Disaster Recovery for Solr – a Story from the Field

We had Solr deployed in our production and one day my manager asked that we will prepare a disaster recovery (DR) plan for it. My company already had a DR data center that was deployed exactly the same nodes as our production so the main challenge was to keep the Solr on the DR data center up to date with the data on the production Solr. Oh, and one more thing – the network between the production and DR data centers was slow.

Our first thought was: lets add the Solr nodes in the DR data center to the production Solr cluster (more accurate SolrCloud) and let Solr handle the replication for us.  But we realized that this will cause bad performance when indexing to the Solr as Solr uses two phase commit replication for strong consistency: when a document is indexed – the relevant Solr shard leader verifies that all shard’s replica nodes have committed the document to their transaction log before acknowledge the request. This means that each index request takes the max commit time of all Solr nodes that participate in the SolrCloud and as there is slow network to the DR Solr then every index will be slow. Thats bad. Our application required fast indexing and so this option was removed from the table.

Next, we thought using a scheduler process running on the production Solr that copy the updates of the Solr index files to the DR Solr using a utility like Rsync. Thinking this through we understood that this will not work as Solr files might be in inconsistent state while Solr is up, as some of its state is stored in Solr application memory that might not be persisted to disk at he time of replication. So, we concluded that we need to get the changes in the production Solr from the application that uses Solr.

Finally, we came up with the following scheme:

    1. In the production site, we introduced a replicator thread that continually indexed documents that were updated from production Solr to DR Solr. It replicated a fix number of updates  (keeping the order of updated), then sleep for sometime – releasing resources, repeating this process as long as updates required.  
    2. The replicator queried DR Solr for its latest update timestamp.
    3. The replicator search for all docs in the production Solr that have timestamp older than the one in target Solr – then it reindex them in DR Solr.
    4. Special care was needed to handle documents that were deleted: this is a challenge as the above scheme can’t track which documents need to be deleted in the DR Solr as the production Solr does not contain them anymore. For this we indexed a special document (tombstone) in the production Solr for each doc that is deleted. we removed the tombstones in production Solr once we delete the associated doc in DR Solr.

Privilege Seaparation for Linux Security Enhancement

Problem Description

I was working on a monitoring product that was using agents that were installed on customer hosts and reported metrics to backend servers.  Customers preferred to run agents as unprivileged processes i.e., not as root user to prevent the agent from harming their host in case of an agent’s bug or malicious code. Our agent supported also running third parties plugins for monitoring specific applications and those plugins were that not reviewed by our company and so increase the risk of harm by the agent. But our agent required to be run as root in order to execute all its functionality e.g., checking server’s reachability by sending ICMP raw packets.

I wanted the agent to run most of its actions in an unprivileged mode but allow it to run a limited list of predefined actions in privileged mode. Linux security model does not support changing the process security level after it is started. Generally speaking, the process privileges are determined by the set of privileged owned by the user that launch the process (Linux also support file capability configuration). Here is a summary of the requirements:

  1. allow the agent to execute unprivileged code (obvious).
  2. allow the agent to execute privileged actions from a set of limited actions.
  3. block the agent from executing privileged actions not in the set of limited actions.
  4. Third party plugins code should not be changed to support the new requirements – i.e., the solution need to support legacy plugins especially custom plugins.

Solution Sketch

My solution was inspired by an online security course I took, we will implement privilege separation principle i.e., separate a set of privileged actions from the rest of the system, limit the access to them and block any other privileged action.

The agent will run in an unprivileged process and will send request for privileged actions to a second process that will run with privilege user.

The privileged action request will be defined by a strict API in the privileged process.

The privileged process will be accessible only by the agent process via a shared secret and privileged action requests will be logged for auditing.

This solution will enable file system access granularity compared to OS access control: the agent will be able to read any file on the file system and can write to a sandboxed part of the file system.

This approach is straight forward for actions that are stateless like InetAddress.isReachable action but is more challenging for actions that has state like reading a file. The second process will need to track the state and handle life cycle aspects like cleaning after actions that are finish.

Solution Details

  1. agent will have 2 processes:
    1. process #1 – agent core process i.e., a process that will run much like the current agent. This process will run as unprivileged process.
    2. process #2 – helper process that will be privileged.  and will execute request from the agent core process and reply with results
  2. process #1 will instrument plugins code and replace implementation of privileged a set of privileged action e.g., Java/Sigar classes with our own new implementation, for example InetAddress class that sends ICMP will be replaced with MyInetAddress class. This can be achieved with the JVM’s javaagent hook for instrumentation entry point and javassit library for instrumenting replacement of classes, it might also be done via AspectJ. Our implementation will simply forward a request to process #2, process #2 will actually run the action and return the response to process #1. plugin in process#1 is not aware of the new implementation. In case the plugin running in process #1, will execute a privileged action that was not instrumented i.e., not in the set of allowed privileged actions – it will be blocked by the OS as process#1 is unprivileged.
  3. process #2 will have a single entry point e.g., a TCP port that will be accessible only by process #1 to prevent other unprivileged processes execute privileged actions. This can be done by sharing a secret between process#1 and process#2. process #1 will authenticate when opening a connection to process #2.