Disaster Recovery for Solr – a Story from the Field

We had Solr deployed in our production and one day my manager asked that we will prepare a disaster recovery (DR) plan for it. My company already had a DR data center that was deployed exactly the same nodes as our production so the main challenge was to keep the Solr on the DR data center up to date with the data on the production Solr. Oh, and one more thing – the network between the production and DR data centers was slow.

Our first thought was: lets add the Solr nodes in the DR data center to the production Solr cluster (more accurate SolrCloud) and let Solr handle the replication for us.  But we realized that this will cause bad performance when indexing to the Solr as Solr uses two phase commit replication for strong consistency: when a document is indexed – the relevant Solr shard leader verifies that all shard’s replica nodes have committed the document to their transaction log before acknowledge the request. This means that each index request takes the max commit time of all Solr nodes that participate in the SolrCloud and as there is slow network to the DR Solr then every index will be slow. Thats bad. Our application required fast indexing and so this option was removed from the table.

Next, we thought using a scheduler process running on the production Solr that copy the updates of the Solr index files to the DR Solr using a utility like Rsync. Thinking this through we understood that this will not work as Solr files might be in inconsistent state while Solr is up, as some of its state is stored in Solr application memory that might not be persisted to disk at he time of replication. So, we concluded that we need to get the changes in the production Solr from the application that uses Solr.

Finally, we came up with the following scheme:

    1. In the production site, we introduced a replicator thread that continually indexed documents that were updated from production Solr to DR Solr. It replicated a fix number of updates  (keeping the order of updated), then sleep for sometime – releasing resources, repeating this process as long as updates required.  
    2. The replicator queried DR Solr for its latest update timestamp.
    3. The replicator search for all docs in the production Solr that have timestamp older than the one in target Solr – then it reindex them in DR Solr.
    4. Special care was needed to handle documents that were deleted: this is a challenge as the above scheme can’t track which documents need to be deleted in the DR Solr as the production Solr does not contain them anymore. For this we indexed a special document (tombstone) in the production Solr for each doc that is deleted. we removed the tombstones in production Solr once we delete the associated doc in DR Solr.