Performance test for evaluating the overhead of tracing process memory activity

I was designing a PaaS service that needed a way to track hosted process memory actions in runtime as black box i.e., without knowing from advanced which process will be used. The service did not require to know the process source code or even binary from advanced. The memory tracking covered which memory addresses are read and writen by the hosted processes.

I tried Intel’s PIN tool which is a dynamic binary instrumentation tool – it provides hooks to instrument your code into a process in runtime. Using the hooks for memory reading and writing I added my code that tracked the address used in all MOV instructions.

The service worked but then I wander: Are there a lot of MOV instruction in a typical process? More precise:

  1. What is the percentile of MOV instructions from total instructions? In idle? during load?
  2. How many MOV instructions are executed every second? In idle? during load?
  3. If the instrumented process is a service – what is the response time (latency) degradation due to the instrumentation?

To figure out the answers to this questions – I tested a popular database MySQL which its binaries were instrumented with Intel’s PIN and measured the rate of memory read/write actions during idle and during MySQL request’s processing. The instrumentation code included  only code that count the number of memory actions and every instruction i.e., no other overhead was added to MySQL binary in runtime. I also measured the MySQL requests latency of this very basic instrumentation.

Results summary

  1. Application memory instruction takes a significant part from the total of instructions that are executed ~25%.
  2. Application memory instruction rate can be very high ~100M instructions/second.
  3. Application latency degrade significantly with instrumentation ~16X times slower.

Detailed results

Table 1 below show results of comparing read/write during idle/minor load

Metric Idle (no load) Load of select & insert operations
write/sec 20K 180M
read/sec 50K 100M
read/total 14% 30%
write/total 25% 25%

Table 1

Table 2 below show results of comparing latency with and without instrumentation. Instrumentation included code that increase counters on memory read and writes. Load was composed of 10x SELECT and INSERT requests to MySQL. Instrumentation increase two counters for each memory instruction. mysqld process CPU was 97% during load with instrumentation.

Metric MySQL without instrumentation MySQL with instrumentation
latency (millisecond) 160 2600

Table 2

Disaster Recovery for Solr – a Story from the Field

We had Solr deployed in our production and one day my manager asked that we will prepare a disaster recovery (DR) plan for it. My company already had a DR data center that was deployed exactly the same nodes as our production so the main challenge was to keep the Solr on the DR data center up to date with the data on the production Solr. Oh, and one more thing – the network between the production and DR data centers was slow.

Our first thought was: lets add the Solr nodes in the DR data center to the production Solr cluster (more accurate SolrCloud) and let Solr handle the replication for us.  But we realized that this will cause bad performance when indexing to the Solr as Solr uses two phase commit replication for strong consistency: when a document is indexed – the relevant Solr shard leader verifies that all shard’s replica nodes have committed the document to their transaction log before acknowledge the request. This means that each index request takes the max commit time of all Solr nodes that participate in the SolrCloud and as there is slow network to the DR Solr then every index will be slow. Thats bad. Our application required fast indexing and so this option was removed from the table.

Next, we thought using a scheduler process running on the production Solr that copy the updates of the Solr index files to the DR Solr using a utility like Rsync. Thinking this through we understood that this will not work as Solr files might be in inconsistent state while Solr is up, as some of its state is stored in Solr application memory that might not be persisted to disk at he time of replication. So, we concluded that we need to get the changes in the production Solr from the application that uses Solr.

Finally, we came up with the following scheme:

    1. In the production site, we introduced a replicator thread that continually indexed documents that were updated from production Solr to DR Solr. It replicated a fix number of updates  (keeping the order of updated), then sleep for sometime – releasing resources, repeating this process as long as updates required.  
    2. The replicator queried DR Solr for its latest update timestamp.
    3. The replicator search for all docs in the production Solr that have timestamp older than the one in target Solr – then it reindex them in DR Solr.
    4. Special care was needed to handle documents that were deleted: this is a challenge as the above scheme can’t track which documents need to be deleted in the DR Solr as the production Solr does not contain them anymore. For this we indexed a special document (tombstone) in the production Solr for each doc that is deleted. we removed the tombstones in production Solr once we delete the associated doc in DR Solr.