Semantic Hashing

Semantic hashing is a method to map documents to a code e.g., 32-bit memory address so documents with semantically closed content will be mapped to close addresses. This method can be used to implement an information retrieval (IR) system where the query will be a document and search results will contain documents with similar content (semantics). This method was published by G.Hinton in this paper.

Indexing is implemented in the following manner: a document is mapped to a word-count vector and then this vector is passed through RBM autoencoder and encoded to 32-bit address.

For searching – the query string is treated as a document i.e., it’s word-count vector is passed through the encoder to get its matching address. Now, the search result will be the documents that are stored in the query address and also the documents that are stored  in close addresses. for example, close address can be addresses that are different up to 4 bits (also known as Hamming distance) from the original address.

One of the common methods in IR systems today is the tf-idf indexing technique. each document is indexed so each term in the corpus is pointing to a list of documents that contains this term. Basic searching is done by looking up the list of documents that matches each term in the search query and intersect those list leaving document contains all terms in the query. The disadvantage of this method is that search time is affected by the number of terms in the query. In contrast- semantic hashing gets the list of relevant documents in a single lookup and so is not affected by query size.

Another method in IR systems is latent-semantic-analysis (LSA).  Documents are mapped to word-count-vectors, then the dimension of the vectors is reduced using SVD method. Search is done by mapping the query document to word-count-vector, then reduce it’s dimension and measure the angle between the document vector to all the vectors of the corpus documents. The disadvantage of this method is that search time is linearly depends on the size of the corpus. In contrast- semantic hashing is only affected by the size of the documents list i.e., larger corpus means more collision in memory address mappings and longer document lists. The size of the lists does not increase linearly and the documents are spread across the memory addresses.

Below, are results from the paper where it can be seen that the search quality of this technique is similar to tf-idf (which is considered state of the art for IR systems).  The axis are recall and precision values of the systems. The tests included comparison of latent-semantic (LSA) system and semantic-hashing (graph on the left). Also a comparison of LSA and tf-idf and semantic-hashing that was followed by tf-idf filtering (graph on the right). It can be seen in the right graph that semantic-hashing with tf-idf filtering was close to tf-idf which shows that semantic hashing return documents that are similar to what tf-idf method would return – this basically shows that semantic-hashing is working 🙂

Screen Shot 2017-06-24 at 8.45.58 PM

 

 

 

 

Introduction to Restricted Boltzmann Machine (RBM)

Restricted Boltzmann Machines (RBM) are building blocks for certain type of neural networks which were invented by G.E.Hinton.

In a paper published in Science – Hinton describing how to use neural networks to reduce dimensionality of data using an autoencoder. Below is a sketch of a standard autoencoder where data is inserted from layer X on the left, code is for a data is presented in layer Z. The training of the network is done by adding a decoder network from the code layer on and calculate the error by the difference of the encoding at layer X’ to the original data in layer X then using back propagation to update the network weights.

Screen Shot 2017-06-16 at 2.36.59 PM

One of the challenges of training deep autoencoder is that unless the network’s weights are initialize close to their optimum value – the training will fail and the encoder will not work.  Hinton suggested a pre-train phase where every two layers up to the encoder layer are trained in separation from the other layers in that group. Once the first two layers are pre-trained than we move to the two layers where the last layer of the previous step becomes the first layer of the next pair. The graph below shows that a a network with 7 layers training is failing compared to a a network that was pre-trained.

Screen Shot 2017-06-16 at 2.10.35 PM

RBM is a two layer network with the following constraints

  • the layer on the left is called the visible layer and the one on the right is called the hidden layer
  • symmetrical connections
  • no connections between nodes within the same layer
  • There are bias weights for both visible and hidden units

Screen Shot 2017-06-16 at 2.17.02 PM

The pre-train is compose from the following steps (taken from wikipedia):

Screen Shot 2017-06-16 at 2.24.30 PM

Comments on the pre-train procedure:

  • ‘v’ is the visible vector, ‘h’ is the hidden later vector, ‘a’ is the bias vector of the visible layer and ‘b’ is the bias vector of the hidden layer.
  • in step 1  – the hidden layer contains only 1 or 0 value. it is calculated in a stochastic process by generating a random number in the range (0,1) and if the value if h_i is greater than this random value h_i=1 otherwise h_i=0, same goes for step3.
  • outer product is the matrix that is generated from multiplying the column vector v and the row vector h.
  • epsilon is the learning rate

Each two layers of the encoder are pre-trained using the training data several times (this is also called epochs). When completing the pre-train of 2 layers then the hidden layer becomes the visible layer of the next pair of the encoder and it is pre-trained together with the next layer of the encoder in a greedy way.

When all layers up to the code layer are pre- trained then  unrolling of the layer is done which means all matrix transpose are taken as weights to the encoder network. This described in the following sketch:

Screen Shot 2017-06-16 at 2.43.15 PM

Now, the fine tune phase starts which is the standard autoencoder unsupervised learning via back propagation algorithm.

We are ready to encode! we will use the neural net from bottom  layer up to the coding layer and pass through it our data for encoding.

Here are some results of encoding and then decoding using RBM which are compared to PCA algorithms, taken from Hinton’s paper:

Figure A 

  • Architecture: (28X28)-400-200-100-50-25-6
  • Training: 20,000 images
  • Testing: 10,000 new images.

Screen Shot 2017-06-16 at 2.48.27 PM

Figure B

  • Architecture: 784-1000-500-250-30.
  • Training: 60,000 images
  • Testing: 10,000 new images.

Screen Shot 2017-06-16 at 2.49.28 PM

Figure C

Screen Shot 2017-06-16 at 2.49.55 PM

Figure showing visual clustering of RBM compared to PCA:

Screen Shot 2017-06-16 at 2.50.08 PM

 

 

Book recommendation – “DEEP WORK – RULES FOR FOCUSED SUCCESS IN A DISTRACTED WORLD” by Carl Newport

I recently read Cal Newport new book titled “DEEP WORK – RULES FOR FOCUSED SUCCESS IN A DISTRACTED WORLD” and found it an interesting book and most useful for my own career. It opens by explaining the value of working deep which basically means working in high concentration on things that are important to your career, the book continues with how hard it is to work in deep mode and then brings practical practices for achieving this. If you a person who is self aware and likes to improve his productivity –  then I recommend you buy this book.

Rituals

A proven scientific fact is that people fight desires all day long. one should expect to be bombarded with the desire to do anything but work deeply throughout the day. Another fact is that a person have a finite amount of willpower that becomes depleted as you use it. you got 4 hours top per day for doing deep work. Rituals helps you harness your limited willpower to work deep. Do not trust on your good intentions to work deep – you will probably fail. you need rituals.

Some effective rituals

  1. Schedule Every Minute of Your Day: at the beginning of each workday, plan every hour of your day. I use google calendar to schedule appointment for each task I plan. Quantify the Depth of Every Activity – ask yourself: How long would it take to train a smart recent college graduate with no specialized training in my field to complete this task? If the answer is less than 3 months than this task is shallow and you do not want to do it. reduce your shallow work, this will leave you with more energy to work deep. Aimed on small important goals –  you know that you need goals – right?
  2. Where you’ll work: have a special location where you work deep e.g., a library with good atmosphere, pleasant coffee-shop. Some times, make grand gesture – go to some place special to work deep e.g., go to work from an hotel in another city for couple of days, order a flight to and back from Japan and work during the flight. you can practice ‘productive meditation‘ which means that you can use the time of walking your dog or other monotonic exercise to think on a specific deep question. this requires careful planing of what is the question and avoid looping the same thoughts.
  3. How long you’ll work: limit the deep work time. concentrating is hard – your brain will do whatever he can to stop you. I’m using the pomadoro technique.
  4. How you’ll work once you start to work: ban on internet use and cellular activity. Act on the lead measures: there are two types of metrics, lag measure e.g., publishing 6 paper in a year or learn the design of 10 systems  in 2 months and lead measure e.g., the number of hours one spend in deep work. lead measure help you to accomplish short term tasks that contribute to the lag measure.
  5. At the end of the day: Have an end to your working day. do not continue to work from home at night. At the end of the day do a shutdown-ritual which includes documenting all new tasks you collected during the day, quickly skim all open tasks, plan your next day and saying to yourself something like “shutdown complete!”. no more thinking on work after that. Downtime is important because:
    1. Downtime helps have insights
    2. Downtime helps recharge the energy needed to work deeply
    3. You should complete the 4 hour deep work at the office. The work you can do at night will not be deep – so no point doing it.
    4. Limiting your time necessitate more careful thinking about your organizational habits.
  6. At the end of the week: plan next week tasks focusing on tasks that are important, review the tasks you completed this week and check if it is close to your plan from the week before. understand what happened if you are not close.

Definitions

  1. Deep Work: Professional activities performed in a state of distraction-free concentration that push your cognitive capabilities to their limit. These efforts create new value, improve your skill, and are hard to replicate.
  2. The Deep Work Hypothesis: The ability to perform deep work is becoming increasingly rare at exactly the same time it is becoming increasingly valuable in our economy. As a consequence, the few who cultivate this skill, and then make it the core of their working life, will thrive.
  3. Shallow Work: Noncognitively demanding, logistical-style tasks, often performed while distracted. These efforts tend not to create much new value in the world and are easy to replicate.

 

 

Summary of “97 things every software architect should know”

 

The essay “97 things every software architect should know”  discuss what it means to be a software architecture. It is a good source for ideas on how to improve your skills as an architect or what skills you need to develop to become one.

In this post I summarize the chapters I found interesting. Currently, only some chapters from the range 71-95 are summarized.

73. The Business Vs. The Angry Architect – By Chad LaVigne

  1. you need to listen what other people say to you and not concentrate only on what you have to say. you already know what you are going to say – so you might learn something new from listen to others. This is especially true for architects that are experienced and might think they know what is the next step for any situation.
  2. The business is your employer – you need to serve it. The business domain experts are your major partners – listen to what they have to say. if you disagree with them that is fine. learn to leave with disagreements. if there are too many and you can’t contain that – look for another business to work for and contribute your experience to its success.

75. Before anything, an architect is a developer – By Mike Brown

An architect of a project should also write code of the project. This will have several benefits:

  1. Help the architect to estimate how much time coding tasks will take his developer to complete.
  2. Its fun
  3. Keep him in sync with latest technologies updates
  4. His developers will appreciate him more as they will see his high quality code and that he take part of the developing effort.

77. Stable problems get high quality solutions – By Sam Gardiner

  1. An architect should be able to look at a problem and separate it to chunk of problems that are cohesive and decoupled.
  2. The splitting to chunks should be driven to create chunks that are stable i.e., chunks the do not change even if other chunks are changed.
  3. When a problem is stable it cause its design to be stable, when the design is stable – the implementation can be made stable with high quality.
  4. An architect should develop internal sense for separation that is similar to a good sense of direction that some people have.

78. It Takes Diligence – By Brian Hart

An architect is expected to have ingenuity in his work but less common characteristic is of being diligent.

Effective architects are using daily and weekly check lists to remind them do the things that are necessary but harder for them to do like tracking project status.

They use commitments as for example:

  • keeping project’s schedule and budget
  • complete tasks that are not fun for them
  • commitment to process/methodology
  • accepting overall responsibility of the project

Getting better in what you do requires diligence.

79. Take responsibility for your decisions – By Yi Zhou

Here is how you can become a responsible architect:

  1. each major decision need to be well documented, traceable and communicated to engineers and other stockholders so they can review and have the chance to object it.
  2. review your past architectual decision. Identify decision that remain valid and those that do not – figure out how you could avoid bad decisions. learn from your mistakes.
  3. enforce your decisions
  4. delegate decisions to other who are expert in a domain that you are not proficient. you are not know-it-all-genius.

80. Don’t Be a Problem Solver – By Eben Hewitt

An architect has the reflex to try and solved a given problem instead of asking themselves if this problem is a real problem? can it be avoided?

He should not accept requirements as solid. If removing some of them can have a more sustain design then he should offer that.

81. Choose your weapons carefully, relinquish them reluctantly – Chad LaVigne

  1. Selecting the technologies we use to attack problems with is a large part of the software architect‘s job.
  2. architect have a toolbox of technologies they are experienced with, They know the technologies limitations. They know how to solve problems when arise. They are able to estimate project effort accurately.
  3. technologies improved over time – you need to replace old technologies when new ones proved to be significantly better from the one you know.
  4. update technologies in your product if there is an obvious improvements to the business

88. Don’t Be Clever – By Eben Hewitt

  1. Good qualities an architect should have:
    1. General intelligence
    2. resourcefulness
    3. thoughtfulness
    4. a breadth and depth of knowledge
    5. affinity for precision
  2. Write simple solutions not clever ones. The disadvantages of clever solutions:
    1. They are not simple
    2. Hard to fix
    3. Hard to replace, they stick for ever

90. Find and retain passionate problem solvers

    1. Putting together a team of outstanding developers is one of the most important things you can do to ensure the success of a software project.

    2. Building a team
      1. asking someone to explain their approach to diagnosing a performance problem gives you great insight into their methods for problem solving.

      2. ask what they would change given the chance to start their most recent project anew.

      3. Good developers are passionate about their work. Asking them about past experience will bring out that passion and tell you what correct answers to technical trivia questions cannot.

    3. keep the team together

      1. Finding great developers is difficult. Letting people know they are valued is not. Don‘t miss simple chances to build morale and boost productivity.

      2. Keep criticism constructive and don‘t require that every solution look like it came from you.

      3. go the extra mile to keep it together

92. Pay down your technical debt – By Burk Hufnagel

  1. A fix of bug in a production system might be needed ASAP. This means that it will probably be “quick and dirty” fix.
  2. such fixes have hidden cost:
    1. system instability
    2. increase maintenance cost
    3. lack of proper design
    4. lack of documentation
    5. lack of tests
  3. Once the fix is in production, have the developers go back and fix it properly so that it can be included in the next scheduled release.

 

95. The Importance of Consommé – By Eben Hewitt

  1. A consommé is an extremely clarified broth, usually made with beef or veal, served as a delicate soup. It is made by repeated, simple, fine-grained straining.

  2. Software architecture requires a continual refinement of thought, a repeated straining of ideas until we have determined the essence of each requirement in the system.

Performance test for evaluating the overhead of tracing process memory activity

I was designing a PaaS service that needed a way to track hosted process memory actions in runtime as black box i.e., without knowing from advanced which process will be used. The service did not require to know the process source code or even binary from advanced. The memory tracking covered which memory addresses are read and writen by the hosted processes.

I tried Intel’s PIN tool which is a dynamic binary instrumentation tool – it provides hooks to instrument your code into a process in runtime. Using the hooks for memory reading and writing I added my code that tracked the address used in all MOV instructions.

The service worked but then I wander: Are there a lot of MOV instruction in a typical process? More precise:

  1. What is the percentile of MOV instructions from total instructions? In idle? during load?
  2. How many MOV instructions are executed every second? In idle? during load?
  3. If the instrumented process is a service – what is the response time (latency) degradation due to the instrumentation?

To figure out the answers to this questions – I tested a popular database MySQL which its binaries were instrumented with Intel’s PIN and measured the rate of memory read/write actions during idle and during MySQL request’s processing. The instrumentation code included  only code that count the number of memory actions and every instruction i.e., no other overhead was added to MySQL binary in runtime. I also measured the MySQL requests latency of this very basic instrumentation.

Results summary

  1. Application memory instruction takes a significant part from the total of instructions that are executed ~25%.
  2. Application memory instruction rate can be very high ~100M instructions/second.
  3. Application latency degrade significantly with instrumentation ~16X times slower.

Detailed results

Table 1 below show results of comparing read/write during idle/minor load

Metric Idle (no load) Load of select & insert operations
write/sec 20K 180M
read/sec 50K 100M
read/total 14% 30%
write/total 25% 25%

Table 1

Table 2 below show results of comparing latency with and without instrumentation. Instrumentation included code that increase counters on memory read and writes. Load was composed of 10x SELECT and INSERT requests to MySQL. Instrumentation increase two counters for each memory instruction. mysqld process CPU was 97% during load with instrumentation.

Metric MySQL without instrumentation MySQL with instrumentation
latency (millisecond) 160 2600

Table 2

The Architecture of Kafka – A Distributed Streaming Platform

Kafka is a scalable distributed message queue. This means, it allows producers to push high volume of messages to a queue and consumers to pull messages from the queue.  In this post I will describe Kafka’s key abstractions and architecture principles.

A Kafka cluster is composed from nodes which are called brokers. Producer client connects to a broker and push a message. Internally, the broker create a sequence number per message,  append the message to a log file located in the node’s local hard disk. For efficient fetch of a message – the brokers manage an offset table with a sequence number as the key and value of the seek location of the message in the log file and its size. In Kafka, the consumer client is responsible to track which messages he already consumed (using zookeeper) and not the server, it request a fix number of (or all) updates since the last message he consumed. Other message queues approach this tracking differently – their server track this data for each of their consumers. In Kafka, After a message is consumed by a client it is not removed by the server so other consumers can consume it later including the consumer that just consumed it. This can be helpful in case data need to be processed again (replay) by the consumer after some fault in the first processing. The messages will be kept for some period . To emphasis – the queue is actually implemented by continuous log file on node’s local disk.

You might wonder at this point how Kafka manages efficient reads and writes to local disk. Kafka interaction with the local disk has two characteristic: 1) it appends new writes to an open file(i.e., the message-log-file) and 2) it usually reads continues area from disk.  This characteristic lets Kafka rely on OS disk cache for optimizing its reads from and writes to disk.  The OS cache is highly efficient, always enabled and it’s optimized for appends calls and for multi reads from a continues area on disk. More advantages of using OS level cache is that the Kafka’s processes uses little JVM’s heap space so avoiding all GC related issues, and in case the broker restart its does not need to run any cache-warm-up as the OS cache is not cleared when process terminates. Another optimization strategy used by Kafka is to reduce unnecessary copy of data – for example copy of data from network buffers from the OS into the application buffers. This is achieved by using Linux’s sendfile API that read data directly from disk (hopefully from OS disk cache) and routing it directly to a network socket without passing through the application and so avoid the unnecessary copy of buffers.

Now, consider the use case where several producers and consumers are interested in different type of messages. In order to support this use case – Kafka is using the abstraction of ‘topic‘. Producer is sending messages to a topic and consumers are reading messages from a topic.

In order to support large volume of messages and parallel consumption – Kafka introduced the concept of topic’s partitioning which means that when producing to a topic – the producer also specify the method for partitioning the messages e.g., via round robin or by specifying a partition function and message’s field that together define a partition per message. for example if there are  5 partitions with a single topic on a Kafka cluster with 5 brokers – then each broker will manage  (lead) one of the partitions. Order between messages is kept between messages in the same partition only.

High availability is achieved by replicating partition to other brokers so in case of a failure in one node – another node starts to manage the partitions. Usually, there are more topics and partitions than the number of cluster nodes so a typical broker will have several managed partitions and several replication for other partitions which are managed in different brokers. Replication to brokers achieved by a broker act like a regular client that consumes a specific partition. All producers of a partition write to the same broker that manage this partition and the same goes for consumers of a partition- they all read from the same broker that manage the partition.

Kafka uses the concept of consumer groups to allow a pool of processes to divide up the work of consuming and processing records. These processes can either be running on the same machine or, as is more likely, they can be distributed over many machines to provide additional scalability and fault tolerance for processing.

Each Kafka consumer is able to configure a consumer group that it belongs to, and can dynamically set the list of topics it wants to subscribe to or subscribe to all topics matching certain pattern through. Kafka will deliver each message in the subscribed topics to one process in each consumer group. This is achieved by balancing the partitions in the topic over the consumer processes in each group. So if there is a topic with four partitions, and a consumer group with two processes, each process would consume from two partitions. This group membership is maintained dynamically: if a process fails the partitions assigned to it will be reassigned to other processes in the same group, and if a new process joins the group, partitions will be moved from existing consumers to this new process.

So if two processes subscribe to a topic both specifying different groups they will each get all the records in that topic; if they both specify the same group they will each get about half the records.

Conceptually you can think of a consumer group as being a single logical subscriber that happens to be made up of multiple processes. As a multi-subscriber system, Kafka naturally supports having any number of consumer groups for a given topic without duplicating data.

This is a slight generalization of the functionality that is common in messaging systems. To get semantics similar to a queue in a traditional messaging system all processes would be part of a single consumer group and hence record delivery would be balanced over the group like with a queue. Unlike a traditional messaging system, though, you can have multiple such groups. To get semantics similar to pub-sub in a traditional messaging system each process would have its own consumer group, so each process would subscribe to all the records published to the topic.
References

  1. Kafka Documentation in Apache site
  2. Kafka API

Disaster Recovery for Solr – a Story from the Field

We had Solr deployed in our production and one day my manager asked that we will prepare a disaster recovery (DR) plan for it. My company already had a DR data center that was deployed exactly the same nodes as our production so the main challenge was to keep the Solr on the DR data center up to date with the data on the production Solr. Oh, and one more thing – the network between the production and DR data centers was slow.

Our first thought was: lets add the Solr nodes in the DR data center to the production Solr cluster (more accurate SolrCloud) and let Solr handle the replication for us.  But we realized that this will cause bad performance when indexing to the Solr as Solr uses two phase commit replication for strong consistency: when a document is indexed – the relevant Solr shard leader verifies that all shard’s replica nodes have committed the document to their transaction log before acknowledge the request. This means that each index request takes the max commit time of all Solr nodes that participate in the SolrCloud and as there is slow network to the DR Solr then every index will be slow. Thats bad. Our application required fast indexing and so this option was removed from the table.

Next, we thought using a scheduler process running on the production Solr that copy the updates of the Solr index files to the DR Solr using a utility like Rsync. Thinking this through we understood that this will not work as Solr files might be in inconsistent state while Solr is up, as some of its state is stored in Solr application memory that might not be persisted to disk at he time of replication. So, we concluded that we need to get the changes in the production Solr from the application that uses Solr.

Finally, we came up with the following scheme:

    1. In the production site, we introduced a replicator thread that continually indexed documents that were updated from production Solr to DR Solr. It replicated a fix number of updates  (keeping the order of updated), then sleep for sometime – releasing resources, repeating this process as long as updates required.  
    2. The replicator queried DR Solr for its latest update timestamp.
    3. The replicator search for all docs in the production Solr that have timestamp older than the one in target Solr – then it reindex them in DR Solr.
    4. Special care was needed to handle documents that were deleted: this is a challenge as the above scheme can’t track which documents need to be deleted in the DR Solr as the production Solr does not contain them anymore. For this we indexed a special document (tombstone) in the production Solr for each doc that is deleted. we removed the tombstones in production Solr once we delete the associated doc in DR Solr.

Privilege Seaparation for Linux Security Enhancement

Problem Description

I was working on a monitoring product that was using agents that were installed on customer hosts and reported metrics to backend servers.  Customers preferred to run agents as unprivileged processes i.e., not as root user to prevent the agent from harming their host in case of an agent’s bug or malicious code. Our agent supported also running third parties plugins for monitoring specific applications and those plugins were that not reviewed by our company and so increase the risk of harm by the agent. But our agent required to be run as root in order to execute all its functionality e.g., checking server’s reachability by sending ICMP raw packets.

I wanted the agent to run most of its actions in an unprivileged mode but allow it to run a limited list of predefined actions in privileged mode. Linux security model does not support changing the process security level after it is started. Generally speaking, the process privileges are determined by the set of privileged owned by the user that launch the process (Linux also support file capability configuration). Here is a summary of the requirements:

  1. allow the agent to execute unprivileged code (obvious).
  2. allow the agent to execute privileged actions from a set of limited actions.
  3. block the agent from executing privileged actions not in the set of limited actions.
  4. Third party plugins code should not be changed to support the new requirements – i.e., the solution need to support legacy plugins especially custom plugins.

Solution Sketch

My solution was inspired by an online security course I took, we will implement privilege separation principle i.e., separate a set of privileged actions from the rest of the system, limit the access to them and block any other privileged action.

The agent will run in an unprivileged process and will send request for privileged actions to a second process that will run with privilege user.

The privileged action request will be defined by a strict API in the privileged process.

The privileged process will be accessible only by the agent process via a shared secret and privileged action requests will be logged for auditing.

This solution will enable file system access granularity compared to OS access control: the agent will be able to read any file on the file system and can write to a sandboxed part of the file system.

This approach is straight forward for actions that are stateless like InetAddress.isReachable action but is more challenging for actions that has state like reading a file. The second process will need to track the state and handle life cycle aspects like cleaning after actions that are finish.

Solution Details

  1. agent will have 2 processes:
    1. process #1 – agent core process i.e., a process that will run much like the current agent. This process will run as unprivileged process.
    2. process #2 – helper process that will be privileged.  and will execute request from the agent core process and reply with results
  2. process #1 will instrument plugins code and replace implementation of privileged a set of privileged action e.g., Java/Sigar classes with our own new implementation, for example InetAddress class that sends ICMP will be replaced with MyInetAddress class. This can be achieved with the JVM’s javaagent hook for instrumentation entry point and javassit library for instrumenting replacement of classes, it might also be done via AspectJ. Our implementation will simply forward a request to process #2, process #2 will actually run the action and return the response to process #1. plugin in process#1 is not aware of the new implementation. In case the plugin running in process #1, will execute a privileged action that was not instrumented i.e., not in the set of allowed privileged actions – it will be blocked by the OS as process#1 is unprivileged.
  3. process #2 will have a single entry point e.g., a TCP port that will be accessible only by process #1 to prevent other unprivileged processes execute privileged actions. This can be done by sharing a secret between process#1 and process#2. process #1 will authenticate when opening a connection to process #2. 

References

 

Huffman coding – the optimal prefix code

Short story

Recently, I remembered that when I was a student, I read about Huffman coding which is a clever compressing algorithm and ever since wanted to implement it but did not found a chance. Last week – I finally did it. To test my implementation – I took a 160 KB file containing the text of ‘Alice in Wonderland’ book and compressed it using my Huffman implementation down to 94 KB binary file. Hurrah!

What is prefix code?

There are many ways to map a set of characters to a set of bits and so represent text in bits. For example ANSI is a map that takes the latin alphabet and map each character to a byte. When you get a file in ANSI encoding you know that you need to scan the file reading sets of 8 bits (one byte) and translate the byte to a character. Giving the same number of bits to represent each character simplifies decoding as you only need to read fix number of bits and translate each to relevant character but is not optimal from size perspective as often used characters get the same representation as rarely used characters.

Some encoding techniques maps characters to bits so that different characters are mapped to list of bits with different size.When more often used characters are mapped to a smaller list of bits than you expect the encode file to take less space than one encode in ANSI (more on how to measure space efficiency later in the post). The challenge with this approach is to know when you are decoding a list of bits when one character representations ends and another begins. One solution for this is to use some list of bits to represent a marker between list of bits representing different characters, and another solution is to use binary trees like Huffman tree (more on this later in the post.)

prefix code is a type of code that let you decode the encoded text without special marker. for example, the map {a=0, b=10, c=11} is a prefix code as no marker is needed for decryption of any string. If you have ‘000101011’ then it is clear when one list of bits for character ends and another begins you translate it directly to ‘aaabbc’. a counter example, the map {a=0, b=1, c=11} is not prefix code as the string ‘111’ can be translated to ‘bbb’ or ‘bc’ i.e., we need some separator marker between list of bits in this code.

How to measure coding efficiency?

If the coding is represented by a map from a character c to a list of bits then we will define length(c) be the number of bits representing the character c and define freq(c) to be the frequency that character c as appear in the text.

The code efficiency is calculated by \sum_{c\in Alphabet} length(c)\times freq(c) i.e., the weight average of encoding lengths according to their frequencies.

Every prefix code can also be represented as a binary tree where each edge is marked as ‘0’ or ‘1’ and the leaf are marked with a character so the list of edges to a leaf represent the character’s code. Here is a picture that show this idea:

IMG_3419

The depth of a leaf is actually the size of a character prefix code and so this representation provide an alternative to the efficient of the code in the language of trees i.e.,

B(T) = \sum_{c\in Alphabet} depth(c)\times freq(c) where depth(c) is the depth of the leaf c in the tree.

How does Huffman encoding work?

Here is the algorithm to build Huffman tree:

  1. Create a node for each character with its frequency and insert it to a list.
  2. Remove from the list the two nodes with minimum frequencies. Then create a tree where the two nodes are “siblings” leafs and their parent is marked with the sum of their frequencies (the root has no character associated to it).
  3. add the root node into the list. The list now is shorter by one element.
  4. goto 3 until a single node remains in the list. This node will be the root to the Huffman tree.

Example:

Let look at a text with 20 ‘a’,  15 ‘b’, 5 ‘c’,  15 ‘d’, 45 ‘e’.

Step 1 says we will prepare a list {a/20 , b/15, c/5, d/15, e/45}.

Iteration 1:

Step 2 says we need to choose two minimum nodes e.g., ‘c’ and ‘d’ (we could also take the pairs ‘c’ and ‘b’ as the minimum frequencies nodes.) and create a tree:

Screen Shot 2016-01-01 at 3.45.07 PM

we give this node alias ‘n1’ for consistent labeling of nodes but there is no use of this value. note that its frequency is 20 = 5 + 15.

step 3 says that we will add ‘n1’ to the list, ending up with the list {a/20, n1/20, b/15, e/45}.

Iteration 2:

Step 2, similarly as we did above, we choose ‘a’ and ‘b’ and produce a new node n2 and a tree:

Screen Shot 2016-01-01 at 3.48.00 PM

step 3, the list end up looking like this {n2/35, n1/20, e/45}.

Iteration 3

step 2, combines n2/35 and n1/20 under the new node n3/55:

N3

step 3, the list end up looking like this {n3/55, e/45}.

It should be clear that after the 4th iteration we will end up with the Huffman tree shown in the previous section.

Why Huffman encoding is optimal? – a mathematical proof

In our context, optimal encoding means that if you take a fix alphabet with known frequencies then the Huffman code will have the minimum code efficiency value as calculated above when compared to all possible prefix codes available for this alphabet (with the same frequencies).

Lemma 1: Consider the two letters, x and y with the smallest frequencies. Then there is an optimal tree in which these two letters are sibling leaves in the tree in the lowest level.

Proof:  First, we notice that an optimal tree T must be full in the sense that all internal nodes should have two “child” nodes – otherwise if there is an internal node P with a single child node Q then we could remove P from the tree and set Q (and all its sub tree) in P place – which will reduce the depth value of all the nodes under Q by one and so reduce B(T) value in contrast to the fact that B(T) has minimum value.

Next, let T be an optimum prefix code tree, and let b and c be two siblings at the maximum depth of the tree (must exist because T is full). Since x and y have the two smallest frequencies it follows that freq(x)\le freq(b) and freq(y)\le freq(c) without loss of generality. Because b and c are at the deepest level of the tree we know that depth(b)\le depth(x) and depth(c)\le depth(y).

Now switch the positions of x and b in the tree resulting in a different tree T’ and see how the cost changes.

B(T')=B(T)-freq(b)\times depth(b) -freq(x)\times depth(x) + freq(b)\times depth (x) + freq(x)\times depth(b)

rearranging the left side of the equation,

B(T')=B(T)- [(freq(x) - freq(b))\times (depth(x)-depth(b))] \le B(T) but as T is optimal then

B(T)\le B(T') and so B(T)=B(T').

Repeating this argument by replacing y and c will result with the required optimal tree.

Q.E.D

 

 

Theorem: The Huffman coding has code efficiency which is lower than all prefix coding of this alphabet.

Proof: We will prove this by induction on the size of the alphabet.

Starting with an alphabet of size 2, Huffman encoding will generate a tree with one root and two leafs. it is obvious that this tree is the smallest one and so the coding efficiency of this tree is minimal.

Assume the theorem is true to all alphabets of size n-1 and we will prove that it is also true for all alphabet of size n.

Let A be an alphabet with n characters and let T be the optimal tree for A. T has two leafs  ‘x’ and ‘y’ which are “siblings” with minimum frequencies due to Lemma 1. we will  look at T’ which is constructed from T by removing the leafs ‘x’ and ‘y’. We can look at T’ as a prefix code for the A’ alphabet which is constructed from A by removing characters ‘x’ and ‘y’ and adding a new character ‘z’ which is their parent node in T. Any character requires a frequency associated with so we will set the frequency of ‘z’ to be freq(x)+freq(y).

We have

(1) depth(x)=depth(y)=depth(z)+1, therefore depth(z)=depth(x)-1=depth(y)-1.

(2) freq(z) = freq(x)+freq(y).

(3) B(T') =B(T)-freq(x)\times depth(x) - freq(y)\times depth(y) + freq(z)\times depth(z) .

substituting freq(z) and depth(z) in (3) we get

B(T') = B(T) -freq(x) - freq(y) and so

(4) B(T)=B(T')+freq(x) + freq(y).

Let H’ be the Huffman tree for A’ . According to our assumption  H’ is optimal and so

(5) B(H') \le B(T').

Adding to H’ two leafs ‘x’ and ‘y’ under the parent ‘z’ will leave us with Huffman tree H for alphabet A.

This is true as the first step in building H according to the algorithm above is to add ‘x’ and ‘y’ under ‘z’, then remove from A the characters ‘x’ and ‘y’ and replace them by ‘z’ i.e., the rest of algorithm is building Huffman tree for A’ which is exactly H’ (I find this part of the proof to be the hardest part to grasp).

Following the same argument as we had for calculating B(T) we get

(6) B(H)=B(H') + freq(x)+freq(y).

From (5) and (6) we get

(7) B(H)=B(H')+freq(x)+freq(y)\le B(T')+freq(x)+freq(y).

from (7) and (4) we get

(8) B(H)\le B(T).

As T is an optimal tree, we also have

(9) B(T)\le B(H).

From (8) and (9) we get

B(T)= B(H) i.e., the Huffman tree H is optimal.

This complete the induction proof.

Q.E.D

 

 

References

Read More »

RSYNC design – a method for replicating updates

Background

rsync is an algorithm for efficient remote update of files over low bandwidth network link.

Definitions

  • S (source) – the process at the end that has access to the latest version of a file.
  • T (target) – the process at the end that has access to a version of a file that is older than the version S has.

Requirements

  • Sync changes between target file T that has older version of a source file S.
  • Minimize the network traffic for the synchronization.

Implementation

  1. T (the target) divides the file into blocks with fixed size L.
  2. For each block, T calculates a signature that is fast to calculate but not unique.
  3. For each block, T calculates a signature that is slow to calculate but unique.
  4. T sends S a list of pairs with fast and slow signatures for each block.
  5. S starts checking its first block.
  6. T created a temporary empty file X that will replace its current file with the version from S side.
  7. S calculate the fast signature for the block,  if it is not equal to any of T fast signatures then S sends the first byte to T and T append this byte to file X. if the fast signature is equal, then S calculate the slow signature of the block and compare it to the matching slow signature. if the two signatures are equals then S sends T the id of the block. T append the block from its local copy via block id. if the slow  signature is not equal then S will send the first byte to T.
  8. The next block S will check will be the one starting in the next byte if the previous block did not match or the one starting in an offset of block size if the previous block matched. S may need to calculate the fast signature to every possible block  (i.e. blocks generated from the previous block by shifting one byte at a time, starting from the first block and ending in the last) – so rsync is using a method similar to Adler checksum that can be calculated efficiently in iterations based on previous block signature plus small number of steps.
  9. repeat steps 7-8 until S reach its end.

rsync1

rsync2

rsync3

rsync4

Reference:

  1. Andrew Tridgell,  “Efficient Algorithms for Sorting and Synchronization” , https://rsync.samba.org/~tridge/phd_thesis.pdf