RSYNC design – a method for replicating updates

Background

rsync is an algorithm for efficient remote update of files over low bandwidth network link.

Definitions

  • S (source) – the process at the end that has access to the latest version of a file.
  • T (target) – the process at the end that has access to a version of a file that is older than the version S has.

Requirements

  • Sync changes between target file T that has older version of a source file S.
  • Minimize the network traffic for the synchronization.

Implementation

  1. T (the target) divides the file into blocks with fixed size L.
  2. For each block, T calculates a signature that is fast to calculate but not unique.
  3. For each block, T calculates a signature that is slow to calculate but unique.
  4. T sends S a list of pairs with fast and slow signatures for each block.
  5. S starts checking its first block.
  6. T created a temporary empty file X that will replace its current file with the version from S side.
  7. S calculate the fast signature for the block,  if it is not equal to any of T fast signatures then S sends the first byte to T and T append this byte to file X. if the fast signature is equal, then S calculate the slow signature of the block and compare it to the matching slow signature. if the two signatures are equals then S sends T the id of the block. T append the block from its local copy via block id. if the slow  signature is not equal then S will send the first byte to T.
  8. The next block S will check will be the one starting in the next byte if the previous block did not match or the one starting in an offset of block size if the previous block matched. S may need to calculate the fast signature to every possible block  (i.e. blocks generated from the previous block by shifting one byte at a time, starting from the first block and ending in the last) – so rsync is using a method similar to Adler checksum that can be calculated efficiently in iterations based on previous block signature plus small number of steps.
  9. repeat steps 7-8 until S reach its end.

rsync1

rsync2

rsync3

rsync4

Reference:

  1. Andrew Tridgell,  “Efficient Algorithms for Sorting and Synchronization” , https://rsync.samba.org/~tridge/phd_thesis.pdf

Architecture of ZAB – ZooKeeper Atomic Broadcast protocol

Background

ZooKeeper support clients reading and updating key values pairs with high availability. High availability is achieved by replicating the data to multiple nodes and let clients read from any node. Critical to the design of Zookeeper is the observation that each state change is incremental with respect to the previous state, so there is an implicit dependence on the order of the state changes. Zookeeper Atomic Broadcast (ZAB) is the protocol under the hood that drives ZooKeeper replication order guarantee. It also handles electing a leader and recovery of failing leaders and nodes. This post is about ZAB.

Definitions

  • leader and followers-  in ZooKeeper cluster, one of the nodes has a  leader role and the rest have followers roles. The leader is responsible for accepting all incoming state changes from the clients and replicate them to itself and to the followers. read requests are load balanced between all followers and leader.
  • transactions –  client state changes that a leader propagates to its followers.
  • ‘e’ – the epoch of a leader. epoch is an integer that is generated by a leader when he start to lead and should be larger than epoch’s of previous leaders.
  • ‘c’ – a sequence number that is generated by the leader, starting at 0 and increasing. it is used together with an epoch to order the incoming clients state changes.
  • ‘F.history’ – follower’s history queue. used for committing incoming transactions in the order they arrived.
  • outstanding transactions – the set of transactions in the F.History that have sequence number smaller than current COMMIT sequence number.

ZAB Requirements

  1. Replication guarantees
    1. Reliable delivery – If a transaction, M, is committed by one server, it will be eventually committed by all servers.
    2. Total order – If transaction A is committed before transaction B by one server, A will be committed before B by all servers. If A and B are committed messages, either A will be committed before B or B will be committed before A.
    3. Causal order – If a transaction B is sent after a transaction A has been committed by the sender of B, A must be ordered before B. If a sender sends C after sending B, C must be ordered after B.
  2. Transactions are replicated as long as majority (quorum) of nodes are up.
  3. When nodes fail and later restarted – it should catch up the transactions that were replicated during the time it was down.

ZAB Implementation

  • clients read from any of the ZooKeeper nodes.
  • clients write state changes to any of the ZooKeeper nodes and this state changes are forward to the leader node.
  • ZooKeeper uses a variation of  two-phase-commit protocol for replicating transactions to followers. When a leader receive a change update from a client it generate a transaction with sequel number c and the leader’s epoch e (see definitions) and send the transaction to all followers. a follower adds the transaction to its history queue and send ACK to the leader. When a leader receives ACK’s from a quorum it send the the quorum COMMIT for that transaction. a follower that accept COMMIT will commit this transaction unless c is higher than any sequence number in its history queue. It will wait for receiving COMMIT’s for all its earlier transactions (outstanding transactions) before commiting.

Screen Shot 2015-11-20 at 1.56.51 PM

picture taken from reference [4]

  • Upon leader crashes, nodes execute a recovery protocol both to agree upon a common consistent state before resuming regular operation and to establish a new leader to broadcast state changes.
  • To exercise the leader role, a node must have the support of a quorum of nodes. As nodes can crash and recover, there can be over time multiple leaders and in fact the same nodes may exercise the node role multiple times.
  • node’s life cycle: each node executes one iteration of this protocol at a time, and at any time, a process may drop the current iteration and start a new one by proceeding to Phase 0.
    • Phase 0 –  prospective leader election
    • Phase 1 – discovery
    • Phase 2 – synchronization
    • Phase 3 – broadcast
  • Phases 1 and 2 are important for bringing the ensemble to a mutually consistent state, specially when recovering from crashes.
  • Phase 1 – Discovery
    In this phase, followers communicate with their prospective leader, so that the leader gathers information about the most recent transactions that its followers accepted. The purpose of this phase is to discover the most updated sequence of accepted transactions among a quorum, and to establish a new epoch so that previous leaders cannot commit new proposals. Because quorum of the followers have all changes accepted by the previous leader- then it is promised that at least one of the followers in current quorum has in its history queue all the changes accepted by previous leader which means that the new leader will have them as well. Phase 1 exact algorithm available here.
  • Phase 2 – Synchronization
    The Synchronization phase concludes the recovery part of the protocol, synchronizing the replicas in the cluster using the leader’s updated history from the discovery phase. The leader communicates with the followers, proposing transactions from its history. Followers acknowledge the proposals if their own history is behind the leader’s history. When the leader sees acknowledgements from a quorum, it issues a commit message to them. At that point, the leader is said to be established, and not anymore prospective. Phase 2 exact algorithm available here.
  • Phase 3 – Broadcast
    If no crashes occur, peers stay in this phase indefinitely, performing broadcast of transactions as soon as a ZooKeeper client issues a write request.  Phase 3 exact algorithm available here.
  • To detect failures, Zab employs periodic heartbeat messages between followers and their leaders. If a leader does not receive heartbeats from a quorum of followers within a given timeout, it abandons its leadership and shifts to state election and Phase 0. A follower also goes to Leader Election Phase if it does not receive heartbeats from its leader within a timeout.

References:

  1. Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini, “Zab: High-performance broadcast for primary-backup systems” 
  2. Andr´e Medeiros, “ZooKeeper’s atomic broadcast protocol: Theory and practice”
  3. ZooKeeper Apache documentation
  4. Benjamin Reed,Flavio P. Junqueira, A simple totally ordered broadcast protocol”

Algorithm for Zab Phase 2 – Synchronization

This is part of my main post on architecture of ZAB.

There are four variables that constitute the persistent state of a node, which are used during the recovery part of the protocol:
history: a log of transaction proposals accepted;
acceptedEpoch: the epoch number of the last NEWEPOCH message accepted;
currentEpoch: the epoch number of the last NEWLEADER message accepted;
lastZxid: zxid of the last proposal in the history;

algo2

References: Andr´e Medeiros, “ZooKeeper’s atomic broadcast protocol: Theory and practice”, http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf

Algorithm for Zab Phase 1 – Discovery

This is part of my main post on architecture of ZAB.

There are four variables that constitute the persistent state of a node, which are used during the recovery part of the protocol:
history: a log of transaction proposals accepted;
acceptedEpoch: the epoch number of the last NEWEPOCH message accepted;
currentEpoch: the epoch number of the last NEWLEADER message accepted;
lastZxid: zxid of the last proposal in the history;

algo1

References: Andr´e Medeiros, “ZooKeeper’s atomic broadcast protocol: Theory and practice”, http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf

Two-Phase-Commit

What are we talking about here?

We want to commit a transaction in several nodes so it will be committed in all of the nodes or in none.

Assumptions

  1. One nodes is designed as the coordinator.
  2. Each node has persistent storage.
  3. Each node can eventually communicate with any other node.
  4. Each node will recover from a crash eventually i.e., fail-recovery model.

Phase-1

  1. Coordinator sends COMMIT_REQUEST to all other nodes.
  2. A node that receive COMMIT_REQUEST write it to its a transaction log, hold the required commit resources and send AGREED message to the coordinator. If node can not commit, then it will send ABORT message to the coordinator
  3. If the coordinator did not received response from a node, it will re-send the node another COMMIT_REQUEST message or after sometime it will sent ABORT message to all the nodes.

Phase-2

  1. if the coordinator received AGREED from all nodes, it will send a COMMIT message to all nodes.
  2. If the coordinator received ABORT from any of the nodes, it will send an ABORT message to all nodes.
  3. If the coordinator did not received response from a node, it will re-send the node another COMMIT message.
  4. A node the receive COMMIT message will commit and sent a response COMMITTED.
  5. The coordinator will complete a commit after it received COMMITTED message from all the nodes.
  6. A node that receive an ABORT message will roll back the transaction.

Disadvantages

  1. This protocol is blocking i.e., a node will hold commit related resources until coordinator will receive COMMITTED message from all nodes.
  2. We assume that all nodes will recover after a crash – but this is not always the case and in case the coordinator will crash after nodes approved a commit and before he sent them COMMIT message – then nodes will wait for manager and hold resources forever.

Reference

  1. http://courses.cs.vt.edu/~cs5204/fall00/distributedDBMS/duckett/tpcp.html
  2. http://the-paper-trail.org/blog/consensus-protocols-two-phase-commit/

Crash-Recovery System Model Characteristics

This is a model that is useful to describe systems that have nodes that can crash and recover later on.

  1. The system is a set of processes \Pi = \{ p_1, \ldots p_n\}.
  2. processes communicate by sending messages using TCP connections.
  3. process has a persistent storage.
  4. processes have two states: up and down.
  5. process may crash and recover indefinitely many times.

Reference: ZooKeeper’s atomic broadcast protocol: Theory and practice, Andr´e Medeiros, http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf

What is Zookeeper?

ZooKeeper is a high-performance coordination service for distributed applications.

Zookeeper replicates its data to multiple servers, which makes the data highly reliable and available.

It exposes common services so you don’t have to write them from  scratch – such as:

  • configuration management
  • synchronization

You can use it to implement:

  • leader election
  • presence protocols

Main Features

Zookeeper has a very simple, file system like API.
You can think of it as distributed Windows registry.
You can :

  • create a path.
  • set the value of a path.
  • read the value of a path.
  • delete a path.
  • list the children of a path.

It does a few more interesting things:

  • one can register a watcher on a path and get notified when the children of a path or the value of a path has changed.
  • a path can be created as ephemeral (as oppose to persistent), which means that if the creating client is gone, the path is automatically removed by the Zookeeper server.

Programming Zookeeper

http://zookeeper.apache.org/doc/r3.4.2/zookeeperTutorial.html

References

http://research.microsoft.com/en-us/um/people/srikanth/netdb11/netdb11papers/netdb11-final12.pdf

Bloom Filter – Space/Time Trade-offs in Hash Coding with Allowable Errors

What is bloom filter?

  • the main idea is to use bit array in fixed size to check if a key belong to a set of keys. The array size is much smaller then the size of the key set. The algorithms will identify for sure if a key does not belong to the set but might have false positive response for belonging to the set. the largest the array the lower the chance for false positive.

what is its usage?

  • powerful for checking if a key belong to a set of keys where the key space is very large but the set is relatively small. For example, assume we want to write an application that does word hyphenation. 90% of english words have the same simple rule for hyphenation  while the rest have special complicated rules. We can use a filter that check for the 10% words in an efficient way.
  • used by web caches for ignoring one time hits. According to wikipedia 75% of the cached URLs are single hit URLs. In order to ignore one time, web caches check if its bloom filter might contain a new URL if not then it means this is first hit and so it does not add it to the cache and update the bloom filter. next time the same URL will be hit, then it will be found in the bloom filter and so will be added to the cache filter.
  • can be used to reduce work load from the server by delegating work to the browsers: lets say you have a big list that must be handled by a server, you can generate a bloom filter for it and have the client store it. in case the key might be in the list, then a more accurate examination in the server side can be done. e.g., it can be used to manage URL black lists in the browsers.
  • used by Cassandra to check if a key exist in a table file

Bloom algorithm details

  • inserting a key from the set to the bit array:
    • define a bit array with size M.
    • choose K different hash functions that take as input a key from the set and map it to a value in the range {1,..,M}. we will consider the hash result as an index location in the array.
    • apply each hash function on the key. this will produce K values in the range {1,..M}. Set the value in the array-index j in the array if j is one of the K’s hash functions results i.e., there will be maximum of K bits set ‘1’ for a key.
    • repeat the process for each key. For each key – more bits will be set in the array.
    • we will end up with a single bit array A representing the set.
  • Testing if a key belong to the set:
    • process the key by each of the K hash functions, for each hash function we will have a subset L of the set {1,..M}. verify that all bits are set in A in locations of L e.g., if L contains ‘3’, we need to check that the 3rd bit in A is set.
    • If for one of the K hash functions – the result is N but the N-th bit is not set in A then we can be sure that the key is not belong to the set.
    • If all K checkes pass then the key might belong to the set but there is an chance of false positive.

References