What is a gossip protocol?
It is a protocol that lets large number of nodes exchange key value data with no central management. Each node gossip with a random small subset of the nodes, updating of its knowledge on the key/value data of other nodes.
The data of node p can be modeled with (r,k,v,n) where r is a node id, k is the key id, v is the value of the key and n is the version in r for k/v.
When one of node change the value of a key, we want to have k/v updated in all nodes to the latest version in minimum time.
What is the usage?
– data replication
let say we want to store a key/value on several nodes for high availability. The value of a key can be changed by one of the node and so need to be replicate to all other nodes.
If a node handles a range of keys, then each node can know which nodes handle a key so it can be used as to route requests related to a key to the right node.
– nodes liveness monitoring
The key will be the node id and the value the last time it was contacted.
Each node can decide that a node X is down if more than half of the other nodes did not contact X in the last T sec.
– performance monitoring
The key will be a node id plus a performance counter, the value will be the counter value. A node can detect when other nodes are available and route traffic to it.
What are the challenges?
The gossip message size can be larger than the MTU (maximum transmission unit) due to the large number of nodes and key/value need to be transmitted. In such case we need to send part of the data and there are different prioritzation algorithms (also called reconciliation algorithms) to decide which key/value subset need to be to sent to a node.
– precise reconciliation
Node q send node p a message with (r,k,.,n) i.e. without values.
Node p send node q a subset of (r,k,v,n’) where n’>n which fills the message.
The subset can be taken by starting with the smallest n’ which is larger than the value of n as seeing by p for q i.e., (q,k,.,n). Another algorithm takes the highest n’ which larger than n (this can cause starvation).
– Scuttlebutt reconciliation
1) Node p will store a new k,v with a timestamp n that is bigger than all existing timestamps in p.
2) Node q send node p a message with (r,max n in r).
3) Node p send node q a message that is built by taking the first (r,k,v,n) he has with the smallest n that is larger than the max seeing by q for r.
4) p will send q its own k,v in an increasing order of timestamps i.e. When q calculates max n for p then it can assume all other k,v from p has larger timestamps n.
5) If p has (r,k,v,n) that is newer than the k,v in q but n is smaller than q known max timestamp for r then p will not send it to q as it will assume that a newer version for k,v exist in r should be known to other nodes.
Example is need here:
p has (r,a,.,1) , (r,b,.,2)
q has (r,a,.,12), (r,b,.13)
r has (r,a,.21)
r sends q and p (r,a,.,21)
p has (r,a,.,21), (r,b,.,2)
q has (r,a,.,21)), (r,b,.13)
p send q (r, max n i.e., 21)
q will NOT send p (r,b,.,13) although it is newer than p value for b.
The idea here is that the q knows that r has a newer value for b as he got (r,.,.,21) from r and b version in q is 13 i.e., q understand that all version of r will have newer version than 21 so his value for b is not the latest and so there is no point sending it. q assumes other nodes will update the latest value of b to p.
When the rate of updates grows, nodes requirs more cpu and network resources to process them, therefore a carefull adjustments of rate control is needed. In general, it is assumed that correct value will be set eventually and less stress is put on having the latest update immediatelly.
The purpose here is to
1) let each node publish messages in the highest rate not causing network latency.
2) all nodes use the same rate.
Here latency is determined by the size of the gossip messages backlog size in a node.
Each node start to send updates and increase the rate in liner fashion.
When latency is noticed, it will reduce the message size by a constant factor.
Nodes will gossip their update rate and when p and q replace update rate they will average them to increase or reduce their rate so both will have the same rate and the overall system rate is not changed.