What are we talking about here?
We want to commit a transaction in several nodes so it will be committed in all of the nodes or in none.
- One nodes is designed as the coordinator.
- Each node has persistent storage.
- Each node can eventually communicate with any other node.
- Each node will recover from a crash eventually i.e., fail-recovery model.
- Coordinator sends COMMIT_REQUEST to all other nodes.
- A node that receive COMMIT_REQUEST write it to its a transaction log, hold the required commit resources and send AGREED message to the coordinator. If node can not commit, then it will send ABORT message to the coordinator
- If the coordinator did not received response from a node, it will re-send the node another COMMIT_REQUEST message or after sometime it will sent ABORT message to all the nodes.
- if the coordinator received AGREED from all nodes, it will send a COMMIT message to all nodes.
- If the coordinator received ABORT from any of the nodes, it will send an ABORT message to all nodes.
- If the coordinator did not received response from a node, it will re-send the node another COMMIT message.
- A node the receive COMMIT message will commit and sent a response COMMITTED.
- The coordinator will complete a commit after it received COMMITTED message from all the nodes.
- A node that receive an ABORT message will roll back the transaction.
- This protocol is blocking i.e., a node will hold commit related resources until coordinator will receive COMMITTED message from all nodes.
- We assume that all nodes will recover after a crash – but this is not always the case and in case the coordinator will crash after nodes approved a commit and before he sent them COMMIT message – then nodes will wait for manager and hold resources forever.