Citizens in a democratic society usually vote to elect their leaders based on a pool of candidates. Distributed systems can choose a master node as well. Moreso , they can invoke “spot” elections in the situation of a leadership vacuum which helps in maintaining redundancy.

Introducing redundancy in services may introduce new problems of duplication and multi-operations. Leaders in a group of servers for redundancy will serve the business logic with the followers ready to take over leadership. Not a trivial problem with multiple systems that need to share state and gain consensus to elect the leader. Synchronization overhead and reduction of network trips are key considerations for this process.

Dealing with frameworks like Apache Spark on a daily basis makes you more aware of the inherent challenges stemming from a driver/worker architecture and the limited fault tolerance options when the master node goes down especially in high availability scenarios. Though several mitigation options exist (restarting with checkpoints, WALs etc)

As described here, Stable leader election ensures the leader remains that until any crashes irrespective of other behavior and its preferred they ensure minimal communication overhead, fault tolerant and ensure the leader is elected in constant time when the system is stable.

The Leader election process should guarantee that at most there is one leader at a time and eliminate the situation of multiple leaders being elected at the same time. This could be possible in network-partitioned systems where leaders being elected are unaware of each other. Zookeeper, Multi-paxos and Raft use temporary leaders to reduce number of messages required to reach an agreement.

Common Leader Election algorithms include the Bully algorithm or the Ring algorithm.


Bully Algorithm

Explained here

  • Each node gets a unique rank to help identify the leader
  • Elections begin when a node does not respond and the node that notices it first starts an election notifying the nodes greater than the fallen leaders rank.
  • The highest ranked responder takes over the process by responding to lower ranked nodes as well as a check to the fallen leader if any response.
  • If the node does not respond, there is a new leader elected.

This could be vulnerable to the multi-leader scenario if the nodes get split into different partitions. Also, the proclivity to the highest rank means unstable leaders can be elected which may mean frequent elections.

Here’s a quick barebones non-socket invocation of a 6-node election process.

Ring Algorithm

Explained here

  • Follows a ring topology and all nodes in the system form a ring
  • New election started when leader undergoes failure, with an alerting node notifying the next node on the ring.
  • Each node on the ring then contacts the next available node (higher ranked within the ring topology)
  • The node receives the responses back and confirms completeness of the traversal around the ring when it sees its own ID on the list and then picks the highest ID.

Apart from these, there are several other interesting implementations such as:

  • Next-in-line failover: Each elected leader provides a list of failover nodes and the leader failure starts a new election round.
  • Candidate/ordinary optimization: The nodes are split into candidate and ordinary – only one of the candidates can become a leader.
  • Invitation algorithm- Allows nodes to invite other nodes to join their groups instead of trying to outrank them

Alex Petrov’s book on Database Internals does a great job of explaining Election scenarios in a distribution context and has served as an inspiration for more detailed study in this topic.