This tutorial provides the leader election algorithm implementation using Apache ZooKeeper APIs.
Distributed processing and storage frameworks are helping us in great way by letting us store, process and analyse big data using the commodity hardware machines, referred as nodes. However these frameworks tend to have master/leader nodes responsible for co-ordination, maintaing metadata and read/write consistency.
Some of these frameworks such as Apache Hadoop, Apache Storm, Apache Spark etc. draw a hard line between master/leader and slave/worker nodes by having different configuration of these nodes. Typically master/leader node tends to have the list of slave/worker nodes. This hardlining of master/leader node limits the failover as if master node goes down, whole cluster becomes unavailable. Although frameworks do provide master node failover mechanisms but those are limited.
On the other hand, there are frameworks that do not have differently configured leader/master nodes. Instead, all the nodes are configured in same way and leader election is done at run time. This provides better failover as cluster is not dependent on one or couple of nodes. In case of leader node failure, remaining nodes again start the leader election process and choose the new leader making the cluster work as it is. However cluster still does not remain full functional while nodes are doing leader election.
Hence leader election is an important process for distributed frameworks. In this tutorial, we will be implementing leader election algorithm using Apache ZooKeeper, a co-ordination service. However if you do not have any programming experiece using Apache ZooKeeper, you are strongly recommended to follow this Apache ZooKeeper tutorial - Programming with Apache ZooKeeper.
Here are the steps that will be followed by each of the process participating in leader election.
To summarize, each of the node will create a ephemeral and sequential znode at start up under "/election" persistent znode. Since znode created by process is sequential, ZooKeeper will add a unique sequence number to its name. Once this is done, process will fetch all the child znodes of "/election" znode and look for child znode having smallest sequence number. If the smallest sequence number child znode is same as znode created by this process, then current process will declare itself leader by printing the message "I am new leader".
However, if the process znode does not have smallest sequence number, it will set a watch on the znode having sequence number just smaller than its process znode. E.g. if current process znode is "p_0000234" and other process znodes are "p_0000123", "p_0000129", "p_0000223", "p_0000235", "p_0000245" then current process znode will be setting the watch on znode with path "p_0000223".
As soon as the watched ephemeral znode is removed by ZooKeeper due to process being shutdown, current process gets a watchevent notification. Thereafter, current process again fetches the child znodes of "/election" and repeat the steps of checking whether it is leader.
To implement the leader election algorithm explained in previous section, we have created following three classes (please click on the heading to see the source code on GitHub) -
- ZooKeeperService.java - This class is responsible for interacting with ZooKeeper cluster by connecting to ZooKeeper service, creating, deleting and getting znodes and setting the watch on znodes.
- ProcessNode.java - This class represents a process in the leader election. This class is implemented as Runnable and responsible for implementing the leader election algorithm with the help of ZooKeeperService class.
- LeaderElectionLauncher.java - This is a launcher class responsible for starting the thread of ProcessNode implementation. Since Apache ZooKeeper client runs daemon processes for notifying about watchevent, this class uses ExecutorService to start ProcessNode so that program doesn't exit after ProcessNode main thread is finished executing.
You can browse the complete code of leader election implementation as maven eclipse project in our GitHub repository here.
In order to quickly run the project, download the pre-built jar file from here.
In order to demonstrate the algorithm implementation, we will be starting the program with different process ids to emulate the different process nodes participating in the leader election process. To start a program, go to command prompt in windows or terminal in Linux and type the following command by replacing replacing host:port with the host and port of your zookeeper node -
>>java -jar leader-election-0.0.1-SNAPSHOT.jar host:port 2015-06-15 15:41:37 INFO ProcessNode:67 - Process with id: 1 has started! 2015-06-15 15:41:37 DEBUG ProcessNode:92 - [Process: 1] Event received: WatchedEvent state:SyncConnected type:None path:null 2015-06-15 15:41:37 DEBUG ProcessNode:81 - [Process: 1] Process node created with path: /election/p_0000000012 2015-06-15 15:41:37 INFO ProcessNode:49 - [Process: 1] I am the new leader!
Since this is the only active as yet, it will declare itself Leader by printing the message 'I am the new leader'. Now let's start another process with id 2 by opening another command prompt/terminal window as shown below:
>>java -jar leader-election-0.0.1-SNAPSHOT.jar 2 host:port 2015-06-15 15:48:23 INFO ProcessNode:67 - Process with id: 2 has started! 2015-06-15 15:48:23 DEBUG ProcessNode:92 - [Process: 2] Event received: WatchedEvent state:SyncConnected type:None path:null 2015-06-15 15:48:23 DEBUG ProcessNode:81 - [Process: 2] Process node created with path: /election/p_0000000013 2015-06-15 15:48:23 INFO ProcessNode:57 - [Process: 2] - Setting watch on node with path: /election/p_0000000012
As you will notice that second process has set the watch on znode with path '/election/p_0000000012' which is actually znode created by process 1. In other words, it is listening for the events happening in process 1. Similarly start 2 more processes in another 2 windows as shown below -
>>java -jar leader-election-0.0.1-SNAPSHOT.jar 3 host:port 2015-06-15 17:54:10 INFO ProcessNode:67 - Process with id: 3 has started! 2015-06-15 17:54:10 DEBUG ProcessNode:92 - [Process: 3] Event received: WatchedEvent state:SyncConnected type:None path:null 2015-06-15 17:54:10 DEBUG ProcessNode:81 - [Process: 3] Process node created with path: /election/p_0000000014 2015-06-15 17:54:10 INFO ProcessNode:57 - [Process: 3] - Setting watch on node with path: /election/p_0000000013
>>java -jar leader-election-0.0.1-SNAPSHOT.jar 4 host:port 2015-06-15 17:56:11 INFO ProcessNode:67 - Process with id: 4 has started! 2015-06-15 17:56:11 DEBUG ProcessNode:92 - [Process: 4] Event received: WatchedEvent state:SyncConnected type:None path:null 2015-06-15 17:56:11 DEBUG ProcessNode:81 - [Process: 4] Process node created with path: /election/p_0000000016 2015-06-15 17:56:11 INFO ProcessNode:57 - [Process: 4] - Setting watch on node with path: /election/p_0000000014
Now let's move to interesting part and kill the leader process (process with id - 1). As soon as you kill the process with id 1, process with id 2 will get the event notification and declare itself the leader by printing the message 'I am the new leader' as shown below -
2015-06-15 17:57:58 DEBUG ProcessNode:92 - [Process: 2] Event received: WatchedEvent state:SyncConnected type:NodeDeleted path:/election/p_0000000012 2015-06-15 17:57:58 INFO ProcessNode:49 - [Process: 2] I am the new leader!
Now let's kill the process with id 3. Since process with id 3 is not leader, process with id 4 will not claim itself as leader and instead will set the watch on znode ('/election/p_0000000013') created by process id 2.
2015-06-15 17:58:56 DEBUG ProcessNode:92 - [Process: 4] Event received: WatchedEvent state:SyncConnected type:NodeDeleted path:/election/p_0000000014 2015-06-15 17:58:56 INFO ProcessNode:57 - [Process: 4] - Setting watch on node with path: /election/p_0000000013
As shown above, all the process detects the failure of other nodes and decide the leader node among themselves.
Thank you for reading through the tutorial. In case of any feedback/questions/concerns, you can communicate same to us through your comments and we shall get back to you as soon as possible.