ZooKeeper

“Because coordinating distributed systems is a Zoo”

Guarantees:

image

ZooKeeper’s performance can scale linearly through the guarantees it provides.

image Clients requests are processed in FIFO order, this is per client for Reads

Let’s consider the distributed key-value store shown below which makes use of a Raft module in each replica.

image

In a Raft based system, adding more servers will likely degrade the performance of the system. This is because all the reads must still go through the leader, and the leader now has to store more information about the new servers.

ZooKeeper, on the other hand, allows us to scale the performance of our system linearly by adding more servers. It does this by relaxing the definition of correctness and providing weaker guarantees for clients.

Reads can be served from any replica but writes are still sent to a leader. While this has the downside that reads may return stale data, it greatly improves the performance of reads in the system. ZooKeeper is a system designed for read-heavy workloads, and so the trade-off that leads to better read performance is worth it.

image

Zookeeper vs RAFT

image

ZAB

image

Background

ZooKeeper in the Hadoop ecosystem

image

Motivation

ZooKeeper: designed to relieve developers from writing coordination logic code.

A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services

Highly Available

Tolerates the loss of a minority of ensemble members and still function.

One more way of looking at this odd number majority scheme is that, when there is a partition, there can’t be more than one partition with majority of servers in it.

High Performance

Strictly ordered access

Basic Cluster Interactions

image

When one server goes down, clients will see a disconnect event and client will re-connect themselves to another member of the quorum.

One more thing with 2 * n + 1 servers is that, any two majorities will have atleast one overlap server. Because there are atleast n + 1 in the majority, there is intersection with atleast one server from the previous majority.

Zookeeper data structure

image

File system analogy

image

image

image

image

image

Watches

The leader executes all write requests forwarded by followers. The leader then broadcasts the changes. image

API

Creation API

Get/Watch API

Other API

image

image

Implementation Details

image

The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.

ZooKeeper uses a custom atomic messaging protocol. Since the messaging layer is atomic, ZooKeeper can guarantee that the local replicas never diverge. When the leader receives a write request, it calculates what the state of the system is when the write is to be applied and transforms this into a transaction that captures this new state.

Uses of Zookeeper

ZooKeeper was not implemented to be a large datastore.

image

Discovery of hosts

A typical use case for ephemeral nodes is when using ZooKeeper for discovery of hosts in your distributed system. Each server can then publish its IP address in an ephemeral node, and should a server loose connectivity with ZooKeeper and fail to reconnect within the session timeout, then its information is deleted.

Ephemeral nodes

image

Leader election

An easy way of doing leader election with ZooKeeper is to let every server publish its information in a zNode that is both sequential and ephemeral.

Then, whichever server has the lowest sequential zNode is the leader. If the leader or any other server for that matter, goes offline, its session dies and its ephemeral node is removed, and all other servers can observe who is the new leader.

If we use write for leader instead of lowest sequential zNode, then Zookeeper will send the notification to all servers and all servers will try to write to the zookeeper to become a new leader at the same time creating a herd effect.

Same leader election can also be used for locking a resource

Building blocks

image

image

Avoiding the “Herd Effect”

If many clients are waiting for a lock, and one client releases it, you don’t want all the waiting clients to wake up at the same time and flood the ZooKeeper server with requests.

image

Message queue

With the use of watchers one can implement a message queue by letting all clients interested in a certain topic register a watcher on a zNode for that topic, and messages regarding that topic can be broadcast to all the clients by writing to that zNode.

Other usecases

https://zookeeper.apache.org/doc/r3.6.1/recipes.html

Ref: