Distributed Systems — Replication
This is the second story on the topic of DS, if you haven't read the first one, please have a look here.
Replication is at the core of Distributed Systems (DS). We all know what replication is? In its simplest form, is just having a duplicate copy of the original in a case where the original copy is destroyed, we can use the backup one. But in DS replication is not that simple, in fact, it’s much more complicated and it needs to work in order for DS to ‘work’. In a system where storage for example is not distributed, even if the master server fails, we can assume that the backup system will work as it will have all the data (generally, up to date since maintaining one replica is easier) but in case of DS, things can go haywire.
So what complexity does DS have that makes replication not easy.
- Real-Time Sync- The replicas have to be in sync with the master. what does in-sync mean? Let's take a Database for example. In most databases, any writes going to the database are generally written to WAL (write-ahead log) first. Assuming the database storage is actually distributed over several storage servers, the data needs to be replicated as well for availability, scalability, and in the case of server crash/failure as a fallback mechanism. But to fallback, as in the server to start serving API requests from the fallback server, it needs to be in sync with the master.
For eg an application serving hotel bookings to customers. As soon as a write occurs on the portion of the database in the master server, the replicated server needs to be in sync because if the master crashes before the write gets replicated that booking becomes invalid and might get rebooked to someone else.
So how do we keep replicas in sync near real-time, it is one of the tough problems in computer science to solve out there.
Let's talk about real-time sync and issues related to it . Consider 1 master and 2 replicas. As soon as u write to master you can say the write has finished and then in the backened copy the data to replicas , also known as Asynchronous writes.
This has the disadvantage of replicas not being real time sync and any failure can cause loss of data if primary node crashes.
Whats the solution to above problem, We can wait for the replicas to finish(Synchronous Writes) writing making the write call blocking. But this poses another problem , the writes become too slow(could also cause indefinite blocking if no response is there from follower replicas) and it causes very bad client experience and not all business use cases can afford this.
2. Leader-Follower issues- In any replication scenario there are leaders and followers. The leader is the node/server/master with which the client API interacts, in our database example it is the server that will accept writes and followers are the servers that act as backup/failsafe/slaves.
Now whether we follow Synchronous or Asynchronous Writes both have a set of challenges that might cause the replication to fail and DS to not work as expected.
If we adopt Asynchronous Replication , what that entails is we say the write has finished as soon as the data is written to leader and client gets the success flag. Then its the leader’s responsibility to propagate the writes to followers. This causes issues like inconsistency with in the application.
For eg : An application that updates Football scores, two people using the same application can have different scores updated at a certain point of time when a goal is scored , obviously both the users will see the same score in some time (which in IT world is referred to as eventual consistency and is followed by applications where a slight inconsistency usually in milli seconds is tolerable).
A second example of this could be writing a comment on someone’s Facebook post, sometimes you notice that a comment you just posted is not seen and as soon as you refresh couple of times , voila! it’s right there!. Again Eventual Consistency.
Another issue is if the leader fails, and after “discovering” the leader has failed the system elects a new leader, this leader does not have the delta of writes and the system is in an inconsistent state and it violates the client's durability expectations.
On the Synchronous Replication front, all is not good and merry. If the leader is waiting for a follower to respond ok to the writes for the system to be in a consistent state the leader must wait for every write call. If the follower crashes or in a rare scenario even if the ok message is not received, the leader can not know what the fault was, it will just wait. One way to avoid writing to all followers is to just write to one follower (keep it consistent) and the rest of the followers gets updates asynchronously from this one in-sync follower, also known as a Semi-Synchronous System.
Apart from the above issues, there are issues like Node Outages, Adding/removing followers, split-brain, leader election strategies, that make replication in the Distributed Systems a complicated mechanism. I will write a separate article on solutions to the above issues and how leader-follower systems work with minimum failures and fallback mechanisms.
3. Geography- One Aspect of distributed systems often overlooked is geography. How do we ensure that a user in Vietnam and one in Luxembourg who happens to be using the same application are getting the same real-time updates and user experience? To avoid lagging requests are often served from the nearest datacenters governed by complex algorithms that decide which users are served how and from where but to serve with correct data the systems have to be in the correct state itself. There comes the challenge of replication among data centers as well which has its own challenges. Can we think leader-follower here? A leader among data centers and then a leader within the data center itself, you see the problems discussed above just exploded to a new level with lagging, sending signals, notifications cross-continent but somehow distributed systems are making it work.
Replication is at the core of making distributed systems work with keeping copies and is one of many crux components to make the distribution possible t apart from the actual distribution(either storage or compute ) itself.