As the topic says this is distributed systems 101, so I am gonna write some very basic introduction and working of Distributed Systems which I’ll refer to as DS
So What is DS?
Let's go at it word by word. The Word Distributed means to distribute, so what exactly is distributed, it could be anything depending on the use case, for eg
- Storage= A case where the processing and storage are separated and your storage is distributed over logical or physical means. Let’s say you are running a query on your laptop which fetches data from different systems and then processes in your system’s ram (assuming the system can accommodate the data). So in this case the storage is said to be distributed. It could be distributed in different databases on the same system or on different systems(servers) and it is the responsibility of your application code to fetch that ‘distributed data’ from different servers and transport it over the network and bring it to your local system for computing.
- Compute = An alternate scenario is where the compute is distributed and not the storage. So how can computing something be ‘distributed’. Let’s talk about a scenario. Say you have a master storage server having all the data which could be in GB’s or TBs’s or even PB’s, ( with no replication, we r assuming that data is not distributed for our use case). So if we have to compute something that involves scanning a major chunk of this data we need that much processing power. So we ‘distribute’ that processing but how? we can have several compute servers with relatively low storage and each of them works on a logical chunk of data and just return the result/s to a single master server which collates the result and the client can connect to this server and gets the result.
- Storage and Compute = A third scenario is where both storage and compute are distributed which is the most common use case we see in Computer science. I’ll talk about processing Data using Spark. Let’s say our data resides on AWS S3 storage or on HDFS on on-premise servers, and we write spark code to compute something that solves a business use case. Now the fundamental data structure spark uses is Dataframe. If you are a data engineer you know what a Dataframe is. It's a distributed collection of data. But again, what exactly is ‘distributed’ here. The data could be parquet files stored in S3 which is a distributed storage (with replication and versioning if switched on the bucket) and when sparks load a dataframe in memory, it just knows which files to pick while forming logical partitions in memory on those servers where it will process them.
The core idea in Big data Processing is taking compute to where storage is to avoid data ‘travel’ over the network or have a minimum amount of data that needs to be ‘traveled or shuffled as it is called in spark’s terminology.
So in our case of both distributed storage and compute example spark will actually load data into memory (or JVM) on those servers where it will try to process those logical chunks thus distributing both data and storage while processing. Spark follows lazy evaluation to load only what is needed to compute and it “figures” this out with user code, what is being loaded, filtered, the columns worked upon, etc. NO System should load all the data in a single JVM since that beats the purpose of distributing and in most cases you simply cannot process it in a single JVM or server or machine and hence the need for DS.
This is just the tip of the iceberg of DS, I‘ll write more topics on how the distributed systems work under the hood, how databases work with replication, how nodes understand who is leader, who is follower, replication, leader election strategies, etc.