Distributed Systems 101

So What is DS?

  1. Storage= A case where the processing and storage are separated and your storage is distributed over logical or physical means. Let’s say you are running a query on your laptop which fetches data from different systems and then processes in your system’s ram (assuming the system can accommodate the data). So in this case the storage is said to be distributed. It could be distributed in different databases on the same system or on different systems(servers) and it is the responsibility of your application code to fetch that ‘distributed data’ from different servers and transport it over the network and bring it to your local system for computing.
  2. Compute = An alternate scenario is where the compute is distributed and not the storage. So how can computing something be ‘distributed’. Let’s talk about a scenario. Say you have a master storage server having all the data which could be in GB’s or TBs’s or even PB’s, ( with no replication, we r assuming that data is not distributed for our use case). So if we have to compute something that involves scanning a major chunk of this data we need that much processing power. So we ‘distribute’ that processing but how? we can have several compute servers with relatively low storage and each of them works on a logical chunk of data and just return the result/s to a single master server which collates the result and the client can connect to this server and gets the result.
  3. Storage and Compute = A third scenario is where both storage and compute are distributed which is the most common use case we see in Computer science. I’ll talk about processing Data using Spark. Let’s say our data resides on AWS S3 storage or on HDFS on on-premise servers, and we write spark code to compute something that solves a business use case. Now the fundamental data structure spark uses is Dataframe. If you are a data engineer you know what a Dataframe is. It's a distributed collection of data. But again, what exactly is ‘distributed’ here. The data could be parquet files stored in S3 which is a distributed storage (with replication and versioning if switched on the bucket) and when sparks load a dataframe in memory, it just knows which files to pick while forming logical partitions in memory on those servers where it will process them.
    The core idea in Big data Processing is taking compute to where storage is to avoid data ‘travel’ over the network or have a minimum amount of data that needs to be ‘traveled or shuffled as it is called in spark’s terminology.
    So in our case of both distributed storage and compute example spark will actually load data into memory (or JVM) on those servers where it will try to process those logical chunks thus distributing both data and storage while processing. Spark follows lazy evaluation to load only what is needed to compute and it “figures” this out with user code, what is being loaded, filtered, the columns worked upon, etc. NO System should load all the data in a single JVM since that beats the purpose of distributing and in most cases you simply cannot process it in a single JVM or server or machine and hence the need for DS.




Data Engineering

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Agile Principles

What to consider before enabling Google Analytics: App+Web’s Enhanced Measurements

What is Python programming And How To Get Started With It?


Ingest any file as Parquet into your S3 data lake

Introducing the new Nx Cloud platform

Development Update — 12th August

A New Standard for Testing — Cypress IO

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Harry Singh

Harry Singh

Data Engineering

More from Medium

Consumer Lag in Delta Lake

Landing data on S3: the good, the bad and the ugly.

Kafka in Nutshell

Why FiscalNote Chose Apache Flink for Data Pipeline Orchestration