Distributed Systems 101

So What is DS?

  1. Storage= A case where the processing and storage are separated and your storage is distributed over logical or physical means. Let’s say you are running a query on your laptop which fetches data from different systems and then processes in your system’s ram (assuming the system can accommodate the data). So in this case the storage is said to be distributed. It could be distributed in different databases on the same system or on different systems(servers) and it is the responsibility of your application code to fetch that ‘distributed data’ from different servers and transport it over the network and bring it to your local system for computing.
  2. Compute = An alternate scenario is where the compute is distributed and not the storage. So how can computing something be ‘distributed’. Let’s talk about a scenario. Say you have a master storage server having all the data which could be in GB’s or TBs’s or even PB’s, ( with no replication, we r assuming that data is not distributed for our use case). So if we have to compute something that involves scanning a major chunk of this data we need that much processing power. So we ‘distribute’ that processing but how? we can have several compute servers with relatively low storage and each of them works on a logical chunk of data and just return the result/s to a single master server which collates the result and the client can connect to this server and gets the result.
  3. Storage and Compute = A third scenario is where both storage and compute are distributed which is the most common use case we see in Computer science. I’ll talk about processing Data using Spark. Let’s say our data resides on AWS S3 storage or on HDFS on on-premise servers, and we write spark code to compute something that solves a business use case. Now the fundamental data structure spark uses is Dataframe. If you are a data engineer you know what a Dataframe is. It's a distributed collection of data. But again, what exactly is ‘distributed’ here. The data could be parquet files stored in S3 which is a distributed storage (with replication and versioning if switched on the bucket) and when sparks load a dataframe in memory, it just knows which files to pick while forming logical partitions in memory on those servers where it will process them.
    The core idea in Big data Processing is taking compute to where storage is to avoid data ‘travel’ over the network or have a minimum amount of data that needs to be ‘traveled or shuffled as it is called in spark’s terminology.
    So in our case of both distributed storage and compute example spark will actually load data into memory (or JVM) on those servers where it will try to process those logical chunks thus distributing both data and storage while processing. Spark follows lazy evaluation to load only what is needed to compute and it “figures” this out with user code, what is being loaded, filtered, the columns worked upon, etc. NO System should load all the data in a single JVM since that beats the purpose of distributing and in most cases you simply cannot process it in a single JVM or server or machine and hence the need for DS.

--

--

--

All things Data| Lead Data Platform @ Razorpay | Ex-MongoDB

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Concurrent Programming Fundamentals — Sharing Objects (Part 1)

Battle for Low-Code Custom Development: Linx vs Outsystems

My experience in programming as a Gazan 16-year-old teenager.

What they publish: This women’s site is primarily;. but

Detect and Search Faces in video using Amazon Rekognition in Java

1 mandatory skill to excel in tech industry

Who Reviews for the Top IEEE Network Conferences?

Ansarada way of thinking about Quality Assistance — Risk Assessment

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Harry Singh

Harry Singh

All things Data| Lead Data Platform @ Razorpay | Ex-MongoDB

More from Medium

Demystifying Streaming

Benchmarking Spark Adaptive Query Execution

Distributed Sync SGD

Getting Started with Apache Spark on Databricks