Distributed Systems — Sharding
This is the third story in distributed systems, you can read the others here.
Sharding is the basis of the distribution systems, without sharding DS can simply not exists (for the most part..). Sharding is one of the easiest and trickiest things to explain and understand. It is at the crux of NoSql databases.
When i interview candidates about No-Sql and DS, one of the things i ask in detail is sharding? Many people just blunt out that dividing data is sharding, although that is technically ok, but i can also divide data into multiple Rdbms tables , does that makes my rdbms database a no-sql db now?
some use the fancy terms like horizontally scaling, and if i ask why not vertically scale, they throw terms like backups , master/slave which can again, be done in a vertically scaled up server. So lets discuss ‘sharding’ in a very basic and simple term using databases and servers as examples.
So what exactly is sharding? well, the interviewee mentioned above was not totally wrong, it is indeed dividing but dividing what? and how? What to do after divide? and how will the processing be impacted?
Let's consider a working system that has a server that internally maintains a database to store the AADHAR (A Unique ID for a citizen by Govt of India, like the SSN in the USA) information. Now if we think ,it has a lot of information about the person such as his address, biometrics details, phone number (which again is used for OTP and Authentication purposes by a lot of Government and private entities), and other information.
Let us use an Oracle Database to store this information, we are not considering performance issues right now, we’ll talk about them later but since the above information can be stored in oracle, we see how it performs.
Potentially the no of AADHARS are above 1 Billion, so we have that many entries , that many addresses and so on.
One table could be just the name and AADHAR_ID and the phone number.

And other tables could be addresses, biometrics , with foreign key relationship on AADHAR in the primary table. But You see the problem right away, to search for an ID you have to scan the entire table which will have potential a billion entries.
Lets try sharding, What if we split the tables state wise, each state would just contain the data for its domicile. But we have to know before hand or figure out how aadhar numbers map to states.
Another option is to use the phone number since those belong to a calling circle and that information is available.
But both the above options requires some data mapping that needs to be known before hand. What if we use “AADHAR” number itself ?
The AADHAR number is a 12 digit number of the form 1234 1234 1234
If we just split the tables into the range 0000 0000 0000–9999 9999 9999 into N tables where N can be computed by studying a variety of numbers.
For sake of simplicity lets say we just divide into 5 tables.
Now if a query comes for aadhar no 1234 4321 1224 we know just by looking at the number “123443211224” that it will be in the first table and that just reduces our searching time by 1/5 already. There are other things like indexes etc but we are just considering very basic sharding here.
What we have essentially done is called RANGE-BASED SHARDING. But hey, it is still oracle,still and RDBMS, we haven’t even thought about NO SQL.we have just sharded the storage. But the problem is far from over.Now if millions of queries come and bombard the system, many of them will fail cause the system can take up a definite load only. We can queue , scale up the ram, have indexes, caching but as the load increase there will be latency and AADHAR is used for authentication where the OTP expires quickly so we have to do something here to improve the response time and avoid starvation etc.
We already have sharded the storage into N tables based on aadhars and our system knows which aadhars belong to which tables based on range, so why not just shift those tables to a different server equally capable (caching, indexes, queues) and have a master server just re direct the traffic based on aadhar number? Welcome to sharding the compute as well. Now we have horizontally split the compute and storage to different servers with a dedicated server just handling a range of aadhars and a range of queries.
Wat can we do more? If we see the address information that is not used very frequently digitally, also the biometric information of a person are accessed while applying for Mobile Sim/Mobile calling card, so it really makes sense to group the information together that is accessed together,which is what HBASE do(stores data based on column families ), so hbase (which is a no SQL db) can be used for this purpose , it basically does a range partioining, figuring out which info is stored in which region servers(what we just discussed above), plus it maintains versioning(which is configurable) to store changes to addresses, biometrics, personal information , correction of faults in information etc. There is also compaction for data storage.
We started from a RDBMS and ended up at a NO SQL database horizontally sharding and scaling for the solution.
But some might argue why use an HBASE , why not mongodb and have multiple indexes (on state for example) to group individuals based on state and leverage its uses. THERE IS NO SILVER BULLET. It all depends on requirements and functionalities that the system offers. Obviously we would need backups of our indivdaual range servers (master-slave for example) incase , our master region server goes down, and as we increase complexity other issues which we discussed earlier like leader-follower issues start coming into picture. we can just simply store all information as a document in a mongodb and have multiple indexing (from range to state etc) and sharding there. We just need to see which system best meets most of our requirements and which one is the best fit of all.
Hope it helps.