On the 3rd April 2012, Instagram launched its Android app and promptly racked up 1 million downloads in less than 24 hours. While from a marketing and business point of view, that sounds like an exciting prospect, for an engineer, adding a million new users in under 24 hours sounds like a major headache. That they pulled it off without any reported outages, and with a team of only 13 employees, makes this kind of engineering achievement all the more impressive.
What kind of infrastructure did they have powering their service throughout all this buzz? Surely some sexy, distributed, BigData, NoSQL mega-cluster?
Nope. It was PostgreSQL.
Don't get me wrong, there was a lot of clever stuff going on and lots of technologies involved, but the data storage was done by a traditional relational database technology.
The problem with storing a lot of data in a relational database is the your database has to live on one server. When you hit the maximum capacity of a single server your stuck, there is no built in way to combine the power of multiple servers. That's where sharding comes in. Sharding means finding a way to break your data up so you can store a portion of it on separate servers. It's exactly what the Encyclopaedia Britannica did when they had too many pages to bind into one book, they broke it down into volumes.
This is all very well, but you need some way of knowing what volume (or shard) any given piece of information is going to be in. The Encyclopaedia Britannica solved this problem by sorting their content alphabetically and then labelling each volume with the range of letters that it contained the entries for. This is actually a very effective sharding strategy because it's simple, it creates some fairly evenly sized shards, and because you generally know what word you're trying to look up, so it doesn't require you to use any trial and error when looking for the volumes that that you're interested in.
When it comes to data storage, the key to making sharding work is to come up with a system like Encyclopaedia Britannica's. One that gives any client application a simple rule to follow for finding the server with the data they want, and which produces roughly evenly sized shards.
One common sharding technique, if most of your data is associated with someone's user account, is to make a rule that usernames starting with "a" lived on the "a" server, usernames starting with "b" lived on the "b" server, and so on. When someone wants to log in, they type in their username, you take the first letter, locate the right server, then load the data from there. Generally usernames are fairly evenly spread throughout the alphabet so you get evenly sized shards, and you can store 26 times as much data before hitting the maximum server size limit.
Sometimes a small central database is used to store the location of the server with the bulk of the data on it. Applications that want to access the data always start with the central database, then follow the pointer to real data store. This can make data slower to find, and the central database can become a bottleneck, but it has the benefit of allowing you to relocate data at a later date if one server starts to fill up.
Instagram had a somewhat cleverer approach to sharding, but the principles are similar.
Is Sharding a Good Idea?
Sharding is useful if you have a lot of experience with a particular database technology and you don't want to have to learn something new. Or if you have a lot of code that is written to work with a particular technology and don't want to change it. If you use shards, you do potentially open yourself up to inconsistencies between the data on each server, something relational databases normally stop from happening, so you might wonder whether it's better to just bite the bullet and learn a NoSQL platform. But with the right sharding strategy, sharding is extremely effective and well worth it.