Big Data Doesn't Have to Be Unstructured Data

March 11, 2014


I recently had a brief to design a system for storing about 2 billion records that would be created during a simulation process. Each one would have two decimal numbers, a time stamp and a bit of metadata. The data store needed to support a range of queries based on any combination of the 6 meta data properties on each record, and would ideally be able to aggregate 10s of 1000s of records as part of each query, as users would be doing analysis of the output from each simulation.

We had done something similar (but smaller) in the past and had used an SQL Server database to hold the output. We looked at how much space each record was taking up in that system and it came to about 0.25 KB. If you do the maths, our predicted output of 2 billion rows comes to around 550 GB. In one database table.

When we considered all the other bits of application data we would be needing, our database was starting to look like it would be 650 GB on day one of operation.

Too Big for a Relational Database?

Now, 650 GB in one SQL Server database is perfectly do-able. But things do start to get non-trivial as you get above that size. Making your database fast, able to cope with server failures and adequately backed up will require a lot of disk space. Building indexes on your tables takes a long time, taking backups takes a long time, running queries can start to take longer than you'd like, and when they do, your only option is to start spending some serious money beefing up your one database server.

When we considered all the angles, we felt that looking after a relational database that big would be quite expensive. What's more, the data would never be updated, just deleted and regenerated over night. So the guarantee of consistency that a relational data gave us was mostly irrelevant. Fortunately, some of the NoSQL stores that are available now offer an alternative based on bringing many cheap commodity servers together to spread the load, rather than scaling up a single server. I started to consider what NoSQL store might work best for our particular requirements.

Structured NoSQL Stores

Hadoop is probably the most established NoSQL store out there, but in this case it's not the most appropriate. Our data is made of lots of small records and Hadoop doesn't handle lots of small files all that well. What's more, all of our data conforms to the same very rigid structure, something Hadoop won't be able to take advantage of when running a query. It could be made to work, for sure, but I began looking into possible alternatives.

Microsoft Azure's Table data store could easily handle the volume we were looking at, but getting the data out quickly depends on you storing the data in the ideal structure for your query. If you have more than one type of query you want to run, then you either have to store the same data twice but structured differently, or accept poor performance. For that reason Azure Table was not ideal.

Google's BigTable supports "indexes", meaning all of the queries you need to run can be made to run quickly. Unfortunately the only way to use BigTable is to run your app in Google's AppEngine so that's not ideal either.

HBase is a data store for structured data built on top of Hadoop and shares many of same properties as Azure Table. It does also have some support for indexes, although they're not quite as baked in as it might be. So while HBase would probably work, it's not the best we can do.

The Cassandra data store is just the sort of thing we're looking for. It is designed for storing structured data. It supports indexes meaning we can run fast queries based on our various bits of meta data and it doesn't tie us into one vendors infrastructure. The only problem was that Cassandra doesn't offer any help with aggregating the data that you're querying. So working out the sum of all records that share the same meta data properties would have to be done within our app if we used Cassandra.

The one data store that really caught my eye during this process was Elasticsearch. It has everything we were after, scalability and resilience on commodity hardware, indexes, and aggregations. As an alternative to managing large scale relational databases full of structured time series data, Elasticsearch looks great.

There's So Much Choice

In recent years, the NoSQL movement has delivered a massive range of new technologies. While a relational database is usually a good default option for any system, once you get to scale, choosing the right technology depends not just on how much data you have but also on the type of data you've got an the type of queries you need to run. Relational database with billions of rows in a single table require a lot of careful thought and performance tuning, but there are alternatives out there which can handle Big and Structured data well.