Someone asked me this week: "Is it better to capture all the data you can now and decide what to do with it later, or to decide what specific questions you need to answer and develop a big data strategy?". My answer at the time was: "A bit of both, it's an iterative process". But I'll elaborate a bit more in this article.
As I've said before on this blog, designing reports should be done by starting at the end and working backwards. Start by asking what are the business's objectives? What sort of report would give us perfect insight into which course of action we should take to achieve those objectives? What data do you need in order to build that report?
Having gone through that thought process, what you're left with is a clear definition of the questions you need to answer and therefore a clear set of data requirements. But the world doesn't stand still, and you'll always think of new questions you want to answer. So it's worth keeping as much data around as you can, just in case.
There's a clear divide, however, between the data you need to report on right now, and the data you may want to report on one day. And you need to work out how much data you're going to have in each category because choosing the wrong technology for the wrong type of data could cause a lot of technical debt and cost you a lot of money.
Small Amounts of Data That You Know What To Do With
If you know exactly what reports you want to run and the data amounts to less than 500 GB or so, then a relational database is usually the best option. You can structure the tables and indexes to enable quick retrieval and it's a well understood type of technology that you won't have any trouble supporting.
Small Amounts of Data That You Don't Know What To Do With
If you have only a few 100s of GB of raw data, but you have no idea what you might want to report on in the future, then OLAP cubes are worth looking into. OLAP cubes, like relational databases, are restricted to the size of a single server, but they give you a playground in which to quickly explore the data and keep experimenting until you've found something worth showing to others.
Large Amounts of Data That You Know What To Do With
Where data volumes exceed 1 TB, but you have a very clear idea of how that data is going to be queried, then large distributed data stores Microsoft's Azure Table or Google's BigTable are ideal. Building your own sharded network of relational databases is also an option. These systems spread the data across multiple machines, but with a clear structure that makes getting to the thing you need doable within a few milliseconds. The downside to these systems is that accessing the data in a way you hadn't planned for can take hours or even days.
Large Amounts of Data That You Don't Know What To Do With
Document oriented databases like Hadoop and Couch DB are good here, because they enable multiple servers to collaborate on storing and querying the data. This structure means the amount of time a query takes to run is dictated more by the number of servers you have collaborating and less by the size of the data you're holding. Queries will never be as quick as with a distributed table option because the data is not stored in the optimal format for any one type of query. But given that you don't know what queries you want to run yet, this is the best option.
While it's not necessary to plan for every question you'll ever have up front, by giving some thought to what questions you want to answer, how much data that requires and how much other data is there, you can arrive at a sensible strategy for handling that data.
Choosing the Right Tools
I take exception to the phrase "Big Data Strategy". It's like saying "Microsoft Excel Strategy" or "Screwdriver Strategy". It suggests to me you've decided on the tool you're going to use, now you just need to find the right problem.
For me, this is about finding the right mix of tools. A relational database with 1 TB worth of data in it is no fun to work with. A poorly thought out Azure Table is a massive pain. And explaining to your boss that you're writing a Map Reduce job just to count the number of orders you had this quarter would make you look like a fool. So make sure you choose the right tools for each job.