Why You Should Query Your Event Logs in Hadoop

February 21, 2014

If you're developing an application or website, the chances are that you have some form of events log recording what's happening, when and who triggered it.

Logs are great for telling you how much the various parts or features of your system are being used. They are great at telling you which users are more active than others and are often good for figuring out how different categories of user are using the system or what the conversion rates of different types of user are.

Sometime these event logs are stored in a relational database, but mostly they are just dumped into a text file on a server. Unfortunately, neither of those are ideal choices.

What's So Bad About Relational Databases?

Log data by its very nature contains a variety of different types of events, with each event having different attributes that relate to it. When somebody opens a particular page you're going to want to store what page it was in the log, but when a user does a search you'll want to record that search text instead of a page number. So each event has different properties that need to be captured. If you were going to design a database structure to efficiently store all those different events you'd have to separate out each type of event into its own table. That would be a huge amount of work and every time someone wanted to start logging a new event, someone would have to design a new table in which to store it. That's probably why I've never seen any system built this way.

Let's assume you're throwing every type of event into one table That table will be relatively unstructured, it will certainly grow quite big over time and you'll probably want to avoid adding indexes to that table so inserting into it doesn't take a long time. In other words, it'll be a nightmare to query that table. Your database may only be a few 10s of GB in size, but digging through that table for the the information you're interested in will be slow and painful. Finally, storing many GBs of data in a relational database table can be quite expensive. Microsoft Azure SQL Databases are priced at up to $3.996 per GB whereas storing simple file data on Azure costs up to $0.12 per GB. That's 30 times cheaper.

What's So Bad About Log Files?

If relational databases are no good for log data, what about simply storing it in files? There's much to be said for file based logs. It's a simple no nonsense approach. It's cheap. And querying the data can be done using tools like grep, awk, sed, perl, powershell or python. But this is valuable data, so you're going to want to back up your logs. You may also find that running queries using a single machine will start to take a long time, so it would helpful to be able to throw more computing power at the task. Perhaps slicing up your files and getting each machine in a cluster to query just one small section of the data before aggregating the results at the end.

In other words you want a distributed backup and query capability, which is exactly what Hadoop has to offer. Storing log data in Hadoop ensures that data is kept safe and provides a platform for coordinating the processing power of multiple machines when running a query.

Personally, I'm a big fan of the cloud based Hadoop offerings such as Microsoft Azure's HD Insight or Amazon's Elastic MapReduce. The best thing about these products is that data can be stored in Azure Blob Storage or Amazon's S3 storage service and the computing power is only turned on when a query needs to be run. This keeps log storage simple and cheap, while providing a mechanism to bring lots of computing power to bear on those logs when a query needs to be run. Finally, queries can be written in almost any language thanks to�Hadoop's Streaming service. This system enables queries to be written in any programming language, so long as it can be compiled into an executable program.

Logging as much as you can in your applications and then keeping that data around forever, in case you want to refer back to it, is a really good idea. But a little bit of thought up front about how you're going to store it, back it up and query it later on will make your life a lot easier when you do eventually come to use it. You could do a lot worse than daily log files a cloud based storage service, and a Hadoop cluster when, if ever, you want to go digging.