Sep 14, 2013

A common-sense approach to Big Data

Big data is heavily hyped these days and everyone wants to get in on the fun. This post continues this blog's "Belaboring the Obvious" theme by describing a common-sense hype-free approach for deciding whether you need to go there and how to get  started if you do. 

The main advice here is to take the steps in the order presented, particularly the first ones, to avoid some of the train wrecks I've seen in companies who've tried doing this the wrong way.

1- What is the business problem to be solved?

The first step is the most crucial because it is the compass that guides all other steps and the financial fuel to get you there. Big data is neither easy or cheap and requires far more effort and cost than the hype will lead you to believe. 

Start by documenting the business problem you hope to solve and get official buy-in from those that you expect to cover the cost. One business problem is enough to get started as this will help you stay focused. More can always be added once you succeed with the first one.

2- What data can we use to solve it?

With the goal decided and approved, the search for data that can support it can begin. This often involves getting permission to use it. If the data is privacy sensitive, custom development may be needed to anonymize it. And you'll need to develop a way of regularly retrieving it from the collection environment to the processing environment. 

3- Is the data set truly large?

This is a critical question and the answer is not necessarily what the big data hype implies. The boundary is not fixed, but for current purposes, we'll define large as "bigger than most computers can handle easily"; typically a few tens of gigabytes. If your data set is smaller than this, or can be reduced to this size through filtering and cleaning, and is not likely to grow beyond this limit in the lifetime of your project, count yourself lucky. You have a small data problem and can start learning to swim in the shallow end of the pool.

4- Choose suitable tools; build as required

There are an overwhelming number of ready to use tools for both small and large data so you'll rarely need to build your own advanced analysis logic. You'll still need lots of custom coding but most of this will be "glue" code for anonymizing sensitive data, converting your data into formats that the off-the-shelf tools require, transferring it between machines, integrating it with other data sources, and so forth.

For small data that is likely to remain small for the lifetime of the project,  avoid the heavily hyped but less mature cluster-based tools like Hadoop by focusing on the much older Data Mining environments like R, RapidMiner, Knime, Weka and others. R seems to be  the oldest, most complete and most heavily used (at least in academic circles) but is the least graphical of the lot. They all support all common data analysis tasks, or can be extended to do so via plugins, and have the considerable advantage of running on ordinary file systems which are downright trouble-free.

If your data set is still small but will grow too large for an ordinary file system in your project's lifetime, you'll need to design your project around a distributed (aka "clustered") file system like Hadoop. This doesn't  mean you need to dive right into the deep end by installing your own distributed cluster. Hadoop supports a "local" mode that lets tutorials run on a single PC, often downloading a data set to demonstrate the tutorial's features. The tutorials are indispensible for learning the many parts of the Hadoop Ecosystem.  Avro, Cassandra, Chukwa, HBase, Hive, Mahout, Pig, Zookeeper, Impala, Accumulo are just a few. These roughly approximate the capabilities of the non-distributed toolkits but aren't nearly as easy to use nor as comprehensive as a rule.

If your data set is truly large, you have no option but to dive right in by building your own distributed processing cluster. The easiest way I know of is to use a virtual hosting service like Amazon EC2 and use that to host a Hadoop distribution from Cloudera or perhaps HortonWorks (I've no direct experience with the latter). I recommend avoiding Apache's  distributions as difficult to install and hard to understand without considerable experience. Cloudera invests heavily in testing and integrating the fast-changing parts of the Hadoop Ecosystem. Their distribution includes Cloudera Manager which comes remarkably close to providing a trouble-free install experience, given the complexity of getting the numerous Hadoop components to work right in combination. My main advice here is to avoid experimenting with any options that you don't yet fully understand. Cloudera Manager is not entirely reliable at undoing experimental changes so it is still all too easy to wind up with an unusable cluster that you won't be able to repair short of starting over from scratch (and losing all your data).

5- Evaluate, replan and adjust

From there its simply wash, rinse and repeat.