Big Data, Murky Waters

deep dank murky lake
unorganized data will make
pray nothing shall break

This is the philosophy a lot of organizations take when implementing their big data platform. It’s easy to say, “I’ll just dump all the data in the lake and worry about getting it out later.” While it’s true the lake is meant for this concept, without a little upfront organization, developers and analysts seeking to wade in the waters may have a hard time.

We recently ran into an issue in our data lake, where we were using internal hive tables and appending the data via an INSERT INTO statement. Unbeknownst to us, hive creates a new file and places it in it’s internal dataset directory. This caused us to have a hive table with over two hundred thousand tiny files. Consequently, this made the hive table unusable and put a strain on the name node.

Since then, we decided to clean up some of our data pipeline process by converting datasets to a common format once they land in the lake.

So how can you learn from us? It might help to periodically concatenate tiny files to keep the name node footprint down and use external hive tables to limit the amount of duplicate data.

Take a little extra time and think about how you will organize the data. What file system layout or pattern will you employ? What compression algorithms will you use? How often should you compact smaller files with the same data type? What effect will the reorganization of data have on the pipeline or process?

A little work upfront can save you a lot of headache downstream.

What steps have you taken to help organize your data lake?

© 2017 Lampo Licensing, LLC. All rights reserved.