We recently brought Waterline Data Science into the Information Asset Big Data Lab for hands-on testing. Waterline is a VC-funded startup. The company is run by some of my former IBM colleagues including Alex Gorelik and Oliver Claude, so I was interested in their newly-released product.
Waterline has positioned itself as the “Amazon of Big Data.” If you have thousands of files in your Hadoop environment, Waterline seeks to provide faceted search so you can find what you are looking for with minimal effort.
Full Profiling and Tag Propagation
We uploaded a few sample files into our Cloudera environment. We then issued commands to profile these files. Waterline profiled the data and assigned tags to the files. As shown in Figure 1, Waterline executed MapReduce jobs in the background to profile the data and assign tags to the files.
Figure 1: Waterline executes MapReduce jobs to profile data and propagate tags in Cloudera.
Figure 2 shows the results for MOCK_Data.csv. Waterline produced data profiling results for each field in the file and also generated tags that could be used later for search.
Figure 2: Results of data profiling and tag propagation with Waterline.
Waterline leverages Apache Lucene for search. As shown in Figure 3, the faceted search in Waterline allows users to retrieve data based on key facets such as keywords, source type, content type, file/table size, last accessed, etc.
Figure 3: Faceted search in Waterline.
Our team found the product easy to use and especially liked the automated tag propagation feature. The first step in governing big (and small) data is building a data inventory. Waterline definitely meets an unmet need. We will be talking about this product to our growing stable of clients who are looking to govern big data.