In this blog post, we will discuss how to establish end-to-end data lineage across SQL Server and Hadoop using DAG MetaCenter and Cloudera Navigator. Cloudera Navigator provides lineage within the Hadoop environment itself. However, if you want end-to-end data lineage (including non-Hadoop data sources), then you need to work with an enterprise metadata repository like DAG MetaCenter.

The components in this demo include the following:

  1. MetaCenter
  2. Cloudera Distribution for Apache Hadoop
  3. Cloudera Navigator
  4. MetaCenter ActiveLinx for Cloudera Navigator (the metadata adaptor)
  5. Microsoft SQL Server

We configured the MetaCenter ActiveLinx to read the metadata from Cloudera Navigator. As shown in Figure 1, we can now view the Hive tables and columns in MetaCenter.

Figure 1: Hive tables and columns in MetaCenter.

In Figure 2, we can also view the metadata relating to a Sqoop import job. MetaCenter also reads other Hadoop metadata pertaining to pig scripts, Hive queries, MapReduce jobs and Yarn jobs.

Figure 2: Sqoop import job in MetaCenter.

Finally, Figure 3 shows end-to-end data lineage across SQL Server, Sqoop and Hive. The lineage tells us that that the Product table from SQL Server was imported into Hive using the Sqoop import command. While performing the Sqoop import job, the import process will instantiate a MapReduce job fetches the Product table from SQL Server and stores the results in HDFS under /user/hive/warehouse/product.product as product.part-m-00000 and then will write the output into Hive’s default database.

Figure 3: End-to-end data lineage across SQL Server, Sqoop and Hive.