The Information Asset team has been working with Cloudera Navigator and Collibra. Cloudera Navigator provides rich Hadoop metadata around artifacts like Hive tables and Sqoop jobs. Collibra provides tooling to govern these data artifacts. In this blog, we will discuss how we imported the metadata from Cloudera Navigator into Collibra so that it can be governed appropriately.
In Figure 1, we show a list of Hive tables in Cloudera Navigator.
Figure 1: Hive tables in Cloudera Navigator.
In Figure 2, we updated the Collibra metamodel to create a custom data asset called “Hive Table.”
Figure 2: Custom data asset called Hive Table in Collibra.
Figure 3 shows a sample of the metadata that is exposed by Cloudera Navigator in JSON format.
Figure 3: Sample Cloudera Navigator metadata in JSON format.
In Figure 4, we used a Python script to parse the JSON data.
Figure 4: Python script to parse Collibra Navigator JSON data.
We then used the REST API to import the parsed JSON data into Collibra as shown in Figure 5.
Figure 5. List of imported Hive metadata in Collibra.
The list of Hive tables can then be governed like any other asset in Collibra. In Figure 6, we show the data steward for each Hive table in Collibra. We used a similar approach to also import Hive columns, Sqoop jobs, and other Hadoop artifacts into Collibra.
Figure 6: Data steward for Hive tables in Collibra.