Sunil Soares
March 20, 2018

A data catalog is a single inventory of data sources that allows analysts and business users to easily access, understand, query and transform data in a collaborative manner. Data catalogs are part of a broader phenomenon around so-called data democratization, which involves unleashing access to data across the enterprise to drive business value. Many vendors are starting to develop integrated platforms across data catalog and data governance tools. However, there are a few important differences. Data catalogs are generally focused on analytics productivity while data governance tools are often geared towards improving data trustworthiness and regulatory compliance.

There are several functional and non-functional requirements when selecting a data catalog

  1. Ingestion of Diverse Data Sources & Open API – Data catalogs should provide the ability to ingest data from diverse data sources including relational databases, reporting tools, Hadoop, XML, JSON and NoSQL. If the vendor does not provide out of the box support for a specific data source, then they should provide an API to create a custom data source.
  2. Preview of Sample Data & Profiling – Data catalogs should provide users with a preview of sample data including schemas, tables and columns. Users should also be able to view a profile of the data including min, max, percentage of nulls and cardinality.
  3. Report Catalog – Data catalogs should provide users with a catalog of reports in tools such as Tableau and BusinessObjects. Ideally, users should be able to drill down into details about the report including name, owner, key attributes and certification status. Ideally, users can navigate from the data catalog to the actual visualization in the reporting tool itself.
  4. Data Lineage – Data catalogs should provide integration with the metadata hub so that users can view lineage from a data source such as a report to the underlying data source.
  5. Business Glossary & Data Governance Integration – Data catalogs should provide a business glossary so that users can view business terms, definitions and data stewards. Ideally, the data catalog can reuse business terms and definitions from the data governance tool.
  6. Data Shopping Cart & Workflow Enablement – Data catalogs should provide an “Amazon-like” shopping cart interface so that users can request access to a data set. Data access can then be provided to the user based on pre-defined rules or based on a workflow where the access request is routed to the data owner. For example, the HR data owner may create a rule that states that the compensation data set may be provisioned with name and title masked to users where access has been authorized by the director of human resources.
  7. Interface for SQL Queries & Data Wrangling – Data catalogs should provide an intuitive user interface so that data analysts can create SQL queries and other transformations against data sources.
  8. Data Quality on Ingestion – Some data catalogs also provide the ability to run quality checks when data is ingested. For example, data quality checks may quarantine any customer records that have missing dates of birth. A workflow can then route the quarantined records to a data steward for the appropriate remediation activities.
  9. Data Discovery & Masking – Some data catalogs leverage machine learning to discover and mask sensitive data within data sources. A classic example would be the discovery of social security numbers within a field called EMP_NUM.
  10. Attractive User Interface & Social Enablement – The success of data democratization initiatives depends on the level of adoption of data catalogs. The level of uptake of data catalogs by business users and data analysts requires an attractive user interface. In addition, some data catalogs provide users with the ability to tag data and provide their own qualitative rating on the value of the data. For example, a business user may tag the definition of “customer” as a “5” on a scale of 1 to 5.

Data catalogs are an emerging technology category and the requirements are still evolving. However, the requirements above will provide useful guidance to any organization that is looking to implement a data catalog.