The cost of Bad Data
The costs of poor data quality are so high that many have trouble believing the stats. Gartner estimated that the average organization takes a $15M hit due to poor data quality every year. For some organizations, it can even be fatal. I’m often reminded of a story told by my Data Science Innovation Summit co-presenter, Dan Enthoven from Domino Data Labs, about a high-frequency trading firm, Knight Capital, who deployed a faulty update to their algorithm without testing its effect. Within a day, the firm had automated away nearly all of their capital and had to orchestrate an emergency sale to another firm.
He also speaks of a credit card company that failed to validate the FICO© Credit Score field from a 3rd party provider. When the company later switched providers, the new provider indicated “no credit” with an illegal value of 999 (850 is the highest legal value). Because there was no data quality check in place, their automated approvals algorithm started approving these applications with huge credit limits, leading to major losses.
DataOps ensures Data Quality is Maintained
In an earlier post, I discussed a profound shift to DataOps that will change the way data is governed. Just recently, we have gained a capability to repeatably version and replicate data at scale, and this will enable us to move from process and control applied to a data system (i.e. control board meetings) to guarantees made within the system. One of the key elements of data governance that will be affected is Data Quality Management, which will move from a task assigned to Data Stewards who attempt to “clean up” data after it has already been corrupted to a part of all development activity within an organization that produce data.
First, a few definitions so we’re on the same page:
Before we begin, it’s best to put out a few definitions that will be used in this article (because we are talking about data quality, it’s best to be precise):
data source — anything that generates data over time and makes that data available for consumption as a data entity. Could be an operational application, refined data warehouse, 3rd party API, sensor, etc.
data product — a data entity that fills a specific need and is typically produced by transforming and combining data from data source(s)
data pipeline — a series of transformations where data flows in from the sources and out into data products
data definition — The metadata describing a particular data entity. The definition should enable data consumers to utilize the data entity without extra input. It will typically include the location, schema, and format for the entity as well as descriptions and constraints for each field.
Data Quality is part of Everyone’s Job
One additional definition that is important to sync on is the broad category of people in your organization that use data to produce insights. I’ll call them data product creators. They may come from a variety of backgrounds and typically have a title of data analyst, data engineer, or data scientist. Each will follow slightly different development processes and may use different sets of tools. However, they should all have responsibility for maintaining and improving data quality.
The 3 types of Data Product Creators all have data quality responsibilities
In order to move data quality from an after-the-fact data steward operation into everyday life, it becomes necessary that all roles that touch the data participate in the process of keeping the tests up to date to keep it clean. This means that all roles that create data products must take into account the tasks typically assigned to Data Stewards. It does not mean that the role of Data Steward should fully go away, as there are still plenty of data sources and data sets within the organization that are not undergoing active development.
DataOps and test-driven development
Compilers need to be more robust than the code that they compile. Given that data pipeline code is actually just compiling data into data products, and that the data itself isn’t under your control when writing code, your data pipeline code should be robust enough to handle any possible data scenario. When building library code, software developers often utilize tools like Microsoft Intellitest to generate test inputs that account for all possible input scenarios.
“We expect owners of pipelines to treat the schema as a production asset at par with source code and adopt best practices for reviewing, versioning, and maintaining the schema.” — Google Data Engineering
Within DataOps, when developing new code that uses data as an input (ie data pipeline code), the data definition (i.e. schema and set of rules for possible values) must accurately represent the constraints that are expected on the input data. A data fuzzing library such as Google’s LibFuzz Structured Fuzzing or Faker can be used to generate potential data inputs based on the constraints in the data definition. Based on the failures generated from these tests, the inputs help harden the processing code as well as the data definition. More details on the practices used by Google can be found in this article. Most of the time, this capability uncovers updates that are needed to make the data definition more clear about what is considered possible, but sometimes it also uncovers code bugs that could break data pipelines.
Data testing and code testing share many qualities. The tests are committed to source control, and can be run as part of continuous integration procedures using tools like Jenkins or Circle CI, coverage of tests can be measured against sample datasets. However, data tests differ in one important way: if they fail, they leave behind bad data to clean up. In traditional data development architectures, the only way to ensure that the data pipeline truly works. Therefore, while creating tests to evaluate data quality, data product creators should work in safe development environments, using what DataKitchen calls the Right to Repair Architecture. This allows them to create or modify a data product without introducing bad data into production.
In many cases, testing is made more difficult by the fact that the data products being delivered are simply a representation of the data inputs that generated them (for example the charts and graphs that populate a dashboard). In this case, it is difficult to write code-based unit tests that specify the full set of constraints for a dataset. Therefore, the set of tests often ensure that the pipeline doesn’t break in unexpected ways, but does not guarantee that the data is high quality. This is much the same as the compiler not checking for business-logic flaws in your code. In agile software development, coders must work with the product owner to define the tests for a feature to be accepted. In data product development, data owners should be able to specify the conditions that they expect in the data without much complex coding.
Data Definitions and continuous integration
In this way, the definition of data quality becomes intrinsic to the data definition:
1. Data definitions are treated as first-class code: versioned, tested, and published
2. In pipeline steps that merge or blend multiple inputs, the inputs are specified as a group, allowing tests to be written for relationships between the input data.
3. Data is run through a check against the data definition as it is loaded into the data platform in production. These checks can be used to apply statistical process control to the running platform.
By combining these principles, the team can guarantee that data issues are captured on ingestion.
The data validation procedure in use at Google, from: Data validation for machine learning
Debugging issues and root cause automation
In any system, end users are going to find some of the issues. In a self-service world, we want data consumers to be able to quickly and easily report issues with the data products that they are leveraging. Simply being able to report issues, and having them documented for other data consumers does not build trust. It is important for the data owner to quickly determine the root cause and resolve the issue. However, data issues can be notoriously hard and time-consuming to diagnose. Questions that arise are:
What: Is the issue caused by a bug in the code or bad input data at the source?
Where: If it was a code bug, at what stage in the pipeline was the issue introduced?
How: If it was a source issue, is our understanding of the source data correct?
Who: If it was a source issue, did the user contribute other bad data?
How Big: What other data products use the same bad input data (and are therefore also incorrect)?
When: At what point in time was the data issue introduced (how much other similarly bad data exists)?
With data and pipeline code versioning in place, and tests continuously running against the data in the pipeline, we can answer these questions much more quickly.
With DataOps foundations in place, the answer to the root cause (What, Where, and How) questions above can be obtained systematically. Detailed data lineage allows us to utilize the following process:
1. Work with the data consumer to understand their expectations
2. Update the data product definition to include these expectations
3. Produce a data test that violates the expectations (potentially leverage fuzz testing tools to aid in this process)
4. Identify the specific data inputs that cause the test to fail and determine whether they warrant a data definition update because they are unrealistic
5. If the data inputs are realistic, move back to the previous stage in the pipeline and add tests for the new output expectation
6. Continue moving back and adding tests until you find a stage where the problems in the output are not related to a particular set of inputs. In this case, the pipeline stage in question is the root cause of the issue.
7. If you reach the data sources of the pipeline, then the source data is the issue, and you need to properly discard the data within the pipeline.
From there, data profiling can be used to determine When the data problem was introduced. Metadata on the source data can be used to determine Who caused the problem, and the lineage graph can answer the How Big question by tracing data issues from their source to the other affected data products.
Data Issue Documentation and Resolution
The data lineage graph also provides us with the capability to notify all consumers of data products that are affected by a detected data problem. Consider the following higher education enrollment scenario where multiple data pipelines are in place to leverage enrollment metrics to help with budgeting and marketing. Based on the lineage graph between the various intermediate data products, the system can automatically alert the downstream consumers as soon as a problem is detected in a pipeline. A similar pattern of alerting can be used for multiple types of problems (the diagram shows 3 example problems across job failures and slow running jobs). These alerts allow the data consumer to determine whether they should use the affected data product (dashboard, application, etc.), or wait for a fix.
In addition, the lineage graph can be used to propagate data quality documentation through a system. When any of the data quality issues shown above is detected, it can automatically be added as a note on the data definition in the data catalog. The quality issue can then be propagated to all downstream data products, allowing any other data product creators to understand the data quality issues present in the data sources they are working with.
This carry forward of quality issues is also noted on the metadata for records where the correcting code has not been applied. Then, when a fix is applied within a data pipeline to correct for bad source data, that fix can be propagated to the metadata, and the data quality note can be removed from the source and affected data definitions.