The Rise of DataOps from the Ashes of Data Governance

Ryan Gross
Vice President, Chicago Office

Companies know they need data governance, but aren’t making any progress in achieving it

These days, executives are interested in data governance because of articles like these:

  1. Recent Gartner research has found that organizations believe poor data quality to be responsible for an average of $15 million per year in losses.
  2. The first major GDPR fine was Google’s $57 million fine from the French data authority
  3. The Equifax data breach has cost the firm $1.4 Billion (and counting) despite the fact that the data has never been found.

On the other hand, the vast majority of data governance initiatives fail to move the needle, with Gartner also categorizing 84% of companies as low maturity in data governance. Despite the fact that nearly every organization recognizes the need for data governance, many companies are not even starting data governance programs due to the strong negative connotation of the term within the executive ranks.

Current data governance “best practices” are broken

In my experience, the reason for the lack of progress is that we have been doing data governance the wrong way, making it dead on arrival. Stan Christiaens got this right in his Forbes article, despite the fact that it was essentially an ad (a very effective one) for his company. I agree with him that the primary reason governance has failed in the past is because the technology just wasn’t ready, and organizations couldn’t find ways to motivate people to follow the processes that filled the technology gaps. However, I disagree that modern data catalog tools provide the complete technology answer we need to be successful (although they are a step in the right direction).

If Data Catalog tools aren’t the answer, what is?

Recent advances in data lake tooling (specifically the ability to version data at scale) have put us at a tipping point where we can reimagine the way we govern data (i.e. the culture, structures, and processes in place to achieve the risk mitigation and cost take out from governance). At the end of the transformation, data governance will look a lot more like DevOps, with data stewards, scientists, and engineers working closely together to codify the governance policies throughout the data analytics lifecycle. Companies who adopt these changes early on will create a huge competitive advantage.

To understand how I came to that conclusion we will have to go back through some of the history of Software Engineering, where 2 core technical innovations enabled process and eventually cultural changes that transformed coding from a hobby to a world-eating revolution. We’ll then see how similar innovations were the primary enablers of the DevOps movement, which has similarly transformed IT infrastructure in the cloud era. Finally, we’ll see how these innovations are poised to drive similar process and cultural changes to data governance. It’ll take a little while to build the case, but I haven’t found a better way to get the point across yet, so please stick with me.

Background: How Source Control and Compilation created Software Engineering

The core innovations that created the discipline of software engineering are:

  1. The ability to compile a set of inputs to executable outputs
  2. Version control systems to keep track of the inputs

Before these systems, back in the 1960s, software development was a craft, where a single craftsman had to deliver an entire working system. These innovations enabled new organizational structures and processes to be applied to the creation of software, and programming became an engineering discipline. This is not to say that the art of programming is not extremely important, it’s just not the topic of this article.

The first step to moving from craft to engineering was the ability to express programs in higher level languages through compilers. This made the programs easier to understand to the people who were writing them, and easier to share across multiple people on a team, because the program could be broken down into multiple files. Additionally, as the compilers got more advanced, they added automated improvements to the code by passing it through many intermediate representations.

 

  

By adding a consistent version system across all of the changes made to the code that ended up producing the system, the art of coding became “measurable“ over time (in the sense of Peter Drucker‘s famous quote: “you cannot manage what you cannot measure“). From there, all sorts of incremental innovations, like automated tests, static analysis for code quality, refactoring, continuous integration, and many others were added to define additional measures. Most importantly, teams could file and track bugs against specific versions of code and make guarantees about specific aspects of the software they were delivering. Obviously there have been many other innovations to improve software development, but it is hard to think of ones that aren’t dependent in some way on compilers and version control.

Everything-as-code: Applying Software Engineering’s core innovations elsewhere

In recent years, these core innovations have been applied to new areas, leading to a movement aptly titled everything-as-code. While I wasn’t personally there, I can only assume that software developers met the first versions of SVN back in the 70s with a skeptical eye. In much the same way, many new areas consumed by the everything-as-code movement have garnered similar skepticism, some even claiming that their discipline could never be reduced to code. Then, within a few years, everything within the discipline has been reduced to code, and this has led to many-fold improvements over the “legacy” way of doing things.

 

Turning code into infrastructure using a "compiler" layer of virtualization and configuration management

The first area of expansion was infrastructure provisioning. In this example, the code is a set of config files and scripts specifying the infrastructure configuration across environments, and the compilation happens within a cloud platform, where the config is read and executed alongside scripts against the cloud service APIs to create and configure virtual infrastructure. While it may seem like the Infrastructure as Code movement swept through all infrastructure teams overnight, a ton of amazing innovations (Virtual machines, software defined networks, resource management APIs, etc.) went into making the “compilation” step possible. This likely started with proprietary solutions from firms like VMWare and Chef, but it became widely adopted when public cloud providers made the core functionality free to use on their platforms. Before this shift, infrastructure teams managed their environments to ensure consistency and quality because they were hard to recreate. This led to layers of governance, designed to apply control at various checkpoints in the development process. Today, DevOps teams engineer their environments, and the controls can be built into the “compiler”. This has created an orders of magnitude improvement in the ability to deploy changes, going from months or weeks to hours or minutes.

This enables a complete rethink of the possibilities for improving infrastructure. Teams started to codify each of the stages for creating their system from scratch, making the compilation, unit testing, analysis, infra setup, deployment, functional and load testing a fully automated process (Continuous Delivery). Additionally, teams started testing that the system was secure both before and after deployment (DevSecOps). As each new component moves into version control, the evolution of that component becomes measurable over time, which will inevitably lead to continuous improvement because we can now make guarantees about specific aspects of the environments we deliver.

Getting to the point: the same thing will happen to data governance

The next field to be consumed by this phenomenon will be data governance / data management. I’m not sure what the name will be (DataOps, Data as Code, and DevDataOps all seem a bit off), but its effects will likely be even more impactful than DevOps/infrastructure as code.

Data pipelines as compilers

“With Machine Learning, your data writes the code.” — Kris Skrinak, ML Segment Lead at AWS

The rapid rise of Machine Learning has provided a new way to build complex software (typically for classifying or predicting things, but it’s going to do more over time). This mindset shift to thinking of the data as the code will be a key first step to converting data governance to an engineering discipline. Said another way:

“Data pipelines are simply compilers that use data as the source code.”

There are 3 things that are different, but also more complex, about these “data compilers” compared to those for software or infrastructure:

  1. Data teams own both the data processing code and the underlying data. But if the data is now the source code, it’s as if each data team is writing its own compiler to build something executable from the data.
  2. With data, we have been specifying the structure of data manually through metadata, because this helps the teams writing the “data compiler” understand what to do at each step. Software and Infrastructure compilers typically infer the structure of their inputs.
  3. We still don’t really understand how data writes code. This is why we have datascientists experiment to figure out the logic of the compilers and then data engineers come in later to build the optimizers.

The current set of data management technology platforms (Collibra, Waterline, Tamr, etc.) are built to enable this workflow, and they’re doing a pretty good job. However, the workflow they support still makes the definition of data governance a manual process handled in review meetings, which holds back the type of improvements we saw after the advent of DevOps & Infrastructure as Code. 

The missing link: Data Version Control

Applying data version control. Credit to the DVC Project: https://dvc.org

Because data is generated “in the real world,” not by the data team, data teams have focused on controlling the metadata that describes it. This is why we draw the line between data governance (trying to manage to something you can’t directly control) and data engineering (where we are actually engineering the data compilers rather than the data itself). Currently, data governance teams attempt to apply manual control at various points to control the consistency and quality of the data. The introduction of version tracking to the data would allow data governance and engineering teams to engineer the data together, filing bugs against data versions, applying quality control checks to the data compilers, etc. This will allow data teams to make guarantees about the system components that the data delivers, which history has shown will inevitably lead to orders of magnitude improvement in the reliability and efficiency of data driven systems.

The data version control tipping point has arrived

Platforms like Palantir Foundry already treat the management of data in much the same way as developers treat the versioning of code. Within these platforms, datasets can be versioned, branched, acted upon by versioned code to create new data sets. This enables data driven testing, where the data itself is tested in much the same way as that the code that modifies it might be tested by a unit test. As data flows through the system in this way, the lineage of the data is tracked automatically by the system as are the data products that are produced at each stage of each data pipeline. Each of these transformations can be considered a compile step, converting the input data into an intermediate representation, before machine learning algorithms convert the final Intermediate Representation (which data teams usually call the Feature Engineered dataset) into an executable form to make predictions. If you have $10M-$40M laying around who are willing to go all in with a vendor, the integration of all of this in Foundry is pretty impressive (disclaimer: I don’t have a ton of hands on experience with Foundry; these statements were based on demos I’ve seen of real implementations at clients).

For the rest of us, there are now open source alternatives. The Data Version Control project is one option that is focused on data scientist users. For big data workloads, DataBricks has taken the first step in open sourcing a true version control system for data lakes with the release of their open source Delta Lake project. These projects are brand new, so branching, tagging, lineage tracking, bug filing, etc. haven’t been added yet, but I’m pretty sure the community will add them over the next year or so.

The DataBricks Delta Lake open source project enables data version control for data lakes

The next step is to rebuild data governance

The arrival of technology for versioning & compiling data puts the impetus on data teams to start rethinking how their processes can take advantage of this new capability. Those who can actively leverage the capability to make guarantees will likely create a massive competitive advantage for their organization. The first step will be killing off the checkpoint based governance process. Instead the data governance, science, and engineering teams will work closely together to enable continuous governance of data as it is compiled by data pipelines into something executable. Somewhere behind that will be the integration of the components compiled from data alongside the pure software and infrastructure as a single unit; although I don’t think the technology to enable this exists. The rest will emerge over time (and in another post), enabling a culture of governance that reduces major issues while accelerating the time to value for machine learning initiatives. I know it sounds crazy to say, but this is an exciting time to be in data governance.

 

If you need to define a starting point with DataOps or pick up where you left off, let Pariveda come alongside you to showcase Data Governance in a different light and provide you value. Talk with us today! 

 

Illustration by: Sharon Tsao

Article originally published on Ryan's Medium page

Ryan Gross

About the Author

Ryan Gross Vice President, Chicago Office
Mr. Ryan Gross brings his passion for technology, problem-solving, and team-building together to build quality solutions across many industries. He has taken a generalist approach in solutions ranging from native mobile apps to enterprise web applications and public APIs. Most recently, he has focused on how the cloud enables new application delivery mechanisms, developing and applying a Continuous Experimentation development approach to build a cloud-native IOT data ingestion and processing platform. Ryan strongly believes in a virtuous cycle of technology enabling process and team efficiency to build improved technology.

More Perspectives

Perspective
by Kent Norman and Kevin Moormann
I’ll be Home in Five Story-points
Perspective
by Kerry Stover
I Know You Believe You Understood What You Think I Said...
Perspective
by Adrian Kosiak
Lessons from Dior on Becoming a Premium Brand
Perspective
by Margaret Rogers
Failing Fast Is Fine — As Long As You’re Failing Well, Too
Perspective
by David Watson
Work Life Balance
Perspective
by Sean McCall
4 Ways Sports Can Benefit Careers
Perspective
by Sean McCall
Forget Coffee: Energize Your Work Morning
Perspective
by Russell Clarkson
Stand Up for Good Presentations
Perspective
By Alexandria Johnson
The Hottest Thing at SXSW You Learned it in Kindergarten
Perspective
by Bruce Ballengee
Developing the Individual
Perspective
by Lori Dipprey
Why Performance Review are Here to Stay at Pariveda
Perspective
by Mike Strange
4 Reasons to Leverage the Power of Small Teams
Perspective
by David Watson
The Benefits of Working in Teams
Perspective
by Sean McCall
The Architecture of a Selfless Team
Perspective
by Nathan Hanks
What it Means to be in the People Business
Perspective
by Bruce Ballengee
Unleashing the Power of Humility
Perspective
by Russell Clarkson
Mark Your Exits
Perspective
by Bruce Ballengee
Teaching Roots Run Deep
Perspective
by Dimitrios Vilaetis
Business Capabilities: a Spotlight for Strategic Decision Making
Perspective
by Samantha Nottingham
Brandsparency: Who Builds Brands These Days?
Perspective
by Russell Clarkson
Is your Ecosystem a pipeline or a platform?
Perspective
by Derrick Bowen
Stop Complaining About Changing Requirements
Perspective
by Brian Duperrouzel
Hippos and Cows are Stopping Your Machine Learning Projects
Perspective
by Jack Warner
Building Smart Deployment Pipeline
Perspective
by Marc Helberg
3 Ways You Can Begin to Take Patient Experience More Seriously
Perspective
by Ryan Gross
What Does It Really Mean to Operationalize a Predictive Model
Perspective
by Tom Cunningham
The Sound of the Future: Predictive Analytics in the Music Industry
Perspective
by Kent Norman
Limit, to Win It - How Putting Limits on Your Team Will Allow Them to Do More
Perspective
by Sophie Zugnoni
Did You Catch Machine Learning Fever?
Perspective
by Susan Paul
Capabilities as Building Blocks
Perspective
by Susan Paul
When Corporate Culture Undermines IT Success
Perspective
by Margaret Rogers
Identifying the Value of Nonprofit Customer Experience
Perspective
by Margaret Rogers
Why an Agile Mindset is at the Root of an Excellent Guest Experience
Perspective
by Collins DeLoach
What does Cloud Transformation mean to IT?
Perspective
by Mike Strange
Untangling and Understanding DevOps
Perspective
by Clayton Rothschild
Blockchain in an Enterprise Setting
Perspective
by Mike Strange
DevOps: A Practical Framework for Technology Process and Organizational Change
Perspective
by Julio Santos
Context as Currency
Perspective
by Oussama Housini
Why DevOps?
Perspective
by Dave Winterhalter
Data in the Dugout
Perspective
by Mike Strange
Can We Predict the Future?
Perspective
by Julio Santos and Jon Landers
How Customer Context and Smarter Algorithms will Power the Next Level of Experiences and Engagement
Perspective
by Victor Diaz
6 Things to Consider when Choosing a Data Visualization Library
Perspective
<p>by Brian Duperrouzel</p>
Post Cloud and the Lonely CIO
Perspective
by Marc Helberg
How AI Will Affect Your Patient’s Experience
Perspective
by Phillip Manwaring
Let Serverless Solve the Tech Problems You Don't Have
Perspective
by Mike Strange
Bigger is Not Necessarily Better
Perspective
by Margaret Rogers
How to Tell the Hype from the Digital Reality
Load More