DataOps for business: A comprehensive introduction
DataOps is a holistic approach that recognizes the far-reaching impacts data has for business, and addresses common problems on several fronts.
Managing the people, technology, and business outcomes related to data has become an enterprise-level challenge over the past decade.
Simply put, everything data-related has become more complex, which creates many new challenges. Fortunately, there are frameworks to handle all this complexity to foster a positive data culture and keep meeting business goals. One of these is DataOps.
What is DataOps?
DataOps — short for “data operations” — is a method used to collaboratively manage data across an enterprise, focusing on the intersection of technology systems and human systems.
DataOps is a holistic approach that recognizes the far-reaching impacts data has for business, and addresses common problems on several fronts.
The term arose during the mid 2010’s (around the same time as other industry terms such as the “modern data stack”) in response to an explosion of new tools and methodologies. These changes to the data scene brought new opportunities and pitfalls, which forced the industry to re-think many fundamental principles.
Because DataOps is so holistic, it can be challenging to understand without taking into account lots of industry context.
In this article, we’ll walk through how and why DataOps came to exist. Along the way, we’ll highlight some of the best of its many definitions. And we’ll dig into its benefits and how to select a DataOps platform.
The origin of DataOps
“DataOps is a data management method that emphasizes communication, collaboration, integration, automation and measurement of cooperation between data engineers, data scientists and other data professionals.”
This definition comes from Andy Palmer, who is credited with coining the term in his 2015 blog post.
Palmer proposed “DataOps” as a framework to help work through two major shifts he’d noticed in the data space:
- Democratization of analytics. Business stakeholders of all types rely on data analytics, from leadership to customer support to sales and marketing. Around the time of Palmer’s writing, business intelligence tools (his post mentions Tableau, though Looker is a more current alternative) began putting accessible analysis tools in the hands of all professionals.
While this was a big step toward data-driven enterprise, it meant data users were no longer just a small group of highly technical specialists with similar backgrounds. Suddenly, they were extremely diverse, with different needs, abilities, and points of view.
- “Purpose-built” data storage. At the same time, different types of database architectures and storage systems were proliferating. Storage was scaling in the cloud to an unprecedented level.
The systems Palmer calls out aren’t too relevant today, but the principle remains true. A given organization will likely have a relational database to power transactions, and a storage system designed for analytics. This could be a big-name data warehouse, like Snowflake or Bigquery, or an alternative with a newer storage model, like Rockset or Firebolt.
Given these changes, it’s easy to guess the resulting problem.
In general, the data user pool and their needs were broadening. Overall usabilty of an organization’s data, and communication between teams, got much harder. This is the “Data” piece of DataOps.
At the same time, there was an increased technical demand on the data infrastructure. Data pipelines now needed to connect data across many new BI and SaaS tools, as well as new types of data storage systems. That’s the “ops.”
By 2018, the term DataOps had spread across the industry. Mainly, it was being used in the marketing of various data platforms, as it is today.
Meanwhile, the problems Palmer described in 2015 haven’t gone away. Data and data-driven decision making have become more and more important, and the landscape of data tools has become more and more fractured.
And the demand for timely, reliable data has only increased. Uniting the systems in a data stack, and the teams that rely on them, has proven to be a moving target.
Making data agile
The term “DataOps” is a play on “DevOps,” which implies a comparison between the worlds of data and software development.
The two aren’t exactly as similar as you might think. We’ll get into that more below. For now, we’ll talk about a key similarity: agile development.
The following definition of DataOps comes from the proceedings of the 2018 LDWA conference (It’s also the definition used by Wikipedia):
“DataOps is a set of practices, processes and technologies that combines an integrated and process-oriented perspective on data with automation and methods from agile software engineering to improve quality, speed, and collaboration and promote a culture of continuous improvement in the area of data analytics.”
Agile software development (or just “agile) is a broad set of frameworks used in the development world.
Agile encompasses many different techniques, but all are geared toward the goals of:
- Adaptability to change and uncertainty
- Early and continuous (iterative) delivery of the product
In DataOps, we apply agile principles to data workflows in order to deliver data as a product. This means delivering data that is clean, discoverable, explorable, and transformed as needed for a given use case. In other words, it’s ready to be operationalized.
Because business goals and use cases shift constantly, it’s important for data stakeholders to be able to iterate on the workflow that produces the data product. (Usually, this means refining the data pipeline). There must be a relatively straightforward, timely way to productionize new models.
For example, say the sales department wants new data integrated into their dashboards, but this data must come from a source system their team has never used before. They should be able to add this data source to their pipeline quickly on their own, and later tweak it to refine the data product. They should not have to engage an engineering team in a complex technical process that takes weeks.
DataOps vs DevOps
DevOps is a methodology that allows software development (“Dev”) to collaborate effectively with IT operations (“Ops”). It goes hand-in-hand with agile methodology, and emerged as society began to demand powerful, user-friendly, and ever-current computer software.
Traditional management practices failed to meet this demand for software, and siloed teams caused production to break down in an unprecedented way. Something had to change.
Like DevOps, DataOps involves agile principles and fosters collaboration between two otherwise siloed groups with common interests. These are the business users of data (“Data”) and the data engineering or infrastructure teams (“Ops”).
It’s tempting to just say that DataOps is “DevOps for data pipelines.” There are a number of reasons why this isn’t the case, but the biggest reason has to do with the human element.
While DevOps unifies two highly technical teams, DataOps unifies a spectrum of technical skill levels: from data engineers, to data analysts, to business users with minimal technical experience. While these teams roughly map into “Data” and “Ops” groups, the line is quite blurry.
By extension, the problem DataOps is solving is arguably more complex.
The variety of humans involved makes it harder to design shared workflows and communication methods. Leaders must be very intentional about choosing DataOps tools and processes that truly meet everyone’s needs.
Why DataOps?
We’ve established that DataOps is a nebulous — if appealing — concept. It’s also challenging to implement.
So, why bother?
Here are some of the problems you might face as a data-driven organization that doesn’t use DataOps:
- Poor collaboration both within and between data stakeholder teams.
- One team is the bottleneck on other teams’ data needs. Usually, this is the technical “ops” team charged with creating data pipelines.
- There’s a backlog of technical debt and broken pipelines due to poor governance or tooling in the face of growing demand. The “Ops” team is constantly playing catch-up.
- “Data” teams are skeptical of data quality due to lack of transparency and governance.
Data outcomes strongly influence business outcomes, so inefficiencies in data workflows are also slow leaks of time and money for the whole organization.
Benefits of DataOps
On the flip side, benefits of DataOps include:
- More effective collaboration across disparate teams.
- Robust, streamlined data pipeline workflows that prevent technical debt.
- Improved data quality, lower error rates, and increased trust in data.
- Fast innovation and delivery of new insights from data.
It’s true that implementing the new tools and practices required within the company will cost money up front, but DataOps saves costs in the long run.
In general, your teams are able to more effectively and efficiently meet their goals, and waste much less energy on communications breakdowns, technical barriers, broken data pipelines… the list goes on.
And because transparent data is a major tenant of DataOps, you’ll be able to easily show the value and ROI.
Taking things a step further, implementing DataOps imparts tons of flexibility, and opens the door to a variety of organizational frameworks. For example, re-modeling your infrastructure as a data mesh would be difficult, if not impossible, without DataOps.
Picking a DataOps platform or tool
DataOps isn’t a magic bullet, and you can’t just buy it.
What you can do is purchase a platform that enables you to solve the technical challenges of data integration. From there, the cultural and managerial shifts become possible.
A good DataOps platform should:
- Help see data through the lifecycle of your datastack: from ingestion, to transformation, to operationalization.
- Be transparent, with accessible metrics and status information.
- Include a high degree of customization and automation of data workflows.
- Guarantee high-quality data, with automated testing and built-in safeguards to prevent errors from reaching production.
- Offer an accessible interface for engineers and business users alike.
- Offer fast, scalable deployments and the ability to iterate easily.
It doesn’t have to be branded as a DataOps platform or tool to be effective. “Data integration,” “data pipelines,” “data mesh,” “ELT,” and “data orchestration” are just a few other labels under which you might find your ideal platform.
With the right platform in place, you’ll have the flexibility to manage the workflows, collaboration, and communication around data the same way you would with any other aspect of your organization. What that looks like, of course, is unique to you.
Estuary Flow is a real-time DataOps platform focused on flexible, scalable pipelines.