Bad Data Quality: Everyone Has the Same Complaint and They All Mean Something Different
Why every “bad data” complaint lands in a different root cause — and how to build a Trusted Data Environment anyway.
📌 THE POINT IS: Everyone says ‘bad data,’ but they’re describing different problems. Define data quality by component and owner, then build a plan to improve trust, because agentic AI will operationalize your data at scale.
Like and Restack/Share this article if it hits the spot!
The Plinko Problem: You can only guess where a “bad data” conversation will land you
Lately I’ve had a lot of conversations with people who end the same way: a business or technology leader shrugging their shoulders in defeat about “bad data.” In a conversation with colleagues not long ago, I stopped them and started asking, “But what does bad data mean…to you…?” The term “bad data” is being used to describe several actual problems or phenomena, but unless you dig into what the person is actually experiencing, it’s like watching a Plinko ball fall down the pegboard during a “The Price Is Right” episode: you just don’t know where the conversation will land.
Confusion usually comes from two places: either enterprises don’t have a clear definition of “data quality” or that definition includes too many components. Good data quality in the analytical environment is achieved when the data faithfully matches the upstream systems. I over-index on that to build trust in the environment.
“3 out of every 5 data scientists… spend the most time cleaning and organizing data.” — CrowdFlower Data Science Report (2016)
Upstream—where data is created—CTOs and application leaders own quality. To an application owner, good data is data that’s captured accurately within any required business bounds that can then be used for any use case required later on. As long as the data is created correctly using input controls and other audit strategies, the possibilities are endless.
Build the Plinko Board: the 4 buckets “bad data” usually falls into
Data quality typically falls into a handful of buckets. Each one could be an article, but there are patterns, at varying levels of effort and investment, that technology and business teams should apply to increase quality.
We don’t have the data (not captured, not retained, not accessible)
We have it, but people don’t understand it (definitions, timing, joins, grain)
We have it, but it’s malformed at creation (input controls, free text, weak validation)
We have it, but the platform breaks trust (fidelity gaps vs source, missing lineage, silent transforms)
“Data quality refers to the usability and applicability of data used for an organization’s priority use cases…” — Gartner
Increasing data quality used to be a very hard business case to write. Sometimes with legacy tools, adding one input control or validation would cost $1MM! Business teams would scratch their heads because it seemed so simple, but whether large enterprise IT processes caused the price to balloon, or vendor costs to do custom work added to the ticket, it was very hard to justify the amount.
Nowadays two things are changing the equation very quickly:
The importance of high-quality data in operational systems that an agentic AI will depend on.
The rise of AI coding tools that significantly reduce both the coding time and the testing time required to make changes.
Suddenly the business case is simultaneously becoming much more important and much easier to attain for organizations. This will change the game.
“Quality is made at the factory”: input controls are the shift-left move
“Finding and fixing flawed data soon becomes a permanent fixture.” — Thomas C. Redman, Harvard Business Review
Agentic AI will operate systems, which means data quality becomes operational safety, not reporting hygiene. They’ll also be swimming in the analytical data lake to surface insights that drive business decisions and performance. Interns won’t be able to “fix numbers” before being seen by an executive. Input controls in the operational environment will function much like Knowledge Graphs will in the analytical one.
“Shifting left” to ensure superior logic is built into applications will be paramount. I asked a group of CTOs recently: How many web forms in the past week let you type letters into a phone number field? The answer is most likely zero, because JavaScript input controls have been around for decades! The same needs to be thought about in our operational environment and AI coders will close the gap on that in the near term future.
The business cases should write themselves.
“Poor data quality costs organizations at least $12.9 million a year on average…” — Gartner
That quote is from 2020, predating generative AI. That means that they were focusing on reporting, analytics, and machine learning use cases. Take this estimate and scale it for a rapidly accelerating world of AI and business outputs.
Six Actions that Executives Should Consider Today
Here’s where you should start if you’re ready to dive into tuning up your enterprise’s data:
Force the definition upfront: Every “bad data” complaint must declare which bucket it’s in and which use case is harmed. (Fitness-for-use framing.)
Shift-left with input controls: Put validation rules where data is created; stop funding downstream bandages as a “strategy.”
Make fidelity measurable: Reconciliation checks between source and Trusted Environment for key KPIs (especially “executive numbers”).
Invest in observability + lineage like it’s production monitoring: because it is.
Kill the master-sheet incentive: publish “trust scores” (lineage, freshness, known issues) so people don’t feel safer in spreadsheets.
Treat AI as a trust multiplier: no agent rollout without defined data contracts + controls + monitoring.
When someone says, “we have bad data!”…
Don’t argue! They’re probably right. However, dig deeper with a couple of probing questions. Get them to explain what they actually mean and are experiencing so that you can “route the call” appropriately.
Meanwhile as you’re looking at opportunities for future investments, push your teams to bring at least one idea that tunes up the source systems so that data is produced well so that when it flows downstream, everyone benefits. Taking this “root cause approach” will pay dividends on the investment when it comes to reduced time to produce reports, build machine learning or other AI models, train AI agents to do quality work, and expand your executive worldview through highly accurate and automated business metrics and insights.



