
Here's something that might surprise you: after decades of innovation
in data pipeline technology, data analysts are still spending roughly
80% of their time on data preparation. That's not a typo, and it's not
getting better.
This sobering reality came up at the 62nd edition of The IT Press Tour during a conversation with
Arvind Prabhakar, CEO and co-founder of Tabsdata, a company that emerged
from stealth in February with what sounds like a radical
proposition-replacing traditional data pipelines entirely with something
called "Pub/Sub for Tables." Prabhakar should know a thing or two about
data pipelines. He spent the last decade as founder and CTO of
StreamSets, building it up before
its ultimate acquisition.
But here's where things get interesting. Prabhakar isn't trying to
build a better pipeline. He's arguing that the entire pipeline approach
is fundamentally broken.
The Reality Recreation Problem
"What's even more mind-blowing," Prabhakar explained during the
briefing, "is that all of this data preparation, all this pipeline of
transformations, aggregations, cleansing, validation-what do you think
it is achieving? What are you trying to do? One way to express what
you're trying to do is to recreate the reality that those systems are
representing."
Think about that for a moment. Your source systems-whether they're
Salesforce, your ERP, or your customer database-already represent
business reality. They're actively powering your front-end operations.
But when you build a traditional data pipeline, you copy all that data
into a staging area, then spend enormous effort trying to recreate that
same reality through transformations and joins.
"The data was grounded in reality," Prabhakar continued. "You copied
it over. Then you have to go through all of this to recreate that
reality. And what's even more mind-blowing is the people who are doing
those modeling and transformations are the data team, which have no clue
and they're not grounded in the front-end systems."
This disconnect creates what Prabhakar calls "huge wedges of
information loss." The sales team understands their forecast data
perfectly, but by the time it's been through multiple pipeline stages
and transformations, that business context has been filtered through the
interpretations of data engineers who may never have spoken to a
salesperson.
When Kitchen Telemetry Becomes the Recipe
Prabhakar offered a brilliant analogy that captures the absurdity of
current approaches: "It's like when you're at a restaurant having your
favorite meal. If you were to create a pipeline to figure out how that
meal was made, they'll give you the telemetry of the kitchen for the
last seven hours. Sure, you can recreate those dishes if you are
inherently a person who enjoys pain. But you could also ask for the
recipe."
That's the difference between current data pipelines and what
Tabsdata is proposing. Instead of giving data teams the "kitchen
telemetry"-raw dumps from every database table and message queue-why not
just ask the domain experts to share the specific datasets the
organization actually needs?
Consider a typical pipeline scenario: you connect to a Postgres
database with two tables containing millions of records. Your data
engineers perform complex joins and transformations, ultimately
producing a dozen meaningful rows. "You create a dozen records out of
millions," Prabhakar noted. "A lot of wasted effort in storage, IOPS,
compute, reprocessing time to get a dozen records that you could have
asked those teams for."
How Pub/Sub for Tables Actually Works
The concept itself is elegantly simple, though the implications are
profound. Instead of treating business domains as passive data sources
to be mined, Tabsdata's approach makes them active publishers of
specific datasets.
Here's how it works in practice. Your sales team publishes a weekly
forecast table. Your finance team publishes licensing data. Customer
success publishes engagement metrics. Each of these becomes a versioned,
trackable data contract-a formal commitment about what data will be
shared and when.
Data consumers-analytics teams, ML workflows, business intelligence
applications-subscribe to these published tables. When the sales team
updates their forecast, all subscribers automatically receive the new
version. No complex pipelines. No transformation logic that assumes how
sales forecasting works. No wondering whether the data you're analyzing
is from last week or last month.
The system handles versioning, provenance tracking, and lineage
automatically. If there's a problem with a particular customer health
score, you can trace it back to the exact input records and timestamps
that contributed to that calculation.
The Conway's Law Connection
There's a deeper organizational principle at work here that Prabhakar
touched on-Conway's Law. This concept suggests that the design of any
complex system mirrors the communication structure of the organization
building it.
"If you have a team with two technical leads and you give them a
problem to solve, they will create a solution with two modules that work
together," Prabhakar explained. "Imagine a Fortune 100 company that has
thousands and thousands, tens of thousands of employees in thousands of
organizations. If they're solving a problem, guess what's happening?
Their communication structure is the first draft of their design."
Traditional pipelines fight against this natural organizational
structure by funneling everything into centralized staging areas where
"they become somebody else's problem." The Pub/Sub approach respects
Conway's Law, allowing organizations to "push down the responsibility of
quality, governance, manageability, testability, all of that in a
manner in which communication in that organization works."
This isn't just theoretical organizational design-it has real
operational benefits. When data quality issues arise, there's a clear
line of communication back to the domain experts who understand the
business context.
Why Existing Solutions Fall Short
You might wonder whether existing technologies could accomplish the
same thing. After all, message brokers like Kafka have been doing
pub/sub for years. Data platforms like Snowflake offer versioning and
time travel capabilities. Couldn't you just build this yourself?
Prabhakar walked through why each existing category falls short:
Message Brokers: These are designed for events and
streams, not tables. "When you're operating on a pub/sub system by
nature, you either are operating at a point-to-point kind of message
exchange capacity to build event-driven architecture, or you're building
stream processing systems," he explained. To make Kafka work for
tables, you'd need to define table schemas, implement versioning, build
provenance tracking-essentially rebuilding much of what Tabsdata
provides natively.
Data Platforms: While Delta Lake and Snowflake offer
many of the technical capabilities needed, they require giving all
domain teams expensive platform access. "The governance nightmare is because these platforms don't forget data. They will retain that data
forever," Prabhakar noted. Plus, you still haven't solved the
fundamental problem of disconnecting data from business reality.
Workflow Orchestrators: Tools like Airflow operate
on metadata, not data. You'd still need to build the underlying data
versioning and provenance capabilities separately.
The technical challenges of building this functionality on existing
platforms aren't insurmountable, but as Prabhakar put it, "you're
looking at a science experiment. You're not looking at something that's
easily doable. Maybe a PhD project."
Real-World Applications and Early Results
Tabsdata is currently working with early design partners across heavy
data industries-fintech, healthcare, retail, and insurance companies.
The use cases emerging from these engagements reveal the practical
impact of the approach.
Take change data capture (CDC), for example. At StreamSets, Prabhakar
dealt with the notorious challenges of implementing CDC for Oracle
databases using LogMiner-a approach he describes as "the poor man's way"
because Oracle's preferred solution requires expensive GoldenGate
licenses. "LogMiner has bugs that Oracle doesn't fix. And they make
breaking changes from every major version upgrade to the next major
version."
With Tabsdata's versioning system, subscribers can automatically
compare current and previous table versions to identify changes. "You
can create CDC on any data source. You don't need specialized tools and
costly systems."
The company is taking an open core approach, with the basic
infrastructure available under Apache license and enterprise extensions
remaining closed source. They're targeting Python data engineers with a
pip-installable package that runs anywhere-laptops, Kubernetes clusters,
or cloud infrastructure.
The Bigger Picture: A Shift Left for Data
What Prabhakar is describing mirrors other major shifts in enterprise
technology. Just as the industry moved from waterfall to agile
methodologies and from monoliths to microservices, data architecture may
be ready for its own fundamental transformation.
"We started that exercise where we're still using the old markers of
'we'll build the products and then hand it over to operations for
testing, validation, certification,'" he explained. "Eventually, people
realized that's not going to scale, so they did what's known as shift
left-operations and validations and quality all of those become part of
the development process."
The same principle applies to data. Instead of centralizing all data
preparation and hoping data teams can recreate business reality, the
quality, governance, and business understanding stay with the domain
experts who understand it best.
Looking Forward: When Data Integration Disappears
Tabsdata's vision is ambitious: "A future where data integration no
longer exists-just trusted datasets, instantly accessible across the
enterprise, ready for AI, analytics, and action."
That might sound like Silicon Valley hyperbole, but consider the
organizational pain points this approach addresses. Legal teams
demanding exact data lineage for ML workflows but lacking the
infrastructure to provide it. Analytics teams spending 80% of their time
on data preparation instead of actual analysis. Business domains losing
agility because every small change requires complex pipeline
modifications.
The company emerged from stealth in February, entered public beta,
and is targeting their enterprise release for July. The founding team's
decade of experience building StreamSets gives them deep credibility in
the data integration space, along with the battle scars to understand
what didn't work the first time around.
Whether Pub/Sub for Tables becomes the new standard for enterprise
data architecture remains to be seen. But for IT leaders struggling with
the complexity and cost of current data pipeline approaches,
Prabhakar's fundamental question deserves serious consideration: if your
source systems already represent business reality, why are you working
so hard to recreate it?
Sometimes the most powerful innovations come not from building
something faster or cheaper, but from asking whether you need to build
it at all.
##