Virtualization Technology News and Information
Article
RSS
Rethinking Data Architecture: How Pub/Sub for Tables Could Replace Traditional Data Pipelines

vmblog-tabsdata-itpt62

Here's something that might surprise you: after decades of innovation in data pipeline technology, data analysts are still spending roughly 80% of their time on data preparation. That's not a typo, and it's not getting better.

This sobering reality came up at the 62nd edition of The IT Press Tour during a conversation with Arvind Prabhakar, CEO and co-founder of Tabsdata, a company that emerged from stealth in February with what sounds like a radical proposition-replacing traditional data pipelines entirely with something called "Pub/Sub for Tables." Prabhakar should know a thing or two about data pipelines. He spent the last decade as founder and CTO of StreamSets, building it up before its ultimate acquisition.

But here's where things get interesting. Prabhakar isn't trying to build a better pipeline. He's arguing that the entire pipeline approach is fundamentally broken.

The Reality Recreation Problem

"What's even more mind-blowing," Prabhakar explained during the briefing, "is that all of this data preparation, all this pipeline of transformations, aggregations, cleansing, validation-what do you think it is achieving? What are you trying to do? One way to express what you're trying to do is to recreate the reality that those systems are representing."

Think about that for a moment. Your source systems-whether they're Salesforce, your ERP, or your customer database-already represent business reality. They're actively powering your front-end operations. But when you build a traditional data pipeline, you copy all that data into a staging area, then spend enormous effort trying to recreate that same reality through transformations and joins.

"The data was grounded in reality," Prabhakar continued. "You copied it over. Then you have to go through all of this to recreate that reality. And what's even more mind-blowing is the people who are doing those modeling and transformations are the data team, which have no clue and they're not grounded in the front-end systems."

This disconnect creates what Prabhakar calls "huge wedges of information loss." The sales team understands their forecast data perfectly, but by the time it's been through multiple pipeline stages and transformations, that business context has been filtered through the interpretations of data engineers who may never have spoken to a salesperson.

When Kitchen Telemetry Becomes the Recipe

Prabhakar offered a brilliant analogy that captures the absurdity of current approaches: "It's like when you're at a restaurant having your favorite meal. If you were to create a pipeline to figure out how that meal was made, they'll give you the telemetry of the kitchen for the last seven hours. Sure, you can recreate those dishes if you are inherently a person who enjoys pain. But you could also ask for the recipe."

That's the difference between current data pipelines and what Tabsdata is proposing. Instead of giving data teams the "kitchen telemetry"-raw dumps from every database table and message queue-why not just ask the domain experts to share the specific datasets the organization actually needs?

Consider a typical pipeline scenario: you connect to a Postgres database with two tables containing millions of records. Your data engineers perform complex joins and transformations, ultimately producing a dozen meaningful rows. "You create a dozen records out of millions," Prabhakar noted. "A lot of wasted effort in storage, IOPS, compute, reprocessing time to get a dozen records that you could have asked those teams for."

How Pub/Sub for Tables Actually Works

The concept itself is elegantly simple, though the implications are profound. Instead of treating business domains as passive data sources to be mined, Tabsdata's approach makes them active publishers of specific datasets.

Here's how it works in practice. Your sales team publishes a weekly forecast table. Your finance team publishes licensing data. Customer success publishes engagement metrics. Each of these becomes a versioned, trackable data contract-a formal commitment about what data will be shared and when.

Data consumers-analytics teams, ML workflows, business intelligence applications-subscribe to these published tables. When the sales team updates their forecast, all subscribers automatically receive the new version. No complex pipelines. No transformation logic that assumes how sales forecasting works. No wondering whether the data you're analyzing is from last week or last month.

The system handles versioning, provenance tracking, and lineage automatically. If there's a problem with a particular customer health score, you can trace it back to the exact input records and timestamps that contributed to that calculation.

The Conway's Law Connection

There's a deeper organizational principle at work here that Prabhakar touched on-Conway's Law. This concept suggests that the design of any complex system mirrors the communication structure of the organization building it.

"If you have a team with two technical leads and you give them a problem to solve, they will create a solution with two modules that work together," Prabhakar explained. "Imagine a Fortune 100 company that has thousands and thousands, tens of thousands of employees in thousands of organizations. If they're solving a problem, guess what's happening? Their communication structure is the first draft of their design."

Traditional pipelines fight against this natural organizational structure by funneling everything into centralized staging areas where "they become somebody else's problem." The Pub/Sub approach respects Conway's Law, allowing organizations to "push down the responsibility of quality, governance, manageability, testability, all of that in a manner in which communication in that organization works."

This isn't just theoretical organizational design-it has real operational benefits. When data quality issues arise, there's a clear line of communication back to the domain experts who understand the business context.

Why Existing Solutions Fall Short

You might wonder whether existing technologies could accomplish the same thing. After all, message brokers like Kafka have been doing pub/sub for years. Data platforms like Snowflake offer versioning and time travel capabilities. Couldn't you just build this yourself?

Prabhakar walked through why each existing category falls short:

Message Brokers: These are designed for events and streams, not tables. "When you're operating on a pub/sub system by nature, you either are operating at a point-to-point kind of message exchange capacity to build event-driven architecture, or you're building stream processing systems," he explained. To make Kafka work for tables, you'd need to define table schemas, implement versioning, build provenance tracking-essentially rebuilding much of what Tabsdata provides natively.

Data Platforms: While Delta Lake and Snowflake offer many of the technical capabilities needed, they require giving all domain teams expensive platform access. "The governance nightmare is because these platforms don't forget data. They will retain that data forever," Prabhakar noted. Plus, you still haven't solved the fundamental problem of disconnecting data from business reality.

Workflow Orchestrators: Tools like Airflow operate on metadata, not data. You'd still need to build the underlying data versioning and provenance capabilities separately.

The technical challenges of building this functionality on existing platforms aren't insurmountable, but as Prabhakar put it, "you're looking at a science experiment. You're not looking at something that's easily doable. Maybe a PhD project."

Real-World Applications and Early Results

Tabsdata is currently working with early design partners across heavy data industries-fintech, healthcare, retail, and insurance companies. The use cases emerging from these engagements reveal the practical impact of the approach.

Take change data capture (CDC), for example. At StreamSets, Prabhakar dealt with the notorious challenges of implementing CDC for Oracle databases using LogMiner-a approach he describes as "the poor man's way" because Oracle's preferred solution requires expensive GoldenGate licenses. "LogMiner has bugs that Oracle doesn't fix. And they make breaking changes from every major version upgrade to the next major version."

With Tabsdata's versioning system, subscribers can automatically compare current and previous table versions to identify changes. "You can create CDC on any data source. You don't need specialized tools and costly systems."

The company is taking an open core approach, with the basic infrastructure available under Apache license and enterprise extensions remaining closed source. They're targeting Python data engineers with a pip-installable package that runs anywhere-laptops, Kubernetes clusters, or cloud infrastructure.

The Bigger Picture: A Shift Left for Data

What Prabhakar is describing mirrors other major shifts in enterprise technology. Just as the industry moved from waterfall to agile methodologies and from monoliths to microservices, data architecture may be ready for its own fundamental transformation.

"We started that exercise where we're still using the old markers of 'we'll build the products and then hand it over to operations for testing, validation, certification,'" he explained. "Eventually, people realized that's not going to scale, so they did what's known as shift left-operations and validations and quality all of those become part of the development process."

The same principle applies to data. Instead of centralizing all data preparation and hoping data teams can recreate business reality, the quality, governance, and business understanding stay with the domain experts who understand it best.

Looking Forward: When Data Integration Disappears

Tabsdata's vision is ambitious: "A future where data integration no longer exists-just trusted datasets, instantly accessible across the enterprise, ready for AI, analytics, and action."

That might sound like Silicon Valley hyperbole, but consider the organizational pain points this approach addresses. Legal teams demanding exact data lineage for ML workflows but lacking the infrastructure to provide it. Analytics teams spending 80% of their time on data preparation instead of actual analysis. Business domains losing agility because every small change requires complex pipeline modifications.

The company emerged from stealth in February, entered public beta, and is targeting their enterprise release for July. The founding team's decade of experience building StreamSets gives them deep credibility in the data integration space, along with the battle scars to understand what didn't work the first time around.

Whether Pub/Sub for Tables becomes the new standard for enterprise data architecture remains to be seen. But for IT leaders struggling with the complexity and cost of current data pipeline approaches, Prabhakar's fundamental question deserves serious consideration: if your source systems already represent business reality, why are you working so hard to recreate it?

Sometimes the most powerful innovations come not from building something faster or cheaper, but from asking whether you need to build it at all.

## 

Published Tuesday, June 10, 2025 7:45 AM by David Marshall
Filed under:
Comments
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
Calendar
<June 2025>
SuMoTuWeThFrSa
25262728293031
1234567
891011121314
15161718192021
22232425262728
293012345