Datafold announced Open Source data-diff,
a new open source cross-database diffing package. This new product is an open
source extension to Datafold's original Data Diff tool for comparing data
sets. Open source data-diff validates the consistency of data across databases
using high-performance algorithms.
In the modern data stack, companies extract data from
sources, load that data into a warehouse, and transform that data so that it
can be used for analysis, activation, or data science use cases. Datafold has
been focused on automated testing during the transformation step with Data
Diff, ensuring that any change made to a data model does not break a dashboard
or cause a predictive algorithm to have the wrong data. With the launch of open
source data-diff, Datafold can now help with the extract and load part of the
process. Open source data-diff verifies that the data that has been loaded
matches the source of that data where it was extracted. All parts of the data
stack need testing for data engineers to create reliable data products, and
Datafold now gives them coverage throughout the extract, load, transform (ELT)
process.
"data-diff fulfills a need that wasn't previously being
met," said Gleb Mezhanskiy, Datafold founder and CEO. "Every data-savvy
business today replicates data between databases in some way, for example, to
integrate all available data in a warehouse or data lake to leverage it for
analytics and machine learning. Replicating data at scale is a complex and
often error-prone process, and although multiple vendors and open source tools
provide replication solutions, there was no tooling to validate the correctness
of such replication. As a result, engineering teams resorted to manual one-off
checks and tedious investigations of discrepancies, and data consumers couldn't
fully trust the data replicated from other systems.
Mezhanskiy continued, "data-diff solves this problem
elegantly by providing an easy way to validate consistency of data sets across
databases at scale. It relies on state-of-the art algorithms to achieve
incredible speed: e.g., comparing one-billion-row data sets across different
databases takes less than five minutes on a regular laptop. And, as an open
source tool, it can be easily embedded into existing workflows and systems."
Answering an Important Need
Today's organizations are using data replication to
consolidate information from multiple sources into data warehouses or data
lakes for analytics. They're integrating operational systems with real-time
data pipelines, consolidating data for search, and migrating data from legacy
systems to modern databases.
Thanks to amazing tools like Fivetran, Airbyte and
Stitch, it's easier than ever to sync data across multiple systems and
applications. Most data synchronization scenarios call for 100% guaranteed data
integrity, yet the practical reality is that in any interconnected system,
records are sometimes lost due to dropped packets, general replication issues,
or configuration errors. To ensure data integrity, it's necessary to perform
validation checks using a data diff tool.
Datafold's approach constitutes a significant step
forward for developers and data analysts who wish to compare multiple databases
rapidly and efficiently, without building a makeshift diff tool themselves.
Currently, data engineers use multiple comparison methods, ranging from simple
row counts to comprehensive row-level analysis. The former is fast but not
comprehensive, whereas the latter approach is slow but guarantees complete
validation. Open source data-diff is fast and provides complete validation.
Open Source data-diff for Building and Managing Data
Quality
Available today, data-diff uses checksums to verify 100%
consistency between two different data sources quickly and efficiently. This
method allows for a row-level comparison of 100 million records to be done in
just a few seconds, without sacrificing the granularity of the resulting
comparison.
Datafold has released data-diff under the MIT license.
Currently, the software includes connectors for Postgres, MySQL, Snowflake,
BigQuery, Redshift, Presto and Oracle. Datafold plans to invite contributors to
build connectors for additional data sources and for specific business
applications.
To learn more about Datafold's open source data-diff,
visit https://github.com/datafold/data-diff/.