Virtualization Technology News and Information
VMblog's Expert Interviews: Girish Pancha of StreamSets Talks About Big Data Ingest, Data Drift and the Future


We've entered 2016, and one thing is certain: The volume and variety of data companies collect today is unprecedented, yet data ingest is often an afterthought - businesses don't realize until it is too late that a data pipeline is corrupted, or that they've suffered from a loss of data.

Operators are spending a lot more time trying to sanitize raw data, to make sense of it, and then use it to make business decisions.  A startup company called StreamSets says, the current data environment contains constantly changing infrastructure and semantics, which slows down the process of collecting and moving data so it can be used for reliable analytics -- a problem it calls “data drift.”

Back in September, the company closed a $12.5 million Series A round.  And they have embarked on a mission to provide continuous ingest technology for the next generation of big data applications.  The company says it cleans and monitors data in motion to address this challenge and fuels real-time analysis.

To find out more, I spoke to the company's CEO, Girish Pancha.

VMblog:  Talk to us about "Data Drift."  What is it?  And what impact is it having on enterprises?

Girish Pancha:  We define data drift as the unpredictable and continuous mutation of data characteristics such as schema, semantics and the infrastructure components that created the data.  These changes are an unavoidable result of the constant evolution of the systems producing the data as part of their operational life-cycle.  

Data drift is a new Big Data problem because traditionally you had a high degree of control over your data sources. Now, these sources are often owned by others and based on systems where data output consistency is not well-controlled.

The downstream impact of data drift fall into two major categories: damage to the reliability of the end-to-end data operations and the destruction of data quality.  The first is an artifact of having lots of low-level code holding today's ingest systems together; when something changes at the source, the pipeline breaks, and sources are changing all the time.  The second, corroding data quality, is a function of the inability to detect when data has been lost or its meaning has changed.  You end up with data stores that are incomplete and lack integrity - an untrustworthy recipe that leads to false insights and bad business decisions.

These problems are big, but today are ignored or accepted as an unfortunate fact of life because there hasn't been a solution until now.

VMblog:  Tell us about the StreamSets solution and how it works.

Pancha:  Last September, we launched StreamSets Data Collector as a solution for managing data in motion and attacking the data drift problem head on. With it, you can access data from a variety of sources and move it into your data stores quickly, reliably, with quality and without custom code.   

Specifically, it's an open source infrastructure for data ingest that uses a visual UI to make it easy for data engineers to connect data sources to destination data stores. It comes with built-in transformations that let you sanitize the data in-stream, so you land clean data. It also gives you the ability to monitor the data flows you create, both in terms of operating condition of the pipeline as well as the state of the data passing through it.  For instance, you can automate early warning notifications by setting thresholds on data metrics.  Finally, it's built for large scale continuous operations since it runs 100% in-memory on edge nodes for optimal latency and resource utilization, or on your existing cluster for scalability.

VMblog:  Talk to me about the pain points you are trying to address.  What are they?

Pancha:  At a high level, the real pain is that today's low-level tools are not up to the task of ingesting real-time data in a way that leads to timely and quality insights for the business. For instance, an organization may be trying to receive data in order to predict price for a particular item. They would gather data from 3rd party sources, but the problem is the data meaning or location may unexpectedly change. If this breaks your ingest, you can't perform the analysis quickly and with confidence, which means you underprice or overprice products, which damages the business.

At a lower level, there are three types of pain we're addressing - inefficiency, brittleness and opaqueness - each of which results from the current model of custom code driven ingest operations.  Inefficiency comes from engineers creating and patching custom code to get data into stores, as well as lots of data cleansing cycles from data scientists to make use of the data. Pipeline brittleness is the result of low-level tools like Flume and Kafka requiring schema coupling between data producers and consumers; any producer data drift creates breakage and patching. Lastly, opaqueness results from custom code not being instrumented to allow for real-time monitoring, early warning detection or diagnosis of faults.

VMblog:  What makes StreamSets solution different from other competitors in the market?

Pancha:  SteamSets is unique in several ways. First, it allows you to build pipelines without specifying schema, which makes it resistant to data drift. Second, it lets you sanitize your data while it's being ingested, so it lands clean. This has numerous benefits, including reducing data cleansing after the fact and ensuring compliance of data in the data store. Third, but perhaps most important, StreamSets provides complete visibility into your data pipeline and the data within it.  Other solutions are "black boxes" that can't warn you of impending problems with your data, and have limited value in helping you get to the root cause of problems.  Since we inspect not just the pipeline performance but the data itself - we call this deep introspection -  we can detect data drift whereas others can't.

Lastly, we built StreamSets Data Collector for seamless scalability.  It runs 100% in memory to minimize latency and runs on your existing YARN or Mesos-based cluster which not only make adding capacity trivial but avoiding extra data hops in the process.   

VMblog:  I have to ask, why hasn't this been done in the past?

Pancha:  We are just now getting to a maturity level where enterprises are operationalizing Big Data ingest.  This maturity was achieved for traditional data warehouses, but the legacy technologies used solved a batch problem for database and application data.  But for emerging Big Data applications that need to operate in both batch and real-time on all data--database, applications and machine generated, we have been going through the maturity cycle all over again, manually coding using low level frameworks.  

VMblog:  And what hurdles did you have to overcome to develop this solution?

Pancha:  We were fortunate to be able to design things for the current world from the ground up. Legacy data players and competitors are reliant on relatively old open source frameworks and are trapped by their architectures.

The key challenge was not technical; it was understanding the pain points so we architected correctly.  We use a very methodical customer discovery process where we spoke with dozens of companies to unearth their core concerns.  One unexpected insight was that simplifying ingest for developers wasn't enough, you had to deal with the operational issues - reliability, quality, agility.  This was what led us to focus on transparency and visibility as a focal point.  

VMblog:  What's next in big data streaming, down the road?

Pancha:  Down the road, we expect enterprises to continue to mature how they do big data streaming which will lead to unprecedented operational control of data across applications, in areas such as data availability, fidelity and security. This level of sophistication will be a must-have for modern enterprises that want to run their business on real-time big data applications with the right level of trust.


Once again, thank you to Girish Pancha, CEO of StreamSets, for taking time out to speak with VMblog and answer a few questions.

Published Wednesday, January 20, 2016 6:36 AM by David Marshall
Filed under: ,
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<January 2016>