Virtualization Technology News and Information
VMblog's Expert Interviews: StreamSets Talks Big Data and Data Flow Performance Management

Interview StreamSets 

StreamSets, a company that delivers performance management for data flows, announced results from a global survey last month conducted by independent research firm Dimensional Research.  The survey revealed pervasive data pollution, which implies analytic results may be wrong, leading to false insights that drive poor business decisions.  Even if companies can detect their bad data, the process of cleaning it after the fact wastes the time of data scientists and delays its use, which is deadly in a world increasingly reliant on real-time analysis.

To find out more, I spoke with StreamSets' Founder and CEO, Girish Pancha.

VMblog:  The recent survey by StreamSets and Dimensional Research found that data quality is one of the biggest challenges enterprises.  What impact does data quality have on big data initiatives?

Girish Pancha:  When analysis is used to generate insights and suggest actions, users assume it was fed by a baseline level of data quality. With the shift to real-time, data-driven applications, the margin for error in discovering and dealing with quality issues before using the data disappears. The immediate impact is that businesses make bad decisions or take improper actions (or improperly fail to act). The longer-term impact is an erosion of trust in any of the analysis coming out of those big data applications. So, it is of fundamental importance that enterprises address inbound data quality as they are ingesting the data.

VMblog:  What is data flow performance management and why is it needed now?

Pancha:  The concept of data performance management is to create an organizational discipline around the efficient and effective movement of data in order to achieve business goals. It is needed because managing data in motion has become complex, resource-intensive and problematic. In fact, findings from our survey with Dimensional Research found that 68% of respondents cited data quality as their most common challenge.

Before big data, moving data was trivial because data sources were internally controlled and hand-crafted, which made it highly reliable and easy to import to the data warehouse. Now, more and more data is gathered from a variety of big data sources such as system logs or IoT sensors, that are not under the direct supervision or quality control of IT. This leads to quality issues as data schema and semantics change without notice or detection, something we call data drift.  As a result, bad data creeps into and pollutes the data store. In fact, 87% of survey respondents reported that "bad data" is flowing into their data stores and 74% stated they are currently storing "bad data."

VMblog:  Tell us about the StreamSets solution.  How does it address the identified pain points?

Pancha:  With the StreamSets Data Collector, we have created an important tool that helps businesses establish a performance management discipline to help tame this chaotic environment. Most data movement technologies to date have been focused on solving the development problem, easing the task of designing or coding data flows. We went beyond this; we built an integrated data environment (IDE) that spans the data-flow life cycle, including design, operation and evolution. It makes it a drag-and-drop exercise to build data flows, but also provides the ability to test and then run data flows within the IDE. While running data pipelines, users can see live metrics and set rules-based alerts to get an early warning for any issue. Then users can take necessary actions, such as trapping problematic data. Lastly, it is easy to update data flows to support new sources, transformations or destinations.

The solution is also 100% open source under a standard Apache license that can be downloaded from our website.

VMblog:  What are the most common challenges found in data performance management?

Pancha:  Key findings from our survey identified data drift, operational blindness and infrastructure complexity as the most common challenges in data performance management.

Data drift is an insidious problem that is relatively new to big data. It refers to the constant changes to data schema and semantics that occur due to unexpected and announced modifications of the systems that generate big data, for instance IoT sensors or servers that produce log files. Eighty-five percent of respondents said that unexpected changes to data structure or semantics create a substantial operational impact. Data drift breaks and pollutes data flows built using hand coding. For example, data that stops conforming to expected schema can get dropped, or data that has changed in meaning is passed through without notice. In either case, analysis reliant on this data is compromised.

Operational blindness refers to the fact that the low-level frameworks used to build data flows today at best come with primitive instrumentation. Findings suggest that respondents highly value real-time visibility, yet do not have the capabilities they need to control their data pipelines. In practice, they are dealing with a black box. This means users discover problems the hard way, when pipelines stop operating or a forensic study of a data quality situation reveals well after the fact that data corroded at the source. 

Lastly, infrastructure diversity points to the fact that big data pipelines use a number of different technologies for ingestion, message queuing, pub/sub and storage/compute.  Each of these components is on its own development path, meaning users have a multiplicity of lifecycles to manage simultaneously.

VMblog:  What makes StreamSets solution different from others in the market?

Pancha:  The StreamSets Data Collector was built to be an any-to-any data flow management solution that addresses today's reality of big data complexity, including data drift, infrastructure diversity and operational management. It differs from previous efforts in several key ways. First, rather than require that users define a complex fixed schema for incoming data, it is intent driven, meaning users can specify the fields they care about. This makes the solution resilient to changes in schema and semantics. 

Second, it provides a full life cycle IDE where users design, test, run and monitor their data flows all from one interface. They receive real-time metrics or KPI's on both the data of the data movement and the profile of the data flowing through their systems. In short, users are alerted to data drift as it happens and can take action, such as normalizing or re-routing the data. Third, the solution is built so users can upgrade big data components without downtime. This is possible because each stage in a StreamSets pipeline is logically isolated from the others so users can swap out components seamlessly. For instance, this means users can run two incompatible versions of the same component in the same pipeline, to test and then deploy an upgrade.

VMblog:  Based on these findings, what's next in data performance management, down the road?

Pancha:  While the bad news is that data pollution is widespread and a challenge for many enterprises, the silver lining is that the market is waking up to the fact that one must manage data flows as an continuous operation, end-to-end and holistically, because there is a many-to-many relationship between data sources and points of consumption, and this map is always shifting. We have spent quite a while perfecting our management of traditional data stores and data movement, usually using batch ETL. Now we need to raise our gaze up to deal with key business applications that make real-time decisions using data whose structure is more volatile and whose fidelity must always be continually monitored and managed. This is not only a software problem, but an issue of organizational focus and discipline. Thus, companies need to create a performance management practice for their data in motion.


A special thank you to Girish Pancha, CEO and Founder of StreamSets, for taking time out to speak with

Girish Pancha Girish Pancha is a data industry veteran who has spent his career developing successful and innovative products that address the challenge of providing integrated information as a mission-critical, enterprise-grade solution. Before co-founding StreamSets, Girish was an early employee and chief product officer at Informatica, where he was responsible for the company’s entire product portfolio. Girish also previously co-founded Zimba, a developer of mobile applications providing real-time access to corporate information, which he led to a successful acquisition. Girish began his career at Oracle, where he led the development of Oracle’s BI platform.

Published Wednesday, July 20, 2016 7:02 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<July 2016>