Virtualization Technology News and Information
VMblog Expert Interview: Dinesh Chandrasekhar Talks Cloudera Data Platform, IoT Initiatives, and Real-Time Data Insights


The future of business is hybrid-cloud.  But we should understand that the data warehouse (one of the oldest data disciplines on the planet, and perhaps one of the most important) is the underpinning of cloud success.  Data is an extremely hot topic right now, and VMblog recently caught up with an industry expert on the subject, Dinesh Chandrasekhar, Head of Product Marketing at Cloudera.  

VMblog:  How have conversations and challenges evolved since the start of the coronavirus pandemic?  What have you seen companies learn or not learn?

Dinesh Chandrasekhar:  Data has never been more valuable than right now. The pandemic has made enterprises very aware of their shortcomings with their data management strategies. Businesses used to be quite content with their IT organizations delivering data insights with varying levels of latencies. This served up actionable intelligence to business leaders with minimum accuracy. Today, everyone is conscious that data latencies can be quite costly in terms of missing business opportunities as well as impacting the bottom line. In order to truly address the real challenges of the pandemic, enterprises are now waking up to the fact that data latencies are unacceptable and actionable insights need to be truly real-time so that business leaders can make the right decisions at the right times using the right data. The adoption of streaming analytics is on the rise. Traditional business intelligence type insights are hardly matching up to the needs of today's business challenges. Streaming analytics deliver predictive and prescriptive insights to prevent mishaps from happening as well as to capture ideal business opportunities at the right moment. Agility is the order of the day and streaming analytics help deliver that.

VMblog:  How has the pandemic affected organizations' ability to monitor and analyze "real-time" data insights?  And how does Cloudera ensure that "real-time" insights are shared in a timely and accurate manner?

Chandrasekhar:  Different industries have had different types of impacts due to the pandemic. For example, if a retail chain was collecting data about its customer purchases through its POS terminals at its retail stores, it is no longer able to do so since that entire market has shifted to online purchasing for all of their needs. Data collection and ingestion from the traditional sources have suddenly become rare. The other important thing to consider is that while enterprises are looking to get real-time insights, the need for data collection suddenly has extended beyond the enterprise. For a manufacturer, it is no longer relevant to understand just the distributor's demands alone but to actually feel the pulse of the customer directly. They need to visualize the entire value chain. Manufacturers are yearning to pivot their processes and production cycles based on what the actual customer demands are today. In order to do so, data collection needs to come from data-at-rest sources like traditional warehouses, CRM systems, ERP systems, etc., but also from data-in-motion sources like social streams, partner systems, clickstreams, IoT devices, and more.

Cloudera enables enterprises with analytics from edge-to-cloud with its Cloudera Data Platform (CDP), which is a comprehensive hybrid platform that supports analytic workloads of any type across various environments. CDP also extends Cloudera's Data-in-Motion platform, Cloudera DataFlow, from on-premises to the cloud. This empowers enterprises to truly capture data from all types of streaming and batch sources alike from the edge and easily move it through the data lifecycle while generating actionable insights at every touchpoint. CDP helps with immediately generating real-time streaming analytics as well as with more traditional operational insights and beyond with comprehensive machine learning insights.

VMblog:  Why does a connected data lifecycle matter to enterprises?

Chandrasekhar:  A data lifecycle refers to the process through which enterprise data makes its journey from inception as raw data to consumption as information or insight. In this lifecycle, the sanctity of the data is preserved and governed while it undergoes various transformations and is secured as it is handled by various user personas across the organization. A traditional data lifecycle starts when data is ingested from standard at-rest sources like databases, enterprise systems, legacy applications, etc. This is then eventually pushed into a data warehouse or a data lake and subsequently to even an operational database for generating operational insights. While this works for a lot of use cases, it does not address the need for immediate and real-time insights. This is also not ideal for handling data-in-motion from various types of streaming sources. This leads enterprises to adopt a separate data lifecycle for handling and managing streaming data. This lifecycle would involve the ingestion of heavy volume, high-velocity data from a variety of streaming sources. But, as soon as it is ingested, the data is instantly analyzed, on the wire, to produce real-time insights. Complex patterns are assessed from various streams and alerts are generated based on possible anomalies or violations of thresholds. All of this happens even before the data has landed into a data lake and that is why these insights are truly real-time in nature.

However, these two life cycles need not be and should not be mutually exclusive. As a matter of fact, they tend to be interlinked at every level and this is what makes up a connected data lifecycle. The combination of a data-at-rest lifecycle and a data-in-motion lifecycle is what makes up a connected data lifecycle. When real-time streams are ingested within the data-in-motion lifecycle, it will be sent to the streams messaging cluster for buffering or to a stream processing engine cluster for generating insights. But, at the same time, it can also be sent to an operational store for enhancing the data or to a data lake for long-term storage and analysis. Likewise, when a machine learning model is built or enhanced from the data-at-rest data, this same machine learning model can be pushed into the data-in-motion lifecycle so that it can be made available at the edge for immediate data scoring as soon as the data is ingested at the edge. There are lots of interlinked exchanges like this across the connected data lifecycle.

So, to provide comprehensive analytics to all the key stakeholders, enterprises must adopt a connected data lifecycle.

VMblog:  Is IoT a key component to a successful cloud offering?  And what are the key challenges that IoT initiatives face?

Chandrasekhar:  IoT is not really a key component to any successful cloud offering. Rather, IoT and cloud are two ends of a spectrum that rely on each other a lot. The growth of IoT initiatives means growth in extremely high volumes of data that are being ingested from thousands of IoT devices. Rather than inundate the on-premises data centers, enterprises find the cloud as an easy and extensible alternative for data storage and analytics. The sheer scalability of the cloud along with the affordability aspect of it makes it a worthy extension to all the IoT initiatives these days. In essence, what starts out as data-in-motion from thousands of IoT devices and sensors are streamed into the cloud, where it ends up as data-at-rest.

Key challenges that IoT initiatives face are:
  • Edge data capture and edge processing - Time is of the essence when it comes to IoT initiatives. There is a strong need to capture data right at the source and process it right there to extract key pieces of information that can then be sent upstream. This becomes challenging when you consider the scale - hundreds of thousands of devices across each initiative.
  • Edge storage - While the cloud may be a natural extension to IoT device data, the roundtrips from the edge to the cloud may not always be possible due to the cost and bandwidth. This means that, in certain use cases, edge data may need to be stored locally at the edge itself. However, this gets complicated when you consider edge hardware and environmental limitations.
  • Edge Management - Controlling what happens at the edge requires you to set up an edge agent that can establish edge connectivity, capture data, and process it locally. However, how you control hundreds of thousands of such agents to adopt specific behavior, instruct them what needs to be captured and how that data needs to be streamed requires edge management capabilities. Not all IoT platforms offer such seamless management capabilities.
  • Edge intelligence and edge analytics - This is the future. With the onset of Industry 4.0, every vertical has been focusing on use cases that revolve around themes like "autonomous this" or "connected that" or "smart whatever." Machine learning models being made available at the edge enable such use cases where the edge becomes not just smart, but also self-aware and autonomous. This requires the right kind of data management platform to connect all the dots together to make it work seamlessly across multiple clouds, data centers, and the edge.

Cloudera DataFlow's edge management capabilities allow you to seamlessly deploy hundreds of thousands of edge agents (MiNiFi) to edge devices and easily manage them from one single central agent management hub (Edge Flow Manager). With support for machine learning models at the edge, Cloudera DataFlow has enabled dozens of companies across multiple industry verticals to adopt sophisticated IoT initiatives over the years. Now, with Cloudera DataFlow capabilities available within CDP, Cloudera can truly enable companies with the edge-to-cloud vision.

VMblog:  Cloudera recently completed its enterprise data cloud vision with the launch of Cloudera Data Platform, Private Cloud.  Can you tell me more about how this came about?

Chandrasekhar:  One of the key tenets of an enterprise data cloud is the ability to have data and analytics where it makes sense for your business and use case. This means multiple public clouds, private clouds, and hybrid clouds. Listening to the voice of the customer is also critical for us. Observing industry trends and getting feedback from our customers helped us define our roadmap towards the public and private cloud. We built CDP to be one platform with two form factors - public cloud and private cloud. So, it was always our vision to have a CDP Private Cloud. CDP provides hybrid cloud functionality and we are going to be expanding these capabilities over the coming months to make data movement automated and secure.

VMblog:  Why is an enterprise data cloud so important in today's business landscape and what is unique about Cloudera Data Platform?

Chandrasekhar:  We've talked about one of the pillars of an enterprise data cloud, namely data in multiple public clouds and private clouds. The next pillar is the need to push data and analytics to more and more people in an organization so that they can make better decisions. But, the more people that have access to data, the more likely that compliance is at risk (i.e. the simplest way to meet compliance is to restrict access to just a few people). So, data needs to be secured and governed in order to maintain compliance when more and more people access and analyze it. The third pillar has to do with the types of use cases companies are trying to implement. Predictive maintenance, credit card fraud detection, etc. require more than just a single product. Instead, companies need a platform that provides ingestion, transformation, querying, transactional, and predictive capabilities - and that is the data lifecycle. Lastly, an enterprise data cloud needs to be open. Not only does code need to be open-source, but the APIs need to be open so that you can integrate the tools that are relevant for your company and industry, as well as have open storage formats so that you can always have access to your data. We have built CDP along these pillars to provide customers the most flexibility, agility, and cost-efficiency as possible.


Published Friday, September 18, 2020 7:35 AM by David Marshall
Filed under: ,
Living on the Edge: How to Accelerate Your Business with Real-time Analytics - Software - (Author's Link) - September 16, 2021 1:29 AM
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<September 2020>