Virtualization Technology News and Information
Q&A: Interview with BlueData, Talking Big Data in the Enterprise, Hadoop and Spark

BlueData, a pioneer in Big Data private clouds, recently made some noise about the growth of its executive team -- adding former VMware executive Jim Lenox as vice president of worldwide sales and former Microsoft executive Greg Kirchoff as vice president of business development.  These two additions were said to help further the BlueData mission to democratize Big Data, making it easy and cost-effective for enterprises of all sizes to deploy a self-service, private cloud on-premises.

To find out more about what BlueData is up to, I had a chance to speak with the company's CEO and co-founder, Kumar Sreekanti.

VMblog:  What is the current state of Big Data in the enterprise?

Kumar Sreekanti:  Confusion is the best word to describe the current state of Big Data in the enterprise. Companies realize that they are generating incredible amounts of data, but are having trouble figuring out to make this data valuable to them.  Many have used the analogy of finding the needle in the haystack or AMPLab has talked about Big Data having the capacity to cure the cancer, and it does, but you first have to figure out how to take all of this data and put it to the right use to make the right set of decisions for the enterprises.  That's what everybody is wrestling with. I think 2015 will be the year when we see the transition from experimenting with the data to analyzing it in real production systems and gaining real insights for the enterprise.

VMblog:  What are some of the things that have slowed down Big Data adoption?

Sreekanti:  There are a number of common pitfalls that have slowed down the adoption of Big Data in the enterprise.  One is the lack of planning. Enterprises will put together one physical Hadoop distribution and then when another group within the enterprise decides they need to use Hadoop they can't expand appropriately. 

The second problem is how they deal with data security. Typically, enterprises have a collection of historical repositories of data (data silos) that have different ways of being accessed with different security mechanisms. The enterprise needs to understand how they can incorporate those into their Big Data solution. If an enterprise typically covers those two areas, they will be successful.

VMblog:  And how should enterprises handle rapid changes in Big Data platforms?

Sreekanti:  One of the common problems that enterprises face when starting to develop or use Big Data is not designing their infrastructure appropriately for rapid changes to the platforms. They develop an infrastructure that works great for one distribution of Hadoop or one Big Data application such as MapR.  However, when Spark comes along they don't have the flexibility to modify their physical deployment and take advantage of that technology. BlueData can help enterprises with their infrastructure because our solution does not require a single distribution or a single application; we support all of them. The enterprises that choose to implement BlueData are more prepared for future changes.

VMblog:  How do you see Apache Spark evolving?

Sreekanti:  Spark is the new paradigm in the distributed MapReduce space. A lot of people like it because it provides some really advanced features like Spark streaming, Spark SQL, MLlib (machine learning library) and GraphX, which is a machine learning library for graph analytics. 2014 was really the year of Spark, as it started really getting deployed. Spark is exactly where Hadoop was maybe 24 months ago. I think they will continue to build on are areas like being more enterprise grade in terms of better performance characteristics, being able to scale to larger volumes of data, and scaling the data store in memory structures, which includes some of the in-memory file systems that are going to quickly emerge next year. I think if the indications are any good over the last year we'll see a much further advancement of Spark in the enterprise in 2015.

VMblog:  Enterprises have different ways of handling security.  Is Hadoop flexible enough to handle all of these?

Sreekanti:  Currently, security is a key area of focus for Hadoop.  It's not there as yet, it's going to be a journey and I think it will get there eventually, but there are a few key problems that Hadoop needs to address like multi-tenancy. Specifically, what I mean by multi-tenancy is the ability to onboard different user groups or business units who have diverse requirements. Some business units may want to run more machine learning type jobs, while others might want to run more batch processing. The ability to onboard all of these different users onto a single Hadoop cluster is a very hard problem to solve. Enterprises need to be cognizant of their different types of security when rolling out a Hadoop platform. 

The other major area of security that needs to be addressed is around data governance, particularly as it pertains to data masking, auditing and encryption. This is something that has been solved by many of the enterprise grade storage vendors, as they have all of these capabilities with respect to auditing and access controls and so on.  Hadoop is going to solve this eventually as well. In the meantime, enterprises should leverage existing storage systems, while the Hadoop ecosystem matures. This is where BlueData can really help customers on their journey to the adoption of Hadoop. By leveraging our data tap technology to point to existing storage systems you can leverage all of the enterprise grade security features that exist in those storage systems.

VMblog:  Currently, there is legacy data sitting on non-HDFS systems.  How can companies access this data for business analysis?

Sreekanti:  Hadoop requires that you move all non-HDFS into a Hadoop distributed file system to conduct analysis and leverage some of the richer capabilities that are available. It is not easy to move of data stored in existing NAS filers and other storage systems into HDFS because HDFS lacks a lot of the data governance and auditing capabilities. This is particularly hard for regulated industries like healthcare.

Additionally, the cost and time associated with moving all of this data and making sure that the right people have access to this data in the Hadoop distributed file system is holding enterprises back from making this move. Many storage vendors are trying to expose an HDFS interface to the storage systems, but it is complex and hard to do. These complications are exactly why BlueData created its DataTap technology, which provides a separation so that existing Hadoop environments can access storage systems like NFS without having to move the data. By not moving the data, enterprises accelerate the time to data exploration and results without the added cost.

VMblog:  There's been a lot of noise of about data lakes.  Can you tell us your thoughts?

Sreekanti:  The goal of data lakes is great.  It means that an enterprise can store all of their data in a low-cost, easy to access storage device and all of their applications can use their own private APIs for accessing that data. Currently, the technology is such that there is gap between the ideal of the data lake and what's actually practical. 

Enterprise data storage has spent the last 20 years perfecting backup and recovery, security, and maintenance of data.  The HDFS file system and the other typical lower cost distribution file systems used in the data lake do not yet have that sophistication. We need to find a way to bring those two together to either enhance the abilities or the technology of HDFS or to perhaps reduce the cost of enterprise storage.

In 2015, data lakes will become more ubiquitous in the enterprise. You'll see a lot more of the existing storage systems being included within the data lake and so it will become multiple data lakes or multiple data ponds encompassed into a logical data lake. You will not have to move all of the data from all of these different storage systems into one central data lake as has been prescribed in the last version of the data lake, but instead existing storage systems would essentially expose a data lake interface providing the enterprise a logical data lake. This should really improve the agility and access for data workers.


Once again, a special thank you to BlueData CEO and co-founder, Kumar Sreekanti.

Prior to co-founding BlueData, Kumar was vice president of R&D at VMware where he was responsible for Storage and Availability in the Cloud Infrastructure Business Unit. Kumar's responsibilities included VSAN, Virtual Volumes, Virtual Flash, Virtual Storage Appliance, SRM (Site Recovery Manager), High Availability, SMP-FT and Virtual Storage Fabric. Before VMware, Kumar was the founding CEO and CTO of Agami Systems, a high-performance, distributed NAS company. Earlier, Kumar was vice president of Engineering and Operations for Akamai Technologies where he responsible for the Streaming Business Unit. Earlier in his career, Kumar held senior technical and engineering management positions with Adaptec, Mylex and Seagate.

Kumar is a graduate of Indian Institute of Technology Kanpur where he studied Electronics and Electrical Engineering.

Published Wednesday, January 28, 2015 6:33 AM by David Marshall
Filed under: ,
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<January 2015>