Virtualization Technology News and Information
SnapLogic 2017 Predictions: Finding Dumbo - The Great Pachyderm Migration to the Cloud

VMblog Predictions 2017

Virtualization and Cloud executives share their predictions for 2017.  Read them in this 9th annual series exclusive.

Contributed by Shayne Hodge, Data Scientist at SnapLogic

Finding Dumbo – The Great Pachyderm Migration to the Cloud

Hadoop is famously named after a toy elephant. In 2017 we will see how prescient that designation was, as this will be the year IT departments realize its similarities to its peanut-munching namesake. Like real elephants, Hadoop is large, powerful, extremely capable, and can complete complex tasks. It is also a bit too much to handle without trained staff and a large (data center) space; viewing it at a comfortable distance on someone else's land is often preferable.

Prediction #1: Enterprises will move to cloud-hosted Hadoop

The first (and, for some, only) step for many enterprises will be to stick with their Hadoop ecosystems as-is, but move the infrastructure to the Cloud. Organizations that do this will still need big data engineers but will save on the hardware and networking headaches that building your own Hadoop cluster can bring. However, this does not solve the problem of Hadoop's difficulty in administration.

Prediction #2: Enterprises will move toward specific-purpose SaaS applications to replace some Hadoop applications.

For many, Hadoop's primary feature has been as a cheap data warehouse. The rise of Spark, Hive, and other parts of the ecosystem have enabled enterprises to turn their warehouse into a data lake with enhanced processing capabilities beyond their original goals. While intriguing, the ROI often is not present for building these ancillary systems from scratch.

It instead makes sense to commoditize out the software; this trend is most visible with AWS Redshift, but others are following. For many organizations whose core business is not technology the most cost effective solution will generally be a cloud-based Hadoop cluster (or perhaps simply a data warehouse), connecting to specific SaaS offerings such as Microsoft's AzureML or Google's BigQuery instead of building the equivalent of those applications in their cluster. [1]

Prediction #3: SQL Strikes Back

After the initial hype of NoSQL databases, it was decided that RDBMSs had some redeeming qualities after all and "NoSQL" suddenly became "Not Only SQL". [2] Similarly, on the analysis side, most things in Hadoop were written as MapReduce jobs, or perhaps Pig, or Cascading, or Spark. The recent releases of Spark have seen things come full circle, with support for SQL queries and effort put into optimizing their performance. Given its large user base and new SaaS applications leveraging it to analyze large data sets, SQL is poised to become the chief way of interacting with the Hadoop ecosystem for most users.


About the Author

Shayne Hodge is a Data Scientist at SnapLogic, Inc, a leading provider of data and application integration software. He has spent time at a variety of Silicon Valley tech companies and an almost equal variety of schools. He can be found on Twitter as @PurpleQuark.

Shayne Hodge


[1] On Hacker News, a discussion of a recent article on the analysis of large data sets had many comments to the effect of ‘don't use Hadoop if a SaaS solution works'. Most of these solutions would work for 5TB datasets, and possibly much bigger.

[2] It appears there might not be a consistent definition of NoSQL available for all partitions of the industry.
Published Thursday, December 01, 2016 6:50 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<December 2016>