Virtualization and Cloud executives share their predictions for 2016. Read them in this 8th Annual VMblog.com series exclusive.
Contributed by Dale Kim, Director, Industry Solutions, MapR Technologies
The Rise of JSON in Virtualized Big Data Infrastructures
If you regularly talk to enterprise computing technologists,
you'll hear again and again that "everyone is moving to the cloud." While
that's not literally true (at a number of levels), we definitely see the
ongoing trend of leveraging virtualized environments for a wider variety of
applications.
Some of the big data use cases that will have significantly
more presence on virtualized architectures in the coming year include the many
types of real-time analytics on event data streams, particularly related to the
Internet-of-things (IoT). Since these data sources often grow at fast and
unpredictable rates, they are ideally served by virtualized environments that
can elastically scale out to meet both the growing volume and the compute
requirements. And since these data sources are often generated in the cloud (or
at the "edge"), there is little friction in delivering the data to a cloud
infrastructure for large scale analytics.
A cloud-based topology for IoT data is certainly not new,
but one popular technology that will further facilitate deployments is the data
format known as JavaScript Object Notation (JSON). JSON will play a bigger role
in the cloud as it helps to deliver and store data in a flexible and
easy-to-process format. You may already know that JSON is great for structured
but non-relational data formats that are hierarchical, nested, and/or evolving,
as in product catalog data. It's also great for unstructured data formats where
a fixed schema may not exist, such as machine log files. More broadly, its
self-describing construct makes it great for data interchange, most notably as
a vehicle for web browsers to make partial updates to your page view via AJAX.
These are the characteristics make JSON an ideal format for
representing an incoming stream of data points, including event data and sensor
readings. These time-based events and sensor measurements are not going to be
homogenous across your entire enterprise, nor will they necessarily be in a
"flat" format that can be expressed as rows and columns or as comma separated
values (CSV). Your entire set of "time series data" will have data points that
differ across sources, and may even change in structure over time, such as from
wearable devices that add new capabilities with each new version. And in many
cases when storing the data, you will want to group together data based on time
windows (such as all data collected within one hour intervals) to make data
retrieval more efficient. You might also want to create aggregations,
summaries, and samples as ways of enriching your data.
IoT devices will continue to adopt JSON, either as the raw
output format, or as an output from a binary format via downstream conversion.
This means the big data technologies deployed for IoT use cases will leverage
JSON more heavily in 2016. Apache Hadoop will be used much more for storing
JSON data, and technologies like the open source Open JSON Application
Interface (OJAITM) will provide a standardized interface to JSON in
Hadoop. This will be especially important for integrating many different data
sources that can be correlated in a central data repository.
NoSQL databases, especially the document databases based on
JSON, will play a huge role in capturing IoT data. And visionary tools like the
open source Apache Drill will provide a SQL query engine on JSON data so enterprises
can continue using their SQL expertise and business intelligence tools for new,
non-relational data sources. JSON has already "won" as the data format of
choice in the Internet, and it promises to play an important role in modern
data architectures that include virtualized technologies.
##
About the Author
Dale
Kim is the Director of Industry Solutions at MapR. His background
includes a variety of technical and management roles at information technology
companies. While his experience includes work with relational databases, much
of his career pertains to non-relational data in the areas of search, content
management, and NoSQL, and includes senior roles in technical marketing, sales
engineering, and support engineering. Dale holds an MBA from Santa Clara
University, and a BA in Computer Science from the University of California,
Berkeley.