Virtualization Technology News and Information
Insider's Guide: Architecting for Edge Success, Pt. 2

Welcome back!

As discussed in Part 1 of this series, there are three key objectives to unlock the value of data produced by IoT sensors and devices. Data must be canonicalized, moved from the edge of the business back to a central data center and once there, it must be possible to query, enrich, visualize, transform and otherwise make it useful to business decision makers. These correspond roughly to three stages of an architecture called the edge-to-cloud pipeline.

We touched on the "where" aspect of the architecture - the physical locations and devices - in which each stage runs. However, there are several phases for edge-to-cloud deployment. So, let's dig in!

Entering the Gateway: Data Capture

Stage one of the pipeline begins inside the gateway where two essential tasks must take place: data from sensors and devices must be captured in its raw form and then, sensor and device data must be aggregated and canonicalized to reduce its payload size. Once reduced, it will conform to standardized and semantically meaningful formats.

Raw data capture is the first component of the beginning phase of an edge-to-cloud architecture, and where in-memory computing platforms prove beneficial. In-memory data grids (IMDGs) in particular, are ideal for capturing the kinds of heterogeneous data generated by edge devices. When deploying on edge gateway devices, and for this to take place in the most ideal scenario, your platform should be on the smaller end. It's best to target containment in a single Java ARchive (JAR) file of about 15MB.

Unlike traditional databases, IMDGs are schema-free and operate as key value stores with advanced capabilities for indexing and querying. Due to its flexibility, data captured from devices can operate in a couple interesting ways:

  1. If made by different manufacturers over several decades, it can be mixed into a unified store
  2. It can then be keyed on any unique value where keys can be intrinsic to the data as a sample universally unique identifier (UUID) grouped through text processing or as a timestamp recording for when the data was captured

What's more, IMDGs being pure in-memory solutions don't require flash or solid state drive (SSD) access as part of their data-writing operations - a huge win for developers. Why? It means that even relatively constrained devices like IoT gateways can cope with big data ingest rates with very low latencies. For initial data capture from edge devices, it's hard to beat an IMDG!

Entering the Gateway: Aggregation and Canonicalization

Data captured in its raw form, as output from sensors and devices, isn't as useful as one would believe. What do I mean by that? First, the data is in whatever format the device produced, which is probably not semantically meaningful to you from a business perspective. Second, there tends to be a lot of it. If a sensor generates a sample at 30Hz, you get 30 data samples every seconds-even if almost all of these samples show very little change over small time windows.

What's needed are two further processes to make the data useful:

  1. We need to aggregate the data by transforming a series of fine-grained samples into coarser-grained (and more manageable) averages
  2. We need to convert the data from the raw sensor or device format into a meaningful and useful format.

The second step goes by various names, such as standardization, normalization, or canonicalization, but is mainly an extraction, transformation, and loading (ETL) process on a continuous stream of data.

When we think of ETL tools, we tend to think of large, sophisticated, expensive systems like Informatica PowerCenter, designed to run in corporate data centers. But at the edge, you need something API-driven and code-oriented rather than the GUI-focused ETL tools that most are familiar with using. When looking for a platform alternative, make sure it offers a lightweight option. It should package inside the same 15MB platform JAR file as the IMDG we discussed earlier. Why? A stream processing tool runs blazingly fast, even in a resource-constrained environment like an edge gateway.

By harnessing the stream processing tool, data stored in the underlying IMDG storage layer will continuously transform from key-value maps of raw device data into key-value maps of meaningful, canonicalized data.

From Edge-to-Cloud: Data Transport

Once device data is canonicalized into a format that will make sense to decision-makers and data scientists alike, it's time to move it from the edge into the cloud or data center. Thankfully, if you choose the recommended IMDG platform, this usually means a simple flick of a switch.

Every platform instance should come with the ability to replicate data in an eventually consistent, asynchronous manner to other, geographically distinct platform instances. Even over slow or unreliable Wide Area Networks (WAN) links. These WAN links come with myriad configuration options, but in many cases it's sufficient to turn it on and let the default settings do the rest.

While WAN replication seamlessly moves data updates from the edge back to the cloud, you might be asking yourself: can I use the same data storage technology on beefy servers in the cloud that I use on resource-constrained edge devices? The answer is an emphatic yes!

In the Cloud (or Data Center): Making Sense of Information

Once aggregated and canonicalized device data makes its way from the edge back to the cloud, your platform becomes an incredible tool for unlocking the value of that data. So how do you make sense of it all? Since the data will arrive in an IMDG instance, you can leverage an IMDG distributed query API, which will immediately begin to make the data available in paged result sets to other enterprise systems.

What's more, if you developed machine learning (ML) models to do the following, you can use an inference runner to execute those models as part of a pipeline-all within the same platform.

  1. Analyze device performance
  2. Predict device failures
  3. Manufacture yield percentages from your edge data

And you needn't worry about having to do something poorly understood that no else has done before. Using the platform to unlock device and sensor data transmitted back to the cloud or data center is a well-tread path.

By now you've realized that building your edge architecture correctly may not always be the easiest process. In fact, when architecting the pipeline, many IT managers deal with its inherent challenges. However, given the significant increases in the speed and effectiveness at which a business can respond to the information provided, the benefits far outweigh the challenges.


About the Author

Lucas Beeler 

Lucas is a senior architect at Hazelcast, where he helps Hazelcast's most demanding customers architect, design, and operationalize enterprise software systems based around Hazelcast IMDG and Jet. Before joining Hazelcast, Lucas held similar positions at GigaSpaces and GridGain, giving him a uniquely broad and deep understanding of the in-memory platform space. Lucas holds a B.S.E. in computer science from the University of Michigan.

Published Tuesday, August 04, 2020 7:34 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<August 2020>