Virtualization Technology News and Information
Q&A with Craig Nunes of Datrium on the AWS re:Invent 2017 Conference


Before they headed out to AWS re:Invent 2017, VMblog was able to catch up with Datrium's VP of Marketing, Craig Nunes, to get the early scoop on what they would be showcasing at the event and learn more about their technology.

VMblog:  In a nutshell, what is Open Convergence?

Craig Nunes:  Open Convergence is an architectural evolution that extends beyond hyperconvergence, and mirrors hyperscaler approaches. With Open Convergence, VM and IO processing on servers (compute nodes) is split from shared durable data on the network (data nodes). This split architecture eliminates east-west traffic between Compute Nodes so there is never neighbor noise or network overhead, and the system can mix and scale workloads while maintaining low latency. 

Compute nodes maintain all VM data in local flash for ultra-low latency performance, and are stateless so maintenance is simple and data availability is assured.  Data nodes maintain all persistent copies of data, which is always compressed, globally deduped, and erasure coded with double fault tolerance.

VMblog:  This is the first time Datrium has attended AWS re:Invent, can you give our readers the elevator pitch on what they can expect to see from you?

Nunes:  At AWS re:Invent we are excited to be demonstrating Datrium Cloud DVX, a port of the DVX software to Amazon Web Services (AWS).  Cloud DVX is a natural architectural fit with AWS, in that its compute nodes align perfectly with EC2 and its data nodes with S3. 

The first service available on Cloud DVX provides VM and vDisk backup and recovery for on-premise Datrium DVX systems.  Beyond as-a-service simplicity, the new service offering delivers breakthrough global cloud deduplication and forever incremental backups to minimize the cost of capacity and network bandwidth.  Granular, dedupe-aware recovery also provides accelerated Recovery-Time-Objectives (RTO) versus other AWS S3-based offerings.

VMblog:  Cloud-based backup is being done by the likes of Nutanix already, how are you different?

Nunes:  With Cloud DVX, we have focused on 'as-a-service' simplicity.  Cloud DVX automates all the key Day One and Day Two tasks, so customers don't have to specialize in AWS infrastructure mechanics. We offer a one-click set-up which makes getting started with Cloud DVX in AWS as simple as pairing another replication target to an on-premise DVX. Simply select the AWS region, enter credentials and click finish. In just a few minutes, Cloud DVX is set up in AWS automatically and customers can start replicating snapshots immediately. 

Beyond that, we provide self-healing availability as a part of the backup and recovery service. Cloud DVX leverages serverless management of AWS resources, to ensure continuity of data replication tasks. In the event that an EC2 instance suffers an outage, another EC2 instance is automatically spun so that ongoing replication tasks can finish successfully. The service also provides automated software upgrades for Cloud DVX without any customer intervention required, and just like on-premise DVX, Cloud DVX sends a range of telemetry inputs to the Datrium support teams for proactive issue identification and resolution.

VMblog:  How does this impact a user's AWS subscription?

Nunes:  We are immensely sensitive to the cost of a customer's AWS subscription, and our goal with Cloud DVX is lowest total cost of ownership.  The biggest cost optimization features of Cloud DVX include a few things. First, Datrium DVX collapses three-tier D2D2C or F2D2C into a two-tier model by converging primary and secondary in one system and connecting to the cloud directly. By delivering better performance, greater data efficiency and granular data management in one system, DVX can save customers over 70% in purchase costs alone, compared to separate primary and HCI backup vendor solutions. 

Beyond that, Cloud DVX leverages compute in the cloud for in-cloud deduplication of data across multiple sites or systems. Only unique data is replicated from any on-premise DVX to the cloud and back, and all data stays compressed from initial ingestion. With an average of 3:1 local data reduction and 1.5:1 global 'cloud' data reduction across systems or sites, customers can experience an average of 4.5:1 data reduction for huge cost savings. 

Finally, Cloud DVX supports Forever-Incrementals and End-to-End Encryption. The former means that after initial seeding of snapshot data to the cloud, all subsequent snapshots result in only the differences being transferred. The latter means that DVX natively encrypts all data in-transit to the public cloud, eliminating the need for expensive public cloud VPN services, which are charged by transferred data capacity and on a connection-hours basis.

VMblog:  What about RTO with Cloud DVX -- how does it compare with other AWS data protection approaches?

Nunes:  It turns out that Cloud DVX provides excellent RTO compared to HCI and Backup HCI approaches. First, Cloud DVX minimizes the amount of data transferred from public cloud to an on-premise DVX by sending only unique data that is not already available on-premise. The system further reduces the amount of data transferred by allowing customers to recover not just virtual machines but also individual virtual disks, datastore-files (OVAs, ISOs) or persistent container volumes. 

In addition, the two-tier direct-to-cloud model that is available with DVX and Cloud DVX means data is recovered from the cloud straight to the host, where it can be instantly utilized on the primary infrastructure. There is no staging and rehydrating on a backup system and no second-hop copy to hosts, so RTO is faster. In addition, Datrium snapshots provide always-synthetic full copies, which means that any snapshot restore is instant and there is no need to first wait for incrementals to be applied, as is the case with HCI backup vendors. 

VMblog:  Can you give us a sneak peek into what additional features we can expect in the future for Cloud DVX?

Nunes:  Sure, we have several new things planned for Cloud DVX in 2018, including support for single-file-restore and full cloud instantiation of DVX to support cloud DR as well as other cloud data management use cases like test/dev and analytics.  We are also excited about another development that is currently underway, and which will provide cloud-managed orchestration and correlation across public and on-premise DVX deployments. I will be able to share more details when we get closer to the product launch date.

VMblog:  What is your future vision for Datrium's product portfolio?

Nunes:  Our focus is on scalable data services that are capable of moving instance data efficiently between hosts, local persistence pools, other sites or clouds by policy. This is also our core differentiator. We are not building a hypervisor. We're not building a custom cloud. We're about simplifying customers' Tier 1 compute and storage infrastructure on-premises - including scalable policies for data protection, while fighting data gravity to pre-position data efficiently across multiple sites, whether they are on-premises or in the public cloud, so that data is where it needs to be, when the business needs it. 

VMblog:  Can you talk about the all-flash data node?  What type of customers are looking at deploying it, and what use cases can you share with us?

Nunes:  Until this point, our data node has been disk-based like any other secondary storage designed to minimize cost. Since read I/O happens in flash on compute nodes and write I/O is written to NVRAM in the data node, the DVX provides high performance in most cases, independent of whether the data node contains disk or flash. 

However, there are a few situations where a DVX with flash data nodes (called DVX with Flash End-to-End) might benefit users. The first is for the most write-intensive workloads such as multi-thousand seat VDI and IoT. The DVX with Flash E2E supports up to 16GB/s of write bandwidth, about twice the standard DVX. The second use case is where a customer might desire the additional Tier 1 features of the DVX with Flash E2E. The system maintains data availability and low I/O latency under a range of failure conditions, which is ideally suited for mission critical deployments such as Oracle RAC. Finally, for customers looking for a true all-flash data center that includes both primary and secondary storage deployments, this is the first VM-based system in the market to offer such an option.

VMblog:  What does your recent Oracle RAC qualification deliver for customers?

Nunes:  We have customers for whom we have consolidated their entire Tier 1 virtual infrastructure, including databases and data warehouses, off of arrays and onto a single DVX. The holdout has been Oracle RAC.  Our Oracle RAC qualification now allows our customers to complete their Tier 1 consolidation onto Datrium with the same failure protection model (N-1 server failure protection) as Oracle RAC provides. 

VMblog:  And finally, speaking of availability, Datrium recommends double failure protection with its products.  Why is that?

Nunes:  We not only provide double-failure protection standard with every DVX, we go a step further and recommend that every customer, even those deploying non-Datrium systems, seriously consider double failure protection. The reason is not because a second drive might fail before reconstruction completes, but because there is a good chance of discovering a Latent Sector Error (LSE) during reconstruction. An LSE, essentially sectors on disk drives that become unreadable, can happen for a variety of reasons including media imperfections, stray particles, and issues that arose when the content was written. We found that LSEs occur more frequently on older drives, and that those experiencing LSEs are prone to recurring issues. 

Our analysis shows that a system with monthly disk scrubbing would have a super high data-loss probability of about 0.5% in one year, or about 2% over a four-year period. Less frequent scrubbing spikes the data-loss probability to over 18% in the fourth year.

Arrays have been available with RAID 6 standard for nearly a decade. The fact that hyperconverged system vendors still recommend single failure tolerance over double failure tolerance is a ticking time bomb in any data center, and is largely due to the fact that double failure protection is incredibly expensive with traditional HCI architectures, given the necessary data protection overhead.


Published Tuesday, November 28, 2017 7:30 AM by David Marshall
There are no comments for this post.
To post a comment, you must be a registered user. Registration is free and easy! Sign up now!
<November 2017>