Written by Björn Kolbeck, co-founder and CEO of Quobyte
Random
bit flips are far more common than most people, even IT professionals, think.
Surprisingly, the problem isn't widely discussed, even though it is silently
causing data corruption that can directly impact our jobs, our businesses, and
our security. It's really scary knowing that such corruptions are happening in
the memory of our computers and servers - that is before they
even reach the network and storage portions of the stack. Google's in-depth study of bit-level DRAM errors showed
that such uncorrectable errors are a fact of life. And do you remember the
time when Amazon
had to globally reboot their entire S3 service due to
a single bit error?
The
Error-Prone Data Trail
Let's
assume for a moment that your data survives its many passes through a system's
DRAM and emerges intact. That data must then be safely transported over a
network to the storage system where it is written to disk. How do you assure
the data remains unaltered along the way? Well, if you're using one of the
storage protocols that lack end-to-end checksums (e.g. NFSv2, NFSv3, SMBv2),
your data remains susceptible to random bit flips and data corruption. Even
NFSv4 plus Kerberos 5 with integrity checking (krb5i) doesn't offer true
end-to-end checksums. Once the data is extracted from the RPC, it is
unprotected once again. In addition, widespread adoption of NFSv4 hasn't
happened, and fewer still use krb5i.
Over
a decade ago, the folks at CERN urged that
"checksum mechanisms (...) be implemented and deployed everywhere." This appeal
only amplifies today when one considers the storage sizes and daily rates
of data transfer we're dealing with. Data corruption can no longer be ignored
as just a "theoretical" issue. And if you think modern applications protect
against this problem, I've got bad news for you: In 2017, researchers at
the University of Wisconsin uncovered serious
problems for some storage systems when they introduced bit errors into
some well-known and widely used applications.
Checksums
Came at a Cost that's Worth its Price Today
When
NFS was designed, file writes and the general amount of data were
relatively small and checksum computations were very expensive. Hence, the
decision to rely on TCP checksums for data protection seemed reasonable.
Unfortunately, these checksums proved to be too weak, especially when
transferring more than 64k bytes per packet - which easily happens when you
transfer Gigabytes per second. What about Ethernet checksums, you ask? They are
indeed stronger. However, they don't allow for end-to-end protection and
opportunities for data corruption are manyfold: Cut-through
switches that don't recompute
checksums and kernel drivers for NICs are just two examples
where things can go horribly wrong.
Checksums and
the End of Silent Data Corruption
Experts
have seen such silent data corruption happen, even in mid-sized installations. In
one instance, enterprise administrators were informed that their data
corruption happened in transit. At that point, they began investigating the
network stack. It turned out to be a driver-related issue that occurred after a
kernel update broke the TCP offload feature of their NICs. Tracking down the
problem was both difficult and time-consuming.
That's
where end-to-end checksums come in. In one use case, as soon as the system
receives data from the operating system, each block (usually 4k bytes, but that
can be adjusted in the volume configuration) is checksummed. Because this
checksum stays with the data block forever, the data is protected - even
against software bugs - as it travels through the software stack. The checksum
is validated along the path throughout the life of the data - even at rest when
the data isn't accessed (via periodic disk scrubbing). All this is possible
because dated legacy protocols like NFS are not relied on. Instead, an RPC
protocol where each data block, and the message itself, are checksum-protected.
And since modern CPUs have built-in CRC32 computation capabilities, there's no
longer a performance penalty for using CRCs.
##
About
the Author
Björn Kolbeck is the co-founder and CEO of Quobyte. Before taking
over the helm at Quobyte, Björn spent time at Google working as tech lead for
the hotel finder project (2011-2013) and he was the lead developer for the
open-source file system XtreemFS (2006-2011). Björn's PhD thesis dealt with
fault-tolerant replication.