0.036: data integrity and virtualized storage
Is your data really safe?
In what many think is a modern-day impression of Chicken Little, Robin Harris has been asking this question over on StorageMojo for quite a while. In his most recent blog post, he refuels his concerns using "evidence" presented in a Data Integrity research paper done by the folks at CERN.
I highly recommend you at least skim that document, as there are some interesting observations in it that could have far reaching ramifications in your own storage environment.
According to this paper, more than 3 of the MP3's or TiVo videos I have in my Terabyte Home are probably corrupted -and I might never know it!
Now Robin takes the 50,000 foot view of this, and comes to the conclusion that the world just may collapse soon if this data integrity issue isn't resolved. He even suggests that HEY! Shouldn't we be doing something NOW to avoid all this?
</sarcasm> (I leave it to the reader to figure where the opening tag belongs )
Good news, Robin: some of us have already been solving this problem. Been doing so for years, in fact...
parity, checksums, and ecc domains - oh my!
The CERN folks did a lot of testing, and concluded that they were experiencing a Byte Error Rate of 3 in 10^7 - or 3 corruptions per Terabyte. That's a lot more than what you'd guess if you looked at the BER predictions for the components, and Robin is likely correct in his conclusion that the problem rests in all the transfers of the data between multiple (supposedly protected) domains.
What seems to have been overlooked, however, is that the problem is likely caused by faulty architecture and design of the CERN storage infrastructure. Apparently no provisions had been made by either the storage vendor or the implementation architect to ensure that faults that occur between domains were detected and corrected.
But this is not universally true of all storage. A lot more than you think, maybe. But not all...
No, in fact, Symmetrix has long implemented front-to-back parity/checksum/ecc verification that data blocks don't change, from the moment they arrive at the FA port, through global memory, out the DA port, all the way out to the disk, and then back again. Checksums and/or parity are calculated coming into and validated going out of each internal "domain." ECC algorithms performing Single Nibble Correction and Double Nibble Detection (SNCDND) has long been used to protect Symmetrix memory against alpha-particle corruption of the memory chips themselves. Triple Modular Redundancy (TMR) was introduced in the DMX series to protect against Silent Bit Errors (SBEs) in the logic paths. And the data stored on disk is continually exercised and validated against checksums to validate that it hasn't changed - and if it does change, the bad block is remapped, the block's mirror image is validated and restored, or the block is rebuilt from the RAID parity - all quietly behind the scenes.
More importantly, while each of the potential failure domains within the Symmetrix has always been protected by appropriate error detection & correction technology (typically using different algorithms for adjacent domains for maximum algorithmic integrity), the protection domains themselves always overlap the adjacent domains.
This is critically important, because you simply can't strip off the protection bits for a block until you know that it's integrity was intact during the time the next domain's protection was being generated. And if an error is detected during the transfer across domains, the original can still be recovered from the prior domain and re-routed through a different path - or the I/O request can be rejected altogether before corrupt data is committed or returned (yes, Virginia, in most cases, no data is better than bad data - especially if the alternative is silently bad data).
So what's going on at CERN?
maybe they got what they paid for
As I read through the CERN paper, it struck me that the storage infrastructure being described probably was lacking this sort of attention to error detection and correction. All this talk about needing to ADD overhead of checksums and integrity checks seems to imply none existed before.
Now, this isn't necessarily surprising to ME, nor probably anyone else who lives and breathes storage every day. But I suspect the CERN folks had never even considered the possibility that the storage devices they were using hadn't been designed to ensure the integrity of their data. Nor that they even needed to worry about it. In fact they probably never connected the lower cost of the storage (as compared to say, enterprise-class storage) to the higher potential for data corruption that it might create. Whether due to budget constraints or just naivete to the issue, they had been merrily doing their research without any concerns over the potential for data corruption.
And I am sure that they are not alone.
But the facts are pretty much as they've laid them out - unless you do something pro-active to detect and recover from Bit- or Byte-Errors, you will eventually find yourself working with corrupted data.
Now, when it happens to your laptop, you might not even notice it. Or, your entire hard drive may have to be rebuilt because the error was so large. Then you'll moan, you'll swear to do backups religiously from now on, but eventually you'll probably accepts that you really didn't lose anything life threatening. I mean, if my copy of Chris Isaak's "Baby Did A Bad Bad Thing" suddenly turns up corrupted, I just copy it from one of my backups, or re-rip it from the original CD and move on.
It's not like I'm managing the global economy on my laptop or anything.
But I'm willing to bet that most storage admins have never even asked about the data integrity risks associated with their mid-tier storage devices, even as they deploy critical applications like particle physics. Or global arbitrage. Or air traffic control. Or genetic research.
Hey - wait a minute! Maybe they should be thinking about this more, before a bit error accidentally creates an unstoppable bio-virus, or something.
so what's this got to do with storage virtualization?
OK - here's the deal. Above I described (generally) the stuff that Symmetrix does to ensure that the data read from the array is in fact the same data that was written to it. I'm going to go out on a limb and assume that Hitachi's USP, USP-V and mini-me do similar things (although this sort of stuff is usually opaque unless you ask about it, so I don't really know for sure). And BarryW has asserted on his blog that the SVC his similar internal protections against corruptions within their domains.
But these virtualization engines can only assume that the storage behind them don't corrupt the data. They expect that if anything goes wrong, the storage device will fail the I/O to the virtualization engine, which can then regenerate the request without regard to loss.
And if you virtualize a Symmetrix, that's exactly what should happen.
But what if the error occurs BETWEEN the virtualization device and the storage - a place where there is NO overlapping protection domain? Indeed, silent errors can be introduced and left undetected - the virtualizer sends "A", the storage sees "B" and dutifully stores it - and guarantees that you get "B" back.
Alas, the virtualization engine has no way to tell that the "B" it later gets back is wrong, and blindly sends it along to the host.
Now - what if you were to put one of these "unprotected" storage devices behind a "protected" virtualization engine?
Scary thought now, isn't it? You take a device that has been observed to have a BER of 3 per Terabyte, and you add an additional layer of indirection. Even if the virtualization engine is fail-proof, you've increased the number of potential failure domains. So maybe your BER goes to 4/TB. Or maybe it goes to 1 per hundred gigabytes.
And what happens if one of the data blocks that happens to get corrupted is part of a compressed file (hint: you probably can't read any more of the file). Or an encrypted dataset (hint: it might as well have been deleted). Or a block that's used by hundreds of different files and that has been de-duped down to a single instance (hint: EVERY one of files is now corrupted).
The problem is, you won't know about the corruption until it's too late - if ever. And we're not talking lost MP3's here.
end-to-end data integrity
So maybe this isn't just sensationalism on Robin's part (no, after his last attack, I didn't think I'd ever defend him either). Nor is this post just my latest weapon for the Colosseum battles with the Lions of Virtualization.
In fact, the issue is very real, and I offer as evidence the following technologies designed specifically to provide end-to-end data integrity validation: Oracle checksum, EMC Double Checksum and the T-10 DIF proposal. Both intend to reliably deliver end-to-end protection of data from the generating application out to the storage (and back, of course).
It is somewhat disconcerting that such technologies are actually needed at all, but the fact is that errors are inevitable in virtually everything we do. And without error correction that overlaps domains to protect and recover from errors, the only other alternative is an end-to-end data integrity scheme like these. And in somewhat more support of Robin's lament, because these sorts of technologies are not yet ubiquitous, ALL of our data is operating at a higher risk than we probably believe, and we're probably not moving fast enough to address the situation.
BTW - for those that are interested, the EMC implementation of Double Checksum is somewhat unique- in cooperation with Oracle, the Symmetrix uses the Oracle checksum to validate the integrity of the data that is received, and once verified, converts to the standard internal protection used for all other data within the Symmetrix. This creates the overlapping protection domain without having to maintain the additional overhead of Oracle's checksums. On a read, the Oracle checksum is regenerated from within the Symmetrix' protection domain, and then presented back. And of course, both ends are sufficiently aware to recognize and notify the other of any discrepancies that are detected. There's a white paper that discusses this somewhere - oddly the only publicly-accessible copy I can find isn't on EMC.com..it's here.
implications on tiered storage
I have to admit that I had originally intended to include this within part 3 of my (now very delayed) series on tiered storage, as one of the challenges. But I'm here, and there's no time like the present, as they say.
Simply put: I think this is a very good example of why customers should prefer to use an array optimized to support multiple tiers within-the-box instead of any sort of virtualization in front of lower tiers of storage (see part 2 of my tiered storage series for these definitions).
Simple fact is that the addition of the virtualization layer adds more places for data to be silently corrupted! Whether or not the external storage devices provide comprehensive error detection and recovery, operating that storage behind another device only increases your risks of undetectable data corruption. These virtualization engines can't compensate for the increased probability of BER, and in fact they pretty much inarguably increase the probability of data corruption of data stored on the external/virtualized storage.
And that's no matter what HHSNBN or even BarryW might say (sorry guys - facts is facts).
On the other hand, high-capacity SATA drives installed within a Symmetrix DMX-4 will inherently and automatically benefit from the same comprehensive front-to-back data integrity protection that has long been applied to internal Fibre Channel storage (and SCSI devices before the DMX, even). Data is checksummed and validated across overlapping domains every step of the way from the FA port to disk and back. And the data on the drives is continually "scrubbed" to ensure its integrity, and regenerated from the mirror or parity when needed. And basically you get this added insurance at no incremental cost - just another benefit you get with Symmetrix that you might not be getting with your mid-tier storage array tucked in behind your fancy new USP-V
Go figure - you can avoid the cost and complexity of all those separate storage arrays for your lower tiers, skip all the extra ports on both ends to connect them to the virtualization engine, and ignore the licensing fees for the virtualization software itself - and improve your data integrity at the same time.