« 0.035: hitachi drops another shoe (it sounded like a slipper!) | Main | 0.037: storage anarchism recategorized »

September 19, 2007

0.036: data integrity and virtualized storage

Is your data really safe?

In what many think is a modern-day impression of Chicken Little, Robin Harris has been asking this question over on StorageMojo for quite a while. In his most recent blog post, he refuels his concerns using "evidence" presented in a Data Integrity research paper done by the folks at CERN.

I highly recommend you at least skim that document, as there are some interesting observations in it that could have far reaching ramifications in your own storage environment.

According to this paper, more than 3 of the MP3's or TiVo videos I have in my Terabyte Home are probably corrupted -and I might never know it!

Now Robin takes the 50,000 foot view of this, and comes to the conclusion that the world just may collapse soon if this data integrity issue isn't resolved. He even suggests that HEY! Shouldn't we be doing something NOW to avoid all this?

</sarcasm> (I leave it to the reader to figure where the opening tag belongs smile_wink)

Good news, Robin: some of us have already been solving this problem. Been doing so for years, in fact...  
 

parity, checksums, and ecc domains - oh my!

The CERN folks did a lot of testing, and concluded that they were experiencing a Byte Error Rate of 3 in 10^7 - or 3 corruptions per Terabyte. That's a lot more than what you'd guess if you looked at the BER predictions for the components, and Robin is likely correct in his conclusion that the problem rests in all the transfers of the data between multiple (supposedly protected) domains.

What seems to have been overlooked, however, is that the problem is likely caused by faulty architecture and design of the CERN storage infrastructure. Apparently no provisions had been made by either the storage vendor or the implementation architect to ensure that faults that occur between domains were detected and corrected.

But this is not universally true of all storage. A lot more than you think, maybe. But not all...

No, in fact, Symmetrix has long implemented front-to-back parity/checksum/ecc verification that data blocks don't change, from the moment they arrive at the FA port, through global memory, out the DA port, all the way out to the disk, and then back again. Checksums and/or parity are calculated coming into and validated going out of each internal "domain." ECC algorithms performing Single Nibble Correction and Double Nibble Detection (SNCDND) has long been used to protect Symmetrix memory against alpha-particle corruption of the memory chips themselves. Triple Modular Redundancy (TMR) was introduced in the DMX series to protect against Silent Bit Errors (SBEs) in the logic paths. And the data stored on disk is continually exercised and validated against checksums to validate that it hasn't changed - and if it does change, the bad block is remapped, the block's mirror image is validated and restored, or the block is rebuilt from the RAID parity - all quietly behind the scenes.

More importantly, while each of the potential failure domains within the Symmetrix has always been protected by appropriate error detection & correction technology (typically using different algorithms for adjacent domains for maximum algorithmic integrity), the protection domains themselves always overlap the adjacent domains.

This is critically important, because you simply can't strip off the protection bits for a block until you know that it's integrity was intact during the time the next domain's protection was being generated. And if an error is detected during the transfer across domains, the original can still be recovered from the prior domain and re-routed through a different path - or the I/O request can be rejected altogether before corrupt data is committed or returned (yes, Virginia, in most cases, no data is better than bad data - especially if the alternative is silently bad data).

So what's going on at CERN?
 

maybe they got what they paid for

As I read through the CERN paper, it struck me that the storage infrastructure being described probably was lacking this sort of attention to error detection and correction. All this talk about needing to ADD overhead of checksums and integrity checks seems to imply none existed before.

Now, this isn't necessarily surprising to ME, nor probably anyone else who lives and breathes storage every day. But I suspect the CERN folks had never even considered the possibility that the storage devices they were using hadn't been designed to ensure the integrity of their data. Nor that they even needed to worry about it. In fact they probably never connected the lower cost of the storage (as compared to say, enterprise-class storage) to the higher potential for data corruption that it might create. Whether due to budget constraints or just naivete to the issue, they had been merrily doing their research without any concerns over the potential for data corruption.

And I am sure that they are not alone.

But the facts are pretty much as they've laid them out - unless you do something pro-active to detect and recover from Bit- or Byte-Errors, you will eventually find yourself working with corrupted data.

Now, when it happens to your laptop, you might not even notice it. Or, your entire hard drive may have to be rebuilt because the error was so large. Then you'll moan, you'll swear to do backups religiously from now on, but eventually you'll probably accepts that you really didn't lose anything life threatening. I mean, if my copy of Chris Isaak's "Baby Did A Bad Bad Thing" suddenly turns up corrupted, I just copy it from one of my backups, or re-rip it from the original CD and move on.

It's not like I'm managing the global economy on my laptop or anything.

But I'm willing to bet that most storage admins have never even asked about the data integrity risks associated with their mid-tier storage devices, even as they deploy critical applications like particle physics. Or global arbitrage. Or air traffic control. Or genetic research.

Hey - wait a minute! Maybe they should be thinking about this more, before a bit error accidentally creates an unstoppable bio-virus, or something.
 

so what's this got to do with storage virtualization?

OK - here's the deal. Above I described (generally) the stuff that Symmetrix does to ensure that the data read from the array is in fact the same data that was written to it. I'm going to go out on a limb and assume that Hitachi's USP, USP-V and mini-me do similar things (although this sort of stuff is usually opaque unless you ask about it, so I don't really know for sure). And BarryW has asserted on his blog that the SVC his similar internal protections against corruptions within their domains.

But these virtualization engines can only assume that the storage behind them don't corrupt the data. They expect that if anything goes wrong, the storage device will fail the I/O to the virtualization engine, which can then regenerate the request without regard to loss.

And if you virtualize a Symmetrix, that's exactly what should happen.

But what if the error occurs BETWEEN the virtualization device and the storage - a place where there is NO overlapping protection domain? Indeed, silent errors can be introduced and left undetected - the virtualizer sends "A", the storage sees "B" and dutifully stores it - and guarantees that you get "B" back.

Alas, the virtualization engine has no way to tell that the "B" it later gets back is wrong, and blindly sends it along to the host.

Now - what if you were to put one of these "unprotected" storage devices behind a "protected" virtualization engine?

Scary thought now, isn't it? You take a device that has been observed to have a BER of 3 per Terabyte, and you add an additional layer of indirection. Even if the virtualization engine is fail-proof, you've increased the number of potential failure domains. So maybe your BER goes to 4/TB. Or maybe it goes to 1 per hundred gigabytes.

And what happens if one of the data blocks that happens to get corrupted is part of a compressed file (hint: you probably can't read any more of the file). Or an encrypted dataset (hint: it might as well have been deleted). Or a block that's used by hundreds of different files and that has been de-duped down to a single instance (hint: EVERY one of files is now corrupted).

The problem is, you won't know about the corruption until it's too late - if ever. And we're not talking lost MP3's here.
 

end-to-end data integrity

So maybe this isn't just sensationalism on Robin's part (no, after his last attack, I didn't think I'd ever defend him either). Nor is this post just my latest weapon for the Colosseum battles with the Lions of Virtualization.

In fact, the issue is very real, and I offer as evidence the following technologies designed specifically to provide end-to-end data integrity validation: Oracle checksum, EMC Double Checksum and the T-10 DIF proposal. Both intend to reliably deliver end-to-end protection of data from the generating application out to the storage (and back, of course).

It is somewhat disconcerting that such technologies are actually needed at all, but the fact is that errors are inevitable in virtually everything we do. And without error correction that overlaps domains to protect and recover from errors, the only other alternative is an end-to-end data integrity scheme like these. And in somewhat more support of Robin's lament, because these sorts of technologies are not yet ubiquitous, ALL of our data is operating at a higher risk than we probably believe, and we're probably not moving fast enough to address the situation.

BTW - for those that are interested, the EMC implementation of Double Checksum is somewhat unique- in cooperation with Oracle, the Symmetrix uses the Oracle checksum to validate the integrity of the data that is received, and once verified, converts to the standard internal protection used for all other data within the Symmetrix. This creates the overlapping protection domain without having to maintain the additional overhead of Oracle's checksums. On a read, the Oracle checksum is regenerated from within the Symmetrix' protection domain, and then presented back. And of course, both ends are sufficiently aware to recognize and notify the other of any discrepancies that are detected. There's a white paper that discusses this somewhere - oddly the only publicly-accessible copy I can find isn't on EMC.com..it's here.
 

implications on tiered storage

I have to admit that I had originally intended to include this within part 3 of my (now very delayed) series on tiered storage, as one of the challenges. But I'm here, and there's no time like the present, as they say.

Simply put: I think this is a very good example of why customers should prefer to use an array optimized to support multiple tiers within-the-box instead of any sort of virtualization in front of lower tiers of storage (see part 2 of my tiered storage series for these definitions).

Simple fact is that the addition of the virtualization layer adds more places for data to be silently corrupted! Whether or not the external storage devices provide comprehensive error detection and recovery, operating that storage behind another device only increases your risks of undetectable data corruption. These virtualization engines can't compensate for the increased probability of BER, and in fact they pretty much inarguably increase the probability of data corruption of data stored on the external/virtualized storage.

And that's no matter what HHSNBN or even BarryW might say (sorry guys - facts is facts).

On the other hand, high-capacity SATA drives installed within a Symmetrix DMX-4 will inherently and automatically benefit from the same comprehensive front-to-back data integrity protection that has long been applied to internal Fibre Channel storage (and SCSI devices before the DMX, even). Data is checksummed and validated across overlapping domains every step of the way from the FA port to disk and back. And the data on the drives is continually "scrubbed" to ensure its integrity, and regenerated from the mirror or parity when needed. And basically you get this added insurance at no incremental cost - just another benefit you get with Symmetrix that you might not be getting with your mid-tier storage array tucked in behind your fancy new USP-V

Go figure - you can avoid the cost and complexity of all those separate storage arrays for your lower tiers, skip all the extra ports on both ends to connect them to the virtualization engine, and ignore the licensing fees for the virtualization software itself - and improve your data integrity at the same time.

lightbulb brilliant!


TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d834c659f269e200e54ef5ec798834

Listed below are links to weblogs that reference 0.036: data integrity and virtualized storage:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Matt

OK, so this begs the question...

What about Invista? I assume there is nothing "magic" in that environment that prevents silent corruption of virtualized disk....

the storage anarchist

Well, now that you mention it...

Invista doesn't terminate and regenerate the I/O requests - it merely instructs the SAN Director to retarget the I/O to a different location - on the fly, in real time, and without any of the overhead inflicted by UVM or SVC. Since there is no cache, every I/O is effectively a cache miss, executed at FC Director I/O latencies. As such, the data block doesn't pass through any cache memory on the way to its destination - it is simply retargeted by the director on the fly by rewriting the FC destinateion headers.

So arguably, since the data traverses the SAN in the exact same manner as it would on the way to the intended target storage/host, Invista adds (virtually zero)/(the least incremental) risk of data corruption. All the other implementations effectively act in a store-and-forward mode, copying the data blocks en route to the destination, which is where the added risk of corruption is introduced.

And I'm absolutely sure that BarryW will correct me here (and equally positive that neither HHSNBN nor Claus will chime in).

But hey - remember, I'm just the Symmetrix dude. What do I know about Invista ;^?

Snig

"the virtualizer sends "A", the storage sees "B" and dutifully stores it"

So how would the data get corrupted being sent from one device and received by another? You're saying that a switch could corrupt the SCSI string that is encapsulated in Fibre Channel?

Nigel Poulton

Barry,

Cracking open the FC packet and still maintaining FC Director latencies…….. Suppose depends what you mean by FC Director Latencies.

Does the DMX4 do any kind of read-after-write verification for data destined for SATA similar to what the HDS AMSXXX performs? The AMS range of HDS storage performs a verify-after-write for all data written to SATA disk. This impacts write performance. Or do you feel that the protections built into the DMX negate the requirement for that?

the storage anarchist

Snig - I'm not saying the probability is large, but I don't think anyone can guarantee that undetected errors never slip through the FC protocol either. And any time you add another thing into the data path, you increase the risk of failure - even a single component like a FC transciever chip can introduce undetectable bit errors. Shoulnd't happen, but then, the CERN guys thought their data was safe too (and you'll note they could only attribute part of their corruption to bad microcode).

This is in fact why T-10 DIF, Oracle Checksum and ZFS's CRC exist - to ensure that the relatively lightweight ECC/CRC/Checksum/Parity protections are wrapped in a higher level end-to-end integrity check.

NigelP - actually, a Cisco director can retarget packets within the native switching time; the overhead is effectively nil in the context of the inherent switching latency.

And no, neither Symm nor CLARiiON perform read-after-write on SATA drives; both platforms use CRC/ECC/Checksum techniques to validate data integrity on the drives, same as for FC. With literally hundreds of thousands of SATA drives deployed on CLARiiON, we've never observed the necessity of adding what you correctly observe is a tremendous write penalty.

In fact, we've often chuckled about Hitachi's strategy - it's probably part of the reason why the continue to refuse to add SATA support to the USP-V/VM.

Tony Rodriguez

Barry,
Nice post. As the inventor of the idea of doing an end-to-end checksum for Oracle, I thought your readers might be interested in how the idea came about...

Back in 1999, eBay was the darling of the dot-com era, and they suffered an outage that lasted 22 hours and resulted in front-page news. Teams of engineers from Sun, Veritas, and Oracle were sent to eBay to try to get them back up. The crash was caused by bad blocks found in primary Oracle database (Oracle always does a consistency check when it reads a block.) The root cause was eventually tracked to a bug in Solaris that randomly scribbled over blocks in the buffer cache (after Oracle had done the write, but before the transaction commit.)

As I studied the post-mortem analysis, I tried to think of ways we could have prevented this with Symmetrix (eBay was not using EMC storage at the time.) I was familiar with Oracle and knew they had the capability to generate checksums for each data block, so I took the idea of an end-to-end consistency check from networking (and a classic paper written by my old MIT profs: http://mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf)
I got together with the Oracle architects to work out the details and thus was born the end-to-end Oracle checksum.

Barry Whyte

What can I say. You have a point. A very small one but its true, unless you have end to end checking then you are at the mercy of a rouge bit of kit corrupting bits. However, (there's always a however with me) its one of the things we jump up and down about. The SVC team having evolved from the UK storage group (The Winchester drive wasn't called that by chance) data-miscompares are a no-no. The SVC software is studiously tested (almost over-tested IMHO - Microsoft could take note) So yes, there is another box in the way, but I'd be much more worried about the reliability and availabilty of an FC-AL loop than the virtualization hardware device.

the storage anarchist

BarryW - my point wasn't so much that the SVC itself might corrupt data, but that the bit of FC-AL loop inserted between your kit and the storage is an incremental opportunity for silent-but-deadly rogue bits. AND that while the chance of added damage is small, there's not much that the SVC kit can do to detect or correct errors inserted by the storage kit (unless you generate, add and validate another layer of ECC as the data passes through, that is). Somehow I don't think the SVC does this today, at least not given the 60 microsecond overhead you claim for a read miss pass-through...

brutus

Nice post indeed. You guys are very good at theoretical stuffs. But the practice really contradicts you.
A little bit of history: back in 2007 we experienced severe data corruption on a DMX2000 box due to a memory cache board failure. You are going to tell me that this is a very rare thing to happen but still: what about the "overlapping protection domains"?
The EMC recommendation was to buy a DMX3.
This year we have also experienced data corruption with Invista (software bug fixed in release 2.2).
I'm not very sure if buying additional software for checksuming will ever guarantee corruption-free data. After all EMC will never go to court for data corruption at customer premises.
In both cases, EMC support was not able to offer an impact analysis or a resolution in a decent time (and I mean days).

The comments to this entry are closed.

anarchy cannot be moderated

about
the storage anarchist


View Barry Burke's profile on LinkedIn Digg Facebook FriendFeed LinkedIn Ning Other... Other... Other... Pandora Technorati Twitter Typepad YouTube

disclaimer

I am unabashedly an employee of EMC, but the opinions expressed here are entirely my own. I am a blogger who works at EMC, not an EMC blogger. This is my blog, and not EMC's. Content published here is not read or approved in advance by EMC and does not necessarily reflect the views and opinions of EMC.

search & follow

search blogs by many emc employees:

search this blog only:

 posts feed
      Subscribe by Email
 
 comments feed
 

 visit the anarchist @home
 
follow me on twitter follow me on twitter

TwitterCounter for @storageanarchy

recommended reads

privacy policy

This blog uses Google Ads to serve relevant ads with posts & comments. Google may use DoubleClick cookies to collect information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide ads about goods and services of interest to you. If you would like more information about this practice and your options for not having this information used by Google, please visit the Google Privacy Center.

All comments and trackbacks are moderated. Courteous comments always welcomed.

Email addresses are requested for validation of comment submitters only, and will not be shared or sold.

Use OpenDNS