3.011: hot air reclamation
As I said in a prior post, sometimes we in the storage industry misbehave.
And other times, we spew fish stories – the kind that would make Pinocchio’s nose grow a couple of feet instantly.
The latest fish tale to be exaggerated beyond all sense of reality is the Unused Space Reclamation geyser, and to hear it told is to be convinced that the world of underutilized storage hath been all but eliminated at long last by the ingenuity and design of a unique new magic trick that allows host software to tell storage systems they are no longer in need of certain blocks within a LUN.
Now, don’t get me wrong – this feature is extremely valuable and it will undoubtedly help us all to improve storage utilization and efficiency. But I’ve seen practically every vendor who is shipping support for this feature today practically claiming to have invented it, that it’s a key differentiating feature for their platforms, and that THEY are the ones driving the hypervisor, host operating system, file system, database and volume manager vendors to implement this new feature.
Reality Check Time
Folks, the fact is that the T10 SBC-3 committee has stabilized the RFCs for the two (yes 2) new SCSI commands that underpin all this hoopla. With stable RFCs, vendors are now able to implement one or both of these new operations without concern that the API is going to change (again). And these standards have been under development for over a year, with representation and comment from practically every vendor in the list I scribed above – as with most standards, it has been a communal effort.
Somehow, the early adopters see no need to explain these facts to their audiences, allowing encouraging them instead to think that each vendor alone has mastered alchemy to turn deleted files into reusable space.
Alchemy, indeed…
UNMAP(), WRITE_SAME() and TRIM()
Fact is, these two new SCSI operations (UNMAP() and the new Unmap flag for the WRITE_SAME() command) follow by more than a year the xATA equivalent TRIM() command that was rushed through standardization to support flash drives. These space reclamation (actually, space release) APIs allow the host operating software to tell the target device (whether a real disk drive, an SSD or an array-based LUN presentation) that a range of blocks are no longer needed. This could be because the volume manager has truncated (shortened) the device, or because the file system has deleted files (and emptied the trash), or because the hypervisor wants to release reinitialize an extent so it can be reassigned to a new virtual machine.
On the device/array side, these reclaim commands are typically interpreted to mean that the physical blocks of storage assigned to the “released” LBAs can be put back into the unused pool. For thin provisioned devices, this means that capacity is made available for other volumes that share the pool, helping to improve utilization efficiency. In the case of NAND-based SSDs, the device can reassign the now unused blocks to its own internal free pool, perhaps erasing them and putting them on the write queue to help optimize performance.
Pre-erase is a strategy employed by many SSDs to improve performance, because the erase cycle for NAND typically takes 2-4x longer than the write operation. Pre-erase blocks can thus be written faster than overwriting a block that contains (old) data.
Prior to thin provisioning and SSDs, there was really no reason for this information to be passed to the storage – the capacity was preallocated. File systems kept track of what was deleted and reassigned that space to new files according to each implementations’ strategy. With the introduction of Thin Provisioning, folks soon came to realize that “zero space reclaim” was insufficient to truly optimize utilization. This because with most file systems, deleting files does not actually free up any space in a “thin” device – even when you “empty the trash.” For performance reasons, file systems simply don’t overwrite deleted files with zeros, so you will frequently find file systems that report that they are only 30% full (as an example) while the thin device itself reports far more space has been allocated.
EMC has been among the T10 members driving these new standards (I’m not going to get into why there are two SCSI standards, nor why both differ so much from the SATA TRIM command – I’ll leave that bit of esoterica to someone else). This participation has allowed the Symmetrix VMAX team to prepare for these new APIs in advance. While the software update that adds the actual support for them (and yes, VMAX will support both SCSI variants) is planned for the fall timeframe, the infrastructure is already in place on the current VMAX version of Enginuity.
VMAX will support these APIs to reclaim space for thinly-allocated Virtual Provisioned devices – just as every other vendor who offers “thin” provisioning will eventually do, pending their software release cycles. In relatively short order, these APIs won’t be a “differentiating capability” for anyone.
But wait, there’s more!
You may be surprised to learn that VMAX will also support these commands for ALL devices, including pre-allocated VP devices and yes – even for traditionally-provisioned thick LUNs!
Why?, you might ask
Clearly, with devices that aren’t “thin” you cannot reclaim space from the unused blocks. But, the knowledge that certain blocks are no longer needed by the host can be used to optimize many other operations that the array routinely undertakes. For example:
- For blocks that reside on an SSD, passing the command down to the device can enable it to optimize its own performance (as discussed above)
- For LUNs that are being Cloned or Snapped, the array can avoid copying (or updating metadata) for blocks that are known not to be in use
- For LUNs that are being remotely replicated, bandwidth can be saved on initialization and resyncs of the remote device
- When a raw dump or copy of a LUN is being made (e.g., a dd copy), the array doesn’t actually have to read the unused blocks off the drives – it can simply forward an all-zero block to the host, reducing the I/O overhead on the back-end of the array.
VMAX supports efficiencies like these through metadata that tracks blocks that have never been written by the host (we call it the NWBH flag). Separately, the array tracks blocks that Should Be Zero (I call that one SBZ). Together, these two flags allow the array to know whether a given block (extent) is in use by the host software, and if it is, whether it is currently all-zeros or that it has actual data on it. Armed with this knowledge, And in fact, with a thinly-allocated VP device, NWBH and SBZ blocks don’t even have to physically exist beyond the meta data!
VMAX thus interprets the space reclamation APIs to mean “set these blocks back to NWBH and SBZ” and then if the target LUN is thinly provisioned, release the capacity back to the pool. Significantly, on VMAX this entire operation can be completed without a single disk I/O – the metadata is updated and the operation is completed. SBZ blocks on non-thin devices will eventually get zeros written to them, but that’s an asynchronous process to the actual SCSI command request.
VAAI and Block Zero()
Somewhat related is the new Block Zero command in the VMware 4.1 VAAI set – VMAX is optimized for this command as well. As mentioned above, VMware can be configured to zero out VMDK space before assigning it to a new Virtual Machine, and with 4.1 this new Block Zero command provides a more efficient alternative to ESX simply writing gigabytes of 0’s to the target. The VAAI command basically says “write zeros to this LBA range for me,” allowing the array to handle however appropriate, but without all the I/O traffic on the fabric.
Some arrays will still dutifully write zeros to the disks, minimizing the realized value of the API. And indeed, the new APIs eliminates the benefits of zero-detecting hardware that a certain array vendor likes to brag about so much – instead of GBs of zeros, one simple command request effects the operation.
On VMAX, the operation itself can be effected in microseconds, and without any I/O operations to the disks. Instead, the SBZ bit is simply set for the specified block range and the operation returns complete.
coming soon to a vmax near you
The described support for these APIs on VMAX is pending an upcoming Enginuity release, and this discussion here is for planning purposes only – the final implementations may vary from what I’ve described. But the basic architecture is already shipping as described, and the implementation intent is consistent with development’s plans.
I have gone out on this limb because it is important for people to understand the realities of the whole situation, rather than be sucked in by the alchemists who would have you believe they alone can make free space from deleted files.
The standards have only recently stabilized, and it is thus reasonable to expect that support for it will be soon delivered by all the relevant vendors on both the host/initiator and array/target sides of the operation. I can’t speak for all the vendors who will be supporting these commands, but suffice to say it will be a short-lived differentiator for the early adopters. More importantly, the value will grow exponentially as host SW like Windows Server, Oracle, Linux, AIX, et al deliver their implementations.
Meanwhile the geysers are nothing more than hot air…
technorati tags: EMC, VMAX, Symmetrix, Virtual Provisioning, VMware, VAAI, Block Zero, WRITE_SAME, UNMAP, space reclamation, thin provisioning
Talk about geysers gushing. you make it sound like all the opportunities for space reclamation can be solved by an API. The thing you forgot to mention is that the host software also has to include the same functionality and that customers need to upgrade their software to get this functionality after it if finally released. Not exactly like hitting the EASY button now is it?
By contrast 3PAR (the company I work for) has solutions today for Windows, Symantec, Oracle and VMware environments using a very fast co-processor function that does not require an API at all, but native software functionality that uses a single command to zero blocks. Customers are using it today.
Oh yes - and of course we also will support the reclamation API standards too, as will most everybody else that matters in this industry.
This post reminds me of your incredibly pre-mature announcement of automated storage tiering in April 2009. I assume this will be forthcoming sometime in 2010, long after 3PAR, Compellent and IBM delivered the same functionality?
Posted by: marc farley | August 09, 2010 at 10:35 AM
Any info as to whether this functionality will be made available in EMC's mid-tier storage offerings?
Posted by: Alexsmattson | August 09, 2010 at 10:40 AM
Alex -
I believe that CLARiiON supports the Write_Same flavor today; I'm not sure when they will add the unmap version.
Posted by: the storage anarchist | August 09, 2010 at 10:52 AM
Marc (aka "Old Faithful")
I repeatedly noted the need for host support in my post. Fortunately, most vendors appear to be planning a service pack to add support, much as Microsoft did for the TRIM command.
Writing zeros over unused space and then reclaiming the zeros is a hack that the new APIs eliminate, and one that Symm already supports for the same platforms as does 3PAR The APIs will make things much more efficient on multiple dimensions and eliminate the unnecessary overhead of writing and reclaiming zeros.
Although your own blog touts the benefits, I note that you neglected to discuss the APIs or their benefits/implications.
As I have repeatedly observed, being first to market rarely breeds long-term success. In telling the market where we are going with our products, it is inevitable that others will rush features to market. Once Symmetrix sub-LUN FAST later in 2010, we will have plenty of time to compare and contrast implementations.
Posted by: the storage anarchist | August 09, 2010 at 11:09 AM
Barry, I don't mind being compared to Old Faithful, there are many things to like about it, especially the fact that for a natural phenomenon, it delivers spectacularly on schedule - year after year.
Furthering the Yellowstone analogy, maybe a good nickname for you would be "Mud Pot" for writing nonsense like "Although your own blog touts the benefits, I note that you neglected to discuss the APIs or their benefits/implications." Gee, that's not talking in circles, is it?!
I noticed that you did not mention the platforms that your "hack" reclamation works on. I too noted that you neglected to talk about how this hack that you supposedly have today works. I suspect, the Symm (apparently the Clariion hasn't implemented it) actually writes the zeros to disk before running a reclaim process. Gosh, it would probably suck a lot of array resources writing and reading zeros that never need to be written in the first place. I bet it degrades performance something awful. 3PAR Thin Persistence software does depend on the host writing zeros (what you call a hack), but it catches them on ingestion and reclaims space without writing them - which means there is no need to read them later because the job is already done. You can call that a hack if you want to, but we have customers who call it a Godsend.
The way I see it, there must be a reason you don't really say much about your hackish implementation. If all you have is a band-aid, the best thing to do is to pray for rapid standards adoption.
Posted by: marc farley | August 09, 2010 at 02:19 PM
Marc -
Symantec's support is already widely adopted, and I predict that OS, File System, Volume Manager and HyperVisor adoption of these APIs will be swift and painless.
The tone of your response is therefore entirely expected - clearly you are recognizing that yet another bit of 3PAR's secret sauce is on the verge of being deprecated to the point of irrelevance.
Posted by: the storage anarchist | August 09, 2010 at 02:35 PM
Barry,
Good to hear you talk about something technical and interesting again.
Couple of quick comments -
1. This really good post was let down (as usual) by your unnecessary digs at other vendors. I think your story would have gone down better with me if it weren't littered with attacks on the competition. Makes readers think your story needs padding out to have substance.
2. Every vendor Ive spoken to about it these technologies has listed the other vendors that support it too, and none of them have tried to convince me that they invented it, or that they alone can make free space from deleted files. However, you (Symm) are not currently on their list of vendors that currently do support it. That's the truth though right? If so, a far cry from the picture your trying to paint. Again, detracts from the crux of your message (IMHO).
3. I assume the NWBH flag can only be applied to newly created LUNs and, for example, not LUNs migrated in from other arrays etc... Also I imagine that LUNs created with prior versions of Eginuity wont qualify either. Is this correct?
Still, despite your childish antics, there is some great content in this post. I think the NWBH and SBZ flags sound cool. But I have to ask - how long before the other vendors implement similar features?
BTW I think you owe us a technical post on the differences between WRITE SAME with UNMAP flag set, and UNMAP (). Im cool with TRIM, but I think a lot of folks (myself included) are a little grey around WRITE SAME() and UNMAP()
Nigel
Posted by: Nigelpoulton | August 09, 2010 at 02:59 PM
Nigel -
Thanks for the feedback.
As you probably know, the SBC-3 RFCs have only just recently stabilized sufficiently for implementation. And indeed, VMAX support is included in an as-yet-not-shipping software update due later this year.
The flags are actually already implemented in VMAX, so existing LUNs will already be tagged. For VP devices, zero page reclaim will find and release all-zero extents. And the upcoming Enginuity release includes updates to Open Replicator that will avoid copying all-zero blocks (PPME and Open Migrator already do this today).
And indeed, I suspect others will take similar approaches to optimize their support for space reclamation - I explain the Symm's approach here for your interest, not as some sort of differentiated uniqueness.
I will look into what I can detail about the two different standards, possibly for a future post.
Posted by: the storage anarchist | August 09, 2010 at 03:39 PM
I was just saying on another blog how I hoped thin reclamation would become a standard API built into the operating systems someday. I appreciate the article.
Barry of interest is your claim to "seen practically every vendor who is shipping support for this feature today practically claiming to have invented it"... followed by your statement; "Writing zeros over unused space and then reclaiming the zeros is a hack that the new APIs eliminate".
You are practically implying that vendors are deploying published co-developed RFC technology and claiming it as their own, when technically they have had to develop a separate unique interim solution (hacks) make the process work for now. I think you have a wrinkle in your space-time continuum. As you clearly stated, the published RFC technology requires the host operating systems to use the new SCSI commands, which as of today, they do not... hence the need, as you pointed out, for vendors to develop (invent) their own processes for zeroing out data and unallocating it. Everybody is doing it, in their own proprietary developed way.
I don't understand your position to wag fingers at vendors for developing thin provisioning technologies and taking credit for it. While I deeply appreciate the effort to standardize these functions, credit remains due to the early adopters from the "me-too" bandwagon-eers that benefit.
Posted by: Richard Siemers | August 14, 2010 at 03:35 PM
Richard -
Thank you very much for your comments.
While I agree that the early innovators indeed lead the way, the vendors I am referring too here are all, in point of fact, implementing the new standards. What I shake my head at is the fact that none of them chose to include mention or reference to the standards, implying by omission that they had done something unique in their space reclamation implementations.
As to the "hack" of writing zeros and then reclaiming them, i'll gladly bestow kudos on those that built hardware optimized for this purpose. But they are still hacks, because writing TBs of zeros places a huge burden on not only the storage, but on the storage networks as well. I've seen VMware clusters totally saturate an FC fabric writing zeros; implementing WRITE_SAME not only removes the need for search-for-zeros but places an infinitesimal overhead on the SAN and storage to boot.
Would that we vendors could work as hard on getting standards approved and implemented as we do trying to differentiate. Especially when a cooperation between multiple different components of the IT infrastructure is required for an optimal solution. I guess if you have an ASIC that can scrub zeros quickly there's not much incentive to push for standards that deprecate the value of your custom hardware.
Truth be told, work on these standards was started back in 2007 before EMC brought STEC's Flash Drives forward for the industry. It is a shame that none of the early-implementers of thin provisioning were pushing for these reclamation standards before then.
Posted by: the storage anarchist | August 14, 2010 at 04:53 PM