« 1.013: more on enterprise flash | Main | 1.015: stranger danger »

July 03, 2008

1.014: the laurel and hardy of thin provisioning

UPDATED (July 3, 2008): Deletions struck-out, additions noted in green.

Purely coincidental...click the pic to see the Original L&H One is decidedly skinny.

The other is unabashedly portly.

And I'm not talking about Stan and Ollie, folks.

No, such is the differentiation between the thin provisioning implementations of IBM's SVC and Hitachi's USP-V/USP-VM.

Sir Barry White eloquently describes the petite implementation of SVC's fine-grained Space-Efficient Virtual Disk (SEV for short) in a recent blog post (any resemblance of BarryW to fellow Brit Stan Laurel is purely coincidental, I'm sure).

Not to be outdone (and in an obvious attempt to justify the Hardy-ness of Hitachi's Dynamic Provisioning), HHSNBN explains why DP's heavyweight approach makes for better thinness (at least on the USP-V). Given the title of his post (When is Thin Provisioning Too Thin?), I figure ole' HHSNBN doesn't think the SVC's implementation is all that, shall we say, robust.

IMHO, both have managed to gloss over details that are very pertinent to understanding if, when and where one implementation is better than the other. Not surprising, especially since BarryW & I both know full well HHSNBN will never respond directly to any inquiries or challenges. No, HHSNB prefers only one-sided discourse (his side, of course), so I guess that leaves it up to me to try to tease out the truth.

So let's look a little deeper at these near-opposite implementations and see what we can figure out for ourselves, shall we?

Warning: readers of this blog have asked that I spend more time talking tech,
and less time bashing the competition.

This post is about as close as I can get to fulfilling those requests...

 

so thick is really thin?

The Hitachi USP-V implementation of thin provisioning (called Dynamic Provisioning - probably because it isn't really "thin") each write to an unallocated block of the "dynamic" device results in the allocation of the appropriate 42MB "page" (chubby chunk). Unfortunately, I have been unable to find any publicly-accessible technical description for why 42MB was chosen as the page size - the only explanation I've found is that it is sufficiently large as to ensure each page is wide-striped across multiple spindles.

The reason for selecting 42MB may perhaps be found indirectly, however.

The most popular published examples of Hitachi's DP typically describe a storage pool made up of 4ea. 7+1 RAID 5 array groups, for a total of 32 drives in the pool. Subtracting out the parity "drives" (I know, RAID 5 doesn't really do drive-based parity, but it's just easier to discuss as if it does), and using a 42MB page size, it would thus appear that 1.5MB of each page will be assigned to each data drive. Likewise, if you had 6 7+1 array groups in the pool (42 data+ 6 parity), you'd allocate 1MB per spindle; 2 array groups (14 data + 2 parity) would place 3MB per spindle.

I suspect that this is at least part of the reason for selecting 42MB - because it can be spread across any even number of data spindles with up to 8ea RAID 5 or RAID 6 array groups such that each spindle will be assigned sufficiently large number of an even multiple of 8K blocks.

But 42MB is a huge amount of storage to be allocating, and there are smaller sizes that would have similar effect. For example, instead of 42 megabytes, Hitachi could have chose to allocate 42 "blocks" of 256KB each. Each "page" would then be 10.5MB in size, and each drive would hold 256KB of data in the 4-group (32 drive case), or 128KB in the 2 array group (16 drive) configuration. This would seem more space efficient than the 42MB while still maintaining the wide-striping benefits.

Considering the inevitable hassle that competitors will (have already) give Hitachi over this chubby page size, you have to wonder why.

I think I know the answer

Metadata.

More specifically, the balance between the amount of metadata required to support Dynamic Provisioning and the rather limited amount of control memory that the USP-V has (max of 32GB, if I recall correctly).

See, with any thin provisioning implementation comes the overhead of the metadata that is inevitably required to keep track of which physical blocks on disk are assigned to each "thin" device. This metadata is generally in addition to whatever metadata is required for other purposes (device mapping, replication flags, access controls, etc.). And it has to include whatever pointers and flag bits necessary by the implementation to keep track of the allocation of "pages" (in Hitachi's parlance - often called Chunks or Extents on other) to their externally referenced LUNs or Volumes.

And the thing is, the more actual "pages" you have to track, the more metadata that will be required (assuming that the amount of metadata per page is constant). As a result, the 10.5MB page size I suggested above would require 4 times as much metadata for the same amount of allocated GBs of storage.

So there's at least SOME incentive to keep the number of pages as small as possible, since larger pages require less total metadata for a given amount of allocated capacity.

Now, I don't actually know how much metadata Hitachi's Dynamic Provisioning requires for each allocated "page", but I suspect that it's at least 8 bytes each (it's probably a lot more). But if only 8 bytes of metadata are required for each 42MB DP "page," then it would require just about 195KB of metadata per TB allocated, while my suggested 10.5MB page would use up about 780KB per TB.

Sounds pretty manageable, doesn't it?

Until you consider the fact that Hitachi promotes the USP-V as supporting up to 838TB of internal and 247 PETABYTES of externally virtualized storage. (I smell Hitachi Math coming, don't you?)

Doing the math out, if each 42MB page only requires 8 bytes of metadata, and a TB requires 195KB of metadata, then 1 Petabyte will require 195MB, and 247.8PB would require right around 47.5GB of metadata.

Now, with only 32GB of control memory in which to store this metadata, we have one of those problems that only Hitachi Math can solve Winking.

OK, actually, there are a few options:

  1. There's less than 8 bytes of metadata per 42MB page. This I doubt, but even 4 bytes would consume over half of the control store, and it was only just doubled with the introduction of the USP-V. So it must be something else
  2. The metadata isn't actually kept in the control store - it's either kept in the global cache (which I doubt), or it is paged to/from disk on demand (which I also doubt).
  3. The USP-V can't actually virtualize AND dynamically provision 247.8PB of storage...

Given my assumption that there's actually a lot more metadata per "page" than 8 bytes and that the metadata is in fact ALL kept in the control memory on the USP-V, I'm going with #3 - the actual amount of storage that can be allocated to thin devices is something well less than the advertised 247.8 petabytes.

But no matter if I'm right or wrong, it becomes pretty clear that Hitachi couldn't use a smaller page size, even if they wanted to. 10.5MB pages would need over 192GB of metadata, and smaller page sizes would require even more. Net-net, I suspect that 42MB was the smallest page size that they could use within the memory limitations of their control store.

And they'll be nagged incessantly about their "chubby provisioning" - I'll make sure of that!

so thin is really slow?

Over in IBM SVC-land, I'm sure Sir BarryW has been ROTFL about the obesity of Hitachi's Dynamic Provisioning. And well he should, because the SVC engineers have gone the other extreme in their implementation of Space Efficient Virtual Disk (SEV) - they're actually supporting a configurable "grain" size ("grain" is SEV's allocation unit and the equivalent of Hitachi's "page"). According to BarryW (I haven't seen the actual documentation yet), grains can be configured on a per thin-device basis as 32KB, 64KB, 128KB or 256KB, with a recommended/expected default of 32KB.

Compared to 42MB, the SVC's allocation size is downright infinitesimal.

Clearly, the SVC implementation is striving to maximize utilization efficiency, although SEV's smallest grain size isn't quite as small as 3PARs implementation which uses 16KB chunklets. I'm sure the folks over at 3PAR will have something to say about this (although I doubt that they'll call SEV "chubby" - maybe just "slightly overweight"). Using 32KB "grain" is pretty space efficient, especially since most databases tend to drive 8KB I/Os, while most file systems seem to allocated along either 8KB or 64KB boundaries.

but what about all the metadata?

Ahh...there's the rub. As we've just learned from the Hitachi exposé above, smaller allocation sizes require more metadata. And on top of that, the poor little SVC doesn't even have the benefit of 32GB of control memory - in fact, if I recall correctly, each SVC node is limited to a maximum of 8GB of memory total!

Time to do a little math here (real math this time, not Hitachi Math).

The most recent SVC code and hardware can support up to 2PB of physical storage per node, with a maximum of 2TB per exported LUN and 8,192 total LUNs per node (and no, you can't have 8000 2TB LUNs - that would be Hitachi Math). So with 2PB of physical storage allocated to SEV thin devices using 32KB "grains" and only 8 bytes of metadata per grain, you'll need 512GB of metadata!

That's a lot!

UPDATE (July 3, 2008): BarryW has posted a rebuttal to my analysis over on his blog, and in the ensuing discussion, he provided more accurate approximations of actual metadata overhead. I wholeheartedly encourage you to read BarryW's post as well as all the follow-on discussion between the two of us in the comments (just overlook BarryW's accusatory and defensive tone - it's part of the game between us).

Based on the new information supplied by BarryW, I've rewritten the following section.

But WAIT! In his blog post, BarryW gives us a hint about how much metadata is actually required - he says it is "less than 1% additional" to the allocated capacity. Now, he didn't say exactly HOW much less than 1%, but I'm gonna bet that being the engineer/master scientist that he is, he would have said "less than 0.5%" if it were actually that much smaller. So, assuming that the metadata is pretty close to 1% of the actual amount of allocated capacity, we can deduce that worst case (32K grains), each allocated "grain" requires somewhere around 300 bytes of metadata (imagine what that would mean for the Hitachi implementation!).

300 bytes per grain means ~18TB of metadata for a fully allocated 2PB node.

But WAIT! In his follow-on blog post, BarryW discloses that the actual metadata overhead for 32KB "grains" is less than 0.5%, and for 256KB "grains" it's less than 0.1%. Based on that, I'll adjust my original deductions by 50% - I'm guessing that each allocated "grain" requires about 150 bytes of metadata.

150 bytes per grain means ~9TB for a 2PB SVC node/cluster, or 36TB for the theoretical maximum 8PB SVC cluster, if everything were allocated.

Now, of course, we all know that nobody in their right mind would put 2PB 8PB of physical storage behind an SVC cluster, much less configure enough SEV LUNs to actually consume all of the available storage.

But still...that's a LOT of metadata!

so where do they put all that metadata?

Well, since each SVC node only has a maximum of 8GB of RAM per node, all of this meta-data isn't going to be memory-resident all the time. As BarryW explains in the referenced blog post, the metadata is actually stored on disk, right alongside the data itself.

Brilliant! (Well, sorta.)

See, in order to access (read or write) any particular block of data from the thin SEV device, the SVC has to access the metadata to find it. And if the required metadata is not in memory, then the host I/O request has to be delayed while the SVC code reads the required metadata for that particular block from the disk. And if there's no room in memory for the newly required metadata, then some other piece of metadata has to be pushed out of memory to make room for it. And if the metadata being demoted was changed since it was last written out to the physical disk, well...you can see how one I/O request to a thin SEV LUN could result in 3 disk I/Os (2 reads and 1 write).

Starts sounding a little like how Hitachi verifies writes to SATA drives (write, read, compare, return), doesn't it?

Now, you can surely bet that the SEV engineers did their damnedest to make sure this didn't happen very often...but the fact is, this situation will happen - it is impossible to avoid, since all of the metadata cannot fit into memory at once. And should there be multiple random I/O applications hitting the poor little SVC node simultaneously, it ain't gonna matter how efficient the SVC is at passing an I/O through...the overhead of the metadata is going to force the system to do more (slow) disk I/Os and the latency of those I/Os are going to have an impact on host application performance.

Just for fun, I worked backwards. If 100% of the 8GB of RAM in an SVC node were used to store ONLY the 300 150 bytes of metadata for each 32K grain, it would only be able to maintain the working set for about 8TB 16TB of allocated SEV capacity. And of course, we know that there are numerous other things competing for those memory resources, so it's probably more realistic to think the max working set for SEV capacity is closer to the 1-2TB 2-4TB range. Above this, the node likely will start thrashing as it tries to page metadata in and out from the disk.

So while IBM is marketing SEV as "free", you could well require an SVC node for every 2TB 4TB of allocated capacity in order to maintain reasonable performance.

I suspect it won't be too hard to prove this, perhaps even with a simple I/O generator. Should be fun to try.

Or maybe BarryW will just answer this for us...being that his job is SVC performance testing, I'm sure he already knows what the practical limit of allocated SEV capacity per node is for various workloads.

UPDATE (July 3, 2008): I asked BarryW if there was a recommended limit of SEV capacity for SVC configurations, and so far he has chosen not to respond.

is the svc intentionally dumbing down storage?

Well, DUH!

It is generally understood that the SVC is blind to the inner workings of the storage behind it. Oh sure, there are the tuning parameters that BarryW talks about from time to time that allow the SVC to be adjusted to accommodate some of the differences between storage platforms. But the SVC has absolutely no awareness of the inner workings of the storage it is "virtualizing" - it is blissfully unawares.

This is not necessarily a bad thing, except perhaps when there are errors reported by the storage that aren't properly handled by the SVC - I can imagine a few such cases probably exist even today.

But the small "grain" size of the SEV implementation introduces a new wicked-bad side effect of this blindness, because SEV in effect can defeat the performance optimizing algorithms of the external storage by making what originated as sequential I/O from the host into blatantly random I/O on the back-end storage.

Let me explain.

Imagine you have several dozen (hundred) hosts, each being used to create relatively large sequential data files. Whether they're storing PowerPoint documents, digital photographs, X-rays, MP3 files, surveillance videos, VMware images, application executables, etc. - doesn't really matter. Let's say all those hosts are using SEV volumes on a single SVC node in front of Brand X storage, and the community of hosts are randomly writing new sequential files.

By definition, the data streams of these hosts are going to be broken up by SEV into 32KB "grains" and written to the back-end storage. And since they're all arriving "simultaneously", the grains being created to contain the data from each of the sequential data streams are going to be intermingled with all the other incoming grains, such that the data on the back end is no longer sequential. For example, a 1MB file would be broken into 32 "grains" by SEV, and there is a very high probability that none of these grains are actually written to the external storage such that they are adjacent to each other.

UPDATE (July 3, 2008): BarryW also explained that all grains for a vdisk are actually allocated out of 16GB-2TB extents that are vdisk-specific, meaning that grains of different LUNs won't be intermixed within an extent. However, grains are indeed allocated sequentially in the order they are received, so the grains of multiple concurrent applications on a single vdisk (LUN) will still be intermingled as described.

Of course, this is no concern to the SVC - when the host later posts a read request for any (or all) of that data, the SVC will happily submit the 32 individual 32KB read requests to the back-end storage, and it will (eventually) get all the data required to fulfill the request.

But instead of seeing read requests for sequential 32KB blocks of data, the back-end storage will see 32 individual random I/O requests.

Now, for dumb, inefficient, non-cached storage, the performance difference of this vs. 32 requests for consecutive 32KB blocks might not be all that large. You're basically going to get Read Miss response times for each, plus the overhead of relocating disk heads between each read.

But for an Intelligent Cached Disk Array such as Symmetrix, a USP/USP-V and even a DS8000 (in some rare cases, at least), the negative impact on sequential read performance as seen by the host will be quite significant. This because the "grains" for each thin LUN were written "randomly" across the back-end storage device such that no amount of back-end intelligence can effectively pre-fetch the data ahead of the individual read requests - the reads posted to the storage won't be for sequential, consecutive disk blocks.

Algorithms that are incredibly adept at increasing cache hits ahead of sequential reads in Symmetrix DMX, for example, will be totally defeated, even though the host itself is doing sequential I/O.

Ahh...the joys of external virtualization.

Thin provisioning is a prime example of why virtualization doesn't always belong in the network. Anything operating outside the software domain of the storage's own operating system is going to have a hard time integrating with the architecture and the algorithms employed inside the storage. In fact, it is all to easy for external virtualization to defeat or obviate the benefits of intelligent storage, reducing the inherent benefits of intelligent disk arrays to that of uncached JBOD.

And exactly why would anyone want to do that?

I guess if you're IBM or Hitachi, and trying to promote the externalized virtualization as the be-all and end-all to storage management, collateral damage to your competitor's storage products isn't all that disconcerting. (If not outright intentional).

But if you're a customer who has paid good money for the performance and availability benefits of enterprise-class storage, you just might want to think twice about how that virtualization device might be destroying the value of your investment.

well then, isn't this a fine mess?

Yes, Stanley, it is quite a fine mess you've gotten yourself into.

And though I never thought I'd say this, in this one case, Hitachi's "chubby" provisioning is probably more performance efficient with external storage than is the SVC's "thin" approach. But it is still horribly inefficient in context of capacity utilization.

I'm sure you won't be surprised to hear this from me, but it is absolutely the case that Symmetrix Virtual Provisioning is both more space efficient than Hitachi's approach and inherently integrated to maximize the internal architecture of the DMX platform for maximum performance and efficiency. Because indeed, the "thin extent" size used by Symmetrix Virtual Provisioning is both larger than the largest that SVC uses, and (significantly) smaller than what Hitachi uses.

More importantly, Symmetrix VP's "thin extents" are perfectly aligned to maximize the Symmetrix' algorithms that accelerate both sequential and random I/O across all RAID types and configurations - something that neither IBM nor Hitachi's storage virtualization or thin provisioning technologies even come close to accomplishing.

As I said, it is not necessarily in the interests of either competitor to ensure that customers get all of the performance benefit that a Symmetrix has to offer once it is virtualized - they have every reason to try and "dumb down" Symmetrix to the lowly status of JBOD. Thin provisioning is just the latest demonstration of this.

"free" may be the most expensive solution you can buy...

Before you rush off to put a bunch of SVCs running (free) SEV in front of your storage arrays, you might want to consider the performance implications of that choice. Likewise, for Hitachi's DP, you probably want to understand the impact on capacity utilization that DP will have. DP isn't free, and it isn't very space efficient, either.

If you'd like a solution that is precisely tuned to maximize both performance and utilization efficiency, look no further: Symmetrix DMX-3 and DMX-4 deliver this today.

 


TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d834c659f269e200e55383131e8834

Listed below are links to weblogs that reference 1.014: the laurel and hardy of thin provisioning:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Sangineer

I think you are cherry-picking examples of how to not use thin provisioning wisely. You pick worst-case scenarios and them complain about how poor this provisioning would be. Well, obviously. It is still the job of the storage admin to figure out where it is a good fit and where it isn't, we are not all idiots and don't feel the need to thin provision everything in the world, it is just another tool for specific cases.

the storage anarchist

Thanks for the feedback.

In no way was my post meant to imply that storage admins are idiots - in fact, just the opposite: I used (admittedly extreme) examples to show how the SVC implementation of SEV may have unexpected impacts on performance.

I doubt that IBM was going to explain these impacts, leaving it to the overworked storage admins to figure out on their own why SEV performance was worse on some applications or storage configurations than it was on others.

Or, more realistically, IBM would hope that they don't notice that they're not getting everything their storage could deliver. It's easier that way.

The important point of my post is that SEV marks the first time that the SVC presents I/O requests to the back-end storage in a manner that specifically differs from how the host actually formatted the requests - sequential host read requests will undoubtedly appear "random" in many cases to the back end storage!

I'm sure some people would never have realized that until they read my post.

Lacking integral knowledge of how these two differnt I/O request patterns impact the performance of the storage behind it, the SVC will thus silently defeat performance optimization algorithms and thus diminish the value of the storage.

Which, as I've noted, is precisely IBM's intent!

Barry Whyte

Its been some time, and I guess you and I are probably the few that will read this, but our recent SPC-1 result (much as you hate the SPC) and also hate SVC for reducing the perceived benefit that DMX or Symm can provide shows that although you painted a potentially interesting case here, this really doesn't stack up when you run a pseudo-real life workload. Or indeed real-life workloads.

I think the ultimate point here is that you don't need to spend the money on enterprise products like DMX, when "little old" SVC and mid-range can provide enterprise class performance, even with its using SEV.

Nigel

Barry! I can't believe I missed this!

I went quiet while having great time working on a great ptoject last year followed by a well earned break. But the fun on that project would have been nothing compared to an exchanges relating to your comments on this post!

If my wife doesn't go in to labour this weekend I will see if I can put a response together (better late than never).

Off the top of my head and with my limited knowledge of the Symm my guess at your extent size would be around 5.25MB

For some info on why 42MB for HDS see my recent post -
http://blogs.rupturedmonkey.com/?p=182

Oh and if you wanted to know about the metadata you could have just asked. A mixture of dedicated DIMMs on the Shared Memory board (way less than 32GB) plus some reserved areas per pool and per volume used to create the pool.........

Oh and as I work my way through some of the stuff should I hold my breath expecting you to reveal your extent size?

The comments to this entry are closed.

anarchy cannot be moderated

about
the storage anarchist


View Barry Burke's profile on LinkedIn Digg Facebook FriendFeed LinkedIn Ning Other... Other... Other... Pandora Technorati Twitter Typepad YouTube

disclaimer

I am unabashedly an employee of EMC, but the opinions expressed here are entirely my own. I am a blogger who works at EMC, not an EMC blogger. This is my blog, and not EMC's. Content published here is not read or approved in advance by EMC and does not necessarily reflect the views and opinions of EMC.

search & follow

search blogs by many emc employees:

search this blog only:

 posts feed
      Subscribe by Email
 
 comments feed
 

 visit the anarchist @home
 
follow me on twitter follow me on twitter

TwitterCounter for @storageanarchy

recommended reads

privacy policy

This blog uses Google Ads to serve relevant ads with posts & comments. Google may use DoubleClick cookies to collect information (not including your name, address, email address, or telephone number) about your visits to this and other websites in order to provide ads about goods and services of interest to you. If you would like more information about this practice and your options for not having this information used by Google, please visit the Google Privacy Center.

All comments and trackbacks are moderated. Courteous comments always welcomed.

Email addresses are requested for validation of comment submitters only, and will not be shared or sold.

Use OpenDNS