0.021 the case against standardized (performance) testing
Fellow blogger Tony Pearson has just completed a week-long series on the values and merits of standardized storage performance benchmarking, in a not-so-subtle attempt to justify his recent assertion that a SPC-2 win for the SVC has awe-inspiring relevance to customers. And he's done so in an eloquent, perhaps even masterful manner, deftly leveraging the subtleties and nuances of the English language (who knew?) to make his case.
But if you ask me, he's failed miserably. Unless his readers get lost in the misdirection and fail to realize that his metaphors are totally unrelated to the world of storage performance. In fact, his tutorial underscores the problems associated with standardized testing.
Elsewhere in the blogosphere, I have offered my own personal perspective on standardized benchmarking, which boils down to this:
- Standardized benchmarking oversimplifies the complex interactions that make up a real-world environment --the requirement for "controlled and repeatable" forces standardized benchmarks to exclude the chaos of random, but normally occurring, events and overheads, often masking or even intentionally subverting key differentiating capabilities of the test targets
- The inherent quest to be best in standardized benchmarks inevitably drives participants in to optimize their test targets for the test
- There is very little documented correlation between standardized testing results and the intended real-world application of the test target, and most people don't understand what the tests actually measure
- The inbred survival instincts of humans leads us to subconsciously establish relationships and hierarchies between similar objects, and in the absence of in-depth situational/contextual understanding, we will assign "better" based solely on whatever limited data points are available to us
I know - heady assertions, and my opinions all. But note that I harbor these opinions for ANY standardized test, be it the SPC, TPC, MPG, EER, SAT or every state's equivalent of MCAS. And my reasoning is simple:
Standardized testing homogenizes comparisons to a meaningless baseline that masks the unique strengths of the test targets, be they cars, servers, storage arrays or high school students. Unless you fully understand the test itself and the relevant requirements of your own application of the test target, you can draw no real conclusions on how standardized test results apply to your expected results.
So when Tony tries to convince readers that the SPC is like MPG, well...you know me, I gotta take exception.
masterfully mixing metaphors
For some reason, Tony tries to make it sound like SPC benchmarks are similar to Miles Per Gallon ratings, when I think it pretty obvious that the real storage performance metric parallel to the automobile is Miles Per Hour. But when I specifically asked Tony about this in his blog, his response was surprising, if not counterintuitive. He said he was comparing SPC IOPS to MPG because (paraphrasing here):
- MPG results are consistent across every instance of a particular car model that comes off the product line.
- MPG is standardized and publicly available
- MPG is usage-based connected to real-world conditions
- MPG can be used for cost/benefit analysis
Powerful assertions that seemingly support - well, nothing, really! What does any of that have to do with the fact that MPG measures the number of miles a car is supposedly able to go on one gallon of gasoline under some unknown (but well-labeled) conditions, while the SPC measures how many "SPC IOPS" a certain vendor-selected configuration can achieve?
I know what a mile is, and I know how far I need to travel each day and I even have a pretty good idea how much a gallon of gas is going to cost me, but I have NO CLUE what an SPC IOP is, nor how many of these my storage array needs to be able to do. The metaphor isn't even apples and oranges - it's more like fruit flies and potato peelers!
mpg isn't what you might think
Fact is, there's no real attempt to prove that every single car coming off of a production line will get identical MPG in the EPA's test - they take one "production car" off the line, break it in for 4-5000 miles, test it, and publish the results. Done! And while the test is indeed standardized and publicly available, I doubt that most people have taken the time to read or understand these tests, or how they relate to their driving styles. Tony provided excerpts from the descriptions that describe the driving patterns for City and Highway (his link to How Stuff Works), and I'm sure you all immediately recognized (as I did) that you never drive that way - ever. But that's not the only reason for the "your mileage may vary" disclaimer.
For example, I bet most people don't know that the tests are always run with the air conditioners, heaters, radios, fans, GPS's and lights all turned off - because all of these things draw power from the engine which (you guessed it) lowers the results. And I'll bet you didn't know that the test doesn't actually measure the amount of fuel burned over the (synthetic) test course - it calculates it based on the hydrocarbon output at the exhaust. You and I, well, we calculate MPG based on the amount of fuel we have to put back into the tank to fill it back up. And as Tony points out in his rebuttal to my inquiry, engines are getting more and more efficient at burning fuel more completely, which means the test results are being artificially inflated - a factor that is even greater with hybrid cars because they don't generate any hydrocarbons while the electric engine is running, but the tests are don't sufficiently account for the energy overhead of charging the batteries. And the EPA tests are run in a temperature-controlled environment using specially-blended fuel with a consistent energy content (unlike what you can get at the pump - did you realize that "10% Ethanol" means that you are only buying 97.5% of the energy you would have gotten if there was no Ethanol added to the gasoline?)
To top it all off (so to speak), the EPA is in fact changing the way MPG is measured as we speak (see this article at Edmund's). That's right, they're reworking the benchmark, so the "city" numbers on 2008 model cars and trucks will be about 12% lower than the identical 2007 model.
your mileage will vary
So I ask you - how the heck are you going to be able to use the 2008 MPG numbers to make an informed car buying decision? You can't compare this year's MPG ratings to last years' cars, because the tests are different. More importantly, you still can't correlate the new MPG numbers to your own driving habits because nobody has any hands-on experience to validate the relationship that the EPA asserts they've accommodated. They "say" the new numbers are "more representative," and that they are tacitly credible because they are US Government EPA-sponsored tests and results (this is now being specifically called out on the ratings sheets, in an apparent response to market suspicion that the auto manufacturers have been "doping" the results over the years).
But guess what? I agree that MPG and SPC are indeed similar.
Similar in that both tests are impossible to correlate in advance to expected results! And after the fact, the real-world environment doesn't match the results predicted by the test, there's really not much anyone can do about it - either the test workload was misunderstood, or the specifics of the intended real-world workload was (more likely, both weren't understood sufficiently).
And even though the MPG tests are "standardized" (everybody knows what a "mile" and a "gallon" are) they don't necessarily cover my intended use case. If a want a 4x4 to go off-roading, how do I know that the relative EPA "on road" fuel economy ratings of my potential selections are going to be consistent in relationship to one another in the "off road" use cases? I don't. And I can't. The EPA tests don't cover my use case, and thus I have no idea how much fuel I should plan on needing.
Just like the SPC tells me nothing except maybe how well the specific tested configuration runs that benchmark. In fact, there's nothing to even explain WHY this specific configuration was chosen instead of one, say, with fewer larger disk drives. More significantly, as I mentioned before, there is no common understanding of what an "SPC IOP" is (nor an "SPC MB/s" for that matter ). Fact is, unless you're intimately involved in benchmarking, the SPC tests and the architectures of the storage itself, here is insufficient data to make any correlation of SPC results to to any other real-world environment.
speed vs. efficiency
On top of all this, the fact is that EPA's MPG is a measure of efficiency and not speed or performance. And the SPC is a measure of performance, and not efficiency - the number of these specific "SPC" IO's that you can get done in a unit of time (e.g., 1 second: IOPs). And while you can calculate SPC IOPs over the listed price of the test configuration to get to an efficiency rating, it's the wrong one: MPG is Miles per Gallon, not Miles per List Price. The SPC equivalent would have to be SPC IOPS per Watt, but I can't seem to find the measured power utilization of the test configuration in any of the SPC benchmarks.
No, the performance metric parallel we're looking for in the automotive world is very clearly Miles Per Hour. But as Tony points out, we all know that MPH isn't really all that useful in choosing the proper car in the real world, since almost all cars go faster than the speed limits (at least here in the US). Truth is, the vast majority of consumers don't buy cars based on which one wins at NASCAR or the Grand Prix.
So it clearly wouldn't have helped his argument any to relate SPC to MPH, because we all know MPH is irrelevant. And by association, that would admit that the SPC tests themselves might in fact be irrelevant.
but how much us good enough?
Here's the thing - neither SPC or MPG ratings are true "tests" - their results are really only relative metrics and there is no perfect score. You can't get a perfect score - in fact, the test creators don't know what a "good enough" score is, much less the best possible. Unlike so-called "aptitude tests," (SAT, MCAS, IQ), neither SPC or MPG really tell us anything about the ability (aptitude) of the test target to perform outside of the specific test criteria. And while the SAT or MCAS may provide some insights about the candidate's linguistic and mathematical abilities, neither offers the college registrar any real insights as to a candidates' aptitude for say, music, technology, or psychology.
Conversely, and perhaps the biggest challenge with SPC and other standardized performance tests such as SpecNFS, IOMeter (etc.) is the lack of knowing what "good" or even "good enough" is. The predominant assertion is that "more is better," and that you always want to buy "the most you can for your money." But how do you know how much you really need? What if you could spend half as much money to get half as much performance and still meet your application's requirements and SLAs? Or say you spend $3.5M to get the top-rated performance configuration, only to find that it costs you more to configure, operate, power & cool than you could afford? Or that its performance falls to a tiny fraction of the rated results while a disk drive is rebuilding or under the strain of synchronous remote replication?
No, tests like the SPC just aren't really all that helpful to help make the appropriate storage selection.
FWIW, David Hitz had a little fun with this topic in his recent Lies, Damned Lies and Benchmark Results blog post. His point was that there are lots of different ways to analyze performance benchmarks and you can come to different conclusions based on how you interpret them (although, not surprisingly, he was able to derive "NetApp is a little better", "NetApp is a lot better" and "NetApp is infinitely better" out of the same SpecFS results...go figure ).
blogketing gone overboard?
Bottom line: my point is not that the SPC (or any other standardized test, for that matter) is bad. But to promote it as anything other than an interesting data point is to assign more importance to it than it deserves (IMHO).
Of course, that's what marketing is really all about, and given Tony's title and position (brand marketing, IBM storage), I know I really shouldn't expect anything else. But in the grand scheme of things, posting the best results for a benchmark that nobody can relate to the real world barely justifies a press release. The blogketing hype and "get under EMC's skin" response to the relevance challenge, the thinly veiled accusations that EMC is hiding something by not participating in SPC, followed by a week-long (semi-condescending) tutorial on performance metrics - all in defense of one little benchmark - well, I just think that's going more than a bit overboard to create relevance where it simply doesn't exist.
And to try to correlate SPC with MPG (instead of MPH) is really just trying to obfuscate the argument, and I think that approach hurts the relevance argument instead of helping it.
At least I'm now more than ever convinced that the SPC benchmark is pretty much as irrelevant as MPG!
But remember - YMMV!