Deconstructing disk performance rules: final thoughts

To summarize the discussion so far:

While my experience with rule-based approaches to computer performance leads me to be very skeptical of their ultimate value, I recognize they can be useful under many circumstances, especially if people understand their inherent limitations. For example, in the last couple of blog entries, I noted the usefulness of threshold rules for filtering the great quantities of esoteric performance data that can readily be gathered on Windows (and other computing platforms). The threshold rules implicitly select among the performance data to be gathered – after all, before you can perform the threshold test, you must first have acquired the data to be analyzed. Concentrating on measurement intervals where the threshold test succeeds also help you to narrow the search to periods of peak load and stress.

However, the mere mechanical iteration over some set of expert-derived performance rules, absent the judgment of an experienced performance analyst, is unlikely to be very useful as a diagnostic tool. A process that mechanically exercises a pre-defined set of threshold-based performance rules using settings arbitrarily defined by some “expert” or other without regard to the particulars of your application and environment is apt to be quite brittle. And if the rule happens to fire, signaling that a problem has possibly occurred, will you understand what to do next? Following confirmation of the diagnosis, will you even know how to begin to resolve the issue you have uncovered?

A crippling deficiency of static performance rules is that the best inference that can be made about whether any symptom of a performance problem that you can measure is “good” or “bad” is frequently, “It depends.” That is why I often try to turn the debate over what the proper setting for the threshold test should be into a discussion that explores the reasoning behind the setting. Knowing what insights the “expert” relied upon in formulating the Rule in the first place is often extremely helpful. Knowing what the rule threshold depends on turns out to be quite useful, especially as you proceed from diagnosis to cure.

*  *  *

Bottleneck analysis

In the last blog entry, I attempted to make this argument more concrete by drilling into a popular disk performance rule from the Microsoft experts behind the PAL tool. This rule primarily serves as a filter, a role that I can endorse without too much hesitation. I then looked at a similar, but more ambitious and more complex disk performance rule that was promulgated by a colleague of mine at Microsoft, Grant Holliday. Unbeknownst to me, this rule was formulated following work we did together to diagnose and solve a serious performance and scalability problem affecting a major server-based application. The rule attempts to encapsulate the analysis I performed to first diagnose the problem and then devise a remedy to fix it.

My style of working on an application performance investigation such as this one can appear bracingly unstructured, but it isn’t. Essentially, the method is one of bottleneck analysis, and is informed by queuing theory. At the outset I sift through large amounts of very detailed and esoteric measurement data very quickly. It may not look like I am being systematic & deliberate, but I am. During these initial stages, I do not always take the time to explain this analytic process. When the problem is diagnosed, of course, I am prepared to defend and justify the diagnosis. But at that point you are backtracking from the solution, crafting a narrative designed to convince and persuade, shorn of all the twists and wrong turns you took in actually arriving at your final destination.

A general method in a performance investigation is to hunt for a bottlenecked resource, defined as the one that saturates around the time response time (or throughput) also begins to back up. Analytically, I am searching for an elongation of queue time at some saturated resource (CPU, disk, network, software locking mechanism, all the usual suspects first) that also appears to be correlated with measures of application response time that are (hopefully) available. You formulate a tentative hypothesis about what bottlenecked resource could be responsible for the problem and then start testing the validity of that hypothesis. If there is a hypothesis that remains still credible after increasingly rigorous testing, that is the one to proceed with.

I can recognize in the disk performance diagnostic rule Grant created some of the line of reasoning I followed in the tuning project we worked on together. At least some of it has a familiar ring to it. The disk response time measurements in Windows include queue time, so an elongation in disk queue time would be reflected in those measurements. Since the disk response time measurements are difficult to interpret in isolation, I also like to examine the Current Disk Queue Length counters; to be sure, these measurements are samples taken at the end of each measurement interval, but they have the virtue of being a direct measurement of the length a disk device’s current request queue. If the physical disk entity known to Windows is actually a standalone physical disk, it is also possible to derive the disk service time. Then, you can back out the service time from the disk response time measurements taken at the device driver level and calculate the queue time (as detailed here).

Queuing for the disk is not always evident at the device driver level, however, and may not show up in the Windows counter data. Disk requests can back up inside an application like Microsoft SQL Server that performs its own scheduling of disk activity, for example. That turned out to be the case here, as a matter of fact.

My recollection is that in the tuning project I worked on with Grant disk response times were generally within acceptable levels of 15-20 ms. In other words, the rule Grant formulated in retrospect would not have been triggered in this instance. However, one of the physical disks was being hammered at rates ranging from 400-1400 requests/second. (The fact that a range of performance levels like that is expected on a physical disk that has caching enabled is discussed in the disk tuning article I referenced above.) I also observed Current Disk Queue Length measurements in the double digits during peak intervals. I needed to look at disk throughput, too, not just response times, which also strongly suggested the disk was saturated, which was not evident from the disk response time measurements alone.

So, it was clear to me pretty quickly that there was a disk device that was severely overloaded. I then had to make the link between what was happening at the disk and the request queue in the application that would also start to back-up when the disk became overloaded. When that causal link was established empirically, I thought we could proceed with confidence with a plan to replace the bottlenecked disk device with a much faster one. This would improve both throughput and response time. The new disk subsystem was quite expensive, and converting SQL Server to use it required a lengthy outage, but, by then, I was confident that the application would benefit enough to justify those costs. By the time we were able to make a switch, Grant had improved service level reporting for the TFS application in place, and he documented a 2x-5x improvement (depending on the request type) in application response time at the 90th percentile of all requests following the disk upgrade. The customer was satisfied with the result.

Deconstructing Grant’s Disk Performance Rule

Human reasoning proceeds from the particular to the general, from the concrete to the abstract. Socrates is a man. All men are mortal. Thus, Socrates is mortal. Q.E.D. We generate logical rules in an attempt to encapsulate this reasoning process into a formal one that can be executed repeatedly and with confidence.

But humans, by necessity, incorporate bounded rationality into their decision-making under conditions of uncertainty. Facing time constraints and lacking complete and reliable data, we still need to make timely decisions and execute effectively on them. (Daniel Dennett is particularly good on this subject in Elbow Room, I think.) Once we accept bounds on the formal reasoning process are necessary to arrive at timely decisions, it is difficult to know how far to relax those bounds. Knowing when to deviate from formal logic calls for experience and judgment, and these are precisely the aspects of expert decision-making are impossible to encapsulate in static, declarative rules. (We also learn from our mistakes, a process that machine learning approaches can successfully emulate, so there is hope for the Artificial Intelligence agenda.)

In this particular performance investigation that led to Grant’s formulating a disk performance rule, I was careful to explain to Grant what I was thinking and doing at some length since he was the public face of the project. I wanted to make sure that that Grant could, in the short term, faithfully report our findings up the chain of command, which he did. In the long term, I wanted him to be capable of carrying on independent of me in the various meetings and briefings he conducted and the memos and reports he prepared. Ultimately, I wanted him to produce a tuning guide that we could give customers faced with similar product scalability issues. I found Grant to be both willing & capable.

However, I am less than satisfied by the rule that Grant created to encapsulate the bottleneck analysis approach I employed. I find Grant’s rule a little too narrow to be genuinely useful, and certainly it would not have been adequate in the case we worked on together. Without digging considerably deeper, I would not be ready to jump to any conclusions, should this simple rule fire.

While I am not ready to throw out Grant’s proposed rule completely, I would argue that, at a minimum, it needs to be augmented by consideration of the IO rate, the block size of typical requests, the performance characteristics of the underlying physical disk, and the applications accessing the disk. There is a section in the Performance Guide (also available here in somewhat expanded form) that describes steps I recommend in a more thorough (and more complicated) investigation of a potential disk performance problem. I call it a “simplified” approach; at any rate, it is about as simple as I think I can make it, but it is considerably more complicated than any of the disk performance rules that PAL implements.

Final Thoughts on the Disk Performance Counters

In conclusion, as I discussed previously, the average disk response time measurements in Windows are difficult to interpret in isolation. The context is crucial to interpreting them correctly. For example:

  • These counters measure the average response time of disk requests. Response time includes time spent in either the disk driver request queue or the physical disk’s request queue, This complicates matters. Are these queuing mechanisms enabled? (That is usually a pretty good first question to ask if you suspect disk performance is a problem.) How much of the problem is due to poor device performance or is the disk simply overloaded? If you are able to answer that question, you already have a leg up in how to solve it.
  • Why choose 15 ms? Why not 20 or 25 ms? This isn‘t simply a question that is reducible to experts jockeying over what the best value for the threshold test is. The overall rate of IO requests and the block size of those requests impact the performance expectations that are reasonable for the underlying hardware. And, if there is potential for contention at the disk, that will factor into the response time measurements that include queue time.
  • Why choose Logical Disk, and not the associated measurements of the Physical Disk? Depending on how Logical Disks are mapped to Physical Disks, there could be built-in contention for the underlying physical disk.
  • Doesn’t it also depend on the performance characteristics of the underlying physical disk? What kind of physical disk entity is it? A standard SATA drive running at 7200 RPMs, a high end drive that spins at twice that speed, an SSD that doesn’t spin at all, a low power drive on a lightweight portable, or a logical volume carved out of an expensive SAN controller configuration? Each of these configuration options has different price/performance characteristics, and, consequently, a different set of expectations about what are acceptable and unacceptable levels of performance.
  • Won’t also it depend on how the application is accessing the disk? Is it accessing the file system sequentially, sucking up large blocks in a series of throughput-oriented requests? Or it requesting small blocks randomly distributed across the disk?
  • What about applications like back-up, search, virus scans, and database Table scans that intentionally monopolize the disk? These applications queue requests internally, so the driver is unable to measure queue time directly. What do the measurements look like under these circumstances when disk usage is scheduled at the application level? If one of those applications is running, how does that impact the performance of other applications that are trying to access the same disk concurrently?
  • Is caching enabled at the disk, the disk subsystem, in the application, or in multiple places? How effective is the caching? Doesn’t disk caching affect the pattern of physical disk access? What is the impact on the overall response times of requests when caching is enabled?

Ok, I admit I am deliberately messing with your head a little here to show how quickly Clint’s original simple and elegant Rule of Thumb with regard to expected levels of disk performance and Grant’s more nuanced version starts to sprout messy complications.

Crucially, Grant’s rule leaves out consideration of the IO rate. When the disk activity rate is low, then even quite poor disk response time will not have a significant impact on the performance of the application. On the other hand, if the disk IO rate is high enough, indicating saturation of the resource, the disk can be a significant bottleneck without any degradation of the disk response time measurements being observed, which is what I mainly observed in this case.

In addition, both the block size and the proportion of reads vs. writes is often very important, especially when it comes to figuring out what to do to remedy the situation. I am certain I discussed some of this with Grant at the time (and referred him to the Performance Guide for the rest), but none of those considerations found its way into Grant’s rule. My version of the rule would definitely include a clause that filters out situations where the overall disk activity rates are low. In addition, as the disk activity rate increases, I grow increasingly concerned about disk performance becoming the bottleneck that constrains the overall responsiveness or throughput of the application.

Finally, knowing what to do to resolve a disk performance problem that you have just diagnosed is crucial, but is something that Grant’s rule fails to touch upon. Once you understand the problem, then you have to start thinking about what the fix is. In this case, I felt that swapping out the overloaded device and replacing it with a device from a very expensive, cached enterprise class disk subsystem would relieve the bottleneck. (We anticipated that fixing the application code would take considerably longer.) But before making that suggestion, which was both disruptive and expensive, I needed to understand the way the application itself was architected. It was multi-tier, with a large SQL Server-based back-end and a large variety of web services that are used to communicate with the database.

The next step was to provide the technical input to justify to senior management a very costly disk upgrade, also an unanticipated one that had not been previously budgeted for. Not only that, I had to justify a significant operational outage allowing the data the application needed to be migrated to the new disk subsystem. Finally, I had to make sure after the fact that the change yielded a significant performance boost, warranting both the time and expense of everyone involved.