Deconstructing disk performance rules: final thoughts

To summarize the discussion so far:

While my experience with rule-based approaches to computer performance leads me to be very skeptical of their ultimate value, I recognize they can be useful under many circumstances, especially if people understand their inherent limitations. For example, in the last couple of blog entries, I noted the usefulness of threshold rules for filtering the great quantities of esoteric performance data that can readily be gathered on Windows (and other computing platforms). The threshold rules implicitly select among the performance data to be gathered – after all, before you can perform the threshold test, you must first have acquired the data to be analyzed. Concentrating on measurement intervals where the threshold test succeeds also help you to narrow the search to periods of peak load and stress.

However, the mere mechanical iteration over some set of expert-derived performance rules, absent the judgment of an experienced performance analyst, is unlikely to be very useful as a diagnostic tool. A process that mechanically exercises a pre-defined set of threshold-based performance rules using settings arbitrarily defined by some “expert” or other without regard to the particulars of your application and environment is apt to be quite brittle. And if the rule happens to fire, signaling that a problem has possibly occurred, will you understand what to do next? Following confirmation of the diagnosis, will you even know how to begin to resolve the issue you have uncovered?

A crippling deficiency of static performance rules is that the best inference that can be made about whether any symptom of a performance problem that you can measure is “good” or “bad” is frequently, “It depends.” That is why I often try to turn the debate over what the proper setting for the threshold test should be into a discussion that explores the reasoning behind the setting. Knowing what insights the “expert” relied upon in formulating the Rule in the first place is often extremely helpful. Knowing what the rule threshold depends on turns out to be quite useful, especially as you proceed from diagnosis to cure.

*  *  *

Bottleneck analysis

In the last blog entry, I attempted to make this argument more concrete by drilling into a popular disk performance rule from the Microsoft experts behind the PAL tool. This rule primarily serves as a filter, a role that I can endorse without too much hesitation. I then looked at a similar, but more ambitious and more complex disk performance rule that was promulgated by a colleague of mine at Microsoft, Grant Holliday. Unbeknownst to me, this rule was formulated following work we did together to diagnose and solve a serious performance and scalability problem affecting a major server-based application. The rule attempts to encapsulate the analysis I performed to first diagnose the problem and then devise a remedy to fix it.

My style of working on an application performance investigation such as this one can appear bracingly unstructured, but it isn’t. Essentially, the method is one of bottleneck analysis, and is informed by queuing theory. At the outset I sift through large amounts of very detailed and esoteric measurement data very quickly. It may not look like I am being systematic & deliberate, but I am. During these initial stages, I do not always take the time to explain this analytic process. When the problem is diagnosed, of course, I am prepared to defend and justify the diagnosis. But at that point you are backtracking from the solution, crafting a narrative designed to convince and persuade, shorn of all the twists and wrong turns you took in actually arriving at your final destination.

A general method in a performance investigation is to hunt for a bottlenecked resource, defined as the one that saturates around the time response time (or throughput) also begins to back up. Analytically, I am searching for an elongation of queue time at some saturated resource (CPU, disk, network, software locking mechanism, all the usual suspects first) that also appears to be correlated with measures of application response time that are (hopefully) available. You formulate a tentative hypothesis about what bottlenecked resource could be responsible for the problem and then start testing the validity of that hypothesis. If there is a hypothesis that remains still credible after increasingly rigorous testing, that is the one to proceed with.

I can recognize in the disk performance diagnostic rule Grant created some of the line of reasoning I followed in the tuning project we worked on together. At least some of it has a familiar ring to it. The disk response time measurements in Windows include queue time, so an elongation in disk queue time would be reflected in those measurements. Since the disk response time measurements are difficult to interpret in isolation, I also like to examine the Current Disk Queue Length counters; to be sure, these measurements are samples taken at the end of each measurement interval, but they have the virtue of being a direct measurement of the length a disk device’s current request queue. If the physical disk entity known to Windows is actually a standalone physical disk, it is also possible to derive the disk service time. Then, you can back out the service time from the disk response time measurements taken at the device driver level and calculate the queue time (as detailed here).

Queuing for the disk is not always evident at the device driver level, however, and may not show up in the Windows counter data. Disk requests can back up inside an application like Microsoft SQL Server that performs its own scheduling of disk activity, for example. That turned out to be the case here, as a matter of fact.

My recollection is that in the tuning project I worked on with Grant disk response times were generally within acceptable levels of 15-20 ms. In other words, the rule Grant formulated in retrospect would not have been triggered in this instance. However, one of the physical disks was being hammered at rates ranging from 400-1400 requests/second. (The fact that a range of performance levels like that is expected on a physical disk that has caching enabled is discussed in the disk tuning article I referenced above.) I also observed Current Disk Queue Length measurements in the double digits during peak intervals. I needed to look at disk throughput, too, not just response times, which also strongly suggested the disk was saturated, which was not evident from the disk response time measurements alone.

So, it was clear to me pretty quickly that there was a disk device that was severely overloaded. I then had to make the link between what was happening at the disk and the request queue in the application that would also start to back-up when the disk became overloaded. When that causal link was established empirically, I thought we could proceed with confidence with a plan to replace the bottlenecked disk device with a much faster one. This would improve both throughput and response time. The new disk subsystem was quite expensive, and converting SQL Server to use it required a lengthy outage, but, by then, I was confident that the application would benefit enough to justify those costs. By the time we were able to make a switch, Grant had improved service level reporting for the TFS application in place, and he documented a 2x-5x improvement (depending on the request type) in application response time at the 90th percentile of all requests following the disk upgrade. The customer was satisfied with the result.

Deconstructing Grant’s Disk Performance Rule

Human reasoning proceeds from the particular to the general, from the concrete to the abstract. Socrates is a man. All men are mortal. Thus, Socrates is mortal. Q.E.D. We generate logical rules in an attempt to encapsulate this reasoning process into a formal one that can be executed repeatedly and with confidence.

But humans, by necessity, incorporate bounded rationality into their decision-making under conditions of uncertainty. Facing time constraints and lacking complete and reliable data, we still need to make timely decisions and execute effectively on them. (Daniel Dennett is particularly good on this subject in Elbow Room, I think.) Once we accept bounds on the formal reasoning process are necessary to arrive at timely decisions, it is difficult to know how far to relax those bounds. Knowing when to deviate from formal logic calls for experience and judgment, and these are precisely the aspects of expert decision-making are impossible to encapsulate in static, declarative rules. (We also learn from our mistakes, a process that machine learning approaches can successfully emulate, so there is hope for the Artificial Intelligence agenda.)

In this particular performance investigation that led to Grant’s formulating a disk performance rule, I was careful to explain to Grant what I was thinking and doing at some length since he was the public face of the project. I wanted to make sure that that Grant could, in the short term, faithfully report our findings up the chain of command, which he did. In the long term, I wanted him to be capable of carrying on independent of me in the various meetings and briefings he conducted and the memos and reports he prepared. Ultimately, I wanted him to produce a tuning guide that we could give customers faced with similar product scalability issues. I found Grant to be both willing & capable.

However, I am less than satisfied by the rule that Grant created to encapsulate the bottleneck analysis approach I employed. I find Grant’s rule a little too narrow to be genuinely useful, and certainly it would not have been adequate in the case we worked on together. Without digging considerably deeper, I would not be ready to jump to any conclusions, should this simple rule fire.

While I am not ready to throw out Grant’s proposed rule completely, I would argue that, at a minimum, it needs to be augmented by consideration of the IO rate, the block size of typical requests, the performance characteristics of the underlying physical disk, and the applications accessing the disk. There is a section in the Performance Guide (also available here in somewhat expanded form) that describes steps I recommend in a more thorough (and more complicated) investigation of a potential disk performance problem. I call it a “simplified” approach; at any rate, it is about as simple as I think I can make it, but it is considerably more complicated than any of the disk performance rules that PAL implements.

Final Thoughts on the Disk Performance Counters

In conclusion, as I discussed previously, the average disk response time measurements in Windows are difficult to interpret in isolation. The context is crucial to interpreting them correctly. For example:

  • These counters measure the average response time of disk requests. Response time includes time spent in either the disk driver request queue or the physical disk’s request queue, This complicates matters. Are these queuing mechanisms enabled? (That is usually a pretty good first question to ask if you suspect disk performance is a problem.) How much of the problem is due to poor device performance or is the disk simply overloaded? If you are able to answer that question, you already have a leg up in how to solve it.
  • Why choose 15 ms? Why not 20 or 25 ms? This isn‘t simply a question that is reducible to experts jockeying over what the best value for the threshold test is. The overall rate of IO requests and the block size of those requests impact the performance expectations that are reasonable for the underlying hardware. And, if there is potential for contention at the disk, that will factor into the response time measurements that include queue time.
  • Why choose Logical Disk, and not the associated measurements of the Physical Disk? Depending on how Logical Disks are mapped to Physical Disks, there could be built-in contention for the underlying physical disk.
  • Doesn’t it also depend on the performance characteristics of the underlying physical disk? What kind of physical disk entity is it? A standard SATA drive running at 7200 RPMs, a high end drive that spins at twice that speed, an SSD that doesn’t spin at all, a low power drive on a lightweight portable, or a logical volume carved out of an expensive SAN controller configuration? Each of these configuration options has different price/performance characteristics, and, consequently, a different set of expectations about what are acceptable and unacceptable levels of performance.
  • Won’t also it depend on how the application is accessing the disk? Is it accessing the file system sequentially, sucking up large blocks in a series of throughput-oriented requests? Or it requesting small blocks randomly distributed across the disk?
  • What about applications like back-up, search, virus scans, and database Table scans that intentionally monopolize the disk? These applications queue requests internally, so the driver is unable to measure queue time directly. What do the measurements look like under these circumstances when disk usage is scheduled at the application level? If one of those applications is running, how does that impact the performance of other applications that are trying to access the same disk concurrently?
  • Is caching enabled at the disk, the disk subsystem, in the application, or in multiple places? How effective is the caching? Doesn’t disk caching affect the pattern of physical disk access? What is the impact on the overall response times of requests when caching is enabled?

Ok, I admit I am deliberately messing with your head a little here to show how quickly Clint’s original simple and elegant Rule of Thumb with regard to expected levels of disk performance and Grant’s more nuanced version starts to sprout messy complications.

Crucially, Grant’s rule leaves out consideration of the IO rate. When the disk activity rate is low, then even quite poor disk response time will not have a significant impact on the performance of the application. On the other hand, if the disk IO rate is high enough, indicating saturation of the resource, the disk can be a significant bottleneck without any degradation of the disk response time measurements being observed, which is what I mainly observed in this case.

In addition, both the block size and the proportion of reads vs. writes is often very important, especially when it comes to figuring out what to do to remedy the situation. I am certain I discussed some of this with Grant at the time (and referred him to the Performance Guide for the rest), but none of those considerations found its way into Grant’s rule. My version of the rule would definitely include a clause that filters out situations where the overall disk activity rates are low. In addition, as the disk activity rate increases, I grow increasingly concerned about disk performance becoming the bottleneck that constrains the overall responsiveness or throughput of the application.

Finally, knowing what to do to resolve a disk performance problem that you have just diagnosed is crucial, but is something that Grant’s rule fails to touch upon. Once you understand the problem, then you have to start thinking about what the fix is. In this case, I felt that swapping out the overloaded device and replacing it with a device from a very expensive, cached enterprise class disk subsystem would relieve the bottleneck. (We anticipated that fixing the application code would take considerably longer.) But before making that suggestion, which was both disruptive and expensive, I needed to understand the way the application itself was architected. It was multi-tier, with a large SQL Server-based back-end and a large variety of web services that are used to communicate with the database.

The next step was to provide the technical input to justify to senior management a very costly disk upgrade, also an unanticipated one that had not been previously budgeted for. Not only that, I had to justify a significant operational outage allowing the data the application needed to be migrated to the new disk subsystem. Finally, I had to make sure after the fact that the change yielded a significant performance boost, warranting both the time and expense of everyone involved.

.

More on Performance Rules: Context is King

As discussed in the last blog entry, unfortunately, an automated, rules-based, expert systems approach to diagnosing performance-related problems turns out to be too brittle to be very effective. The simple threshold-based rules invoked by various authorities often need to be fleshed out with additional conditions and exceptions. Once the rule is burdened with all the predicates necessary to qualify an expert’s assessment of the data in context, the automated reasoning process starts to break down.

It turns out it isn’t so easy to encapsulate an expert’s knowledge and judgment into a simple, declarative rule. The expertise a performance analyst cultivates can involve pattern matching based on experience with many other incidents with similar problems encountered in the past. Where the human diagnostic expert often indulges in intuition based on that background and professional experience, it is difficult to craft a mathematical or logical rule that can accurately mimic that reasoning and decision-making process. We haven’t yet figured out how to get a computerized expert system to play a hunch or take an educated guess.

Moreover, “It depends” is often the right answer when it comes to setting an Alert threshold for many of the common performance metrics that are gathered for Windows – or any other type of machine, for that matter. Generating alerts based on measurements exceeding some pre-defined threshold value – as determined by some expert – that are genuinely useful usually requires a deeper understanding of what the threshold value depends on.

The fact that the experts often argue over what the Alert threshold for this or that Windows counter should be isn’t too surprising. The rules themselves often need to be interpreted in context – what is the workload, what kind of hardware, what is the application, etc. That is why I often try to turn the argument over what the proper setting for the rule is into a discussion of the reasoning underlying why the expert chose that particular threshold value. If you understand why the expert chose this or that threshold value for the rule, you have a much better chance at getting that rule to work for you in your environment.

Another issue you need to face when it comes to threshold settings for alerts is that in many, many cases what it depends on is what is customary in your specific environment. The shorthand for this dependency that I used in the Win2K3 Server Resource Kit book is along the lines of “Build alerts for important server application processes based on deviation from historical norms [emphasis added].”

For example, take the context switches/sec counter in Windows. It helps, of course, to have some basic understanding of what a context switch is in Windows and how they are counted. The “textbook” definition I provided in the Win2K3 Performance Guide reads,

A context switch occurs whenever the operating system stops one thread from running thread and replaces it with another. This can happen because the thread that was originally running voluntarily relinquishes the processor, usually because it needs to wait until an I/O finishes before it resume processing. A running thread can also be preempted by a higher priority thread that is ready to running, again, often due to an I/O interrupt that has just occurred. User mode threads also switch to a corresponding kernel mode thread whenever the User mode application needs to perform a privileged mode operating system or subsystem service.

The rate of thread context switch that occur is tallied at the Thread level and at the overall system level. While this is an intrinsically interesting statistic, there is very little that a system administrator normally can do to influence the rate that context switches occur on a machine with a given workload.

which, let’s face it, may or may not be all that helpful. A thread is a unit of execution, and there are typically hundreds of them, most of which are usually idle. When there is work to be done by a thread, though, the thread is scheduled for execution. A context switch occurs when the thread actually begins execution. Logically, there is a context switch (newthreadid, oldthreadid) event that occurs, and these are what are being counted. (Note: if the processor was idle at the time the new thread is scheduled for execution, the oldthreadid = the Idle thread, which is more of a bookkeeping mechanism than an actual execution thread.)

It certainly sounds like monitoring the rate context switches occur could be useful. So far as an alert threshold is concerned, however, there is simply no carved-in-stone, right or wrong threshold you can set based on the number of context switches per second that occur on a Windows machine. It depends on the workload. For additional context (pun intended), let’s see some other measurements that are apt to be related to the number of context switches that occur.

The Perfmon screen shot in Figure 1 shows that the number of context switches that occur can vary a great deal during a typical execution interval. The data here, as you can see, was gathered over a 50 minute interval and ranges between 8K and 32K context switches per second, which is a considerable degree of variability. This will make building a statistical quality control alert “based on deviation from historical norms” a bit more challenging because it means understanding what that customary range of behavior is. (I will return to the subject of statistical quality control approach in more detail later. For now, see this earlier blog entry of mine that I posted last year.)

Figure 1. Charting the relationship between device interrupts (purple) and context switches (green) using Perfmon.

 

Knowing what a metric like context switches depends on can also be very helpful in determining whether the number you are seeing is problematic or not. That was why I gathered data on Interrupt rates at the same time. Under normal circumstances, each device interrupt that is processed will cause at least two thread context switches:

  1. an initial context switch that invokes the Interrupt Service routine, and
  2. a subsequent context switch to a user mode thread that is lying dormant, waiting to be woken up by the OS when the data transmitted to or from the device that caused the interrupt is completed.

The relationship between the two measurements is quite evident. (Be careful, Perfmon graphs in real-time wrap; the data on the right side of the chart was gathered before the data on the left side.) These are not two independent variables. They are, in fact, closely related to each other. (The technical term is autocorrelation.) Context switches/sec is at least partially a function of the device interrupt rate.

I initiated a large file copy operation on this machine after I started tracking the performance counters in Perfmon. Instead of tracking the overall interrupt rate, I could have also looked specifically at the number of disk interrupts that are occurring using the Physical Disk(_Total)Disk transfers/sec counter, especially helpful if the workload happens to be disk IO bound. In fact, if the workload is disk IO bound, as my file copy operation undoubtedly was, the number of context switches/sec that occurs is primarily an artifact of disk IO capacity. If the workload is disk IO bound, and you are able to swap in a faster disk, the number of disk transfers will increase, with a corresponding increase in the number of context switches.

The spike in context switches on the right side of the graph is the result of some web server activity I also initiated once I got the file copy operation going. In the case of network IO requests, things are more complicated. Network Sends and Receives must traverse several layers of the TCP/IP stack, ultimately arriving at the application layer for processing. For example, the data from an http request to an IIS web server is handled in turn by a series of kernel and user mode threads before it finally arrives in your ASP.NET application for processing. In the case of a TCP Receive of an http packet, I would expect to see at least three or four thread context switches before it is finally processed in your ASP.NET application.

Let’s widen the context a little more. In Figure 2, both interrupts/sec and context switches/sec are shown in relationship to overall processor utilization. (I resorted here to serving up a chart from the NTSMF reporting portal that overlays interrupts/sec and context switches/sec over a stacked area graph that reports processor utilization per processor. Perfmon charts don’t do this type of reporting very well.) In Figure 2, the corresponding relationships between device interrupt handling, which includes all network IO requests on a web server machine, thread context switching, and the demand for overall processor resources is evident. Interrupts lead to context switches, which from another perspective, also represent the units of processor work that need to be performed to service typical web requests.

Figure 2. The relationship between device interrupts, thread context switches and CPU utilization on a Windows machine.

The point is these are not independent measurements. Context switches, interrupts, and processor utilization are measuring related aspects of thread scheduling and thread execution time. From a statistical viewpoint, they are not only highly correlated, they are autocorrelated.

The fact that these metrics are all potentially related adds a whole new dimension to this discussion about performance rules and alerts. Perhaps how many context switches per second my machine can handle is more appropriately a question of how much processor, disk and network capacity I have on hand to field http or other network requests without starting to impact on server responsiveness in my ASP.NET application. From a capacity planning perspective, we can also see that being able to calculate the amount of CPU time per network request on average and then trending that data over time is extremely useful.

Ultimately, it is in this wider overall context that tracking a metric like context switches/sec makes sense. This is why raising an alert when a web server that normally handles 15-30K context switches per second suddenly is processing more than 100,000 context switches per second can be useful. The  “deviation from historical norms” may represent something significant that you will want to investigate. And then drilling into the context switch data and seeing if the relationship between device interrupts, context switches, and CPU processing that held in the past continues to hold true in the current situation. Has a disk controller gone bad suddenly and is spewing forth a ton of extra device interrupts? Or maybe it is a denial of service attack on your web server by some evil hackers. Or maybe Ashton Kutcher just tweeted about that cat photo you posted and your web server is being deluged with requests to view it.

Or maybe you’ve just gone online with a new rev of the application, and this is something that should never have gotten past the QA team.

Encapsulate that knowledge into a Performance Rule and you are in business..

Performance Rules!

Around the time that Odysseus Pentakalos and I were writing our original book (the Windows 2000 Performance Guide from O’Reilly), there were already several books in print that provided guidance on Windows NT performance topics. (Internally, Windows 2000 is Windows NT version 5.0, while the current Windows 7 OS is version 6.1). In my travels, I had read several of these, along with almost every technical article on the subject I could get my hands on.

While these Windows performance books all had some merit, I also found they had serious shortcomings, in my less than humble opinion. Unfortunately, none of them were written with the benefit of understanding the Windows operating system from the inside out, which was largely a black box until the publication of David Solomon’s original “Inside Windows NT” in 1998. (You can check out the review of the Solomon book I wrote for Amazon almost immediately after it was published here.)

Moreover, none of those early books relied on a systematic approach to computer performance that gathered measurement data and analyzed rigorously under a variety of conditions. Only by using an empirical approach – computer science, to the extent it can be considered an actual “science” and not just an engineering discipline is an empirical one – could systematically & reliably determine how a Windows machine actually behaved when it was under stress or what would happen to it when you tried tweaking one of its many (often hidden) performance-oriented configuration options. Carefully gathering measurements of repeatable benchmarks run under varied conditions and analyzing the results is one of the cornerstones of the empirical approach I pursue.

I had the naïve notion that someone interested in this esoteric subject matter would be willing to invest the time and effort necessary to understand it in sufficient depth. But I found that some readers were disappointed that the book did not contain enough simple recipes – short cuts and other step-by-step procedures that could be followed by rote that were guaranteed paths to success. When I wrote the second book, I made an extra effort to address this criticism, which struck me as a legitimate Reader reaction to the book that I had written. As much as I tried to adhere to Occam’s Razor in writing it, the book was short on simple recipes. We intended it as a guide book, not a cookbook. But I could appreciate that some Readers had bought the book because they faced critical performance problems that they were hoping to get practical advice on how to fix. Naturally, these Readers might become frustrated when they did not discover simple solutions to their problems.

The challenge, of course, is that I am not sure there are too many simple recipes for success in this field.

In the 2nd book, I tried to be much more explicit about the empirical approach to diagnosing computer performance problems. I tried to communicate clearly that the cookbook approach with simple recipes anyone could follow often would not suffice. (Sometimes, it is all about managing expectations.) In addition, I tried to include more concrete examples that I could discuss in detail, case studies that illustrated, methodically, step by step, a systematic, empirical approach. And I worked harder to formulate what crisp rules and recommendations I could, identifying those patterns and Best Practices in data collection, analysis, configuration and tuning that I thought were worthy.

Having learned something from writing the 1st book, the 2nd book was hopefully an improvement. But I can still imagine that some Readers of the 2nd book, which being part of the official Windows Server 2003 documentation set, circulated much more widely, were still frustrated to find fewer simple recipes for success than they had hoped.

So, while I am certainly sympathetic to the desire that many people have to purchase a set of concise prescriptions for success distilled into a Windows Performance Cookbook, obviously, I have been unable to produce one. This is not for lack of trying because I am sure that I could sell considerably more copies of book entitled “Windows Performance for Dummies” than the books I did write. It is because there are formidable obstacles to producing a decent, worthwhile book of fail-safe recipes.

Rule-based expert systems.

Let me take a minute and explain. One popular cookbook-like approach to performance attempts to encapsulate the knowledge of expert practitioners into declarative rules. These rules selectively analyze some measurement data and test it against some threshold value. An experienced practitioner in the problem domain selects both what data to look at, how to look at it (i.e., summarized, calculate a ratio between two values, look for a consistent linear relation between two values by calculating a correlation coefficient, etc.), and what values to use in the threshold tests. Programmatically, the rule is then evaluated as either unambiguously True or False in the current context. Computer programs that execute along these lines are known as expert systems.

Back when I was a grad student working on a degree in Computer Science, there were high hopes in artificial intelligence (AI) for expert systems. One of the more appealing aspects of the expert system approach was that you would not have to do much custom programming; some generic rules processing engine could do the bulk of the heavy lifting. What you would need instead was a knowledge engineer capable of encoding the domain-specific rules that an expert diagnostician followed. Once that expert knowledge was encapsulated in a set of declarative rules, a separate Rules engine would effectively be able to replicate those diagnostic procedures.

After I got my degree and starting developing software that did computer performance analysis, I was among those that wanted to see to what extent these AI techniques could be successfully adapted to this problem domain. One thing that was clear was that performance analysts had access to huge amounts of diagnostic and other measurement data to sift through. Not having enough data wasn’t our problem, as it might be in medical diagnosis, another problem domain where people were hoping expert systems might help. Computer performance analysts were swimming in it. Building software that could automatically analyze that data and generate suggestions about how to act on it would be very helpful.

At the time I entered the field, there were already many experienced practitioners using tools like SAS to process and analyze copious amounts of computer performance measurement data. There were tools to build Performance Data Bases (PDBs), repositories for all this measurement data where you could track growth and detect changes over time. There was undoubtedly some rich ore here, if we could only figure out how to mine it. Vendor-developed tools that massaged this measurement data into a form where analytic queuing modeling techniques could be applied were also in widespread use. (I worked for some of these tool vendors.) These analytic queuing models provided a “deep” understanding of the scalability behavior of complex computer systems, offering valuable predictive capabilities.

At the time I also encountered many self-appointed “experts” proposing rules that defined both desirable and undesirable run-time computer system characteristics. This was commonly known as the “Rule of Thumb” (ROT) approach to diagnosing performance problems, to distinguish it, I suppose, from more, precise analytic approaches, much as seafaring explorers needed to use dead reckoning instead of precise navigation techniques before the technology to build accurate seagoing clocks was available. A problem that arises almost immediately is that encapsulating these rough-hewn Rules of Thumb into rule definitions that could be processed by some AI-derived Rules engine requires that they be stated with precision. In specifying these rules to be executed by some computer program, they need to be precise. There is no way for the expert system to play a hunch or rely on intuition. (In theory, at least, this mechanical process of rule evaluation was what experts did to arrive at a decision, and computers could mimic that behavior. That some amount of mathematical-logical analysis of relevant data is a component of an expert’s decision-making process is probably true. But in rule-based expert systems, this component is the entire decision process. As an aside, I am not convinced that augmenting the rules with some combination of fuzzy logic and/or bayesian inference to try to deal with the uncertainty inherent in many problem domains helps all that much. In the problem domain that I know – which is computer performance analysis – I know it doesn’t help that much.)

Many of these useful Rules of Thumb resisted being rendered with enough precision that they could be evaluated programmatically by an Expert System’s rules engine. When you tried to pin them down to a precise logical formulation, many of the ROTs postulated by the reigning domain experts, these rules incorporated so many additional conditions and qualifying predicates that I soon developed my own (tongue-in-cheek) Rule of Thumb characterizing them as largely unhelpful and, in some cases, even downright dangerous to apply. The ROT I formulated to characterize the adequacy of a diagnosis based on a ROT firing is as follows:

  1.  In evaluating the precise True/False value for the Rule, if the number of predicates qualifying the conditions under which the rule applies exceeds the number of predicates in the body of the Rule by a factor of 5, then the Rule itself should be discarded.

When enforced, Friedman’s Rule on performance Rules eliminates many of the rules proposed by the leading computer performance experts. Unfortunately, for the sake of the rule-based approach, “It depends” is frequently the correct answer to most queries that ask if a measurement that exceeds some postulated threshold value is a valid indicator of a related performance problem. Friedman’s Rule on performance rules suggests that any rule that is so over-burdened and pre-conditions and post-mortems other qualifiers is probably not a useful rule. When there are so many reasons why the rule won’t work, it is not that reasonable a rule. (Lot of puns there, but you get the idea.)

In the next blog entry, I will give an example of a simple performance rule and then drill into some of qualifications and conditions that you have to look for before it is at all reasonable to apply the rule.

.