More on Performance Rules: Context is King

As discussed in the last blog entry, unfortunately, an automated, rules-based, expert systems approach to diagnosing performance-related problems turns out to be too brittle to be very effective. The simple threshold-based rules invoked by various authorities often need to be fleshed out with additional conditions and exceptions. Once the rule is burdened with all the predicates necessary to qualify an expert’s assessment of the data in context, the automated reasoning process starts to break down.

It turns out it isn’t so easy to encapsulate an expert’s knowledge and judgment into a simple, declarative rule. The expertise a performance analyst cultivates can involve pattern matching based on experience with many other incidents with similar problems encountered in the past. Where the human diagnostic expert often indulges in intuition based on that background and professional experience, it is difficult to craft a mathematical or logical rule that can accurately mimic that reasoning and decision-making process. We haven’t yet figured out how to get a computerized expert system to play a hunch or take an educated guess.

Moreover, “It depends” is often the right answer when it comes to setting an Alert threshold for many of the common performance metrics that are gathered for Windows – or any other type of machine, for that matter. Generating alerts based on measurements exceeding some pre-defined threshold value – as determined by some expert – that are genuinely useful usually requires a deeper understanding of what the threshold value depends on.

The fact that the experts often argue over what the Alert threshold for this or that Windows counter should be isn’t too surprising. The rules themselves often need to be interpreted in context – what is the workload, what kind of hardware, what is the application, etc. That is why I often try to turn the argument over what the proper setting for the rule is into a discussion of the reasoning underlying why the expert chose that particular threshold value. If you understand why the expert chose this or that threshold value for the rule, you have a much better chance at getting that rule to work for you in your environment.

Another issue you need to face when it comes to threshold settings for alerts is that in many, many cases what it depends on is what is customary in your specific environment. The shorthand for this dependency that I used in the Win2K3 Server Resource Kit book is along the lines of “Build alerts for important server application processes based on deviation from historical norms [emphasis added].”

For example, take the context switches/sec counter in Windows. It helps, of course, to have some basic understanding of what a context switch is in Windows and how they are counted. The “textbook” definition I provided in the Win2K3 Performance Guide reads,

A context switch occurs whenever the operating system stops one thread from running thread and replaces it with another. This can happen because the thread that was originally running voluntarily relinquishes the processor, usually because it needs to wait until an I/O finishes before it resume processing. A running thread can also be preempted by a higher priority thread that is ready to running, again, often due to an I/O interrupt that has just occurred. User mode threads also switch to a corresponding kernel mode thread whenever the User mode application needs to perform a privileged mode operating system or subsystem service.

The rate of thread context switch that occur is tallied at the Thread level and at the overall system level. While this is an intrinsically interesting statistic, there is very little that a system administrator normally can do to influence the rate that context switches occur on a machine with a given workload.

which, let’s face it, may or may not be all that helpful. A thread is a unit of execution, and there are typically hundreds of them, most of which are usually idle. When there is work to be done by a thread, though, the thread is scheduled for execution. A context switch occurs when the thread actually begins execution. Logically, there is a context switch (newthreadid, oldthreadid) event that occurs, and these are what are being counted. (Note: if the processor was idle at the time the new thread is scheduled for execution, the oldthreadid = the Idle thread, which is more of a bookkeeping mechanism than an actual execution thread.)

It certainly sounds like monitoring the rate context switches occur could be useful. So far as an alert threshold is concerned, however, there is simply no carved-in-stone, right or wrong threshold you can set based on the number of context switches per second that occur on a Windows machine. It depends on the workload. For additional context (pun intended), let’s see some other measurements that are apt to be related to the number of context switches that occur.

The Perfmon screen shot in Figure 1 shows that the number of context switches that occur can vary a great deal during a typical execution interval. The data here, as you can see, was gathered over a 50 minute interval and ranges between 8K and 32K context switches per second, which is a considerable degree of variability. This will make building a statistical quality control alert “based on deviation from historical norms” a bit more challenging because it means understanding what that customary range of behavior is. (I will return to the subject of statistical quality control approach in more detail later. For now, see this earlier blog entry of mine that I posted last year.)

Figure 1. Charting the relationship between device interrupts (purple) and context switches (green) using Perfmon.

 

Knowing what a metric like context switches depends on can also be very helpful in determining whether the number you are seeing is problematic or not. That was why I gathered data on Interrupt rates at the same time. Under normal circumstances, each device interrupt that is processed will cause at least two thread context switches:

  1. an initial context switch that invokes the Interrupt Service routine, and
  2. a subsequent context switch to a user mode thread that is lying dormant, waiting to be woken up by the OS when the data transmitted to or from the device that caused the interrupt is completed.

The relationship between the two measurements is quite evident. (Be careful, Perfmon graphs in real-time wrap; the data on the right side of the chart was gathered before the data on the left side.) These are not two independent variables. They are, in fact, closely related to each other. (The technical term is autocorrelation.) Context switches/sec is at least partially a function of the device interrupt rate.

I initiated a large file copy operation on this machine after I started tracking the performance counters in Perfmon. Instead of tracking the overall interrupt rate, I could have also looked specifically at the number of disk interrupts that are occurring using the Physical Disk(_Total)Disk transfers/sec counter, especially helpful if the workload happens to be disk IO bound. In fact, if the workload is disk IO bound, as my file copy operation undoubtedly was, the number of context switches/sec that occurs is primarily an artifact of disk IO capacity. If the workload is disk IO bound, and you are able to swap in a faster disk, the number of disk transfers will increase, with a corresponding increase in the number of context switches.

The spike in context switches on the right side of the graph is the result of some web server activity I also initiated once I got the file copy operation going. In the case of network IO requests, things are more complicated. Network Sends and Receives must traverse several layers of the TCP/IP stack, ultimately arriving at the application layer for processing. For example, the data from an http request to an IIS web server is handled in turn by a series of kernel and user mode threads before it finally arrives in your ASP.NET application for processing. In the case of a TCP Receive of an http packet, I would expect to see at least three or four thread context switches before it is finally processed in your ASP.NET application.

Let’s widen the context a little more. In Figure 2, both interrupts/sec and context switches/sec are shown in relationship to overall processor utilization. (I resorted here to serving up a chart from the NTSMF reporting portal that overlays interrupts/sec and context switches/sec over a stacked area graph that reports processor utilization per processor. Perfmon charts don’t do this type of reporting very well.) In Figure 2, the corresponding relationships between device interrupt handling, which includes all network IO requests on a web server machine, thread context switching, and the demand for overall processor resources is evident. Interrupts lead to context switches, which from another perspective, also represent the units of processor work that need to be performed to service typical web requests.

Figure 2. The relationship between device interrupts, thread context switches and CPU utilization on a Windows machine.

The point is these are not independent measurements. Context switches, interrupts, and processor utilization are measuring related aspects of thread scheduling and thread execution time. From a statistical viewpoint, they are not only highly correlated, they are autocorrelated.

The fact that these metrics are all potentially related adds a whole new dimension to this discussion about performance rules and alerts. Perhaps how many context switches per second my machine can handle is more appropriately a question of how much processor, disk and network capacity I have on hand to field http or other network requests without starting to impact on server responsiveness in my ASP.NET application. From a capacity planning perspective, we can also see that being able to calculate the amount of CPU time per network request on average and then trending that data over time is extremely useful.

Ultimately, it is in this wider overall context that tracking a metric like context switches/sec makes sense. This is why raising an alert when a web server that normally handles 15-30K context switches per second suddenly is processing more than 100,000 context switches per second can be useful. The  “deviation from historical norms” may represent something significant that you will want to investigate. And then drilling into the context switch data and seeing if the relationship between device interrupts, context switches, and CPU processing that held in the past continues to hold true in the current situation. Has a disk controller gone bad suddenly and is spewing forth a ton of extra device interrupts? Or maybe it is a denial of service attack on your web server by some evil hackers. Or maybe Ashton Kutcher just tweeted about that cat photo you posted and your web server is being deluged with requests to view it.

Or maybe you’ve just gone online with a new rev of the application, and this is something that should never have gotten past the QA team.

Encapsulate that knowledge into a Performance Rule and you are in business..

Performance Rules!

Around the time that Odysseus Pentakalos and I were writing our original book (the Windows 2000 Performance Guide from O’Reilly), there were already several books in print that provided guidance on Windows NT performance topics. (Internally, Windows 2000 is Windows NT version 5.0, while the current Windows 7 OS is version 6.1). In my travels, I had read several of these, along with almost every technical article on the subject I could get my hands on.

While these Windows performance books all had some merit, I also found they had serious shortcomings, in my less than humble opinion. Unfortunately, none of them were written with the benefit of understanding the Windows operating system from the inside out, which was largely a black box until the publication of David Solomon’s original “Inside Windows NT” in 1998. (You can check out the review of the Solomon book I wrote for Amazon almost immediately after it was published here.)

Moreover, none of those early books relied on a systematic approach to computer performance that gathered measurement data and analyzed rigorously under a variety of conditions. Only by using an empirical approach – computer science, to the extent it can be considered an actual “science” and not just an engineering discipline is an empirical one – could systematically & reliably determine how a Windows machine actually behaved when it was under stress or what would happen to it when you tried tweaking one of its many (often hidden) performance-oriented configuration options. Carefully gathering measurements of repeatable benchmarks run under varied conditions and analyzing the results is one of the cornerstones of the empirical approach I pursue.

I had the naïve notion that someone interested in this esoteric subject matter would be willing to invest the time and effort necessary to understand it in sufficient depth. But I found that some readers were disappointed that the book did not contain enough simple recipes – short cuts and other step-by-step procedures that could be followed by rote that were guaranteed paths to success. When I wrote the second book, I made an extra effort to address this criticism, which struck me as a legitimate Reader reaction to the book that I had written. As much as I tried to adhere to Occam’s Razor in writing it, the book was short on simple recipes. We intended it as a guide book, not a cookbook. But I could appreciate that some Readers had bought the book because they faced critical performance problems that they were hoping to get practical advice on how to fix. Naturally, these Readers might become frustrated when they did not discover simple solutions to their problems.

The challenge, of course, is that I am not sure there are too many simple recipes for success in this field.

In the 2nd book, I tried to be much more explicit about the empirical approach to diagnosing computer performance problems. I tried to communicate clearly that the cookbook approach with simple recipes anyone could follow often would not suffice. (Sometimes, it is all about managing expectations.) In addition, I tried to include more concrete examples that I could discuss in detail, case studies that illustrated, methodically, step by step, a systematic, empirical approach. And I worked harder to formulate what crisp rules and recommendations I could, identifying those patterns and Best Practices in data collection, analysis, configuration and tuning that I thought were worthy.

Having learned something from writing the 1st book, the 2nd book was hopefully an improvement. But I can still imagine that some Readers of the 2nd book, which being part of the official Windows Server 2003 documentation set, circulated much more widely, were still frustrated to find fewer simple recipes for success than they had hoped.

So, while I am certainly sympathetic to the desire that many people have to purchase a set of concise prescriptions for success distilled into a Windows Performance Cookbook, obviously, I have been unable to produce one. This is not for lack of trying because I am sure that I could sell considerably more copies of book entitled “Windows Performance for Dummies” than the books I did write. It is because there are formidable obstacles to producing a decent, worthwhile book of fail-safe recipes.

Rule-based expert systems.

Let me take a minute and explain. One popular cookbook-like approach to performance attempts to encapsulate the knowledge of expert practitioners into declarative rules. These rules selectively analyze some measurement data and test it against some threshold value. An experienced practitioner in the problem domain selects both what data to look at, how to look at it (i.e., summarized, calculate a ratio between two values, look for a consistent linear relation between two values by calculating a correlation coefficient, etc.), and what values to use in the threshold tests. Programmatically, the rule is then evaluated as either unambiguously True or False in the current context. Computer programs that execute along these lines are known as expert systems.

Back when I was a grad student working on a degree in Computer Science, there were high hopes in artificial intelligence (AI) for expert systems. One of the more appealing aspects of the expert system approach was that you would not have to do much custom programming; some generic rules processing engine could do the bulk of the heavy lifting. What you would need instead was a knowledge engineer capable of encoding the domain-specific rules that an expert diagnostician followed. Once that expert knowledge was encapsulated in a set of declarative rules, a separate Rules engine would effectively be able to replicate those diagnostic procedures.

After I got my degree and starting developing software that did computer performance analysis, I was among those that wanted to see to what extent these AI techniques could be successfully adapted to this problem domain. One thing that was clear was that performance analysts had access to huge amounts of diagnostic and other measurement data to sift through. Not having enough data wasn’t our problem, as it might be in medical diagnosis, another problem domain where people were hoping expert systems might help. Computer performance analysts were swimming in it. Building software that could automatically analyze that data and generate suggestions about how to act on it would be very helpful.

At the time I entered the field, there were already many experienced practitioners using tools like SAS to process and analyze copious amounts of computer performance measurement data. There were tools to build Performance Data Bases (PDBs), repositories for all this measurement data where you could track growth and detect changes over time. There was undoubtedly some rich ore here, if we could only figure out how to mine it. Vendor-developed tools that massaged this measurement data into a form where analytic queuing modeling techniques could be applied were also in widespread use. (I worked for some of these tool vendors.) These analytic queuing models provided a “deep” understanding of the scalability behavior of complex computer systems, offering valuable predictive capabilities.

At the time I also encountered many self-appointed “experts” proposing rules that defined both desirable and undesirable run-time computer system characteristics. This was commonly known as the “Rule of Thumb” (ROT) approach to diagnosing performance problems, to distinguish it, I suppose, from more, precise analytic approaches, much as seafaring explorers needed to use dead reckoning instead of precise navigation techniques before the technology to build accurate seagoing clocks was available. A problem that arises almost immediately is that encapsulating these rough-hewn Rules of Thumb into rule definitions that could be processed by some AI-derived Rules engine requires that they be stated with precision. In specifying these rules to be executed by some computer program, they need to be precise. There is no way for the expert system to play a hunch or rely on intuition. (In theory, at least, this mechanical process of rule evaluation was what experts did to arrive at a decision, and computers could mimic that behavior. That some amount of mathematical-logical analysis of relevant data is a component of an expert’s decision-making process is probably true. But in rule-based expert systems, this component is the entire decision process. As an aside, I am not convinced that augmenting the rules with some combination of fuzzy logic and/or bayesian inference to try to deal with the uncertainty inherent in many problem domains helps all that much. In the problem domain that I know – which is computer performance analysis – I know it doesn’t help that much.)

Many of these useful Rules of Thumb resisted being rendered with enough precision that they could be evaluated programmatically by an Expert System’s rules engine. When you tried to pin them down to a precise logical formulation, many of the ROTs postulated by the reigning domain experts, these rules incorporated so many additional conditions and qualifying predicates that I soon developed my own (tongue-in-cheek) Rule of Thumb characterizing them as largely unhelpful and, in some cases, even downright dangerous to apply. The ROT I formulated to characterize the adequacy of a diagnosis based on a ROT firing is as follows:

  1.  In evaluating the precise True/False value for the Rule, if the number of predicates qualifying the conditions under which the rule applies exceeds the number of predicates in the body of the Rule by a factor of 5, then the Rule itself should be discarded.

When enforced, Friedman’s Rule on performance Rules eliminates many of the rules proposed by the leading computer performance experts. Unfortunately, for the sake of the rule-based approach, “It depends” is frequently the correct answer to most queries that ask if a measurement that exceeds some postulated threshold value is a valid indicator of a related performance problem. Friedman’s Rule on performance rules suggests that any rule that is so over-burdened and pre-conditions and post-mortems other qualifiers is probably not a useful rule. When there are so many reasons why the rule won’t work, it is not that reasonable a rule. (Lot of puns there, but you get the idea.)

In the next blog entry, I will give an example of a simple performance rule and then drill into some of qualifications and conditions that you have to look for before it is at all reasonable to apply the rule.

.

Not quite ready for Twitter

It might be helpful to set a few expectations up front to Readers of this blog about what kind of blogging you can expect from me, with the added caveat that I believe such expectations are made to be broken…

As an adjunct to my day job as a software developer, I have written & published numerous articles on topics related to computer performance over the years. Since about 1996, I have been focused on Windows, so I was mainly writing and publishing articles on Windows performance and scalability topics. (If you are interested, many of these early articles are available on the DemandTech web site here: http://demandtech.com/knowledge_perspectives.html. More concise answers written in response to FAQs are published here: http://faq.demandtech.com/.)  In the past, I often published similar material in industry newsletters, trade magazines, and in the official publications of the Computer Measurement Group, a professional association that I joined in 1983 and remained active in for years.  

With a colleague, Dr. Odysseus Pentakalos, I eventually organized these Windows articles into a book called Windows 2000 Performance Guide, published in 2002 by O’Reilly. (My editor at O’Reilly was responsible for the title, not me. That may have been the name of the book the publisher wished Odysseus and I had written.) I decided to write the book mainly because I discovered a shortage of solid, reliable guidance on performance topics for Microsoft Windows developers and system administrators as I began working to develop performance tools for that platform.

Later, and with the support and cooperation of the Microsoft Windows Server Performance team, I updated that original book to create a volume available in the official Windows Server 2003 Resource Kit, also entitled a Performance Guide, published in 2005 by Microsoft Press. Because the Win2K3 book was written with the active assistance of Microsoft and the material I wrote was vetted by members of the development staff, the 2nd book is certainly the more authoritative one of the two. It also benefitted from feedback I received from Readers of the 1st book and from attendees of a seminar I taught on Windows performance that covered much of the same ground as the books.

Today, that book on Windows Server 2003 performance (internally, Windows Server 2003 is Windows NT version 5.2) unquestionably needs updating to reflect changes to the Windows operating system since 2003. Advances in computer hardware over the interim also contribute to the need for a new edition. For instance, there is very little information in either of those books for systems professionals interested in the performance implications of

·         multi-core processors,

·         solid state disks,

·         virtualization,

·         cloud computing,

and a wide range of other “hot” topics of more recent vintage. Nor is there very much solid information in those books for computer professionals responsible for the scalability and performance of large scale web sites running on Windows Servers using technology associated with ASP.NET, Microsoft SQL Server, and the .NET Framework in general.

As it happens, these are some of the topics in computer performance that I have been actively investigating in the past 5-6 years or so since the publication of the Win2K3 book. I plan to use this blog to consolidate some of the material I have written since the Win2K3 book that I view as “work in progress” towards a next edition of the Windows Performance Guide, filling in some of the major gaps I mentioned earlier. I will also publish newer material based on current topics of interest to me.

Anticipating a question that arises frequently from fans of my two Windows Performance Guide books, I would like to explain that a complete and up-to-date revision requires both a willing publisher and enough spare cycles on my part to execute. Since I expect a willing publisher would not be too difficult to find, to be totally honest, what is lacking on my part is the time to execute on a thorough revision. The process of assembling and polishing a book for print is both grueling & time-consuming. Realistically, I don’t see it happening. So the best I can offer in the interim is this blog, where I will post on topics that are either missing from the book or need serious revisions in light of subsequent changes to the hardware and/or software.

A blog works for me. The phenomenon of self-publishing using the Internet is something I have gotten interested in, too, setting up and contributing to a team blog during my recent 4-year stint working at the Developer Division at Microsoft, and occasionally ghostwriting  an entry for one of the executives there to publish. You are not likely to discover a book that I might write in the check-out line of your local bookstore — assuming you have a local book store; the rewards of writing such a book are certainly not financial. Visually, the blogging format is often less than adequate for the type of writing I do – I am hoping the latest version of Blogspot is more flexible.

I suppose I am not the normal blogger, attuned to the short attention span that characterizes most content on the web. If you care to check out at the team blog I started at Microsoft (for a representative example, see http://blogs.msdn.com/b/ddperf/archive/2008/06/10/mainstream-numa-and-the-tcp-ip-stack-part-i.aspx) , you should get a good flavor of my blogging style. You can see that, as a writer, I am not quite ready for Twitter. I am more likely to publish a longer piece with considerable research behind it once every two months or so than something short and pithy three times a week. Whenever I have something substantial of chapter length ready to go, I will likely chop it into pieces of more convenient size for publishing it on the web and blog about it in installments.

.