Measuring Processor Utilization in Windows and Windows applications: Part 2

An event-driven approach to measuring processor execution state.

The limitations of the legacy approach to measuring CPU busy in Windows and the need for more precise measurements of CPU utilization are recognized in many quarters across the Windows development organization at Microsoft. The legacy sampling approach is doubtless very efficient, and this measurement facility was deeply embedded in the OS kernel’s Scheduler facility, a chunk of code that is very risky to tamper with. But, for example, more efficient power management, something that is crucial for battery-powered Windows devices, strongly argues for an event-driven alternative. You do not want the OS to wake up from a low power state regularly on an idle machine just to perform its CPU usage accounting duties, for example.

A straightforward alternative to periodically sampling the processor execution state is to measure the time spent in each processor state directly. This is accomplished by instrumenting the phase state transitions themselves. Processor state transitions in Windows are known as context switches. A context switch occurs in Windows whenever the processor switches the processor execution context to run a different thread. Processor state transitions also occur as a result of high priority Interrupt Service Routines (ISRs) gaining control following a device interrupt, as well as the Deferred Procedure Calls (DPCs) that ISRs schedule to complete the interrupt processing. By recording the time that each context switch occurs, it is possible to construct a complete and an accurate picture of CPU consumption.
(See a two-part article in MSDN Magazine, entitled “Core OS Events in Windows 7,” written by Insung Park and Alex Bendetovers and published beginning in September 2009.  The authors are, respectively, the architect and lead developer of the ETW infrastructure. The article provides a conceptual overview describing how to use the various OS kernel events to reconstruct a state machine for processor execution, along with other diagnostic scenarios. Park and Bendetovers report, “In state machine construction, combining Context Switch, DPC and ISR events enables a very accurate accounting of CPU utilization.”)

It helps to have a good, general understanding of thread scheduling in the OS in order to interpret this stream of events. Figure 3 is a diagram depicting the state machine associated with thread execution. At any point in time, a thread can be in only one of the three states indicated: Waiting, Ready, or Running. The state transition diagram shows the changes in execution state that can occur. A Waiting thread is usually waiting for some event to occur, perhaps a Wait timer to expire, an IO operation to complete, a mouse or keyboard click that signals user interaction with the application, or a synchronization event from another thread that indicates it is OK to continue processing.

A thread that is Ready to run is placed in the Dispatcher’s Ready Queue, which is ordered by priority. When a processor becomes available, the OS Scheduler selects the highest priority thread on the Ready Queue and schedules it for execution on that processor. Once it is running, a thread remains in the Running state until it completes its execution cycle and transitions back to the Wait state. An executing thread can also be interrupted because a higher priority execution unit needs to run (this is known as preemptive scheduling) or it is interrupted by the OS Scheduler because its time-slice has expired. A Running thread can also be delayed because of a page fault, accessing data or an instruction in virtual memory that is not currently resident in physical memory. These thread execution time delays are often referred to as involuntary waits.

State-machine-for-thread-execution

Figure 3. A state machine for thread execution.

Figure 4 associates these thread execution state transitions with the ETW events that record when these transitions occur. The most important of these is the CSwitch event record that is written on every processor context switch. The CSwitch event record indicates the thread ID of the thread that is entering the Running state (the new thread id), the thread ID that was displaced (the old thread ID) and provides the Wait Reason code associated with an old thread ID that is transitioning from Running back to the Wait state. The processor number indicating which logical CPU has undergone this state change is provided in an ETW_Buffer_Context structure associated with the ETW standard record header. In addition, it is necessary to know that Thread 0 from Process 0 indicates the Idle thread, which is dispatched on a processor whenever there are no Ready threads waiting for execution. While a thread other than the Idle thread is “active,” the CPU is considered busy.

Conceptually, a context switch event is something like a processor state switch(oldThreadId, newThreadId), with a time stamp identifying when the context switch occurred. The CPU time of a thread is precisely the amount of time it spends in the Running state. It can be measured using the CSwitch events that show the thread transitioning from Ready to the Running state and the CSwitch events that show that thread transitioning back from the Running state to Waiting. To calculate processor busy, you summarize the amount of time each processor spends when the Idle thread is active and subtract from 100% over the measurement interval.

State-machine-for-thread-execution-2528with-ETW-details-2529

Figure 4. The state transition diagram for thread execution, indicating the ETW trace events that mark thread state transitions.

One complication in this approach is that the ETW infrastructure does not guarantee delivery of every event to a Listener application. If the Listener application cannot keep up with the stream of events, then ETW will drop memory-resident buffers filled with events rather than queue them for delivery later. CSwitch events can occur at very high rates, 20,000-40,000 times per second per CPU are not unusual on busy machines, so there is definitely potential to miss enough of the context switch events to bias the calculations that result. In practice, handling the events efficiently in the Listener application and making appropriate adjustments to the ETW record buffering  options can be used to minimize the potential for missing events.

To see this event-driven processor execution state measurement facility at work, access the Resource Monitor application (resmon.exe) that is available beginning in Vista and Windows Server 2008. Resource Monitor can be launched directly from the command line, or from either Performance Monitor plug-in or Task Manager Performance tab. Figure 5 displays a screen shot that shows Resource Monitor in action on a Windows 7 machine, calculating CPU utilization over the last 60 seconds of operation, breaking out that utilization by process. The CPU utilization measurements that ResMon calculates are based on the context switch events. These measurements are very accurate, about as good as it gets from a vantage point inside the OS.

Resmon-screen-shot-showing-CPU-busy-calculations

Figure 5. The Windows 7 Resource Manager application.

 

The Resource Monitor measures CPU busy in real time by listening to the ETW event stream that generates an event every time a context switch occurs. It also produces similar reports from memory, disk, and network events.

To summarize these developments, this trace-driven measurement source positions the Windows OS so it could replace its legacy CPU measurement facility with something more reliable and accurate sometime in the near future. Unfortunately, converting all existing features in Windows, including Perfmon and Task Manager, to support the new measurements is a big job, not without its complications and not always as straightforward as one would hope. But we can look forward to future versions of the Windows OS where an accurate, event-driven approach to measuring processor utilization supplants the legacy sampling approach that Task Manager and Perfmon rely on today.

In the next blog entry in this series, I will show a quick example using xperf to calculate the same CPU utilization metrics from the ETW event stream. I will point xperfview at an .etl file gathered during the same measurement interval as the one illustrated in Figure 5 using ResMon..

Measuring Processor Utilization in Windows and Windows applications: Part 1

Introduction.

This blog entry discusses the legacy technique for measuring processor utilization in Windows that is based on sampling and compares and contrasts it to other sampling techniques. It also introduces newer techniques for measuring processor utilization in Windows that are event-driven. The event-driven approaches are distinguished by far greater accuracy. They also entail significantly higher overhead, but measurements indicate this overhead is well within acceptable bounds on today’s high powered server machines.

As of this writing, Windows continues to report measurements of processor utilization based on the legacy sampling technique. The more accurate measurements that are driven using an event-driven approach are gaining ground and can be expected to supplant the legacy measurements in the not too distant future.

While computer performance junkies like me positively salivate at the prospect of obtaining more reliable and more precise processor busy metrics, the event-driven measurements do leave several very important issues in measuring CPU utilization unresolved. These include validity and reliability issues that arise when Windows is running as a guest virtual machine under VMware, Zen, or Hyper-V that impact the accuracy of most timer-based measurements. (In an aside, mitigation techniques for avoiding some of the worst measurement anomalies associated with virtualization are discussed.)

A final topic concerns characteristics of current Intel-compatible processors that that undermine the rationale for using measurements of CPU busy that are based solely on thread execution. This section outlines the appeal of using internal hardware measurements of the processor’s instruction execution rate in addition to or instead of wall-clock timing of thread execution state. While I do try to make the case for using internal hardware measurements of the processor’s instruction execution rate to augment more conventional measures of CPU busy, I will also outline some of the current difficulties that advocates of this approach encounter when they attempt to put it into practice today.

Sampling processor utilization.

The methodology used to calculate processor utilization in Windows was originally designed 20 years ago for Windows NT. Since the original design goal of Windows NT was to be hardware independent, the measurement methodology was also designed so that it was not dependent on any specific set of processor hardware measurement features.

The familiar % Processor Time counters in Perfmon are measurements derived using a sampling technique. The OS Scheduler samples the execution state of the processor once per system clock tick, driven by a high priority timer-based interrupt. Currently, the clock tick interval the OS Scheduler uses is usually 15.6 ms, roughly 64 clock interrupts per second. (The precise value that the OS uses between timer interrupts is available by calling  the GetSystemTimeAdjustment() function.) If the processor is running the Idle loop when the quantum interrupt occurs, it is recorded as an Idle Time sample. If the processor is running some application thread, that is recorded as a busy sample. Busy samples are accumulated continuously at both the thread and process level.

Figure 1 illustrates the calculation of CPU time based on this sampling of the processor execution state as reported in the Performance tab of the Windows Task Manager.

Task-Manager-Performance-tab

Figure 1. The Performance tab of the Windows Task Manager reports processor utilization based on sampling the processor execution state once every quantum, approximately 64 times per second.

When a clock interrupt occurs, the Scheduler performs a number of other tasks, including adjusting the dispatching priority of threads that are currently executing with the intention of stopping the progress of any thread that has exceeded its time slice. Using the same high priority OS Scheduler clock interrupt that is used for CPU accounting to implement processor time-slicing is the reason the interval between Scheduler interrupts is known as the quantum. At one time in Windows NT, the quantum was set based on the speed of the processor; the faster the processor the shorter the more frequently the OS Scheduler would gain control. Today, however, the quantum value is constant across all processors.

Another measurement function that is performed by the OS Scheduler’s clock interrupt is to take a sample of the length of the processor queue consisting of Ready, but waiting threads. The SystemProcessor Queue Length counter in Perfmon is an instantaneous counter that reflects the last measurement taken by the OS Scheduler’s clock interrupt of the current number of Ready threads waiting in the OS Scheduler queue. Thus, the SystemProcessor Queue Length counter represents one sampled observation, and needs to be interpreted with that in mind.

The processor Queue Length metric is sometime subject to anomalies due to the kind of phased behavior you sometimes see on an otherwise idle system. Even on an mostly idle system, a sizable number of threads can be waiting on the same clock interrupt (typically, polling the system state once per second), one of which also happens to be the Perfmon measurement thread, also cycling once per second. These sleeping threads tend to clump together so that they get woken up at the exact same time by the timer interrupt. (As I mentioned, this happens mainly when the machine is idling with little or no real work to do.) These awakened threads then flood the OS dispatching queue at exactly the same time. If one of these threads is the Perfmon measurement thread that gathers the Processor Queue Length measurement, you can see how this “clumping” behavior could distort the measurements. The Perfmon measurement thread executes at an elevated priority level of 15, so it is scheduled for execution ahead of any other User mode threads that were also awakened by the same Scheduler clock tick. The effect is that at the precise time when the Processor ready queue length is measured, there are likely to be a fair number of Ready Threads. Compared to the modeling assumption where processor scheduling is subject to random arrivals, one observes a disproportionate number of Ready Threads waiting for service, even (or especially) when the processor itself is not very busy overall.

This anomaly is essentially a low-utilization effect that perturbs the measurement when the machine is loafing. It generally ceases to be an issue when processor utilization climbs or there are more available processors on the machine. But this bunching of timer-based interrupts remains a serious concern, for instance, whenever Windows is running as a guest virtual machine under VMware or Hyper-V. Another interesting side discussion is how this clumping of timer-based interrupts interacts with power management, but I do not intend to venture further into that subject here.

To summarize, the CPU utilization measurements at the system, process and thread level in Windows are based on a sampling methodology. Similarly, the processor queue length is also sampled. Like any sampling approach, the data gathered is subject to typical sampling errors, including

  • accumulating a sufficient number of sample observations to be able to make a reliable statistical inference about the underlying population, and
  • ensuring that there aren’t systemic sources of sampling error that causes sub-classes of the underlying population to be under or over-sampled markedly

So, these CPU measurements face familiar issues with sampling size and the potential for systematic sampling bias, as well as the usual difficulty in ensuring that the sample data is representative of the underlying population (something known as non-sampling error). For example, the interpretation of the CPU utilization data that Perfmon gathers at the process and thread level is subject to limitations based on a small sample size for collection intervals less than, for example, 15 seconds. At one minute intervals, there are enough samples to expect accuracy within 1-2%, a reasonable trade-off of precision against overhead. Over even longer measurement intervals, say 5 or 10 minutes, the current sampling approach leads to miniscule sampling errors, except in anomalous cases where there is systematic under-sampling of the processor’s execution state.

Sample size is also the reason that Microsoft does not currently permit Perfmon to gather data at intervals more frequent than once per second. Running performance data collection at intervals of 0.1 seconds, for example, the impact of relying on a very small number of processor execution state samples is quite evident. At 0.1 second intervals, it is possible to accumulate just 5 or 6 samples per interval. If you are running a micro-benchmark and want to access the Thread% Processor Time counters from Perfmon over 0.1 second intervals, you are looking for trouble. Under these circumstances, the % Processor Time measurements cease to resemble a continuous function over time.

The limitations of the current approach to measuring CPU busy in Windows and the need for more precise measurements of CPU utilization are recognized in many quarters across the Windows development organization at Microsoft. More on how this area is evolving in the next post in this series..