How Windows performance counters are affected by running under VMware ESX

This post is a prequel to a recent one on correcting the Process(*)% Processor Time counters on a Windows guest machine.

To assess the overall impact of the VMware virtualization environment on the accuracy of the performance measurements available for Windows guest machines, it is necessary to first understand how VMware affects the clocks and timers that are available on the guest machine. Basically, VMware virtualizes all calls made from the guest OS to hardware-based clock and timer services on the VMware Host. A VMware white paper entitled “Timekeeping in VMware Virtual Machines” contains an extended discussion of the clock and timer distortions that occur in Windows guest machines when there are virtual machine scheduling delays. These clock and timer services distortions, in turn, cause distortion among a considerably large set of Windows performance counters, depending on the specific type of performance counter. (The different types of performance counters are described here.)

First, let’s look into the clock and timer distortions that occur in VMware, and then let’s discuss how these distorted clock values impact various Windows performance measurements.

You may recall from an earlier blog entry on the subject that the Windows OS provides the following clock and timer services:

a.       a periodic clock interrupt that Windows relies upon to maintain the System Time of Day clock,

b.      the rdtsc instruction, which Windows wraps using a call to the QueryPerformanceCounter function.

In VMware, virtualization influences both time sources that Windows depends upon in performance measurement. From inside the Windows guest machine, there is no clock or timer service that is consistently reliable. The consequences for performance monitoring are profound.

The fact that virtualization impacts external, hardware-based clock and timer services should not be too big a surprise. Whenever the guest machine accesses an external hardware timer or clock, that access is virtualized like any other access to an external device. Any external device IO request is intercepted by the VM Host software, which then redirects the request to the actual hardware device. If the target device happens to be busy servicing another request originating from a different virtual machine, the VM Host software must queue the request. When the actual hardware device completes its processing of the request, there is an additional processing delay associated with the VM Host routing the result to the guest machine that originated the request. In the case of a synchronous HPET timer request, this interception and routing overhead leads to some amount of “jitter” in interpreting the clock values that are retrieved that is not otherwise present when Windows is running in a native mode.

 % Processor Time counters

Windows schedules a periodic clock interrupt designed to interrupt the machine 64 times per second. During servicing of this interrupt, Windows maintains its System Time of Day clock. (The Windows System Time of Clock is maintained in 100 nanosecond timer units, but the actual time of day values are only updated 64 times per second – approximately once every 16.67 milliseconds.)  The clock interrupt service routine also samples the execution state of the machine, and these samples form the basis of the % Processor Time measurements that are maintained at the processor, process, and thread levels in Windows.

VMware’s virtualization technology impacts any asynchronous timer request of this type. Timer interrupts are subject to additional delay because the original interrupt is service first by VMware Host software services and sometime later a virtualized device interrupt is presented to the guest machine. Interrupt delay time is normally minimal if the virtual guest is currently dispatched. However, if, at the specific point in time when the periodic clock interval the Windows Scheduler relies upon is scheduled to expire, the guest machine is blocked by the VMware Host scheduler (where it is accumulating Ready time), the interrupt delay time will be significantly longer.

In extreme circumstances, it is possible for VMware guest machine scheduling delays to grow large enough to cause some periodic timer interrupts to be dropped entirely. In VMware terminology, these are known as lost ticks, another tell-tale sign of contention for physical processors. In extreme cases where the backlog of timer interrupts ever exceeds 60 seconds, VMware attempts a radical re-synchronization of the time of day clock in the guest machine, zeros out its backlog of timer interrupts, and starts over. (A VMware white paper entitled “Timekeeping in VMware Virtual Machines” has an extended discussion of the clock and timer distortions that occur in Windows guest machines when there are virtual machine scheduling delays.)

Unfortunately, there is currently no way to measure directly the magnitude of these guest machine dispatcher delays. The VMware measurement that comes closest to characterizing the nature and extent of the delays associated with timely delivery of timer interrupts is CPU Ready milliseconds, which reports the amount of time a guest machine was waiting for execution, but was delayed in the ESX Dispatcher queue. (VMware has a pretty good white paper on the subject available here).

If the underlying ESX Host is not too busy, guest machines are subject to very little CPU Ready delay. Under those circumstances, the delays associated with virtualizing clock timer interrupts will introduce some jitter into the Windows guest CPU utilization measurements, but these should not be too serious. However, if the underlying ESX Host does become very heavily utilized, clock timer interrupts are subject to major delays.

At the guest level, these delays impact both the rate of execution state samples and the duration between samples. Consider a common scenario where a dual socket VMware Host machine is configured to run 8 or more Windows guest machines. Each Windows guest machine is expected a periodic clock interrupt that is utilized to perform CPU accounting 64 times per second. If more vCPUs for guest machines are defined than exist physically, then some of those clock interrupts are going to be subject to VMware dispatching delays.

The busier the underlying VMware Host machine is servicing other guest machines, the more these measurements are distorted.

Actually, this all too common scenario is a nightmare for VMware – the effect is that of multiple Windows guest machines continuously issuing and processing clock interrupts, even when the machines are idle or nearly idle.

You do have recourse to actual measurements of processor utilization by guest machines from VMware, which are available inside the Windows guest when the VMware tools are installed. However, be aware that the VMware guest machine processor utilization counters are not measuring the same thing as the internal Windows CPU accounting function. VMware measures amount of time a guest machine was dispatched on a physical processor during the last interval. A Windows guest machine that appears to be running to the VMware scheduler might actually be executing its Idle loop on one or more of the vCPUs that are defined.

To summarize this discussion, remember that in a virtualization environment, at the processor level the guest machine’s % Processor Time counters are difficult to interpret. They are largely meaningless, although when the underlying VMware Host machine is not very busy with other guests, the internal measurements are probably not that far from reality. You can substitute the guest machine processor time measurements that the VMware Host reports, but they measure the amount of time the guest machine was dispatched without regard for any periods of time when Windows guest machine vCPUs were executing the Idle loop.

As discussed in the last blog entry, at the process and thread level, the guest machine % Processor Time measurements can be readily corrected so long as you have access to the VMware guest machine measurements, so they continue to be quite useful.

Difference counters.

Windows performance counters that are difference counters are also impacted by virtualization of the Windows time of day clock. Difference counters, which are mainly of type PERF_COUNTER_COUNTER, are probably the most common type of Windows performance counters. They are based on the simple counting of the number of events that occur. Examples include Pages/sec, Disk transfer/sec, TCP segments/sec, and other similar counters.

Difference counters are all performance counters whose values are transformed by performance monitoring software and reported as events/second rates.  The by performance monitoring software calculates an interval delta by subtracting the value from the last measurement interval from the current value of the counter. This interval delta value is then divided by the interval duration to create a rate, i.e., events/second. In making that events/second calculation, the numerator – the number of these events that were observed during the interval – remains a valid count field. What is not reliable under VMware, however, is the interval duration, something which is derived from two successive calls to virtualized timer services that may or may not be delayed significantly. There should be no doubt that the events that were counted during the measurement interval actually occurred. What is suspect is the calculation of the rate that those events occurred that is performed by Perfmon and other performance monitoring applications.

Instantaneous counters.

On the other hand, there are counter types that are unaffected by virtualized clock and timer values. These are instantaneous counters, effectively a snapshot of a value such as MemoryAvailable Bytes that is observed at a single point of time. (This sort of counter is known in the official documentation as a PERF_COUNTER_RAWCOUNT.) Since there is no interval timer or duration associated with production of this type of counter, virtualized clock and timer values have no impact on the validity of these measurements.

Disk performance counters that use the QueryPerformanceCounter API

Finally, there is a set of counters that use the QueryPerformanceCounter API to measure duration at higher precision than the System Time of Day clock permits. QueryPerformanceCounter is a Windows API that wraps the hardware rdtscinstruction. (There is an extensive discussion of this interface here.) Under VMware, even timings based on the lightweight rdtsc instruction issued from guest machines are subject to virtualization delays. The VMware Host OS traps all rdtsc instructions and returns virtualized timer values. Despite that fact that an rdtsc instruction can normally be issued by a program executing at any protection level, they are still trapped and virtualized in VMware. (In contrast, Microsoft’s Hyper-V chooses not to trap rdtscinstructions, so guest machines in that environment do have native access to the hardware Read TimeStamp Counter instruction.)

Performance counters that utilize the QueryPerformanceCounter API are found mainly among the Logical and Physical Disk counters. The System Time of Day clock, which advances 64 times per second, provides too low resolution for timing disk I/O operations that often complete within 10 milliseconds of less. In fact, the the QueryPerformanceCounter API was originally introduced back in Windows 2000 to improve the resolution of performance counters such as Logical Disk(*)Avg. Disk sec/Transfer – and, thus, the unfortunate name for an API that is actually the Windows high resolution timer function.

In theory, VMware’s virtualization of the rdtsc instruction should not have a major impact on the timer-based calculations used to produce counters like Avg. Disk sec/Transfer that measure disk latency in Windows. To calculate disk latency in Windows, the disk device driver issues an rdtsc instruction at time t1 when the IO operation to disk is initiated. The driver issues a second rdtsc at time t2when the IO completes. The device driver software calculates the interval delta, t2 – t1, and accumulates the total elapsed time for the device in a DISK_PERFORMANCEstructure. Since the rdtsc timing instructions are effectively issued in line by the guest machine and presumably processed in line by VMware, the clock value returned by virtualized rdtsc call should be very close to the actual rdtsc value, plus some minor amount of overhead associated with VMware trapping the instruction.

However, it is entirely possible for the entire guest machine to be blocked from executing sometime between t1 and t2while it has a disk IO pending. When the guest machine is re-dispatched, the disk IO that was pending will finally complete, at which point the clock value associated with that completion is acquired by the guest machine device driver. Now, however, when the interval delta, t2– t1, is calculated, the measurement of disk latency includes the amount of time the guest machine was delayed in the VMware dispatcher ready queue. Of course, from the standpoint of the application awaiting the IO operation to complete, the t2– t1 calculation remains perfectly valid. It is no longer disk latency, per see, but it does accurately represent the delay that the application issuing the IO request perceives.

A final complication is that it is not possible to tell how many times during any measurement interval that the amount of disk latency time the Windows disk device driver accumulates includes guest machine dispatching delays during pending IO operations.

Similarly to the VMware Host measurements of guest machine physical processor utilization, VMware does provide direct measurements of disk latency from the point of view of the VMware Host. For each VMware guest machine, there are additional disk performance statistics available that tracks bytes transferred and reads vs. writes. Measurements of disk latency at the guest machine level are not available from VMware, however.

 

Reconciling the VMware Host measurements of overall disk latency with the Windows guest machine measurements of disk performance at a logical or physical disk level is problematic at best. Figure 1 is an attempt to reconcile these disparate data sources under the best possible circumstances – the VMware Host was managing a single guest machine.

Comparison-of-VMware-and-Windows-measurements-of-disk-latency-

Figure 1. Comparing the VMware Host and Windows guest machine measurements of disk latency. To make the comparison possible, there was only a single guest machine defined for the VMware Host to run.

In Figure 1, I plotted about 35 minutes worth of disk performance data from the VMware Host (the area plotted in light orange) and from the single Windows guest (the dotted line overlay) for the same one minute measurement intervals. In an ideal world, this data should be quite similar, and in some of the intervals, the disk latency measurements are quite similar. But, in many other intervals, the measurements are quite different. It is unclear how to reconcile these differences.

Overall, since VMware is not able to report disk latency at the guest machine level, the disk performance measurements available internally from the Windows guest machine remain quite useful. You should attempt to reconcile the VMware disk measurements with the guest machine internal measurements, which with regard to reads, writes and bytes transferred should be comparable. Whenever a Windows guest machine running on a busy VMware Host reports disk latency that is much worse than the VMware Host disk latency measurements, you need to consider the possibility that at least some of the difference is due guest machine dispatcher delays and not poor disk hardware performance.


[1] You may recall from an earlier blog entry that the QueryPerformanceCounter function that is used in performance monitoring to generate granular measurements of elapsed time uses the hardware rdtsc instruction in Windows 7, but reverts to the HPET external timer on older machines when rdtsc cannot be trusted. For simplicity’s sake, I will ignore the potential use of the HPET in the discussion here.

.

Presenting two sessions at the upcoming UKCMG meeting in Oxford, England on May 14-15.

Some news that regular readers of this blog might be interested in hearing about…
I plan to present two sessions at the upcoming UKCMG annual conference, which is being held this year on May 14 & 15 at the Oxford Belfry on the outskirts of Oxford, England.
The first presentation is a repeat performance of the one I gave at the US CMG in December, a paper entitled  “Measuring Processor Utilization in Windows and Windows applications,” essentially pulling together the series of blog entries I have been posting here, beginning with the first installment, but with a good deal more material than I have gotten around to posting to the blog.
For instance, the last blog post discussing the high resolution clocks and timer facilities in Windows leads directly to a consideration of what happens to the various CPU utilization measurements when Windows is running as virtual guest under VMware or Hyper-V. That discussion is in the paper, but, unfortunately, hasn’t made it to the blog yet.
But you can download the full paper from my company’s web site here.
It is shameful to admit that the full paper has been available since December. Inept as I am at blogging, I had not alerted you blog readers about its availability. Unfortunately, and it will be forever thus, or at least until I retire from my day job, self-publishing on this blog takes a back seat to work that actually pays the bills around here.
(I will resist the temptation to go off on a rant here about the idiotic and naïve notion expounded by fanatical proponents of Open Source technology that information should be free. That’s a wonderful ideal state, of course, but flies in the face of the economics of information gathering, production, storage and dissemination, which has real costs associated with it. Even in the digital age, which has revolutionized the costs associated with information storage and dissemination, these costs remain and they are considerable. My contrarian view is that no one, other than gods and saints, in possession of potentially valuable information is apt to give it away for free under our system of capitalism, but that is another topic entirely.)
Workshop in Web Application Performance and Tuning.
The second session is an extended workshop on web application performance. It is focused on Windows technology (IIS, ASP.NET, AJAX, etc.), but many of the tools and techniques discussed are directly applicable to other web hosting platforms.
The workshop is based on a course that I used to give in-house back in Microsoft to the developers working on various Microsoft web-based applications. While I have published very little on this topic over the years, it has actually been the focus of much of my software development work over the past five years or so. I do expect to start publishing something soon on the subject, especially as I am in the late stages of developing a new software tool aimed squarely at Microsoft web application performance.
Reading between the lines of some of my recent blog postings that are ETW-oriented, including the CPU measurement series, you would be correct in guessing that the new tool attempts to leverage ETW trace events, specifically, in this case, the events that instrument the Microsoft IIS web server and the TCP/IP networking stack. This new trace analysis tool also correlates these system-oriented trace events from various Windows components with events issued from inside application scenarios instrumented using the Scenario class library (a free component, currently posted in the MSDN Archive here).
Instrumenting your application for performance monitoring is a crucial step, and that is where the Scenario class library comes in. Originally, I conceived of the Scenario instrumentation library as a .NET flavor of the open source Application Response Measuriment (ARM) initiative that was championed by both HP and IBM (and supported by the US CMG, where I was the ARM Committee liaison for many years). Soon after I arrived at Microsoft, it quickly became apparent that I needed to adapt my original conception to leverage ETW tracing technology, which had the considerable weight of the Windows Fundamentals organization behind it.
In the workshop I explain how to use this application-oriented instrumentation as part of integrating software performance engineering best practices into the software development life cycle. This involves first setting performance goals around key application scenarios that you’ve identified, and then instrumenting those scenarios to determine whether or not the application as delivered for testing is actually capable of meeting those goals. The instrumentation can also safely be embedded in the application when it is ultimately deployed in production. This is fundamentally necessary to enable service level reporting and verify, for example, that the app is meeting its stated performance objectives. Most ARM advocates concentrate on monitoring application performance in production, but tend to neglect the crucial earlier stages of application development where it is important to bake goal-oriented, performance monitoring in at the outset.
The new Windows performance tool is currently in a very limited beta release, and contrary to the negative views I expressed in my earlier aside — not a rant — about information being free, we are looking at some sort of freebie distribution of the initial “commercial” version of the tool to allow you guys to explore the technology and see what it can do for you.
So, if you happen to be in the neighborhood of Oxford, England next month, you can hear & see more about this initiative. In the meantime, stayed tuned to this space, where I will try to do a better job keeping you posted as we make progress in this area.

.