Presentations for the upcoming CMG conference are available on slideshare.net

I am presenting two topics at the Computer Measurement Group (CMG) annual conference this week, and I have just posted the slide decks I will be using on slideshare.net.

The latest slide deck for Monitoring Web Application Response Times in Windows is available here.

And the HTTP/2: Recent protocol changes and their impact on web application performance one is available here.

If you are a regular reader of this blog, most of this material will look familiar. Enjoy!

 .

Correcting the Process level measurements of CPU time for Windows guest machines running under VMware ESX

Recently, I have been writing about how Windows guest machine performance counters are affected by running in a virtual environment, including publishing two recent, longish papers on the subject, one about processor utilization metrics and another about memory management. In the processor utilization paper, (which is available here), it is evident that running under VMware, the Windows performance counters that measure processor utilization are significantly distorted. At a system level, this distortion is not problematic so long as one has recourse to the VMware measurements of actual physical CPU usage by each guest machine.

A key question – one that I failed to address properly, heretofore – is whether it is possible to correct for that distortion in the measurements of processor utilization taken at the process level inside the guest machine OS. The short answer for Windows, at least, is, “Yes.” The % Processor Time performance counters in Windows that are available at the process level are derived using a simple technique that samples the execution state of the machine approximately 64 times per second. VMware does introduce variability in the duration of the interval between samples and sometimes causes the sampling itself rate to fluctuate. However, neither the variability in the time between samples nor possible changes to the sampling rate undermine the validity of the sampling process.

In effect, the sampling data should still accurately reflect the relative proportion of the CPU time used by the individual processes running on the guest machine. If you have access to the actual guest machine CPU usage – from the counters installed by the vmtools, for example – you should be able to adjust the internal guest machine process level data accordingly. You need to accumulate a sufficient number of guest machine samples to ensure that the sample data is a good estimator of the underlying population. And, you also need to make sure that the VMware measurement intervals are reasonably well-synchronized to the guest machine statistics. (The VMware ESX performance metrics are refreshed every 20 seconds.) In practice, that means if you have at least five minutes worth of Windows counter data and VMware guest CPU measurements across a roughly equivalent five minute interval, you should be to correct for the distortion due to virtualization and reliably estimate processor utilization at the process level within Windows guest machines.

Calculate a correction factor, w, from the ratio of the actual guest machine physical CPU usage as reported by the VMware hypervisor to the average processor utilization reported by the Windows guest:

w = Aggregate Guest CPU Usage / Average % Processor Time

Then multiply the process level processor utilization measurements by this correction factor to estimate the actual CPU usage of individual guest machine processes.

The impetus for me revisiting this subject was the following question posed by Steve Marksamer posted on a LinkedIn forum devoted to computer capacity and performance:

[What is] the efficacy of using cpu and memory data extracted from Windows O/S objects when this data is for a virtual machine in VMware. Has there been any timing correction allowing for guest to host delay? Your thoughts/comments are appreciated. Any other way to get at process level data more accurately?

This is my attempt to formulate a succinct answer to Steve’s question.

In the spirit of full disclosure, I should also note that Steve and I once collaborated to write an article on virtualization technology that was published back in 2007. I don’t think I have seen or talked to Steve since we last discussed the final draft of our joint article back in 2007.

A recent paper of mine focused on the processor utilization measurements that are available in Windows (the full paper is available here), and contains a section that describes them being impacted by VMware’s virtualization of timer interrupts and the Intel RDTSC clock instruction. Meanwhile, on this blog I posted a number of entries that covered similar ground (beginning here), but in smaller chunks. But, up to now, I neglected to post any material here related to the impact of virtualization on the Windows processor utilization measurements. Let me start to correct that error of omission.

Considering how VMware affects the measurements CPU reported at the system and process level inside a guest Windows machine, a particularly salient point is that VMware virtualizes the clock timer interrupts that is basis for the Windows clock. Virtualizing clock timer interrupts has an impact on the legacy method used in Windows to measure CPU utilization at the system and process level, which samples the execution state of the system roughly 64 times per second. When the OS handles this periodic clock interrupt, it examines the execution state of the system prior to the interrupt. If a processor was running the Idle thread, then the CPU accounting function attributes all the time associated with the current interval as Idle Time. Idle Time is then accumulated at the processor level. If the machine was executing an actual thread, all the time associated with the current interval is attributed to that thread and process. Processor time is also accumulated in counters associated with active threads and processes. Utilization at the processor level is calculated by subtracting the amount of Idle time accumulated in a measurement interval from the total amount of time in that interval.

So, the familiar Windows performance counters that report % processor time are derived using a sampling technique. Note that a simple statistical argument based on the sample size strongly suggests that these legacy CPU time measurements (i.e., any of the % Processor Time counters in Windows) are not reliable at intervals of less than 15 seconds, which was the official guidance in the version of the Windows Performance Guide that I wrote for Microsoft Press back in 2005. But so long as the measurement is interval is long enough to accumulate a sufficient number of samples – one minute measurement intervals are based on slightly more than 3600 execution state samples, the Windows performance counters provide reasonable estimates of processor utilization at both the hardware and process levels. You can easily satisfy yourself that this is true by using xperf to calculate CPU utilization precisely from context switch events and comparing those calculations to Windows performance counters gathered over the same measurement interval.

In virtualizing clock interrupts, VMware introduces variability into the duration of the interval between processor utilization samples. If a clock interrupt destined for a Windows machine that is currently parked in the ESX dispatcher queue occurs, the VMware dispatcher delays delivery of that interrupt to the guest machine until the next time that guest machine is dispatched. A VMware white paper entitled “Timekeeping in VMware Virtual Machines” has an extended discussion of the clock and timer distortions that occur in Windows guest machines when there are virtual machine scheduling delays. Note that it is possible for VMware guest machine scheduling delays to grow large enough to cause some periodic timer interrupts to be dropped entirely. In VMware terminology, these are known as lost ticks, another tell-tale sign of contention for physical processors. In extreme cases where the backlog of timer interrupts ever exceeds 60 seconds, VMware attempts a radical re-synchronization of the time of day clock in the guest machine, zeros out its backlog of timer interrupts, and starts over.

Unfortunately, there is currently no way to measure directly the magnitude of these guest machine dispatcher delays, which is the other part of Steve’s question. The VMware measurement that comes closest to characterizing the nature and extent of the delays associated with timely delivery of timer interrupts is CPU Ready milliseconds, which reports the amount of time a guest machine was waiting for execution, but was delayed in the ESX Dispatcher queue. (VMware has a pretty good white paper on the subject available here).

If the underlying ESX Host is not too busy, guest machines are subject to very little CPU Ready delay. Under those circumstances, the delays associated with virtualizing clock timer interrupts will introduce some jitter into the Windows guest CPU utilization measurements, but these should not be too serious. However, if the underlying ESX Host does become very heavily utilized, clock timer interrupts are subject to major delays. At the guest level, these delays impact both the rate of execution state samples and the duration between samples.

In spite of this potential disruption, it is still possible to correct the guest level % processor time measurements at a process and thread level if you have access to the direct measurements of hardware CPU utilization that VMware supplies for each guest machine. The execution state sampling performed by periodic clock interval handler still amounts to a random sampling of processor utilization. The amount of CPU time accumulated at the process level and reported in the Windows performance counters remains proportional to the actual processor time consumed by processes running on the guest machine. Using processor utilization measurements gathered over longer intervals and with a reasonable degree of synchronization in the measurement gathering process to encompass both the VMware external measurements and the Windows guest internal measurements are both required for this reconciliation.

To take a simple example of calculating and applying the factor for correcting the processor utilization measurements, let’s assume a guest machine that is configured to run with just one vCPU. Suppose you see the following performance counter values for the Windows guest machine:

% Processor Time
(virtual)
Overall
73%
Process A
40%
Process B
20%

From VMware ESX measurements, then suppose you see that the actual CPU utilization of the guest machine is 88%. You can then correct the internal process CPU time measurements by weighting them by the ratio of the actual CPU time consumed compared to the CPU time reported by the guest machine (88:73, or roughly 1.2).

% Processor Time
(virtual)
(actual)
Overall
73%
88%
Process A
40%
48%
Process B
20%
24%

Similarly, for a guest machine with multiple vCPUs, calculate the average utilization per processor to calculate the weighting factor, since VMware only reports an aggregate CPU utilization for each guest machine.

What about the rest of the Windows performance counters?

Stay tuned. I will discuss the rest of the Windows performance counters that in the next blog post.

Virtual memory management in VMware: Final thoughts

This is final blog post in a series on VMware memory management. The previous post in the series is here:

Final Thoughts

My colleagues and I constructed and I have been discussing in some detail a case study where VMware memory over-commitment led to guest machine memory ballooning and swapping, which, in turn, had a substantial impact on the performance of the applications that were running. When memory contention was present, the benchmark application executed to completion three times slower than the same application run standalone. The difference was entirely due to memory management “overhead,” the cost of demand paging when the supply of machine memory was insufficient to the task.

Analysis of the case study results unequivocally shows that the cost equation associated with aggressive server consolidation using VMware needs to be adjusted based on the performance risks that can arise when memory is over-committed. When configuring the memory on a VMware Host machine, for optimal performance it is important to realize that virtual memory systems do not degrade gracefully. When virtual memory workloads overflow the amount of physical memory available for them to execute, they are subject to page fault resolution delays that are punishing in nature. The amount of delay during the time it takes to perform a disk I/O necessary to bring a block of code or data from the paging file into memory to resolve a page fault is several orders of magnitude larger than almost any other sort of execution time delay a running thread is ever likely to encounter.

VMware implements a policy of memory over-commitment in order to support aggressive server consolidation. In many operational environments, like Server hosting or application testing, guest machines are frequently dormant. But when they are active, they are extremely active in bursts. These kinds of environments are well-served by aggressive guest machine consolidation on server hardware that is massively over-provisioned.

On the other hand, implementing overly aggressive server consolidation of active production workloads with more predictable levels of activity presents a very different set of operational challenges. One entirely unexpected result of the benchmark was the data on Transparent Memory Sharing reported in an earlier post that showed the benefits of memory sharing evaporating almost completely in the face of guest machines actively using their allotted physical memory. Since the guest machines used in the benchmark were configured identically, down to running the same exact application code, it was surprising to see how ineffective memory sharing proved to be once the benchmark applications started to execute on their respective guest machines. Certainly the same memory over-commitment mechanism is extremely effective when guest machines are idle for extended periods of time. This finding that memory sharing is ineffective when the guest machines are active, if it can be replicated in other environments, would call for re-evaluating the value of the whole approach, especially since idle machines can be swapped out of memory entirely.

Moreover, the performance-related risks for critical workloads that arise when memory over-commitment leads to ballooning and swapping are substantial. Consider that if an appropriate amount of physical memory was chosen for a guest machine configuration at the outset, any pages removed from the guest machine memory footprint via ballooning and/or swapping is potentially very damaging. For this reason, for example, warnings from SQL Server DBAs about VMware’s policy of over-committing machine memory are very prominent in blog posts. See http://www.sqlskills.com/blogs/jonathan/the-accidental-dba-day-5-of-30-vm-considerations/ for an example.

In the benchmark test discussed here, each of the guest machines ran identical workloads that, when a sufficient number of them were run in tandem, combined to stress the virtual memory management capabilities of the VMware Host. Using the ballooning technique, VMware successfully transmitted the external memory contention in effect to the individual guest machines. This successful transmission diffused the response to the external problem, but did not in any way lessen its performance impact.

More typical of a production environment, perhaps, is the case where a single guest machine is the primary source of the memory contention. Just as in a single OS image when one user process consuming an excess of physical memory can create a resource shortage with a global impact, a single guest machine consuming an excess of machine memory can generate a resource shortage that impacts multiple tenants in the virtualization environment.

Memory Reservations.

In VMware, customers do have the ability to prioritize guest machines so that all tenants sharing an over-committed virtualization Host machine are not penalized equally when there is a resource shortage. The most effective way to protect a critical guest machine from being subjected to ballooning and swapping due to a co-resident guest is to set up a machine memory Reservation. A machine memory Reservation establishes a floor guaranteeing that a certain amount of machine memory is always granted to the guest. With a Reservation value set, VMware will not subject a guest machine to ballooning or swapping that will result in the machine memory granted to the guest falling below that minimum.

But in order to set an optimal memory Reservation size, it is first necessary to understand how much physical memory the guest machine requires, not always an easy task. A Reservation value that is set too high on a Host machine experiencing memory contention will have the effect of increasing the level of memory reclamation activity on the remaining co-tenants of the VMware Host.

Another challenge is how to set an optimal Reservation value for guest machines running applications that, like the .NET Framework application used in the benchmark discussed here, dynamically expand their working set to grab as much physical memory as possible on the machine. Microsoft SQL Server is one of the more prominent Windows server applications that does that, but others include the MS Exchange Store process (fundamentally also a database application), and ASP.NET web sites. Like the benchmark application, SQL Server and Store listen for Low Memory notifications from the OS, and will trim back their working set of resident pages in response. If the memory remaining proves inadequate to the task, there are performance ramifications.

With server applications like SQL Server that expand to fill the size of RAM, it is often very difficult to determine how much RAM is optimal, except through trial and error. The configuration flexibility inherent in virtualization technology does offer a way to experiment with different machine memory configurations. Once the appropriate set of performance “experiments” have been run, the results can then be used to reserve the right amount of machine memory for these guest machines. Of course, these workloads are also subject to growth and change over time, so once memory reservation parameters are set, they need to be actively monitored at both the VMware Host and guest machine and application levels..