How Windows performance counters are affected by running under VMware ESX

This post is a prequel to a recent one on correcting the Process(*)% Processor Time counters on a Windows guest machine.

To assess the overall impact of the VMware virtualization environment on the accuracy of the performance measurements available for Windows guest machines, it is necessary to first understand how VMware affects the clocks and timers that are available on the guest machine. Basically, VMware virtualizes all calls made from the guest OS to hardware-based clock and timer services on the VMware Host. A VMware white paper entitled “Timekeeping in VMware Virtual Machines” contains an extended discussion of the clock and timer distortions that occur in Windows guest machines when there are virtual machine scheduling delays. These clock and timer services distortions, in turn, cause distortion among a considerably large set of Windows performance counters, depending on the specific type of performance counter. (The different types of performance counters are described here.)

First, let’s look into the clock and timer distortions that occur in VMware, and then let’s discuss how these distorted clock values impact various Windows performance measurements.

You may recall from an earlier blog entry on the subject that the Windows OS provides the following clock and timer services:

a.       a periodic clock interrupt that Windows relies upon to maintain the System Time of Day clock,

b.      the rdtsc instruction, which Windows wraps using a call to the QueryPerformanceCounter function.

In VMware, virtualization influences both time sources that Windows depends upon in performance measurement. From inside the Windows guest machine, there is no clock or timer service that is consistently reliable. The consequences for performance monitoring are profound.

The fact that virtualization impacts external, hardware-based clock and timer services should not be too big a surprise. Whenever the guest machine accesses an external hardware timer or clock, that access is virtualized like any other access to an external device. Any external device IO request is intercepted by the VM Host software, which then redirects the request to the actual hardware device. If the target device happens to be busy servicing another request originating from a different virtual machine, the VM Host software must queue the request. When the actual hardware device completes its processing of the request, there is an additional processing delay associated with the VM Host routing the result to the guest machine that originated the request. In the case of a synchronous HPET timer request, this interception and routing overhead leads to some amount of “jitter” in interpreting the clock values that are retrieved that is not otherwise present when Windows is running in a native mode.

 % Processor Time counters

Windows schedules a periodic clock interrupt designed to interrupt the machine 64 times per second. During servicing of this interrupt, Windows maintains its System Time of Day clock. (The Windows System Time of Clock is maintained in 100 nanosecond timer units, but the actual time of day values are only updated 64 times per second – approximately once every 16.67 milliseconds.)  The clock interrupt service routine also samples the execution state of the machine, and these samples form the basis of the % Processor Time measurements that are maintained at the processor, process, and thread levels in Windows.

VMware’s virtualization technology impacts any asynchronous timer request of this type. Timer interrupts are subject to additional delay because the original interrupt is service first by VMware Host software services and sometime later a virtualized device interrupt is presented to the guest machine. Interrupt delay time is normally minimal if the virtual guest is currently dispatched. However, if, at the specific point in time when the periodic clock interval the Windows Scheduler relies upon is scheduled to expire, the guest machine is blocked by the VMware Host scheduler (where it is accumulating Ready time), the interrupt delay time will be significantly longer.

In extreme circumstances, it is possible for VMware guest machine scheduling delays to grow large enough to cause some periodic timer interrupts to be dropped entirely. In VMware terminology, these are known as lost ticks, another tell-tale sign of contention for physical processors. In extreme cases where the backlog of timer interrupts ever exceeds 60 seconds, VMware attempts a radical re-synchronization of the time of day clock in the guest machine, zeros out its backlog of timer interrupts, and starts over. (A VMware white paper entitled “Timekeeping in VMware Virtual Machines” has an extended discussion of the clock and timer distortions that occur in Windows guest machines when there are virtual machine scheduling delays.)

Unfortunately, there is currently no way to measure directly the magnitude of these guest machine dispatcher delays. The VMware measurement that comes closest to characterizing the nature and extent of the delays associated with timely delivery of timer interrupts is CPU Ready milliseconds, which reports the amount of time a guest machine was waiting for execution, but was delayed in the ESX Dispatcher queue. (VMware has a pretty good white paper on the subject available here).

If the underlying ESX Host is not too busy, guest machines are subject to very little CPU Ready delay. Under those circumstances, the delays associated with virtualizing clock timer interrupts will introduce some jitter into the Windows guest CPU utilization measurements, but these should not be too serious. However, if the underlying ESX Host does become very heavily utilized, clock timer interrupts are subject to major delays.

At the guest level, these delays impact both the rate of execution state samples and the duration between samples. Consider a common scenario where a dual socket VMware Host machine is configured to run 8 or more Windows guest machines. Each Windows guest machine is expected a periodic clock interrupt that is utilized to perform CPU accounting 64 times per second. If more vCPUs for guest machines are defined than exist physically, then some of those clock interrupts are going to be subject to VMware dispatching delays.

The busier the underlying VMware Host machine is servicing other guest machines, the more these measurements are distorted.

Actually, this all too common scenario is a nightmare for VMware – the effect is that of multiple Windows guest machines continuously issuing and processing clock interrupts, even when the machines are idle or nearly idle.

You do have recourse to actual measurements of processor utilization by guest machines from VMware, which are available inside the Windows guest when the VMware tools are installed. However, be aware that the VMware guest machine processor utilization counters are not measuring the same thing as the internal Windows CPU accounting function. VMware measures amount of time a guest machine was dispatched on a physical processor during the last interval. A Windows guest machine that appears to be running to the VMware scheduler might actually be executing its Idle loop on one or more of the vCPUs that are defined.

To summarize this discussion, remember that in a virtualization environment, at the processor level the guest machine’s % Processor Time counters are difficult to interpret. They are largely meaningless, although when the underlying VMware Host machine is not very busy with other guests, the internal measurements are probably not that far from reality. You can substitute the guest machine processor time measurements that the VMware Host reports, but they measure the amount of time the guest machine was dispatched without regard for any periods of time when Windows guest machine vCPUs were executing the Idle loop.

As discussed in the last blog entry, at the process and thread level, the guest machine % Processor Time measurements can be readily corrected so long as you have access to the VMware guest machine measurements, so they continue to be quite useful.

Difference counters.

Windows performance counters that are difference counters are also impacted by virtualization of the Windows time of day clock. Difference counters, which are mainly of type PERF_COUNTER_COUNTER, are probably the most common type of Windows performance counters. They are based on the simple counting of the number of events that occur. Examples include Pages/sec, Disk transfer/sec, TCP segments/sec, and other similar counters.

Difference counters are all performance counters whose values are transformed by performance monitoring software and reported as events/second rates.  The by performance monitoring software calculates an interval delta by subtracting the value from the last measurement interval from the current value of the counter. This interval delta value is then divided by the interval duration to create a rate, i.e., events/second. In making that events/second calculation, the numerator – the number of these events that were observed during the interval – remains a valid count field. What is not reliable under VMware, however, is the interval duration, something which is derived from two successive calls to virtualized timer services that may or may not be delayed significantly. There should be no doubt that the events that were counted during the measurement interval actually occurred. What is suspect is the calculation of the rate that those events occurred that is performed by Perfmon and other performance monitoring applications.

Instantaneous counters.

On the other hand, there are counter types that are unaffected by virtualized clock and timer values. These are instantaneous counters, effectively a snapshot of a value such as MemoryAvailable Bytes that is observed at a single point of time. (This sort of counter is known in the official documentation as a PERF_COUNTER_RAWCOUNT.) Since there is no interval timer or duration associated with production of this type of counter, virtualized clock and timer values have no impact on the validity of these measurements.

Disk performance counters that use the QueryPerformanceCounter API

Finally, there is a set of counters that use the QueryPerformanceCounter API to measure duration at higher precision than the System Time of Day clock permits. QueryPerformanceCounter is a Windows API that wraps the hardware rdtscinstruction. (There is an extensive discussion of this interface here.) Under VMware, even timings based on the lightweight rdtsc instruction issued from guest machines are subject to virtualization delays. The VMware Host OS traps all rdtsc instructions and returns virtualized timer values. Despite that fact that an rdtsc instruction can normally be issued by a program executing at any protection level, they are still trapped and virtualized in VMware. (In contrast, Microsoft’s Hyper-V chooses not to trap rdtscinstructions, so guest machines in that environment do have native access to the hardware Read TimeStamp Counter instruction.)

Performance counters that utilize the QueryPerformanceCounter API are found mainly among the Logical and Physical Disk counters. The System Time of Day clock, which advances 64 times per second, provides too low resolution for timing disk I/O operations that often complete within 10 milliseconds of less. In fact, the the QueryPerformanceCounter API was originally introduced back in Windows 2000 to improve the resolution of performance counters such as Logical Disk(*)Avg. Disk sec/Transfer – and, thus, the unfortunate name for an API that is actually the Windows high resolution timer function.

In theory, VMware’s virtualization of the rdtsc instruction should not have a major impact on the timer-based calculations used to produce counters like Avg. Disk sec/Transfer that measure disk latency in Windows. To calculate disk latency in Windows, the disk device driver issues an rdtsc instruction at time t1 when the IO operation to disk is initiated. The driver issues a second rdtsc at time t2when the IO completes. The device driver software calculates the interval delta, t2 – t1, and accumulates the total elapsed time for the device in a DISK_PERFORMANCEstructure. Since the rdtsc timing instructions are effectively issued in line by the guest machine and presumably processed in line by VMware, the clock value returned by virtualized rdtsc call should be very close to the actual rdtsc value, plus some minor amount of overhead associated with VMware trapping the instruction.

However, it is entirely possible for the entire guest machine to be blocked from executing sometime between t1 and t2while it has a disk IO pending. When the guest machine is re-dispatched, the disk IO that was pending will finally complete, at which point the clock value associated with that completion is acquired by the guest machine device driver. Now, however, when the interval delta, t2– t1, is calculated, the measurement of disk latency includes the amount of time the guest machine was delayed in the VMware dispatcher ready queue. Of course, from the standpoint of the application awaiting the IO operation to complete, the t2– t1 calculation remains perfectly valid. It is no longer disk latency, per see, but it does accurately represent the delay that the application issuing the IO request perceives.

A final complication is that it is not possible to tell how many times during any measurement interval that the amount of disk latency time the Windows disk device driver accumulates includes guest machine dispatching delays during pending IO operations.

Similarly to the VMware Host measurements of guest machine physical processor utilization, VMware does provide direct measurements of disk latency from the point of view of the VMware Host. For each VMware guest machine, there are additional disk performance statistics available that tracks bytes transferred and reads vs. writes. Measurements of disk latency at the guest machine level are not available from VMware, however.

 

Reconciling the VMware Host measurements of overall disk latency with the Windows guest machine measurements of disk performance at a logical or physical disk level is problematic at best. Figure 1 is an attempt to reconcile these disparate data sources under the best possible circumstances – the VMware Host was managing a single guest machine.

Comparison-of-VMware-and-Windows-measurements-of-disk-latency-

Figure 1. Comparing the VMware Host and Windows guest machine measurements of disk latency. To make the comparison possible, there was only a single guest machine defined for the VMware Host to run.

In Figure 1, I plotted about 35 minutes worth of disk performance data from the VMware Host (the area plotted in light orange) and from the single Windows guest (the dotted line overlay) for the same one minute measurement intervals. In an ideal world, this data should be quite similar, and in some of the intervals, the disk latency measurements are quite similar. But, in many other intervals, the measurements are quite different. It is unclear how to reconcile these differences.

Overall, since VMware is not able to report disk latency at the guest machine level, the disk performance measurements available internally from the Windows guest machine remain quite useful. You should attempt to reconcile the VMware disk measurements with the guest machine internal measurements, which with regard to reads, writes and bytes transferred should be comparable. Whenever a Windows guest machine running on a busy VMware Host reports disk latency that is much worse than the VMware Host disk latency measurements, you need to consider the possibility that at least some of the difference is due guest machine dispatcher delays and not poor disk hardware performance.


[1] You may recall from an earlier blog entry that the QueryPerformanceCounter function that is used in performance monitoring to generate granular measurements of elapsed time uses the hardware rdtsc instruction in Windows 7, but reverts to the HPET external timer on older machines when rdtsc cannot be trusted. For simplicity’s sake, I will ignore the potential use of the HPET in the discussion here.

.

Correcting the Process level measurements of CPU time for Windows guest machines running under VMware ESX

Recently, I have been writing about how Windows guest machine performance counters are affected by running in a virtual environment, including publishing two recent, longish papers on the subject, one about processor utilization metrics and another about memory management. In the processor utilization paper, (which is available here), it is evident that running under VMware, the Windows performance counters that measure processor utilization are significantly distorted. At a system level, this distortion is not problematic so long as one has recourse to the VMware measurements of actual physical CPU usage by each guest machine.

A key question – one that I failed to address properly, heretofore – is whether it is possible to correct for that distortion in the measurements of processor utilization taken at the process level inside the guest machine OS. The short answer for Windows, at least, is, “Yes.” The % Processor Time performance counters in Windows that are available at the process level are derived using a simple technique that samples the execution state of the machine approximately 64 times per second. VMware does introduce variability in the duration of the interval between samples and sometimes causes the sampling itself rate to fluctuate. However, neither the variability in the time between samples nor possible changes to the sampling rate undermine the validity of the sampling process.

In effect, the sampling data should still accurately reflect the relative proportion of the CPU time used by the individual processes running on the guest machine. If you have access to the actual guest machine CPU usage – from the counters installed by the vmtools, for example – you should be able to adjust the internal guest machine process level data accordingly. You need to accumulate a sufficient number of guest machine samples to ensure that the sample data is a good estimator of the underlying population. And, you also need to make sure that the VMware measurement intervals are reasonably well-synchronized to the guest machine statistics. (The VMware ESX performance metrics are refreshed every 20 seconds.) In practice, that means if you have at least five minutes worth of Windows counter data and VMware guest CPU measurements across a roughly equivalent five minute interval, you should be to correct for the distortion due to virtualization and reliably estimate processor utilization at the process level within Windows guest machines.

Calculate a correction factor, w, from the ratio of the actual guest machine physical CPU usage as reported by the VMware hypervisor to the average processor utilization reported by the Windows guest:

w = Aggregate Guest CPU Usage / Average % Processor Time

Then multiply the process level processor utilization measurements by this correction factor to estimate the actual CPU usage of individual guest machine processes.

The impetus for me revisiting this subject was the following question posed by Steve Marksamer posted on a LinkedIn forum devoted to computer capacity and performance:

[What is] the efficacy of using cpu and memory data extracted from Windows O/S objects when this data is for a virtual machine in VMware. Has there been any timing correction allowing for guest to host delay? Your thoughts/comments are appreciated. Any other way to get at process level data more accurately?

This is my attempt to formulate a succinct answer to Steve’s question.

In the spirit of full disclosure, I should also note that Steve and I once collaborated to write an article on virtualization technology that was published back in 2007. I don’t think I have seen or talked to Steve since we last discussed the final draft of our joint article back in 2007.

A recent paper of mine focused on the processor utilization measurements that are available in Windows (the full paper is available here), and contains a section that describes them being impacted by VMware’s virtualization of timer interrupts and the Intel RDTSC clock instruction. Meanwhile, on this blog I posted a number of entries that covered similar ground (beginning here), but in smaller chunks. But, up to now, I neglected to post any material here related to the impact of virtualization on the Windows processor utilization measurements. Let me start to correct that error of omission.

Considering how VMware affects the measurements CPU reported at the system and process level inside a guest Windows machine, a particularly salient point is that VMware virtualizes the clock timer interrupts that is basis for the Windows clock. Virtualizing clock timer interrupts has an impact on the legacy method used in Windows to measure CPU utilization at the system and process level, which samples the execution state of the system roughly 64 times per second. When the OS handles this periodic clock interrupt, it examines the execution state of the system prior to the interrupt. If a processor was running the Idle thread, then the CPU accounting function attributes all the time associated with the current interval as Idle Time. Idle Time is then accumulated at the processor level. If the machine was executing an actual thread, all the time associated with the current interval is attributed to that thread and process. Processor time is also accumulated in counters associated with active threads and processes. Utilization at the processor level is calculated by subtracting the amount of Idle time accumulated in a measurement interval from the total amount of time in that interval.

So, the familiar Windows performance counters that report % processor time are derived using a sampling technique. Note that a simple statistical argument based on the sample size strongly suggests that these legacy CPU time measurements (i.e., any of the % Processor Time counters in Windows) are not reliable at intervals of less than 15 seconds, which was the official guidance in the version of the Windows Performance Guide that I wrote for Microsoft Press back in 2005. But so long as the measurement is interval is long enough to accumulate a sufficient number of samples – one minute measurement intervals are based on slightly more than 3600 execution state samples, the Windows performance counters provide reasonable estimates of processor utilization at both the hardware and process levels. You can easily satisfy yourself that this is true by using xperf to calculate CPU utilization precisely from context switch events and comparing those calculations to Windows performance counters gathered over the same measurement interval.

In virtualizing clock interrupts, VMware introduces variability into the duration of the interval between processor utilization samples. If a clock interrupt destined for a Windows machine that is currently parked in the ESX dispatcher queue occurs, the VMware dispatcher delays delivery of that interrupt to the guest machine until the next time that guest machine is dispatched. A VMware white paper entitled “Timekeeping in VMware Virtual Machines” has an extended discussion of the clock and timer distortions that occur in Windows guest machines when there are virtual machine scheduling delays. Note that it is possible for VMware guest machine scheduling delays to grow large enough to cause some periodic timer interrupts to be dropped entirely. In VMware terminology, these are known as lost ticks, another tell-tale sign of contention for physical processors. In extreme cases where the backlog of timer interrupts ever exceeds 60 seconds, VMware attempts a radical re-synchronization of the time of day clock in the guest machine, zeros out its backlog of timer interrupts, and starts over.

Unfortunately, there is currently no way to measure directly the magnitude of these guest machine dispatcher delays, which is the other part of Steve’s question. The VMware measurement that comes closest to characterizing the nature and extent of the delays associated with timely delivery of timer interrupts is CPU Ready milliseconds, which reports the amount of time a guest machine was waiting for execution, but was delayed in the ESX Dispatcher queue. (VMware has a pretty good white paper on the subject available here).

If the underlying ESX Host is not too busy, guest machines are subject to very little CPU Ready delay. Under those circumstances, the delays associated with virtualizing clock timer interrupts will introduce some jitter into the Windows guest CPU utilization measurements, but these should not be too serious. However, if the underlying ESX Host does become very heavily utilized, clock timer interrupts are subject to major delays. At the guest level, these delays impact both the rate of execution state samples and the duration between samples.

In spite of this potential disruption, it is still possible to correct the guest level % processor time measurements at a process and thread level if you have access to the direct measurements of hardware CPU utilization that VMware supplies for each guest machine. The execution state sampling performed by periodic clock interval handler still amounts to a random sampling of processor utilization. The amount of CPU time accumulated at the process level and reported in the Windows performance counters remains proportional to the actual processor time consumed by processes running on the guest machine. Using processor utilization measurements gathered over longer intervals and with a reasonable degree of synchronization in the measurement gathering process to encompass both the VMware external measurements and the Windows guest internal measurements are both required for this reconciliation.

To take a simple example of calculating and applying the factor for correcting the processor utilization measurements, let’s assume a guest machine that is configured to run with just one vCPU. Suppose you see the following performance counter values for the Windows guest machine:

% Processor Time
(virtual)
Overall
73%
Process A
40%
Process B
20%

From VMware ESX measurements, then suppose you see that the actual CPU utilization of the guest machine is 88%. You can then correct the internal process CPU time measurements by weighting them by the ratio of the actual CPU time consumed compared to the CPU time reported by the guest machine (88:73, or roughly 1.2).

% Processor Time
(virtual)
(actual)
Overall
73%
88%
Process A
40%
48%
Process B
20%
24%

Similarly, for a guest machine with multiple vCPUs, calculate the average utilization per processor to calculate the weighting factor, since VMware only reports an aggregate CPU utilization for each guest machine.

What about the rest of the Windows performance counters?

Stay tuned. I will discuss the rest of the Windows performance counters that in the next blog post.

A comment on “Memory Overcommitment in the ESX Server”

The VMware Technical Journal recently published a paper entitled “Memory Overcommitment in the ESX Server.” It traverses similar ground to my recent blog entries on the subject of VMware memory management, and similarly illustrates the large potential impact paging can have on the performance of applications running under virtualization. Using synthetic benchmarks, the VMware study replicates the major findings from the VMware benchmark data that I recently reported on beginning here.

VMware being willing to air performance results publicly that are less than benign is a very positive sign. Unfortunately, it is all too easy for VMware customers to configure machines where memory is overcommitted, subject to severe performance problems. VMware customers require frank guidance from their vendor to help them recognize when this happens and understand what steps to try to minimize these problems arising in the future. The publication of this article in the VMTJ is a solid, first step in that direction.

Memory overcommitment factor

One of the interesting results reported in the VMTJ paper is Figure 6, reproduced below. It reports application throughput (in operations per minute for some synthetic benchmark workload that simulates an online DVD store) against a memory overcommitment factor, calculated as:

S guest machine allocated virtual memory / available machine memory

This overcommitment factor is essentially a V:R ratio, the amount of virtual memory allocated to guest machines divided by the amount of RAM available to the VMware ESX host.

VMTJ-Figure-6

Figure 6. Reproduced from the article, “Memory Overcommitment in the ESX Server” published in the VMware Technical Journal.

The two virtual machines configured in the test allocate about 20 GB of RAM, according to the authors. The benchmark was initially run on a physical machine configured with 42 GB of RAM. With the machine’s memory less than fully over-committed, throughput of the test workload was measured at almost 26K operations per minute. The authors then began to tighten the screws on the amount machine memory available for the VMware Host machine to manage. The performance of the workload degraded gradually as machine memory was reduced in stages down to 12 GB of machine memory. (The memory overcommitment factor for the 12 GB machine is 20/12, or 1.6667.)

Finally, the authors attempted to run the benchmark workload in a machine configured with only 7 GB of RAM, yielding the degradation shown on the line chart. Under severe memory constraints, the benchmark app managed less than 9,000 operations/minute, a degradation of about 3X.

The same sequence of experiments was repeated on the machine, but this time with a solid state disk (SSD) configured for VMware to use for swapping (plotted using the green line). With the faster paging device configured, throughput of the test workload increased about 10% on the 12 GB machine. The magnitude of the improvements associated with using the SSD increased even more for the 7 GB machine, but the workload still suffered serious performance degradation, as shown.

Reading the SSD comparison, I experienced an extremely acute (but very pleasant) sense of dejà vu. Thirty years ago, I worked for StorageTek, which was selling one of the industry’s first SSDs, known as the 4305 because it emulated a high speed disk drum that IBM built called the 2305. Due to its limited capacity and the fact the storage itself was volatile RAM, the 4305 was best suited to serve as a fast paging device. It was often attached to an IBM mainframe running the MVS operating system.

In the system engineering organization at STK, where I worked back in the stone age, I first learned to assess the potential improvement in performance from plugging in the 4305 by calculating a V:R ratio using the customer’s data. I was very taken with the idea that we could use this V:R calculation for performance prediction, something I explored more rigorously a few years later in a paper called “Nonlinear Regression Models and CPE Applications,” available at the Computer Measurement Group’s web site. V:R remains a very useful metric to calculate, and I have also used it subsequently when discussing Windows virtual memory management.

By way of comparison to the results reported in the VMTJ paper, the overcommitment factor in the VMware ESX benchmarks I reported on was 2:1; the performance impact on throughput was an approximately 3X degradation in the elapsed time of the benchmark app under the impact of VMware memory ballooning and swapping.

Flaws in the analysis

I can recommend reading the VMTJ paper on its merits, but I could not help noticing it was flawed in several places, which I will mention here.

Transparent memory sharing

The first is an error of omission. The authors fail to report any of the shared memory measurements that are available. Instead, they accept on faith the notion that VMware’s transparent memory sharing feature is effective even when machine memory is substantially over-committed. That is not what I found.

Transparent memory sharing is an extremely attractive mechanism for packing as many idle guest Windows machines as possible onto a single VMware ESX host. What I discovered (and reported here) in the benchmarks we ran was that the amount of memory sharing was effectively reduced to near zero as the guest workloads started to become active.

The VMTJ authors do note that when the Windows OS initializes, it writes zeros to every physical page of RAM that is available. Since VMware can map all the pages on the OS Zero list to a single resident machine memory page, transparent memory sharing is a spectacularly successful strategy initially. However, once the Windows OS’s own virtual memory management kicks in when the guest machine becomes active, the contents of the Zero list, for example, become quite dynamic. This apparently defeats the simple mechanism ESX uses to detect identical virtual pages in real-time.

VMware transparent memory sharing is a wonderful mechanism for packing as many mostly idle guest Windows machines onto ESX Servers as possible. On the other hand, it can be a recipe for disaster when you are configuring production workloads to run as VMware virtual machines.

Measuring guest machine Active Memory

A second flaw in the paper is another error of omission. The VMTJ authors decline to discuss the sampling methodology VMware uses to measure a guest machine’s current working set. This is what VMware calls guest machine active memory, the measurement that is plugged into the formula for calculating the memory over-commitment factor as the numerator. To avoid duplicating all the effort that the guest machine OS expends trying to implement some form of LRU-based page replacement, VMware calculates a guest machine’s active memory using a sampling technique. The benchmark results I reported on showed the guest machine Active Memory measurement VMware reports can be subject to troubling sampling errors.

The VMTJ authors manage to avoid using this dicey metric to calculate the memory overcommitment factor by relying on a benchmark program that is effectively a memory “sponge.” These are programs designed specifically to stress LRU-based memory management by touching at random all the physical pages that are allocated to the guest machine.

With their sponge program running, the authors can reliably calculate memory overcommitment factor without using the Active Memory counter that is subject to measurement anomalies. With the sponge program, the amount of guest machine committed physical memory is precisely equal to the amount of active memory. For the other DVDStore workload, they were able to assess its active memory requirements by running it in a stand-alone mode without any memory contention.

Again, thing aren’t nearly so simple in the real world of virtualization. In order to calculate the memory overcommitment factor for real-world VMware workloads, customers need to use the Active Memory counter, which is not a wholly reliable number due to the way it is calculated using sampling.

To manage VMware, access to a reliable measure of a guest machine’s active memory is very important. Assessing just how much virtual memory your Windows application needs isn’t always a simple matter, as I tried to discuss in my discussion of the VMware benchmark results. A complication arises in Windows whenever any well-behaved Windows application is running that (1) expands to fill the size of physical RAM, (2) listens for low physical memory notifications from the OS, and (3) uses these notifications to trigger virtual memory management routines internally. Examples of applications that do this include Microsoft SQL Server. Any 64-bit .NET Framework application, including 64-bit ASP.NET web sites, is also in this category of applications because the Framework’s Common Language Runtime (CLR) triggers garbage collection in its managed Heaps whenever it receives low physical memory notifications from the OS. With processes designed to take advantage of whatever physical memory is available, it is actually quite difficult to assess how much RAM they require for good performance, absent, of course, internal performance statistics on throughput or response time.

Memory management in Hyper-V

Finally, the VMware paper mischaracterizes the memory management features in Microsoft’s Hyper-V product, dismissing it is an inferior derivative of what VMware accomplishes in a more comprehensive fashion. Moreover, this dismissal is unaccompanied by any hard performance data to back up the claim that VMware ESX is better. It may well be true that VMware is superior, but I am not aware of any data in the public domain that would enable anyone to argue conclusively one way or the other.

It seems entirely possible to me that Hyper-V actually currently holds the advantage when it comes to virtual memory management. Memory management in Hyper-V, which installs on top of the Windows Server OS, can leverage the Windows Memory Manager. Memory management benefits from the fact that the OS understands the guest machine’s actual pattern of physical memory usage since the guest machine appears as simply another process address space executing on the virtualization host.

In its original incarnation, Hyper-V could not take advantage of this because the entire container process that the guest machine executed in was relegated to non-paged memory in the virtualization host. Guest machines acquired their full allotment of virtualized “physical” memory at initialization and retained it throughout as static non-paged memory, which put Hyper-V at a significant disadvantage compared to VMware, which granted guest machine memory on demand.

In the case of a Windows guest machine, VMware would initially have to grant the guest all the physical memory it requested because of the zero memory initialization routine Windows runs at start-up to clear the contents of RAM. While a Windows guest machine is idle, however, VMware’s transparent memory sharing feature reduces the guest machine memory footprint by mapping all those identical, empty zero-filled pages to a single page in machine memory. This behavior was evident in the benchmark test results I reported.

So, VMware’s dynamic approach to memory management let you over-commit machine memory and pack significantly more guest machines into the same memory footprint. This, as I have indicated, works just great so long as those guest machines are mostly idle, making, for example, virtualization of machines devoted to application testing and QA no-brainer candidates for virtualization.

However, the most recent versions of Hyper-V have a feature that Microsoft calls “Dynamic Memory” that changes all that. It is fair to say that Dynamic Memory gives Hyper-V parity with VMware in this area. RAM is allocated on demand, similar to VMware. When you configure a Hyper-V guest, you specify the maximum amount of RAM that Hyper-V will allocate on its behalf. But that RAM is only granted on demand. To improve start-up performance, you can also specify the amount of an initial grant of RAM to the guest machine. (I suspect a Windows guest machine is “enlightened” when it detects it is running under Hyper-V and skips the zero memory initialization routine.)

There is one other configurable parameter the new dynamic memory management feature uses called the memory buffer. The VMTJ authors mistakenly conflate the Hyper-V memory buffer with the ballooning technique that VMware pioneered. In fact, since the Hyper-V Host understands the memory usage pattern of its guests, there is no reason to resort to anything like ballooning to drive page replacement at the guest machine level when the virtualization host faces memory over-commitment.

The configurable Hyper-V guest machine memory buffer is a feature which tries to maintain a pool of free or available pages to stay ahead of the demand for new pages. In Hyper-V, the virtualization host allocates machine memory pages for the guest based on guest machine committed memory. (My guess is that the allocation mechanism Hyper-V relies on is the guest machine’s construction of the Page Table entry for the virtual memory that is allocated.)

The memory buffer is a cushion that Hyper-V adds to the guest machine’s Committed bytes – but only when sufficient machine memory is available – to accommodate rapid expansion since the number of committed bytes fluctuates as processes come and go, for example. Requests for new pages are satisfied from this pool of available memory immediately without incurring any of the delays associated with page replacement, which can often be relegated to background operations.

Configuring a guest machine’s memory buffer is basically a hint to the Hyper-V virtualization host memory manager – essentially, the Windows Memory Manager – to allocate and maintain some amount of additional virtual memory for the guest machine to use when it needs to expand. This is a good practice in any virtual memory management system. It is not a ballooning mechanism at all..