Virtual memory management in VMware: Final thoughts

This is final blog post in a series on VMware memory management. The previous post in the series is here:

Final Thoughts

My colleagues and I constructed and I have been discussing in some detail a case study where VMware memory over-commitment led to guest machine memory ballooning and swapping, which, in turn, had a substantial impact on the performance of the applications that were running. When memory contention was present, the benchmark application executed to completion three times slower than the same application run standalone. The difference was entirely due to memory management “overhead,” the cost of demand paging when the supply of machine memory was insufficient to the task.

Analysis of the case study results unequivocally shows that the cost equation associated with aggressive server consolidation using VMware needs to be adjusted based on the performance risks that can arise when memory is over-committed. When configuring the memory on a VMware Host machine, for optimal performance it is important to realize that virtual memory systems do not degrade gracefully. When virtual memory workloads overflow the amount of physical memory available for them to execute, they are subject to page fault resolution delays that are punishing in nature. The amount of delay during the time it takes to perform a disk I/O necessary to bring a block of code or data from the paging file into memory to resolve a page fault is several orders of magnitude larger than almost any other sort of execution time delay a running thread is ever likely to encounter.

VMware implements a policy of memory over-commitment in order to support aggressive server consolidation. In many operational environments, like Server hosting or application testing, guest machines are frequently dormant. But when they are active, they are extremely active in bursts. These kinds of environments are well-served by aggressive guest machine consolidation on server hardware that is massively over-provisioned.

On the other hand, implementing overly aggressive server consolidation of active production workloads with more predictable levels of activity presents a very different set of operational challenges. One entirely unexpected result of the benchmark was the data on Transparent Memory Sharing reported in an earlier post that showed the benefits of memory sharing evaporating almost completely in the face of guest machines actively using their allotted physical memory. Since the guest machines used in the benchmark were configured identically, down to running the same exact application code, it was surprising to see how ineffective memory sharing proved to be once the benchmark applications started to execute on their respective guest machines. Certainly the same memory over-commitment mechanism is extremely effective when guest machines are idle for extended periods of time. This finding that memory sharing is ineffective when the guest machines are active, if it can be replicated in other environments, would call for re-evaluating the value of the whole approach, especially since idle machines can be swapped out of memory entirely.

Moreover, the performance-related risks for critical workloads that arise when memory over-commitment leads to ballooning and swapping are substantial. Consider that if an appropriate amount of physical memory was chosen for a guest machine configuration at the outset, any pages removed from the guest machine memory footprint via ballooning and/or swapping is potentially very damaging. For this reason, for example, warnings from SQL Server DBAs about VMware’s policy of over-committing machine memory are very prominent in blog posts. See http://www.sqlskills.com/blogs/jonathan/the-accidental-dba-day-5-of-30-vm-considerations/ for an example.

In the benchmark test discussed here, each of the guest machines ran identical workloads that, when a sufficient number of them were run in tandem, combined to stress the virtual memory management capabilities of the VMware Host. Using the ballooning technique, VMware successfully transmitted the external memory contention in effect to the individual guest machines. This successful transmission diffused the response to the external problem, but did not in any way lessen its performance impact.

More typical of a production environment, perhaps, is the case where a single guest machine is the primary source of the memory contention. Just as in a single OS image when one user process consuming an excess of physical memory can create a resource shortage with a global impact, a single guest machine consuming an excess of machine memory can generate a resource shortage that impacts multiple tenants in the virtualization environment.

Memory Reservations.

In VMware, customers do have the ability to prioritize guest machines so that all tenants sharing an over-committed virtualization Host machine are not penalized equally when there is a resource shortage. The most effective way to protect a critical guest machine from being subjected to ballooning and swapping due to a co-resident guest is to set up a machine memory Reservation. A machine memory Reservation establishes a floor guaranteeing that a certain amount of machine memory is always granted to the guest. With a Reservation value set, VMware will not subject a guest machine to ballooning or swapping that will result in the machine memory granted to the guest falling below that minimum.

But in order to set an optimal memory Reservation size, it is first necessary to understand how much physical memory the guest machine requires, not always an easy task. A Reservation value that is set too high on a Host machine experiencing memory contention will have the effect of increasing the level of memory reclamation activity on the remaining co-tenants of the VMware Host.

Another challenge is how to set an optimal Reservation value for guest machines running applications that, like the .NET Framework application used in the benchmark discussed here, dynamically expand their working set to grab as much physical memory as possible on the machine. Microsoft SQL Server is one of the more prominent Windows server applications that does that, but others include the MS Exchange Store process (fundamentally also a database application), and ASP.NET web sites. Like the benchmark application, SQL Server and Store listen for Low Memory notifications from the OS, and will trim back their working set of resident pages in response. If the memory remaining proves inadequate to the task, there are performance ramifications.

With server applications like SQL Server that expand to fill the size of RAM, it is often very difficult to determine how much RAM is optimal, except through trial and error. The configuration flexibility inherent in virtualization technology does offer a way to experiment with different machine memory configurations. Once the appropriate set of performance “experiments” have been run, the results can then be used to reserve the right amount of machine memory for these guest machines. Of course, these workloads are also subject to growth and change over time, so once memory reservation parameters are set, they need to be actively monitored at both the VMware Host and guest machine and application levels..

Virtual memory management in VMware: Swapping

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here, a discussion of what happens when the VMware memory balloon driver inflates to try to force the guest machines resident of a machine where memory is over-committed to shed some of their least recently pages, which can then be stolen by VMware.

Swapping

As discussed in an earlier post, VMware has recourse to steal physical memory pages granted to a guest OS at random, which VMware terms swapping, to relieve a serious shortage of machine memory. When free machine memory drops below a 4% threshold, swapping is triggered. Memory ballooning, which is triggered when free machine memory drops below a 6% threshold, remains active during swapping. VMware resorts to random page trimming from guest machine working sets at the lower threshold because ballooning is a process that may take a while to work, and the 4% threshold suggests an antidote that will work more quickly to relieve memory-commitment is necessary.

During the case study, VMware resorted to swapping beginning around 9:10 AM when the Memory State variable reported a memory state transition to the “Hard” memory state, as shown in Figure 19. Initially, VMware swapped out almost 600 MB of machine memory granted to the four guest machines. Also, note that swapping is very biased. The ESXAS12B guest machine was barely touched, while at one point 400 MB of machine memory from the ESXAS12E machine was swapped out.

VMware-memory-management-Figure-19

Figure 19. VMware resorted to random page replacement – or swapping – to relieve a critical shortage of machine memory when usage of machine memory exceeded 96%. Swapping was biased – not all guest machines were penalized equally.

Given how infrequently random page replacement policies are implemented, it is surprising to discover they often perform reasonably well in simulations. Nevertheless, they still perform much worse than stack algorithms that order candidates for page replacements based on Least Recently Used criteria. Because VMware selects pages from a guest machine’s allotted machine memory for swapping at random, without any knowledge of what sort of page it is, it is entirely possible for VMware to remove truly awful candidates from the current working set of a guest machine’s machine memory pages using swapping. With random page replacement, some worst case scenarios are entirely possible. For example VMware might to choose to swap out a frequently referenced page that contains code from the operating system kernel or Page Table entries, pages that the guest OS would be among the pages least likely to be chosen for page replacement. That is the fundamental concern with a random page replacement policy. However, given that it is likely that only a very small subset of a guest machine’s set of allocated pages is critical to performance, the actual performance of a random page replacement policy is not always horrendous.

To gauge how effective VMware’s random page replacement policy is, the rate of pages swapped out were compared to the swap-in rate. This comparison is shown in Figure 20. There were two large bursts of swap out activity , the first one taking place at 9:10 AM when the swap out rate was reported at about 8 MB/sec. The swap-in rate never exceeded 1 MB/sec, but a small amount of swap-in activity continued to be necessary over the next 90 minutes of the benchmark run, until the guest machines were shut down and machine memory was no longer over-committed. In clustered VMware environments, the vMotion facility can be invoked automatically to migrate a guest machine from an over-committed ESX Host to another machine in the cluster that is not currently experiencing memory contention. This action may relieve the immediate memory over-commitment, but may also succeed in simply shifting the problem to another VM Host, not counting the disruption in processing that the vMotion-induced migration foments.

As noted in the previous blog entry, the benchmark program took three times longer to execute when there was memory contention from all four active guest machines, compared to running in a standalone guest machine. Delays due to VMware swapping were certainly one of the important factors contributing to elongated program run-times.

VMware-memory-management-Figure-20

Figure 20. Comparing pages swapped out to pages swapped in.

This entry on VMware swapping concludes the presentation of the results of the case study that stressed the virtual memory management facilities of an VMware ESX host machine. Based on an analysis of the performance data on memory usage gathered at the level of both the VMware Host and internally in the Windows guest machines, it was possible to observe the virtual memory management mechanisms used by VMware in operation very clearly.

With this clearer understanding of VMware memory management in mind, I’ll discuss some of the broader implications for performance and capacity planning of large scale virtualized computing infrastrucures in the next (and last) post in this series..

Virtual memory management in VMware: memory ballooning

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here. Details of the case study are discussed in this earlier post.

Ballooning

Ballooning is a complicated topic, so bear with me if this post is much longer than the previous ones in this series.

As described earlier, VMware installs a balloon driver inside the guest OS and signals the driver to begin to “inflate” when it begins to encounter contention for machine memory, defined as the amount of free machine memory available for new guest machine allocation requests dropping below 6%. In the benchmark example I am discussing here, the Memory Usage counter rose to 98% allocation levels and remained there for duration of the test while all four virtual guest machines were active.

Figure 7, which shows the guest machine Memory Granted counter for each guest, with an overlay showing the value of the Memory State counter reported at the end of each one-minute measurement interval, should help to clarify the state of VMware memory-management during the case study. The Memory state transitions indicated mean that VMware would attempt to use both ballooning and swapping to try to relieve the over-committed virtual memory condition.
The impact of ballooning will be discussed first.

VMware-memory-contention-showing-memory-state

Figure 7. Memory State transitions that are associated with active memory granted to guest machines that drained the supply of free machine memory pages.

Ballooning occurs when the VMware Host recognizes that there is a shortage of machine memory and must be replenished using page replacement. Since VMware has only limited knowledge of current page access patterns, it is not in a position to implement an optimal LRU-based page replacement strategy. Ballooning attempts to shift responsibility for page replacement to the guest machine OS, which presumably can implement a more optimal page replacement strategy than the VMware hypervisor. Essentially, using ballooning, VMware reduces the amount of physical memory available for internal use within the guest machine, forcing the guest OS into exercising its memory management policies.

The impact that ballooning has on virtual memory management being performed internally by the guest OS suggests that it will be well worth looking inside the guest OS to assess how it detects and responds to the shortage of physical memory that ballooning induces. Absent ballooning, when the VMware Host recognizes that there is a shortage of machine memory, this external contention for machine memory is not necessarily manifest inside the guest OS where, for example, its physical memory as configured might, in fact, be sized very appropriately. Unfortunately, for Windows guest machines the problem is not as simple and straightforward as a guest machine configured to run well with x amount of physical memory assigned finds itself abruptly in a situation where it can access far less than the amount of physical memory configured for it to use, which is serious enough. An additional complication is that well-behaved Windows applications listen for notifications from the OS and attempt to manage their own process virtual memory address spaces in response to these low memory events.

 Baseline measurements (no memory contention).

To help understand what happened during the memory contention benchmark, it will be useful to compare those results to a standalone baseline set of measurements gathered when there was no contention for machine memory. We begin by reviewing some of the baseline memory measurements taken from inside Windows when the benchmark program was executed standalone on the VMware ESX server with only one guest machine active.

When the benchmark program was executed standalone with only one guest machine active, there was no memory contention. A single 8 GB guest defined to run on a 16 GB VMware ESX machine could count on all the machine memory granted to it being available. For the baseline, the VMware Host reported overall machine memory usage never exceeding 60% and the balloon targets communicated to the guest machine balloon driver are zero values. The Windows guest machine does experience some internal memory contention, though, because there is a significant amount of demand paging that occurred during the baseline.

Physical memory usage inside the Windows guest machine during the baseline run is profiled in Figure 8. The Available Bytes counter is shown in light blue, while memory allocated to process address spaces is reported in dark blue. Windows performs page replacement when the pool of available bytes drops below a threshold number of free pages on the Zero list. When the benchmark process is run, beginning at 2:50 pm, physical memory begins to be allocated to the benchmark process, shrinking the pool of available bytes. Over the course of the execution of the benchmark process – a period of approximately 30 minutes – the Windows page replacement policy periodically needs to replenish the pool of available bytes by trimming back the number of working set pages allocated to running processes. In a standalone mode, when the benchmark process is active, the Windows guest machine manages to utilize all the 8 GBs of physical memory granted to it under VMware. This is primarily a result of the automatic memory management policy built into the .NET Framework runtime which the benchmark program uses. While a .NET Framework program specifically allocates virtual memory as required, the runtime, not the program, is responsible for reclaiming any memory that was previously allocated but is no longer needed. The .NET Framework runtime periodically reclaims currently unused memory by scheduling a garbage collection when it receives a notification from the OS that there is a shortage of physical memory Available Bytes.

The result is a tug of war, the benchmark process continually growing its working set to fill physical memory and Windows memory management periodically signally the CLR of an impending shortage of physical memory, causing the CLR to free some of the virtual memory previously allocated by the managed process.The demand paging rate of the Windows guest is reported in Figure 8 as a dotted line chart, plotted against the right axis. There are several one-minute spikes reaching 150 hard page faults per second. The Windows page replacement policy leads to the OS trimming physical pages that are then re-accessed during the benchmark run that subsequently need to be retrieved from the paging file. In summary, during a standalone execution of the benchmark workload, the benchmark process allocates and uses enough physical memory to trigger the Windows page replacement policy on the 8 GB guest machine.

VMware-memory-management-Figure-8

Figure 8. Physical Memory usage by the Windows guest OS when a single guest machine was run in a standalone mode.

The physical memory utilization profile of the Windows guest machine reflects the use of virtual memory by the benchmark process, which is the only application running. This process, ThreadContentionGenerator.exe, is a multithreaded 64-bit program written in C# that deliberately stresses the automated memory management functions of the .NET runtime. The benchmark program’s usage of process virtual memory is highlighted in Figure 9.

The benchmark program allocates some very large data structures and persists them through long processing cycles that access, modify and update them at random. Inside the program, the program allocates these data structures using the managed Heaps associated with the .NET Framework’s Common Language Runtime (CLR). The CLR periodically schedules a garbage collection thread that automatically deletes and compacts the memory previously allocated to objects that are no longer actively referenced. (Jeff Richter’s book, CLR via C#, is a good, basic reference on garbage collection inside a .NET Framework process.)

VMware-memory-management-Figure-9

Figure 9. The sizes of the managed heaps inside the benchmarking process address space when there was no “external” memory contention. The size of the process’s Large Object Heap varies between 4 and 8 GB during a standalone run.

Figure 9 reports the sizes of the four managed Heaps during the standalone benchmark run. It shows the amount of process private virtual memory bytes allocated ranging between 4 and 8 GBs during the run. The size of the Large Object Heap, which is never compacted during garbage collection, dwarfs the sizes of the other 3 managed Heaps, which are generation-based. Normally in a .NET application process, garbage collection is initiated automatically when the size of the Generation 0 heap grows to exceed a threshold value, which is chosen in 64-bit Windows based on the amount of physical memory that is available. But the Generation 0 heap in this case remains quite small, smaller than the Generation 0 “budget.”  The CLR also initiates garbage collection when it receives a LowMemoryResourceNotification event from the OS, a signal that page trimming is about to occur. Well-behaved Windows applications that allow their working sets to expand until they reach the machine’s physical memory capacity wait on this notification. In response to the Low Memory resource Notification, the CLR dispatches its garbage collection thread to reclaim whatever unused virtual memory it can find inside the process address space. Garbage collections initiated by LowMemoryResourceNotification events cause the size of the Large Object Heap to fluctuate greatly during the standalone benchmark run.

To complete the picture of virtual memory management at the .NET process level, Figure 10 charts the cumulative number of garbage collections that were performed inside the ThreadContentionGenerator.exe process address space. For this discussion, it is appropriate to focus on the number of Generation 0 garbage collections – the fact that so many Generation 0 collections escalate to Gen 1 and Gen 2 collections is a byproduct of the fact that so much of the virtual memory used by the ThreadContentionGenerator program was allocated in the Large Object Heap. Figure 10 shows about 1000 Generation 0 garbage collections occurring. This represents a reasonable, but rough, estimate of the number of times the OS generated Low Memory resource notifications during the run to trigger garbage collection.

VMware-memory-management-Figure-10

Figure 10. The number of CLR Garbage collections inside the benchmarking process when the guest Windows machine was run in standalone mode.

To summarize, the benchmark program is a multi-threaded 64-bit .NET Framework application that will allocate virtual memory up to the physical memory limits on the machine. When the Windows OS encounters a shortage of empty Zero pages as a result of these virtual memory allocations, it issues a Low Memory notification that is received and processed by the CLR. Upon receipt of the this Low Memory notification, the CLR schedules a garbage collection to reclaim any private bytes previously allocated on the managed Heaps that are no longer in use.

Introducing external memory contention.

Now, let’s review the same memory management statistics when the same VMware Host is asked to manage a configuration of four such Windows guest machines running concurrently, all actively attempting to allocate and use 8 GB of physical memory.

As shown in Figure 11, ballooning begins to kick in during the benchmark run around 9:10 AM, which also corresponds to an interval in which the Memory State transitioned to the “hard” state where both ballooning and swapping would be initiated. (From Figure 2, we saw this corresponds to intervals where the machine memory usage was reported running about 98% full.) These balloon targets are communicated to the balloon driver software resident in the guest OS. An increase in the target instructs the balloon driver to “inflate” by allocating memory, while a decrease in the target causes the balloon driver to deflate. Figure 11 reports the memory balloon targets communicated to each of the guest machine resident balloon drivers, with balloon targets rising to over 4 GB per machine. When the balloon drivers in each guest machine begin to inflate, the guest machines will eventually encounter contention for physical memory, which they will respond to by using their page replacement policies to identify older pages to be trimmed from physical memory.

VMware-memory-management-Figure-11

Figure 11. The memory balloon targets for each guest machine increase to about 4 GB when machine memory fills.

Note that the balloon driver’s memory targets are also reported when the Windows guest machine has the VMware Windows tools installed in the VM Memory performance counters.

In Windows, when VMware’s vmmemsty.sys balloon driver inflates, it allocates physical memory pages and pins them in physical memory until explicitly released. To determine how effective ballooning works to relieve a shortage of machine memory condition, it is useful to drill into the guest machine performance counters and look for signs of increased demand paging and other indicators of memory contention. Based on how Windows virtual memory management works [3], we investigated the following telltale signs that virtual memory appeared to be under stress as a result of the balloon driver inflating inside the Windows guest:

  1. memory allocated in the nonpaged pool should spike due to allocation requests from the VMware balloon driver
  2. a reduction in Available Bytes, leading to an increase in hard page faults (Page Reads/sec) as a result of increased contention for virtual memory
  3. applications that listen for low memory notifications from the OS will initiate page replacement to trim their residents sets voluntarily

(As discussed above, the number of garbage collections performed inside the ThreadContentionGenerator process address space corresponds to the number of low memory notifications received from the OS indicating that page trimming is about to occur.)

Generally speaking, if ballooning is effective, it should cause a reduction in the working set of the benchmark process address space since that is main consumer of physical memory inside each guest. Let’s take a look.

Physical memory usage.

Figure 12 shows physical memory usage inside one of the active Windows guest machines, reporting the same physical memory usage metrics as Figure 8. (All four show a similar pattern.) Beginning at 9:10 AM when the VMware balloon driver inflates, process working sets (again shown in dark blue) are reduced, which reduces the physical memory footprint of the guest machine to approximately 4 GB. Concurrently, the Available Byes counter, which includes the Standby list, also drops precipitously.

VMware-memory-management-Figure-12

Figure 12. Windows guest machine memory usage counters show a reduction in process working sets when the balloon driver inflates.

Figure 12 also shows an overlay line graph of the Page Reads/sec counter, which is a count of all the hard page faults that need to be resolved from disk. Page Reads/sec is quite erratic while the benchmark program is active. What tends to happen when the OS is short of physical memory is that demand paging to and from disk increases, until the paging disk saturates. Predictably, a physical memory bottleneck is transfigured into a disk IO bottleneck in systems with virtual memory. Throughput of the paging disk serves as an upper limit on performance when there is contention for physical memory. In a VMware configuration, this paging IO bottleneck is compounded when guest machines share the same physical disks, which was the case here.

Process virtual memory and paging.

Figure 13 drills into the process working set for the benchmarking application, ThreadContentionGenerator.exe. The process working set is evidently impacted by the ballooning action, decreasing in size from a peak value of 6 GB down to less than 4 GB. The overlay line in Figure 13 shows the amount of Available Bytes in Windows. The reduction in the number of Available Bytes triggers the low memory notifications that the OS delivers and .NET Framework CLR listens for. When the CLR receives an OS LowMemoryResourceNotification event, it schedules a garbage collection run to release previously allocated, but currently unused, virtual memory inside the application process address space.

VMware-memory-management-Figure-13

Figure 13. Working set of the benchmarking application process is reduced when VMware ballooning occurs from over 4 GB down to about 2 GB.

Figure 14 looks for additional evidence that the VMware balloon driver induces memory contention inside the Windows guest machine when the balloon inflates. It graphs the counters associated with physical memory allocations, showing a sharp increase in the number of bytes allocated in the Nonpaged pool, corresponding to the period when the balloon driver begins to inflate. The size of the Nonpaged pool shows a sharp increase from 30 MB to 50 MB, beginning shortly after 9:10 AM. The balloon evidently deflates shortly before 10:30 AM, over an hour later when VMware no longer experiences memory contention.

What is curious, however, is that the magnitude of the increase in the size of the nonpaging pool shown in Figure 14 is so much smaller than the VMware guest machine balloon targets reported in Figure 9. The guest machine balloon target is approximately 4 GB in Figure 10, and it is evident in Figure 12 that the balloon inflating reduced the memory footprint of the OS by approximately 4 GB. However, the increase in the size of the nonpaging pool (in light blue) reported in Figure 12 is only 20 MB. This discrepancy requires some explanation.

What seems likely is that the balloon driver inflates by calling MmProbeAndLockPages, allocating physical memory pages that are not associated with either of the standard paged or nonpaged system memory pools. (Note that he http.sys IIS kernel-mode driver allocates a physical memory resident cache that is similarly outside the range of both the nonpaged and paged pool and is not part of any process address space either. Like the balloon driver, the size of the http.sys cache is not reported in any of the Windows Memory performance counters. By allocating physical memory that is outside the system’s virtual memory addressing scheme, the VMware balloon driver can inflate effectively in 32-bit Windows when the virtual memory size of the nonpaged pool is constrained architecturally.)

The size of the VMware memory balloon is not captured directly by any of the standard Windows memory performance counters. The balloon inflating does appear to cause an increase in the size of the nonpaged pool, probably reflecting the data structures that the Windows Memory Manager places there to keep track of the locked pages that the balloon driver allocates.

VMware-memory-management-Figure-14

Figure 14. Physical memory allocations from inside one of the Windows guest machines. The size of the nonpaged pool shows a sharp increase from 30 MB to 50 MB, beginning shortly after 9:10 AM when the VMware balloon driver inflates.

Figure 14 reveals a gap in the measurement data at around 9:15. This gap in the measurement data is probably caused by a missing collection data interval because VMware was taking drastic action to alleviate memory contention by blocking execution of the guest machine, triggered by the amount of free machine memory dipping below 2%.

Figure 15 provides some evidence that ballooning induces an increased level of demand paging inside the Windows guest machines. Windows reports demand paging rates using the Page Reads/sec counter, which shows consistently higher levels of demand paging activity once ballooning is triggered. The increased level of demand paging is less pronounced than might otherwise be expected from the extent of the process working set reduction that was revealed in Figure 13. As discussed above, the demand paging rate is predictably constrained by the bandwidth of the paging disk, which in this configuration is a disk shared by all the guest machines.

VMware-memory-management-Figure-15

Figure 15. Ballooning causes an increased level of demand paging inside the Windows guest machines. The Pages Read/sec counter from one of the active Windows guest machines is shown. Note, however, the demand paging rate is constrained by the bandwidth of the paging disk, which in this test configuration is a single disk shared by all the guest machines.

For comparison purposes, it is instructive to compare the demand paging rates in Windows from the one of the guest machines to the same guest machine running the benchmark workload in a standalone environment where only a single Windows guest was allowed to execute. In Figure 16, the demand paging rate for the ESXAS12B Windows guest during the benchmark is contrasted to the same machine and workload running the standalone baseline. The performance counter data from the standalone run is graphed as a line chart overlaying the data from the memory contention benchmark. In a standalone mode, the Windows guest machine has exclusive access to the virtual disk where the OS paging file resides. Executing during contention mode, the disk where the guest machines’ paging files reside is shared. In standalone mode – where there is no disk contention – Figure 16 shows that the standalone Windows guest machine is able to sustain higher demand paging rates.Due to the capacity constraint imposed by the bandwidth of the shared paging disk, the most striking comparison in the measurement data shown in Figure 16 is the difference in run-times. The benchmark workload that took about 30 minutes to execute in a standalone mode ran for over 90 minutes when there was memory contention, almost three times longer. Longer execution times are the real performance impact of memory contention in VMware, as in any other operating system that supports virtual memory. The hard page faults that occur inside the guest OS delay the execution of every instruction thread that experiences them, even operating system threads. Moreover, when VMware-initiated swapping is also occurring – more about that in a moment – execution of the guest OS workload is also potentially impacted by page faults whose source is external.

VMware-memory-management-Figure-16

Figure 16. Hard page fault rates for one of the guest machines, comparing the rate of Page Reads/sec with and without memory contention. In standalone mode – where there is no disk contention – the Windows guest machine is able to sustain higher demand paging rates. Moreover, the benchmark workload that took about 30 minutes to execute in a standalone mode ran for over 90 minutes when there was memory contention.

VMware ballooning appears to cause Windows to send low memory notifications to the CLR, which then schedules a garbage collection run that attempts to shrink the amount of virtual memory the benchmark process uses.

Garbage collection inside a managed process.

Finally, let’s look inside the process address space for the benchmark application and the memory management performed by the .NET Framework’s CLR. The cumulative number of .NET Framework garbage collections that occurred inside the ThreadContentionGenerator process address space is reported for the configuration where there was memory contention in Figure 17. Figure 18 shows the size of the managed heaps inside the ThreadContentionGenerator process address space and the impact of ballooning. Again, the size of the Large Object Heap dwarfs the other managed heaps.

Comparing the usage of virtual memory by the benchmarking process in Figure 18 to the working set pages that are actually resident in physical memory, as reported in Figure 13, it is apparent that the Windows OS is aggressively trimming pages from the benchmarking process address space. Figure 13 shows the working set of the benchmarking process address space constrained to just 2 GB, once the balloon driver inflated. Meanwhile, the size of the Large Object Heap shown in Figure 18 remains above 4 GB. When an active 4 GB virtual memory address space is confined to running within just 2 GB of physical memory, the inevitable result is that the application’s performance will suffer from numerous paging-related delays. We conclude that ballooning successfully transforms the external contention for machine memory that the VMware Host detects into contention for physical memory that the Windows guest machine needs to manage internally.

Comparing Figure 17 to Figure 10, we see that in both situations, the CLR initiated about 1000 Generation 0 garbage collections. During the memory contention test, the benchmark process takes much longer to execute, so these garbage collections are reported over a much larger interval. This causes the shape of the cumulative distribution in Figure 17 to flatten out, compared to Figure 10. The shape of the distribution changes because the execution time of the application elongates.

VMware-memory-management-Figure-17

Figure 17. CLR Garbage collections inside the benchmarking process when there was “external” memory contention.

Comparing the sizes of the Large Object Heap allocated by the ThreadContentionGenerator process address space, Figure 9 from running standalone shows that the size of the Large Object Heap fluctuates from almost 8 GB down to 4 GB. Meanwhile, in Figure 18, the size of the Large Object Heap is constrained to between 4 and 6 GBs, once the VMware balloon driver began to inflate. The size of the Large Object Heap is constrained due to aggressive page trimming by the Windows OS.

VMware-memory-management-Figure-18

Figure 18. The size of the managed heaps inside the benchmarking process address space when there was “external” memory contention. The size of the Large Object Heap dwarfs the sizes of the other managed heaps, but is constrained to between 4 and 6 GBs due to VMware ballooning.

That completes our look at VMware ballooning during a benchmark run where machine memory was substantially over-committed and VMware was forced into responding, using its ballooning technique.

In the next post, we will consider the impact of VMware’s random page replacement algorithm, which it calls swapping..