Virtual memory management in VMware: Transparent memory sharing

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.

In this installment, I will discuss the impact and effectiveness of transparent memory sharing, using the performance data that was gathered during a benchmark that stressed VMware’s virtual memory management capabilities.

Transparent memory sharing.

Transparent memory sharing is one of the key memory management mechanisms that supports aggressive server consolidation. Dynamically, VMware detects memory pages that are identical within or across guest machine images. When identical pages are detected, VMware maps them to a single page in machine memory. When guest machines are largely idle, transparent memory sharing enables VM to pack guest machine images efficiently into a single hardware platform, especially for machines running the same OS, the same OS version, and the same applications. However, when guest machines are active, the benefits of transparent memory sharing are evidently greatly reduced, as will soon be apparent.

VMware uses a background thread that scans guest machine pages continuously, looking for duplicates. This process is illustrated in Figure 5. Candidates for memory sharing are found by calculating a hash value from the contents of the page and looking for a collision in a Hash Table that is built from the known hash values from other current pages. If a collision is found, then the candidate for sharing is compared to the base page byte by byte. If the contents of the candidate and the base pages match, then VMware points the PTE of the copy to the same page of machine memory backing the base page.

Memory sharing is provisional. VMware uses a Copy on Write mechanism whenever a shared page is modified and can no longer be shared. This is accomplished by flagging the shared page PTE as Read Only. Then, when an instruction executes that attempts to store data in the page, the hardware generates an addressing exception. VMware handles the exception by creating a duplicate, and re-executing the Store instruction that failed against the duplicate. Transparent memory sharing has great potential benefits, but there is some overhead necessary to support the feature. One source of overhead is the processing by the background thread. There are tuning parameters to control the rate at which these background memory scans run, but, unfortunately, there are no associated performance counters reported that would help the system administrator to adjust these parameters. The other source of overhead results from the Copy on Write mechanism, which entails the handling of additional hardware interrupts associated with the soft page faults. There is no metric that provides the rate these additional soft page faults occur either.

VMware-memory-sharing

Figure 5. Transparent memory sharing uses a background thread to scan memory pages, compute a hash code value from their contents, and compare to other Hash codes that have already been computed. In the case of a collision, the contents of the page that is a candidate for sharing are compared byte by byte to the collided page. If the pages contain identical content, VMware points both pages to same physical memory location.

In the case study, transparent memory sharing is initially extremely effective – when the guest machines are largely idle. Figure 6 renders the Memory Shared performance counter from each of the guest machines as a stacked area chart. At 9 AM, when the guest machines are still idle, almost all the 8 GB granted to three of the machines (ESXAS12C, ESXAS12D, and ESXAS12E) is being shared by pointing those pages to the machine memory pages that are assigned to the 4th guest machine (ESXAS12B). Together, these three guest machines have about 22 GB of shared memory, which allows VMware to pack 4 x 8-GB OS images into a machine footprint of about 10-12 GB.

However, once the benchmark programs start to execute, the amount of shared memory dwindles to near zero. This is an interesting result. With this workload of identically configured virtual machines, even when the benchmark programs are active, there should still be significant opportunities to share identical code pages. But VMware is apparently unable to capitalize much on this opportunity once the guest machines become active. A likely explanation for the diminished returns from memory sharing is simply that the virtual memory management performed by each of the active guest Windows machines leads to the contents of too many virtual memory pages changing too frequently, something which simply overwhelms the copy detection sharing mechanism. Since the benchmark programs are also consuming CPU resources, another possible explanation for the lack of memory sharing is severe processor contention that prevents the memory scanning thread from being dispatched while the benchmark programs were active. However, the VMware Host reported overall processor utilization of only about 40-60% throughout most of the active benchmarking period, so this hypothesis was rejected. Here is where some resource accounting that can report the memory scan rate or the amount of time the scan thread was active would be quite helpful.

VMware-memory-management-Figure-5

Figure 6. The impact of transparent memory sharing dwindles to near zero when the benchmarking workloads were active.

In the next post in this series: we will dig into VMware’s use of ballooning.

.

Virtual memory management in VMware: a case study

This is a continuation of a series of blog posts on VMware memory management. The previous post in the series is here.

Case Study.

The case study reported here is based on a benchmark using a simulated workload that generates contention for machine memory. VMware ESX Server software was installed on a Dell Optiplex 990 with an Intel i7 quad-core processor and 16GB of RAM. (Hyper-Threading was disabled on the processor through the BIOS.) Then four identical Windows Server 2012 guest machines were configured, each configured to run with 8 GB of physical memory. Each Windows guest is running a 64-bit application benchmark program designated as the ThreadContentionGenerator.exe, which the author developed.

The benchmark program was written using the .NET Framework. The program allocates a very large block of private memory and accesses that memory randomly. The benchmark program is multi-threaded, and updates the allocated array using explicit locking to maintain the integrity of its internal data structures. Executing threads also simulate IO waits periodically by going to sleep, instead of executing reads or writes against the file system to avoid exercising the machine’s physical disks. Performance data from both VMware and the Windows guest machines was gathered at one minute intervals for the duration of the benchmark testing, approximately 2 hours. For comparison purposes, a single guest machine was activated to execute the same benchmark in a standalone environment where there was no contention for machine memory. Running standalone, with no memory contention, the benchmark executes in about 30 minutes.

Memory allocation on demand

Figure 2 tracks three key ESX memory performance metrics during the test: the total Memory Granted to the four guest Windows machines, the total Memory Active for the same four guest Windows machines, and the VMware Host’s Memory Usage counter, reported as a percent of the total machine memory available. The total Memory Granted counter increases at the outset in 8 GB steps as each of the Windows guest machines spins up. The memory benchmarking programs were started at just before 9 AM, and continued executing over the next 2 hour period, finally winding down execution near 11 AM. The benchmark programs drive Active Memory to almost 15 GB, shortly after 9 AM, and overall Memory Usage to 98%. (In this configuration, 2% free memory translates into about 300 MB of available physical memory.) 

Notice that the Memory Active counter that purports to measure the guest OS working set of resident pages exhibits some anomalies, presumably associated with the way it is estimated using sampling. There are periodic spikes in the counter when the guest machines have just been activated, but are not active yet in the beginning of the testing period. Toward the end of the benchmark period, after many of the benchmark worker threads have completed, there is another spike, resembling the earlier ones. This later spike shows total guest machine active memory briefly reaching some 20 GB, which, of course, is physically impossible.

VMware-memory-management-Figure-2

Figure 2. Memory Granted, Memory Active and % Memory Used during the benchmark.

As the benchmark programs execute in each of the guest machines, the Memory Granted counter takes a downward plunge from 32 GB down to about 15 GB. The vCenter Performance Counters documentation provides this definition of the counter: “The amount of memory that was granted to the VM by the host. Memory is not granted to the [guest] until it is touched one time and granted memory may be swapped out or ballooned away if the VMkernel needs the memory.” Evidently, during initialization of a Windows Server machine, the OS initially touches every page in physical memory, so initially 8 GB of RAM are granted to each guest machine. But in this case study, there is only 16 GB total physical RAM available. As VMware detects memory contention, the memory granted to each guest machine is evidently reduced through page replacement, using the ballooning and swapping mechanisms.

Figure 3 attempts to show the breakdown of machine memory allocated by adding allotments associated with the VMKernel to the sum of the Active memory consumed by each of the guest machines. The anomalous spike in Active Memory near the end of the benchmark test pushes overall machine memory usage beyond the amount of RAM actually installed, which, as noted above, is physically impossible. This measurement anomaly, possibly associated with a systematic sampling error, is troubling because it makes it difficult in VMware to obtain a precise breakdown of machine memory allocation and usage reliably.

VMware-memory-management-Figure-3

Figure 3. Machine memory allocations, including the areas of memory allocated by the VMKernel.

Figure 3 also shows a dotted line overlay that reports the value of the Memory State counter. The Memory State counter reports the value of memory state at the end of each measurement interval, so these values should be interpreted as sample observations. There were three sample observations when the memory state was “Soft,” indicating ballooning taking place. And there is an earlier sample observation where the memory state was “Hard,” indicating that swapping was triggered. 

Figure 4 shows the same counter data as Figure 3, without the Memory Active counter data. We see that the VMware Host management functions consume about 1.5 GB of RAM altogether. This includes the Memory Overhead counter, which reports the space the shadow page tables occupy. The amount of machine memory that the VMware hypervisor consumes remains flat through out the active benchmarking period.

VMware-memory-management-Figure-4

Figure 4. Machine memory areas allocated by the VMKernel, including memory management “overhead.”

In the next post in this series, we will look at the effectiveness of another VMware memory management feature, transparent memory sharing.

.

Virtual memory management in VMware.

Server virtualization technology, as practiced by products such as the VMware ESX hypervisor, applies similar virtual memory management techniques in order to operate an environment where multiple virtual guest machines are provided separate address spaces so they can execute concurrently, sharing a single hardware platform. To avoid confusion, in this section machine memory will refer to the actual physical memory (or RAM) installed on the underlying VMware Host platform. Virtual memory will continue to refer to virtual address space a guest OS builds for a process address space. Physical memory will refer to a virtualized view of machine memory that VMware grants to each guest machine. Virtualization adds a second level of memory address virtualization. (A white paper published by VMware entitled ““Understanding Memory Resource Management in VMware® ESX™ Server” is a good reference.)

When VMware spins up a new virtual guest machine, it grants that machine a set of contiguous virtual memory addresses that correspond to a fixed amount of physical memory, as specified by configuration parameters. The fact that this grant of physical memory pages does not reflect a commitment of actual machine memory is transparent to the guest OS, which then proceeds to create page tables and allocate this (virtualized) physical memory to running processes the same as it would if the OS were running native on the hardware. The VMware hypervisor is then responsible for maintaining a second of set of physical:machine memory mapping tables, which VMware calls shadow page tables. Just as the page tables maintained by the OS map virtual addresses to (virtualized) physical addresses, the shadow page tables map the virtualized physical addresses granted to the guest OS to actual machine memory pages, which are managed by the VMware hypervisor.

VMware maintains a set of shadow page tables that map virtualized physical addresses to machine memory addresses for each guest machine that it is executing. In effect, there is a second level of virtual to physical address translation that occurs each time a program executing inside a guest machine references a virtual memory address, once for the guest OS to map the process virtual address to a virtualized physical address and then by the VMware hypervisor to map the virtualized physical address to an actual machine memory address. Server hardware is available that supports this two-phase virtual:physical address mapping, as illustrated in Figure 1. In a couple of white papers, VMware reports this hardware greatly reduces the effort required by the VMware Host software to maintain the shadow page tables.

VMware-virtual-memory-management-shadow-page-tables

Figure 1. Two levels of Page Tables are maintained in virtualization Hosts. The first level is the normal set of Page Tables that the guest machines build to map virtual address spaces to (virtualized) physical memory. The virtualization layer builds a second set of shadow Page Tables that are involved in a two-step address translation process to derive the actual machine memory address during instruction execution.

Ballooning.

VMware attempts to manage virtual memory on demand without unnecessarily duplicating all the effort that its client guest machines already expend on managing virtual memory. The VMware hypervisor, which also needs to scale effectively on machines with very large amounts of physical memory, only gathers a minimum amount of information on the memory access patterns of any virtual machine guests that it is currently running. When VMware needs to replenish its inventory of available pages, it attempts to pressure the resident virtual machines to make those decisions by inducing paging within the guest OS, using a technique known as ballooning.

The VMware memory manager intervenes to handle the page faults that occur when a page initially granted to a guest OS is first referenced. This first reference triggers the allocation of machine memory to back the page affected, and results in the hypervisor setting the valid bit of the corresponding shadow Page Table entry. On the basis of setting the PTE valid bit on this first reference, VMware understands that it is an active page. But, following the initial access, VMware does very little to try to understand the reference patterns of the active pages of a guest OS. Neither does it attempt to use an LRU-based page replacement algorithm.

VMware does try to understand how many of the pages allocated to a guest machine are actually active using sampling. At random, it periodically selects a small sample of the guest machine’s active pages and flips the valid bit in the shadow PTE. (Note: according to the “Understanding Memory Resource Management in VMware® ESX™ Server” white paper, ESX selects 100 physical pages randomly from each guest machine and records how many of the pages that were selected were accessed in the next 60 seconds. The sampling rate can be adjusted by changing Mem.SamplePeriod in ESX advanced settings). This is mainly done to try and identify guest machines that are idle and calculate what is known as an Idle machine tax. Pages from idle guest machines are preferred if VMware needs to perform page replacement. If any of those active pages that are flagged as invalid are referenced again, these pages are then soft-faulted back into the guest OS working set with little delay. The percentage of such pages that are re-referenced again within the sampling period is used to estimate the total number of Active pages in the guest machine working set. Note that it is only an estimate.

Using the page fault mechanism described above, VMware assigns free machine memory pages to a guest OS on demand. When the amount of free physical memory available for new guest machine allocation requests drops below 6%, ballooning is triggered.  Ballooning is an attempt to induce paging stealing in the guest OS. Ballooning works as follows. VMware installs a balloon driver inside the guest OS and signals the driver to begin to “inflate.” vmmemctl.sys is the VMware balloon device driver software installed inside a guest Windows machine that “inflates” on command. The vmmemctl.sysVMware balloon driver uses a private communications channel to poll the VMware Host once per second to obtain a ballooning target. Waldspurger [7] reports that in Windows, the balloon inflates by calling standard routines that are available to device drivers that need to pin virtual memory pages in memory. The two memory allocation APIs Waldspurger references are MmProbeAndLockPagesand MmAllocatePagesForMDLEx.  These APIs specifically allocates pages that remain resident in physical memory until they are explicitly freed by the device driver.

After allocating these balloon pages, which remain empty of any content, the balloon driver sends a return message to the VMware Host, providing a list of the physical addresses of the pages it has acquired. Since these pages will remain unused, the VMware memory manager can delete them from physical memory immediately upon receipt of this reply. So, ballooning itself has no guaranteed immediate impact on physical memory contention inside the guest. The intent, however, is to pin enough guest OS pages in physical memory to trigger the guest machine’s page replacement policy. However, if ballooning does not cause the guest OS machine to experience memory contention, i.e., if the balloon request can be satisfied without triggering the guest machine’s page replacement policy, there will be no visible impact inside the guest machine. If there is no relief from the memory contention, VMware, of course, may continue to increase the guest machine’s balloon target until the guest machine starts to shed pages. We will see how effectively this process works in the next blog entry in this series.

Because inducing page replacement at the guest meachine level using ballooning may not act quickly enough to relieve a machine memory shortage, VMware will also resort to random page replacement from guest OS working sets when necessary. In VMware, this is called swapping. VMware swapping is triggered when the amount of free physical memory available for new guest machine allocation requests drops below 4%. Random page replacement is one page replacement policy that can be performed without any gathering information about the age of resident pages, and while less optimal than an LRU-based approach, simulation studies show its performance can be reasonably effective.

VMware’s current level of physical memory contention is encapsulated in a performance counter called Memory State. This Memory State variable is set based on the amount of Free memory available. Memory state transitions trigger the reclamation actions reported in Table 1:

State
Value
Free Memory Threshold
Reclamation Action
High
0
³ 6%
None
Soft
1
< 6%
Ballooning
Hard
2
< 4%
Swapping to Disk or Pages compressed
Low
3
<2%
Blocks execution of active VMs > target allocations

Table 1. The values reported in the ESX Host Memory State performance counter.

In monitoring the performance of a VMware Host configuration, the Memory State counter is one of the key metrics to track.

In the case study discussed beginning in the next blog entry, a benchmarking workload was executed that generated contention for machine memory on a VMware ESX server. During the benchmark, we observed the memory state transitioning to both the “soft” and “hard” paging states shown in Table 1, triggering both ballooning and swapping.

.