Hyper-V architecture: Dynamic Memory

This is the fourth post in a series on Hyper-V performance. The series begins here.

The hypervisor also contains a Memory Manager component for managing access to the machine’s physical memory, i.e., RAM. For the sake of clarity, when discussing memory management in the Hyper-V environment, I will call RAM machine memory, the Hyper-V host machine’s actual physical memory, to distinguish it from the view of virtualized physical memory granted to each partition. Guest machines never access machine memory directly. Each guest machine is presented with a range of Guest Physical memory addresses (GPA), based on its configuration definitions, that the hypervisor maps to machine memory with a set of page tables that the hypervisor maintains.

Machine memory cannot be shared in the same way that other computer resources like CPUs and disks can be shared. Once memory is in use, it remains 100% occupied until the owner of those memory locations frees it. The hypervisor’s Memory Manager is responsible for distributing machine memory among the root and child partitions. It can partition memory statically, or it can manage the allocation of memory to partitions dynamically. In this section, we will focus on the dynamic memory management capabilities of Hyper-V, an extremely valuable option from the standpoint of capacity planning and provisioning. Dynamic Memory, as the feature is known, enables Hyper-V to host considerably more guest machines, so long as these guest machines are not actively using all the Guest Physical Memory they are eligible to acquire.

The unit of memory management is the page, a fixed-size block of contiguous memory addresses. Windows supports standard 4K pages on Intel hardware and also uses some Large 2 MB pages in specific areas where it is appropriate. Hyper-V supports allocation using both page sizes. Pages of machine memory are either (1) allocated and in use by a child partition or (2) free and available for allocation on demand as needed.

Each guest machine assumes that the physical memory it is assigned is machine memory, and builds its own unique set of Guest Virtual Addresses (GVA) to Guest Physical addresses mappings – its own set of page tables. Both sets of page tables are referenced by the hardware during virtual address translation when a guest machine is running. As mentioned above, this hardware capability is known as Second Level Address Translation (SLAT). SLAT hardware makes virtualization much more efficient. Figure 8 illustrates the capability of SLAT hardware to reference both the hypervisor Page Tables that map machine memory and the guest machine’s Page Tables that map Guest Virtual Addresses to Guest Physical addresses during virtual address translation.

Hyper-V virtual memory management

Figure 8. Second Level Address Translation (SLAT) hardware and the tagged TLB are hardware optimizations that improve the performance of virtual machines.

Figure 8 illustrates another key hardware feature called tagged TLB that was specifically added to the Intel architecture to improve the performance of virtual machines. The Translation Lookaside Buffer (TLB) is a small, dedicated cache internal to the processor core containing the addresses of recently accessed virtual addresses and the corresponding machine memory addresses they are mapped to. In the processor hardware, virtual addresses are translated to machine memory addresses during instruction execution, and TLBs are extremely effective at speeding up that process. With virtualization hardware, each entry in the processor’s TLB is tagged with a virtual machine guest ID, as illustrated, so when the hypervisor Scheduler dispatches a new virtual machine, the TLB entries associated with the previously executing virtual machine can be identified and purged from the table.

Memory management for the Root partition is handled a little differently from the child partitions. The Root partition requires access to machine memory addresses and other physical hardware on the motherboard like the APIC to allow the Windows OS running in the Root partition to manage physical devices like the keyboard, mouse, video display, storage peripherals, and the network adaptor. But the Root partition is also a Windows machine that is capable of running Windows applications, so it builds page tables for mapping virtual addresses to physical memory addresses like a native version of the OS. In the case of the Root partition’s page tables, unlike any of the child partitions, physical addresses in the Root partition correspond directly to machine memory addresses. This allows the Root OS to access memory mapped for use by the video card and video driver, for example, as well as the physical memory accessed by other DMA device drivers. In addition, the hypervisor reserves some machine memory locations exclusively for its own use, which is the only machine memory that is off limits to the Root partition.

From a capacity planning perspective, it is important to remember that the Root partition requires some amount of Guest Physical Memory, too. You can see how much physical memory the Root is currently using by looking at the usual OS Memory performance counters.

Dynamic memory.

The Hyper-V hypervisor can adjust machine memory grants to guest machines up or down dynamically, a feature that is called dynamic memory. Dynamic memory refers to adjustments in the size of the Guest Physical address space that the hypervisor grants to a guest machine running inside a child partition. When dynamic memory is configured for a guest machine, Hyper-V can give a partition more physical memory to use or remove guest physical memory from a guest machine that doesn’t require it, ignoring for a moment the relative memory priority of the guests. With the dynamic memory feature of Hyper-V, you can pack significantly more virtual machines into the memory footprint of the VM host machine, although you still must be careful not to pack in too many guest machines and create a memory bottleneck that can impact all the guest machines that are resident on the Hyper-V Host.

When dynamic memory is enabled for a child partition, you set minimum and maximum Guest Physical memory values and allow Hyper-V to make adjustments based on actual physical memory usage. Figure 9 charts the amount of physical memory that is visible to one of the child partitions in a test scenario that I will be discussing. Starting around 5:45 pm, Hyper-V boosted the amount of Guest Physical Memory visible to this guest machine from approximately 4 GB to 8 GB, which was its maximum dynamic memory allotment. Figure 9 also shows two additional Hyper-V dynamic memory metrics, the cumulative number of Add and Remove memory operations that Hyper-V performed. Judging from the shape of the Add Memory Operations line graph, the measurement counts operations, not pages, so it is not a particularly useful measurement. In fact, once Dynamic Memory is enabled for one or more guest machines, memory adjustment operations occur on a more or less continuous basis. But understanding the rate that Memory Add and Remove operations are occurring provides no insight into the magnitude of those adjustments, which is reflected in the adjustments that are made in the amount of Guest Physical Memory that is visible to the partition.

Hyper-V guest machine visible physical memory(1)

Figure 9. Guest machines configured to use Dynamic Memory are subject to adjustments in the amount of Guest Physical Memory that is visible to them to access. Hyper-V adjusted the amount of physical memory visible to the TEST5 guest machine upwards from 4 GB to 8 GB beginning around 5:45 PM. 8 GB was the maximum amount of Dynamic Memory that was configured for the partition. This adjustment upwards occurred after two of the other guest machines being hosted were shutdown, freeing up the memory they had allocated.

 

The hypervisor makes decisions to add or remove physical memory from a guest machine based on measurements of how much virtual and physical memory the guest OS is actually using. A guest OS enlightenment reports the number of committed bytes, in effect, the number of virtual and physical memory pages that the Windows guest has constructed Page Table entries (PTEs) to address. Each guest machine’s committed bytes is then compared to how much physical memory it currently has available, a metric Hyper-V calls Memory Pressure, calculated using the formula:

Memory Pressure = (Guest machine Committed Memory / Visible Guest Physical Memory) * 100

Memory Pressure is a ratio of the virtual and physical memory allocated by the guest machine divided by the amount of physical memory currently allotted to the guest to address. For example, any guest machine with a Memory Pressure value less than 100 has allocated fewer pages of virtual and physical memory than its current physical memory allotment. Guest machines with a Memory Pressure measure greater than 100 have allocated more virtual memory than their current physical memory allotment and are at risk for higher demand paging rates, assuming all the allocated virtual memory is active.

Figure 10 reports the Current value of Memory Pressure for the guest machine shown in Figure 9 over an 8-hour period that includes the measurement interval used in the earlier chart. Just prior to Hyper-V’s Add Memory operation that increased the amount of Guest Physical Memory visible to the partition from 4 GB to 8 GB, the Memory Pressure was steady at 150. Working backwards from the measurements reported in Figure 9 that showed 4 GB of Guest Physical Memory visible to the guest and the corresponding Memory Pressure values shown in Figure 10, you can calculate the number of guest machine Committed Bytes:

   Committed Bytes = (Memory Pressure / 100) * Guest Physical Memory

So, in Figure 9, a Memory Pressure reading of 150 up until 5:45 pm means the guest machine had committed bytes of around 6 GB, a situation that left the guest machine severely memory constrained.

Memory pressure looking at a single vm

Figure 10. The Memory Pressure indicator is the ratio of guest Committed Bytes to visible Physical Memory, multiplied by 100. Prior to Hyper-V increasing the amount of Physical Memory visible to the guest from 4 GB to 8 GB around 5:45 PM, the Memory Pressure for the guest was 150.

Memory Pressure, then, corresponds to the ratio of virtual to physical memory that was discussed in the earlier chapter on Windows memory management where I recommended using it as a memory contention index for memory capacity planning. This is precisely the way Memory Pressure is used in Hyper-V. The Memory Pressure values calculated for the guest machines subject to dynamic memory management are then used to determine how to adjust the amount of Guest Physical memory granted to those guest machines.

It is also important to remember that the memory contention index is not always a foolproof indicator of a physical memory constraint in Windows. Committed Bytes as an indicator of demand for physical memory can be misleading. Windows applications like SQL Server and the Exchange Server store.exe process will allocate as much virtual memory to use as data buffers as possible, up to the limit of the amount of physical memory that is available. SQL Server then listens for low and high memory notifications issued by the Windows Memory Manager to tell it when it is safe to acquire more buffers or when it needs release older, less active ones. The .NET Framework functions similarly. In a managed Windows application, the CLR listens for low memory notifications and triggers a garbage collection run to free used virtual memory in any of the managed Heaps when the low memory signal is received. What makes Hyper-V dynamic memory interesting is that these process-level dynamic virtual memory management adjustments that SQL Server and .NET Framework applications use continue to operate when Hyper-V adds or removes memory from the guest machine.

Memory Buffer

Another dynamic memory option pertains to the size of the buffer of free machine memory pages that Hyper-V maintains on behalf of each guest in order to speed up memory allocation requests. By default, the hypervisor maintains a memory buffer for each guest machine subject to dynamic memory that is 20% of the size of its grant of physical guest memory, but that parameter can be adjusted upwards or downwards for each guest machine. On the face of it, the Memory Buffer option looks useful for reducing the performance risks associated with under-provisioned guest machines. So long as the Memory Buffer remains stocked with free machine memory pages, operations to add physical memory to a child partition are able to be satisfied quickly.

Hyper-V available memory chart

Figure 14. Hyper-V maintains Available Memory to help speed memory allocation requests.

A single performance counter is available that tracks overall Available Memory, which is compared to the average Memory Pressure in Figure 14. Intuitively, maintaining some excess capacity in the hypervisor’s machine memory buffer seems like a good idea. When the Available Memory buffer is depleted, in order for the Hyper-V Dynamic Memory Balancer to add memory to a guest machine, it first has to remove memory from another guest machine, an operation which does not take effect immediately. Generating an alert based on Available Memory falling below a threshold value of 3-5% of total machine memory is certainly appropriate. Unfortunately, Hyper-V does not provide much information or feedback to help you make adjustments to the tuning parameter and understand how effective the Memory Buffer is.

 

 

 .

Hyper-V architecture: Intercepts, Interrupts and Hypercalls.

This is the third post in a series on Hyper-V performance. The series begins here.

Three interfaces exist that allow for interaction and communication between the hypervisor, the Root partition and the guest partitions: intercepts, interrupts, and the direct Hypercall interface. These interfaces are necessary for the virtualization scheme to function properly, and their usage accounts for much of the overhead virtualization adds to the system. Hyper-V measures and reports on the rate these different interfaces are used, which is, of course, workload dependent. Frankly, the measurements that show the rate that the hypervisor processes interrupts and Hypercalls is seldom of interest outside the Microsoft developers working on Hyper-V performance itself. But these measurements do provide insight into the Hyper-V architecture and can help us understand how the performance of the applications running on guest machines is impacted due to virtualization. Figure 3 is a graph showing these three major sources of virtualization overhead in Hyper-V.

Hyper-V overheads

Figure 3. Using the Hyper-V performance counters, you can monitor the rate that intercepts, virtual interrupts, and Hypercalls are handled by the hypervisor and various Hyper-V components.

Intercepts.

Intercepts are the primary mechanism used to maintain a consistent view of the virtual processor that is visible to the guest OS. Privileged instructions and other operations issued by the guest operating system that are  valid when the OS is accessing the native hardware need to be intercepted by the hypervisor and handled in a way that maintains a consistent view of the virtual machine. Intercepts make use of another hardware assist – the virtualization hardware that allows the hypervisor to intercept certain operations. Intercepts include the guest machine OS

  • issuing a CPUID instruction to identify the hardware characteristics
  • accessing machine-specific registers (MSRs)
  • accessing I/O ports directly
  • instructions that cause hardware exceptions when executed that must be handled by the OS

When these guest machine operations are detected by the hardware, control is immediately transferred to the hypervisor to resolve. For example, if the guest OS believes it is running on a 2-way machine and issues a CPUID instructions, Hyper-V intercepts that instruction and, through the intercept mechanism, supplies a response that is consistent with the virtual machine image. Similarly, whenever a guest OS issues an instruction to read or update a Control Register (CR) or a Machine-Specific Register (MSR) value, this operation is intercepted, and control is transferred to the parent partition where the behavior the guest OS expects is simulated.

Resolving intercepts in Hyper-V is a cooperative process that involves the Root partition. When a virtual machine starts, the Root partition makes a series of Hypercalls that establish the intercepts it will handle, providing a call back address that the hypervisor uses to signal the Root partition when that particular interception occurs. Based on the virtual machine state maintained in the VM worker process, the Root partition will then simulate the requested operation, and then allow the intercepted instruction to complete its execution.

Hyper-V is instrumented to report the rate that several categories of intercepts occur. Some intercepts occur infrequently, like issuing CPUID instructions, something the OS needs to do rarely. Others like Machine-Specific Register access are apt to occur more frequently, as illustrated in Figure 4, which compares the rate of MSR accesses to the overall intercept rate, summed over all virtual processors for a Hyper-V host machine.

MSR access intercepts per second graph

Figure 4. The rate MSR intercepts are processed, compared to the overall intercept rate (indicated by an overlay line graphed against the secondary, right-hand y-axis).

In order to perform its interception functions, the Root partition’s VM worker process maintains a record of the virtual machine state. This includes keeping track of the virtual machine’s registers each time there is an interrupt, plus maintaining a virtual APIC for interrupt handling, as well as additional virtual hardware interfaces, what some authors describe as a “virtual motherboard” of devices representing the full simulated guest machine hardware environment.

Interrupts.

Guest machines accessing the synthetic disk and network devices that are installed are presented with a virtualized interrupt handling mechanism. Compared to native IO, this virtualized interrupt process adds latency to guest machine disk and network IO requests to synthetic devices. Latency increases because device interrupts need to be processed twice, once in the Root partition, and again in the guest machine. Latency also increases when interrupt processing at the guest machine level is deferred because none of the virtual processors associated with the guest are currently dispatched.

To support guest machine interrupts, Hyper-V builds and continuously maintains a synthetic interrupt controller associated with the guest’s virtual processors. When an external interrupt generated by a hardware device attached to the Host machine occurs because the device has completed a data transfer operation, the interrupt is directed to the Root partition to process. If the device interrupt is found to be associated with a request that originated from a guest machine, the guest’s synthetic interrupt controller is updated to reflect the interrupt status, which triggers action inside the guest machine to respond to the interrupt request. The device drivers loaded on the guest machine are suitably “enlightened” to skip execution of as much redundant logic as possible during this two-phased process.

The first phase of interrupt processing occurs inside the Root partition. When a physical device raises an interrupt that is destined for a guest machine, the Root partition handles the interrupt in the Interrupt Service Routine (ISR) associated with the device immediately in the normal fashion. When the device interrupt is in response to a disk or network IO request from a guest machine, there is a second phase of interrupt processing that occurs associated with the guest partition. The second phase, which is required because the guest machine also must handle the interrupt, increases the latency of every IO interrupt that is not processed directly by the child partition.

An additional complication arises if none of the guest machine’s virtual processors are currently dispatched. If no guest machine virtual processor is executing, then interrupt processing on the guest is deferred until one of its virtual processors is executing. In the meantime, the interrupt is flagged as pending in the state machine maintained by the Root partition. The amount of time that device interrupts are pending also increases the latency associated with synthetic disk and network IO requests initiated by the guest machine.

The increased latency associated with synthetic device interrupt-handling can have a very serious performance impact. It can present a significant obstacle to running disk or network IO-bound workloads as guest machines. The problem is compounded because the added delay and its impact on an application is difficult to quantify. The Logical Disk and Physical Disk\Avg. Disk secs/Transfer counters on the Root partition are not always reliably capable of measuring the disk latency associated with the first phase of interrupt processing because Root partition virtual processors are also subject to deferred interrupt processing and virtualized clocks and timers. The corresponding guest machine Logical Disk and Physical Disk\Avg. Disk secs/Transfer counters are similarly burdened. Unfortunately, a careful analysis of the data shows it is not clear that any of the Windows disk response time measurements are valid under Hyper-V, even disk devices that are natively attached to the guest partition.

The TCP/IP networking stack, as we have seen in our earlier look at NUMA architectures, has a well-deserved reputation for requiring execution of a significant number of CPU instruction in processing Network IO. Consequently, guest machines that handle a large amount of network traffic are subject to this performance impact when running virtualized. The guest machine synthetic network driver enlightenment helps considerably with this problem, as do NICs featuring TCP offload capabilities. Network devices that can be attached to the guest machine in native mode are particularly effective performance options in such cases.

In general, over-provisioning processor resources on the VM Host is an effective mitigation strategy to limit the amount and duration of deferred interrupt processing delays that occur for both disk and network IO. Disk and network hardware that can be directly attached to the guest machine is certainly another good alternative. Interrupt processing for disk and network hardware that is directly attached to the guest is a simpler, one-phase process, but one that is also subject to pending interrupts whenever the guest’s virtual processors are themselves delayed. The additional latency associated with disk and network IO is one of the best reasons to run a Windows machine in native mode.

VMBus

Guest machine interrupt handling relies on an inter-partition communications channel called the VMBus, which makes use of the Hypercall capability that allows one partition to signal another partition and send messages. (Note that since child partitions have no knowledge of other child partitions, this Hypercall signaling capability is effectively limited to use by the child partition and its parent, the Root partition.) Figure 5 illustrates the path taken when a child partition initiates a disk or network IO to a synthetic disk or network device installed in the guest machine OS. IOs to synthetic devices are processed by the guest machine device driver, which is enlightened, as discussed above. The synthetic device driver passes the IO request to another Hyper-V component installed inside the guest called a Virtualization Service Client (VSC). The VSC inside the guest machine translates the IO request into a message that is put on the VMBus.

The VMBus is the mechanism used for passing messages between a child partition and its parent, the Root partition. Its main function is to provide a high bandwidth, low latency path for the guest machine to issue IO requests and receive replies. According to Mark Russinovich, writing in Windows Internals, one message-passing protocol the VMbus uses is a ring of buffers that is shared by the child and partition: “essentially an area of memory in which a certain amount of data is loaded on one side and unloaded on the other side.” Russinovich’s book continues, “No memory needs to be allocated or freed because the buffer is continuously reused and simply rotated.” This mechanism is good for message passing between the partitions, but is too slow for large data transfers due to the necessity to copy data to and from the message buffers.

Another VMBus messaging protocol uses child memory that is mapped directly to the parent partition address space. This direct memory access VMBus mechanism allows disk and network devices managed by the Root partition to reference buffers allocated in a child partition. This is the technique Hyper-V uses to perform bulk data IO operations for synthetic disk and network devices. For the purpose of issuing IO requests to native devices, the Root partition is allowed to access machine memory addresses directly. In addition, it can request the hypervisor to translate guest machine virtual addresses allocated for use as VMBus IO buffers into machine addresses that can be referenced by the physical devices supporting DMA that are attached to the Root.

Inside the Root partition, Hyper-V components known as Virtualization Service Providers (VSPs) receive IO requests from synthetic devices from the guest machines and translate them into physical disk and network IO requests. Consider, for example a guest partition request to read or write a .vhdx file that the VSP must translate into a disk IO request to the native file system on the Root. These translated requests are then passed to the native disk IO driver or the networking stack installed inside the Root partition that manages the physical devices. The VSPs also interface with the VM worker process that is responsible for the state machine that represents the virtualized physical hardware presented to the guest OS. Using this mechanism, interrupts for guest machine synthetic devices can be delivered properly to the appropriate guest machine.

When the native device completes the IO operation requested, it raises an interrupt that the Root partition handles normally. This process is depicted in Figure 5. When the request corresponds to one issued by a guest machine, what is different under Hyper-V is that a waiting thread provided by the VSP and associated with that native device is then awakened by the device driver. The VSP also ensures that the device response adheres to the form that the synthetic device driver on the guest machine expects. It then uses the VMBus inter-partition messaging mechanism to signal the guest machine that has an interrupt pending.

HyperV interrupt processing

Figure 5. Synthetic interrupt processing involves the Virtualization Service Provider (VSP) associated with the device driver invoked to process the interrupt. Data acquired from the device is transferred directly into guest machine memory using a VMBus communication mechanism, where it is processed by the Virtualization Service Client (VSC) associated with the synthetic device.

From a performance monitoring perspective, the Hyper-V hypervisor reports on the overall rate of virtual interrupt processing, as illustrated in Figure 6. The hypervisor, however, has no understanding of what hardware device is associated with each virtual interrupt. It can report the number of deferred virtual interrupts, but it does not report the amount of pending interrupt delay, which can be considerable. The measurement components associated with disk and network IO in the Root partition function normally, with the caveat that the disk and network IO requests counted by the Root partition aggregate all the requests from both the Root and child partitions. Windows performance counters inside the guest machine continue to provide an accurate count of disk and network IO and the number of bytes transferred for that partition. The guest machine counters are useful when the Root disks or network interface cards are saturated to identify which guest partitions are responsible for the overload. Later on, we will review some examples that illustrate how all these performance counters function under Hyper-V.

Hyper-V virtual interrupts chart

Figure 6. Virtual interrupt processing per guest machine virtual processor. The rate of pending interrupts is displayed as a dotted line plotted against the secondary y-axis. In this example, approximately half of all virtual interrupts are subject to deferred interrupt processing delays.

 

Hypercalls.

The Hypercall interface provides a calling mechanism that allows child partitions to communicate with the Root partition and the hypervisor. Some of the Hypercalls support the guest OS enlightenments mentioned earlier. Others are used by the Root partition to communicate requests to the hypervisor to configure, start, modify, and stop child partitions. There is another set of Hypercalls used in dynamic memory management, which is discussed below. Hypercalls are also defined to enable the hypervisor log events and post performance counter data back to the Root partition where it can be gathered by Perfmon and other similar tools.

Hypercalls per second graph

Figure 7. Monitoring the rate Hypercalls are being processed.

.

Hyper-V architecture

This is the second post in a series on Hyper-V performance. The series begins here.

Hyper-V installs a hypervisor that gains control of the hardware immediately following a boot, and works in tandem with additional components that are installed in the Root partition. Hyper-V even installs a few components into Windows guest machines (e.g., synthetic device drivers, enlightenments). This section looks at how all these architectural components work together to provide a complete set of virtualization services to guest machines.

Hyper-V uses a compact hypervisor component that installs at boot time. There is one version for Intel machines, Hvix64.exe, and a separate version for the slightly different AMD hardware, Hvax64.exe. The hypervisor is responsible for scheduling the processor hardware and apportioning physical memory to the various guest machines. It also provides a centralized timekeeping facility so that all guest machines can access a common clock. Except for the specific enlightenments built into Windows guest machines, the hypervisor functions in a manner that is transparent to a guest machine, which allows Hyper-V to host guest machines running an OS other than Windows.

After the hypervisor initializes, the Root partition is booted. The Root partition performs virtualization functions that do not require the higher privilege level associated with the hardware virtualization interface. The Root partition owns the file system, for instance, which the hypervisor has no knowledge of, let alone a way to access it, lacking all the system software to perform file IO. The Root creates and runs a VM worker process (vmwp.exe) for each guest machine. Each VM worker process keeps track of the current state of a child partition. The Root partition is also responsible for the operation of native IO devices, which includes disks and network interfaces, along with the device driver software that is used to access those hardware peripherals. The Windows OS running in the Root partition also continues to handle some of the machine management functions, like power management.

The distribution of machine management functions across the Hyper-V hypervisor and its Root partition makes for an interesting, hybrid architecture. A key advantage of this approach from a software development perspective is Microsoft already had a server OS and saw no reason to reinvent the wheel when it developed Hyper-V. Because the virtualization Host uses the device driver software installed on the Windows OS running inside the Root partition to connect to peripherals devices for disk access and network communication, Hyper-V immediately gained support for an extensive set of devices without ever needing to coax 3rd party developers to support the platform. Hyper-V also benefits from all the other tools and utilities that run on Windows Server: PowerShell, network monitoring, and the complete suite of Administrative tools. The Root partition uses Hypercalls to pull Hyper-V events and logs them to its Event log.

Hyper-V performance monitoring also relies on cooperation between the hypervisor and the Root partition to work. The hypervisor maintains several sets of performance counters that keep track of guest machine dispatching, virtual processor scheduling, machine memory management, and other hypervisor facilities. A performance monitor running inside the Root partition can pull these Hyper-V performance counters by accessing a Hyper-V perflib that gathers counter values from the hypervisor using Hypercalls.

In Hyper-V, the Root Partition is used to administer the Hyper-V environment, which includes defining the guest machines, as well as starting and stopping them. Microsoft recommends that you dedicate the use of the Hyper-V Root partition to Hyper-V administration, and normally you would install the Windows Server Core to run Hyper-V. Keeping the OS footprint small is especially important if you ever need to patch the Hyper-V Root and re-boot it, an operation that would also force you either to shut down and restart all the resident guest machines, or migrate them, which is the preferred way to do reconfiguration and maintenance. While there are no actual restrictions that prevent you from running any other Windows software that you would like inside the Root partition, it is not a good practice. In Hyper-V, the Root partition provides essential runtime services for the guest machines, and running any other applications on the Root could conflict with the timely delivery of those services.

The Root partition is also known as the parent partition, because it serves as the administrative parent of any child guest machines that you define and run on the virtualization Host machine. A Hyper-V Host machine with a standard Windows Server OS Root partition and with a single Windows guest machine running in a child partition, along with the major Hyper-V components installed in each partition, is depicted in Figure 1.

HyperV architecture

Figure 1. Major components of the Hyper-V architecture.

Figure 1 shows the Hyper-V hypervisor installed and running at the very lowest (and highest priority) level of the machine. Its role is limited to the management of the guest machines, so, by design, it has a very limited range of functions and capabilities. It serves as the Scheduler for managing the machine’s logical processors, which includes dispatching virtual machines to execute on what are called virtual processors. It also manages the creation and deletion of the partitions where guest machines execute. The hypervisor also controls and manages machine memory (RAM). The Hypercall interface is used for communication with the Root partition and any active child partitions.

Processor scheduling

Virtual machines are scheduled to execute on virtual processors, which is the scheduling mechanism used by the hypervisor to run the guest machines while maintaining the integrity of the hosting environment. Each child partition can be assigned one or more virtual processors, up to the maximum number of logical processors (i.e., physical processors with Intel’s HyperThreading enabled), present in the hardware. Tallying up all the guest machines, more virtual processors are typically defined than the number of logical processors that are physically available in the hardware, so the hypervisor maintains a Ready Queue of dispatchable guest machine virtual processors analogous to the Ready Queue maintained by the OS for threads that are ready to execute. If the guest OS running in the child partition supports the feature, Hyper-V can even add virtual processors to the child partition while the guest machine is running. (The Windows Server OS supports adding processors to the machine dynamically without having to reboot.)

By default, virtual processors are scheduled to run on actual logical processors using a round robin policy that balances the CPU load across virtual machines evenly. Processor scheduling decisions made by the hypervisor are visible to the Root partition via a hypervisor call to allow the Root to keep current the state machines it maintains representing each individual guest.

Similar to the Windows OS Scheduler, the hypervisor Scheduler implements soft processor affinity, where an attempt is made to schedule a virtual processor on the same logical processor where it executed last. The hypervisor’s guest machine scheduling algorithm is also aware of the NUMA topology of the underlying hardware, and attempts to assign all the virtual processors for a guest machine to execute on the same NUMA node, where possible. NUMA considerations loom large in virtualization technology because most data center hardware has NUMA performance characteristics. If the underlying server hardware has NUMA performance characteristics, best practice is to (1) assign guest machines no more virtual processors than the number of logical processors that are available on a single NUMA node, and then (2) rely on the NUMA support in Hyper-V to keep that machine confined to one NUMA node at a time.

Because the underlying processor hardware from Intel and AMD supports various sets of specialized, extended instructions, virtual processors can be normalized in Hyper-V, a setting which allows for live migration across different Hyper-V server machines. Intel and AMD hardware are different enough, in fact, that you cannot live migrate guest machines running on one manufacturer’s equipment to the other manufacturer’s equipment. (See Ben Armstrong’s virtualization blog for details.) Normalization occurs when you select the guest machine’s compatibility mode setting on the virtual processor configuration panel. The effect of this normalization is that Hyper-V will present a virtual processor to the guest that might exclude some of the hardware-specific instruction sets that might exist on the actual machine. If it turns out that the guest OS specifically relies on some of these hardware extensions – many have to do with improving the execution of super-scalar computation, audio and video streaming, or 3D rendering – then you are presented with an interesting choice to make between higher availability and higher performance.

Processor performance monitoring

When Hyper-V is running and controlling the use of the machine CPU resources, to monitor processor utilization by the hypervisor, the Root partition, and guest machines running in child partitions, you must access the counters associated with the Hyper-V Hypervisor. These can be gathered by running a performance monitoring application on the Root partition. The Hyper-V processor usage statistics need to be used instead of the usual Processor and Processor Information performance counters available at the OS level.

There are three sets of Hyper-V Hypervisor processor utilization counters and the key to using them properly is to understand what entity each instance of the counter set represents. The three Hyper-V processor performance objects include the following:

  • Hyper-V Hypervisor Logical Processor

There is one instance of HVH Logical Processor counters available for each hardware logical processor that is present on the machine. The instances are identified as VP 0, VP 1, …, VP n-1, where n is the number of Logical Processors available at the hardware level. The counter set is similar to the Processor Information set available at the OS level (which also contains an instance for each Logical Processor). The main metrics of interest are % Total Run Time and % Guest Run Time. In addition, the % Hypervisor Run Time counter records the amount of CPU time, per Logical processor, consumed by the hypervisor.

  • Hyper-V Hypervisor Virtual Processor

There is one instance of the HVH Virtual Processor counter set for each child partition Virtual Processor that is configured. The guest machine Virtual Processor is the abstraction used in Hyper-V dispatching. The Virtual Processor instances are identified using the format guestname: Hv VP 0, guestname: Hv VP 1, etc., up to the number of Virtual Processors defined for each partition. The % Total Run Time and the % Guest Run Time counters are the most important measurements available at the guest machine Virtual Processor level.

CPU Wait Time Per Dispatch is another potentially useful measurement indicating the amount of time a guest machine Virtual Processor is delayed the hypervisor Dispatching queue, which comparable to the OS Scheduler’s Ready Queue. Unfortunately, it is not clear how to interpret this measurement. Not only are the units are undefined – although 100 nanosecond timer units are plausible – but the counter reports values that are inexplicably discrete.

The counter set also includes a large number of counters that reflect the guest machine’s use of various Hyper-V virtualization services, including the rate that various intercepts, interrupts and Hypercalls are being processed for the guest machine. (These virtualization services are discussed in more detail in the next section.) These are extremely interesting counters, but be warned they are often of little use in diagnosing the bulk of capacity-related performance problems where either (1) a guest machine is under-provisioned with respect to access to the machine’s Logical Processors or (2) the Hyper-V Host processor capacity is severely over-committed. This set of counters is useful in the context of understanding what Hyper-V services the guest machine consumes, especially if over-use of these services leads to degraded performance. Another context where they are useful is understanding the virtualization impact on a guest machine that is not running any OS enlightenments.

  • Hyper-V Hypervisor Root Virtual Processor

The HVH Root Virtual Processor counter set is identical to the metrics reported at the guest machine Virtual Processor level. Hyper-V automatically configures a Virtual Processor instance for each Logical Processor for use by the Root partition. The instances of these counters are identified as Root VP 0, Root VP 1, etc., up to VP n-1.

Table 1.

Hyper-V Processor usage measurements are organized into distinct counter sets for child partitions, the Root partition, and the hypervisor.

Counter Group

# of instances

Key counters

HVH Logical Processor

# of hardware logical processors
VP 0, VP 1, etc. % Total Run Time
% Guest Run Time

HVH Virtual Processor

# of guest machines * virtual CPUs

 

  guestname: Hv VP 0, etc. % Total Run Time
  % Guest Run Time
  CPU Wait Time Per Dispatch
  Hypercalls/sec
  Total Intercepts/sec
  Pending Interrupts/sec

HVH Virtual Processor

# of hardware logical processors
  Root VP 0, Root VP 1, etc. % Total Run Time

Scheduling Priority.

The Hyper-V hypervisor, which is responsible for guest machine scheduling, by default implements a simple, round-robin policy in which any virtual processor in the execution state is equally likely to be dispatched. If the CPU capacity of the hardware is adequate, defining and running more virtual processors than logical processors leads to the possibility of dispatching queuing delays, but normally these dispatching delays are minimal because many virtual processors are in the idle state, and so over-commitment of the physical processors often does not impact performance greatly. On the other hand, once these dispatching delays start to become significant, the priority scheduling options that Hyper-V provides can become very useful.

In addition to the basic round-robin scheme, the hypervisor also factors processor affinity and the NUMA topology into scheduling decisions, considerations that add asymmetric constraints to processor scheduling. Note that the guest machine’s current CPU requirements are not directly visible to the Hyper-V hypervisor, so its scheduling function does not know when there is big backlog of ready threads delayed inside the guest. With the single exception of an OS enlightenment in Windows guests to notify Hyper-V via a Hypercall that it is entering a long spinlock, Hyper-V has no knowledge of guest machine behavior to make better informed scheduling decisions. Nor does the fact that a virtual processor has one or more interrupts pending bias a guest machine that is currently idle and improve its position in the dispatching queue. (In other words, Hyper-V’s scheduling algorithm does not attempt to implement any form of the Mean Time to Wait (MTTW) algorithm that improves throughput in the general case by favoring the dispatching of guest machines that spend proportionally more time waiting on service from external devices. In contrast, the Windows OS Scheduler does implement a form of MTTW thread scheduling by temporarily boosting the priority of a thread that is awakened when a device interrupt it was waiting for occurred.) These are all factors that suggest it is not a good idea to try and push the CPU load to the limits of processor capacity under Hyper-V.

When enough guest machine virtual processors attempt to execute and the physical machine’s logical processors are sufficiently busy, performance under Hyper-V will begin to degrade. You can then influence Hyper-V guest machine scheduling using one of the available priority settings. The CPU priority settings are illustrated in Figure 2, which is a screenshot showing the Hyper-V Processor Resource control configuration options that are available. Hyper-V provides several tuning knobs to customize the Hyper-V virtual processor scheduler function. These include:

  • reserving CPU capacity in advance on behalf of a guest machine,
  • setting an upper limit to the CPU capacity a guest machine can use, and
  • setting a relative priority (or weight) for this virtual machine to be  applied whenever there is contention for the machine’s processors

Let’s compare the processor reservation, capping and weighting options and discuss which make sense to use and when to use them.

HyperV logical processor resource scheduling screenshot

Figure 2. Processor resource control setting include reservation, upper limits on CPU consumption, and a prioritized guest machine dispatching based on the relative weights of the VMs.

 Reservations.

The Hyper-V processor reservation setting is specified as the percentage of the capacity of a logical processor to be made available to the guest machine whenever it is scheduled to run. The reservation setting applies to each virtual processor that is configured for the guest. Setting a processor reservation value is mainly useful if you know that a guest machine requires a certain, minimal amount of processing power and you don’t want Hyper-V to schedule that VM to run unless that minimum amount of CPU capacity is available. When the virtual processor is dispatched, reservations guarantee that it will receive the minimum level of service specified.

Hyper-V reservation work differently from many other Quality of Service (QoS) implementations that feature them where any capacity that is reserved, but not used, remains idle. In Hyper-V, when the guest machine does not consume all the capacity that it is reserved for it, virtual processors from other guest machines that are waiting can be scheduled instead. Implementing reservations in this manner suggests the hypervisor makes processor scheduler decisions on a periodic basis, functionally equivalent to a time-slice, but whatever that time-slicing interval is, it is undocumented.

Capping.

Setting an upper limit using the capping option is mainly useful when you have a potentially unstable workload, for example, a volatile test machine, and you don’t want to allow that rogue guest to dominate the machine to the detriment of every other VM that is also trying to execute. Similar to reservations, capping is also specified as a percentage of logical processor capacity.

Weights.

Finally, setting a relative weight for the guest machine’s virtual processors is useful if you know the priority of a guest machine’s workload relative to the other guest machine workloads that are resident on the Hyper-V host. This is the most frequently used virtual processor priority setting. Unlike reservations, weights do not provide any guarantee that a guest machine’s virtual processor will be dispatched. Instead they are simply used to increase or decrease the likelihood that a guest machine’s virtual processor is dispatched. As implemented, weights are similar to reservations, except that it is easier to get confused with weights about what percentage of a logical processor you intend to allocate.

Here is how the weights work. Each virtual processor that a guest machine is eligible to use gets a default weight of 100. If you don’t adjust any of the weights, each virtual processor is equally as likely to be dispatched as any other, so the default scheme is the balanced, round-robin approach. In default mode (i.e., round-robin), the probability that a virtual processor is selected for execution is precisely

1/eligible guest machines

Notice that the number of virtual processors defined per machine makes the guest eligible to run more virtual processors, but is otherwise not a factor in calculating the dispatching probability for any one of its virtual processors.

When you begin to adjust the weights of guest machines, scheduling decisions become based on the relative weighting factor you have chosen. To calculate the probability that a virtual processor is selected for execution when the weighting factors are non-uniform, calculate a base value that is the total for all the guest machine weighting factors. For instance, if you have three guest machines with weighting factors of 100, 150 and 250, the base value is 500, and the individual guest machine dispatching probabilities are calculated using the simple formula

weight(i) / SUM(weight(1):weight(n))

So, for that example, the individual dispatching probabilities are 20%, 30% and 50%, respectively.

Relative weights make it easier to mix production and test workloads on the same Hyper-V host. You could boost the relative weight of production guests to 200, for example, while lowering the weight of the guest machines used for testing to 50. If you have a production SQL Server guest machine that services the production servers, you would then want to set its weight higher than the other production guests. And, if you had a SQL Server guest test machine that services the other test guests, you could leave that machine running at the default weight of 100, higher than the other test machines, but still lower than any of the production guest machines.

CPU Weight example.

Consider a Hyper-V Host with four guest machines, two production guests, while the remaining two guest are test machines running a discretionary workload. To simplify matters, let’s assume each guest machine is configured to run two virtual processors. The guest machine virtual processors weights are configured as shown in Table 2:

Table 2.

Guest Machine Workload VP Weight
Production 200
Test 50

Since Hyper-V guest machine scheduling decisions are based on relative weight, you need to sum the weights over all the guest machines and compute the total weighting factor per logical processor. Then Hyper-V calculates the relative weight per logical processor for each guest machine and attempts to allocate logical processor capacity to each guest machine proportionately.

Table 3.

Calculating the guest machine relative weights per logical processor.

Workload

# guests VP Weight Total Weight

(Weight * guests)

Guest Relative Weight

per LP

Production

2

200

400

80%

Test

2

50

100

20%

Totals

4

500

Under those conditions, assuming 100% CPU utilization and an equivalent CPU load from each guest, you can expect to see the two production machines consuming about 80% of each logical processor, leaving about 20% of a logical processor for the test machines to share. In a later post, we will look at benchmark results using this CPU weighting scenario.

If you are mixing production and test machines on the same Hyper-V host, you may want to also consider an extra bit of insurance and use capping to set a hard upper limit on the amount of CPU capacity that the test machines can ever use. Keep in mind that none of these resource controls has much effect unless there is ample contention for the processors, which is something to be avoided in general under any form of virtualization. The Hyper-V processor resource controls come in handy when you pursue an aggressive server consolidation program, and want to add a little extra margin of safety to the effort.

Reservations and limits are commonly used in Quality of Service mechanisms, and weights are often used in task scheduling. However, it is a little unusual to see a scheduler that implements all three tuning knobs. System administrators can use a mix of all three settings for the guest machines, which is confusing enough. And, for maximum confusion, you can place settings on different machines that are basically incompatible with the settings of the other resident guest machines. So, it is wise to exercise caution when using these controls. Later, in this chapter I will report on some benchmarks that I ran where I tried some of these resource control settings and analyzed the results. When the Hyper-V Host machine is running at or near its CPU capacity constraints, these guest machine dispatching priority settings do have a significant impact on performance.

 

The next post in this series continues this Hyper-V architecture discussion, describing its use of intercepts, interrupts and Hypercalls, the three interfaces that allow for interaction between the hypervisor, the root partition, and the guest partitions.

 .