This is the third post in a series on Hyper-V performance. The series begins here.
Three interfaces exist that allow for interaction and communication between the hypervisor, the Root partition and the guest partitions: intercepts, interrupts, and the direct Hypercall interface. These interfaces are necessary for the virtualization scheme to function properly, and their usage accounts for much of the overhead virtualization adds to the system. Hyper-V measures and reports on the rate these different interfaces are used, which is, of course, workload dependent. Frankly, the measurements that show the rate that the hypervisor processes interrupts and Hypercalls is seldom of interest outside the Microsoft developers working on Hyper-V performance itself. But these measurements do provide insight into the Hyper-V architecture and can help us understand how the performance of the applications running on guest machines is impacted due to virtualization. Figure 3 is a graph showing these three major sources of virtualization overhead in Hyper-V.
Intercepts are the primary mechanism used to maintain a consistent view of the virtual processor that is visible to the guest OS. Privileged instructions and other operations issued by the guest operating system that are valid when the OS is accessing the native hardware need to be intercepted by the hypervisor and handled in a way that maintains a consistent view of the virtual machine. Intercepts make use of another hardware assist – the virtualization hardware that allows the hypervisor to intercept certain operations. Intercepts include the guest machine OS
- issuing a CPUID instruction to identify the hardware characteristics
- accessing machine-specific registers (MSRs)
- accessing I/O ports directly
- instructions that cause hardware exceptions when executed that must be handled by the OS
When these guest machine operations are detected by the hardware, control is immediately transferred to the hypervisor to resolve. For example, if the guest OS believes it is running on a 2-way machine and issues a CPUID instructions, Hyper-V intercepts that instruction and, through the intercept mechanism, supplies a response that is consistent with the virtual machine image. Similarly, whenever a guest OS issues an instruction to read or update a Control Register (CR) or a Machine-Specific Register (MSR) value, this operation is intercepted, and control is transferred to the parent partition where the behavior the guest OS expects is simulated.
Resolving intercepts in Hyper-V is a cooperative process that involves the Root partition. When a virtual machine starts, the Root partition makes a series of Hypercalls that establish the intercepts it will handle, providing a call back address that the hypervisor uses to signal the Root partition when that particular interception occurs. Based on the virtual machine state maintained in the VM worker process, the Root partition will then simulate the requested operation, and then allow the intercepted instruction to complete its execution.
Hyper-V is instrumented to report the rate that several categories of intercepts occur. Some intercepts occur infrequently, like issuing CPUID instructions, something the OS needs to do rarely. Others like Machine-Specific Register access are apt to occur more frequently, as illustrated in Figure 4, which compares the rate of MSR accesses to the overall intercept rate, summed over all virtual processors for a Hyper-V host machine.
In order to perform its interception functions, the Root partition’s VM worker process maintains a record of the virtual machine state. This includes keeping track of the virtual machine’s registers each time there is an interrupt, plus maintaining a virtual APIC for interrupt handling, as well as additional virtual hardware interfaces, what some authors describe as a “virtual motherboard” of devices representing the full simulated guest machine hardware environment.
Guest machines accessing the synthetic disk and network devices that are installed are presented with a virtualized interrupt handling mechanism. Compared to native IO, this virtualized interrupt process adds latency to guest machine disk and network IO requests to synthetic devices. Latency increases because device interrupts need to be processed twice, once in the Root partition, and again in the guest machine. Latency also increases when interrupt processing at the guest machine level is deferred because none of the virtual processors associated with the guest are currently dispatched.
To support guest machine interrupts, Hyper-V builds and continuously maintains a synthetic interrupt controller associated with the guest’s virtual processors. When an external interrupt generated by a hardware device attached to the Host machine occurs because the device has completed a data transfer operation, the interrupt is directed to the Root partition to process. If the device interrupt is found to be associated with a request that originated from a guest machine, the guest’s synthetic interrupt controller is updated to reflect the interrupt status, which triggers action inside the guest machine to respond to the interrupt request. The device drivers loaded on the guest machine are suitably “enlightened” to skip execution of as much redundant logic as possible during this two-phased process.
The first phase of interrupt processing occurs inside the Root partition. When a physical device raises an interrupt that is destined for a guest machine, the Root partition handles the interrupt in the Interrupt Service Routine (ISR) associated with the device immediately in the normal fashion. When the device interrupt is in response to a disk or network IO request from a guest machine, there is a second phase of interrupt processing that occurs associated with the guest partition. The second phase, which is required because the guest machine also must handle the interrupt, increases the latency of every IO interrupt that is not processed directly by the child partition.
An additional complication arises if none of the guest machine’s virtual processors are currently dispatched. If no guest machine virtual processor is executing, then interrupt processing on the guest is deferred until one of its virtual processors is executing. In the meantime, the interrupt is flagged as pending in the state machine maintained by the Root partition. The amount of time that device interrupts are pending also increases the latency associated with synthetic disk and network IO requests initiated by the guest machine.
The increased latency associated with synthetic device interrupt-handling can have a very serious performance impact. It can present a significant obstacle to running disk or network IO-bound workloads as guest machines. The problem is compounded because the added delay and its impact on an application is difficult to quantify. The Logical Disk and Physical Disk\Avg. Disk secs/Transfer counters on the Root partition are not always reliably capable of measuring the disk latency associated with the first phase of interrupt processing because Root partition virtual processors are also subject to deferred interrupt processing and virtualized clocks and timers. The corresponding guest machine Logical Disk and Physical Disk\Avg. Disk secs/Transfer counters are similarly burdened. Unfortunately, a careful analysis of the data shows it is not clear that any of the Windows disk response time measurements are valid under Hyper-V, even disk devices that are natively attached to the guest partition.
The TCP/IP networking stack, as we have seen in our earlier look at NUMA architectures, has a well-deserved reputation for requiring execution of a significant number of CPU instruction in processing Network IO. Consequently, guest machines that handle a large amount of network traffic are subject to this performance impact when running virtualized. The guest machine synthetic network driver enlightenment helps considerably with this problem, as do NICs featuring TCP offload capabilities. Network devices that can be attached to the guest machine in native mode are particularly effective performance options in such cases.
In general, over-provisioning processor resources on the VM Host is an effective mitigation strategy to limit the amount and duration of deferred interrupt processing delays that occur for both disk and network IO. Disk and network hardware that can be directly attached to the guest machine is certainly another good alternative. Interrupt processing for disk and network hardware that is directly attached to the guest is a simpler, one-phase process, but one that is also subject to pending interrupts whenever the guest’s virtual processors are themselves delayed. The additional latency associated with disk and network IO is one of the best reasons to run a Windows machine in native mode.
Guest machine interrupt handling relies on an inter-partition communications channel called the VMBus, which makes use of the Hypercall capability that allows one partition to signal another partition and send messages. (Note that since child partitions have no knowledge of other child partitions, this Hypercall signaling capability is effectively limited to use by the child partition and its parent, the Root partition.) Figure 5 illustrates the path taken when a child partition initiates a disk or network IO to a synthetic disk or network device installed in the guest machine OS. IOs to synthetic devices are processed by the guest machine device driver, which is enlightened, as discussed above. The synthetic device driver passes the IO request to another Hyper-V component installed inside the guest called a Virtualization Service Client (VSC). The VSC inside the guest machine translates the IO request into a message that is put on the VMBus.
The VMBus is the mechanism used for passing messages between a child partition and its parent, the Root partition. Its main function is to provide a high bandwidth, low latency path for the guest machine to issue IO requests and receive replies. According to Mark Russinovich, writing in Windows Internals, one message-passing protocol the VMbus uses is a ring of buffers that is shared by the child and partition: “essentially an area of memory in which a certain amount of data is loaded on one side and unloaded on the other side.” Russinovich’s book continues, “No memory needs to be allocated or freed because the buffer is continuously reused and simply rotated.” This mechanism is good for message passing between the partitions, but is too slow for large data transfers due to the necessity to copy data to and from the message buffers.
Another VMBus messaging protocol uses child memory that is mapped directly to the parent partition address space. This direct memory access VMBus mechanism allows disk and network devices managed by the Root partition to reference buffers allocated in a child partition. This is the technique Hyper-V uses to perform bulk data IO operations for synthetic disk and network devices. For the purpose of issuing IO requests to native devices, the Root partition is allowed to access machine memory addresses directly. In addition, it can request the hypervisor to translate guest machine virtual addresses allocated for use as VMBus IO buffers into machine addresses that can be referenced by the physical devices supporting DMA that are attached to the Root.
Inside the Root partition, Hyper-V components known as Virtualization Service Providers (VSPs) receive IO requests from synthetic devices from the guest machines and translate them into physical disk and network IO requests. Consider, for example a guest partition request to read or write a .vhdx file that the VSP must translate into a disk IO request to the native file system on the Root. These translated requests are then passed to the native disk IO driver or the networking stack installed inside the Root partition that manages the physical devices. The VSPs also interface with the VM worker process that is responsible for the state machine that represents the virtualized physical hardware presented to the guest OS. Using this mechanism, interrupts for guest machine synthetic devices can be delivered properly to the appropriate guest machine.
When the native device completes the IO operation requested, it raises an interrupt that the Root partition handles normally. This process is depicted in Figure 5. When the request corresponds to one issued by a guest machine, what is different under Hyper-V is that a waiting thread provided by the VSP and associated with that native device is then awakened by the device driver. The VSP also ensures that the device response adheres to the form that the synthetic device driver on the guest machine expects. It then uses the VMBus inter-partition messaging mechanism to signal the guest machine that has an interrupt pending.
From a performance monitoring perspective, the Hyper-V hypervisor reports on the overall rate of virtual interrupt processing, as illustrated in Figure 6. The hypervisor, however, has no understanding of what hardware device is associated with each virtual interrupt. It can report the number of deferred virtual interrupts, but it does not report the amount of pending interrupt delay, which can be considerable. The measurement components associated with disk and network IO in the Root partition function normally, with the caveat that the disk and network IO requests counted by the Root partition aggregate all the requests from both the Root and child partitions. Windows performance counters inside the guest machine continue to provide an accurate count of disk and network IO and the number of bytes transferred for that partition. The guest machine counters are useful when the Root disks or network interface cards are saturated to identify which guest partitions are responsible for the overload. Later on, we will review some examples that illustrate how all these performance counters function under Hyper-V.
The Hypercall interface provides a calling mechanism that allows child partitions to communicate with the Root partition and the hypervisor. Some of the Hypercalls support the guest OS enlightenments mentioned earlier. Others are used by the Root partition to communicate requests to the hypervisor to configure, start, modify, and stop child partitions. There is another set of Hypercalls used in dynamic memory management, which is discussed below. Hypercalls are also defined to enable the hypervisor log events and post performance counter data back to the Root partition where it can be gathered by Perfmon and other similar tools..