Over-provisioned Hyper-V Hosts: Understanding Guest Machine Performance, Part II

This long post discusses guest machine performance under Hyper-V looks when there is an over-provisioned Hyper-V Host. When there is an over-provisioned Hyper-V Host, guest machine workloads are subject to a minimal performance penalty, which I will attempt to quantify. This is the eighth post in a series on Hyper-V performance. The series began here.

It is easy to recognize a generously over-provisioned Hyper-V Host machine – its processors are underutilized and machine memory is not fully allocated. When the machine’s logical CPUs are seldom observed running in excess of 25-40% busy, there is ample CPU capacity for all its resident guest machines, especially considering that most Hyper-V Host machines are multiprocessors. Memory can safely be regarded as underutilized when more than 40% of it is Available for allocation by the hypervisor and no guest machine is running at its maximum Dynamic Memory setting.

Note: The dispatching of a guest machine virtual processor is delayed when all CPUs are busy, so it is forced to wait. In a symmetric multiprocessor, the probability that all CPUs are busy simultaneously is the joint probability that all the processors are busy. For example, if there are four CPUs and each CPU is busy 25% of the time, the joint probability of all the CPUs being busy simultaneously is 0.25 * 0.25 * 0.25 * 0.25, or 0.004. The probability that all CPUs are simultaneously busy is only ~0.4%.

When Hyper-V Host machines are over-provisioned, the performance of guest machine applications approaches the level of native hardware. The problem with over-provisioned VM Host machines is that they are not economical. Over-provisioning on a wide enough scale often leads an initiative to increase the degree of sever consolidation by trying to pack more guest machines into the existing virtualization infrastructure.

Discussing over-committed Hyper-V Host machines and under-provisioned guest machines is much more interesting because that is when serious performance problems can occur. The first set of benchmarking runs reported below was directed at showing how these two conditions can be characterized based on various Hyper-V and Windows OS performance measurements.

Several of the benchmarking runs discussed in this section deliberately overload the Hyper-V Host processors or the Host machine’s memory footprint and then look at the Hyper-V and guest machine performance counters that characterize those overloaded conditions. As we have seen, you can implement virtual processor and dynamic memory priority settings to try and protect higher priority workloads from this degradation. Additional benchmarking runs involving physical resources being over-committed reveal how effective these Hyper-V virtual processor and machine memory priority settings are in shielding higher priority workloads from the performance impact when the Hyper-V Host is overloaded.

Scaling up and scaling out

While improving application performance is not among them, virtualization hardware and software technology has evolved to the point where it currently provides compelling benefits in modern data center operations. For example, one of the clear trends in hardware manufacturing that favors virtualization is building more powerful multiprocessor cores, not faster individual processors. CPU manufacturers have resorted to packaging more processors on a chip rather than making processors faster because increases in clock speeds lead to disproportionate increases in power consumption, which also ramps up the amount of heat that has to be dissipated. In semiconductor fabrication, the manufacturers have encountered a “power wall” that resists other engineering solutions.

A second factor that promotes virtualization is industry Best Practices that lead to building and deploying Windows machines that are dedicated to performing a single role, whether they are explicit Server roles or more general purpose desktop and portable workstations handling diverse personal computing tasks. A related practice is the Technical Support group within the IT organization building and then certifying for distribution one or more stable images of the operating system and the application software installed on top of it after a lengthy period of comprehensive Acceptance Testing. This stable image is then cloned each time there is an organizational need to support another copy of this application. Virtualization software that can deploy new copies of these system images through rapid cloning of virtual machines – a process that can also be automated – adds valuable flexibility to data center operations.

Most of the virtual machines configured to handle a single server role are clearly not well matched against the powerful capabilities of the data center machines they would be deployed to. Without virtualization, these data center machines would often be massively over-provisioned if they were only capable of running an individual Windows Server workload. Virtualization technology offers relief from this conundrum, a convenient way to consolidate many of these individual workloads on a single piece of equipment. Essentially, virtualization technology provides a flexible, software-based mechanism that allows system administrators to utilize current hardware more effectively while retaining all the administrative advantages of isolating workloads on dedicated servers.

Still, spinning up a new guest machine from the standard server or workstation Build is not the only possible response to each new request for IT services. There are viable alternatives, including allowing a single instance of IIS, for example, to host multiple application web sites or installing multiple instances of SQL Server on a production or test machine. IT professionals are sometimes reluctant to choose these configuration alternatives because they are concerned about the performance risks associated with multiple web servers sharing a single machine image, for example. Of course, these performance risks do not magically disappear when multiple guest machines are provisioned instead. The problem of over-committing shared computer resources is merely elevated to the level associated with Hyper-V administration.

Finally, having firmly established itself as an integral part of large scale data center operations, virtualization technology continues to evolve other virtual machine management capabilities, including replication, live migration, dynamic load balancing, automatic failover and recovery. The flexibility that virtualization solutions also provide in being able to provision a new machine quickly can benefit the performance of workloads that are running up against capacity limits in their current configuration and need to scale out across multiple machines in an application cluster to achieve higher levels of throughput.

Benchmark results

To gain some additional perspective on the performance impact of virtualization, we will look first at some benchmarking results showing the performance of virtual machines in various simple configurations, which we will also compare to native performance where Windows is installed directly on top of the hardware. For these performance tests, I used a benchmarking program that simulates the multi-threaded CPU and memory load of an active ASP.NET web application, but without issuing disk or network requests so that those limited resources on the target machine are not overwhelmed in the course of executing the benchmark program.

The benchmark program I used for stress testing Hyper-V guest machines is a Load Generator application I wrote that is parameter-driven to generate a wide variety of “challenging” workloads. The current version is a 64-bit .NET program written in C# called the ThreadContentionGenerator. It has a main dispatcher thread and a variable number of worker threads, similar to ASP.NET. You set it to execute a fixed number of concurrent tasks, and perform a specific number of iterations of each task. Each task allocates a large .NET collection object that it then fills with random data. It then searches the collection repeatedly, and finally deletes all the data. In this fashion, the program stresses both the processor and virtual memory. Periodically, each active thread simulates an IO wait by sleeping, where the simulated IO rate and the IO duration is also subject to some degree of realistic variation.

The benchmark program is a very flexible beast that can be adjusted to stress the machine’s CPUs, memory or both. You can execute it in a shared nothing environment where the threads execute independent of each other. Alternatively, you can set a parameter that adds an element of resource sharing to the running process so that the threads face lock contention. In contention mode, the main thread sets up some shared data structures that the worker threads access serially to generate a degree of realistic lock contention that can be dialed either up or down by increasing or decreasing the amount of processing spent in the critical section.

For this first set of Hyper-V guest machine performance experiments, I set the number of concurrent worker tasks to 32 and the number iterations to 90:

ThreadContentionGenerator.exe –tasks 32 –iterations 90

 

There are additional parameters to vary the virtual memory footprint of the program, the duration of IO waits and the rate of lock contention, but for this set of tests I let the program run with default values for those three parameters. With these settings, the program generates a load that is similar in many respects to a busy ASP.NET web application, one that is compute-bound, with requests that can be processed largely independent of each other. Note that the intent was to stress the Hyper-V environment, beginning by stressing the machine’s CPU capacity, without attempting a realistic simulation of a representative or a particular ASP.NET workload.

The hardware was an Intel i7 single socket machine with four physical CPUs (and Intel Hyper-Threading disabled) and 12 GB of RAM. The OS was Windows Server 2012 R2.

  • Native performance baseline

Running first on the native machine – after re-booting with Hyper-V disabled – the benchmark program ran to completion in about 90 minutes, the baseline execution time we will use to compare the various virtualization configurations that were tested. The only other active process running on the native Windows machine was Demand Technology’s Performance Sentry performance monitor, DmPerfss.exe, gathering performance counters once per minute.

At this stage, the only aspect of the benchmark program’s resource usage profile that is relevant is its CPU utilization. Because each task being processed goes to sleep periodically to simulate I/O, individual worker threads are not CPU-bound. However, since there are 32 worker threads executing concurrently and only four physical CPUs available, the overall workload is CPU-bound, as evidenced in Figure 25, which reports processor utilization by the top 5 consumers of CPU time during a one hour slice when the ThreadContentionGenerator program was active on the native machine.

Figure 25. Native execution of the benchmark program shows CPU utilization near 400% on a single socket machine with 4 physical CPUs. Instantaneous measurements of the System/Processor Queue Length counter, represented by a dotted line chart plotted against the right-hand y-axis, indicate a significant amount of processor queuing.

You can see in Figure 25 that overall processor utilization approaches the capacity of the machine at close to 400% utilization. The dotted line graph in Figure 25 also shows the instantaneous values obtained from the Processor Queue Length counter. The number of threads waiting in the Windows Scheduler Ready Queue exceeds fifteen for some of the observations. We can readily see that not only are the four physical CPUs on the machine quite busy, at many intervals there are a large number of ready threads waiting for service. Figure 26 confirms that the threads waiting in the Ready Queue are predominately from the ThreadContentionGenerator process (shown in blue), which is the behavior I expected, by the way.

Figure 26. This chart charts threads with a Wait State Reason indicating they are waiting in the OS Scheduler Ready Queue. As expected, most of the ready threads in the Ready Queue are from the benchmark program, the ThreadContentionGenerator process.

  • Standalone in the Root partition

In the next scenario, running standalone on the Root partition under Hyper-V with no child partitions active, the same benchmark executed for approximately 100 minutes, about 11% longer than the native execution baseline. In many scenarios a 10% performance penalty is a small price to pay for the other operational benefits virtualization provides, but it is important to keep in mind that there is always some performance penalty that is due whenever you are running an application in a virtualized environment.

Applications take longer to run inside a virtual machine compared to running native because of a variety of virtualization costs that are not encountered on a native machine. These include performance costs associated with Hyper-V intercepts and Hypercalls, plus the additional path length associated with synthetic interrupt processing. As mentioned above, the benchmark program simulates IO by issuing Timer Waits. These require the timer services of the hypervisor, which are less costly that the synthetic interrupt processing associated with disk and network IO. So, the 10% increase in execution time is very likely a best case of the performance degradation to expect.

Those costs of virtualization are minor irritants so long as the Hyper-V Host machine can supply ample resources to the guest machine. The performance costs of virtualization do increase substantially, however, when guest machines start to contend for shared resources on the Host machine.

Since processor scheduling is under the control of the hypervisor in the second benchmark run, for reliable processor measurements, it is necessary to turn to the Hyper-V Logical Processor counters, as shown in Figure 27. For a one-hour period while the benchmark program was active, overall processor utilization is reported approaching 400%, but you will notice it is slightly lower than the levels reported for the native machine in Figure 25. Figure 27 also shows an overlay line graphing hypervisor processor utilization against the right-hand y-axis, which accounts for some of the difference. The hypervisor consumes about 6% of one processor over the same measurement interval. The amount of CPU time consumed directly by the Hyper-V hypervisor is one readily quantifiable source of virtualization overhead that causes performance of the benchmark application to degrade by 10% or so.

Standalone guest virtual processor utilization

Figure 27. Running the benchmark workload standalone on the Root partition, the hypervisor consumes about 6% of one processor. Overall CPU utilization approaches 400% busy, slightly less busy than the configuration shown in Figure 23.

Reviewing the Hyper-V counter measurement data, we can see that thread execution inside the Root Partition executes on a virtual processor, subject to the hypervisor Scheduler, the same as the virtual processor scheduling performed for any guest machine child partition. When the Windows OS inside the Root Partition executes a thread context switch, the Hyper-V performance counters graphed in Figure 28 show that there is a corresponding hypervisor context switch. For child partitions, there is an additional Hyper-V Scheduler interrupt that requires processing on a context switch, so there is slightly more virtualization overhead whenever child partitions are involved.

Standalone guest logical processor context switches

Figure 28. Each time the Windows OS inside the Root Partition executes a thread context switch, there is a corresponding hypervisor context switch.

The Hyper-V Logical Processor utilization measurements do include a metric that should be directly comparable to the System\Processor Queue Length measurement that was shown in Figure 25 called CPU Wait Time per Dispatch, which is available at the virtual processor level. Unfortunately, this performance counter is not helpful, however. It is not clear what the units of Wait Time that are reported, although an educated guess is standard Windows 100-nanosecond timer units seems likely. It also reports Wait Time in very discrete, discontinuous measurements, which is strange. Together, these two issue make for problems of interpretation. Fortunately, the System\Processor Queue Length is an instantaneous measurement that remains serviceable under Hyper-V. Figure 29 shows the same set of Process(*)\% Processor Time counters and a Processor Queue Length overlay line as Figure 25. The length of the processor Ready Queue for the Root partition is comparable to the native benchmark run, with even some evidence that the Ready Queue delays are slightly longer in the configuration where virtualization was enabled.

Standalone guest machine benchmark process utilization

Microsoft strongly suggests that you do not use the Root partition to execute any work other than what is necessary to administer the VM Host machine. There is no technical obstacle that prevents you from executing application programs on the Root partition like I did with the benchmark program. But it is not a practice that is recommended. The Root partition provides a number of high priority virtualization services, like the handling of synthetic disk and network IO requests, which you want to take pains to try not to impact by running any other applications in the Root.

  • Standalone in a single child partition

Given the prohibition against running applications in the Root, the more useful comparison quantifying the minimum overhead of virtualization would be to compare performance of a guest machine in a child partition with performance on native hardware. So, on the same physical machine, I then created a Windows 8.1 virtual machine and configured it to run with 4 virtual processors. Making sure that nothing else was running on the Hyper-V server, I then ran the same benchmark on the 4-way guest machine. This time the benchmark ran to completion in 105 minutes.
Notice that on the child partition the benchmark run took about 5% longer when a single 4X Guest machine was configured. This virtual machine had access to all the physical CPUs that were available on the physical machine and executed in a standalone environment where it did not have to contend with any other guest VMs for processor resources. 105 minutes in execution time is about 17% longer than it took the same benchmark program to execute in native mode. Figure 30, which shows the rate that the Hyper-V hypervisor processed several types of virtualization-related interrupts, provides some insight into why execution time elongates under virtualization. Notice that hypervisor Scheduler interrupts occur when child partitions are executing – these Scheduler interrupts do not occur when threads are executing inside the Root partition, as illustrated back in Figure 28.

Logical processor interrupts for standalone child partition

Figure 30. Interrupt processing rates reported for the hypervisor when a child a partition is active.

This configuration was also noteworthy because the hypervisor CPU consumption was reported as about 8%, a slightly higher utilization level (+25%) than any of the other configurations evaluated.

Today, performance testing is often performed on virtual machines due to the fact that they are only intermittently active, plus the ease with which you can spin them up and tear them down again. In my experience it is reasonable to expect the same workload to take about 10% longer to execute if you run inside a VM under ideal circumstances, which implies the VM has access to all the resources it needs on the machine, and there is no or minimal contention for those resources from other resident guest machines. This first set of benchmark tests show that the performance degradation to expect when a guest machine executes on an efficiently-provisioned VM Host is for tasks to run approximately 10% slower. Consider this a minimum stretch factor that elongates execution time due to various virtualization overheads. Furthermore, it is reasonable to expect this stretch factor to increase whenever the guest machine is under-provisioned or the Hyper-V machine is over-committed.

In the next post, this baseline measurement is compared to the other possible VM configurations: an efficiently-provisioned Host machine, an over-committed VM Host machine, and, finally, an under-provisioned guest machine. In the case of an efficiently-provisioned VM Host machine, we can expect a stretch factor comparable to the minimum stretch factor reported here. However, as we will see, when the VM Host machine is significantly over-committed or the guest machine is significantly under-provisioned, quest machine workloads can experience a severe performance penalty.

 

 

 

 .

Understanding Guest Machine Performance under Hyper-V

In this post I begin to consider guest machine performance under Hyper-V. This is the seventh post in a series on Hyper-V performance. The series began here.

All virtualization technology requires executing additional layers of systems software that adds overheads in many functional areas of Windows, including

  • processor scheduling,
  • intercepting and emulating certain guest machine instructions that would violate the integrity of the virtualization scheme,
  • machine memory management,
  • initiating and completing IO operations, and
  • synthetic device interrupt handling.

The effect of virtualization in each one of these areas of execution is to impart a performance penalty, and this applies equally to VMware, Zen, and other flavors of virtualization that are available. Windows guest machine enlightenments under Hyper-V serve to reduce some of the performance penalties associated with virtualization, but they cannot eliminate all of it. Your application suffers some performance penalty when it executes on a virtual machine. The question is how big is the performance penalty.

Executing these additional layers of software under virtualization always impacts the performance of a Windows application negatively, particularly its responsiveness. Individually, executing these extra layers of software adds a very small amount of overhead every time one of these functional areas is exercised. Added together, however, these additional overhead factors are significant enough to take notice of. But the real question is whether they are substantial enough to actively discourage data centers from adopting virtualization technology, given its benefits in many operational areas. Earlier in this series, I suggested a preliminary answer, which is “No, in many cases the operational benefits of virtualization substantially outweigh the performance risks.” Still, there are many machines that remain better off being configured to run on native hardware. Whenever maximum responsiveness and/or throughput is required, native Windows machines reliably outperform Windows guest machines executing the same workload.

Where Hyper-V virtualization technology excels is in partitioning and distributing hardware resources across virtual machines require far less capacity than is available on powerful server machines. Furthermore, by exploiting the ability to clone new guest machines rapidly, virtualization technology is often used to enhance the scalability and performance of an application that requires a cluster of Windows machines to process. Virtualization can make scaling up and scaling out such an application operationally easier. However, you should be aware that there are other ways to cluster machines to achieve the same scaling up and scaling out improvements without incurring the overhead of virtualization.

Performance risks.

The configuration flexibility that virtualization provides is accompanied by a set of risk factors that expose virtual machines to potential performance problems that are much more serious in nature than the additional overhead considerations discussed immediately above. These performance risks need to be understood by IT professionals charged with managing the data center infrastructure. The most serious risk that you will encounter is the ever-present danger of over-loading the Hyper-V Host machine, which leads to more serious performance degradation than any of the virtualization “overheads” enumerated above. Shared processors, shared memory and shared devices introduce opportunities for contention for those physical resources among guest machines that would not otherwise be sharing those components if allowed to run on native hardware. The added complexity of administering the virtualization infrastructure with its more ubiquitous level of resource sharing is a related risk factor.

When a Hyper-V Host machine is overloaded, or over-committed, all its resident guest machines are apt to suffer, but isolating them so they share fewer resources, particularly disk drives and network adaptors, certainly helps. However, shared CPUs and shared memory are inherent in virtualization, so achieving the same degree of isolation with regard to those resources is more difficult, to say the least. This aspect of resource sharing is the reason Hyper-V has virtual processor scheduling and dynamic memory management priority settings, and we will need to understand when to use these settings and how effective they are. In general, priority schemes are only useful when a resource is over-committed, essentially an out-of-capacity situation. This creates a backlog of work – a work queue – that is not getting done. Priority sorts the work queue, allowing more of the higher priority work to get done, at the expense of lower priority workloads. Like any other out-of-capacity situation, the ultimate remedy is not priority, but finding a way to relieve the capacity constraint. With a properly provisioned virtualization infrastructure, there should be a way to move guest machines from an over-committed VM Host to one that has spare capacity.

Somewhere between over-provisioned and under-provisioned is the range where the Hyper-V Host is efficiently provisioned to support the guest machine workloads it is configured to run. Finding that balance can be difficult, given constant change in the requirements of the various guest machines.

Finally, there are also performance risks associated with guest machine under-provisioning, where the VM Host machine has ample capacity, but one or more child partitions is constrained by its virtual machine settings from accessing enough of the Hyper-V Host machine’s processor and memory resources it requires.

Table 2 summarizes the four kinds of Hyper-V configurations that need to be understood from a cost/performance perspective, focusing on the major performance penalties that can occur.

Table 2. Performance consequences of over or under-provisioning the VM Host and its resident guest machines.

Condition

Who suffers a performance penalty

Over-committed VM Host All resident guest machines suffer
Efficiently provisioned VM Host No resident guest machines suffer
Over-provisioned VM Host No guest machines suffer, but hardware cost is higher than necessary
Under-provisioned Guest Guest machine suffers

In the next blog entry, I will make an effort to characterize the performance profile of each configuration condition, beginning with the case that generates the least damaging performance penalty, namely that of the over-provisioned VM Host. Characterizing application performance when the Hyper-V Host machine is over-provisioned will provide insight into the minimum performance penalties that you can expect to accrue under virtualization..

Hyper-V architecture: Intercepts, Interrupts and Hypercalls.

This is the third post in a series on Hyper-V performance. The series begins here.

Three interfaces exist that allow for interaction and communication between the hypervisor, the Root partition and the guest partitions: intercepts, interrupts, and the direct Hypercall interface. These interfaces are necessary for the virtualization scheme to function properly, and their usage accounts for much of the overhead virtualization adds to the system. Hyper-V measures and reports on the rate these different interfaces are used, which is, of course, workload dependent. Frankly, the measurements that show the rate that the hypervisor processes interrupts and Hypercalls is seldom of interest outside the Microsoft developers working on Hyper-V performance itself. But these measurements do provide insight into the Hyper-V architecture and can help us understand how the performance of the applications running on guest machines is impacted due to virtualization. Figure 3 is a graph showing these three major sources of virtualization overhead in Hyper-V.

Hyper-V overheads

Figure 3. Using the Hyper-V performance counters, you can monitor the rate that intercepts, virtual interrupts, and Hypercalls are handled by the hypervisor and various Hyper-V components.

Intercepts.

Intercepts are the primary mechanism used to maintain a consistent view of the virtual processor that is visible to the guest OS. Privileged instructions and other operations issued by the guest operating system that are  valid when the OS is accessing the native hardware need to be intercepted by the hypervisor and handled in a way that maintains a consistent view of the virtual machine. Intercepts make use of another hardware assist – the virtualization hardware that allows the hypervisor to intercept certain operations. Intercepts include the guest machine OS

  • issuing a CPUID instruction to identify the hardware characteristics
  • accessing machine-specific registers (MSRs)
  • accessing I/O ports directly
  • instructions that cause hardware exceptions when executed that must be handled by the OS

When these guest machine operations are detected by the hardware, control is immediately transferred to the hypervisor to resolve. For example, if the guest OS believes it is running on a 2-way machine and issues a CPUID instructions, Hyper-V intercepts that instruction and, through the intercept mechanism, supplies a response that is consistent with the virtual machine image. Similarly, whenever a guest OS issues an instruction to read or update a Control Register (CR) or a Machine-Specific Register (MSR) value, this operation is intercepted, and control is transferred to the parent partition where the behavior the guest OS expects is simulated.

Resolving intercepts in Hyper-V is a cooperative process that involves the Root partition. When a virtual machine starts, the Root partition makes a series of Hypercalls that establish the intercepts it will handle, providing a call back address that the hypervisor uses to signal the Root partition when that particular interception occurs. Based on the virtual machine state maintained in the VM worker process, the Root partition will then simulate the requested operation, and then allow the intercepted instruction to complete its execution.

Hyper-V is instrumented to report the rate that several categories of intercepts occur. Some intercepts occur infrequently, like issuing CPUID instructions, something the OS needs to do rarely. Others like Machine-Specific Register access are apt to occur more frequently, as illustrated in Figure 4, which compares the rate of MSR accesses to the overall intercept rate, summed over all virtual processors for a Hyper-V host machine.

MSR access intercepts per second graph

Figure 4. The rate MSR intercepts are processed, compared to the overall intercept rate (indicated by an overlay line graphed against the secondary, right-hand y-axis).

In order to perform its interception functions, the Root partition’s VM worker process maintains a record of the virtual machine state. This includes keeping track of the virtual machine’s registers each time there is an interrupt, plus maintaining a virtual APIC for interrupt handling, as well as additional virtual hardware interfaces, what some authors describe as a “virtual motherboard” of devices representing the full simulated guest machine hardware environment.

Interrupts.

Guest machines accessing the synthetic disk and network devices that are installed are presented with a virtualized interrupt handling mechanism. Compared to native IO, this virtualized interrupt process adds latency to guest machine disk and network IO requests to synthetic devices. Latency increases because device interrupts need to be processed twice, once in the Root partition, and again in the guest machine. Latency also increases when interrupt processing at the guest machine level is deferred because none of the virtual processors associated with the guest are currently dispatched.

To support guest machine interrupts, Hyper-V builds and continuously maintains a synthetic interrupt controller associated with the guest’s virtual processors. When an external interrupt generated by a hardware device attached to the Host machine occurs because the device has completed a data transfer operation, the interrupt is directed to the Root partition to process. If the device interrupt is found to be associated with a request that originated from a guest machine, the guest’s synthetic interrupt controller is updated to reflect the interrupt status, which triggers action inside the guest machine to respond to the interrupt request. The device drivers loaded on the guest machine are suitably “enlightened” to skip execution of as much redundant logic as possible during this two-phased process.

The first phase of interrupt processing occurs inside the Root partition. When a physical device raises an interrupt that is destined for a guest machine, the Root partition handles the interrupt in the Interrupt Service Routine (ISR) associated with the device immediately in the normal fashion. When the device interrupt is in response to a disk or network IO request from a guest machine, there is a second phase of interrupt processing that occurs associated with the guest partition. The second phase, which is required because the guest machine also must handle the interrupt, increases the latency of every IO interrupt that is not processed directly by the child partition.

An additional complication arises if none of the guest machine’s virtual processors are currently dispatched. If no guest machine virtual processor is executing, then interrupt processing on the guest is deferred until one of its virtual processors is executing. In the meantime, the interrupt is flagged as pending in the state machine maintained by the Root partition. The amount of time that device interrupts are pending also increases the latency associated with synthetic disk and network IO requests initiated by the guest machine.

The increased latency associated with synthetic device interrupt-handling can have a very serious performance impact. It can present a significant obstacle to running disk or network IO-bound workloads as guest machines. The problem is compounded because the added delay and its impact on an application is difficult to quantify. The Logical Disk and Physical Disk\Avg. Disk secs/Transfer counters on the Root partition are not always reliably capable of measuring the disk latency associated with the first phase of interrupt processing because Root partition virtual processors are also subject to deferred interrupt processing and virtualized clocks and timers. The corresponding guest machine Logical Disk and Physical Disk\Avg. Disk secs/Transfer counters are similarly burdened. Unfortunately, a careful analysis of the data shows it is not clear that any of the Windows disk response time measurements are valid under Hyper-V, even disk devices that are natively attached to the guest partition.

The TCP/IP networking stack, as we have seen in our earlier look at NUMA architectures, has a well-deserved reputation for requiring execution of a significant number of CPU instruction in processing Network IO. Consequently, guest machines that handle a large amount of network traffic are subject to this performance impact when running virtualized. The guest machine synthetic network driver enlightenment helps considerably with this problem, as do NICs featuring TCP offload capabilities. Network devices that can be attached to the guest machine in native mode are particularly effective performance options in such cases.

In general, over-provisioning processor resources on the VM Host is an effective mitigation strategy to limit the amount and duration of deferred interrupt processing delays that occur for both disk and network IO. Disk and network hardware that can be directly attached to the guest machine is certainly another good alternative. Interrupt processing for disk and network hardware that is directly attached to the guest is a simpler, one-phase process, but one that is also subject to pending interrupts whenever the guest’s virtual processors are themselves delayed. The additional latency associated with disk and network IO is one of the best reasons to run a Windows machine in native mode.

VMBus

Guest machine interrupt handling relies on an inter-partition communications channel called the VMBus, which makes use of the Hypercall capability that allows one partition to signal another partition and send messages. (Note that since child partitions have no knowledge of other child partitions, this Hypercall signaling capability is effectively limited to use by the child partition and its parent, the Root partition.) Figure 5 illustrates the path taken when a child partition initiates a disk or network IO to a synthetic disk or network device installed in the guest machine OS. IOs to synthetic devices are processed by the guest machine device driver, which is enlightened, as discussed above. The synthetic device driver passes the IO request to another Hyper-V component installed inside the guest called a Virtualization Service Client (VSC). The VSC inside the guest machine translates the IO request into a message that is put on the VMBus.

The VMBus is the mechanism used for passing messages between a child partition and its parent, the Root partition. Its main function is to provide a high bandwidth, low latency path for the guest machine to issue IO requests and receive replies. According to Mark Russinovich, writing in Windows Internals, one message-passing protocol the VMbus uses is a ring of buffers that is shared by the child and partition: “essentially an area of memory in which a certain amount of data is loaded on one side and unloaded on the other side.” Russinovich’s book continues, “No memory needs to be allocated or freed because the buffer is continuously reused and simply rotated.” This mechanism is good for message passing between the partitions, but is too slow for large data transfers due to the necessity to copy data to and from the message buffers.

Another VMBus messaging protocol uses child memory that is mapped directly to the parent partition address space. This direct memory access VMBus mechanism allows disk and network devices managed by the Root partition to reference buffers allocated in a child partition. This is the technique Hyper-V uses to perform bulk data IO operations for synthetic disk and network devices. For the purpose of issuing IO requests to native devices, the Root partition is allowed to access machine memory addresses directly. In addition, it can request the hypervisor to translate guest machine virtual addresses allocated for use as VMBus IO buffers into machine addresses that can be referenced by the physical devices supporting DMA that are attached to the Root.

Inside the Root partition, Hyper-V components known as Virtualization Service Providers (VSPs) receive IO requests from synthetic devices from the guest machines and translate them into physical disk and network IO requests. Consider, for example a guest partition request to read or write a .vhdx file that the VSP must translate into a disk IO request to the native file system on the Root. These translated requests are then passed to the native disk IO driver or the networking stack installed inside the Root partition that manages the physical devices. The VSPs also interface with the VM worker process that is responsible for the state machine that represents the virtualized physical hardware presented to the guest OS. Using this mechanism, interrupts for guest machine synthetic devices can be delivered properly to the appropriate guest machine.

When the native device completes the IO operation requested, it raises an interrupt that the Root partition handles normally. This process is depicted in Figure 5. When the request corresponds to one issued by a guest machine, what is different under Hyper-V is that a waiting thread provided by the VSP and associated with that native device is then awakened by the device driver. The VSP also ensures that the device response adheres to the form that the synthetic device driver on the guest machine expects. It then uses the VMBus inter-partition messaging mechanism to signal the guest machine that has an interrupt pending.

HyperV interrupt processing

Figure 5. Synthetic interrupt processing involves the Virtualization Service Provider (VSP) associated with the device driver invoked to process the interrupt. Data acquired from the device is transferred directly into guest machine memory using a VMBus communication mechanism, where it is processed by the Virtualization Service Client (VSC) associated with the synthetic device.

From a performance monitoring perspective, the Hyper-V hypervisor reports on the overall rate of virtual interrupt processing, as illustrated in Figure 6. The hypervisor, however, has no understanding of what hardware device is associated with each virtual interrupt. It can report the number of deferred virtual interrupts, but it does not report the amount of pending interrupt delay, which can be considerable. The measurement components associated with disk and network IO in the Root partition function normally, with the caveat that the disk and network IO requests counted by the Root partition aggregate all the requests from both the Root and child partitions. Windows performance counters inside the guest machine continue to provide an accurate count of disk and network IO and the number of bytes transferred for that partition. The guest machine counters are useful when the Root disks or network interface cards are saturated to identify which guest partitions are responsible for the overload. Later on, we will review some examples that illustrate how all these performance counters function under Hyper-V.

Hyper-V virtual interrupts chart

Figure 6. Virtual interrupt processing per guest machine virtual processor. The rate of pending interrupts is displayed as a dotted line plotted against the secondary y-axis. In this example, approximately half of all virtual interrupts are subject to deferred interrupt processing delays.

 

Hypercalls.

The Hypercall interface provides a calling mechanism that allows child partitions to communicate with the Root partition and the hypervisor. Some of the Hypercalls support the guest OS enlightenments mentioned earlier. Others are used by the Root partition to communicate requests to the hypervisor to configure, start, modify, and stop child partitions. There is another set of Hypercalls used in dynamic memory management, which is discussed below. Hypercalls are also defined to enable the hypervisor log events and post performance counter data back to the Root partition where it can be gathered by Perfmon and other similar tools.

Hypercalls per second graph

Figure 7. Monitoring the rate Hypercalls are being processed.

.