Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison.
Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed in the previous post, there are few professional-grade, application response time monitoring and profiling tools that exploit the ETW facilities in Windows. As we have seen illustrated, the Concurrency Visualizer demonstrates the capability to reconstruct an application’s execution time profile from kernel trace events with great precision and detail. An application response time monitoring tool that generates ETW response time events that can then be correlated with the Windows OS thread scheduling events has potential value in performance investigations of many sorts.
Fortunately, there is such an application response time monitoring tool for both Windows C++ native and .NET applications that is integrated with ETW. It is called the Scenario class. (The name “Scenario” was suggested by a Habib Heydarian, an excellent Program Manager at Microsoft, and is a nod to the popular practice of scenario-based development, associated primarily with Agile software development methodologies. See, for example, Scenario-Based Design of Human-Computer Interactions, by John Carroll and User Stories Applied: For Agile Software Development by Mike Cohen.)
The Scenario instrumentation library is a free download available from the MSDN Code Gallery site here.
The Scenario class is an ETW wrapper built around an extended version of the .NET Stopwatch class. The standard .NET Stopwatch class in the System.Diagnostics Namespace is used to measure elapsed time in a .NET Framework application. The .NET Stopwatch class itself is a managed wrapper around the native Windows APIs called QueryPerformanceFrequency and QueryPerformanceCounter (QPC) that access a high resolution timer and are used to calculate elapsed time. A straightforward extension of the .NET Stopwatch class adds a call to QueryThreadCycleTime (QTCT), a Windows API that provides a measure of CPU usage at the thread level, beginning in Window version 6.
Prior to discussing the use of the application-oriented Scenario instrumentation library, however, we should first take a deeper look at the Windows APIs it utilizes, namely QueryPerformanceCounter (QPC), QueryPerformanceFrequency, and the newer QueryThreadCycleTime (QTCT). Using the Scenario library properly and interpreting the measurement data it produces will benefit from a deeper understanding of how these Windows APIs work.
The QueryPerformanceCounter and QueryPerformanceFrequency APIs were added to Windows beginning with Windows 2000 when the Performance Monitor developers noticed that the granularity of the Windows system clock was inadequate for measuring disk IO response time. As discussed in the first post in this series, the Windows system time of day clock contains time values that display precision to 100 nanosecond timer units. However, the current time of day clock value maintained by the OS is only updated once per quantum, effectively about once every 15.6 milliseconds in current versions of Windows. Using values from the system clock to time disk IOs in early versions of Windows NT, the Logical and Physical DiskAvg Disk sec/Transfer counters in Perfmon often reported zero values whenever an IO operation started and ran to completion within a single tick of the system clock.
The solution in Windows 2000 was the addition of the QueryPerformanceCounter (QPC) and QueryPerformanceFrequency APIs. In Windows 2000, the QueryPerformanceCounter API returned a value from an rdtsc (Read TimeStampCounter) instruction. The TSC is a special hardware performance monitoring counter on Intel-compatible processors that is incremented once per processor clock cycle. On a processor clocked at 2 GHz, for example, one expects two TSC clock cycle ticks to occur every nanosecond. (As will be discussed further, elapsed time measurements that are based on successive rdtsc instructions are far less precise than the processor’s actual instruction execution cycle time. See, for example, this FAQ from the Intel Software Network for an official explanation from the CPU manufacturer.)
Issuing an rdtsc instruction on a processor clocked at 2 GHz returns a clock value considerably more precise than standard Windows timer values delineated in 100 nanosecond units. Since the processor clock speed is hardware-dependent, an additional API call, QueryPerformanceFrequency, was provided to supply the processor clock speed. Once you know the clock frequency, you can translate the output from successive rdtsc instructions into elapsed wall clock time.
When the QueryPerformanceCounter and QueryPerformanceFrequency APIs became generally available in Windows, they were rapidly adopted by other applications in need of a more precise timer facility than the Windows system clock. However, developers using QPC() soon began to notice discrepancies in time measurements taken using the rdtsc instruction due to the way in which the TSC was implemented in the hardware. There were two major discrepancies that were encountered using the rdtsc instruction on Intel-compatible processors, namely:
- lack of synchronization of the TSC across processors, and
- dynamic changes to the TSC clock update interval as a result of the processor entering a lower power state, slowing both the clock rate and the TSC update interval in tandem.
The effect of these TSC anomalies was quickly evident in the disk driver routine responsible for timing the duration of disk operations when running on a multiprocessor, which was especially ironic since QPC was originally built in order to measure disk IO operations accurately. It was possible on a multi-processor for a disk IO that was initiated on CPU 0 and completed on CPU 1 to retrieve a TSC value from CPU 1 for the IO completion that was before the TSC value retrieved when the IO operation was initiated on CPU 0. (The flavor of the difficulties in dealing with rdtsc instructions across multiple CPUs on multiprocessors is captured in this blog entry written by Microsoft SQL Server customer support representatives.)
Of lesser concern, the latency of an rdtsc instruction was surprisably larger than expected for a hardware instruction, on the order of several hundred clock cycles on older Intel hardware. That, and the fact that the rdtsc instruction does not serialize the machine, made QPC() unsuitable for accurately timing, say, micro-benchmarks of less than several thousand instructions.
To deal with the drift in TSC values across multiple processors, in Windows 6 (Vista and Windows Server 2008), the Windows QueryPerformanceCounter function was changed to use one of several external, high resolution timer facilities that are usually available on the machine. These include the High Precision Event Timer (HPET), often also referred to as the Multimedia Timer in Windows, and the external ACPI Power Management Timer, another high resolution timer facility that is independent of the main processor hardware. Because these timer facilities are external to the processors, they are capable of supplying uniform values that are consistent across CPUs. (At the same time, QueryPerformanceFrequency was also re-written to return the frequency of the external timer source.) This change effectively fixed the problems associated with accurate measurements of disk IO response time that were evident in Windows 2000 and Windows XP.
However, using an external timer in the QPC implementation does have one major disadvantage, namely addional latency. If you wrap rdtsc instructions around QPC() calls in Windows Vista, you can typically measure latency on the order of 800 nanoseconds to call the QPC API, or roughly 1 m-second per call. This latency is particularly problematic given how frequently QPC is called. In ETW tracing, for instance, QPC is the default timer routine that is used to generate the event timestamps. When gathering high volume events such as the CSwitch, ReadyThread, ISR and DPC, using QPC for timestamps in ETW generates significant overhead. If one is expecting, say, 20,000 ETW events to be generated per second per processor on a Vista or Windows Server 2008 machine, calling QPC that frequently adds about 2% additional CPU utilization per processor.
(Note: The clock facility used by default in ETW is QPC. Alternatively, one can specify either the low resolution system timer (an option that should only be used in rare instances where the low resolution 15.6 ms clock ticks suffice), or that an rdtsc instruction be issued directly. The rdtsc option is termed the “CPU cycle counter” clock resolution option. See this ETW documentation entry on the MSDN Library for details.)
For the record, ETW is hardly the only application that routinely relies on QPC-based measurements of elapsed time. When the ETW tracing infrastructure is used to gather OS kernel scheduling events, it is probably the most frequent caller of the API on the machine. Other frequent callers of QPC include the disk driver routines mentioned earlier that measure disk IO response time – reported in Perfmon as the Avg. Disk Secs/Transfer counters. These were re-written in Windows 2000 to use QPC. The TCP protocol, which needs to estimate the Round Trip Time (RTT) of packets sent to remote TCP sessions, utilizes QPC for high resolution timing, also. As mentioned earlier, the .NET Framework Stopwatch class allows an application to issue calls to QPC and QPF as an alternative to using the low resolution DateTime.Now() method that access the Windows system clock.
The hardware manufacturers were meanwhile at work making improvements in the TSC hardware to allow it to serve as an efficient and accurate elapsed time counter. Changing the behavior of the TSC when there was a power state change that adjusted the processor’s clock cycle time was the first fix. Newer machines now contain a TSC with a constant tick rate across power management states. (Which is what I like to call it, instead of what Intel officially calls an invariant TSC, terminology I find a little awkward). For details, see the Intel hardware manual entitled Intel® 64 and IA-32 Architectures, Software Developer’s Manual, Volume 3A: System Programming Guide, Part 1. Section 16.9 of this manual discusses the Time-Stamp Counter on Intel hardware and its processor-dependent behavior. The manual reports, “Constant TSC behavior ensures that the duration of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core changes frequency. This is the architectural behavior moving forward.” [Emphasis added.]
The second problem related to the TSC clocks not being synchronized across processors still exists, but the TSCs for all the processor cores resident on a single multi-core socket do run off the same underlying hardware source, at least. Clock drift across processor cores remains an issue on multi-socket NUMA hardware, but built-in NUMA node thread and interrupt affinity in Windows minimizes some of these concerns, while not eliminating them completely.
Finally, the hardware vendors also report that the latency of the rdtsc instruction has improved significantly in current hardware. Reportedly, the latency of an rdtsc instruction has improved by an order of magnitude on current hardware. Unfortunately, the hardware vendors are reluctant to disclose the specific latency of their rdtsc instructions. The rdtsc instruction still does not serialize the processor instruction execution engine, so rdtsc continues to return timer values that are subject to some amount of processor core “jitter,” imprecision about instruction execution latency simply is a fact of like in superscalar, pipelined machines. The need to maintain a TSC with a constant tick rate across power state changes also results in some loss of precision in rdtsc return values, which is really no big deal since it affects just one or two of the least significant clock resolution timer bits.
The long latency associated with accessing an external clock facility combined, with the rdtsc hardware improvements described above, prompted another round of changes in QueryPerformanceCounter for Windows 7 (and Windows Server 2008 R2). During system initialization, Windows 7 attempts to figure out if the current hardware supports a TSC tick rate that is constant across power state changes. When Windows 7 determines that the processor’s TSC tick rate is constant, the QPC routine is set to issue rdtsc instructions. If it appears that the TSC is not invariant across processor core frequency changes, then QPC will be resolved as in Windows 6 by calling the machine’s external timer. In this fashion, QPC in Windows 7 automatically provides a well-behaved, high resolution hardware clock timer service that uses the low latency rdtsc instruction whenever it is well-behaved.
This ability in Windows 7 to resolve the QPC timer service dynamically based on the current hardware is the reason Windows application developers are advised to stay away from using rdtsc – unless you absolutely know what you are doing – and to use QPC instead. Coding an inline rdtsc instruction is still going to be faster than calling QPC to access the TSC, but I hope I have made clear that using rdtsc directly is not for the faint of heart. One especially has to be aware of older portables in my experience. In newer machines, however, my experience using rdtsc is very positive.
Another good reason to avoid issuing rdtsc instructions in your app is the portability of your app onto ARM machines going forward in Windows 8. On ARM-based tablets and phone, there are a slew of hardware dependencies that we are going to have to start looking out for — which is a good topic for another set of blog posts!
Next time, the QueryThreadCycleTime API introduced in Windows 6 that is based on instrumenting the OS Scheduler using rdtsc instructions..