Using QueryThreadCycleTime to access CPU execution timing

As a prelude to a discussion of the Scenario instrumentation library, I mentioned in the previous post that a good understanding of the clocks and timers available to the Windows developer would likely prove helpful. Timekeeping in Windows is an area fraught with confusion, and I confess, it was not something I understand well enough at the time I wrote either of my Windows performance books.

The previous post in this series focused on the Windows QueryPerformanceCounter API that is a wrapper around the hardware rdtsc (Read Time-Stamp Counter) instruction. QPC() provides access to a high-resolution timer that is frequently used in performance monitoring (see, for example). It is also useful in a variety of other measurement contexts. That then led to a brief discussion of the vagaries involved in using the hardware rdtsc instruction itself. Windows 7 attempts to utilize rdtsc where it is appropriate, based on repeatedly measuring the execution time of a routine during system start-up to determine if the execution times are reliably stable across power management events. On older PC hardware, power managements changes that slowed the machine execution cycle time were reflected in the frequency of TSC ticks. On current Intel hardware, however, the frequency of TSC ticks is constant across power management events, one of which is dynamic over-clocking, something Intel brands as its PowerBoost technology.

If all this talk of hardware clocks and timing instructions strikes you as a bit on the esoteric side or unnecessarily complicated, I’d like to assure you that it is actually a fundamental topic in computer performance. In Dr. Neil Gunther’s excellent introduction to computer performance modeling, The Practical Performance Analyst, one of the early foundation chapters contains an excellent discussion of the importance of time measurements in the field. (For reference, Neil’s often entertaining computer capacity planning blog is here.)

Prior to stumbling into the field of computer capacity planning, Neil’s academic background was in astrophysics, so an interest in the subject of time comes naturally to him. The fundamental insight of Einstein’s famous Special Theory of Relativity was that time measurements are necessarily relative to the position of the observer. Relativity placed the results of the famous Michelson-Morley experiment published in 1887 in a radically new context that revolutionized 20th century physics. The Michelson-Morley experiment showed that measurements of the speed of light were invariant relative to the motion of the earth through space. This was a surprising result that the astrophysics of the day did not predict. Crucial to interpreting the experimental data was the ingenious instrumentation rig Michelson and Morley constructed, the Michelson interferometer, to measure the speed of light at an unprecedented level of accuracy and precision.

Making the connection to computer performance a little less obscurely, the lifetime achievement award granted by the Computer Measurement Group, of which Dr. Gunther is a recent recipient, is named in honor of Professor Michelson.

It is interesting to note that a related time measurement phenomenon confounding physics in the late 19th century was the problem of clock synchronization across large geographical distances. This problem arose in the context of coordinating intercontinental rail travel. After trying to synchronize the clocks in the French railway system using an electrical signaling apparatus, the famous French physicist and mathematician Henri Poincare was reluctantly led to conclude that a naive conception of the absolute simultaneity of two events occurring in different places at the same time was a worthless concept, given that the speed of light itself was finite.

(The conceptual link between these two time measurement problems in 19th century physics is the subject of a recent popular science book entitled “Einstein’s Clocks, Poincare’s Maps.” That interesting book does quite possibly engender the longest and most convoluted book review I have ever encountered on the Amazon web site. It is from John Ryskamp, evidently an amateur mathematician. So if you crave even more diversion than I can supply directly — after all, if you are reading this, you are probably just wasting time grazing on the Internet to begin with — the Ryskamp book review is here,  and to understand the intellectual baggage Ryskamp carries to his reading of the otherwise unassuming Einstein/Poincare book, try this article of his that he references. Fortunately for me — time-wasting-wise, that is — I feel quite fortunate that Ryskamp is not a more prolific Amazon reviewer. He is basically Charles Kinbote manifest in the flesh — and you will have to work out that obscure reference on your own.)

And, finally, to bring this digression full circle back to computer measurement, the problem of synchronizing clocks across large distances reasserted itself in the early days of the development of the Internet Protocol (IP) for use in Wide Area Networking (WAN). The original IP header (version 4) Time-to-Live field had to be reinterpreted as a hop count when the search for a way to synchronize clocks across IP routers arrayed around the world proved futile (once again !). The Transmission Control Protocol (TCP) that sits atop IP in the network stack resorts instead to measuring the Round Trip Time (RTT) of Send requests relative to the sender. RTT, of course, is an important measurement used in TCP to anchor network congestion avoidance strategies.

And now back to Windows clocks and timers…

Beginning in Windows 6 (which refers to both Vista and Windows Server 2008), there is a new, event-driven mechanism for measuring processor utilization at the thread level. This measurement facility relies on the OS Scheduler issuing an rdtsc instruction at the beginning and end of each thread execution dispatch. By accumulating these CPU time measurements at the thread and process level each time a context switch occurs, the OS can maintain an accurate running total of the amount of time on the processor an executing thread consumes. Application programs can then access these accumulated CPU time measurements by calling a new API, QueryThreadCycleTime() and specifying a Thread Id.

QueryThreadCycleTime(), or QTCT, provides measurements of CPU time gathered by issuing an rdtsc instruction each time a context switch occurs. Using the same mechanism, the OS Scheduler also keeps track of processor idle time, which can be retrieved by calling either QueryIdleProcessorCycleTime or QueryIdleProcessorCycleTimeEx(), depending on whether multiple processor groups are defined in Windows 7. (Overall CPU utilization is calculated from Idle time for any given interval by subtracting Idle time from the interval duration:

CPU % Busy =  (IntervalDuration – Idle Time) * 100 /  IntervalDuration )

With QTCT, CPU time is kept in units of clock ticks, which is model-dependent. So, you also need to call QueryPerformanceFrequency and get the clock frequency to transform processor cycles into wall clock time.

The potential significance of the Windows OS thread scheduler being instrumented cannot be under-estimated, but using the rdtsc instruction isn’t quite as straightforward as computer measurement people would like. The OS Scheduler, at least, handles some of the vagaries automatically. Potential clock drift across processors isn’t a problem, for instance. The CSwitch event that signals the beginning of a thread execution time interval and the CSwitch event that terminates it both occur on the same (logical) CPU.

However, if you are running on an older Intel processor that does not have a constant tick rate across power management states, a thread’s cycle time is potentially going to be a very difficult number to interpret. That is because on older CPUs, whenever there is a power state change that changes the clock frequency, the frequency of the associated Time Stamp Counter (the TSC register) is adjusted in tandem. When a power management does occur, the OS Scheduler does not attempt to adjust for any change in the time between clock ticks. That means that on one of these machines, it is possible for QTCT() to return accumulated clock ticks for an interval in which one or more p-state changes have occurred such that clock ticks of different lengths are aggregated together. Obviously, this creates a problem in interpretation, but, of course, only to the extent that power management changes are actually occurring during thread execution, and it is a problem that only occurs on older hardware.

Given that set of concerns with an rdtsc-based measurement mechanism, QTCT() remains a major step forward in measuring CPU usage in Windows. Instrumenting the OS Scheduler directly to measure processor usage is the first step towards replacing the legacy sampling technique I discussed at the outset of the series of blog articles. It has all the advantages of accuracy and precision that accrue to an event-oriented methodology. Plus, the OS Scheduler issuing an rdtsc instruction in-line during a context switch is much more efficient than generating an ETW event that must be post-processed subsequently, the approach that xperf uses.

As of Window 7, the OS Scheduler measurements are only exposed through QTCT at the thread level. I suspect that the rdtsc anomalies due to the variable rate of the clock on older machines are probably one factor holding up wider adoption, while the scope of retrofitting all the services in & around the OS that currently rely on the sampling-based processor utilization data is probably another. I am curious to see if Windows 8 takes some of the obvious next steps to improve the CPU utilization measurements.

The QTCT() API that gives access to these timings at the thread level does have one other serious design limitation that I ran into. QTCT currently returns the number of processor cycles a thread has consumed up until the last time a context switch occurred. There is no method that allows a running thread to get an up-to-date, point-in-time measurement that includes the cycle time accumulated up to the present time, not just the last context switch. A serializing method along those lines would make QTCT suitable for explicitly accounting for CPU usage at a thread level between two phases in the execution of a Windows program. Next: The Scenario class uses QPC and QTCT together to try to calculate both the elapsed time and CPU time consumed between two markers strategically inserted into your Windows application. In the next blog entry in this series, I will discuss the Scenario instrumentation library in more detail..

Tagged . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *