The mystery of the scalability of a CPU-intensive application on an Intel multi-core processor with HyperThreading (HT).

This blog entry comes from answering the mail (note: names and other incriminating identification data were redacted to protect the guilty):

A Reader writes:

My name is … and I am a developer for a software company. I encountered something I didn’t understand profiling our software and stumbled across your blog. Listen, hot-shot, since you are such an expert on this subject, maybe you can figure this out.

Our software is very CPU-intensive, so we use QueryPerformanceCounter() and QueryPerformanceFrequency() to get the CPU cycles. Hence, if someone mistakenly commits some change to slow it down, we can catch it. It works fine until one day we decided to fire up more than one instance of the application. Since a large part of our code is sequential and could not be parallelized, by firing up another instance, we can use the other cores that are idle when one core is executing the sequential code. We found the timing profile all messed up. The time difference can be 5x difference for part of the code. Apparently QueryPerformanceCounter() is counting wall time, not CPU cycles. BTW, I have a quad-core hyper-threaded i7 PC.

Then I wrote this small bit of code  (at the end of this email) to test QueryPerformanceCounter(), GetProcessTimes() function AND QueryProcessCycleTime() function. If I run the code solo (just one instance), we get pretty consistent numbers.

However, if I run 10-instances on my Intel 4-way with HyperThreading enabled (8-logical processors) machine, all three methods (QueryPerformanceCounter(),GetProcessTimes() and QueryProcessCycleTime() report random numbers.

# of Concurrent Processes



QueryPerformanceCounter time
GetProcessTimes (User)

Can you please help me understand what is going on here?


   — Dazed and confused.

Note that the QPC and GetProcessTimes are reported in seconds. The QPCT times are in cpu cycles, which are model-dependent. You can use the QueryProcessorFrequency() API call to get the processor speed in cycles/second. Or check the ~MHz field in the Registry as illustrated in Figure 1. In my experience, that will report a number similar to QFC().


Figure 1. The ~MHz field in the Registry will report the processor speed in cycles/second.


Swallowing deeply, I (humbly) replied:


Dear Dazed,

I see that you are confused, but there is a very good explanation for what you are seeing.

If I understand this correctly, the app runs for 21 seconds in a standalone environment. When you run 10 processes concurrently, the app runs for 39 seconds.

You are correct, QueryPerformanceCounter() is counting wall clock time, not cycles. Any time that the thread is switched out between successive calls to QPC() would be included in the timing, which is why you obtained some unexpected results. You can expect thread pre-emption to occur when you have 10 long-running threads Ready to run on your 8-way machine.

But I think there is something else going on in this case, too. Basically, you are seeing two multiprocessor scalability effects – one due to HyperThreading and one due to more conventional processor queuing. The best way to untangle these is to perform the following set of test runs:

  1. run standalone
  2. run two concurrent processes
  3. run four concurrent processes
  4. run 8 concurrent processes
  5. then, keep increasing the # of concurrent processes and see if a pattern emerges.

Using GetProcessTimes is not very precise — for all the reasons discussed on my blog, it is based on sampling – but for this test of a long running processing (~ 21 seconds), the precision is sufficient. Since the process is CPU-intensive, this measurement is consistent with elapsed time as measured using QPC, but that is just pure luck, because, as it happens, the program you are testing does not incur any significant IO wait time.  My recommendation is to try calling QueryThreadCycleTime instead; it is a better bet, but it is not foolproof either (as I have discussed on my blog). Actually, I recommend you instrument the code using the Scenario class library that we put up on Code Gallery sometime back, see It calls both QPC and QTCT in-line during thread execution.

I personally don’t have a lot of direct experience with QueryProcessCycleTime, but my understanding of how it works could make it problematic when you are calling it from inside a multi-threaded app. If your app runs mainly single-threaded, it should report CPU timings similar to QTCT.

The Mystery Solved.

Within a few days, the intrepid Reader supplied some additional measurement data, per my instructions. After instrumenting the program using the Scenario class, he obtained measurements from executing 2, 4, 6, 8, etc. processes in parallel, up to running 32 processes in parallel on this machine. Figure 2 summarizes the results he obtained:


Figure 2. Elapsed times and CPU times as a function of the number of concurrent processes. The machine environment is a quad-core Intel i7 processor with HT enabled (the OS sees 8 logical processors).

(Note that I truncated the results at 24 concurrent processes to create this graphic to focus on left side of the chart. Beyond 24 processes, both line graphs continue consistently along the same pattern indicated. CPU time remains flat and elapsed time nues to scale linearly.)

To be clear about the measurements being reported, the Elapsed property of the Scenario object is the delta between a QPC() clock timer issued at Scenario.Begin() and one issued at Scenario.End(). It is reported in standard Windows 100 nanosecond timer ticks. The ElapsedCpu property is the delta between two calls to QTCT, made immediately after the QPC call in the Scenario.Begin method and just before Scenario.End calls QPC. It is also reported in standard Windows timer ticks. Respectively, they are the elapsed (wall clock time) time and the CPU thread execution time for the thread calling into the embedded Scenario instrumentation library.

Looking at the runs for two and four concurrent processes, we see both elapsed time and CPU time holding constant. In terms of parallel scalability, this application running on this machine with 4 physical processor cores scales linearly for up to four concurrent processes.

We also see that for up to 8 concurrent processes, elapsed time and CPU time are essentially identical.

For more than 8 processes, the CPU time curve flattens out, while elapsed time of the application increases linearly with the number of concurrent processes. Running more processes than logical processors does nothing to increase throughput; it simply creates conventional processor queuing delays. The fact that the workload scales linearly up to 32 concurrent processes is actually a very positive result. The queuing delays aren’t exponential, there is no evidence of locking or other blocking that would impede scalability. These results suggest that if you had a 32-way (physical processor) machine, you could run 32 of these processes in parallel very efficiently.

The HT effects are confined to the area of the elapsed and CPU time curves between 4 and 8 concurrent processes. Notice with six processes running concurrently, CPU time elongates only slightly – the benefit of the HT hardware boost is evident here. However, running eight concurrent processes leads to a sharp spike in CPU time – this is a sign that 8 concurrent processes saturate the four processor cores. The HT feature no longer effectively increases instruction execution throughput.

The CPU time curve flattening out after 8 concurrent processes suggests that at least some of that CPU time spike at 8 concurrent processes is due to contention for shared resources internal to the physical processor core from threads running in parallel on logical (HT) processors.

HT is especially tricky because the interaction between threads contending for shared resources internal to the physical processor core is very workload dependent. Running separate processes in parallel eliminates most of the parallel scalability issues associated with “false” sharing of shared cache lines because each process is running in its own dedicated block of virtual memory. What is particularly nefarious about concurrently executing threads “false sharing” of cache lines is that it subjects executing to time-consuming delays due to writing back updated cache lines to RAM and re-fetching them. In theory, these delays can be minimized by changing the way you allocate your data structures. In practice, you do not have a lot of control over how memory is allocated if you are an application developer using the .NET Framework. On the other hand, if you are writing a device driver in Windows and working in C++, false sharing in the processor data cache is something you need to be careful to address.

I trust this further exploration and explanation will prove helpful, and I look forward to seeing a check in the mail from your company to me for helping to sweep aside this confusion.


— Your obedient servant, etc., etc..

Using QueryThreadCycleTime to access CPU execution timing

As a prelude to a discussion of the Scenario instrumentation library, I mentioned in the previous post that a good understanding of the clocks and timers available to the Windows developer would likely prove helpful. Timekeeping in Windows is an area fraught with confusion, and I confess, it was not something I understand well enough at the time I wrote either of my Windows performance books.

The previous post in this series focused on the Windows QueryPerformanceCounter API that is a wrapper around the hardware rdtsc (Read Time-Stamp Counter) instruction. QPC() provides access to a high-resolution timer that is frequently used in performance monitoring (see, for example). It is also useful in a variety of other measurement contexts. That then led to a brief discussion of the vagaries involved in using the hardware rdtsc instruction itself. Windows 7 attempts to utilize rdtsc where it is appropriate, based on repeatedly measuring the execution time of a routine during system start-up to determine if the execution times are reliably stable across power management events. On older PC hardware, power managements changes that slowed the machine execution cycle time were reflected in the frequency of TSC ticks. On current Intel hardware, however, the frequency of TSC ticks is constant across power management events, one of which is dynamic over-clocking, something Intel brands as its PowerBoost technology.

If all this talk of hardware clocks and timing instructions strikes you as a bit on the esoteric side or unnecessarily complicated, I’d like to assure you that it is actually a fundamental topic in computer performance. In Dr. Neil Gunther’s excellent introduction to computer performance modeling, The Practical Performance Analyst, one of the early foundation chapters contains an excellent discussion of the importance of time measurements in the field. (For reference, Neil’s often entertaining computer capacity planning blog is here.)

Prior to stumbling into the field of computer capacity planning, Neil’s academic background was in astrophysics, so an interest in the subject of time comes naturally to him. The fundamental insight of Einstein’s famous Special Theory of Relativity was that time measurements are necessarily relative to the position of the observer. Relativity placed the results of the famous Michelson-Morley experiment published in 1887 in a radically new context that revolutionized 20th century physics. The Michelson-Morley experiment showed that measurements of the speed of light were invariant relative to the motion of the earth through space. This was a surprising result that the astrophysics of the day did not predict. Crucial to interpreting the experimental data was the ingenious instrumentation rig Michelson and Morley constructed, the Michelson interferometer, to measure the speed of light at an unprecedented level of accuracy and precision.

Making the connection to computer performance a little less obscurely, the lifetime achievement award granted by the Computer Measurement Group, of which Dr. Gunther is a recent recipient, is named in honor of Professor Michelson.

It is interesting to note that a related time measurement phenomenon confounding physics in the late 19th century was the problem of clock synchronization across large geographical distances. This problem arose in the context of coordinating intercontinental rail travel. After trying to synchronize the clocks in the French railway system using an electrical signaling apparatus, the famous French physicist and mathematician Henri Poincare was reluctantly led to conclude that a naive conception of the absolute simultaneity of two events occurring in different places at the same time was a worthless concept, given that the speed of light itself was finite.

(The conceptual link between these two time measurement problems in 19th century physics is the subject of a recent popular science book entitled “Einstein’s Clocks, Poincare’s Maps.” That interesting book does quite possibly engender the longest and most convoluted book review I have ever encountered on the Amazon web site. It is from John Ryskamp, evidently an amateur mathematician. So if you crave even more diversion than I can supply directly — after all, if you are reading this, you are probably just wasting time grazing on the Internet to begin with — the Ryskamp book review is here,  and to understand the intellectual baggage Ryskamp carries to his reading of the otherwise unassuming Einstein/Poincare book, try this article of his that he references. Fortunately for me — time-wasting-wise, that is — I feel quite fortunate that Ryskamp is not a more prolific Amazon reviewer. He is basically Charles Kinbote manifest in the flesh — and you will have to work out that obscure reference on your own.)

And, finally, to bring this digression full circle back to computer measurement, the problem of synchronizing clocks across large distances reasserted itself in the early days of the development of the Internet Protocol (IP) for use in Wide Area Networking (WAN). The original IP header (version 4) Time-to-Live field had to be reinterpreted as a hop count when the search for a way to synchronize clocks across IP routers arrayed around the world proved futile (once again !). The Transmission Control Protocol (TCP) that sits atop IP in the network stack resorts instead to measuring the Round Trip Time (RTT) of Send requests relative to the sender. RTT, of course, is an important measurement used in TCP to anchor network congestion avoidance strategies.

And now back to Windows clocks and timers…

Beginning in Windows 6 (which refers to both Vista and Windows Server 2008), there is a new, event-driven mechanism for measuring processor utilization at the thread level. This measurement facility relies on the OS Scheduler issuing an rdtsc instruction at the beginning and end of each thread execution dispatch. By accumulating these CPU time measurements at the thread and process level each time a context switch occurs, the OS can maintain an accurate running total of the amount of time on the processor an executing thread consumes. Application programs can then access these accumulated CPU time measurements by calling a new API, QueryThreadCycleTime() and specifying a Thread Id.

QueryThreadCycleTime(), or QTCT, provides measurements of CPU time gathered by issuing an rdtsc instruction each time a context switch occurs. Using the same mechanism, the OS Scheduler also keeps track of processor idle time, which can be retrieved by calling either QueryIdleProcessorCycleTime or QueryIdleProcessorCycleTimeEx(), depending on whether multiple processor groups are defined in Windows 7. (Overall CPU utilization is calculated from Idle time for any given interval by subtracting Idle time from the interval duration:

CPU % Busy =  (IntervalDuration – Idle Time) * 100 /  IntervalDuration )

With QTCT, CPU time is kept in units of clock ticks, which is model-dependent. So, you also need to call QueryPerformanceFrequency and get the clock frequency to transform processor cycles into wall clock time.

The potential significance of the Windows OS thread scheduler being instrumented cannot be under-estimated, but using the rdtsc instruction isn’t quite as straightforward as computer measurement people would like. The OS Scheduler, at least, handles some of the vagaries automatically. Potential clock drift across processors isn’t a problem, for instance. The CSwitch event that signals the beginning of a thread execution time interval and the CSwitch event that terminates it both occur on the same (logical) CPU.

However, if you are running on an older Intel processor that does not have a constant tick rate across power management states, a thread’s cycle time is potentially going to be a very difficult number to interpret. That is because on older CPUs, whenever there is a power state change that changes the clock frequency, the frequency of the associated Time Stamp Counter (the TSC register) is adjusted in tandem. When a power management does occur, the OS Scheduler does not attempt to adjust for any change in the time between clock ticks. That means that on one of these machines, it is possible for QTCT() to return accumulated clock ticks for an interval in which one or more p-state changes have occurred such that clock ticks of different lengths are aggregated together. Obviously, this creates a problem in interpretation, but, of course, only to the extent that power management changes are actually occurring during thread execution, and it is a problem that only occurs on older hardware.

Given that set of concerns with an rdtsc-based measurement mechanism, QTCT() remains a major step forward in measuring CPU usage in Windows. Instrumenting the OS Scheduler directly to measure processor usage is the first step towards replacing the legacy sampling technique I discussed at the outset of the series of blog articles. It has all the advantages of accuracy and precision that accrue to an event-oriented methodology. Plus, the OS Scheduler issuing an rdtsc instruction in-line during a context switch is much more efficient than generating an ETW event that must be post-processed subsequently, the approach that xperf uses.

As of Window 7, the OS Scheduler measurements are only exposed through QTCT at the thread level. I suspect that the rdtsc anomalies due to the variable rate of the clock on older machines are probably one factor holding up wider adoption, while the scope of retrofitting all the services in & around the OS that currently rely on the sampling-based processor utilization data is probably another. I am curious to see if Windows 8 takes some of the obvious next steps to improve the CPU utilization measurements.

The QTCT() API that gives access to these timings at the thread level does have one other serious design limitation that I ran into. QTCT currently returns the number of processor cycles a thread has consumed up until the last time a context switch occurred. There is no method that allows a running thread to get an up-to-date, point-in-time measurement that includes the cycle time accumulated up to the present time, not just the last context switch. A serializing method along those lines would make QTCT suitable for explicitly accounting for CPU usage at a thread level between two phases in the execution of a Windows program. Next: The Scenario class uses QPC and QTCT together to try to calculate both the elapsed time and CPU time consumed between two markers strategically inserted into your Windows application. In the next blog entry in this series, I will discuss the Scenario instrumentation library in more detail..

High Resolution Clocks and Timers for Performance Measurement in Windows.

Within the discipline of software performance engineering (SPE), application response time monitoring refers to the capability of instrumenting application requests, transactions and other vital interaction scenarios in order to measure their response times. There is no single, more important performance measurement than application response time, especially in the degree which the consistency and length of application response time events reflect the user experience and relate to customer satisfaction. All the esoteric measurements of hardware utilization that Perfmon revels in pale by comparison.

Of course, performance engineers usually still want to be able to break down application response time into its component parts, one of which is CPU usage. Other than the Concurrency Visualizer that is packaged with the Visual Studio Profiler that was discussed in the previous post, there are few professional-grade, application response time monitoring and profiling tools that exploit the ETW facilities in Windows. As we have seen illustrated, the Concurrency Visualizer demonstrates the capability to reconstruct an application’s execution time profile from kernel trace events with great precision and detail. An application response time monitoring tool that generates ETW response time events that can then be correlated with the Windows OS thread scheduling events has potential value in performance investigations of many sorts.

Fortunately, there is such an application response time monitoring tool for both Windows C++ native and .NET applications that is integrated with ETW. It is called the Scenario class. (The name “Scenario” was suggested by a Habib Heydarian, an excellent Program Manager at Microsoft, and is a nod to the popular practice of scenario-based development, associated primarily with Agile software development methodologies. See, for example, Scenario-Based Design of Human-Computer Interactions, by John Carroll and User Stories Applied: For Agile Software Development by Mike Cohen.)

The Scenario instrumentation library is a free download available from the MSDN Code Gallery site here.

The Scenario class is an ETW wrapper built around an extended version of the .NET Stopwatch class. The standard .NET Stopwatch class in the System.Diagnostics Namespace is used to measure elapsed time in a .NET Framework application. The .NET Stopwatch class itself is a managed wrapper around the native Windows APIs called QueryPerformanceFrequency and QueryPerformanceCounter (QPC) that access a high resolution timer and are used to calculate elapsed time. A straightforward extension of the .NET Stopwatch class adds a call to QueryThreadCycleTime (QTCT), a Windows API that provides a measure of CPU usage at the thread level, beginning in Window version 6.

Prior to discussing the use of the application-oriented Scenario instrumentation library, however, we should first take a deeper look at the Windows APIs it utilizes, namely QueryPerformanceCounter (QPC), QueryPerformanceFrequency, and the newer QueryThreadCycleTime (QTCT). Using the Scenario library properly and interpreting the measurement data it produces will benefit from a deeper understanding of how these Windows APIs work.


The QueryPerformanceCounter and QueryPerformanceFrequency APIs were added to Windows beginning with Windows 2000 when the Performance Monitor developers noticed that the granularity of the Windows system clock was inadequate for measuring disk IO response time. As discussed in the first post in this series, the Windows system time of day clock contains time values that display precision to 100 nanosecond timer units. However, the current time of day clock value maintained by the OS is only updated once per quantum, effectively about once every 15.6 milliseconds in current versions of Windows. Using values from the system clock to time disk IOs in early versions of Windows NT, the Logical and Physical DiskAvg Disk sec/Transfer counters in Perfmon often reported zero values whenever an IO operation started and ran to completion within a single tick of the system clock.

The solution in Windows 2000 was the addition of the QueryPerformanceCounter (QPC) and QueryPerformanceFrequency APIs. In Windows 2000, the QueryPerformanceCounter API returned a value from an rdtsc (Read TimeStampCounter) instruction. The TSC is a special hardware performance monitoring counter on Intel-compatible processors that is incremented once per processor clock cycle. On a processor clocked at 2 GHz, for example, one expects two TSC clock cycle ticks to occur every nanosecond. (As will be discussed further, elapsed time measurements that are based on successive rdtsc instructions are far less precise than the processor’s actual instruction execution cycle time. See, for example, this FAQ from the Intel Software Network for an official explanation from the CPU manufacturer.)

Issuing an rdtsc instruction on a processor clocked at 2 GHz returns a clock value considerably more precise than standard Windows timer values delineated in 100 nanosecond units. Since the processor clock speed is hardware-dependent, an additional API call, QueryPerformanceFrequency, was provided to supply the processor clock speed. Once you know the clock frequency, you can translate the output from successive rdtsc instructions into elapsed wall clock time.

When the QueryPerformanceCounter and QueryPerformanceFrequency APIs became generally available in Windows, they were rapidly adopted by other applications in need of a more precise timer facility than the Windows system clock. However, developers using QPC() soon began to notice discrepancies in time measurements taken using the rdtsc instruction due to the way in which the TSC was implemented in the hardware. There were two major discrepancies that were encountered using the rdtsc instruction on Intel-compatible processors, namely:

  1. lack of synchronization of the TSC across processors, and
  2. dynamic changes to the TSC clock update interval as a result of the processor entering a lower power state, slowing both the clock rate and the TSC update interval in tandem.

The effect of these TSC anomalies was quickly evident in the disk driver routine responsible for timing the duration of disk operations when running on a multiprocessor, which was especially ironic since QPC was originally built in order to measure disk IO operations accurately. It was possible on a multi-processor for a disk IO that was initiated on CPU 0 and completed on CPU 1 to retrieve a TSC value from CPU 1 for the IO completion that was before the TSC value retrieved when the IO operation was initiated on CPU 0. (The flavor of the difficulties in dealing with rdtsc instructions across multiple CPUs on multiprocessors is captured in this blog entry written by Microsoft SQL Server customer support representatives.)

Of lesser concern, the latency of an rdtsc instruction was surprisably larger than expected for a hardware instruction, on the order of several hundred clock cycles on older Intel hardware. That, and the fact that the rdtsc instruction does not serialize the machine, made QPC() unsuitable for accurately timing, say, micro-benchmarks of less than several thousand instructions.

To deal with the drift in TSC values across multiple processors, in Windows 6 (Vista and Windows Server 2008), the Windows QueryPerformanceCounter function was changed to use one of several external, high resolution timer facilities that are usually available on the machine. These include the High Precision Event Timer (HPET), often also referred to as the Multimedia Timer in Windows, and the external ACPI Power Management Timer, another high resolution timer facility that is independent of the main processor hardware. Because these timer facilities are external to the processors, they are capable of supplying uniform values that are consistent across CPUs. (At the same time, QueryPerformanceFrequency was also re-written to return the frequency of the external timer source.) This change effectively fixed the problems associated with accurate measurements of disk IO response time that were evident in Windows 2000 and Windows XP.

However, using an external timer in the QPC implementation does have one major disadvantage, namely addional latency. If you wrap rdtsc instructions around QPC() calls in Windows Vista, you can typically measure latency on the order of 800 nanoseconds to call the QPC API, or roughly 1 m-second per call. This latency is particularly problematic given how frequently QPC is called. In ETW tracing, for instance, QPC is the default timer routine that is used to generate the event timestamps. When gathering high volume events such as the CSwitch, ReadyThread, ISR and DPC, using QPC for timestamps in ETW generates significant overhead. If one is expecting, say, 20,000 ETW events to be generated per second per processor on a Vista or Windows Server 2008 machine, calling QPC that frequently adds about 2% additional CPU utilization per processor.

(Note: The clock facility used by default in ETW is QPC. Alternatively, one can specify either the low resolution system timer (an option that should only be used in rare instances where the low resolution 15.6 ms clock ticks suffice), or that an rdtsc instruction be issued directly. The rdtsc option is termed the “CPU cycle counter” clock resolution option. See this ETW documentation entry on the MSDN Library for details.)

For the record, ETW is hardly the only application that routinely relies on QPC-based measurements of elapsed time. When the ETW tracing infrastructure is used to gather OS kernel scheduling events, it is probably the most frequent caller of the API on the machine. Other frequent callers of QPC include the disk driver routines mentioned earlier that measure disk IO response time – reported in Perfmon as the Avg. Disk Secs/Transfer counters. These were re-written in Windows 2000 to use QPC. The TCP protocol, which needs to estimate the Round Trip Time (RTT) of packets sent to remote TCP sessions, utilizes QPC for high resolution timing, also. As mentioned earlier, the .NET Framework Stopwatch class allows an application to issue calls to QPC and QPF as an alternative to using the low resolution DateTime.Now() method that access the Windows system clock.

The hardware manufacturers were meanwhile at work making improvements in the TSC hardware to allow it to serve as an efficient and accurate elapsed time counter. Changing the behavior of the TSC when there was a power state change that adjusted the processor’s clock cycle time was the first fix. Newer machines now contain a TSC with a constant tick rate across power management states. (Which is what I like to call it, instead of what Intel officially calls an invariant TSC, terminology I find a little awkward). For details, see the Intel hardware manual entitled Intel® 64 and IA-32 Architectures, Software Developer’s Manual, Volume 3A: System Programming Guide, Part 1. Section 16.9 of this manual discusses the Time-Stamp Counter on Intel hardware and its processor-dependent behavior. The manual reports, “Constant TSC behavior ensures that the duration of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core changes frequency. This is the architectural behavior moving forward.” [Emphasis added.]

The second problem related to the TSC clocks not being synchronized across processors still exists, but the TSCs for all the processor cores resident on a single multi-core socket do run off the same underlying hardware source, at least. Clock drift across processor cores remains an issue on multi-socket NUMA hardware, but built-in NUMA node thread and interrupt affinity in Windows minimizes some of these concerns, while not eliminating them completely.

Finally, the hardware vendors also report that the latency of the rdtsc instruction has improved significantly in current hardware. Reportedly, the latency of an rdtsc instruction has improved by an order of magnitude on current hardware. Unfortunately, the hardware vendors are reluctant to disclose the specific latency of their rdtsc instructions. The rdtsc instruction still does not serialize the processor instruction execution engine, so rdtsc continues to return timer values that are subject to some amount of processor core “jitter,” imprecision about instruction execution latency simply is a fact of like in superscalar, pipelined machines. The need to maintain a TSC with a constant tick rate across power state changes also results in some loss of precision in rdtsc return values, which is really no big deal since it affects just one or two of the least significant clock resolution timer bits.

The long latency associated with accessing an external clock facility combined, with the rdtsc hardware improvements described above, prompted another round of changes in QueryPerformanceCounter for Windows 7 (and Windows Server 2008 R2). During system initialization, Windows 7 attempts to figure out if the current hardware supports a TSC tick rate that is constant across power state changes. When Windows 7 determines that the processor’s TSC tick rate is constant, the QPC routine is set to issue rdtsc instructions. If it appears that the TSC is not invariant across processor core frequency changes, then QPC will be resolved as in Windows 6 by calling the machine’s external timer. In this fashion, QPC in Windows 7 automatically provides a well-behaved, high resolution hardware clock timer service that uses the low latency rdtsc instruction whenever it is well-behaved.

This ability in Windows 7 to resolve the QPC timer service dynamically based on the current hardware is the reason Windows application developers are advised to stay away from using rdtsc – unless you absolutely know what you are doing – and to use QPC instead. Coding an inline rdtsc instruction is still going to be faster than calling QPC to access the TSC, but I hope I have made clear that using rdtsc directly is not for the faint of heart. One especially has to be aware of older portables in my experience. In newer machines, however, my experience using rdtsc is very positive.

Another good reason to avoid issuing rdtsc instructions in your app is the portability of your app onto ARM machines going forward in Windows 8. On ARM-based tablets and phone, there are a slew of hardware dependencies that we are going to have to start looking out for — which is a good topic for another set of blog posts!


Next time, the QueryThreadCycleTime API introduced in Windows 6 that is based on instrumenting the OS Scheduler using rdtsc instructions..