Presenting two sessions at the upcoming UKCMG meeting in Oxford, England on May 14-15.

Some news that regular readers of this blog might be interested in hearing about…
I plan to present two sessions at the upcoming UKCMG annual conference, which is being held this year on May 14 & 15 at the Oxford Belfry on the outskirts of Oxford, England.
The first presentation is a repeat performance of the one I gave at the US CMG in December, a paper entitled  “Measuring Processor Utilization in Windows and Windows applications,” essentially pulling together the series of blog entries I have been posting here, beginning with the first installment, but with a good deal more material than I have gotten around to posting to the blog.
For instance, the last blog post discussing the high resolution clocks and timer facilities in Windows leads directly to a consideration of what happens to the various CPU utilization measurements when Windows is running as virtual guest under VMware or Hyper-V. That discussion is in the paper, but, unfortunately, hasn’t made it to the blog yet.
But you can download the full paper from my company’s web site here.
It is shameful to admit that the full paper has been available since December. Inept as I am at blogging, I had not alerted you blog readers about its availability. Unfortunately, and it will be forever thus, or at least until I retire from my day job, self-publishing on this blog takes a back seat to work that actually pays the bills around here.
(I will resist the temptation to go off on a rant here about the idiotic and naïve notion expounded by fanatical proponents of Open Source technology that information should be free. That’s a wonderful ideal state, of course, but flies in the face of the economics of information gathering, production, storage and dissemination, which has real costs associated with it. Even in the digital age, which has revolutionized the costs associated with information storage and dissemination, these costs remain and they are considerable. My contrarian view is that no one, other than gods and saints, in possession of potentially valuable information is apt to give it away for free under our system of capitalism, but that is another topic entirely.)
Workshop in Web Application Performance and Tuning.
The second session is an extended workshop on web application performance. It is focused on Windows technology (IIS, ASP.NET, AJAX, etc.), but many of the tools and techniques discussed are directly applicable to other web hosting platforms.
The workshop is based on a course that I used to give in-house back in Microsoft to the developers working on various Microsoft web-based applications. While I have published very little on this topic over the years, it has actually been the focus of much of my software development work over the past five years or so. I do expect to start publishing something soon on the subject, especially as I am in the late stages of developing a new software tool aimed squarely at Microsoft web application performance.
Reading between the lines of some of my recent blog postings that are ETW-oriented, including the CPU measurement series, you would be correct in guessing that the new tool attempts to leverage ETW trace events, specifically, in this case, the events that instrument the Microsoft IIS web server and the TCP/IP networking stack. This new trace analysis tool also correlates these system-oriented trace events from various Windows components with events issued from inside application scenarios instrumented using the Scenario class library (a free component, currently posted in the MSDN Archive here).
Instrumenting your application for performance monitoring is a crucial step, and that is where the Scenario class library comes in. Originally, I conceived of the Scenario instrumentation library as a .NET flavor of the open source Application Response Measuriment (ARM) initiative that was championed by both HP and IBM (and supported by the US CMG, where I was the ARM Committee liaison for many years). Soon after I arrived at Microsoft, it quickly became apparent that I needed to adapt my original conception to leverage ETW tracing technology, which had the considerable weight of the Windows Fundamentals organization behind it.
In the workshop I explain how to use this application-oriented instrumentation as part of integrating software performance engineering best practices into the software development life cycle. This involves first setting performance goals around key application scenarios that you’ve identified, and then instrumenting those scenarios to determine whether or not the application as delivered for testing is actually capable of meeting those goals. The instrumentation can also safely be embedded in the application when it is ultimately deployed in production. This is fundamentally necessary to enable service level reporting and verify, for example, that the app is meeting its stated performance objectives. Most ARM advocates concentrate on monitoring application performance in production, but tend to neglect the crucial earlier stages of application development where it is important to bake goal-oriented, performance monitoring in at the outset.
The new Windows performance tool is currently in a very limited beta release, and contrary to the negative views I expressed in my earlier aside — not a rant — about information being free, we are looking at some sort of freebie distribution of the initial “commercial” version of the tool to allow you guys to explore the technology and see what it can do for you.
So, if you happen to be in the neighborhood of Oxford, England next month, you can hear & see more about this initiative. In the meantime, stayed tuned to this space, where I will try to do a better job keeping you posted as we make progress in this area.

.

The mystery of the scalability of a CPU-intensive application on an Intel multi-core processor with HyperThreading (HT).

This blog entry comes from answering the mail (note: names and other incriminating identification data were redacted to protect the guilty):

A Reader writes:

My name is … and I am a developer for a software company. I encountered something I didn’t understand profiling our software and stumbled across your blog. Listen, hot-shot, since you are such an expert on this subject, maybe you can figure this out.

Our software is very CPU-intensive, so we use QueryPerformanceCounter() and QueryPerformanceFrequency() to get the CPU cycles. Hence, if someone mistakenly commits some change to slow it down, we can catch it. It works fine until one day we decided to fire up more than one instance of the application. Since a large part of our code is sequential and could not be parallelized, by firing up another instance, we can use the other cores that are idle when one core is executing the sequential code. We found the timing profile all messed up. The time difference can be 5x difference for part of the code. Apparently QueryPerformanceCounter() is counting wall time, not CPU cycles. BTW, I have a quad-core hyper-threaded i7 PC.

Then I wrote this small bit of code  (at the end of this email) to test QueryPerformanceCounter(), GetProcessTimes() function AND QueryProcessCycleTime() function. If I run the code solo (just one instance), we get pretty consistent numbers.

However, if I run 10-instances on my Intel 4-way with HyperThreading enabled (8-logical processors) machine, all three methods (QueryPerformanceCounter(),GetProcessTimes() and QueryProcessCycleTime() report random numbers.

# of Concurrent Processes

1

10

QueryPerformanceCounter time
21.5567753
39.6047058
GetProcessTimes (User)
21.4501375
39.4214527
QueryProcessCycleTime
77376526479
141329606436

Can you please help me understand what is going on here?

Signed,

   — Dazed and confused.

Note that the QPC and GetProcessTimes are reported in seconds. The QPCT times are in cpu cycles, which are model-dependent. You can use the QueryProcessorFrequency() API call to get the processor speed in cycles/second. Or check the ~MHz field in the Registry as illustrated in Figure 1. In my experience, that will report a number similar to QFC().

processor-speed-from-the-Windows-Registry

Figure 1. The ~MHz field in the Registry will report the processor speed in cycles/second.

 

Swallowing deeply, I (humbly) replied:

 

Dear Dazed,

I see that you are confused, but there is a very good explanation for what you are seeing.

If I understand this correctly, the app runs for 21 seconds in a standalone environment. When you run 10 processes concurrently, the app runs for 39 seconds.

You are correct, QueryPerformanceCounter() is counting wall clock time, not cycles. Any time that the thread is switched out between successive calls to QPC() would be included in the timing, which is why you obtained some unexpected results. You can expect thread pre-emption to occur when you have 10 long-running threads Ready to run on your 8-way machine.

But I think there is something else going on in this case, too. Basically, you are seeing two multiprocessor scalability effects – one due to HyperThreading and one due to more conventional processor queuing. The best way to untangle these is to perform the following set of test runs:

  1. run standalone
  2. run two concurrent processes
  3. run four concurrent processes
  4. run 8 concurrent processes
  5. then, keep increasing the # of concurrent processes and see if a pattern emerges.

Using GetProcessTimes is not very precise — for all the reasons discussed on my blog, it is based on sampling – but for this test of a long running processing (~ 21 seconds), the precision is sufficient. Since the process is CPU-intensive, this measurement is consistent with elapsed time as measured using QPC, but that is just pure luck, because, as it happens, the program you are testing does not incur any significant IO wait time.  My recommendation is to try calling QueryThreadCycleTime instead; it is a better bet, but it is not foolproof either (as I have discussed on my blog). Actually, I recommend you instrument the code using the Scenario class library that we put up on Code Gallery sometime back, see http://archive.msdn.microsoft.com/Scenario. It calls both QPC and QTCT in-line during thread execution.

I personally don’t have a lot of direct experience with QueryProcessCycleTime, but my understanding of how it works could make it problematic when you are calling it from inside a multi-threaded app. If your app runs mainly single-threaded, it should report CPU timings similar to QTCT.

The Mystery Solved.

Within a few days, the intrepid Reader supplied some additional measurement data, per my instructions. After instrumenting the program using the Scenario class, he obtained measurements from executing 2, 4, 6, 8, etc. processes in parallel, up to running 32 processes in parallel on this machine. Figure 2 summarizes the results he obtained:

chart-showing-parallel-process-scalability-on-an-HT-machine

Figure 2. Elapsed times and CPU times as a function of the number of concurrent processes. The machine environment is a quad-core Intel i7 processor with HT enabled (the OS sees 8 logical processors).

(Note that I truncated the results at 24 concurrent processes to create this graphic to focus on left side of the chart. Beyond 24 processes, both line graphs continue consistently along the same pattern indicated. CPU time remains flat and elapsed time nues to scale linearly.)

To be clear about the measurements being reported, the Elapsed property of the Scenario object is the delta between a QPC() clock timer issued at Scenario.Begin() and one issued at Scenario.End(). It is reported in standard Windows 100 nanosecond timer ticks. The ElapsedCpu property is the delta between two calls to QTCT, made immediately after the QPC call in the Scenario.Begin method and just before Scenario.End calls QPC. It is also reported in standard Windows timer ticks. Respectively, they are the elapsed (wall clock time) time and the CPU thread execution time for the thread calling into the embedded Scenario instrumentation library.

Looking at the runs for two and four concurrent processes, we see both elapsed time and CPU time holding constant. In terms of parallel scalability, this application running on this machine with 4 physical processor cores scales linearly for up to four concurrent processes.

We also see that for up to 8 concurrent processes, elapsed time and CPU time are essentially identical.

For more than 8 processes, the CPU time curve flattens out, while elapsed time of the application increases linearly with the number of concurrent processes. Running more processes than logical processors does nothing to increase throughput; it simply creates conventional processor queuing delays. The fact that the workload scales linearly up to 32 concurrent processes is actually a very positive result. The queuing delays aren’t exponential, there is no evidence of locking or other blocking that would impede scalability. These results suggest that if you had a 32-way (physical processor) machine, you could run 32 of these processes in parallel very efficiently.

The HT effects are confined to the area of the elapsed and CPU time curves between 4 and 8 concurrent processes. Notice with six processes running concurrently, CPU time elongates only slightly – the benefit of the HT hardware boost is evident here. However, running eight concurrent processes leads to a sharp spike in CPU time – this is a sign that 8 concurrent processes saturate the four processor cores. The HT feature no longer effectively increases instruction execution throughput.

The CPU time curve flattening out after 8 concurrent processes suggests that at least some of that CPU time spike at 8 concurrent processes is due to contention for shared resources internal to the physical processor core from threads running in parallel on logical (HT) processors.

HT is especially tricky because the interaction between threads contending for shared resources internal to the physical processor core is very workload dependent. Running separate processes in parallel eliminates most of the parallel scalability issues associated with “false” sharing of shared cache lines because each process is running in its own dedicated block of virtual memory. What is particularly nefarious about concurrently executing threads “false sharing” of cache lines is that it subjects executing to time-consuming delays due to writing back updated cache lines to RAM and re-fetching them. In theory, these delays can be minimized by changing the way you allocate your data structures. In practice, you do not have a lot of control over how memory is allocated if you are an application developer using the .NET Framework. On the other hand, if you are writing a device driver in Windows and working in C++, false sharing in the processor data cache is something you need to be careful to address.

I trust this further exploration and explanation will prove helpful, and I look forward to seeing a check in the mail from your company to me for helping to sweep aside this confusion.

Signed,

— Your obedient servant, etc., etc..