Measuring thread execution state using trace events.

Continuing the discussion from the previous blog entry on event-driven approaches to measuring CPU utilization in Windows

 Besides measuring processor utilization at the system level, the stream of context switch events can also be re-constructed to drill into CPU consumption at the process and thread level. An exemplary example of this approach is the Visual Studio Profiler’s Concurrency Visualizer, available in Visual Studio 2010. (For reference, see “Performance Tuning with the Concurrency Visualizer in Visual Studio 2010 in the Visual Studio 2010 Profiler,” an MSDN Magazine article written by the tool’s principal architect, Hazim Shafi.)  The Concurrency Visualizer gathers Context Switch events to calculate processor utilization for the application being profiled.

The VS Concurrency Visualizer creates a system-level CPU Utilization View with an interesting twist – the view pivots based on the application you are profiling, a perspective that matches that of a software performance engineer engaged in a performance investigation. Based on the sequence of context switch trace events, the Concurrency Visualizer calculates processor utilization by the process, aggregates it for the current selection window, and displays it in the CPU Utilization View. In the CPU Utilization View, all other processor activity for processes (other than one being profiled) is lumped together under a category called “Other Processes.” System-processes and the “Idle process,” which is a bookkeeping mechanism, not an actual process that is dispatched, are also broken out separately. See Dr. Shafi’s article for more details. (For reference, Figure 12 below illustrates the CPU Utilization View.)

The Concurrency Visualizer’s primary focus is on being able to reconstruct the sequence of events that impact an application’s execution progress. The Concurrency Visualizer’s Threads View is the main display showing an application’s execution path. The view here is of execution progress on a thread by thread basis. For each thread in your application, the Concurrency Visualizer shows the precise sequence of context switch events that occurred. These OS Scheduler events reflect that thread’s execution state over time. See Figure 8 for an example of this view.


Figure 8. Screen shot of the Concurrency Visualizer illustrating thread preemption by a higher priority system routine.



 Figure 8 shows the execution path of six application threads, a Main thread, a generic worker thread, and 4 CLR (the Common Language Runtime for .NET languages) worker threads that the application created by instantiating a .NET ThreadPool object. (There were originally more threads than this, but I chose to hide those that were inactive over the entire run.) For each thread, the execution state of the thread – whether it is running or whether it is blocked – is indicated over time.

The upper half of the display is a timeline that shows the execution state of each thread over time. The execution progress of each thread display is constructed horizontally from left to right from rectangles that indicate the start and end of a particular thread state. An interval when the thread was running shows as green. An interval where the thread is sleeping is shown in blue. A ready thread that is blocked from executing because a higher priority thread is running is shown in yellow. (This state is labeled “preemption.”) A thread in a synchronization delay waiting on a lock is visualized as red.

On the lower left of the display is a Visible Timeline Profile. This summarizes the state of all threads that are visible within the selected time window. In the screen shot in Figure 8, I have zoomed into a time window that is approximately 150 milliseconds wide. During that interval, the threads shown were in a state where they were actively executing instruction only 11% of the time. For 25% of the time interval, threads were blocked waiting on a lock. Finally, there is a tabbed display at the lower right. If you click on the “Profile Report” tab, a histogram displays that summarizes the execution state of each individual thread over the time window. In the screen shot, I have clicked on the “Current stack” tab that displays the call stack associated with the ETW context switch event. If the thread is blocked, the call stack indicates where in the code the thread will resume execution once it unblocks. We will drill into that call stack in a moment.

Note: The Threads View also displays call stacks from processor utilization samples that ETW gathers on a system-wide basis once per millisecond. Call-stacks samples are visible during any periods when the thread is executing instructions (and ETW execution sampling is active). One of the ETW OS kernel events that the Concurrency Visualizer does not analyze is the ReadyThread event. The interval between a ReadyThread event and a subsequent Context Switch that signals that a ready Thread is being dispatched measures CPU queue time delay directly. Using event data, it is possible to measure CPU queuing delays precisely. Analysis of the ETW kernel event stream far exceeds anything that can be done using Windows performance counters to try to estimate the impact of CPU queuing delays.

The Concurrency Visualizer screen shot in Figure 8 illustrates the calculation of a running thread’s CPU queuing delay. Thread 6920, which happens to be a CLR thread pool worker thread, is shown at a point in time where it was preempted by a higher priority task. The specific delay that I zoomed in on in the screen shot is preemption due to the scheduling of a high priority LPC or ISR – note this category in the Concurrency Visualizer also encompasses assorted APCs and DPCs. In this specific example, execution of Thread 6920 was delayed for 0.7718 milliseconds. According to the trace, that is the amount of time between Thread 6920 being preempted by a high priority system routine and a subsequent context switch when the ready thread was again re-dispatched.

The tool also displays the call stack of the preempted thread. The call stack indicates that the CLR’s garbage collector (GC) was running at the time that thread execution was preempted. Interpreting the call stack, it looks like the GC is sweeping the Large Object Heap (LOH), trying to free up some previously allocated virtual memory. This is not an opportune time to get preempted. You can see that one of the other CLR worker threads, Thread 6420, is also delayed. Notice from the color coding that Thread 6420 is delayed waiting on a lock. Presumably, one of the other active CLR worker threads in the parent process holds the lock that Thread 6420 is waiting for.

This is one of those “Aha” moments. If you click on the synchronization delay that Thread 6420 is experiencing, as illustrated in Figure 9, you can see that the lock that Thread 6420 is trying to acquire is, in fact, currently held by Thread 6920, the one that was preempted somewhere in the midst of running garbage collection. Clicking on the tab that says “Current Stack” (not illustrated) indicates that the duration of the synchronization delay that Thread 6420 suffered in this specific instance of lock contention was about 250 milliseconds.

The scenario here shows one CLR worker thread blocked on a lock that is held by another CLR worker thread, which in turn finds itself being delayed due to preemptions from higher priority Interrupt processing. We can see that whatever high priority work preempted Thread 6920 has the side effect of also delaying Thread 6420, since 6420 was waiting on a lock that Thread 6920 happened to be holding at the time. The tool in Figure 9 displays the Unblocking stack from Thread 6920 which shows the original memory allocation from the Dictionary.Resize() method call being satisfied, releasing a global GC lock. When Thread 6920 resumed execution following its preemption, the GC operation completes, releasing the global GC lock. Thread 6920 continues to execute for another 25 microseconds or so, before it is preempted because its time slice expired. Even as Thread 6920 blocks, Thread 6420 continues to wait while a different CLR thread pool thread (4664) begins to execute instead. Finally, after another 25 microseconds delay, Thread 6420 resumes execution. For a brief period both 6420 and 4664 execute in parallel from approximately the 7640 to 7650 microsecond milestones. (However, they are subject to frequent preemptions during that period of overlapped execution.)

Welcome to the indeterminacy associated with parallel programming.

I won’t take the time here to go into what this little concurrent CLR (Common Language Runtime ) thread pool application is doing. Suffice to say that it instantiates and references a very large Dictionary object in .NET, and I wrote it to illustrate some of the performance issues developers can face trying to do parallel programming, which is a topic I was blogging about at the time. (I should also note that the test program puts the worker threads to sleep periodically to simulate synchronous I/O waits to create an execution profile similar to what one could expect in processing a typical ASP.NET web request that needs to access an external database, an excellent idea I appropriated from a colleague, Joe Hellerstein.)

When I first began to profile this test app using the VS Concurrency Visualizer, I was able to see blocking issues like the one described here where the CLR introduced synchronization and locking considerations that are otherwise opaque to the developer. Well, caveat emptor, I suppose, when it comes to utilizing someone else’s code framework in your application. (See Rico Mariani’s Performance Tidbits blog for a singular discussion of his intriguing proposal that a .NET Framework method provide a performance signature that would allow a developer to make an informed decision before ever calling into some 3rd party’s code. Alas, static code analysis cannot be used to predict the performance of some arbitrarily complex method call embedded in your application, something Rico was eventually forced to concede.)

It turns out that .NET Framework collection classes do use locks to ensure thread-safe operation in a multi-threaded program, whether it is necessary or not. See the MSDN “Thread-Safe Collections” Help topic for more information. Each worker thread in my test program instantiated and accessed a dedicated instance of the Dictionary class during processing, so locking wasn’t necessary in this little test application. Because I had taken steps to ensure thread-safety issues would never arise in my test program, I was unpleasantly surprised when the tool uncovered lock contention for these Dictionary objects. Unfortunately, there is no way for the developer to explicitly signal the runtime that locking is not necessary. Some of the popular .NET Framework collection classes – like the HashTable – do provide a Synchronized method that exposes a lock created implicitly. But the Synchronized method is designed to support more complex multi-threaded access patterns, such as a multiple readers and writers scenario, for example. To assist in parallel programming tasks, several newer collection classes were introduced in the System.Collections.Concurrent Namespace that use “lock-free” and optimistic locking approaches that promise better scalability for parallel programs.

I eventually tweaked the test app into an especially ghoulish version I call the LockNestMonster program that uses explicit global locks to shine an even brighter light on these issues.


Figure 9. CLR Worker Thread 6420 blocked because it is waiting on a GC lock that happens to be held by Thread 6920, which is subject to preemption by higher priority system routines.



The Concurrency Visualizer also utilizes context switch events to calculate the delays a thread encounters during execution due to preemption, as a result of the expiration of a thread’s time-slice. In Figure 10, I clicked on the large yellow block on the right hand side of the execution time bar graph for Thread 6920 indicating another long delay. As in Figure 9, I have hidden all but the three active CLR thread pool threads. Using a combination of zooming to a point of interest in the event stream and filtering out extraneous threads, as illustrated in Figure 10, the Concurrency Visualizer is able to construct an execution time profile using just those events that are visible in the current time-window.

Overall, the three active CLR worker threads are only able to execute 18% of the time, while they are delayed by synchronization 9% of the time and subject to preemption 39% of the time. (You can click on the Profile Report tab in the middle right portion of the display and see a profile report by thread.)


Figure 10. Using the Concurrency Visualizer to drill into thread preemption delays.


At the point indicated by the selection, the time-slice quantum for Thread 6920 expired and the Scheduler preempted the executing thread in favor of some other ready thread. Looking at the visualization, it should be apparent that the ready thread the Scheduler chose to execute next was another CLR thread pool worker thread, namely Thread 4664, which then blocked Thread 6920 from continuing. The tool reports that a context switch(6920, 4664) occurred, and that Thread 6920 was delayed for about 275 milliseconds before it resumed execution after being preempted.

As illustrated in this example, the Concurrency Visualizer uses the ETW-based event data from a profiling run to construct a state machine that reflects the precise execution state of each application thread over the time interval being monitored. It goes considerably beyond calculating processor queue time at the thread level. It understands how to weave the sequence of Ready Thread and Context switch events together to create this execution time profile. It summarizes the profiling data, calculating the precise amount time of time each thread is delayed by synchronous IO, page faults (i.e., involuntary waits due to memory management overhead), processor contention, preemption by higher priority work, and lock contention over the profiling interval. (Note: In the Concurrency Visualizer, memory management waits that are resolved very quickly, usually in less than 1 m-second, correspond to soft page faults. When hard pages faults occur, the tool will show a corresponding disk IO, and the delay is ordinarily several milliseconds in duration, depending on the speed of the paging disk.) Furthermore, it analyzes the call stacks gathered at each Context Switch event, looking for signatures that identify the specific blocking reason. And, specifically, to help with lock contention issues, which are otherwise often very difficult to identify, it also identifies the thread that ultimately unblocks the thread that was found waiting to acquire a lock..

Is there an ARM server in your future?

I cannot resist adding to some of the industry buzz about the recent HP Project Moonshot announcement and what this potentially means for folks that run Windows. As part of Project Moonshot, HP is planning to release something called the Redstone Server Development Platform in 1H12, an ARM-based, massively parallel supercomputer that uses 90% less energy than comparable Intel microprocessors.

Coupled with the big.Little hardware announcement from the ARM Consortium about two weeks ago, I think this is big news that could shake up the foundations of Windows computing. ARM-based Windows Server machines could well be in all our futures.

Let’s start with ARM and the big.Little architecture announcement, which is pretty interesting in itself. Details are from this white paper. big.Little is an explicit multi-core architecture where a simple, low power version of the processor is packaged together with a significantly more powerful (~ 2x) version of the processor that also uses about 3 times the energy consumption. The Little CPU core in the architecture is “an in-order, non-symmetric dual-issue processor with a pipeline length of between 8-stages and 10-stages.” The big guy is “an out-of-order sustained triple-issue processor with a pipeline length of between 15-stages and 24-stages,” significantly more powerful and more complex.

The idea of pairing them is that a task that needs access to more CPU power could migrate from the Little guy to the big guy in about 20K cycles, about 20 m-seconds of delay. The idea is that when you want to playback something with HD graphics on your phone, the architecture can power up the big guy and migrate the CPU-bound graphics-rendering thread so fast that the user experience is like a turbocharger kicking in. Whoosh. Meanwhile, the device – a phone or a tablet, probably – can stay powered up and ready for action for an extended period of time with just the low power CPU. Which, at 1 GHz, is still a potent microprocessor.

Apple, Android, Blackberry and Windows smartphones today all run on ARM. ARM is a reduced instruction set computer (RISC) that has significantly lower energy costs, which is why it is so popular on smartphones. The Apple iPad tablet uses ARM, and so do most iPad competitors today from folks like Samsung and HTC. Some of these tablets are dual core ARM machines today. The tablet computer running an early build of Windows 8 that Microsoft gave away at the recent PDC also uses an ARM chip. When Windows 8 does ship next year sometime, we can expect to see a flurry of announcements for Windows 8 tablets based on ARM chips. Portable devices rely on ARM chips because they consume relatively little power, and power is the most significant performance constraint in portable computing of all types.

To be fair, we will probably see some Intel Atom-based tablets, too, running Win8 being released at the same time. (The Atom is a simplified, in-order processor that supports the full Intel x64 instruction set.) But I think ARM will continue to dominate the market for smartphones and tablets in the coming years. And an ARM-compatible version of Windows is quite likely to add to that momentum – most industry observers are guessing that plenty of Windows 8 tablets are going to be sold.

Another piece of the puzzle: the 64-bit ARM extensions were just formally announced this week.

So, ARM is happening in a big way, and it is time to take notice. Plus, Windows is jumping into ARM-based computing in its next release.

BTW, I am not counting Intel out. Intel has the semiconductor industry’s foremost fabrication facilities; it is still the industry leader in manufacturing. It is the low cost, high volume manufacturer. One of the keys to its manufacturing prowess has been high volume, which is necessary to recoup the enormous capital costs associated with constructing a new, leading edge fab plant.

However, the ARM architecture is the first viable, high volume challenger to Intel’s x86/x64 products to emerge in years. In the most recent quarter, about 400 million smartphones were shipped, easily 10 times the number of Intel-compatible PCs that were sold in the same time frame. That kind of high volume success breeds more success in semiconductor manufacturing because it drives down the manufacturing cost per unit, a key component of Moore’s law.

The HP Moonshot Project announcement suggests none of this is lost on the data center where energy consumption is a major cost factor. The goal is to fill that 1-x rack space with an array of ARM computers installed on a compact motherboard with just 10% of the energy consumption of the comparable Intel computing footprint. The projected cost savings over three years are something like 60%. So, at the higher end of HPC, this is starting to happen. The Windows 8 port to ARM then leaves just one small step to get Windows Server running on these same ARM-based servers.

It will be very interesting to see how this all plays out..

Using xperf to analyze CSwitch events

Continuing the discussion from the previous blog entry on event-driven approaches to measuring CPU utilization in Windows

Last time around I discussed the same CPU busy calculations that the Resource Manager in Windows makes. This same calculation can also be performed after the fact using the event data from ETW. This is the technique used in the Windows Performance Toolkit (WPT, but which is better known around Microsoft as xperf), for example, to calculate CPU usage metrics.

Once you have downloaded and installed the Windows Performance Toolkit, you can launch a basic ETW collection session using the following xperf command:

xperf -on DiagEasy

Then, after you have accumulated enough data, issue another command to stop tracing and capture the event stream to a file:

xperf -d cputrace.etl

Next, process the cputrace.etl file using the xperfview app. After the trace file is loaded, xperfview provides visualizations that are very similar to ResMon. See Figure 6 for an example.


Figure 6. CPU utilization graphs in xperfview, based on ETW context switch event data gathered with the xperf utility. Additional ISR and DPC events are used to calculate the amount of time device drivers spend processing interrupts.

With xperfview, you can also request a Summary Table which displays CPU usage by process (and thread) by right-clicking on the CPU Usage graph and accessing the pop-up context menu. An example of the CPU Scheduling Aggregate Summary Table is illustrated in Figure 6. It is similar to the one ResMon produces. The data here was gathered while running a multi-threaded CPU soaker program called ThreadContentionGenerator while ResMon was also active. You can see that the calculation in Figure 7 roughly mirrors the results shown in Figure 5 for ResMon, allowing for some variation that is to be expected since the intervals themselves are not identical. The xperf interval shown in Figure 6 is approximately 500 seconds long, while ResMon maintains a rolling window that is only 60 seconds in duration. The ResMon screen shot was taken somewhere in the middle of the longer xperf tracing session.

For some perspective on the volume of trace events that can be generated, the binary .etl trace file produced in this example was approximately 325 MB on disk for a DiagEasy trace session that ran for more than ten minutes. Running with the xperf defaults, I received a notification when the trace session closed that three ETW 64K buffers of data were dropped during the trace because xperf was unable to keep pace with processing the event stream in real-time.

The Context Switch event also provides the old thread’s Wait Reason code, which helps you to understand why the sequence of thread scheduling events occurred. For reference, a Windows context switch is defined here, while the contents of the ETW (Event Tracing for Windows) context switch event record are defined here, including a list of the current thread WaitReason codes.

Note that you can measure CPU queue time accurately from the ETW events, an important indicator of processor contention when the processor becomes saturated. As illustrated in Figure 4, the transition from the Wait state to the Ready state is marked by a ReadyThread event record. The time between the ReadyThread event and a subsequent CSwitch event marking its transition to Running is one form of CPU Queue time. A second form of CPU queue time is the time between a CSwitch(…,out,WaitReason,…) where the WaitReason is either a Preempt or time-slice quantum expiration and a subsequent re-dispatch. Both forms of CPU queue time can be measured accurately using ETW.

When precision in understanding patterns of CPU consumption is required, post-processing the ETW context switching event stream is a much better way to proceed than attempting to use the Windows % Processor Time performance counters. Measuring CPU consumption from the context switch events is considerably more precise than, for example, the Windows performance counter data available in Perfmon that report processor utilization at the system and process level based on processor state sampling. Such high precision measurements are not always required, of course, and processing the ETW context switching event stream is relatively expensive due to the extremely high volume of trace data that you must deal with.

Figure 6 illustrates several of the CPU utilization graphs that xperfview creates from the context switch event stream. To help make the graph more readable, I filtered out Idle time calculation for all but two of the logical processors on this machine. (The machine illustrated has 8 logical CPUs.) To gain insight into what high priority Interrupt Service Routines (ISRs) and DPCs are running, ISR and DPC events should also be gathered, which the DiagEasy event profile in xperf does automatically. (Windows device driver developers are very interested in them. For an example, see this blog posting discussing using the xperf ETW utility to capture CPU consumption by the TCP/IP network driver stack.)


Figure 7. The CPU Scheduling Aggregate Summary Table calculated by xperfview. The results of these calculations closely resembles the rolling one-minute calculation reported by the Resource Monitor in Figure 5.