Presentations for the upcoming CMG conference are available on slideshare.net

I am presenting two topics at the Computer Measurement Group (CMG) annual conference this week, and I have just posted the slide decks I will be using on slideshare.net.

The latest slide deck for Monitoring Web Application Response Times in Windows is available here.

And the HTTP/2: Recent protocol changes and their impact on web application performance one is available here.

If you are a regular reader of this blog, most of this material will look familiar. Enjoy!

 .

Measuring application response time using the Scenario instrumentation library.

This blog post describes the Scenario instrumentation library, a simple but useful tool for generating response time measurements from inside a Windows application. The Scenario instrumentation library uses QPC() and QTCT(), the Windows APIs discussed in an earlier blog entry, to gather elapsed times and CPU times between two explicit application-designated code markers. The application response time measurements are then written as ETW events that you can readily gather and analyze.

 You can download a copy of the Scenario instrumentation library here at http://archive.msdn.microsoft.com/Scenario.

The Scenario class library was originally conceived as a .NET Framework-flavored version of the Application Response Measurement (ARM) standard, which was accepted and sponsored by the Open Group. The idea behind ARM was that adding application response time measurements to an application in a standardized way would promote 3rd party tool development. This was moderately successful for a spell, ARM even developed some momentum, and was adopted by a number of IT organizations. Some major management tool vendors, including IBM’s Tivoli suite and HP OpenView, supported ARM measurements in their tools.

In the Microsoft world, however, the ARM standard itself never stirred much interest, and application response time measurements are conspicuously absent from the performance counters supplied by many Microsoft-built client and server applications. However, many of these applications are extensively instrumented and can report response time measurements using ETW, which is one of the many reasons that something ARM-like for Windows should leverage ETW.  The Scenario instrumentation library tries to satisfy the same set of requirements as an ARM-compliant tool, but tailored to the Windows platform.

The topic is very much on my mind at the moment – thus this blog post – because I am working on a new tool for reporting on web application response times under Windows using ETW events generated by the HttpServer component of Windows (better known as IIS) and the TCP/IP networking stack. One of the early users of the Beta version of the tool also wanted a way to track application Scenario markers, so I am currently adding that capability. I expect to have an initial version of this reporting tool that will also support web applications instrumented with the Scenario library available next month, so…

Why measure application response times?

There are several very good reasons for instrumenting applications so that they can gather response time measurements. If you are at all serious about performance, it is well-nigh impossible to do good work without access to application response time measurements. If you don’t have a way to quantify empirically what is “good” response time and compare it to periods of “bad” response time, let’s face it, you are operating in the dark. Performance analysis begins with measurement data, and you cannot manage what you can’t measure.

Application response time measurements are important for two main reasons. The first is that application response time measurements correlate with customer satisfaction. In survey after survey of customers, performance concerns usually rank just below reliability (i.e., bugs and other defects) as the factor most influential in forming either a positive or negative attitude towards an application. They are a critical aspect of software quality that you can measure and quantify.

In performance analysis, application response time measurements are also essential to apply any of the important analytic techniques that people have developed over the years for improving application response time. These techniques include using Queuing Theory and related mathematical techniques used by capacity planners to predict response time in the face of growing workloads and changing hardware. Any form of optimization or tuning you want to apply to your application also needs to be grounded – how can you know if this or that optimization leads to an improvement if you are not measuring response times, both before and after. Even knowing which aspect of the application’s response time to target a tuning effort on requires measurements that allow you to break the response times you observe into their component parts – CPU, IO, network, etc., an analysis technique known as response time decomposition.

So, for these and other reasons, application response time measurements are extremely important. Which is why it is especially annoying to discover that application response time measurements are largely absent from the standard Windows performance counter measurements that are available for both native C++ and managed .NET Framework apps in Windows. The Scenario instrumentation library helps to address this gap in a standard fashion, similar to the ARM standard, and likewise enables the development of 3rd party reporting tools.

Tips for gettng your applications instrumented.

Adding ARM-like instrumentation to an application inevitably raises some concerns. The prime concern is that adding the library calls means you have to open up the code and modify it. Even if the instrumentation API is easy to use – and the Scenario class library is very simple – modifying the code is risky, riskier than doing nothing. It needs to be performed by someone who knows the code and will add the instrumentation carefully. A reluctance to open up the code and expose it to additional risk is usually the big initial obstacle organizations face when it comes to adding instrumentation – it is one more thing on the ToDo list that has to be coded and tested, and one more thing that can go wrong.

The best approach to overcoming this obstacle is to line up executive support for the initiative. Let’s face it, your IT management will appreciate receiving regular service level reports that accurate reflect application usage and response time. We all want satisfied customers, and meeting service objectives associated with availability, reliability and performance is highly correlated with customer satisfaction. Application response time data is critical information for IT decision makers.

The 2nd obstacle, which is actually the more serious one, is that someone has to figure out what application scenarios to instrument. In practice, this is not something that is technically difficult. It just requires some thought from the people who designed the application and care about its quality, and perhaps some input from the folks that use the app to understand what pieces of it they rely on the most. Technical input is also required at various points in the process of coming up with the scenario definitions – decisions about what scenarios to instrument need to be made in light of any technical obstacles that can arise.

Let me illustrate one of the technical considerations that will arise when you first consider instrumenting an application to report response times. You will discover rather quickly that reporting response times alone apart from some explanatory context leads to problems in interpretation. Let me illustrate with an example from a well-known and well-instrumented application you are probably familiar with – Google Search. At the top of the panel where Google displays Search results is a response time measurement. For example, I just typed in “search engines” and Google Search returns the information that it knows of 264,000,000 “results” that match that search criteria. Google then reports that it required all of 0.25 seconds to find that information for me and report back. The 250 milliseconds measure is the response time and the 264 million results is the context needed to interpret whether that response time is adequate.

When you instrument your application using the Scenario class library, you have two additional fields that can be used to report context, a 64-bit integer Property called Size and string Property called Category. If you were instrumenting Google Search using calls to the Scenario class, you would set the Size Property to the 260 million results value and place the search keywords into the Category string. The Size and Category Properties that are reported alongside the response time measurement provide context to assist you in interpreting whether or not the response time result the application supplied was adequate in that particular instance.

So, one final aspect of instrumenting your application to consider is what additional context you want to supply that will aid you in interpreting the response time measurements after the fact. The contextual data that is usually most helpful is associated with what I like to call the application’s scalability model. The application’s scalability is your theory about the factors that have the most influence over the application’s response time. If, for example, you think that how many rows in the database the application must process has something to do with how long it takes to compute a result and generate a response message, that conjecture reflects a simple scalability model

                f(x) = y * rowsn  

Populating the Scenario object’s Size and Category Properties from data relevant to the application’s scalability model helps provide the context necessary for interpreting whether a specific response time measurement that was reported is adequate or not.

It may be challenging to squeeze all the data relevant to the application’s scalability model into the limited space the Scenario class provides for customization. In practice, many adopters turn the Category string into a composite value. That’s something I often resort to myself. Consequently, in my reporting program I support a standard method for packing data into the Category string value, which is automatically broken out in the reporting process. Worst case is that you will never be able to shoehorn all the contextual data needed into the Scenario class’s simple Size and Category Property fields. When that happens, you will need to develop your own instrumentation class – and your own reporting.

Software performance engineering and application development lifecycle

Given how fundamentally important application response time measurements to any worthwhile program for maintaining and improving software quality, it is worth thinking a bit about why this critical data is so often missing. Let’s consider the full application development life cycle – from gathering requirements, to design, coding, testing, acceptance and stress testing, to deployment and ongoing “maintenance” activities. Too often, the application’s performance requirements are relegated to a narrow window somewhere towards the end of the development process but presenting a significant hurdle during acceptance testing. When performance testing is concentrated in this fashion late in the acceptance testing stage, this positioning is almost guaranteed to cause resentment among the hard-pressed software developer staff way. Performance requirements should actually be set early in the requirements process as scenarios targeted for development are being defined. Provisioning the application so that it can report on the response times of those key scenarios emphasizes performance considerations at every stage of the application development lifecycle.

In principle, the application scenarios are specified early in the development life cycle, and early in the cycle is also the best time to begin thinking about response time requirements. In the software development methodologies that are in fashion, application performance is usually viewed as a “non-functional” requirement, one that does not get a lot of attention. This is all wrong, of course. As one of my colleagues was fond of telling our CIO, performance isn’t a coat of paint you can slap on an application after it is done that will beautify it. Decisions about how to structure the application made during the design stage often determine what application response times are even achievable.

On the contrary, performance needs to be considered at every stage of application development. Performance goals for key application scenarios need to be defined early in the design stage. Instrumentation to measure application response time allows developers to assess their progress accurately in meeting these goals at every stage of the process. Instrumentation embedded in the application also aids in performance testing. My experience is that with the right level of instrumentation, every functional test can also serve as a performance test.

While application response time measurements are the closest we can get to understanding and quantifying the end user experience, it is worth noting that the correlation between response time and customer satisfaction is typically not a simple linear relationship. Human beings are a little more complicated than that. If you want a set of relatively simple guidelines to help you decide what response times are good for your application and which are bad, I recommend Steve Seow’s book on the subject “Designing and Engineering Time: The Psychology of Time Perception in Software.” The guidelines in Dr. Seow’s book are grounded in the relevant Human-Computer Interaction (HCI) psychological research, but it is not a dry, academic discussion.

Steve’s book also promotes a set of prescriptive techniques for engineering a better user experience whenever the processing requirements associated with a request are too demanding to produce a prompt application response. For instance, by managing a customer’s expectations about how long some longer running operation is potentially going to take, you can engineer a highly satisfying solution without blistering response times. Showing a progress bar that accurately and reliably reflects the application’s forward progress and providing ways for a customer to cancel out of a long running task that ties up her machine are two very effective ways to create a positive experience for your customers when it is just not possible to complete the computing task at hand quickly enough.

 Using the Scenario instrumentation library.

The Scenario instrumentation class library provides a convenient way for a developer to indicate in the application the beginning and end of a particular usage scenario. Internally, the Scenario instance uses an ExtendedStopwatch() object to gather both wall clock time (using QueryPerformanceCounter) and the CPU ticks from QTCT() for the Scenario when it executes. The Scenario class can then output these measurements in an ETW trace event record that renders for posterity the elapsed time and CPU time of the designated block of code.

The Scenario class wraps calls to an internal ExtendedStopwatch object that returns both the elapsed time and CPU time of a demarcated application scenario. Once a Scenario object is instantiated by an application, calls to Scenario.Begin() and Scenario.End() are used to mark the beginning and end of a specific application scenario. After the Scenario.End() method executes, the program can access the object’s Elapsed and ElapsedCpu time properties. In addition, the Scenario.Begin() and Scenario.End() methods generate ETW events that can be post-processed. The payload of the ETW trace event that is issued by the Scenario.End() method reports the elapsed time and CPU time measurements that are generated internally by the class.

To support more elaborate application response time monitoring scenarios, there is a Scenario.Step method that provides intermediate CPU and wall clock timings. The Scenario class also provides a correlation ID for use in grouping logically related requests. Nested parent-child relationships among scenarios that are part of larger scenarios are also explicitly supported. For details, see the API and ETW payload documentation pages on the MSDN archive.

I developed an earlier version of the Scenario instrumentation library in conjunction with several product development teams in the Developer Tools Division when I was at Microsoft. An early adopter was a product team attempting to build a next-generation C++ compiler. The test case they were looking to optimize was the commercial Windows Build, a set of very demanding and long running compiler jobs. This product team gave the original Scenario instrumentation library quite a stress test, and I added several features to the library to make it more efficient and effective in that challenging environment.

Subsequently, what was effectively the Scenario instrumentation library version 2.0 was included in the commercial version of Visual Studio 2010, with instrumentation added for several newly developed components that shipped in the product. The Visual Studio 2010 flavor of the Scenario instrumentation library is known as the MeasurementBlock class. If you have Visual Studio 2010 installed, you can incorporate this into your application by referencing the Microsoft.VisualStudio.Diagnostics.Measurement.dll that is located at C:Program Files (x86)Microsoft Visual Studio 10.0Common7IDE. After adding a Reference to Microsoft.VisualStudio.Diagnostics.Measurement.dll, you can verify that the semantics of the MeasurementBlock class are functionally identical to the published Scenario library using the Object Browser applet in Visual Studio, as illustrated in the screen shot shown in Figure 1 below.

MeasurementBlock-class-as-view-in-the-VS-Object-Browser

Figure 1. MeasurementBlock class imported from Microsoft.VisualStudio.Diagnostics.Measurement.dll as viewed in the Visual Studio Object

You will note some cosmetic differences, though. The eponymous Mark() method in the original implementation was renamed to Step() in the subsequent Scenario library. (Ever since I found out what it meant a few years back, I have always wanted to use the word “eponymous” in a sentence!) MeasurementBlock also uses a different ETW Provider GUID; we didn’t want developers adding Scenario markers to their apps suddenly seeing the VS MeasurementBlock events, too, when they enabled the ETW Provider.

Using the Scenario library in your application is straightforward and is documented here, but I will provide some additional coding guidance for use in interfacing to the new performance tool I am working on in the next blog entry.

.

The mystery of the scalability of a CPU-intensive application on an Intel multi-core processor with HyperThreading (HT).

This blog entry comes from answering the mail (note: names and other incriminating identification data were redacted to protect the guilty):

A Reader writes:

My name is … and I am a developer for a software company. I encountered something I didn’t understand profiling our software and stumbled across your blog. Listen, hot-shot, since you are such an expert on this subject, maybe you can figure this out.

Our software is very CPU-intensive, so we use QueryPerformanceCounter() and QueryPerformanceFrequency() to get the CPU cycles. Hence, if someone mistakenly commits some change to slow it down, we can catch it. It works fine until one day we decided to fire up more than one instance of the application. Since a large part of our code is sequential and could not be parallelized, by firing up another instance, we can use the other cores that are idle when one core is executing the sequential code. We found the timing profile all messed up. The time difference can be 5x difference for part of the code. Apparently QueryPerformanceCounter() is counting wall time, not CPU cycles. BTW, I have a quad-core hyper-threaded i7 PC.

Then I wrote this small bit of code  (at the end of this email) to test QueryPerformanceCounter(), GetProcessTimes() function AND QueryProcessCycleTime() function. If I run the code solo (just one instance), we get pretty consistent numbers.

However, if I run 10-instances on my Intel 4-way with HyperThreading enabled (8-logical processors) machine, all three methods (QueryPerformanceCounter(),GetProcessTimes() and QueryProcessCycleTime() report random numbers.

# of Concurrent Processes

1

10

QueryPerformanceCounter time
21.5567753
39.6047058
GetProcessTimes (User)
21.4501375
39.4214527
QueryProcessCycleTime
77376526479
141329606436

Can you please help me understand what is going on here?

Signed,

   — Dazed and confused.

Note that the QPC and GetProcessTimes are reported in seconds. The QPCT times are in cpu cycles, which are model-dependent. You can use the QueryProcessorFrequency() API call to get the processor speed in cycles/second. Or check the ~MHz field in the Registry as illustrated in Figure 1. In my experience, that will report a number similar to QFC().

processor-speed-from-the-Windows-Registry

Figure 1. The ~MHz field in the Registry will report the processor speed in cycles/second.

 

Swallowing deeply, I (humbly) replied:

 

Dear Dazed,

I see that you are confused, but there is a very good explanation for what you are seeing.

If I understand this correctly, the app runs for 21 seconds in a standalone environment. When you run 10 processes concurrently, the app runs for 39 seconds.

You are correct, QueryPerformanceCounter() is counting wall clock time, not cycles. Any time that the thread is switched out between successive calls to QPC() would be included in the timing, which is why you obtained some unexpected results. You can expect thread pre-emption to occur when you have 10 long-running threads Ready to run on your 8-way machine.

But I think there is something else going on in this case, too. Basically, you are seeing two multiprocessor scalability effects – one due to HyperThreading and one due to more conventional processor queuing. The best way to untangle these is to perform the following set of test runs:

  1. run standalone
  2. run two concurrent processes
  3. run four concurrent processes
  4. run 8 concurrent processes
  5. then, keep increasing the # of concurrent processes and see if a pattern emerges.

Using GetProcessTimes is not very precise — for all the reasons discussed on my blog, it is based on sampling – but for this test of a long running processing (~ 21 seconds), the precision is sufficient. Since the process is CPU-intensive, this measurement is consistent with elapsed time as measured using QPC, but that is just pure luck, because, as it happens, the program you are testing does not incur any significant IO wait time.  My recommendation is to try calling QueryThreadCycleTime instead; it is a better bet, but it is not foolproof either (as I have discussed on my blog). Actually, I recommend you instrument the code using the Scenario class library that we put up on Code Gallery sometime back, see http://archive.msdn.microsoft.com/Scenario. It calls both QPC and QTCT in-line during thread execution.

I personally don’t have a lot of direct experience with QueryProcessCycleTime, but my understanding of how it works could make it problematic when you are calling it from inside a multi-threaded app. If your app runs mainly single-threaded, it should report CPU timings similar to QTCT.

The Mystery Solved.

Within a few days, the intrepid Reader supplied some additional measurement data, per my instructions. After instrumenting the program using the Scenario class, he obtained measurements from executing 2, 4, 6, 8, etc. processes in parallel, up to running 32 processes in parallel on this machine. Figure 2 summarizes the results he obtained:

chart-showing-parallel-process-scalability-on-an-HT-machine

Figure 2. Elapsed times and CPU times as a function of the number of concurrent processes. The machine environment is a quad-core Intel i7 processor with HT enabled (the OS sees 8 logical processors).

(Note that I truncated the results at 24 concurrent processes to create this graphic to focus on left side of the chart. Beyond 24 processes, both line graphs continue consistently along the same pattern indicated. CPU time remains flat and elapsed time nues to scale linearly.)

To be clear about the measurements being reported, the Elapsed property of the Scenario object is the delta between a QPC() clock timer issued at Scenario.Begin() and one issued at Scenario.End(). It is reported in standard Windows 100 nanosecond timer ticks. The ElapsedCpu property is the delta between two calls to QTCT, made immediately after the QPC call in the Scenario.Begin method and just before Scenario.End calls QPC. It is also reported in standard Windows timer ticks. Respectively, they are the elapsed (wall clock time) time and the CPU thread execution time for the thread calling into the embedded Scenario instrumentation library.

Looking at the runs for two and four concurrent processes, we see both elapsed time and CPU time holding constant. In terms of parallel scalability, this application running on this machine with 4 physical processor cores scales linearly for up to four concurrent processes.

We also see that for up to 8 concurrent processes, elapsed time and CPU time are essentially identical.

For more than 8 processes, the CPU time curve flattens out, while elapsed time of the application increases linearly with the number of concurrent processes. Running more processes than logical processors does nothing to increase throughput; it simply creates conventional processor queuing delays. The fact that the workload scales linearly up to 32 concurrent processes is actually a very positive result. The queuing delays aren’t exponential, there is no evidence of locking or other blocking that would impede scalability. These results suggest that if you had a 32-way (physical processor) machine, you could run 32 of these processes in parallel very efficiently.

The HT effects are confined to the area of the elapsed and CPU time curves between 4 and 8 concurrent processes. Notice with six processes running concurrently, CPU time elongates only slightly – the benefit of the HT hardware boost is evident here. However, running eight concurrent processes leads to a sharp spike in CPU time – this is a sign that 8 concurrent processes saturate the four processor cores. The HT feature no longer effectively increases instruction execution throughput.

The CPU time curve flattening out after 8 concurrent processes suggests that at least some of that CPU time spike at 8 concurrent processes is due to contention for shared resources internal to the physical processor core from threads running in parallel on logical (HT) processors.

HT is especially tricky because the interaction between threads contending for shared resources internal to the physical processor core is very workload dependent. Running separate processes in parallel eliminates most of the parallel scalability issues associated with “false” sharing of shared cache lines because each process is running in its own dedicated block of virtual memory. What is particularly nefarious about concurrently executing threads “false sharing” of cache lines is that it subjects executing to time-consuming delays due to writing back updated cache lines to RAM and re-fetching them. In theory, these delays can be minimized by changing the way you allocate your data structures. In practice, you do not have a lot of control over how memory is allocated if you are an application developer using the .NET Framework. On the other hand, if you are writing a device driver in Windows and working in C++, false sharing in the processor data cache is something you need to be careful to address.

I trust this further exploration and explanation will prove helpful, and I look forward to seeing a check in the mail from your company to me for helping to sweep aside this confusion.

Signed,

— Your obedient servant, etc., etc..