Why is this web app running slowly? — Optimization strategies (Part 4)

This is a continuation of a series of blog entries on this topic. The series starts here.

The YSlow model of web application performance, depicted back in Equations 3 & 4 in the previous post, leads directly to an optimization strategy to minimize the number of round trips, decrease round trip time, or both. Several of the YSlow performance rules reflect tactics for minimizing the number of round trips to the web server and back that are required to render the page. These include

  • designing the Page so there are fewer objects to Request,
  • using compression to make objects smaller so they require fewer packets to be transmitted, and
  • techniques for packing multiple objects into a single request.

For example, the recommendation to make fewer HTTP requests is in the category of minimizing round trips. YSlow rules regarding image compression or the use of “minified” external files containing JavaScript and CSS are designed to reduce the size of Response messages, which, in turn, reduces the number of round trips that are required to fetch these files. For example, the web page where you can download the free minify utility at https://code.google.com/p/minify/ that Google provides to web developers contains the following description of the program:

Minify is a PHP5 app that helps you follow several of Yahoo!’s Rules for High Performance Web Sites. It combines multiple CSS or Javascript files, removes unnecessary whitespace and comments, and serves them with gzip encoding and optimal client-side cache headers.

All the text-based files that are used in composing the page – .htm, .css, and .js – tend to compress very well, while the HTTP protocol supports automatic unpacking of gzip-encoded files. There is not a great benefit from compressing files already smaller than the Ethernet MTU, so YSlow recommends packing smaller files into larger ones so that text compression is more effective.
Meanwhile, the performance rules associated with cache effectiveness are designed to minimize RTT, the round trip time. If current copies of the HTML objects requested from the web server can be retrieved from sources physically located considerably closer to the requestor, the average network round trip time for those Requests can be improved.

With its focus on the number and size of the files necessary for the web browser to assemble in order to construct the page’s document object from these component parts, YSlow uses an approach to optimization known in the field of Operations Research (OR) as decomposition. The classic example of decomposition in OR is the time and motion study where a complex task is broken into a set of activities that are performed in sequence to complete a task. The one practical obstacle to using decomposition, however, is that YSlow understands the components that are used to compose the web page, but it lacks measurements of how long the task and its component parts take.

As discussed in the previous section, these measurements would be problematic from the standpoint of a tool like YSlow which analyzes the DOM once it has been completely assembled. YSlow does not attempt to measure the time it took to perform that assembly. Moreover, the way the tool works, YSlow deals with only a single instance of the rendered page. If it did attempt to measure network latency or cache effectiveness or client-side processing compute power, it would be capable of only gathering a single instance of those measurements. There is no guarantee that that single observation would be representative of the range and variation in behavior a public-facing web application would expect to encounter in reality. As we consider the many and varied ways caching technology, for example, is used to speed up page load times, you will start to see just how problematic the use of a single observation of the page load time measurement to represent the range and variation in actual web page load times can be.

Caching.

Several of the YSlow performance rules reflect the effective use of the caching services that are available for web content. These services include that portion of the local file system that is used for the web client’s cache, a Content Delivery Network, which are caches geographically distributed around the globe, and various server-side caching mechanisms. Effective use of caching improves the round trip time for any static content that can readily be cached. Since network transmission time is roughly a function of distance, naturally, the cache that is physically closest to the web client is the most effective at reducing RTT. Of the caches that are available, the cache maintained by the web browser on the client machine’s file system is physically the closest, and, thus, is usually the best place for caching to occur. The web browser automatically stores a copy of any HTTP objects it has requested that are eligible for caching in a particular folder within the file system. The web browser cache corresponds to the Temporary Internet Files folder in Internet Explorer, for example.

If a file referenced in a GET Request is already resident in the web browser cache – the disk folder where recently accessed cacheable HTTP objects are stored – the browser can add that file to the DOM without having to make a network request. Web servers add an Expiresheader to Response messages to indicate to the web browser that the content is eligible for caching. As the name indicates, the Expires header specifies how long the existing copy of that content remains current. Fetching that content from the browser cache requires a disk operation which is normally significantly faster than a network request. If a valid copy of the content requested is already resident in the browser cache, the round trip time normally improves by an order of magnitude since a block can be fetched from disk in 5-10 milliseconds on average. Note that reading a cached file from disk isn’t always faster than accessing the network to get the same data. Like any other factor, it is important to measure to see which design alternative performs better. In the case of an intranet web application where web browser requests can be fielded very quickly, network access, often involving less than 1 ms of latency, might actually be preferred because it could be much faster to get the Http object requested directly from the IIS kernel-mode cache than for the web client to have to access its local disk folder where Temporary Internet Files are stored.

Note also, that while caching does not help the first time a customer accesses a new web page, it has a substantial impact on subsequent accesses to the page. Web traffic analysis programs will report the number of unique visitors to a web site – each of these is subject to a browser cache that is empty of the any of the content that is requested. This is referred to as a cold start in cache. It is only the repeat visitors that benefit from caching, subject to the repeat visit to the web site occurring prior to the content expiration date and time. In Souders’ book, he reports an encouragingly high number of repeat visits to the Yahoo site as evidence for the YSlow recommendation. When network latency for an external web site is at least 100-200 ms, accessing the local disk-resident browser cache is an order of magnitude faster.

When the web browser is hosted on a mobile phone, which is often configured without a secondary storage device, the capability to cache content is consequently very limited. When Chrome detects it is running on an Android phone, for example, it configures a memory resident cache that will only hold up to 32 files at any one time. If you access any reasonably complex web site landing page with, say, more than 20-30 href= external file references, the effect is to flush the contents of the Chrome mobile phone cache.

Any CSS and JavaScript files that are relatively stable can potentially also be cached, but this entails implementing a versioning policy that your web developers adhere to. The snippet of html that I pulled from an Amazon product landing page that I discussed earlier illustrates the sort of versioning policy your web developers need to implement to reap the benefits of caching, while still enabling program bug fixes, updates, and other maintenance to ship promptly.

Another caching consideration is that when popular JavaScript libraries like jquery.js or angular.js or any of their add-on libraries that are incorporated into your web applications, you will find that current copies of these files already exist in the browser’s cache and do not require network requests to retrieve them. Taking a moment to check the contents of my Internet Explorer disk cache, I can see several different versions of jquery.js are currently resident in the IE cache. Another example is the Google Analytics script, ga.js, which so many web sites utilize for tracking web page usage is frequently already resident in the browser cache. (I will be discussing some interesting aspects of the Google Analytics program in an upcoming section.)

Content that is generated dynamically is more problematic to cache. Web 2.0 pages that custom built for a specific customer probably contain some elements that are unique for the user ID, while other web page parts are apt to be shared among many customers. Typically, the web server programs that build dynamic HTML Response messages will simply flag them to expire immediately so that they are ineligible for caching by the web browser. Caching content that is generated dynamically is challenging. Nevertheless, it is appropriate whenever common portions of the pages are reused, especially when it is resource-intensive to re-generate that content on demand. We will discuss strategies and facilities for caching at least some portion of the dynamic content web sites generate in a future Post.

Beyond caching at the local machine, YSlow also recommends the use of a Content Delivery Network (CDN) similar to the Akamai commercial caching engine to reduce the RTT for relatively static Response messages. CDNs replicate your web site content across a set of geographically distributed web servers, something which allows the CDN web server physically closer to the requestor to serve up the requested content. The net result is a reduction in the networking round trip time simply because the CDN server is physically closer to the end user than your corporate site. Note that the benefits of a CDN even extend to first time visitors of your site because they contain up-to-date copies of the most recent static content from your primary web site host. For Microsoft IIS web servers and ASP.NET applications, there are additional server-side caching options for both static and dynamic content that I will explore much later in this discussion.

Extensive use of caching techniques in web technologies to improve page load time is one of the reasons why a performance tool like YSlow does not actually attempt to measure Page Load Time. When YSlow re-loads the page to inventory all the file-based HTTP objects that are assembled to construct the DOM, the web browser is likely to discover many of these objects in its local disk cache, drastically reducing the time it takes to compose and render the web page. Were YSlow to measure the response time, the impact of the local disk cache would bias the results. A tool like the WebPageTest.org site tries to deal with this measurement quandary by accessing your web site a second time, and comparing the results to first-time user access involving a browser cache cold start.

Having read and absorbed the expert advice encapsulated in the YSlow performance rules and beginning to contemplate modifying your web application based on that advice, you start to feel the lack of actual page load time measurements keenly. It is good to know that using a minify utility and making effective use of the cache control headers should speed up page load time. But without the actual page load time measurements you cannot know how much adopting these best practices will help your specific application. It also means you do not know how to weigh the value of improvements from tactics like Expires headers for CSS and JavaScript files to boost cache effectiveness against the burden of augmenting your software development practices with an appropriate approach to versioning those files, for example. Fortunately, there are popular tools to measure web browser Page Load Time directly, and we will look at them in a moment.

Next: Complications that the simple YSlow model does not fully take into account.

Why is my web app running slowly? — Part 2

This is a continuation of a series of blog entries on this topic. The series starts here.

In this blog entry, I start to dive a little deeper into the model of web page response time that is implicit in the YSlow approach to web application performance.

Simple-2BHTTP-2BGET-2BRequest

Figure 3. A GET Request issued from the web browser triggers a Response message from the server, which is received by the web client where it is rendered into a web page.

The simple picture in Figure 3 (left) leaves out many of the important details of the web protocols, including the manner in which the web server that can respond to the GET Request is located using DNS, the Internet Protocol’s (IP) Domain Naming Service. But it will suffice to frame a definition of web application response time, which is measured from (1) the time the GET Request was issued by the browser, includes the time it takes to locate the web server, (2) the time it takes the web server to create the Response message in reply, the network transmission time to send these messages back and forth, and, finally,Ž (3) for the web browser to render the Response message appropriately on the display. The response time for this Request is measured from the time of the initial GET Request to the time the browser’s display of the Response is complete such that the customer can then interact with any of the controls (buttons, menus, hyperlinks, etc.) that are rendered on the page. This response time measurement is also known as Page Load time. The YSlow tool incorporates a set of rule-based calculations that are intended to assist the web developer in reducing page load time.

Elsewhere on this blog, I have expounded at some length on the limitations of the ask-the-expert, rule-based approach, but there is no doubting its appeal. It makes perfect sense that someone who encounters a performance problem and is a relative newcomer to web application development would seek advice from someone with much more experience. However, when this expertise is encapsulated in static rules, and these rules are applied mechanically, there is often too much opportunity for nuance in the expert’s application of the rule to be missed. The point I labored to make in that earlier series of blog posts was that, too often, the mechanical application of an expert-based rule does not capture some aspect of its context that is crucial to its application. This is why in the field of Artificial Intelligence the mechanical application of rules in so-called expert systems gave way to the current machine learning approach that trains the decision-making apparatus based on actual examples[1]. On the other hand, understanding how and why an expert formulated a particularly handy performance rule is often quite helpful. That is how human experts train other humans to become expert practitioners themselves.

In essence, as Figure 3 illustrates, HTTP is a means to locate and send files around the Internet. These files contain static content, structured using the HTML markup language so that they contain the instructions the web client needs to compose and render them. However, many web applications generate Response messages dynamically, which is the case with the specific charting application we are discussing here in the case study. In that example, the Response messages are generated dynamically by an ASP.NET server-side application based on the machine, date, and chart template selected, which are all passed as parameters appended to the original GET Request message sent to the web server to request a specific database query to be executed.

As depicted in Figure 4, the HTTP protocol is layered on top of TCP, which requires a session-oriented connection, but HTTP itself is a sessionless protocol. Being sessionless means that web servers process each GET Request independently, without regard to history. In practice, however, more complex web applications, especially ones given to generating dynamic HTML, are very often session-oriented, utilizing parameters appended to the GET Request message, cookies, hidden fields, and other techniques to encapsulate the session state, reflecting past interactions with a customers and to link current Requests with that customer’s history. A canonical example is session-oriented data to associate a current GET Request from a customer with that customer’s shopping cart filled with items to purchase from the current session or retained from a previous session.

tcpIp-2Bnetworking-2Bstack-2Bblock-2Bdiagram

Figure 4. The networking protocol stack.

Working our way down the networking stack, the TCP protocol sits atop IP and then (usually) Ethernet at the hardware-oriented, Media Access level. These lower level networking protocols transform network requests into datagrams and from there into packets to build the streams of bits that are actually transmitted using the networking hardware. For mainly historical reasons, the Ethernet protocol supports a maximum transmission unit (MTU) of approximately 1500 bytes. Note that each protocol layer in the networking stack inserts its addressing and control metadata into a succession of headers that are appended to the front of the data packet. After accounting for the protocol headers, the maximum capacity of the data payload is closer to about 1460 bytes. The Ethernet MTU requires that HTTP Request or Response messages that are larger than 1460 bytes be broken into multiple packets by the IP layer at the Sender. IP is also responsible for reassembling the packets at the Receiver. These details of the networking protocol stack are the basis in YSlow for the performance rules that analyze the size of the Request and Response messages that are used at the level of the HTTP protocol to compose the page.

A further complication is that many web pages are composed from multiple Response messages, as depicted in Figure 5. Typically, the HTML that is returned in the original Response message contains references to additional files. These can and often do include image files that the browser must display, style sheets for formatting, video and audio files that the browser may play, etc. In the web charting application I am using as an example here, the charts themselves are rendered on the server as .jpg image files. The HTML in the original Response message references these image files, which causes the browser to issue additional HTTP GET Requests to retrieve them during the process page composition. Of course, the server-side application builds a .jpg file for each of the two charts that are to be rendered when it builds the original Response message. In order to display presentation-quality charts, the jpg files that are built are rather hefty, given that they must be transferred over the network to the web client. The GET Requests to retrieve these charts fully rendered on the web server in jpg form generate a very large Response message that then requires multiple data packets to be built and transmitted.

So, composing web pages may not only require multiple GET Requests to be issued as a result of <link> tags, Response messages that are larger than the Ethernet MTU require transmission of multiple packets. The number of networking data transmission round trips, then, is a function of both the number of GET Requests and the size of the Response messages. For the sake of completeness, note that whenever the size of a GET Request exceeds the MTU, the GET Request must be broken into multiple packets, too. The most common reason that GET Requests exceed the Ethernet MTU is when large amounts of cookie data need to appended to the Request.

Both the number of files that are requested to render the page and the file size are factors in the YSlow performance rules. For example, the HTML returned in the original Response message generated by the ASP.NET application may reference external style sheets, which are files that contain layout and formatting instructions for the browser to use. Formatting instructions in style sheets can include what borders and margins to wrap around display elements, the size and shape of what fonts to use, what colors to display, etc. The example app does rely on several style sheets, but none of them are very extensive or very large. Still, each separate style sheet file requires a separate GET Request and Response message, and some of the style sheets embedded in the document are large enough to require multiple packets to be transmitted.

Finally, the HTML can reference scripts that need to be loaded and executed when the document is being initially displayed. Scripts that modify the DOM by adding new elements to the page or changing the format of existing elements dynamically are quite common in web applications. Usually written using JavaScript, script code can be embedded within the original HTML, bracketed by <script></script> tags. However, to facilitate sharing common scripts across multiple pages of a web application, JavaScript code to manipulate the DOM can reside in external files, too, that then must be requested and loaded separately.

That is enough about JavaScript for now, but we will soon see that dynamic manipulation of the DOM via the execution of script code running inside the web client, usually in response to user interaction, has the potential to complicate the performance analysis of a web application considerably. It is worth noting, however, that YSlow does not attempt to execute any of the scripts that make up the page. Like the other HTTP objects that are requested, YSlow only catalogs the number of JavaScript files that are requested and their size. It does not even begin to attempt to understand how long any of these scripts might take to execute.

The central point is that page composition and rendering inside the browser frequently requires a series of GET Request and Response messages. The original Response message to a GET Request contains links to additional data files – static images, style sheets, JavaScript files, etc. – that are embedded in the HTML document that was requested. Each additional resource file that is needed to compose the page requires an additional GET Request and a Response Message from the web server. Consequently, the simple depiction of Page Load Time shown in Figure 3 gives way to the more complicated process depicted in Figure 5. When all the resources identified by the web client to compose the page are finally available and processed, the document composition process is complete. At that point, the page load state of the page is finalized, which means it is available to the end user to interact with.

Take a minute to consider all those times that you encounter a web page where the page is partially rendered, but is incomplete and the UI is blocked. The web browser has its instructions to gather all the HTTP objects that the web page references, and you are not able to interact with the page until all the objects referenced in the DOM are resolved. But if the Response message for any one of those objects is delayed, the page is not ready for interaction. That is what Page Load time measures.

Multiple-2BHTTP-2BGET-2BRequests

Figure 5. Composing web pages usually requires multiple GET Request:Response message sequences due to links to external files embedded in either the original Response message or in subsequent Response messages. It is not until all external references are resolved that the Page can reach its completed or “loaded” state where all the controls available on the page are available to use.

 

NEXT: Exploring the YSlow scalability model.


[1] In contrast to the expert systems approach to AI, which used explicit rules to mimic the decision-making behavior of experts, machine learning algorithms do not always need to formulate explicit decision-making rules. The program “learns” through experience, making adjustments to the decision-making mechanism based on the success or failure of various trials. The decision-making mechanisms used in machine learning algorithms vary from Bayesian networks (where explicit classification rules are encoded) to neural networks to genetic algorithms. The element that is common in the machine learning approach is a feedback mechanism from the learning trials. See, for example, Peter Flach, Machine Learning for more details on the approach.

  .

Measuring application response time using the Scenario instrumentation library.

This blog post describes the Scenario instrumentation library, a simple but useful tool for generating response time measurements from inside a Windows application. The Scenario instrumentation library uses QPC() and QTCT(), the Windows APIs discussed in an earlier blog entry, to gather elapsed times and CPU times between two explicit application-designated code markers. The application response time measurements are then written as ETW events that you can readily gather and analyze.

 You can download a copy of the Scenario instrumentation library here at http://archive.msdn.microsoft.com/Scenario.

The Scenario class library was originally conceived as a .NET Framework-flavored version of the Application Response Measurement (ARM) standard, which was accepted and sponsored by the Open Group. The idea behind ARM was that adding application response time measurements to an application in a standardized way would promote 3rd party tool development. This was moderately successful for a spell, ARM even developed some momentum, and was adopted by a number of IT organizations. Some major management tool vendors, including IBM’s Tivoli suite and HP OpenView, supported ARM measurements in their tools.

In the Microsoft world, however, the ARM standard itself never stirred much interest, and application response time measurements are conspicuously absent from the performance counters supplied by many Microsoft-built client and server applications. However, many of these applications are extensively instrumented and can report response time measurements using ETW, which is one of the many reasons that something ARM-like for Windows should leverage ETW.  The Scenario instrumentation library tries to satisfy the same set of requirements as an ARM-compliant tool, but tailored to the Windows platform.

The topic is very much on my mind at the moment – thus this blog post – because I am working on a new tool for reporting on web application response times under Windows using ETW events generated by the HttpServer component of Windows (better known as IIS) and the TCP/IP networking stack. One of the early users of the Beta version of the tool also wanted a way to track application Scenario markers, so I am currently adding that capability. I expect to have an initial version of this reporting tool that will also support web applications instrumented with the Scenario library available next month, so…

Why measure application response times?

There are several very good reasons for instrumenting applications so that they can gather response time measurements. If you are at all serious about performance, it is well-nigh impossible to do good work without access to application response time measurements. If you don’t have a way to quantify empirically what is “good” response time and compare it to periods of “bad” response time, let’s face it, you are operating in the dark. Performance analysis begins with measurement data, and you cannot manage what you can’t measure.

Application response time measurements are important for two main reasons. The first is that application response time measurements correlate with customer satisfaction. In survey after survey of customers, performance concerns usually rank just below reliability (i.e., bugs and other defects) as the factor most influential in forming either a positive or negative attitude towards an application. They are a critical aspect of software quality that you can measure and quantify.

In performance analysis, application response time measurements are also essential to apply any of the important analytic techniques that people have developed over the years for improving application response time. These techniques include using Queuing Theory and related mathematical techniques used by capacity planners to predict response time in the face of growing workloads and changing hardware. Any form of optimization or tuning you want to apply to your application also needs to be grounded – how can you know if this or that optimization leads to an improvement if you are not measuring response times, both before and after. Even knowing which aspect of the application’s response time to target a tuning effort on requires measurements that allow you to break the response times you observe into their component parts – CPU, IO, network, etc., an analysis technique known as response time decomposition.

So, for these and other reasons, application response time measurements are extremely important. Which is why it is especially annoying to discover that application response time measurements are largely absent from the standard Windows performance counter measurements that are available for both native C++ and managed .NET Framework apps in Windows. The Scenario instrumentation library helps to address this gap in a standard fashion, similar to the ARM standard, and likewise enables the development of 3rd party reporting tools.

Tips for gettng your applications instrumented.

Adding ARM-like instrumentation to an application inevitably raises some concerns. The prime concern is that adding the library calls means you have to open up the code and modify it. Even if the instrumentation API is easy to use – and the Scenario class library is very simple – modifying the code is risky, riskier than doing nothing. It needs to be performed by someone who knows the code and will add the instrumentation carefully. A reluctance to open up the code and expose it to additional risk is usually the big initial obstacle organizations face when it comes to adding instrumentation – it is one more thing on the ToDo list that has to be coded and tested, and one more thing that can go wrong.

The best approach to overcoming this obstacle is to line up executive support for the initiative. Let’s face it, your IT management will appreciate receiving regular service level reports that accurate reflect application usage and response time. We all want satisfied customers, and meeting service objectives associated with availability, reliability and performance is highly correlated with customer satisfaction. Application response time data is critical information for IT decision makers.

The 2nd obstacle, which is actually the more serious one, is that someone has to figure out what application scenarios to instrument. In practice, this is not something that is technically difficult. It just requires some thought from the people who designed the application and care about its quality, and perhaps some input from the folks that use the app to understand what pieces of it they rely on the most. Technical input is also required at various points in the process of coming up with the scenario definitions – decisions about what scenarios to instrument need to be made in light of any technical obstacles that can arise.

Let me illustrate one of the technical considerations that will arise when you first consider instrumenting an application to report response times. You will discover rather quickly that reporting response times alone apart from some explanatory context leads to problems in interpretation. Let me illustrate with an example from a well-known and well-instrumented application you are probably familiar with – Google Search. At the top of the panel where Google displays Search results is a response time measurement. For example, I just typed in “search engines” and Google Search returns the information that it knows of 264,000,000 “results” that match that search criteria. Google then reports that it required all of 0.25 seconds to find that information for me and report back. The 250 milliseconds measure is the response time and the 264 million results is the context needed to interpret whether that response time is adequate.

When you instrument your application using the Scenario class library, you have two additional fields that can be used to report context, a 64-bit integer Property called Size and string Property called Category. If you were instrumenting Google Search using calls to the Scenario class, you would set the Size Property to the 260 million results value and place the search keywords into the Category string. The Size and Category Properties that are reported alongside the response time measurement provide context to assist you in interpreting whether or not the response time result the application supplied was adequate in that particular instance.

So, one final aspect of instrumenting your application to consider is what additional context you want to supply that will aid you in interpreting the response time measurements after the fact. The contextual data that is usually most helpful is associated with what I like to call the application’s scalability model. The application’s scalability is your theory about the factors that have the most influence over the application’s response time. If, for example, you think that how many rows in the database the application must process has something to do with how long it takes to compute a result and generate a response message, that conjecture reflects a simple scalability model

                f(x) = y * rowsn  

Populating the Scenario object’s Size and Category Properties from data relevant to the application’s scalability model helps provide the context necessary for interpreting whether a specific response time measurement that was reported is adequate or not.

It may be challenging to squeeze all the data relevant to the application’s scalability model into the limited space the Scenario class provides for customization. In practice, many adopters turn the Category string into a composite value. That’s something I often resort to myself. Consequently, in my reporting program I support a standard method for packing data into the Category string value, which is automatically broken out in the reporting process. Worst case is that you will never be able to shoehorn all the contextual data needed into the Scenario class’s simple Size and Category Property fields. When that happens, you will need to develop your own instrumentation class – and your own reporting.

Software performance engineering and application development lifecycle

Given how fundamentally important application response time measurements to any worthwhile program for maintaining and improving software quality, it is worth thinking a bit about why this critical data is so often missing. Let’s consider the full application development life cycle – from gathering requirements, to design, coding, testing, acceptance and stress testing, to deployment and ongoing “maintenance” activities. Too often, the application’s performance requirements are relegated to a narrow window somewhere towards the end of the development process but presenting a significant hurdle during acceptance testing. When performance testing is concentrated in this fashion late in the acceptance testing stage, this positioning is almost guaranteed to cause resentment among the hard-pressed software developer staff way. Performance requirements should actually be set early in the requirements process as scenarios targeted for development are being defined. Provisioning the application so that it can report on the response times of those key scenarios emphasizes performance considerations at every stage of the application development lifecycle.

In principle, the application scenarios are specified early in the development life cycle, and early in the cycle is also the best time to begin thinking about response time requirements. In the software development methodologies that are in fashion, application performance is usually viewed as a “non-functional” requirement, one that does not get a lot of attention. This is all wrong, of course. As one of my colleagues was fond of telling our CIO, performance isn’t a coat of paint you can slap on an application after it is done that will beautify it. Decisions about how to structure the application made during the design stage often determine what application response times are even achievable.

On the contrary, performance needs to be considered at every stage of application development. Performance goals for key application scenarios need to be defined early in the design stage. Instrumentation to measure application response time allows developers to assess their progress accurately in meeting these goals at every stage of the process. Instrumentation embedded in the application also aids in performance testing. My experience is that with the right level of instrumentation, every functional test can also serve as a performance test.

While application response time measurements are the closest we can get to understanding and quantifying the end user experience, it is worth noting that the correlation between response time and customer satisfaction is typically not a simple linear relationship. Human beings are a little more complicated than that. If you want a set of relatively simple guidelines to help you decide what response times are good for your application and which are bad, I recommend Steve Seow’s book on the subject “Designing and Engineering Time: The Psychology of Time Perception in Software.” The guidelines in Dr. Seow’s book are grounded in the relevant Human-Computer Interaction (HCI) psychological research, but it is not a dry, academic discussion.

Steve’s book also promotes a set of prescriptive techniques for engineering a better user experience whenever the processing requirements associated with a request are too demanding to produce a prompt application response. For instance, by managing a customer’s expectations about how long some longer running operation is potentially going to take, you can engineer a highly satisfying solution without blistering response times. Showing a progress bar that accurately and reliably reflects the application’s forward progress and providing ways for a customer to cancel out of a long running task that ties up her machine are two very effective ways to create a positive experience for your customers when it is just not possible to complete the computing task at hand quickly enough.

 Using the Scenario instrumentation library.

The Scenario instrumentation class library provides a convenient way for a developer to indicate in the application the beginning and end of a particular usage scenario. Internally, the Scenario instance uses an ExtendedStopwatch() object to gather both wall clock time (using QueryPerformanceCounter) and the CPU ticks from QTCT() for the Scenario when it executes. The Scenario class can then output these measurements in an ETW trace event record that renders for posterity the elapsed time and CPU time of the designated block of code.

The Scenario class wraps calls to an internal ExtendedStopwatch object that returns both the elapsed time and CPU time of a demarcated application scenario. Once a Scenario object is instantiated by an application, calls to Scenario.Begin() and Scenario.End() are used to mark the beginning and end of a specific application scenario. After the Scenario.End() method executes, the program can access the object’s Elapsed and ElapsedCpu time properties. In addition, the Scenario.Begin() and Scenario.End() methods generate ETW events that can be post-processed. The payload of the ETW trace event that is issued by the Scenario.End() method reports the elapsed time and CPU time measurements that are generated internally by the class.

To support more elaborate application response time monitoring scenarios, there is a Scenario.Step method that provides intermediate CPU and wall clock timings. The Scenario class also provides a correlation ID for use in grouping logically related requests. Nested parent-child relationships among scenarios that are part of larger scenarios are also explicitly supported. For details, see the API and ETW payload documentation pages on the MSDN archive.

I developed an earlier version of the Scenario instrumentation library in conjunction with several product development teams in the Developer Tools Division when I was at Microsoft. An early adopter was a product team attempting to build a next-generation C++ compiler. The test case they were looking to optimize was the commercial Windows Build, a set of very demanding and long running compiler jobs. This product team gave the original Scenario instrumentation library quite a stress test, and I added several features to the library to make it more efficient and effective in that challenging environment.

Subsequently, what was effectively the Scenario instrumentation library version 2.0 was included in the commercial version of Visual Studio 2010, with instrumentation added for several newly developed components that shipped in the product. The Visual Studio 2010 flavor of the Scenario instrumentation library is known as the MeasurementBlock class. If you have Visual Studio 2010 installed, you can incorporate this into your application by referencing the Microsoft.VisualStudio.Diagnostics.Measurement.dll that is located at C:Program Files (x86)Microsoft Visual Studio 10.0Common7IDE. After adding a Reference to Microsoft.VisualStudio.Diagnostics.Measurement.dll, you can verify that the semantics of the MeasurementBlock class are functionally identical to the published Scenario library using the Object Browser applet in Visual Studio, as illustrated in the screen shot shown in Figure 1 below.

MeasurementBlock-class-as-view-in-the-VS-Object-Browser

Figure 1. MeasurementBlock class imported from Microsoft.VisualStudio.Diagnostics.Measurement.dll as viewed in the Visual Studio Object

You will note some cosmetic differences, though. The eponymous Mark() method in the original implementation was renamed to Step() in the subsequent Scenario library. (Ever since I found out what it meant a few years back, I have always wanted to use the word “eponymous” in a sentence!) MeasurementBlock also uses a different ETW Provider GUID; we didn’t want developers adding Scenario markers to their apps suddenly seeing the VS MeasurementBlock events, too, when they enabled the ETW Provider.

Using the Scenario library in your application is straightforward and is documented here, but I will provide some additional coding guidance for use in interfacing to the new performance tool I am working on in the next blog entry.

.