Performance testing is one of the most important practices associated with applying software performance engineering principles to acceptance testing and other quality assurance processes. This blog entry discusses the key role scalability models play in performance testing. A scalability model hypothesizes a relationship between a performance-oriented response time or throughput goal for a scenario and one or more scalability dimensions associated with that specific scenario. One such scalability dimension, the arrival rate of customer requests, is a staple of performance stress tests, for example.
One of the main assertions here is the value of making explicit the scalability assumptions that are implicit in defining performance acceptance testing criteria. Assumptions about the dimensions of an application’s scalability inform the definition of an appropriate performance test matrix to ensure the testing process is both rigorous and effective. Furthermore, feedback from empirical measurement data derived from performance tests should be used to validate the scalability assumptions that were formulated. This is especially important because, frequently, the scalability assumptions made at the outset of a project are often incomplete. Validating the model against the measurements generated by the performance tests can even show it is not accurate enough for its intended purpose, which is essentially to assess the quality of the application.
Performance tests are frequently used to verify that new software releases meet quality assurance goals, including the so-called non-functional business requirements associated with application responsiveness. In building responsive web sites and creating scalable applications of all sorts, these requirements are understood to be crucial in attaining higher levels of customer satisfaction.
In acceptance testing, elaborate performance benchmarks are often constructed. These benchmarks are executed to determine the hardware configuration needed to meet capacity requirements for a new or revised system, based, for example, on the expected number of concurrent users of those new or enhanced program features. Experienced performance engineering specialists learn to use a variety of commercial tools to generate synthetic performance benchmarks and execute them, involving the definition of the test scenarios to execute, facilities for ramping up the number of simulated user interactions, etc.
In short, in order to ensure the quality of a new release of a software application, it is a common practice in performance engineering to develop and execute synthetic performance tests. The goal of this effort is to predict reliably the performance and scalability of that application prior to deploying it in production. To that end, many software development organizations expend considerable time and expense developing and executing performance tests to try to ascertain whether a new software release meets its quality objectives.
I am not, however, primarily concerned here with stress testing designed to aid in capacity planning, although scalability models should and often do play an important role in decisions about which test scenarios to automate and exercise. Instead, the focus here is on performance timing tests, which are designed to execute and measure scenarios to ensure their performance meets their response time objectives. Performance timing tests are also associated with regression testing, ensuring that subsequent changes to the application do not undermine the service levels obtained in a prior release.
The process of developing performance tests includes determining what specific aspects of an application to test. This is challenging for any complex application with a myriad of access paths and interaction scenarios. Here is where a scalability model is most helpful. Consider a database query and reporting scenario that returns a large number of results that must be sorted. Understanding that sorting is a numerical operation that is sensitive to the number of items in the result set suggest that timing tests be developed for the scenario that test the assumption that response time scales appropriately with the size of the result set the query returns. More generally, a scalability model explicitly identifies the major factors impacting application performance, and timing tests need to be formulated along the same dimensions as the model.
After deciding what application-oriented scenarios need testing, the next step is to formulate a response time objective or goal for these scenarios. The scalability model is also very useful in this context because the response time goals need to be realistic in light of the processing resources required to execute the scenario. Finally, the performance test itself needs to be developed – in effect a timing test for the scenario – instrumented, and executed. For example, executing a series of timing tests that show response times for the database query and reporting scenario increasing exponentially with the size of the query result set raises an alarm. The application as delivered is not performing as expected. Maybe the sorting utility that was implemented needs re-visiting.
Frequently, empirical measurements from timing tests executed with the current system establish a baseline set of expectations for the level of performance customers will tolerate in the new release. In the absence of any other predictive criteria, current performance levels establish a baseline of expectations that a new release must meet. It is also important to verify that new features and functions in a future release do not inadvertently undermine the performance levels that customers are accustomed to experiencing in the current release. Identical timing tests run against the old and the new system can verify that subsequent changes or additions have not produced significant performance regressions in legacy features.
The baseline expectations associated with the current system are relevant because studies in Human Factors engineering show that people accustomed to using the current system do not perceive response time improvements unless they are of sufficient magnitude. Steve Seow’s “Engineering Time” is an excellent summary of the Human Factors research in this area. Seow’s guidance for software developers is that customers usually do not perceive response time improvements in the application that are less than 20% better than the old system. (Similarly, response time degradation less than 20% is not liable to be noticed either.)
In summary, a critical success factor in developing performance tests that, unfortunately, is quite often under-emphasized in practice, is the development and validation of a suitable scalability model for the application under test. The scalability model dictates what specific aspects of the application’s performance need to be investigated during testing. It is also important to regard the scalability model assumptions as contingent until validated by specific test results. The absence of a suitable scalability model to guide performance testing can spell doom for that effort.
What is a scalability model?
A scalability model is conceptual in nature, providing a working hypothesis that describes the factors that contribute to the response times of the scenarios exercised in the performance tests. The scalability model implicit in stress testing a web-based application, for example, by successively increasing the arrival rate of customer requests assumes there is a relationship between the response time for web requests and the number of concurrent requests, namely
RT = f(N)
where N is the number of concurrent requests. This is a scalability model that is derived from mathematical queuing theory, where the relationship between response time and the number of customers is modeled by a non-linearfunction, such as the one that characterizes a simple M/M/n open queuing model:
RT = ST + QT = ST + (ST * u / (1-u))
where RT, ST, and QR are the Response Time, the Service Time, and the Queue Time, respectively, and u is the utilization of some bottlenecked resource the web request requires. As the utilization of this resource approaches its capacity limit, u ð100%, the response time for servicing those requests at the bottlenecked resource spikes characteristically, as illustrated in Figure 1. (Note, that at 100% utilization, the function reduces to a singularity and the result is undefined.)
Where ever the relationship hypothesized in an M/M/n queueing model holds, you should be able to execute a series of performance tests, steadily increasing the arrival rate of customer requests until a response time spike is evident, manifesting a performance bottleneck that constrains performance.
The value of even simple queuing model equations, such as the one illustrated in Figure 1, is that they derive analytically behavior that you are apt to observe in real world applications. They “explain” why it is unrealistic to expect the performance of complex applications to scale linearly with the number of concurrent customers, for example, and why a configuration where utilization at each resource is “load-balanced” avoids queuing bottlenecks and produces optimal performance. Fundamentally, this correspondence between what a model predicts and the actual behavior observed is why model building matters in performance analysis.
However, for complex, real-world applications, there is risk in adopting simple models that are apt to be too simple to be completely trustworthy. After all, the arrival rate of customer requests is only one likely dimension of scalability. Consider an electronic mail or messaging application, for example, like Hotmail or Outlook. The size of the message being sent, the number of recipients, the size and number of attachments, the network bandwidth available, as well as the distance of the various recipients from the e-mail server are all factors likely to impact message delivery time, the key performance-oriented indicator of service quality. In addition, some less obvious factors may be necessary before the behavior predicted by the model resembles the actual, ancillary factors such as the number of recipient mailboxes that are configured.
Frequently, that is, indeed, the rub in any conceptual model-building exercise in science. Models that are too simple do not embody enough detail about the application to accurately predict its real world behavior. On the other hand, too many moving parts in the conceptual model can readily lead to computational complexity that limits the usefulness of the model as a predictive tool. We are limited to solving mathematically models that are computationally solvable.
The same considerations apply in performance testing. During acceptance testing, measurements must be taken along each relevant axis of scalability to ensure that application response goals are being met. Two dimensions of scalability – the number of customers (n) and the complexity of their database queries (C), for example – define a performance testing matrix that is n* C wide. If a third scalability dimension is required, such as m, the number of search criteria used in the database query, the number of test cases necessary increases combinatorially to n * m * C. The effectiveness of any sequence of performance tests that are executed gauges the coverage of that sequence of tests against the underlying model of that application’s scalability. If the tests do not accurately assess performance against any of the relevant scalability factors, there is no reason to trust that they will accurately predict actual performance levels when the application is ultimately deployed.
The desirability of a model that is simple, but not too simple is encapsulated in Occam’s razor, in effect, that the simplest explanation among any that conform to the observational measurements is preferred. This balancing act inherent in distilling complex behavior into a simplified conceptual model is often discussed in mathematical circles with reference to an old joke, which goes something like the following:
One evening a man leaves a bar and sees a drunk on his hands and knees in the bright arc of a nearby streetlight. “Can I help you,” he volunteers. “Sure. I am looking for my car keys,” the drunk slurs in response.After a few minutes of futile groping on the ground, the first man asks, “Are you sure this is where you dropped them?” The drunk replies, “No, actually I dropped them over there. But I am looking over here because the light’s better.”
Scientists involved in building conceptual models of complex real-world behavior are reconciled to the fact that they are limited to searching in areas where there is ample empirical measurement data to illuminate the phenomenon requiring an explanation. Consequently, conceptual models that are abstracted from empirical data are always regarded as contingent, subsequent to acquiring observational data that contradicts these assumptions.
Similarly, in performance investigations, we are often constrained to searching only in the areas illuminated by the measurement data we manage to gather. There are frequently aspects of performance that we do not have sufficient measurement data to understand. We are simply too much in the dark about these areas to a render a reasonable judgment about them. Moreover, implicit scalability assumptions are frequently embedded in the measurement tools that are employed, determining what measurements are available and what measurements are not. Unfortunately, performance engineers are often reduced to looking for a solution in aspects of the system where there is sufficient instrumentation.
More on the contingency of models and the way implicit scalability assumptions are often embedded in our measurement tools in the next post in this series.