"Anything you build on a large scale or with intense passion invites chaos."
Francis Ford Coppola
Mr. Coppola was, of course, talking about the making of Apocalypse Now. But the general case is also true, and perhaps particularly relevant when applied to information technology (IT). From organization to architecture, managing the mayhem is increasingly difficult as the scope grows. It takes a very good architect to harness the chaos of the Internet and the vast array of available options, and design systems that support thousands or even millions of users.
This article discusses the opportunities and challenges when leveraging the cloud to performance test large-scale websites and applications. Getting performance right, particularly at web-scale, requires a level of passion that results in both a view of the big picture and an attention to detail. We'll describe how to use the scale of the cloud to gain confidence when deploying sites servicing potentially massive amounts of web traffic. We'll start by describing what we mean by the cloud in order to set context. We'll then describe the components that comprise SOASTA's CloudTest, focus on how we deliver the offering to the market, and examine the key tenets of the cloud testing methodology based on SOASTA's experiences.
Scale introduces the potential for chaos. The number of things, from hardware to software to network, that can impact performance grows exponentially as an application scales and the inter-dependencies between the various components grow increasingly complex.
Testing at web-scale requires:
an infrastructure to support the traffic
a means to manage the deployment and execution of the test
analytics that can massage the massive amounts of data and deliver actionable information
Performance Engineers with the experience to navigate through complex environments
a methodology to do it all as efficiently and effectively as possible
This article describes how SOASTA addresses these issues and the considerations for anyone trying to test at web-scale.
At SOASTA, we help ensure that websites and applications are highly reliable and scalable. This has become an increasingly dicey proposition. External events such as the Super Bowl, Cyber-Monday, Mother's Day, a significant drop in the interest rate, or even news about a celebrity can suddenly drive unforeseen traffic to a website. The impact of social media can exacerbate the situation, turning the success of driving traffic to a website into an embarrassment if a site is slow, or worse, if it crashes. The proliferation of mobile devices and the effort by businesses of all types to expand their reach has driven a level of traffic to websites and applications that were previously only experienced by a few, high profile sites.
SOASTA helps companies respond to these unprecedented spikes in traffic by delivering performance intelligence: capturing data about all the elements that impact a website's responsiveness and reliability and immediately turning it into actionable information. SOASTA delivers this service through a combination of a test application, services, and a Performance Test methodology that recognizes the unique challenges of testing live websites at peak traffic loads.
Most IT professionals tend to describe the cloud in terms of services, typically categorized as:
Software-as-a-Service is characterized by providing application capabilities, such as backup, email, customer relationship management (CRM), billing, or testing. Platform services, such as Force.com, Google Apps, Microsoft Azure and Engine Yard, are designed to make it easier to build and deploy applications in a specific multi-tenant environment without having to worry about the underlying computing stack. Infrastructure services allow companies to take advantage of the compute and storage resources delivered by vendors, such as Amazon, Rackspace and GoGrid. The differentiator from traditional hosting or managed service providers is that with cloud computing you have greater flexibility provided by virtualization and elastic application programming interfaces (APIs).
Software and Infrastructure services are the farthest along in terms of adoption. While SaaS has often been viewed as an alternative to open source, offerings from companies like Salesforce.com have evolved to open source solutions such as SugarCRM and Zimbra. Similarly, the early infrastructure providers developed proprietary APIs and used commercial tools for virtualization to build out their services. Today, products such as libCloud and Eucalyptus are emerging open source alternatives for providing elasticity. Open source virtualization solutions are also available.
Clouds are often categorized as private, public, or hybrid. Internal clouds are often about optimizing a private infrastructure and are usually referred to as private clouds. Public clouds provide access to universally accessible resources. Today, public cloud vendors are offering hybrid alternatives that leverage their compute and storage resources, yet require the proper authorization for access. Extending an internal or managed infrastructure by renting virtualized resources on-demand has become an increasingly viable option. As a result, the line between public and private clouds is starting to blur.
With a private cloud, instead of renting the infrastructure provided by others, companies use commercial or open source products to build their own cloud. Of course, more control means more responsibility. These companies need to address challenges such as repurposing servers, choosing a virtualization platform, image management and provisioning, and capturing data for budgeting and charge back.
The Killer App for the Cloud
It has become clear that the infrastructure services provided by the cloud are being driven by applications that are particularly well suited to taking advantage of virtualization and elasticity. Elasticity is most commonly associated with changes in supply and demand based on price. The definition for the cloud is not much different. An elastic API refers to an infrastructure vendor's ability to quickly respond to demand by allowing customers to quickly spin up servers, and just as quickly take them down. For applications such as performance testing this is incredibly important.
The best-known examples are companies that leverage the cloud to accommodate dramatic swings in traffic or short-term application requirements. For instance, pharmaceutical and financial services companies have heavy compute requirements that may last for hours or weeks at a time. These apps are a great match for the ability to deploy hundreds or thousands of compute cores, on-demand. As a specific example, last year Intuit.com deployed servers in the cloud for their Taxcaster application since it only needed to support peak traffic for the few weeks out of the year surrounding tax deadlines.
Testing and development has become a killer app for the cloud. Development and test beds are often deployed for short periods of time, have increased use based on business cycles, need to scale on-demand to respond to specific requirements such as duplicating production infrastructure or generating load, have limited privacy and security concerns, and are simple to deploy.
Leveraging the cloud for testing yields a number of benefits:
it's now possible to test at both typical and peak traffic levels, from hundreds of users to millions
generating geographically dispersed load provides the most accurate representation of real-world traffic
the lower cost enabled by renting hardware, and using an on-demand service, allows testers to respond to accelerated development cycle times by making agile performance testing a realistic alternative
measuring both internal and external tests, and using both the lab and production environments, provides the most efficient and effective results
testing live web-based applications in production and from outside the firewall is the only way to gain complete confidence in the application
This last point usually causes folks to sit up and take notice. There's risk involved in testing a live, production site. Historically, companies would test a fraction of expected load in the lab and then extrapolate what those results would mean when actual load was hitting the site. So where's the greater risk: testing a live site or not knowing if your site will scale to meet demand?
According to JP Garbani, Vice President and Principal Analyst at Forrester Research, 74% of application performance problems are still reported by end-users to a service/support desk rather than found by infrastructure management. That may have been somewhat acceptable when most applications were internal facing and had hundreds of users. But web applications are exposed to a much larger potential audience, often made up of customers. And the failure or poor performance of an application has a direct impact on the perception of the brand and, potentially, revenue.
The SOASTA Performance Test Methodology
To address this gap, SOASTA leveraged its experience deploying nearly 300,000 on-demand, load testing servers to develop a methodology that leverages existing best practices to extend traditional approaches and address the new opportunities and challenges presented by cloud testing. The following sections provide a high level view of this methodology.
Testing in the Performance Lab
Cloud testing does not obviate the need or eliminate the benefits of testing in a lab environment as well as the production environment, and it's important to have continuity between the two. Ongoing performance testing in a lab allows application engineering teams to assess performance over time, and helps catch any show-stopping performance bugs before they reach production. In addition, the lab provides a place to performance test code and configuration changes for performance regression, before releasing changes to production and outside of the normal build cycle. This could include things like a quick bug fix in a page, or a seemingly minor configuration change that could have a performance impact and should be tested before it is deployed. Often, these kinds of changes are deployed with little to no testing and come back later to cause performance issues.
Testing in Production
Testing in production is the best way to get a true picture of capacity and performance in the real world. Testing in production is the only way to ensure that online applications will perform as expected. There are many things that SOASTA's production testing approach typically catches that cannot be found with traditional test methods. These include:
batch jobs that are not present in the lab (log rotations, backups, etc.) or the impact of other online systems affecting performance
load balancer performance issues, such as mis-configured algorithm settings
network configuration problems such as 100MB settings instead of 1GB on switches and routing problems
latency between systems inside and outside of application bubbles
SOASTA's production testing methodology helps identify the invisible walls that show up in architectures after they move out of the lab. Traditionally, testers have been limited to making extrapolations over time about whether small tests on a few servers in a lab can support exponentially higher amounts of load in production. Without proper testing, these types of assumptions always result in hitting unexpected barriers after multiple years of consistent traffic growth. We have seen that successful companies are using production testing to learn things about the performance of their sites that they could have never learned in a lab.
Strategy and Planning
This approach to performance engineering calls for an umbrella strategy with associated individual test plans. Test plans roll up into an overall strategy that ensures confidence in the ability of key revenue generating applications to perform as expected. The result is an ongoing performance engineering strategy throughout an application's evolution. It includes a number of test plans centered on individual objectives, such as holiday readiness, a major architectural change, or the release of a major version of code.
Having a well-defined strategy, with explicit test plans, provides business and engineering leaders with a high degree of confidence in operational readiness. Using this approach gives greater insight into an application's performance and readiness.
Using an iterative process within test plans to achieve defined goals allows for a stream of continuous improvement in the applications being tested. A cycle that starts with the test definition and ends with obtaining actionable intelligence results in a continuous cycle of improvement.
The process of creating a test plan starts with the define phase. During this phase, the flows to be tested throughout the site are defined, metrics to be monitored are established, and success criteria for the tests are agreed upon.
In the design phase, the user scenarios are written and test parameters are set up. Things such as the mix of users executing different parts of the application, the virtual user targets, and the ramp-time are modeled.
The test phase is where the execution of tests takes place, and where data is collected for assessment.
Finally, the assess phase, parts of which may occur during the test execution, is when the data collected throughout test execution is used to provide actionable intelligence.
Types of Tests
The following are common test types included in a plan, which, when taken together, make for a well-rounded view of application performance and reliability. The most successful online application companies are executing on well-defined performance and readiness plans that include a mix of these tests.
Baseline: the most common type of performance test. Its purpose is to achieve a certain level of peak load on a pre-defined ramp-up and sustain it while meeting a set of success criteria such as acceptable response times with no errors.
Spike: simulates steeper ramps of load, and is critical to ensuring that an application can withstand unplanned surges in traffic, such as users flooding into a site after a commercial or email campaign. A spike test might ramp to the baseline peak load in half of the time, or a spike may be initiated in the middle of steady state of load.
Endurance: help ensure that there are no memory leaks or stability problems over time. These types of tests typically ramp up to baseline load levels, and then run for anywhere from 2 to 72 hours to assess stability over time.
Failure: ramps up to peak load while the team simulates the failure of critical components such as the web, application, and database tiers. A typical failure scenario would be to ramp up to a certain load level, and while at steady state the team would pull a network cable out of a database server to simulate one node failing over to the other. This would ensure that failover took place, and would measure the customer experience during the event.
Stress: finds the breaking point for each individual tier of the application or for isolated pieces of functionality. A stress test may focus on hitting only the home page until the breaking point is observed, or it may focus on having concurrent users logging in as often as possible to discover the tipping point of the login code.
Diagnostic: designed to troubleshoot a specific issue or code change. These tests typically use a specially designed scenario outside of the normal library of test scripts to hit an area of the application under load and to reproduce an issue or verify issue resolution.
SOASTA CloudTest is deployed as an on-demand service, leveraging the cloud to generate the load. It is comprised of the methodology described above, the services provided by our experienced load testers, and the Global Cloud Test Platform that provides a cross-cloud infrastructure for generating load. Within the application, open source libraries are a fundamental part of the offering, used throughout the product for providing various functions. SOASTA provides the software as part of the service. As seen in Figure 1, CloudTest is deployed using a distributed architecture in the cloud, complemented by an appliance for testing behind the firewall.
Figure 1: SOASTA Architecture
While customers can use SOASTA's application for test creation and execution, the Global Cloud Test Platform is built to support additional tools, including Apache JMeter, the most popular open source load-testing tool. The SOASTA platform reduces the complexity and time of deploying JMeter scripts to the cloud, making it dramatically easier for the JMeter community to create, deploy, execute and analyze web-scale load and performance tests. JMeter scripts run without modification. Once the test is built, SOASTA takes care of managing and provisioning servers and executing the test.
The key capabilities we've built into this approach come as a result of our experience deploying to the cloud. The first deployment environment was Amazon EC2. Because the requirements for load and performance testing fit almost all of the characteristics described above, Amazon's implementation of a cloud infrastructure was a perfect match. EC2 was the first to provide a platform that dramatically changed the cost equation for computing resources and delivered an elastic API for speed of deployment.
As the application depends on the swift provisioning and releasing of servers, SOASTA had to quickly identify bad instances and bring up replacement instances. The provisioning technology in SOASTA's implementation is one of the key features of the platform. As new APIs, including open source alternatives such as libCloud become available, SOASTA will use them to expand the reach of the Global Test Cloud.
The other key capability is a real-time analytic engine built exclusively for testing web and mobile applications, enabling quality assurance and development teams to test and monitor their websites under both typical and extreme traffic conditions. Given the massive amounts of data generated in web-scale tests, including the resource being monitored as the test is executed, a cloud-based, highly scalable engine is required to provide actionable information in real-time.
The cloud has approached that point in the hype cycle where its value is being questioned by various pundits because the benefits don't necessarily conform to their specific requirements. The reality is that many companies have found tremendous value. When combined with web-based technology, experienced people and a new methodology, it is clear that performance testing from the cloud can help tame the chaos associated with large scale.