Performance analysis of sites helps ensure readiness for unexpected high traffic situations, including unpredictable types of request spikes. It will ensure greater resiliency for expected high traffic events, and for normal traffic during the average day.
Timing a site audit
It is beneficial to analyze a site sooner rather than later. Performance issues can happen at any time, not only during a high-traffic event.
Performance issues can be hidden
Many sites operate well under normal traffic but performance issues can occur even if something minor changes. For example, over time as the number of posts and comments increases, database queries approach a threshold where they may need to use physical disk space to sort or scan. In that scenario, a single bot crawling a number of uncached pages—or a slight increase in traffic increases—can easily push the site into more obvious poor performance, but that poor performance was there the entire time, just less visible.
Optimizing for better performance brings resiliency so that a site can handle minor traffic bumps.
Frequent regular reviews are recommended
The most important part of analyzing performance is to do it continuously. Teams attempting to deliver new features are often diverted to make last-minute adjustments for performance optimization as the deadline for the high traffic event approaches. Waiting until just before a major event can result in compromises and unsatisfactory outcomes.
Priorities for a performance analysis
The WordPress VIP platform provides resiliency via full page cache, object cache, database indexes, database read replicas, separate handling of static files, on-the-fly image size transformation, and auto-scaling of web containers.
The most critical metrics for site resilience and performance are cache hit rate and the speed of page generation for origin requests (i.e. uncached requests).
Cache hit rate
The edge cache serves to speed up responses to commonly requested URLs and protect and buffer a site’s server resources from increased load, such as traffic spikes.
The cache “hit rate” is the percentage of requests served by the edge cache (a full-page Varnish cache) as compared to the requests that bypass the cache and go to the origin.
If your requests have a high cache hit rate, the site’s back-end resources (web containers, database, and object cache) will be less busy and more available to handle traffic changes. Sudden traffic spikes are better served by the edge cache as that avoids putting additional load on the server resources.
You can ensure a high cache hit rate by avoiding additional query parameters, user cookies, and non-
The cache-control headers sent by an application control how long a page or response is retained in cache. If some of a site’s endpoints rarely change, they can be set to a higher Time To Live (or
max-age) for those routes.
Page generation times
“Page generation time” is the time it takes a site’s server resources to process, understand, and respond to a particular request. It is a small part of what a site’s audience cares about, but a critical aspect of a site’s resilience.
Requests that make it past the edge cache must be served by the origin servers (an application’s code on a web container). These responses should be generated as quickly as possible. Even with autoscaling, resources are not infinite: the higher the demand for resources, the less resilient the site is to variations in traffic load.
For uncached traffic, faster page generation time can be ensured by:
- Optimizing database queries.
- Using the object cache for frequently accessed, slow queries, or remote requests.
Both are important, and the two work together.
It is best not to completely rely on the object cache to eliminate all issues associated with a slower query. The object cache values will inevitably need to be replaced and cache stampedes combined with slow queries can cause disruptions. Instead, optimize the underlying slow queries as much as possible to reduce the time window of a potential cache stampede (i.e. replace the cache quickly) and, if necessary, implement logic to reduce the chance of a stampede by serving stale data during the update period.
These optimizations are even more important if your posts table has grown large. A large number of posts can easily result in standard core database queries needing to use the filesystem to sort results even in cases where only a few posts are being returned. When the database frequently uses the filesystem, other fast queries will be delayed and overall performance will be reduced. Fortunately, most queries can be easily adjusted to make them faster.
With an understanding of what’s needed to improve performance and maintain site resilience at a high level, the next and more difficult task is usually just identifying what needs to be improved.
Identifying performance issues
Many performance issues are easy to identify and correct. If you’re not familiar with WordPress VIP’s performance analysis tools, here’s an overview:
Debug Bar and Query Monitor
Runtimes Logs reports PHP Errors including fatals, warnings, and notices for WordPress applications, and retrieves output sent to
stderr for Node.js applications.
Investigating specific items in a site’s code repository is usually done after using a separate debugging tool identify an issue, but application code may benefit from an occasional manual review. Focus especially on the places in code where remote requests and queries are defined.
Local (or non-production) environment
Though a local environment is not an absolute replica of a VIP environment, it can provide insights and understanding of the code executed during a request. If a recent backup of a production environment’s database has been imported to a VIP Local Development Environment, MySQL’s
EXPLAIN command will be more accurate and can reveal inefficient queries. End-to-end tests can be run with a VIP Local Development Environment if it is running with a clone of the application’s GitHub repository.
This Application Performance Monitoring tool shows current and historical average page generation times. It also captures traces of slow requests. Traces can be reviewed to determine where the most time is being spent.
If a particular URL has an intermittently slow issue due to object caches expiring, a trace should be captured. These issues can be difficult to catch in Query Monitor.
PHP errors, warnings, and notices (via New Relic) can point out issues in code. The fewer code issues that exist on a regular basis, the easier it will be to identify, review, and resolve new issues. Ideally, an application routinely generates no errors, and few to no warnings.
New Relic also has a summary of database queries and remote requests, so potential issues specific to a table or API endpoint can be identified.
PHP_CodeSniffer will flag many (but not all) possible performance issues for a manual review.
Request logs are available via log shipping and can be analyzed with a variety of tools (here’s one). Logs include the page generation time, response code, and cache status. By analyzing your site traffic patterns you can identify the slowest requests and focus attention on the code involved with those.
If assistance is required to review a site’s performance, specific recent issues (with times they occurred) can be described, create a VIP Support ticket. VIP’s Support team will investigate and identify causes wherever possible and make recommendations to address the underlying causes.
What to look at
Ensuring the fastest possible page generation time should be a priority. Continually monitoring the Apdex score in New Relic provides a baseline performance indicator. Request logs are helpful to identify the slowest requests. Audits should prioritize the main features of your site first.
Most used URLs
Review the most frequently used pages and their templates, including the homepage, major landing pages, the single page/post templates, and category/archive templates.
A starting point for performance monitoring and improvement might be:
- Ensuring that 99% of the time requests take less than 500ms to generate a result; and
- Making uncached requests take as little time as possible.
Review uncached templates that can suddenly receive a high level of traffic from bots or users:
- This includes the 404 template, your sitemaps, any search functionality, infinite scrolling or AJAX-generated requests.
- Verify that the 404 template does not make any database queries or remote requests, especially any based on the request URL, aside from the core queries needed to determine that the URL is not found. Optimally, your 404 template is the fastest route on your site and handles requests in less than 250ms, because 404 status responses are not cached for as long as a normal URL: they are cached for a very brief time.
- Confirm that front-end, consumer-originated requests — whether from browsers or mobile apps — do not routinely result in database writes or remote requests.
- Ensure database queries and remote requests follow best practices, are performant and cached, and items used across many pages avoid cache stampedes.
- Ensure remote requests have a reasonable timeout and retry setting. If an endpoint that is accessed to fetch data became unresponsive, how would it impact the site?
- Ensure API requests and consumer-originated AJAX requests also perform quickly.
Logs and APM traces
Review request logs (available via our log shipping option) for the following:
- The presence of significantly slower requests
- Patterns of bot activity or bot requests for pages that don’t need to be indexed (for example, bots should be requesting sitemaps and individual posts, but not crawling through every page of an archive or tag)
- The presence of cache busting URL parameters
- Requests for WP uploads files, especially large images, without resize parameters
Plugins can sometimes significantly affect performance. Many are not tested under high traffic conditions or with sites that have extremely large databases. Even a few WordPress core functions can be slow on large sites if not optimized.
- Review your plugins and modules in New Relic to ensure they are performing quickly. Any frequently slow portions of traces should be looked at closely.
- Use the tools listed above to ensure the plugins you are using are not throwing warnings and errors, are not making front-end database writes, and any queries or filesystem operations are completing in a reasonable time.
- Review any documented plugin issues or core issues — especially if you have a significant amount of content, because large DB tables are often not a priority during the plugin development or core testing.