Performance analysis of your sites helps ensure readiness for unexpected high traffic situations, including unpredictable types of request spikes. It will ensure greater resiliency for expected high traffic events, and for normal traffic during the average day.
Timing a site audit
If you’re unsure whether there are performance issues in your site’s codebase, it’s best to analyze the site sooner than later. We at WordPress VIP have worked with many customers, and performance issues can happen at any time, not simply during a high-traffic event.
Performance issues can be hidden
There are plenty of sites that operate well under normal traffic but quickly run into problems when something minor changes. For example, over time, as the number of posts and comments increases, database queries approach a threshold where they may need to use physical disk space to sort or scan. When a single bot crawls a number of uncached pages, or traffic increases slightly, that can easily push the site into more obvious poor performance, but that poor performance was there the entire time, just less visible.
Optimizing for better performance brings resiliency so the site can handle minor traffic bumps.
Frequent regular reviews are recommended
The most important part of analyzing performance is to do it continuously. Teams attempting to deliver new features are often diverted to make last-minute adjustments for performance optimization as the deadline for the high traffic event approaches. Waiting until just before a major event can result in compromises and unsatisfactory outcomes.
Getting assistance from WordPress VIP
If your support package includes Application Support or Enterprise Support, our team can assist you in performance analysis by looking more closely at any part of your site and offering advice on addressing performance issues.
What follows here are guidelines your development team can use to understand how performance issues can happen, audit the site and code, and prioritize areas for assistance from the WordPress VIP Customer Success team.
Priorities for a performance analysis
The WordPress VIP platform provides resiliency via full page cache, object cache, database indexes, database read replicas, separate handling of static files, on-the-fly image size adjustments, and auto-scaling of web containers.
The most critical metrics for site resilience and performance are cache hit rate and the speed of page generation for origin requests (i.e. uncached requests).
Cache hit rate
The edge cache serves to speed up responses to commonly requested URLs and protect and buffer your site’s server resources from increased load, such as traffic spikes.
The cache “hit rate” is the percentage of requests served by the edge cache (a full-page Varnish cache) as compared to the requests that bypass the cache and go to the origin.
If your requests have a high cache hit rate, the site’s back-end resources (web containers, database, and object cache) will be less busy and more available to handle traffic changes. Sudden traffic spikes are better served by the edge cache as that avoids putting additional load on the server resources.
You can ensure a high cache hit rate by avoiding additional query parameters, user cookies, and non-GET requests.
The cache-control headers sent by your application control how long a page or response is retained in cache. If some of your endpoints rarely change you can set a higher time to live (or max-age) for those routes.
Page generation times
The time it takes your server resources to process, understand, and respond to a particular request is the page generation time. It’s a small part of what your audience cares about, but a critical aspect of your site’s resilience.
Requests that make it past the edge cache must be served by the origin servers (your application code on a web container). These responses should be generated as quickly as possible. Even with autoscaling, resources are not infinite: the higher the demand for resources, the less resilient the site is to variations in traffic load.
For uncached traffic, you can ensure the fastest page generation time with the following:
- Optimized database queries
- Use of object cache for frequently accessed, slow queries, or remote requests
Both are important, and the two work together.
It’s best not to completely rely on the object cache to eliminate all issues associated with a slower query. The object cache values will inevitably need to be replaced and cache stampedes combined with slow queries can cause disruptions. Instead, optimize the underlying slow queries as much as possible to reduce the time window of a potential cache stampede (i.e. replace the cache quickly) and, if necessary, implement logic to reduce the chance of a stampede by serving stale data during the update period.
These optimizations are even more important if your posts table has grown large. A large number of posts can easily result in standard core database queries needing to use the filesystem to sort results even in cases where only a few posts are being returned. When the database frequently uses the filesystem, other fast queries will be delayed and overall performance will be reduced. Fortunately, most queries can be easily adjusted to make them faster.
With an understanding of what’s needed to improve performance and maintain site resilience at a high level, the next and more difficult task is usually just identifying what needs to be improved.
Identifying performance issues
Many performance issues are easy to identify and correct. If you’re not familiar with WordPress VIP’s performance analysis tools, here’s an overview:
PHP_CodeSniffer will flag many (but not all) possible performance issues for a manual review.
Debug Bar / Query Monitor
This front-end tool can identify PHP errors, slow queries, remote requests, and other anomalies on specific pages. It also reports the page generation time.
This Application Performance Monitoring tool shows current and historical average page generation times. It also captures traces of slow requests. You can review the traces to see where the most time is being spent.
If a particular URL has an intermittently slow issue due to object caches expiring, a trace should be captured. These issues can be difficult to catch in Query Monitor.
New Relic also has a summary of database queries and remote requests, so you can identify potential issues specific to a table or API endpoint.
Other places you can look to identify pain points include the following:
Usually you’ll look at specific items in your site’s code repository after identifying something with one of the above tools, but your application code may benefit from an occasional manual review. Focus especially on the places in code where remote requests and queries are defined.
Request logs are available via log shipping and can be analyzed with a variety of tools (here’s one). Logs include the page generation time, response code, and cache status. By analyzing your site traffic patterns you can identify the slowest requests and focus attention on the code involved with those.
Errors and warnings
PHP errors, warnings, and notices (via New Relic) can point out issues in code. The fewer of these you have on a regular basis, the easier it will be to identify, review, and resolve new issues. Ideally, your application routinely generates no errors, and few to no warnings.
Local (or non-production) environment
With a local development environment (application and database) you can run end-to-end tests. These may not completely replicate the VIP environment but it’s a good opportunity to understand the code executed during a request. If your local database has a recent copy of production data then MySQL’s
EXPLAIN command will be more accurate and can reveal inefficient queries.
If you need help reviewing performance and you can describe specific recent issues (with times they occurred), open a support ticket, and we’ll investigate. Where possible we identify the cause and make recommendations to address issues.
We can also help with running MySQL
EXPLAIN, sharing recent PHP warnings and errors, capturing slow queries, assisting with using New Relic or Query Monitor, or assessing and making recommendations on planned optimizations or new features to help minimize the impact to performance.
What to look at
Ensuring the fastest possible page generation time should be a priority. Continually monitoring the Apdex score in New Relic provides a baseline performance indicator. Request logs are helpful to identify the slowest requests. Audits should prioritize the main features of your site first.
Most used URLs
Review the most frequently used pages and their templates, including the homepage, major landing pages, the single page/post templates, and category/archive templates.
A starting point for performance monitoring and improvement might be:
- Ensuring that 99% of the time requests take less than 500ms to generate a result; and
- Making uncached requests take as little time as possible.
Review uncached templates that can suddenly receive a high level of traffic from bots or users:
- This includes the 404 template, your sitemaps, any search functionality, infinite scrolling or AJAX-generated requests.
- Verify that the 404 template does not make any database queries or remote requests, especially any based on the request URL, aside from the core queries needed to determine that the URL is not found. Optimally, your 404 template is the fastest route on your site and handles requests in less than 250ms, because 404 status responses are not cached for as long as a normal URL: they are cached for a very brief time.
- Confirm that front-end, consumer-originated requests — whether from browsers or mobile apps — do not routinely result in database writes or remote requests.
- Ensure database queries and remote requests follow best practices, are performant and cached, and items used across many pages avoid cache stampedes.
- Ensure remote requests have a reasonable timeout and retry setting. If an endpoint that is accessed to fetch data became unresponsive, how would it impact the site?
- Ensure API requests and consumer-originated AJAX requests also perform quickly.
Logs and APM traces
Review request logs (available via our log shipping option) for the following:
- The presence of significantly slower requests
- Patterns of bot activity or bot requests for pages that don’t need to be indexed (for example, bots should be requesting sitemaps and individual posts, but not crawling through every page of an archive or tag)
- The presence of cache busting URL parameters
- Requests for WP uploads files, especially large images, without resize parameters
Plugins can sometimes significantly affect performance. Many are not tested under high traffic conditions or with sites that have extremely large databases. Even a few WordPress core functions can be slow on large sites if not optimized.
- Review your plugins and modules in New Relic to ensure they are performing quickly. Any frequently slow portions of traces should be looked at closely.
- Use the tools listed above to ensure the plugins you are using are not throwing warnings and errors, are not making front-end database writes, and any queries or filesystem operations are completing in a reasonable time.
- Review any documented plugin issues or core issues — especially if you have a significant amount of content, because large DB tables are often not a priority during the plugin development or core testing.