Analyze a site’s server performance
Performance analysis of sites helps ensure readiness for unexpected high traffic situations, including unpredictable types of request spikes. It will ensure greater resiliency for expected high traffic events, and for normal traffic during the average day.
Professional Service Upgrade
Customers who add WordPress VIP Performance Service to their support package can work with a team of VIP’s skilled engineers to address needs and goals related to site performance, conduct performance testing, and identify opportunities to improve the performance of its WordPress VIP hosted website.
Timing a site audit
It is beneficial to analyze a site sooner rather than later. Performance issues can happen at any time, not only during a high-traffic event.
Performance issues can be hidden
Many sites operate well under normal traffic but performance issues can occur due to a minor change. For example, as the number of posts and comments increases over time, database queries can approach a threshold where they may need to use physical disk space to sort or scan. In that scenario, a single bot crawling a number of uncached pages—or a slight increase in traffic—can easily push a site into more obvious poor performance. That poor performance was there the entire time, just less visible.
Optimizing for better performance creates resiliency for a site to be able to handle minor increases traffic.
Frequent reviews on a regular cadence are recommended
The most important part of analyzing performance is to do it continuously. Teams attempting to deliver new features are often diverted to make last-minute adjustments for performance optimization as the deadline for the high traffic event approaches. Waiting until just before a major event can result in compromises and unsatisfactory outcomes.
Priorities for a performance analysis
The WordPress VIP platform provides resiliency via full page cache, object cache, database indexes, database read replicas, separate handling of static files, dynamic image size transformation, and auto-scaling of web containers.
The most critical metrics for site resilience and performance are cache hit rate and the speed of page generation for origin requests (i.e. uncached requests).
Cache hit rate
VIP’s edge cache servers speed up responses to commonly requested URLs. The edge cache protects and buffers a site’s origin servers from increased load (e.g., traffic spikes) with a full-page cache.
The cache “hit rate” is the percentage of requests served by the edge cache as compared to the requests that bypass cache and result in SQL queries on the origin server. A large volume of direct SQL queries can overload the primary database and lead to an increase in responses with a 503
HTTP status code.
If requests to a site have a high cache hit rate, the site’s backend resources (web containers, database, and object cache) will be less busy and more available to handle traffic changes.
Cache hit rate can be increased by avoiding additional query parameters, user cookies, and non-GET
requests that bypass the cache.
Cache-control headers sent by an application control how long a page or response is retained in cache. Site endpoints that rarely change can be set to a higher Time To Live (or max-age
) for those routes.
Page generation times
“Page generation time” is the duration of time required for a site’s server resources to process, understand, and respond to a particular request. Users of a site appreciate fast generation times, but fast page generation times are critical for a site’s resilience.
Requests that bypass the edge cache must be served by the origin servers (an application’s code on a web container). These responses should be generated as quickly as possible. Even with autoscaling, resources are not infinite: the higher the demand for resources, the less resilient the site is to variations in traffic load.
For uncached traffic, faster page generation time can be ensured by:
- Optimizing database queries.
- Using the object cache for frequently accessed slow queries, or remote requests.
Both are important, and the two work together.
It is best not to completely rely on the object cache to eliminate all issues associated with a slower query. The object cache values will inevitably need to be replaced and cache stampedes combined with slow queries can cause disruptions. Instead, optimize the underlying slow queries as much as possible to reduce the time window of a potential cache stampede (i.e. replace the cache quickly) and, if necessary, implement logic to reduce the chance of a stampede by serving stale data during the update period.
Optimizing core queries at scale is even more important if a site’s posts table has grown large. A large number of posts can easily result in standard core database queries needing to use the filesystem to sort results even in cases where only a few posts are being returned. When the database frequently uses the filesystem, other fast queries will be delayed and overall performance will be reduced. Fortunately, most queries can be easily adjusted to make them faster.
With an understanding of what’s needed to improve performance and maintain site resilience at a high level, the next and more difficult task is usually just identifying what needs to be improved.
Identifying performance issues
Many performance issues can be identified with available performance analysis tools. Once identified, work towards correcting performance issues.
Analysis tools
Query Monitor
Query Monitor can be helpful to identify PHP errors, slow queries, remote requests, and other anomalies on specific pages. It also reports the page generation time.
Runtime Logs
Runtimes Logs reports PHP Errors including fatals, warnings, and notices for WordPress applications, and retrieves output sent to stdout
or stderr
for Node.js applications.
GitHub
Investigating specific items in a site’s code repository is usually done after using a separate debugging tool identify an issue, but application code may benefit from an occasional manual review. Focus especially on the places in code where remote requests and queries are defined.
Local (or non-production) environment
Though a local environment is not an absolute replica of a VIP environment, it can provide insights and understanding of the code executed during a request. If a recent backup of a production environment’s database has been imported to a VIP Local Development Environment, MySQL’s EXPLAIN
command will be more accurate and can reveal inefficient queries. End-to-end tests can be run with a VIP Local Development Environment if it is running with a clone of the application’s GitHub repository.
New Relic
This Application Performance Monitoring tool shows current and historical average page generation times. It also captures traces of slow requests. Traces can be reviewed to determine where the most time is being spent.
If a particular URL has an intermittently slow issue due to object caches expiring, a trace should be captured. These issues can be difficult to catch in Query Monitor.
PHP errors, warnings, and notices (via New Relic) can point out issues in code. The fewer code issues that exist on a regular basis, the easier it will be to identify, review, and resolve new issues. Ideally, an application routinely generates no errors, and few to no warnings.
New Relic also has a summary of database queries and remote requests, so potential issues specific to a table or API endpoint can be identified.
PHPCS
PHP_CodeSniffer will flag many (but not all) potential performance issues for a manual review.
HTTP request logs
Request logs are available via HTTP request Log Shipping and can be analyzed with a variety of tools such as GoAccess. Logs include the page generation time, response code, and cache status. By analyzing a site’s traffic patterns, the slowest requests can be identified and attention can be focused on the code that needs to be optimized.
What to look at
Ensuring the fastest possible page generation time should be a priority. Continually monitoring the Apdex score in New Relic provides a baseline performance indicator. Request logs are helpful to identify the slowest requests. Audits should prioritize the main features of a site.
Most used URLs
Review the most frequently used pages and their templates, including the homepage, major landing pages, the single page/post templates, and category/archive templates.
A starting point for performance monitoring and improvement might be:
- Ensuring that 99% of the time requests take less than 500ms to generate a result; and
- Making uncached requests take as little time as possible.
Other URLs
Review uncached templates that can suddenly receive a high level of traffic from bots or users:
- This includes the
404
template, sitemaps, any search functionality, infinite scrolling, or AJAX-generated requests.
- Verify that the
404
template does not make any database queries or remote requests, especially any based on the request URL, aside from the core queries needed to determine that the URL is not found. Optimally, the404
template is the fastest route on a site and handles requests in less than 250ms, because404
status responses are not cached for as long as a normal URL; they are cached for a very brief time. - Confirm that frontend, consumer-originated requests—whether from browsers or mobile apps—do not routinely result in database writes or remote requests.
- Reduce the occurrence of front end requests that result in database writes whenever possible. Subsequent database reads to prior written tables will always be handled by the primary database. If this occurs, it may negatively affect site performance during moderate to high traffic events.
- Ensure database queries and remote requests follow best practices, are performant and cached, and items used across many pages avoid cache stampedes.
- Ensure remote requests have a reasonable timeout and retry setting. If an endpoint that is accessed to fetch data became unresponsive, how would it impact the site?
- Ensure API requests and consumer-originated AJAX requests also perform quickly.
Logs and APM traces
Review request logs (available via our log shipping option) for the following:
- The presence of significantly slower requests
- Patterns of bot activity or bot requests for pages that don’t need to be indexed (for example, bots should be requesting sitemaps and individual posts, but not crawling through every page of an archive or tag)
- The presence of cache busting URL parameters
- Requests for WP uploads files, especially large images, without resize parameters
- Requests for invalid URLs which may be coming from page templates or Javascript
Plugins
Plugins can sometimes significantly affect performance. Many are not tested under high traffic conditions or with sites that have extremely large databases. Even a few WordPress core functions can be slow on large sites if not optimized.
- Review a site’s plugins and modules in New Relic to ensure that they are performing quickly. Any frequently slow portions of traces should be looked at closely.
- Use the tools listed above to ensure that a site’s plugins are not throwing warnings and/or errors, are not making front-end database writes, and any queries or filesystem operations are completing in a reasonable time.
- Review any documented plugin issues or core issues — especially if a site has a significant amount of content. Large DB tables are often not a priority during the plugin development or core testing, but they can lead to negative performance issues as they grow in size.
Last updated: August 03, 2023