Site outage protocol
Though we have 24×7 automated monitoring and would be aware of the outage immediately, you should let us know if there is an issue with your application not loading. To notify our entire team immediately, you can send in a ticket with the word “urgent” in the subject line. Someone will respond to you within minutes.
Urgent tickets should only be used for outages, security issues, or workflow-blocking concerns. And please remember to include as many details as you can to help us solve the issue faster for you.
VIP have documented and agreed procedures for outage scenarios and VIP staff are trained in and required to understand these procedures. Before we touch on how we would respond, let’s go over what is done to prevent outages from happening.
VIP offers different code review levels based on the contract type, all intended to make sure that your site will be secure, performant, and adhering to best practices. This is an extra layer of stability that helps to identify and prevent code-related issues. We also run high-performance checks ahead of site launches to ensure the application is optimized before going into production.
VIP also maintains a blend of documented and automated procedures for dealing with various equipment or other failures. All VIP production environments are backed up every hour, including custom tables, and 30 days of backups are held in the origin data center. A recent backup for each production environment is stored in a separate data center.
VIP maintain several origin data centers, each with additional capacity that can be used if sites require to move data center in the event of a failure. Our primary data centers are located in or around Dallas Fort Worth, Los Angeles, and Washington D.C.
The VIP hosting infrastructure is designed to mitigate against equipment failures causing a given application environment, e.g. the production environment, to be unavailable. All VIP application environments are spread across networking (e.g. switch) and power (e.g. rack power) infrastructure, to mitigate against equipment issues with networking or power affecting all the resources assigned with running a given VIP production environment.
VIP maintain procedures on multiple communications channels and knowledge sharing systems, both on Automattic and third-party infrastructure, in order to mitigate against issues with only a single system in the event of some wide-ranging issue with our systems. We provide our clients with access to New Relic, which includes the availability and monitoring service Synthetics Lite. We have specific thresholds to set off warnings for performance issues, often flagging concerning items that could contribute to or indicate that an application environment is unavailable or has service issues.
In outage scenarios, VIP uses an established Outage Mode protocol. There are different leads assigned to different roles, allowing our engineering team to focus on resolving the issue. This also ensures someone is updating the VIP Status Twitter feed and VIP Lobby. Once the outage has been resolved, we follow up with an After Action Report with details about what led to the issue and preventative measures that may have been put in place to keep a similar scenario from occurring again.
In addition to the communication outlined above, Enterprise clients should expect to hear from their designated Technical Account Manager (TAM). The TAM will alert their primary contact about the situation with the details gathered so far, and then provide status updates throughout. Typically an email, Slack Direct Message, or text message is best while the situation is being tended to so we can contribute to the resolution. Once the issue has been resolved, the TAM will follow up so you know when the problem has been resolved. This could be an email, a Zendesk ticket, or a quick phone call if desired.