Postmortem for outage of us-east-1

We would like to share the details on what occurred during theoutage on 5/27/2014 in our us-east-1 data center, what we have learned, and what actions we are taking to prevent this from happening again. On behalf of all of Joyent, we are extremely sorry for this outage, and the severe inconvenience it may have caused to you, and yourcustomers.

Background

In order to understand the event, first we need to explain a few basics about the architecture of our data centers. All of Joyent's data centers run our SmartDataCenter product, which provides centralized management of all administrative services, and compute nodes (servers) used to host customer instances. The architecture of the system is built such that the control plane, which includes both the API and boot sequences, is highly-available within a single data center and survives any two failures. In addition to this control plane stack, every server in the data center has a daemon on it that responds to normal, machine generated requests for things like provisioning, upgrades, and changes related to maintenance.

In order for the system to be aware of all servers in the data center and their current instances (VMs), software levels, and relevant information., we have our own DHCP/TFTP system that responds to PXE boot requests from servers in the data center. The DHCP/TFTP system caches the state required to service these requests, ordinarily accessed via the control plane.

Because of this, existing capacity that is already a part of the SDC architecture is able to boot, even when the control plane is not fully available such as during this event.

Given this architecture, and the need to support automated upgrades, provisioning requests and many innocuous tasks across all data centers in the fleet, there exists tooling that can be remotely executed on one or more compute nodes. This tooling is primarily used by automated processes, but as part of specific operator needs it may be used by human operations aswell.

What Happened?

Due to an operator error, all us-east-1 API systems and customer instances were simultaneously rebooted at 2014-05-27T20:13Z (13:13PDT). Rounded to minutes, the minimum downtime for customer instances was 20 minutes, and the maximum was 149 minutes (2.5 hours). 80 percent of customer instances were back within 32 minutes, and over 90 percent were back within 59 minutes. The instances that took longer than others were due to a few independent isolated problems which are described below.

The us-east-1 API was available and the service was fully restored by 2014-05-27T21:30Z (1 hour and 17 minutes of downtime). Explanation of the extended API outage is also covered below.

Root cause of this incident was the result of an operator performing upgrades of some new capacity in our fleet, and they were using the tooling that allows for remote updates of software. The command to reboot the select set of new systems that needed to be updated was mis-typed, and instead specified all servers in the data center. Unfortunately the tool in question does not have enough input validation to prevent this from happening without extra steps/confirmation, and went ahead and issued a reboot command to every server in us-east-1 availability zone without delay.

Once systems rebooted, they by design looked for a boot server to respond to PXE boot requests. Because there was a simultaneous reboot of every system in the data center, there was extremely high contention on the TFTP boot infrastructure, which like all of our infrastructure, normally has throttles in place to ensure that it cannot run away with a machine. We removed the throttles when we identified this was causing the compute nodes to boot more slowly. This enabled most customer instances to come online over the following 20-30 minutes.

The small number of machines that lagged for the remaining time were subject to a known, transient bug in a network card driver on legacy hardware platforms whereby obtaining a DHCP lease upon boot occasionally fails. In our experience, platforms with this network device will encounter this boot-time issue about 10% of the time. The mitigation for this is for an operator to simply initiate another reboot, which we performed on those afflicted nodes as soon as we identified them. Our newer equipment uses different network cards which is why the impact was limited to a smaller number of compute nodes.

The extended API outage time was due to the need for manual recovery of stateful components in the control plane. While the system is designed to handle 2F+1 failures of any stateful system, rebooting the data center resulted in complete failure of all these components, andthey did not maintain enough history to bring themselves online. This is partially by design, as we would rather systems be unavailable than come up "split brain" and suffer data loss as a result. That said, we have identified several ways we can make this recovery much faster.

Next Steps

We will be taking several steps to prevent this failure mode from happening again, and ensuring that other business disaster scenarios are able to recover more quickly.

First, we will be dramatically improving the tooling that humans (and systems) interact with such that input validation is much more strict and will not allow for all servers, and control plane servers to be rebooted simultaneously. We have already begun putting in place a number of immediate fixes to tools that operators use to mitigate this, and we will be rethinking what tools are necessary over the coming days and weeks so that "full power" tools are not the only means by which to accomplish routine tasks.

Secondly, we are determining what extra steps in the control plane recovery can be done such that we can safely reboot all nodes simultaneously without waiting for operator intervention. We will not be able to serve requests during a complete outage, but we will ensure that we can record state in each node such that we can recover without human intervention.

Lastly, we will be assessing migrating customer instances off our older legacy hardware platforms more aggressively. We will be doing this on a case by case basis, as there will be impact to instances as we do this, which will require us to work with each customer to do so.

Closing

We want to reiterate our apology for the magnitude of this issue and the impact it caused our customers and their customers. We will be working as diligently as we can, and as expediently as we can, to prevent an issue like this from happening again.

Sincerely,
The Joyent Team

Post written by The Joyent Team