A number of users were unable to open or create Figma files between 17:25 and 18:04 PDT on June 16, 2020. Most users were still able to edit files during this time. We understand how disruptive it is when Figma is unavailable. We take downtime seriously and will be doing everything we can to learn from this incident. While we’re still finalizing the investigation with our Infrastructure provider (Amazon Web Services), we wanted to share the details we have so far.
We use a managed service from AWS called Elasticache. At its core, it leverages Redis, an open-source in-memory data store that supports fast read and write operations. We use Elasticache to accelerate access to frequently accessed data. We are deliberate about only storing data here that we are ok losing: rate limit counters, internal stats, etc. Note that Elasticache is distinct from our main datastore, which is the source of truth for customer data.
Elasticache automates many of the operations that go into running a Redis cluster. We use it in a Muti-AZ setup with a Primary and a Replica instance for high availability. The Primary and Replica instance communicate constantly so that their state is in sync. This is designed to ensure we can fail over to the Replica in event of an issue with the Primary. Additionally, we have setup daily backups to be able to recover from catastrophic failures more seamlessly.
In Elasticache, the Replica is responsible for performing backups. This allows the Primary to continue serving client requests without any performance impact.
It is useful to know that Redis is “mostly single-threaded”. This means it can only really use a single CPU core to process requests.
The incident was triggered by a routine Elasticache backup operation. This operation is performed by Elasticache on a daily basis. On June 16th, the backup triggered a cascading series of failures that caused the interruption to our service.
At 17:01 PDT the Replica instance started the backup operation. This was done using a forkless method where a cooperative background process performs the backup. AWS made this the default method because it reduces the amount of additional RAM needed for the backup. Unfortunately, it comes with a significant performance hit — it uses the same CPU core that is being used for all other operations. The forkless backup kept the Replica so busy that it could not keep up with the updates from the Primary. This caused the Primary instance to buffer updates for the Replica, to the point that its buffer filled up and hit an internal limit. This limit is designed to prevent the Primary from running out of memory as the buffer fills up. Unfortunately, it’s set very low and was tripped very quickly relative to the large amount of free memory the Primary instance had available. When the limit was reached, the Primary terminated its connection to the Replica.
At 17:03 PDT the replica finished the backup operation. It then reconnected with the Primary and requested a partial synchronization, an efficient way to get back in sync from where it left off. Unfortunately, the Primary rejected the request and initiated it’s own forkless snapshot to prepare a full-synchronization instead. This caused the Primary to use up 100% of its CPU core. Since the Primary was also serving live traffic at this time, the forkless snapshot took MUCH longer on the Primary than it had on the Replica and did not complete until 18:04 PDT. Between 17:03 and 17:25 PDT, the Primary was still serving most client requests with a small degradation in latency and Figma was fine. But it did take significantly longer to process a small fraction of client requests involving writes.
While this was going on, connections were growing in Redis. We’re working with AWS to isolate why these connections were growing. Our own systems maintained a relatively constant number of connections to Redis during this time, so it’s clear that something was causing Redis to leak connections. We know that the version of Redis we were using has trouble detecting stale connections. At 17:25 PDT the connections reached the connection limit on the Primary and it started to reject all subsequent connection attempts. This is what ultimately caused Figma to be unavailable for many of our users.
We were alerted by our monitoring system within seconds of the Primary initiating the backup at 17:04 PDT. We quickly identified Redis as the source of the problem. Our dashboard showed that the number of connections on Redis were abnormally high and growing. We shut down internal services that talk to Redis but do not have any user visible impact (e.g., asynchronous job processing system). However, this did not result in any reduction in the number of connections due to the way Redis was leaking connections.
At this point the connection limit had already been hit, and we were unable to connect to Redis directly to force it to drop connections. We asked Elasticache to initiate a fail over to the Replica instance. This should have happened quickly, but the operation did not complete until the Primary finished the backup at 18:04 PDT, at which point the service had already recovered.
There is much to learn from this incident. We worked with AWS to change our configuration so that we do not use the forkless method for synchronization; instead it now forks a child process for backups. This will prevent the same sequence of events from re-occurring since the Redis process on the Replica and Primary will not be overloaded in the event of a backup.
We have scheduled an upgrade to a newer version of Redis that provides improvements to connection management which should prevent them from leaking and growing. We are also exploring using cluster-mode with Elasticache to take advantage of the improvements AWS has made for High Availability.
And most importantly, we are working on ensuring that Figma is resilient to unexpected Redis failures in the future. For all the advantages that Elasticache gives us, it’s clear that there are some edge cases it still cannot handle well. We kicked off plans to remove our dependence on Redis for service availability weeks before this incident, and regret that we didn’t prioritize it sooner.
Finally, we’re updating the health information we expose here on our status page to ensure that it comprehensively captures our uptime going forward.
If you have any questions about this incident, please don’t hesitate to get in touch at email@example.com.