Key Highlights
- Coinbase published a post-mortem on the May 7 outage that disrupted trading, deposits, and withdrawals.
- The incident was triggered by AWS cooling system failures that shut down critical infrastructure.
- Additional Kafka-related issues delayed recovery and affected multiple backend services.
Crypto exchange Coinbase today published a detailed post-mortem analysis of its May 7, 2026 service outage. The service disruption lasted several hours and was characterized by failures in trading, depositing, withdrawing, and other essential processes on the platform.
According to the official announcement, the incident began around 7:20 PM ET, after multiple chiller units malfunctioned in a single Amazon Web Services (AWS) data hall of the company’s us-east-1 region, use1-az4 availability zone. This malfunction initiated a thermal safety system shutdown, causing EC2 instances and EBS volumes to be turned off.
Outage affected trading services
Coinbase trading went down around 7:48 PM ET, and retail users could not conduct trades such as buying, selling, sending, receiving, depositing, or withdrawing on almost all products available on the platform. Coinbase Prime clients also suffered order routing degradation.
The matching engine was restored in cancel-only mode at 2:25 AM ET on May 8, while trading fully resumed at 3:49 AM ET on all order books. The consumer-facing services experienced partial restoration at 5:30 AM ET and were back to normal operations at 9:53 AM ET. The last event streams backlog was cleared off at 2:00 PM ET, leading to an outage lasting for about eight hours.
Technical problems behind the outage
According to the report, the prolonged nature of the outage occurred due to two major technical problems. Firstly, the matching engine of Coinbase operated at a single AWS building, where there was a Cluster Placement Group with the Raft-based cluster used to minimize latency in trade operations.
In case three out of five nodes fail, there is an inability to maintain the quorum, and there is no automatic switch to another availability zone in this situation. In order to fix it, it was necessary to change the code manually and build a completely new node group. The quorum was restored at 12:06 AM ET; however, the trading markets stayed closed longer than that.
Secondly, a problem with the Managed Streaming for Apache Kafka (MSK) of AWS. According to the report, there was an error in the MSK control plane, which made it impossible for the automatic election of a new partition leader when the zone failed. Thus, certain clusters become blocked, and event stream pipelines cannot be processed correctly.
As a consequence, various dependent systems, including the ones responsible for the fees, quotations, ledgers, payments, and data pipelines, were affected by this issue. Moreover, one cluster had a 2 AZ configuration, which further increased its vulnerability. Manual partition reassignments were eventually performed around 3:00 AM ET with assistance from AWS engineers.
Company acknowledges shortcomings
In the report, Coinbase admitted that the failure did not meet expectations regarding the level of reliability. The company stressed that the resilience of its system was supposed to cope with a single availability zone failure within the cloud service provider infrastructure.
The company also thanked AWS engineers for their effort during the failure, along with its team. The incident highlights that there are still problems with working with the large-scale cryptocurrency exchange services using third-party cloud infrastructure while meeting the strict requirements for the performance of the trading system. The outages of this kind may become critical in periods of volatility in the market.
Also Read: Oobit Brings USDT Payments to Bolivia Through Visa Network
