Note: This article is quite technical and oriented for coders.
Coinbase, a digital currency exchange, faced scaling challenges on their platform during the 2017 cryptocurrency boom. The engineering team focused on upgrading and optimizing MongoDB, traffic segregation for hotspots to resolve them, and building capture and replay tools to prepare for future surges.
Customer traffic spiked to higher than the anticipated levels at Coinbase during May-June 2017, exceeding 5 times the typical maximum traffic and causing downtime. The team attacked the easy issues first – vertical scaling, upgrading MongoDB for performance improvements, index optimization and traffic segregation based on hotspots. The existing monitoring system was not enough to identify contextual information, so it was augmented with code instrumentation that logged the missing data. Even with these improvements, during the December 2017 Bitcoin price surge, Coinbase faced multiple outages again. The team has since focused on ensuring their systems can handle higher amounts of traffic by emulating traffic patterns with capture and replay tools.
Both Coinbase’s Ruby app and MongoDB experienced higher latencies during the initial outages, with the time split equally between Ruby and MongoDB. To better understand the context of these calls across different components, the team logged additional data by modifying MongoDB’s database driver. This helped them narrow down the issue to an unoptimized response object which increased the network load. Fixing this issue gave the application a performance boost. Additionally, large read throughputs were speeded up by adding caching based on Memcached at the Object Relational Mapping (ORM) layer as well as in the driver layer. Adding missing indices also improved the response times. By June 2017, the team had upgraded their MongoDB clusters to 3.2 which had the faster WiredTiger storage engine. Coinbase uses Redis to implement services like rate limiting, which were affected due to Redis‘s single threaded model during these outages.
To prepare for future surges in traffic, the team has worked on tools called Capture and Cannon that can capture traffic from production systems and replay it on demand against new systems to test their resilience. Capture and Cannon are both based on mongoreplay, which is a tool that can capture traffic to MongoDB instances from the network interface, and record the commands being invoked. This log can then be replayed against another MongoDB instance. Traffic is captured across multiple application servers and merged into a single file. The captured traffic as well as a disk snapshot is stored on AWS S3, from where Cannon plays it back later.
Coinbase maintains a public status page at https://status.coinbase.com/