CloudFlare's Juniper routers choked on a slight programming change designed to deflect a distributed denial-of-service attack, knocking the company's services off the Internet for about an hour early Sunday morning.
The San Francisco-based company provides a service that speeds up the delivery of web pages and reduces bandwidth. It also provides a suite of security tools that helps website owners identify and filter malicious traffic.
CEO Matthew Prince wrote that a bug in its routers caused its services to effectively drop off the Internet around 1:47 a.m. PST on Sunday. The routers had been modified with a new rule, or a type of filter, intended to deflect a DDOS attack underway against one of its customers.
CloudFlare saw that the attack used data packets that appeared to be between 99,971 and 99,985 bytes, far larger than the 500- to 600-byte average. The company's engineers wrote a rule for the routers to drop the extra large packets, which was then distributed to the routers using the Flowspec protocol, Prince wrote.
"What should have happened is that no packet should have matched that rule because no packet was actually that large," he wrote "What happened instead is that the routers encountered the rule and then proceeded to consume all their RAM until they crashed."
Some of the crashed routers rebooted themselves, but others didn't. When some of the data centers came online, those centers bore the brunt of all of the traffic hitting CloudFlare's network, then crashed again.
"We were able to access some routers and see that they were crashing when they encountered this bad rule," Prince wrote. "We removed the rule and then called the network operations teams in the data centers where our routers were unresponsive to ask them to physically access the routers and perform a hard reboot."
A little over an hour later, CloudFlare had fixed the problem. The company has asked Juniper if the bug is a known issue or one that is unique to CloudFlare's network setup, Prince wrote.
"We will be doing more extensive testing of Flowspec provisioned filters and evaluating whether there are ways we can isolate the application of the rules to only those data centers that need to be updated, rather than applying the rules network wide," Prince wrote.
CloudFlare customers with service-level agreements will be issued credits, Prince wrote.
"Any amount of downtime is completely unacceptable to us and the whole CloudFlare team is sorry we let our customers down this morning," he wrote.
Sign up for CIO Asia eNewsletters.