[Resolved] Another Cloudflare outage, this time internally caused?

Lurick · July 3, 2019

13 minutes ago, rcmaehl said:

No one:
Cloudflare: Let's test changes in production!

Production is best place to test changes!

Even better if it's during critical/peak customer time

colonel_mortis · July 3, 2019

13 minutes ago, will4623 said:

A single? so nice to know about half the internet has no redundency.

It was a single extra rule, which was deployed globally and on all of their redundant systems. In the blog post they say that they do staged deployments for most things, but that wasn't the case with the WAF rule because it was only deployed in monitoring mode, so it shouldn't have been able to cause issues. That meant that when their backups kicked in, they also failed.

It's entirely reasonable to think that a bad regular expression couldn't do damage, but unfortunately there are some really pathological ones out there - running something fairly innocent-looking, such as

/^a+(_?a+)*$/.exec('a'.repeat(33) + '!');

in nodeJS (using a 34 character input) took over a minute while maxing out my CPU, and increasing the input size causes the execution time to increase exponentially (at 50 characters it takes at least several hours).

While this issue could clearly have been avoided, they have an impressive uptime record so far, so they are doing most things right. They are the type of company that will learn from this and make sure that nothing like it happens in future.