Jump to content

[Resolved] Another Cloudflare outage, this time internally caused?

brwainer
13 minutes ago, rcmaehl said:

No one:
Cloudflare: Let's test changes in production!

Production is best place to test changes!

Even better if it's during critical/peak customer time :D

Current Network Layout:

Current Build Log/PC:

Prior Build Log/PC:

Link to comment
Share on other sites

Link to post
Share on other sites

13 minutes ago, will4623 said:

A single? so nice to know about half the internet has no redundency.

It was a single extra rule, which was deployed globally and on all of their redundant systems. In the blog post they say that they do staged deployments for most things, but that wasn't the case with the WAF rule because it was only deployed in monitoring mode, so it shouldn't have been able to cause issues. That meant that when their backups kicked in, they also failed.

 

It's entirely reasonable to think that a bad regular expression couldn't do damage, but unfortunately there are some really pathological ones out there - running something fairly innocent-looking, such as

/^a+(_?a+)*$/.exec('a'.repeat(33) + '!');

in nodeJS (using a 34 character input) took over a minute while maxing out my CPU, and increasing the input size causes the execution time to increase exponentially (at 50 characters it takes at least several hours).

 

While this issue could clearly have been avoided, they have an impressive uptime record so far, so they are doing most things right. They are the type of company that will learn from this and make sure that nothing like it happens in future.

HTTP/2 203

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×