Jump to content
  • entries
    2
  • comments
    5
  • views
    1,246

A breakdown of Saturday's outage

colonel_mortis

1,059 views

Timeline of the outage: (times in UTC)

  • Starting at 02:03 on Saturday, requests started intermittently returning error 502, and many of the requests that were served successfully were significantly slower than normal
  • At 05:59, most of the services on the server crashed. All subsequent requests were served by Cloudflare with error code 502
  • At about 11:00, I came online and attempted to diagnose the problem. Due to the previous failures I was unable to access the server, so all I was able to do was route traffic to an offline page
  • At 16:35, with extra help, the server was forcibly restarted
  • At 16:43 the server seemed to have come up successfully, so we enabled traffic and monitored the status
  • By 16:53 it was clear that performance very poor and a significant fraction of the requests were resulting in error 502 (which means in this case that the server was already processing too many requests, so where were no workers available), so we disabled traffic again to investigate the situation further
  • At 17:36, there was nothing clearly wrong so we tried enabling traffic again
  • At 17:39 the performance was significantly regressing again so the site was turned offline again
  • At 18:00 we ordered a new server
  • At 18:35 the server was ready to be set up with all of the forum-related things, and for the data to be migrated to it
  • At 21:28, the new server was fully set up and the forum was turned back online

 

What were the symptoms?

  • IOWait accounted for the majority of CPU time, but IO utilisation was relatively normal
  • In the syslog we were seeing a number of IO timeouts for the primary drive
  • Prior to rebooting, the limited errors that we could see indicated that there had been disk corruption

 

What was the root cause?

Although we aren't 100% sure, we think it's fairly likely that one of the two RAID 1 disks that form the primary disk had failed, and the poor performance was a consequence of trying to rebuild the array.

 

Why replace the server rather than the disk?

There were already plans to replace the server, this failure just accelerated them. We were also of the opinion that having the disk replaced and getting the array rebuilt would likely not end up being faster, especially as we did not have sufficient information to pinpoint the failure.

4 Comments

Thanks for giving us the word and telling us at least what you think the problem was specifically. Most other non-tech sites would probably tell you "hardware problems" at the most assuming nobody has the knowledge to understand what the specific issue really was despite wanting better answers.

Link to comment
Link to post

Hm... Really interesting at how issues can manifest. Troubleshooting can be a real problem. Glad you guys figured out! Great detective work on that one. 

Link to comment
Link to post
×