Jump to content

Kilrah

Moderator
  • Posts

    16,479
  • Joined

  • Last visited

Reputation Activity

  1. Like
    Kilrah reacted to GOTSpectrum for a blog entry, Award Ceremony   
    This was a nice little event for us all to stretch our legs and keep ourselves warmed up for folding month this year. As many of you know I have been plagued with health issues. Due to the health conditions I have this will continue and slowly get worse. But I plan to keep pushing on with more events for as long as I can. Thank you to everyone who helps to make the events run smoothly and thank you to everyone who signs up for your understanding of my struggles. 
     
    This will certainly not be the longest awards ceremony as is the case with sprint events but I would like to thank those that volunteered to help with the event. With a special thanks to @marknd59 @cbigfoot and @leadeater who are event veterans at this point. 
     
    To claim prizes you need to drop me a PM within the next 7 days, with the subject line of, please can I have my prize. 
     
    https://docs.google.com/spreadsheets/d/1_1PdoJlyopU4MesssRoLq5PQxtuFwDO04X0IdkGlEdA
     
    Thank you to everyone who got involved during this event. It was great to be back even with my own struggles. 
     
    The next event will commence on the 1st of November, I look forward to seeing you all there.
     
    I ended up rewriting this as morphine-brained spec didn't write a good post. 
     
    Happy folding,
     
    Spec.
  2. Informative
    Kilrah reacted to colonel_mortis for a blog entry, A breakdown of Saturday's outage   
    Timeline of the outage: (times in UTC)
    Starting at 02:03 on Saturday, requests started intermittently returning error 502, and many of the requests that were served successfully were significantly slower than normal At 05:59, most of the services on the server crashed. All subsequent requests were served by Cloudflare with error code 502 At about 11:00, I came online and attempted to diagnose the problem. Due to the previous failures I was unable to access the server, so all I was able to do was route traffic to an offline page At 16:35, with extra help, the server was forcibly restarted At 16:43 the server seemed to have come up successfully, so we enabled traffic and monitored the status By 16:53 it was clear that performance very poor and a significant fraction of the requests were resulting in error 502 (which means in this case that the server was already processing too many requests, so where were no workers available), so we disabled traffic again to investigate the situation further At 17:36, there was nothing clearly wrong so we tried enabling traffic again At 17:39 the performance was significantly regressing again so the site was turned offline again At 18:00 we ordered a new server At 18:35 the server was ready to be set up with all of the forum-related things, and for the data to be migrated to it At 21:28, the new server was fully set up and the forum was turned back online  
    What were the symptoms?
    IOWait accounted for the majority of CPU time, but IO utilisation was relatively normal In the syslog we were seeing a number of IO timeouts for the primary drive Prior to rebooting, the limited errors that we could see indicated that there had been disk corruption  
    What was the root cause?
    Although we aren't 100% sure, we think it's fairly likely that one of the two RAID 1 disks that form the primary disk had failed, and the poor performance was a consequence of trying to rebuild the array.
     
    Why replace the server rather than the disk?
    There were already plans to replace the server, this failure just accelerated them. We were also of the opinion that having the disk replaced and getting the array rebuilt would likely not end up being faster, especially as we did not have sufficient information to pinpoint the failure.
×