Jump to content

Rolling2405

Member
  • Posts

    341
  • Joined

  • Last visited

Reputation Activity

  1. Informative
    Rolling2405 reacted to colonel_mortis for a blog entry, LTT Forum Tech Stack   
    The LTT forum is built on Invision Community (previously known as IPS and IPB), an off-the-shelf forum software. The majority of the code that powers the forum is theirs, and the technologies that we can use are largely constrained by what Invision Community supports.
     
    The backend is written entirely in PHP, using a custom framework built specially for Invision Community. Some type annotations have been introduced into the codebase recently, but most of the code was written before PHP supported types and is therefore untyped (this applies to both IC's code and our custom additions). For a sense of scale, there are over 450,000 lines of PHP in Invision Community, and about 12,000 lines of LTT-specific PHP.
     
    Data is stored in MySQL, Elasticsearch and Redis. MySQL is the source of truth for all of the data in the system, and interaction with it from the backend code use Active Records. User content is duplicated into Elasticsearch to power the search functionality, as well as activity feeds and content feeds on user profiles (everywhere that combines content of multiple different types into one view). Redis is just used as a cache for frequently accessed or expensive values.
     
    Requests are handled using NGINX, and your requests also pass through Cloudflare before getting there to the server.
     
    All of the backend services are hosted on one server, with a 16 core EPYC processor. On the server, we're running CentOS 8.
     
    For the frontend, all of the HTML rendering is performed on the server side, using a custom HTML template syntax that works similarly to mustache (although with considerably less polish). Writing this code is virtually identical to writing raw HTML, except that you can use variables, use loops, and import other templates. Invision Community also has its own front end CSS framework, which generally works well to minimise the specific CSS required for each new feature.
     
    Continuing with the trend, there is also a custom Javascript framework for attaching controllers to elements and for declaring and using UI widgets. The framework is built for ES5 + jQuery; while it is possible to use ES6+ features in custom code, there is rarely enough that needs doing with custom code to be able to make use of much.
     
    We can easily edit the PHP, HTML and CSS provided by Invision Community, both using their hook system and by directly modifying code on our fork of the code, but the Javascript framework is generally not editable because it doesn't provide a generally useable hooking mechanism and the code is stored in the database rather than in files. This means that custom javascript is generally limited to only the small amount needed to support new functionality; accordingly, we only have about 400 lines of LTT-specific JS.
     
    I've also built a couple of GitHub actions in Typescript to automate some boring tasks, such as updating the repo with the latest version of Invision Community.
  2. Informative
    Rolling2405 reacted to colonel_mortis for a blog entry, A breakdown of Saturday's outage   
    Timeline of the outage: (times in UTC)
    Starting at 02:03 on Saturday, requests started intermittently returning error 502, and many of the requests that were served successfully were significantly slower than normal At 05:59, most of the services on the server crashed. All subsequent requests were served by Cloudflare with error code 502 At about 11:00, I came online and attempted to diagnose the problem. Due to the previous failures I was unable to access the server, so all I was able to do was route traffic to an offline page At 16:35, with extra help, the server was forcibly restarted At 16:43 the server seemed to have come up successfully, so we enabled traffic and monitored the status By 16:53 it was clear that performance very poor and a significant fraction of the requests were resulting in error 502 (which means in this case that the server was already processing too many requests, so where were no workers available), so we disabled traffic again to investigate the situation further At 17:36, there was nothing clearly wrong so we tried enabling traffic again At 17:39 the performance was significantly regressing again so the site was turned offline again At 18:00 we ordered a new server At 18:35 the server was ready to be set up with all of the forum-related things, and for the data to be migrated to it At 21:28, the new server was fully set up and the forum was turned back online  
    What were the symptoms?
    IOWait accounted for the majority of CPU time, but IO utilisation was relatively normal In the syslog we were seeing a number of IO timeouts for the primary drive Prior to rebooting, the limited errors that we could see indicated that there had been disk corruption  
    What was the root cause?
    Although we aren't 100% sure, we think it's fairly likely that one of the two RAID 1 disks that form the primary disk had failed, and the poor performance was a consequence of trying to rebuild the array.
     
    Why replace the server rather than the disk?
    There were already plans to replace the server, this failure just accelerated them. We were also of the opinion that having the disk replaced and getting the array rebuilt would likely not end up being faster, especially as we did not have sufficient information to pinpoint the failure.
×