Jump to content

A Network Failure cause service disruption at KLIA

Yongtjunkit

A network failure causes system/service disruption at KLIA 

 

Quote

Sepang(Bernama): Malaysia Airports have identified network failure as the cause of the disruption at the Kuala Lumpur International Airport(KLIA).

Quote

 Sepang(Bernama): Operations at the Kuala Lumpur International Airport (KLIA) returned to normal Sunday (Aug 25) after it was hit by network equipment failure on Wednesday (Aug 21) night which cause long queue and flight delay.

Quote

SEPANG (Bernama): The situation at the KL International Airport (KLIA) was under control on Monday (Aug 26) morning even though the Flight Information Display System (FIDS) at the Departure Hall was not functioning.
 

Quote

Checks by Bernama at noon Monday showed that the FIDS had partially recovered.

Quote

There has been no official reason why the systems failed, but sources indicate that the core switch that links all the information systems at the KL International Airport (KLIA) broke down because it had passed its current lifespan.

“There were still intermittent disruptions up to midday yesterday, especially at the back-end even though the IT experts, including those from Cisco, were still trying to change the hardware and stabilise the network, ’’ said an expert.

The source said the core switch is part of the Total Airport Management System, which is an integrated airport management system used to interface and integrate the majority of electronic information within the airport for services such as check-in, baggage, WiFi, flight information display systems, communications systems and several others.

He added that it was an old system that was due for an upgrade in 2012-2014, but was not approved despite several requests. A back-up is in place but could not withstand the overloading, adding to the chaos, the source said.

sources

1. https://www.thestar.com.my/news/nation/2019/08/25/operations-at-klia-back-to-normal-four-days-after-disruption

2. https://www.thestar.com.my/news/nation/2019/08/23/m039sia-airports-disruption-at-KLIA-due-to-network-failure-equipment-replaced

3. https://www.thestar.com.my/news/nation/2019/08/26/situation-in-klia-under-control-despite-non-functional-fids-at-departure-hall

4. https://www.thestar.com.my/business/business-news/2019/08/27/airlines-to-seek-mahb-compensation-for-delays-losses

 

An airport mission critical system such as flight information display systems, airport WiFi gets taken down by a Network failure which caused long queue and flight delays for four days. 

Link to comment
Share on other sites

Link to post
Share on other sites

27 minutes ago, Yongtjunkit said:

A failed network switch causes system/service disruption at KLIA 

 

 

sources

1. https://www.thestar.com.my/news/nation/2019/08/25/operations-at-klia-back-to-normal-four-days-after-disruption

2. https://www.thestar.com.my/news/nation/2019/08/23/m039sia-airports-disruption-at-KLIA-due-to-network-failure-equipment-replaced

 

An airport mission critical system such as flight information display systems, airport WiFi gets taken down by a failed network switch which caused long queue and flight delays.

Like all electronics they eventually fail. It does suck it happened. Now my question is why wasn’t their a failover switch? 

Link to comment
Share on other sites

Link to post
Share on other sites

16 minutes ago, rockking1379 said:

Like all electronics they eventually fail. It does suck it happened. Now my question is why wasn’t their a failover switch? 

The problem is that when a system is up and running, no one notices vulnerabilities such as these. In very complicated setups like these it's rather easy to miss something. The networking setup is often enormous and grows over time, getting updates here and there. There's a tendency to just be like "well it works, so uhm no one touch it" Perhaps there were plenty of failovers and backups in place in other places, but they just happened not to consider this one switch.

Link to comment
Share on other sites

Link to post
Share on other sites

I wonder how long this system has been running? Airport computers tends to upgrade slowly.

Specs: Motherboard: Asus X470-PLUS TUF gaming (Yes I know it's poor but I wasn't informed) RAM: Corsair VENGEANCE® LPX DDR4 3200Mhz CL16-18-18-36 2x8GB

            CPU: Ryzen 9 5900X          Case: Antec P8     PSU: Corsair RM850x                        Cooler: Antec K240 with two Noctura Industrial PPC 3000 PWM

            Drives: Samsung 970 EVO plus 250GB, Micron 1100 2TB, Seagate ST4000DM000/1F2168 GPU: EVGA RTX 2080 ti Black edition

Link to comment
Share on other sites

Link to post
Share on other sites

48 minutes ago, akio123008 said:

The problem is that when a system is up and running, no one notices vulnerabilities such as these. In very complicated setups like these it's rather easy to miss something. The networking setup is often enormous and grows over time, getting updates here and there. There's a tendency to just be like "well it works, so uhm no one touch it" Perhaps there were plenty of failovers and backups in place in other places, but they just happened not to consider this one switch.

It really isn't that complicated to use a stacking switch or a pair not stacked but configured correctly. Spine and leaf topologies have existed for decades and redundancy is extremely simple to implement. Due to the scale of impact from the outage the equipment failure has to be at the core/distribution of the network, zero excuse to not have redundancy there.

 

Equipment failures isn't that uncommon, sadly it's also not that uncommon to not have redundant network paths or think you do but actually do not.

Link to comment
Share on other sites

Link to post
Share on other sites

7 hours ago, leadeater said:

It really isn't that complicated to use a stacking switch or a pair not stacked but configured correctly. Spine and leaf topologies have existed for decades and redundancy is extremely simple to implement. Due to the scale of impact from the outage the equipment failure has to be at the core/distribution of the network, zero excuse to not have redundancy there.

That's true but it's not really what I meant; of course the implementation of redundancy itself in any network is rather straightforward, what I'm saying is that the overall complexity of very large networks like this can cause people to lose track of what's actually going on. Keeping track of all the hardware that's involved and how the redundancy is ensured can be a more difficult task than one might expect. That's not an excuse, but it is an explanation of why stuff like this happens. Of course a properly set up, designed and documented network shouldn't have this issue, but the world isn't quite perfect so shit happens.

Link to comment
Share on other sites

Link to post
Share on other sites

20 minutes ago, akio123008 said:

That's true but it's not really what I meant; of course the implementation of redundancy itself in any network is rather straightforward, what I'm saying is that the overall complexity of very large networks like this can cause people to lose track of what's actually going on. Keeping track of all the hardware that's involved and how the redundancy is ensured can be a more difficult task than one might expect. That's not an excuse, but it is an explanation of why stuff like this happens. Of course a properly set up, designed and documented network shouldn't have this issue, but the world isn't quite perfect so shit happens.

It really isn't that hard, it's literally as simple as two fibre runs to each network cabinet that contains two switches and is feed from a distribution layer of redundant paths. As long as you follow correct standards for anything important you'll never have the problem.

 

Having a few non important single switches feeding a couple of things yea that can happen but this wasn't it, it's not like only a couple of APs or a few terminals went down it's was pretty much everything meaning it was in a network segment that was important and should of had redundancy, being where it would have been it's not something you'd forget to do or lose track of. 

Link to comment
Share on other sites

Link to post
Share on other sites

Is it normal for everything to be running on the same network switch?

I get that you can segregate the networks on the switch, but I would have thought that the Check In computers and Baggage Handling would run on a completely different network from the public WiFi for security

Laptop:

Spoiler

HP OMEN 15 - Intel Core i7 9750H, 16GB DDR4, 512GB NVMe SSD, Nvidia RTX 2060, 15.6" 1080p 144Hz IPS display

PC:

Spoiler

Vacancy - Looking for applicants, please send CV

Mac:

Spoiler

2009 Mac Pro 8 Core - 2 x Xeon E5520, 16GB DDR3 1333 ECC, 120GB SATA SSD, AMD Radeon 7850. Soon to be upgraded to 2 x 6 Core Xeons

Phones:

Spoiler

LG G6 - Platinum (The best colour of any phone, period)

LG G7 - Moroccan Blue

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, yolosnail said:

Is it normal for everything to be running on the same network switch?

I get that you can segregate the networks on the switch, but I would have thought that the Check In computers and Baggage Handling would run on a completely different network from the public WiFi for security

It's normal to use the same equipment, just not to have single points of failure. Enterprise APs can put SSIDs on segregated VLANs and IP ranges and those APs will tunnel the traffic back to the wireless controller over secure VLANs at the network switch layer. Virtual network segments over top of physical networking.

Link to comment
Share on other sites

Link to post
Share on other sites

O.K tech nerds,   what if they had put in a redundant/fail-over switch, but the failure was in the bit that allows that to work and it stopped eiither from working?

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, mr moose said:

O.K tech nerds,   what if they had put in a redundant/fail-over switch, but the failure was in the bit that allows that to work and it stopped eiither from working?

Then maybe they should have made that bit redundant as well!

Laptop:

Spoiler

HP OMEN 15 - Intel Core i7 9750H, 16GB DDR4, 512GB NVMe SSD, Nvidia RTX 2060, 15.6" 1080p 144Hz IPS display

PC:

Spoiler

Vacancy - Looking for applicants, please send CV

Mac:

Spoiler

2009 Mac Pro 8 Core - 2 x Xeon E5520, 16GB DDR3 1333 ECC, 120GB SATA SSD, AMD Radeon 7850. Soon to be upgraded to 2 x 6 Core Xeons

Phones:

Spoiler

LG G6 - Platinum (The best colour of any phone, period)

LG G7 - Moroccan Blue

 

Link to comment
Share on other sites

Link to post
Share on other sites

4 minutes ago, yolosnail said:

Then maybe they should have made that bit redundant as well!

where do we stop? 

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, mr moose said:

where do we stop? 

Usually that's when management says it costs too much or they want to find a way to cut back on costs, security and redundancy are some of the first things to get put on the chopping block.

Current Network Layout:

Current Build Log/PC:

Prior Build Log/PC:

Link to comment
Share on other sites

Link to post
Share on other sites

14 minutes ago, leadeater said:

It's normal to use the same equipment, just not to have single points of failure. Enterprise APs can put SSIDs on segregated VLANs and IP ranges and those APs will tunnel the traffic back to the wireless controller over secure VLANs at the network switch layer. Virtual network segments over top of physical networking.

The APs aren't what worries me, it's the mission critical systems that didn't have redundant connections to two separate switches, or some sort of additional redundancy at least. A switch stack won't really save you if all your APs are sitting on that stack or if you're an end host with a single port and that switch goes bad, then it goes down but you should at least have two connections to different switches for anything mission critical.

Current Network Layout:

Current Build Log/PC:

Prior Build Log/PC:

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, leadeater said:

It really isn't that hard, it's literally as simple as two fibre runs to each network cabinet that contains two switches and is feed from a distribution layer of redundant paths. As long as you follow correct standards for anything important you'll never have the problem.

Again, it's not really "hard" it's just that people are negligent and networking systems can be a mess. Just the fact that a network is deployed in a big facility doesn't indicate it's been set up correctly.  Especially as networks don't just get set up once and then stay online, but rather get modified all the time (by different people); the odds of something being set up incorrectly at some point aren't very small.

 

it's easy to just be like "oh but if you just do it this way and follow the correct standards it always works" but if it were like that in the real world, this kind of stuff wouldn't happen. This actually goes for most technical problems, they aren't very hard to deal with, and in many cases could have easily been avoided in hindsight, It's just stupid tiny (human) errors that are causing the problem.

 

1 hour ago, leadeater said:

it's not something you'd forget to do or lose track of. 

if that were the case, the problem wouldn't have existed, so apparently it is something you can lose track of. 1

 

I don't disagree with you in any way; redundancy could and should have been implemented here, and shame on the network engineers who didn't. I was just trying to point out that this kind of stupid stuff happens all too often because even high end large networks can be a mess when people don't do their job properly.

 

"Never attribute to malice what can be adequately explained by stupidity" is a rule that I like to follow, so I am ignoring the possibility that someone knew the problem existed and didn't bother fixing it. (or worse actively caused the problem to disrupt certain things)

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Lurick said:

The APs aren't what worries me, it's the mission critical systems that didn't have redundant connections to two separate switches, or some sort of additional redundancy at least. A switch stack won't really save you if all your APs are sitting on that stack or if you're an end host with a single port and that switch goes bad, then it goes down but you should at least have two connections to different switches for anything mission critical.

That stacking issue is why we don't use any stacking technologies in our DC's at all, we just multi-path everything. A lot easier to do that in a a DC where every endpoint device is going to have dual NICs though. It does mean we can't use switch assisted LACP but ESXi can do a good enough job at balancing traffic per VM so we get some amount of active-active capability. Storage just has multipathing capabilities at it's core now days so that's not a problem at all.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, akio123008 said:

Again, it's not really "hard" it's just that people are negligent and networking systems can be a mess. Just the fact that a network is deployed in a big facility doesn't indicate it's been set up correctly.  Especially as networks don't just get set up once and then stay online, but rather get modified all the time (by different people); the odds of something being set up incorrectly at some point aren't very small.

I understood your point however where the failure would have had to occur means simple forgetfulness to ensure redundancy is off the cards. You don't just forget to do that and you'd know it's like that every day. I agree with potential negligence but that is different than saying networks sprawl and get complicated and not well documented.

 

Had it been just a switch out on the edge providing connectivity to 10-40 devices then sure it's totally reasonable to forget to go back and add a second switch and distribute the clients across the two, or activate that second fibre link, or a number of other things but the problem is it wasn't that.

 

I actually wouldn't be surprised at all if the failure wasn't actually a network switch and was the storage array that underpins all the servers, that's more likely to have a fault that would take everything down. Far as most people care the network is down and that's what gets reported but it doesn't actually have to mean it was a network fault at all.

Link to comment
Share on other sites

Link to post
Share on other sites

I don't think we have enough information to actually say what happened here or how it could be avoided. 

What segment failed?

Why did it fail? 

Why wasn't there any redundancy? 

 

So far all we know is that it wasn't because of a malicious attack or vulnerability. 

 

We also know that the things which stopped working were: WiFi, flight info displays, check-in counters and baggage handling systems. So it sounds like it was something major that went down. But it doesn't sound like everything went down. 

I'm not even entirely convinced that it was a switch or router. It just says "network equipment" but who knows what that might mean.

 

There might also have been redundancy in place but it didn't work for some reason, for example a bug in some equipment's OS. 

 

Let's not point fingers and play armchair experts when we have next to no info to go on. 

Link to comment
Share on other sites

Link to post
Share on other sites

49 minutes ago, LAwLz said:

I don't think we have enough information to actually say what happened here or how it could be avoided. 

What segment failed?

Why did it fail? 

Why wasn't there any redundancy? 

 

So far all we know is that it wasn't because of a malicious attack or vulnerability. 

 

We also know that the things which stopped working were: WiFi, flight info displays, check-in counters and baggage handling systems. So it sounds like it was something major that went down. But it doesn't sound like everything went down. 

I'm not even entirely convinced that it was a switch or router. It just says "network equipment" but who knows what that might mean.

 

There might also have been redundancy in place but it didn't work for some reason, for example a bug in some equipment's OS. 

 

Let's not point fingers and play armchair experts when we have next to no info to go on. 

Oops, my bad 

 

just updated the topic

Thanks for pointing it out

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, Lurick said:

Usually that's when management says it costs too much or they want to find a way to cut back on costs, security and redundancy are some of the first things to get put on the chopping block.

Then usually next on the chopping block is your job when that system fails because "you should have been able to keep the system working correctly, that's your job right?" xD

 

Moral of the story is that you can never appease the bean counters...

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, LAwLz said:

There might also have been redundancy in place but it didn't work for some reason, for example a bug in some equipment's OS. 

Configuration errors could also cause this. Someone created a switching loop in our office by plugging an unmanaged switch into itself. A configuration error by the networks team meant that the multiple routines used to detect loops and disable ports cancelled each other out. That switching loop brought down the entire building for most of the day... oops. 

Intel® Core™ i7-12700 | GIGABYTE B660 AORUS MASTER DDR4 | Gigabyte Radeon™ RX 6650 XT Gaming OC | 32GB Corsair Vengeance® RGB Pro SL DDR4 | Samsung 990 Pro 1TB | WD Green 1.5TB | Windows 11 Pro | NZXT H510 Flow White
Sony MDR-V250 | GNT-500 | Logitech G610 Orion Brown | Logitech G402 | Samsung C27JG5 | ASUS ProArt PA238QR
iPhone 12 Mini (iOS 17.2.1) | iPhone XR (iOS 17.2.1) | iPad Mini (iOS 9.3.5) | KZ AZ09 Pro x KZ ZSN Pro X | Sennheiser HD450bt
Intel® Core™ i7-1265U | Kioxia KBG50ZNV512G | 16GB DDR4 | Windows 11 Enterprise | HP EliteBook 650 G9
Intel® Core™ i5-8520U | WD Blue M.2 250GB | 1TB Seagate FireCuda | 16GB DDR4 | Windows 11 Home | ASUS Vivobook 15 
Intel® Core™ i7-3520M | GT 630M | 16 GB Corsair Vengeance® DDR3 |
Samsung 850 EVO 250GB | macOS Catalina | Lenovo IdeaPad P580

Link to comment
Share on other sites

Link to post
Share on other sites

Another thing to keep in mind is that this is an airport. It needs to be working 24/7, 365 days a week without hiccups.

You do not want to touch those things if they are working. Even installing a quite simple update might require a reboot, which might cause things to stop working momentarily in the best case scenario, and it might reveal/cause a lot of issues in the worst case scenario.

 

It's not fun motivating changing a fully functional network just because "we need to fix a issue that might cause problems sometime in the future". The risk when touching these networks can often be larger than the risk of just letting things be the way they are. Being the person in charge of doing the changes is stressful as hell. I mean, simply restarting a switch might cause an issue because of a software bug or some improper design choice made 20 years ago by some person who doesn't work there anymore. And since you were the one who restarted it, you're also the one who is responsible for fixing it.

Link to comment
Share on other sites

Link to post
Share on other sites

10 hours ago, mr moose said:

where do we stop? 

 

For real though it comes down to "what's reasonable?"

If there's a relatively easy and inexpensive thing you can add to the current system that would massively improve reliability, it would be negligent and considered "cheaping out" not to have it.  If it's already relatively reliable and you just got unlucky with a failure, and to guard against something as rare as it would cost a fortune, then at least you've done what's reasonable and you simply have to accept the failure.

Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Ryan_Vickers said:

For real though it comes down to "what's reasonable?"

If there's a relatively easy and inexpensive thing you can add to the current system that would massively improve reliability, it would be negligent and considered "cheaping out" not to have it.  If it's already relatively reliable and you just got unlucky with a failure, and to guard against something as rare as it would cost a fortune, then at least you've done what's reasonable and you simply have to accept the failure.

Or you can do a risk analysis. There are several models for it but the one I like to use is a very simple impact multiplied by probability.

 

How likely is it that something occurs? 1 to 5

 

How big of an impact does it have if it were to occur? 1 to 5

 

Take the first number and multiply it by the second to get an overall score of how much you much time and resources you should spend on preparing for it.

More advanced models maps the different things to dollars. How much would it cost if the core went down, how much would it cost to design redundancy, and how likely is it that it happens?

Whatever happened here was probably very likely, and it might have cost a lot to design around it, so it was ignored (assuming the issue was a lack of redundancy). Sometimes it's even worth eating failures like this because it still ends up being cheaper than designing around it.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×