A Network Failure cause service disruption at KLIA

Yongtjunkit · August 25, 2019

A network failure causes system/service disruption at KLIA

Quote

Sepang(Bernama): Malaysia Airports have identified network failure as the cause of the disruption at the Kuala Lumpur International Airport(KLIA).

Quote

Sepang(Bernama): Operations at the Kuala Lumpur International Airport (KLIA) returned to normal Sunday (Aug 25) after it was hit by network equipment failure on Wednesday (Aug 21) night which cause long queue and flight delay.

Quote

SEPANG (Bernama): The situation at the KL International Airport (KLIA) was under control on Monday (Aug 26) morning even though the Flight Information Display System (FIDS) at the Departure Hall was not functioning.

Quote

Checks by Bernama at noon Monday showed that the FIDS had partially recovered.

Quote

There has been no official reason why the systems failed, but sources indicate that the core switch that links all the information systems at the KL International Airport (KLIA) broke down because it had passed its current lifespan.

“There were still intermittent disruptions up to midday yesterday, especially at the back-end even though the IT experts, including those from Cisco, were still trying to change the hardware and stabilise the network, ’’ said an expert.

The source said the core switch is part of the Total Airport Management System, which is an integrated airport management system used to interface and integrate the majority of electronic information within the airport for services such as check-in, baggage, WiFi, flight information display systems, communications systems and several others.

He added that it was an old system that was due for an upgrade in 2012-2014, but was not approved despite several requests. A back-up is in place but could not withstand the overloading, adding to the chaos, the source said.

sources :

1. https://www.thestar.com.my/news/nation/2019/08/25/operations-at-klia-back-to-normal-four-days-after-disruption

2. https://www.thestar.com.my/news/nation/2019/08/23/m039sia-airports-disruption-at-KLIA-due-to-network-failure-equipment-replaced

3. https://www.thestar.com.my/news/nation/2019/08/26/situation-in-klia-under-control-despite-non-functional-fids-at-departure-hall

4. https://www.thestar.com.my/business/business-news/2019/08/27/airlines-to-seek-mahb-compensation-for-delays-losses

An airport mission critical system such as flight information display systems, airport WiFi gets taken down by a Network failure which caused long queue and flight delays for four days.

rockking1379 · August 25, 2019

27 minutes ago, Yongtjunkit said:

A failed network switch causes system/service disruption at KLIA

sources :

1. https://www.thestar.com.my/news/nation/2019/08/25/operations-at-klia-back-to-normal-four-days-after-disruption

2. https://www.thestar.com.my/news/nation/2019/08/23/m039sia-airports-disruption-at-KLIA-due-to-network-failure-equipment-replaced

An airport mission critical system such as flight information display systems, airport WiFi gets taken down by a failed network switch which caused long queue and flight delays.

Like all electronics they eventually fail. It does suck it happened. Now my question is why wasn’t their a failover switch?

akio123008 · August 25, 2019

16 minutes ago, rockking1379 said:

Like all electronics they eventually fail. It does suck it happened. Now my question is why wasn’t their a failover switch?

The problem is that when a system is up and running, no one notices vulnerabilities such as these. In very complicated setups like these it's rather easy to miss something. The networking setup is often enormous and grows over time, getting updates here and there. There's a tendency to just be like "well it works, so uhm no one touch it" Perhaps there were plenty of failovers and backups in place in other places, but they just happened not to consider this one switch.

williamcll · August 25, 2019

I wonder how long this system has been running? Airport computers tends to upgrade slowly.

leadeater · August 25, 2019

48 minutes ago, akio123008 said:

The problem is that when a system is up and running, no one notices vulnerabilities such as these. In very complicated setups like these it's rather easy to miss something. The networking setup is often enormous and grows over time, getting updates here and there. There's a tendency to just be like "well it works, so uhm no one touch it" Perhaps there were plenty of failovers and backups in place in other places, but they just happened not to consider this one switch.

It really isn't that complicated to use a stacking switch or a pair not stacked but configured correctly. Spine and leaf topologies have existed for decades and redundancy is extremely simple to implement. Due to the scale of impact from the outage the equipment failure has to be at the core/distribution of the network, zero excuse to not have redundancy there.

Equipment failures isn't that uncommon, sadly it's also not that uncommon to not have redundant network paths or think you do but actually do not.

akio123008 · August 25, 2019

7 hours ago, leadeater said:

It really isn't that complicated to use a stacking switch or a pair not stacked but configured correctly. Spine and leaf topologies have existed for decades and redundancy is extremely simple to implement. Due to the scale of impact from the outage the equipment failure has to be at the core/distribution of the network, zero excuse to not have redundancy there.

That's true but it's not really what I meant; of course the implementation of redundancy itself in any network is rather straightforward, what I'm saying is that the overall complexity of very large networks like this can cause people to lose track of what's actually going on. Keeping track of all the hardware that's involved and how the redundancy is ensured can be a more difficult task than one might expect. That's not an excuse, but it is an explanation of why stuff like this happens. Of course a properly set up, designed and documented network shouldn't have this issue, but the world isn't quite perfect so shit happens.

leadeater · August 25, 2019

20 minutes ago, akio123008 said:

That's true but it's not really what I meant; of course the implementation of redundancy itself in any network is rather straightforward, what I'm saying is that the overall complexity of very large networks like this can cause people to lose track of what's actually going on. Keeping track of all the hardware that's involved and how the redundancy is ensured can be a more difficult task than one might expect. That's not an excuse, but it is an explanation of why stuff like this happens. Of course a properly set up, designed and documented network shouldn't have this issue, but the world isn't quite perfect so shit happens.

It really isn't that hard, it's literally as simple as two fibre runs to each network cabinet that contains two switches and is feed from a distribution layer of redundant paths. As long as you follow correct standards for anything important you'll never have the problem.

Having a few non important single switches feeding a couple of things yea that can happen but this wasn't it, it's not like only a couple of APs or a few terminals went down it's was pretty much everything meaning it was in a network segment that was important and should of had redundancy, being where it would have been it's not something you'd forget to do or lose track of.

yolosnail · August 25, 2019

Is it normal for everything to be running on the same network switch?

I get that you can segregate the networks on the switch, but I would have thought that the Check In computers and Baggage Handling would run on a completely different network from the public WiFi for security

leadeater · August 25, 2019

1 minute ago, yolosnail said:

Is it normal for everything to be running on the same network switch?

I get that you can segregate the networks on the switch, but I would have thought that the Check In computers and Baggage Handling would run on a completely different network from the public WiFi for security

It's normal to use the same equipment, just not to have single points of failure. Enterprise APs can put SSIDs on segregated VLANs and IP ranges and those APs will tunnel the traffic back to the wireless controller over secure VLANs at the network switch layer. Virtual network segments over top of physical networking.

mr moose · August 25, 2019

O.K tech nerds, what if they had put in a redundant/fail-over switch, but the failure was in the bit that allows that to work and it stopped eiither from working?

yolosnail · August 25, 2019

Just now, mr moose said:

O.K tech nerds, what if they had put in a redundant/fail-over switch, but the failure was in the bit that allows that to work and it stopped eiither from working?

Then maybe they should have made that bit redundant as well!

mr moose · August 25, 2019

4 minutes ago, yolosnail said:

Then maybe they should have made that bit redundant as well!

where do we stop?

Lurick · August 25, 2019

1 minute ago, mr moose said:

where do we stop?

Usually that's when management says it costs too much or they want to find a way to cut back on costs, security and redundancy are some of the first things to get put on the chopping block.

Lurick · August 25, 2019

14 minutes ago, leadeater said:

It's normal to use the same equipment, just not to have single points of failure. Enterprise APs can put SSIDs on segregated VLANs and IP ranges and those APs will tunnel the traffic back to the wireless controller over secure VLANs at the network switch layer. Virtual network segments over top of physical networking.

The APs aren't what worries me, it's the mission critical systems that didn't have redundant connections to two separate switches, or some sort of additional redundancy at least. A switch stack won't really save you if all your APs are sitting on that stack or if you're an end host with a single port and that switch goes bad, then it goes down but you should at least have two connections to different switches for anything mission critical.

akio123008 · August 25, 2019

1 hour ago, leadeater said:

It really isn't that hard, it's literally as simple as two fibre runs to each network cabinet that contains two switches and is feed from a distribution layer of redundant paths. As long as you follow correct standards for anything important you'll never have the problem.

Again, it's not really "hard" it's just that people are negligent and networking systems can be a mess. Just the fact that a network is deployed in a big facility doesn't indicate it's been set up correctly. Especially as networks don't just get set up once and then stay online, but rather get modified all the time (by different people); the odds of something being set up incorrectly at some point aren't very small.

it's easy to just be like "oh but if you just do it this way and follow the correct standards it always works" but if it were like that in the real world, this kind of stuff wouldn't happen. This actually goes for most technical problems, they aren't very hard to deal with, and in many cases could have easily been avoided in hindsight, It's just stupid tiny (human) errors that are causing the problem.

1 hour ago, leadeater said:

it's not something you'd forget to do or lose track of.

if that were the case, the problem wouldn't have existed, so apparently it is something you can lose track of. ¹

I don't disagree with you in any way; redundancy could and should have been implemented here, and shame on the network engineers who didn't. I was just trying to point out that this kind of stupid stuff happens all too often because even high end large networks can be a mess when people don't do their job properly.

¹"Never attribute to malice what can be adequately explained by stupidity" is a rule that I like to follow, so I am ignoring the possibility that someone knew the problem existed and didn't bother fixing it. (or worse actively caused the problem to disrupt certain things)

leadeater · August 25, 2019

2 hours ago, Lurick said:

The APs aren't what worries me, it's the mission critical systems that didn't have redundant connections to two separate switches, or some sort of additional redundancy at least. A switch stack won't really save you if all your APs are sitting on that stack or if you're an end host with a single port and that switch goes bad, then it goes down but you should at least have two connections to different switches for anything mission critical.

That stacking issue is why we don't use any stacking technologies in our DC's at all, we just multi-path everything. A lot easier to do that in a a DC where every endpoint device is going to have dual NICs though. It does mean we can't use switch assisted LACP but ESXi can do a good enough job at balancing traffic per VM so we get some amount of active-active capability. Storage just has multipathing capabilities at it's core now days so that's not a problem at all.

leadeater · August 25, 2019

1 hour ago, akio123008 said:

Again, it's not really "hard" it's just that people are negligent and networking systems can be a mess. Just the fact that a network is deployed in a big facility doesn't indicate it's been set up correctly. Especially as networks don't just get set up once and then stay online, but rather get modified all the time (by different people); the odds of something being set up incorrectly at some point aren't very small.

I understood your point however where the failure would have had to occur means simple forgetfulness to ensure redundancy is off the cards. You don't just forget to do that and you'd know it's like that every day. I agree with potential negligence but that is different than saying networks sprawl and get complicated and not well documented.

Had it been just a switch out on the edge providing connectivity to 10-40 devices then sure it's totally reasonable to forget to go back and add a second switch and distribute the clients across the two, or activate that second fibre link, or a number of other things but the problem is it wasn't that.

I actually wouldn't be surprised at all if the failure wasn't actually a network switch and was the storage array that underpins all the servers, that's more likely to have a fault that would take everything down. Far as most people care the network is down and that's what gets reported but it doesn't actually have to mean it was a network fault at all.

LAwLz · August 25, 2019

I don't think we have enough information to actually say what happened here or how it could be avoided.

What segment failed?

Why did it fail?

Why wasn't there any redundancy?

So far all we know is that it wasn't because of a malicious attack or vulnerability.

We also know that the things which stopped working were: WiFi, flight info displays, check-in counters and baggage handling systems. So it sounds like it was something major that went down. But it doesn't sound like everything went down.

I'm not even entirely convinced that it was a switch or router. It just says "network equipment" but who knows what that might mean.

There might also have been redundancy in place but it didn't work for some reason, for example a bug in some equipment's OS.

Let's not point fingers and play armchair experts when we have next to no info to go on.

Yongtjunkit · August 25, 2019

49 minutes ago, LAwLz said:

I don't think we have enough information to actually say what happened here or how it could be avoided.

What segment failed?

Why did it fail?

Why wasn't there any redundancy?

So far all we know is that it wasn't because of a malicious attack or vulnerability.

We also know that the things which stopped working were: WiFi, flight info displays, check-in counters and baggage handling systems. So it sounds like it was something major that went down. But it doesn't sound like everything went down.

I'm not even entirely convinced that it was a switch or router. It just says "network equipment" but who knows what that might mean.

There might also have been redundancy in place but it didn't work for some reason, for example a bug in some equipment's OS.

Let's not point fingers and play armchair experts when we have next to no info to go on.

Oops, my bad

just updated the topic

Thanks for pointing it out

imreloadin · August 26, 2019

3 hours ago, Lurick said:

Usually that's when management says it costs too much or they want to find a way to cut back on costs, security and redundancy are some of the first things to get put on the chopping block.

Then usually next on the chopping block is your job when that system fails because "you should have been able to keep the system working correctly, that's your job right?"

Moral of the story is that you can never appease the bean counters...

BlueChinchillaEatingDorito · August 26, 2019

1 hour ago, LAwLz said:

There might also have been redundancy in place but it didn't work for some reason, for example a bug in some equipment's OS.

Configuration errors could also cause this. Someone created a switching loop in our office by plugging an unmanaged switch into itself. A configuration error by the networks team meant that the multiple routines used to detect loops and disable ports cancelled each other out. That switching loop brought down the entire building for most of the day... oops.

LAwLz · August 26, 2019

Another thing to keep in mind is that this is an airport. It needs to be working 24/7, 365 days a week without hiccups.

You do not want to touch those things if they are working. Even installing a quite simple update might require a reboot, which might cause things to stop working momentarily in the best case scenario, and it might reveal/cause a lot of issues in the worst case scenario.

It's not fun motivating changing a fully functional network just because "we need to fix a issue that might cause problems sometime in the future". The risk when touching these networks can often be larger than the risk of just letting things be the way they are. Being the person in charge of doing the changes is stressful as hell. I mean, simply restarting a switch might cause an issue because of a software bug or some improper design choice made 20 years ago by some person who doesn't work there anymore. And since you were the one who restarted it, you're also the one who is responsible for fixing it.

vanished · August 26, 2019

10 hours ago, mr moose said:

where do we stop?

For real though it comes down to "what's reasonable?"

If there's a relatively easy and inexpensive thing you can add to the current system that would massively improve reliability, it would be negligent and considered "cheaping out" not to have it. If it's already relatively reliable and you just got unlucky with a failure, and to guard against something as rare as it would cost a fortune, then at least you've done what's reasonable and you simply have to accept the failure.

Yongtjunkit · August 26, 2019

Update:

Link: https://www.thestar.com.my/news/nation/2019/08/26/situation-in-klia-under-control-despite-non-functional-fids-at-departure-hall

FIDS apparently decides to stop working again

But it's recovering

Quote

Checks by Bernama at noon Monday showed that the FIDS had partially recovered.

LAwLz · August 26, 2019

1 hour ago, Ryan_Vickers said:

For real though it comes down to "what's reasonable?"

If there's a relatively easy and inexpensive thing you can add to the current system that would massively improve reliability, it would be negligent and considered "cheaping out" not to have it. If it's already relatively reliable and you just got unlucky with a failure, and to guard against something as rare as it would cost a fortune, then at least you've done what's reasonable and you simply have to accept the failure.

Or you can do a risk analysis. There are several models for it but the one I like to use is a very simple impact multiplied by probability.

How likely is it that something occurs? 1 to 5

How big of an impact does it have if it were to occur? 1 to 5

Take the first number and multiply it by the second to get an overall score of how much you much time and resources you should spend on preparing for it.

More advanced models maps the different things to dollars. How much would it cost if the core went down, how much would it cost to design redundancy, and how likely is it that it happens?

Whatever happened here was probably very likely, and it might have cost a lot to design around it, so it was ignored (assuming the issue was a lack of redundancy). Sometimes it's even worth eating failures like this because it still ends up being cheaper than designing around it.

Sign In

A Network Failure cause service disruption at KLIA

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites