Jump to content

A Network Failure cause service disruption at KLIA

Yongtjunkit
21 minutes ago, LAwLz said:

Snip

That's more or less what I mean, I'm just speaking very generally about it and this is more specific and quantitative

Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, LAwLz said:

Sometimes it's even worth eating failures like this because it still ends up being cheaper than designing around it.

Crudely speaking if you want HA double the price of the solution, if you also want DR then double it again. For something as simple as a SQL server for a core business application that's 4 servers for us, 2 on the primary site for Active-Passive Failover Cluster then another 2 at the DR site on Standby also Active-Passive Failover Cluster. That's also not even taking in to account the storage for it or the network service between the sites to handle the storage replication.

 

That is of course assuming that you have tested the DR procedure and know it'll actually work too, you could be wasting all the money on the DR site if it turns out you can't actually use it or would be quicker/easier to just fix the primary.

 

We have another consideration, it's not uncommon, and that is reputation damage. How much would it hurt our reputation if we were down for 2 or 3 days and the ongoing impact to the business from that.

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, leadeater said:

Crudely speaking if you want HA double the price of the solution, if you also want DR then double it again. For something as simple as a SQL server for a core business application that's 4 servers for us, 2 on the primary site for Active-Passive Failover Cluster then another 2 at the DR site on Standby also Active-Passive Failover Cluster. That's also not even taking in to account the storage for it or the network service between the sites to handle the storage replication.

 

That is of course assuming that you have tested the DR procedure and know it'll actually work too, you could be wasting all the money on the DR site if it turns out you can't actually use it or would be quicker/easier to just fix the primary.

 

We have another consideration, it's not uncommon, and that is reputation damage. How much would it hurt our reputation if we were down for 2 or 3 days and the ongoing impact to the business from that.

Another consideration is incompetent management that can't do these kinds of basic cost/benefit calculations to save their lives.  I have second-hand stories from a company that had better than 5 9s availability and decided to outsource, which resulted in them going down for 1+ days on a regular basis.  Not good when the systems involved handle as much as $6M of business per day.  They definitely did not save money on that change...

Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to comment
Share on other sites

Link to post
Share on other sites

18 minutes ago, Ryan_Vickers said:

Another consideration is incompetent management that can't do these kinds of basic cost/benefit calculations to save their lives.  I have second-hand stories from a company that had better than 5 9s availability and decided to outsource, which resulted in them going down for 1+ days on a regular basis.  Not good when the systems involved handle as much as $6M of business per day.  They definitely did not save money on that change...

You can never outsource risk.

 

Problem is one of the most common underlying reasons outsourcing happens is to shift the risk/responsibility which is impossible to do. What frustrates me is how actually obvious that is if you actually think about it. It doesn't matter what kind of availability guarantees a provider offers you it's still your risk that you accept by partnering with them, so what you are left with is all the risk and limited to no control over it.

Link to comment
Share on other sites

Link to post
Share on other sites

22 hours ago, leadeater said:

I understood your point however where the failure would have had to occur means simple forgetfulness to ensure redundancy is off the cards. You don't just forget to do that and you'd know it's like that every day. I agree with potential negligence but that is different than saying networks sprawl and get complicated and not well documented.

I'm only just learning networking and already the idea of redundancy and fault tolerance is being pushed heavily. It should be a big factor when designing a network, especially for a place like an airport.

Link to comment
Share on other sites

Link to post
Share on other sites

Quote

There has been no official reason why the systems failed, but sources indicate that the core switch that links all the information systems at the KL International Airport (KLIA) broke down because it had passed its current lifespan.
“There were still intermittent disruptions up to midday yesterday, especially at the back-end even though the IT experts, including those from Cisco, were still trying to change the hardware and stabilise the network, ’’ said an expert.

The source said the core switch is part of the Total Airport Management System, which is an integrated airport management system used to interface and integrate the majority of electronic information within the airport for services such as check-in, baggage, WiFi, flight information display systems, communications systems and several others.

He added that it was an old system that was due for an upgrade in 2012-2014, but was not approved despite several requests. A back-up is in place but could not withstand the overloading, adding to the chaos, the source said.


Read more at https://www.thestar.com.my/business/business-news/2019/08/27/airlines-to-seek-mahb-compensation-for-delays-losses#CpEUZLJAr4hRWEH2.99

Some update 

Link to comment
Share on other sites

Link to post
Share on other sites

On 8/25/2019 at 3:03 PM, akio123008 said:

That's true but it's not really what I meant; of course the implementation of redundancy itself in any network is rather straightforward, what I'm saying is that the overall complexity of very large networks like this can cause people to lose track of what's actually going on. Keeping track of all the hardware that's involved and how the redundancy is ensured can be a more difficult task than one might expect. That's not an excuse, but it is an explanation of why stuff like this happens. Of course a properly set up, designed and documented network shouldn't have this issue, but the world isn't quite perfect so shit happens.

Yeah its complicated when you look at it as a whole but at the same time an outage of any kind for an airline or airport cost an absolute crazy amount of money so most airlines and airports spend alot of money making sure that never happens. There is little excuse for this tbh.  

Link to comment
Share on other sites

Link to post
Share on other sites

On 8/26/2019 at 8:33 AM, mr moose said:

O.K tech nerds,   what if they had put in a redundant/fail-over switch, but the failure was in the bit that allows that to work and it stopped eiither from working?

Probably the more concerning is that it took 4 days to fix, I would say it’s like you say, critical infrastructure.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Brooksie359 said:

Yeah its complicated when you look at it as a whole but at the same time an outage of any kind for an airline or airport cost an absolute crazy amount of money so most airlines and airports spend alot of money making sure that never happens. There is little excuse for this tbh.  

It's one of those classic IT budget oversights. Everything works...until it doesn't. 

Link to comment
Share on other sites

Link to post
Share on other sites

Lets be honest, the fact so many of these mission critical networks having little to no redunancy isnt even a surprise anymore, its really just so stupid that they don't because it isn't all that hard to do in the grand scheme of things. and, lets be honest, if something can go wrong, it eventually will!

 

Talking about networks, Folding@home is a network of volunteer computers that help with medical research, get yourself signed up to LTTs official folding month, bragging rights to be gained, prizes to be won!

 

 

My Folding Stats - Join the fight against COVID-19 with FOLDING! - If someone has helped you out on the forum don't forget to give them a reaction to say thank you!

 

The only true wisdom is in knowing you know nothing. - Socrates
 

Please put as much effort into your question as you expect me to put into answering it. 

 

  • CPU
    Ryzen 9 5950X
  • Motherboard
    Gigabyte Aorus GA-AX370-GAMING 5
  • RAM
    32GB DDR4 3200
  • GPU
    Inno3D 4070 Ti
  • Case
    Cooler Master - MasterCase H500P
  • Storage
    Western Digital Black 250GB, Seagate BarraCuda 1TB x2
  • PSU
    EVGA Supernova 1000w 
  • Display(s)
    Lenovo L29w-30 29 Inch UltraWide Full HD, BenQ - XL2430(portrait), Dell P2311Hb(portrait)
  • Cooling
    MasterLiquid Lite 240
Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, ZacoAttaco said:

It's one of those classic IT budget oversights. Everything works...until it doesn't. 

Then it's ITs fault for not explaining properly to management it was important 6 years ago.

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, mr moose said:

Then it's ITs fault for not explaining properly to management it was important 6 years ago.

I can tell that you do not work in IT.

Sometimes management doesn't want an explanation. Maybe there isn't room in the budget.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, LAwLz said:

I can tell that you do not work in IT.

Sometimes management doesn't want an explanation. Maybe there isn't room in the budget.

"I don't like the way this report sounds, could you re-word it in such a way that it is not so alarming. Also it would be best if the summary of the report aligns with my vision of how things should be".

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, LAwLz said:

I can tell that you do not work in IT.

Sometimes management doesn't want an explanation. Maybe there isn't room in the budget.

My experience with management would indicate exactly that they never have enough room in the budget for explanations, let alone enough time to meet with anyone to discuss spending money.

 

 

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

On 8/25/2019 at 6:49 AM, rockking1379 said:

Like all electronics they eventually fail. It does suck it happened. Now my question is why wasn’t their a failover switch? 

Because IT infrastructure is usually seen as a waste of money.

 

That is, until something breaks and a company loses millions of dollars in operational hours. And even that gets blamed on IT people instead of the jackass who denied the upgrade or additional redundant hardware install.

Ketchup is better than mustard.

GUI is better than Command Line Interface.

Dubs are better than subs

Link to comment
Share on other sites

Link to post
Share on other sites

21 hours ago, Trik'Stari said:

Because IT infrastructure is usually seen as a waste of money.

 

That is, until something breaks and a company loses millions of dollars in operational hours. And even that gets blamed on IT people instead of the jackass who denied the upgrade or additional redundant hardware install.

Not just infrastructure but often the actual people doing the job as well.  Turns out there actually is value in knowledge, experience, attention to detail, etc. and farming the operation out to the lowest bidder with none of those attributes isn't a good move in the long run... who would have thought  xD  (again not saying that's what happened here, just my stories)

Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Ryan_Vickers said:

Not just infrastructure but often the actual people doing the job as well.  Turns out there actually is value in knowledge, experience, attention to detail, etc. and farming the operation out to the lowest bidder with none of those attributes isn't a good move in the long run... who would have thought  xD  (again not saying that's what happened here, just my stories)

I've been dealing with that all summer.

 

Not even just people with IT experience, but lacking people who can't wrap their heads around stacking boxes in reverse numerical order.

Ketchup is better than mustard.

GUI is better than Command Line Interface.

Dubs are better than subs

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×