Jump to content

Hi All,

 

I've been happily running a Brocade ICX7250 48P-2X10G network switch for a few months now 24/7. Recently I observed my Proxmox cluster nodes regularly rebooting my VMs and I think I have traced the problem to the Brocade switch. It seems to reboot intermittently multiple times per day and I don't see an obvious pattern to it.

I have syslog remote logging to a server so these logs persist whilst the show logging command outputs don't seem to survive a reboot. I don't see anything weird in there other than regular warnings along the lines of:

May 26 19:42:24:A:System: Stack unit 1 Temperature 67.0 C degrees,

I see this steadily rise to about 80C then at some stage a reboot occurs and the temperatures can be closer to 60C again then start rising. 

I wonder then if these issues are actually thermal related despite being below the shutdown temperature of 105C? I also note that recently I discovered one of my ports with an SFP+ to 10GBaseT adapter was not blinking LEDs or providing a network. On replacing the adapter the switch quickly went into a boot loop with all amber LEDs. I had the system powered off overnight then next day it powered on ok with the adapter in except for this more intermittent boot looping.

To test the thermals I tried blocking the fans and the temperature reached around 90C according to the logs but just kicked in the high fan mode so cooled down and did not reboot. So maybe under heavy load the switch could be quickly overheating to the shutdown temperature before it has time to send a remote syslog message (highest I see in the logs in normal usage is around 80C before it drops lower eg. 50C on next reboot)? But this seems unusual. Are there any other explanations/fixes people can think of? 

 

Am hoping someone can make sense of this/have seen something similar in their own switch, and suggest how I can fix this. Any advice would be very much appreciated at this stage as aside from this new issue I have been very happy with this device. Thanks for your help! 

 

 


Here is some further debug output: 

show version


Copyright (c) Ruckus Networks, Inc. All rights reserved.


UNIT 1: compiled on Aug 8 2023 at 23:06:54 labeled as SPR08095m


(33554432 bytes) from Primary SPR08095m.bin (UFI)


SW: Version 08.0.95mT213 


Compressed Primary Boot Code size = 786944, Version:10.1.26T215 (spz10126)


Compiled on Tue Nov 29 23:13:15 2022





HW: Stackable ICX7250-48-HPOE


==========================================================================


UNIT 1: SL 1: ICX7250-48P POE 48-port Management Module


Serial #image.gif.e9d95c4edc0cbae40d85e05684a410e1.gifUK3845L1DZ


Software Package: ICX7250_L3_SOFT_PACKAGE (LID: fwmINJKnGfb)


Current License: l3-prem-8X10G 


P-ASIC 0: type B344, rev 01 Chip BCM56344_A0


==========================================================================


UNIT 1: SL 2: ICX7250-SFP-Plus 8-port 80G Module


==========================================================================


1000 MHz ARM processor ARMv7 88 MHz bus


8 MB boot flash memory


2 GB code flash memory


2 GB DRAM


STACKID 1 system uptime is 3 hour(s) 44 minute(s) 17 second(s) 


The system started at 19:38:19 CST Mon May 26 2025





The system : started=cold start 
 

 


show chassis


The stack unit 1 chassis info: 





Power supply 1 (AC - PoE) present, status ok


Power supply 2 not present


Power supply 3 not present





Fan 1 ok, speed (auto): [[1]]<->2


Fan 2 ok, speed (auto): [[1]]<->2


Fan 3 ok, speed (auto): [[1]]<->2





Fan controlled temperature: 


Rule 1/2 (MGMT THERMAL PLANE): 91.3 deg-C


Rule 2/2 (AIR OUTLET NEAR PSU): 40.5 deg-C





Fan speed switching temperature thresholds:


Rule 1/2 (MGMT THERMAL PLANE):


Speed 1: NM<-----> 95 deg-C


Speed 2: 85<----->105 deg-C (shutdown)


Rule 2/2 (AIR OUTLET NEAR PSU):


Speed 1: NM<-----> 41 deg-C


Speed 2: 34<----->105 deg-C (shutdown)





Fan 1 Air Flow Direction: Front to Back 


Fan 2 Air Flow Direction: Front to Back 


Fan 3 Air Flow Direction: Front to Back 


Slot 1 Current Temperature: 91.3 deg-C (Sensor 1), 40.5 deg-C (Sensor 2)


Slot 2 Current Temperature: NA


Warning level.......: 85.0 deg-C


Shutdown level......: 105.0 deg-C



 

Link to comment
https://linustechtips.com/topic/1613347-brocade-switch-intermittent-restarting/
Share on other sites

Link to post
Share on other sites

1 hour ago, techfan84 said:

Hi All,

 

I've been happily running a Brocade ICX7250 48P-2X10G network switch for a few months now 24/7. Recently I observed my Proxmox cluster nodes regularly rebooting my VMs and I think I have traced the problem to the Brocade switch. It seems to reboot intermittently multiple times per day and I don't see an obvious pattern to it.

I have syslog remote logging to a server so these logs persist whilst the show logging command outputs don't seem to survive a reboot. I don't see anything weird in there other than regular warnings along the lines of:

May 26 19:42:24:A:System: Stack unit 1 Temperature 67.0 C degrees,

I see this steadily rise to about 80C then at some stage a reboot occurs and the temperatures can be closer to 60C again then start rising. 

I wonder then if these issues are actually thermal related despite being below the shutdown temperature of 105C? I also note that recently I discovered one of my ports with an SFP+ to 10GBaseT adapter was not blinking LEDs or providing a network. On replacing the adapter the switch quickly went into a boot loop with all amber LEDs. I had the system powered off overnight then next day it powered on ok with the adapter in except for this more intermittent boot looping.

To test the thermals I tried blocking the fans and the temperature reached around 90C according to the logs but just kicked in the high fan mode so cooled down and did not reboot. So maybe under heavy load the switch could be quickly overheating to the shutdown temperature before it has time to send a remote syslog message (highest I see in the logs in normal usage is around 80C before it drops lower eg. 50C on next reboot)? But this seems unusual. Are there any other explanations/fixes people can think of? 

 

Am hoping someone can make sense of this/have seen something similar in their own switch, and suggest how I can fix this. Any advice would be very much appreciated at this stage as aside from this new issue I have been very happy with this device. Thanks for your help! 

 

 


Here is some further debug output: 

show version


Copyright (c) Ruckus Networks, Inc. All rights reserved.


UNIT 1: compiled on Aug 8 2023 at 23:06:54 labeled as SPR08095m


(33554432 bytes) from Primary SPR08095m.bin (UFI)


SW: Version 08.0.95mT213 


Compressed Primary Boot Code size = 786944, Version:10.1.26T215 (spz10126)


Compiled on Tue Nov 29 23:13:15 2022





HW: Stackable ICX7250-48-HPOE


==========================================================================


UNIT 1: SL 1: ICX7250-48P POE 48-port Management Module


Serial #image.gif.e9d95c4edc0cbae40d85e05684a410e1.gifUK3845L1DZ


Software Package: ICX7250_L3_SOFT_PACKAGE (LID: fwmINJKnGfb)


Current License: l3-prem-8X10G 


P-ASIC 0: type B344, rev 01 Chip BCM56344_A0


==========================================================================


UNIT 1: SL 2: ICX7250-SFP-Plus 8-port 80G Module


==========================================================================


1000 MHz ARM processor ARMv7 88 MHz bus


8 MB boot flash memory


2 GB code flash memory


2 GB DRAM


STACKID 1 system uptime is 3 hour(s) 44 minute(s) 17 second(s) 


The system started at 19:38:19 CST Mon May 26 2025





The system : started=cold start 
 

 


show chassis


The stack unit 1 chassis info: 





Power supply 1 (AC - PoE) present, status ok


Power supply 2 not present


Power supply 3 not present





Fan 1 ok, speed (auto): [[1]]<->2


Fan 2 ok, speed (auto): [[1]]<->2


Fan 3 ok, speed (auto): [[1]]<->2





Fan controlled temperature: 


Rule 1/2 (MGMT THERMAL PLANE): 91.3 deg-C


Rule 2/2 (AIR OUTLET NEAR PSU): 40.5 deg-C





Fan speed switching temperature thresholds:


Rule 1/2 (MGMT THERMAL PLANE):


Speed 1: NM<-----> 95 deg-C


Speed 2: 85<----->105 deg-C (shutdown)


Rule 2/2 (AIR OUTLET NEAR PSU):


Speed 1: NM<-----> 41 deg-C


Speed 2: 34<----->105 deg-C (shutdown)





Fan 1 Air Flow Direction: Front to Back 


Fan 2 Air Flow Direction: Front to Back 


Fan 3 Air Flow Direction: Front to Back 


Slot 1 Current Temperature: 91.3 deg-C (Sensor 1), 40.5 deg-C (Sensor 2)


Slot 2 Current Temperature: NA


Warning level.......: 85.0 deg-C


Shutdown level......: 105.0 deg-C



 

does where you got it from offer a replacement?

Link to post
Share on other sites

Try updating to 08.0.95s or 09.0.10j_cd1 - the change logs include several instances of switches randomly crashing in the updates since 08.0.95m, although at a glance I didn’t see any related to temperature.

 

What modules are you using and how much power usage are they rated for? The temperature of the SFP+ area of the switch may be getting much hotter than the overall switch temperature reading indicates.

Looking to buy GTX690, other multi-GPU cards, or single-slot graphics cards: 

 

Link to post
Share on other sites

Thanks for the swift responses!

 

I got the switch second hand so can't get a replacement but yes, one option would be to buy a new one or a better one... Do you happen to have suggestions for a switch with similar features but ideally a quieter fan? Very happy with the current features but the fans are pretty intense... 

 

Thanks for the suggestion to update, I will try and work out how to do that and give it a go. Just weird that it was working fine for the last few months and now suddenly having issues! 

 

This is the module I am currently using from fs.com:
Brocade Compatible 10GBASE-T SFP+ Copper 30m RJ-45 Transceiver Module (LOS)

The previous module was working ok for months but did appear to have died and was a FlyproFiber Transceiver.

Here is the switch monitoring for the current one in case it gives clues:

show optic 1/2/8

Port Temperature Voltage Tx Power Rx Power Tx Bias Current

±----±------------±------------±------------±------------±--------------+

1/2/8 42.0000 C 3.2500 volts -004.1930 dBm -002.9174 dBm 6.016 mA

 
 

Normal Normal Normal Normal Normal

 

The only error I see in the latest switch startup logs is:
May 29 11:16:15:C:hmond[733]: Application logmgr.py failed recovery and functionality provided by it will not be available until failure reason is remedied (may require manual intervention)

But not sure what that means?

 

Here is the latest from show version which shows the reboot:
STACKID 1 system uptime is 26 minute(s) 7 second(s)
The system started at 11:14:23 CST Thu May 29 2025

The system : started=cold start

which says its only been up 30 mins despite no intervention from me. It is intermittent so hard to debug but I notice the fans on high multiple times a day and expect this is when it is rebooting. Also hard to tell since the logs don’t seem to persist on reboot despite me forwarding syslog to a remote server (which does receive the thermal logs previously indicated at least).

 

 

At this stage it is probably helpful to give more context about the other issues I am seeing regarding VMs rebooting on my Proxmox cluster that led me to find this switch issue. 

 

Here are the most recent Proxmox logs showing when the VM reboots happen (all status OK except for the one with the error indicated):

May 29th 10.15 AM: Node1: Bulk start VMs and Containers (this is the most recent reboot of VMs and coincides with the switch reboot shown above)

May 29th 10.14 AM: Node2: Bulk start VMs and Containers
May 29th 9.16 AM: Node1: Bulk start VMs and Containers
May 29th 9.16 AM: Node2: Bulk start VMs and Containers
May 29th 8.28 AM: Node1: Bulk start VMs and Containers
May 29th 8.25 AM: Node2: Bulk start VMs and Containers

May 29th 8.00 AM: Node1: Bulk start VMs and Containers
May 29th 8.00 AM: Node2: Bulk start VMs and Containers
May 29th 7.13 AM: Node1: Bulk start VMs and Containers
May 25th 9.53 AM: Node4: Bulk start VMs and Containers
May 25th 9.47 AM: Node3: Bulk start VMs and Containers
May 25th 9.47 AM: Node5: Bulk start VMs and Containers
May 24th 6.43 PM: Node4: Bulk start VMs and Containers:
TASK ERROR: cluster not ready - no quorum?
May 24th 6.32 PM: Node2: Bulk start VMs and Containers:
May 24th 6.30 PM: Node5: Bulk start VMs and Containers:

 

These logs are surprising to me actually and perhaps indicate a different issue. I have noticed the switch seemingly rebooting more regularly over the last few days (based on checking uptime when I log in and regularly hearing the fans in high speed mode for short intervals, just like it does when booting). eg. On May26th the reported uptime of the switch was only 3 hours, suggesting it had rebooted but there is no coinciding Proxmox VM reboot seemingly.

I initially observed the Proxmox rebooting problem. This lead me to notice the SFP+ module for the network link on Node4 was dead. I figured Quorum was lost in my 6 node cluster since it wasn’t seen by the others. I keep one node cold offline which I realise is problematic and have recently removed its votes so now quorum is reached from 3/6 nodes. I since replaced the SFP module as mentioned in my previous post and it has a network link again, aside from the reported problems.

 

Then I observed in the other 4 nodes that their network links were going up and down (Dell Optiplex micro servers with same hardware) so I considered if their Realtek network drivers were failing. A bit weird for all of them to fail at once though. At this point I realised the switch was seemingly booting intermittently as reported and it makes more sense to me that the network itself is rebooting with it rather than each individual NIC on the nodes.

 

I should also note that I am running HA replication between 3 of the nodes for some VMs and regular backup jobs to Proxmox Backup Server which uses storage from an NFS share from one of the VMs on the node that originally had the failing SFP module (Node 4). Whilst probably complex/non ideal, this system has been working fine for a few months until this recent issues identified about a week ago.

 

I don’t see an immediate pattern in the Proxmox reboot logs above. It looks like each of the nodes is being rebooted at different times and again I did not expect to see that they had actually stayed up over the last few days…

 

On the other hand, Uptime Kuma over the last week shows the attached logs for Node1. This shows more yellow outages than the Proxmox logs report, so maybe they aren’t telling the full story? I am just reporting the ones shown in the Proxmox Web GUI…

 

 

Hopefully with this more detailed overview of my setup you can offer some insight into what I can do to debug this? Thanks again for your help!

Link to post
Share on other sites

Are you just doing replication or is HA actually enabled for VMs? If HA is actually enabled, then the switch reboots may be triggering it to take action, which includes rebooting a node if it was deemed offline by the others. If they are all only connected to the switch switch then you really should be going from all-connected to all-isolated, but maybe when connectivity is restored one of them randomly takes longer to reconnect to the cluster and is voted out and commanded to reboot.

Looking to buy GTX690, other multi-GPU cards, or single-slot graphics cards: 

 

Link to post
Share on other sites

21 minutes ago, techfan84 said:

Brocade Compatible 10GBASE-T SFP+ Copper 30m RJ-45 Transceiver Module (LOS)

If you mean FS P/N SFP-10G-T-30 then this should be fine, as it is rated for 2.9W. I also looked at a few modules from FlyproFiber and they all listed 2.5W power usage - but if you have the specific model it would be good to verify. Considering the transceiver status output is reading 42C, the module power usage doesn’t seem to be an issue.

 

Aside from updating the switch firmware, I would starting logging the CPU and Memory usage via SNMP. Zabbix is a good app for that.

Looking to buy GTX690, other multi-GPU cards, or single-slot graphics cards: 

 

Link to post
Share on other sites

Sorry for being imprecise with my summary. For 2 of my VMs they do have HA enabled between 3 cluster nodes. For four 4 VMs I have replication enabled between those nodes. I have of the order of 10 other VMs that have neither yet those VMs are still rebooting with the reported schedule so it looks to be rebooting the nodes in a sense. 

 

Yeah, it makes sense to me also that the switch reboots are the underlying issue that causes the cluster reboot issues. Just mentioned it to better explain the overall problem in case I am missing something.

 

Yes the current module has part number SFP-10G-T-30. The previous module had part number of SFP-10GT-BC-30M. Good to hear that might be ok and yes I am confused as to why it seemed related to the problem. My observation was that when I put the new module in, the switch was stuck in a boot loop. When I removed it it came good again. When I left the switch turned off over night and put the module back in the switch came online in this more intermittent regime that it's in now but the module does give 10G link to the connected Proxmox node. I was worried maybe some short circuit or power spike occurred when I first put in the module but given it is not dead now I am not sure what is going on. 

 

Do you happen to have a good guide for enabling that CPU and memory logging with this switch? That sounds like a good idea in general. I use Grafana Alloy/Loki/Mimir stack for my logging, but would need a way for the switch to export that info in a sensible way. 

 

I will also try removing the module, power cycling the whole setup, and seeing if it's more stable or not. 

Link to post
Share on other sites

2 hours ago, techfan84 said:

Do you happen to have a good guide for enabling that CPU and memory logging with this switch? That sounds like a good idea in general. I use Grafana Alloy/Loki/Mimir stack for my logging, but would need a way for the switch to export that info in a sensible way. 

I only know SNMP, the CPU and Memory (and temperature and lots of other things) are available via SNMP on well-defined (industry-standard) OIDs. Its what enterprise monitoring has been built around for decades.

Looking to buy GTX690, other multi-GPU cards, or single-slot graphics cards: 

 

Link to post
Share on other sites

I removed the transceiver and have started monitoring temperature and CPU stats via SNMP over this time, thanks for the suggestion. I am still trying to work out if I am reading the CPU stats correctly and what the actual variable names should be... but the temperature curve looks believable at least:

image.thumb.png.481f74e73a279eefbff2a75f34b94f21.png

 

Zoomed view over latest outage:
image.thumb.png.c97ec0265ca818cab261c0441b1f4710.png

 

There looks to be two outages/reboots over this time as noticeable in the temperature curve with discrete dropouts. The CPU curve may not be logging correctly but it does at least correlate with these events. 

 

So even without the adapter we seem to have the problems. It's interesting to me that the two events occurred at similar times in the morning of each day. I am wondering if this is significant? If so, it is tricky for me to see what could be causing it. No obvious clues in that temperature sensor at least to me. The power comes from a clean UPS and its logs show no irregularities in power supply. I wonder if network demand spikes around those times regularly due to eg. some backup job or something and that higher demand triggers a crash? Or perhaps my OPNSense Router or some VM has the power to trigger such a reboot cycle of the switch but I don't see how... 

 

Do you have any further ideas or things to help with debugging? I guess one option would be to try upgrading the firmware but I am baffled as to why this would be needed. Maybe some damage has occurred from the adapter and now I have an intermittent hardware fault so need to replace the switch? Weird because it otherwise still works fine... 

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×