Jump to content

Server Alarms and Shuts Down (Not Heat)

Hey guys, I've got a Tyan server board running a pair of E5-2630s, nothing fancy. It posts, and after much struggling with the BIOS I've got Windows Server 2016 installed and booting. I've run into a weird problem though, after it's been running for awhile, whether I am working on it, or just leave it idle. It gives me an alert tone (same one I've heard for an overheat alert, tested it), and Windows goes into emergency shut down mode and the whole thing powers off.

 

Not sure what other than heat can cause that...but I'd really like to make it stop, so I can actually use the server. Anyone got any suggestions? Thanks!

 

Edit: Added the event logs, metadata is included in case it's needed for viewing.

eventlog.rar

 

Edit: Images as requested.

 

099 - Overhead SAS and SATA chipsets.

0104 - Think that's the north bridge, two exhaust fans right above here

0100 - Near PSU, two exhaust fans right above here

0101 - Drive area, there's three fans in the front of the case that pull in air and it passes the currently 4 HDDs, the CPU coolers push air towards the back of the case and then out

0102 - Overview

0103 - Close up of SAS area

And BIOS images

0105 to 0112 - Various seeming errors in some BIOS logs, upper right has some obscure information that I'm sure could help us solve this but I do not understand it

0115 - BIOS temps (not CPU) after letting it run for around an hour, the PCH and SAS temperatures climbed up to those levels and then seemingly stayed stable, seems a bit warm for those chipsets, but they are passively cooled; this section of the BIOS has alert temperatures listed, but only for the CPUs, not for any other component.

 

IMG_0104.JPG

IMG_0099.JPG

IMG_0100.JPG

IMG_0101.JPG

IMG_0102.JPG

IMG_0103.JPG

IMG_0115.JPG

IMG_0105.JPG

IMG_0106.JPG

IMG_0107.JPG

IMG_0108.JPG

IMG_0109.JPG

IMG_0110.JPG

IMG_0111.JPG

IMG_0112.JPG

Link to comment
Share on other sites

Link to post
Share on other sites

check the event logs and see what the system says is happening, windows documents a lot of stuff regrading failures/errors and can be access easily with event viewer in the administrator tools folder and is the first thing you should check when the system is acting like this. what is probely happening is windows is detecting that the hardware is getting damaged in someway (hence the force shutdown) and it is shutting itself down to prevent further damage to itself, it's not so server that the system BSOD's or turns off without warning, but server enough that windows is shutting down to prevent damage

your best bet is to boot the OS in another system to access the logs so that the hardware doesn't get damaged

once you get the logs, have a good look, if it doesn't make sense to you (i don't blame you it's quite cryptic and hard to understand) upload it here and people on the forums might be able to help you understand what is going on, if you need help finding said logs, sort them out with errors first and try and find the most recent one

if you still are unable to find it, on the toolbar click Action -> Save All Events As..., save them on a usb and upload it here and we can manually have a look at what the hell is going on

*Insert Witty Signature here*

System Config: https://au.pcpartpicker.com/list/Tncs9N

 

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Salv8 (sam) said:

check the event logs and see what the system says is happening, windows documents a lot of stuff regrading failures/errors and can be access easily with event viewer in the administrator tools folder and is the first thing you should check when the system is acting like this. what is probely happening is windows is detecting that the hardware is getting damaged in someway (hence the force shutdown) and it is shutting itself down to prevent further damage to itself, it's not so server that the system BSOD's or turns off without warning, but server enough that windows is shutting down to prevent damage

your best bet is to boot the OS in another system to access the logs so that the hardware doesn't get damaged

once you get the logs, have a good look, if it doesn't make sense to you (i don't blame you it's quite cryptic and hard to understand) upload it here and people on the forums might be able to help you understand what is going on, if you need help finding said logs, sort them out with errors first and try and find the most recent one

if you still are unable to find it, on the toolbar click Action -> Save All Events As..., save them on a usb and upload it here and we can manually have a look at what the hell is going on

 

I did take a cursory look, and there's some random warnings that seem meaningless, the only error indicated is DistributedCOM 100016 happens before everyone, then it has a kernelpower disconnect which is just the shut down. No BSODs. No temperature warnings in the BIOS logs or the system logs.

 

I'll pull the logs off the system and post them here shortly, I'll edit the original post so they're easy to find.

Link to comment
Share on other sites

Link to post
Share on other sites

Note: I added the servers event logs, hopefully someone can demystify them for me enough to fix this!

Link to comment
Share on other sites

Link to post
Share on other sites

The process C:\Windows\System32\RuntimeBroker.exe (UMBRA) has initiated the power off of computer UMBRA on behalf of user UMBRA\Administrator for the following reason: Other (Planned)
 Reason Code: 0x85000000
 Shutdown Type: power off
 Comment: 

 

Theres a few events similar to the one above. Because theyre all different, it looks like some sort of hardware failure. 

What is the history of this server?

Have you had it "as-is" for a while?

Is it an ex-lease server, or custom built with parts?

Have you done any upgrades recently? hardware, bios, os updates, etc...?

 

I suspect its something hardware. Does the server have any sort of management logs like HP's iLO or Dell's PERC? 

You haven't had any brownout's when its triggered? (lights flicking/dimming, etc..) Is the server on a UPS?  

Spoiler

Desktop: Ryzen9 5950X | ASUS ROG Crosshair VIII Hero (Wifi) | EVGA RTX 3080Ti FTW3 | 32GB (2x16GB) Corsair Dominator Platinum RGB Pro 3600Mhz | EKWB EK-AIO 360D-RGB | EKWB EK-Vardar RGB Fans | 1TB Samsung 980 Pro, 4TB Samsung 980 Pro | Corsair 5000D Airflow | Corsair HX850 Platinum PSU | Asus ROG 42" OLED PG42UQ + LG 32" 32GK850G Monitor | Roccat Vulcan TKL Pro Keyboard | Logitech G Pro X Superlight  | MicroLab Solo 7C Speakers | Audio-Technica ATH-M50xBT2 LE Headphones | TC-Helicon GoXLR | Audio-Technica AT2035 | LTT Desk Mat | XBOX-X Controller | Windows 11 Pro

 

Spoiler

Server: Fractal Design Define R6 | Ryzen 3950x | ASRock X570 Taichi | EVGA GTX1070 FTW | 64GB (4x16GB) Corsair Vengeance LPX 3000Mhz | Corsair RM850v2 PSU | Fractal S36 Triple AIO + 4 Additional Venturi 120mm Fans | 14 x 20TB Seagate Exos X22 20TB | 500GB Aorus Gen4 NVMe | 2 x 2TB Samsung 970 Evo Plus NVMe | LSI 9211-8i HBA

 

Link to comment
Share on other sites

Link to post
Share on other sites

i have googled this and it get's stranger

this person has got the same issue:

https://community.spiceworks.com/topic/612854-event-1074-mysterious-shutdown

but it's being caused by winlogon.exe not RuntimeBroker.exe, the dude bought it to dell and dell didn't know what was going on either, he checked the PSU (both of them) and wasn't able to find anything, 

Linux might give more info on what is going on, back up the windows OS and try installing Linux, see how it deals with our haunted hardware, see what it does when the error happens

then lets check the logs to see exactly what happened, if we can't get anything, we have to open the server and inspect it carefully for any type of damage anywhere

*Insert Witty Signature here*

System Config: https://au.pcpartpicker.com/list/Tncs9N

 

Link to comment
Share on other sites

Link to post
Share on other sites

26 minutes ago, Jarsky said:

The process C:\Windows\System32\RuntimeBroker.exe (UMBRA) has initiated the power off of computer UMBRA on behalf of user UMBRA\Administrator for the following reason: Other (Planned)
 Reason Code: 0x85000000
 Shutdown Type: power off
 Comment: 

 

Theres a few events similar to the one above. Because theyre all different, it looks like some sort of hardware failure. 

What is the history of this server?

Have you had it "as-is" for a while?

Is it an ex-lease server, or custom built with parts?

Have you done any upgrades recently? hardware, bios, os updates, etc...?

 

I suspect its something hardware. Does the server have any sort of management logs like HP's iLO or Dell's PERC? 

You haven't had any brownout's when its triggered? (lights flicking/dimming, etc..) Is the server on a UPS?  

 

Currently it is not on a UPS, I was wondering if dirty power might be the cause and was intending to hook it up to one tomorrow and see if that solves the issue. In the meantime it's unplugged, and when it was running it was on a surge protector.

 

it's a custom build, a Tyan S7067 board with a pair of used but in theory tested E5-2630v2s. RAM is supposedly also tested DIMMs from a datacenter. No upgrades, just got it running today in fact. The only management logs are the BIOS server logs, which seem to indicate very little, one has an "OS" event listed, with no real information for it. And the rest of the indicators are just power on and power off events.

 

If it's a hardware failure, I'd love to isolate it so I can replace whatever is wrong if possible. Must be some way it can be narrowed down. It does have one expansion networking card, a gigabit 4 port Intel usually used in Dell servers. Seems to present no issues immediately that I can find.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, Salv8 (sam) said:

i have googled this and it get's stranger

this person has got the same issue:

https://community.spiceworks.com/topic/612854-event-1074-mysterious-shutdown

but it's being caused by winlogon.exe not RuntimeBroker.exe, the dude bought it to dell and dell didn't know what was going on either, he checked the PSU (both of them) and wasn't able to find anything, 

Linux might give more info on what is going on, back up the windows OS and try installing Linux, see how it deals with our haunted hardware, see what it does when the error happens

then lets check the logs to see exactly what happened, if we can't get anything, we have to open the server and inspect it carefully for any type of damage anywhere

 

I am less great with Linux than I should be, any distro you'd recommend? You think it will have the same "protective features" that Windows seems to have for shutting it down? I'm ignorant of how to retrieve logs from Linux as is done with Windows.

 

Seems super strange to be caused by the login service, especially when I'm already logged in.

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Grimm Spector said:

I am less great with Linux than I should be, any distro you'd recommend? You think it will have the same "protective features" that Windows seems to have for shutting it down? I'm ignorant of how to retrieve logs from Linux as is done with Windows.

Ubuntu is the easiest to get used to and it also has a lot of support for different hardware devices, download it and stick it on a usb and boot into the usb and install like normal (hopefully, your server isn't stable you might encounter problems but we can try and help you out with them), login to the system and wait for the error to happen and see how Ubuntu deals with it, Ubuntu is quite resistance to this type of stuff, but if does or says anything then it's a issue that also affecting other OS's as well

how to retrieve logs in Ubuntu is something i also don't know how to do, i will CC a dude called leadeater as he has more Linux experience than me and should be able to help you retrieve and make sense of the logs

CC: @leadeater

*Insert Witty Signature here*

System Config: https://au.pcpartpicker.com/list/Tncs9N

 

Link to comment
Share on other sites

Link to post
Share on other sites

26 minutes ago, Salv8 (sam) said:

Ubuntu is the easiest to get used to and it also has a lot of support for different hardware devices, download it and stick it on a usb and boot into the usb and install like normal (hopefully, your server isn't stable you might encounter problems but we can try and help you out with them), login to the system and wait for the error to happen and see how Ubuntu deals with it, Ubuntu is quite resistance to this type of stuff, but if does or says anything then it's a issue that also affecting other OS's as well

how to retrieve logs in Ubuntu is something i also don't know how to do, i will CC a dude called leadeater as he has more Linux experience than me and should be able to help you retrieve and make sense of the logs

CC: @leadeater

 

Heh, I would love to do this, however I haven't successfully gotten the stupid thing to boot via USB, it detects the drives, but doesn't see them as a valid boot option, no matter what tool I use to set them up. I'll give it another go, I suspect some of the optional boot hardware, like the video/lan/etc. optroms might block out the USB boot devices just by existing, as one of them was actually blocking me from booting from the main hard drive after windows was installed. Was a headache to track down.

Link to comment
Share on other sites

Link to post
Share on other sites

There are 3 of these events, EventID 41

 

Quote

The system has rebooted without cleanly shutting down first. This error could be caused if the system stopped responding, crashed, or lost power unexpectedly.

 

This usually indicates a hardware issue of some kind. Can you provide a picture of the case internals? I'm suspecting one of the chipsets is getting too hot and triggering a reboot, like the onboard SAS controller. That boards looks to have a few chipsets on it, as a step to see if that is the problem temporarily get a fan to blow over those heat sinks and see if it stops the rebooting.

 

These motherboards like to go in chases with rather high air flow.

Link to comment
Share on other sites

Link to post
Share on other sites

What are your drives? Do you have any sort of RAID configuration? 

Also I notice there is an event from 2016, so where did this OS install come from? Did you pull the drive from another server and plug it in? If so did you reset it using a sysprep? 

Spoiler

Desktop: Ryzen9 5950X | ASUS ROG Crosshair VIII Hero (Wifi) | EVGA RTX 3080Ti FTW3 | 32GB (2x16GB) Corsair Dominator Platinum RGB Pro 3600Mhz | EKWB EK-AIO 360D-RGB | EKWB EK-Vardar RGB Fans | 1TB Samsung 980 Pro, 4TB Samsung 980 Pro | Corsair 5000D Airflow | Corsair HX850 Platinum PSU | Asus ROG 42" OLED PG42UQ + LG 32" 32GK850G Monitor | Roccat Vulcan TKL Pro Keyboard | Logitech G Pro X Superlight  | MicroLab Solo 7C Speakers | Audio-Technica ATH-M50xBT2 LE Headphones | TC-Helicon GoXLR | Audio-Technica AT2035 | LTT Desk Mat | XBOX-X Controller | Windows 11 Pro

 

Spoiler

Server: Fractal Design Define R6 | Ryzen 3950x | ASRock X570 Taichi | EVGA GTX1070 FTW | 64GB (4x16GB) Corsair Vengeance LPX 3000Mhz | Corsair RM850v2 PSU | Fractal S36 Triple AIO + 4 Additional Venturi 120mm Fans | 14 x 20TB Seagate Exos X22 20TB | 500GB Aorus Gen4 NVMe | 2 x 2TB Samsung 970 Evo Plus NVMe | LSI 9211-8i HBA

 

Link to comment
Share on other sites

Link to post
Share on other sites

28 minutes ago, leadeater said:

There are 3 of these events, EventID 41

 

 

This usually indicates a hardware issue of some kind. Can you provide a picture of the case internals? I'm suspecting one of the chipsets is getting too hot and triggering a reboot, like the onboard SAS controller. That boards looks to have a few chipsets on it, as a step to see if that is the problem temporarily get a fan to blow over those heat sinks and see if it stops the rebooting.

 

These motherboards like to go in chases with rather high air flow.

Added some images, I can go into the BIOS and watch the temperatures for various chipsets, may be worth doing though I don't seem to have any such shutdowns in the BIOS that I've seen so far. I can toss some more cooling overtop temporarily and see if it stops.

 

23 minutes ago, Jarsky said:

What are your drives? Do you have any sort of RAID configuration? 

Also I notice there is an event from 2016, so where did this OS install come from? Did you pull the drive from another server and plug it in? If so did you reset it using a sysprep? 

I have 4 drives on the SAS controller that are currently empty and not being used. No RAID. The main system drive is a brand new SSD, clean install, the odd date might be that when I cleared the CMOS I forgot to set the date/time and it was wrong until Windows set it from the internet. Install was done via PXE boot.

Link to comment
Share on other sites

Link to post
Share on other sites

13 minutes ago, Grimm Spector said:

Added some images, I can go into the BIOS and watch the temperatures for various chipsets, may be worth doing though I don't seem to have any such shutdowns in the BIOS that I've seen so far. I can toss some more cooling overtop temporarily and see if it stops.

Looks to be plenty of fans to give enough cooling, maybe just set them to spin a bit faster for the test instead of adding another fan some how.

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Grimm Spector said:

-snip-

then you have to go old school.....

if you have a disk drive AND a spare DVD-RW optical disk handy(ensure that the size is bigger than 2GB otherwise you won't be able to burn it to the disk!!!!), burn the iso to a disk and boot off that, if you don't have a DVD-RW, then if you have a spare hdd then image the iso to that and tell the system to boot off that and install it that way, it should act exactly like you were installing from a usb/disk and when you are finished, shut the system down and remove the install hdd and boot it back up and the system should boot into Ubuntu

just make sure that that hdd doesn't have anything on it before using it!!!!

Edited by Salv8 (sam)
Edit: fixed up some things

*Insert Witty Signature here*

System Config: https://au.pcpartpicker.com/list/Tncs9N

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Salv8 (sam) said:

then you have to go old school.....

if you have a disk drive AND a spare DVD-RW optical disk handy

Old school, DVD :P.

 

windows1.jpg

Link to comment
Share on other sites

Link to post
Share on other sites

24 minutes ago, leadeater said:

windows1.jpg

thats 3.5" floppys... you are just a little baby, watch this

2397460107_bb174c2334_o.jpg

The Selectron tube, developed in ~1946 (development happened in that year and they also planed to have 200 hundred produced by the end of that year, didn't go so well as it was hard to make than they initially predicted), yes it's RAM but it still counts as storage as RAM stores stuff, it just can't retain it after losing power...

and yes, it was't commercially viable, still counts as the first storage medium, fun fact it can hold ~32 to 512 bytes of data in there, it wouldn't be enough to hold the amount of text (even if it is in a plain text format) in this very comment...

https://en.wikipedia.org/wiki/Selectron_tube

*Insert Witty Signature here*

System Config: https://au.pcpartpicker.com/list/Tncs9N

 

Link to comment
Share on other sites

Link to post
Share on other sites

Had a similar issue with one of my servers oddly enough, patched ESXi and shortly after it just started shutting down. Long story short, found some article that suggested updating the firmware of my power supply -  which I never knew had firmware. Updated the P/S firmware (was a challenge because whomever I bought my server from swapped in other ones) and it's been stable since.

 

I also went through and updated all firmware, previously I had only updated the bios and idrac. Oddly enough there was yet another update for my idrac so I did that again too.

 

tl/dr - can't hurt to go through and update all firmware for the hell of it.

Link to comment
Share on other sites

Link to post
Share on other sites

9 hours ago, Salv8 (sam) said:

then you have to go old school.....

if you have a disk drive AND a spare DVD-RW optical disk handy(ensure that the size is bigger than 2GB otherwise you won't be able to burn it to the disk!!!!), burn the iso to a disk and boot off that, if you don't have a DVD-RW, then if you have a spare hdd then image the iso to that and tell the system to boot off that and install it that way, it should act exactly like you were installing from a usb/disk and when you are finished, shut the system down and remove the install hdd and boot it back up and the system should boot into Ubuntu

just make sure that that hdd doesn't have anything on it before using it!!!!

Yup, that was my main thought, pretty sure I have an old SATA blu-ray/dvd RW somewhere, just have to find it.

2 hours ago, Mikensan said:

Had a similar issue with one of my servers oddly enough, patched ESXi and shortly after it just started shutting down. Long story short, found some article that suggested updating the firmware of my power supply -  which I never knew had firmware. Updated the P/S firmware (was a challenge because whomever I bought my server from swapped in other ones) and it's been stable since.

 

I also went through and updated all firmware, previously I had only updated the bios and idrac. Oddly enough there was yet another update for my idrac so I did that again too.

 

tl/dr - can't hurt to go through and update all firmware for the hell of it.

Power supply firmware? o.O well it's a brand new seasonic PSU, who knows. It's not a server PSU as they're just too expensive, but it's a good quality one. The BIOS is the latest, and there are no new firmwares for the LSI chip that I can find, which just leaves drivers.

 

I'm also updating the original post with some images from the BIOS, found some interesting bits that may be a clue!

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, Grimm Spector said:

Yup, that was my main thought, pretty sure I have an old SATA blu-ray/dvd RW somewhere, just have to find it.

Power supply firmware? o.O well it's a brand new seasonic PSU, who knows. It's not a server PSU as they're just too expensive, but it's a good quality one. The BIOS is the latest, and there are no new firmwares for the LSI chip that I can find, which just leaves drivers.

 

I'm also updating the original post with some images from the BIOS, found some interesting bits that may be a clue!

Ah ok, well them cross that off the list lol. How many watts is your PSU?

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Salv8 (sam) said:

wait is one of the temperatures in the minus?

thats looks to be a good start...

Nope, that's the temperature detla to the alarm threshold, quite specifically mentioned in the manual as not being in celsius or anything other than whatever Intel decided it was going to be in, which is just a raw delta regardless of what units I ask the system to report temperatures in. 

 

And, I think I figured out the problem! It's an over voltage sensor I think, as per the BIOS' strange description. After a lot of poking I noticed it was running the RAM at both the wrong voltage and wrong clock speed, despite being setup to AUTO as I see on most boards defaults. Had to reduce the voltage manually and force the speed to the proper speed as it's PC3L ECC RAM. Seems to be running stable so far.

 

I also hooked it up to a UPS to scrub the power, though I took it off after solving the RAM issue and it seems all good so far. Going to put it on another UPS in the near future.

 

So I think at this point I'll just run some stress tests to see how it goes, not sure what I want to use just yet. Not used to server testing. Thanks for all the help everyone, maybe keep the weird BIOS info in mind for future problems like this that others may have.

Link to comment
Share on other sites

Link to post
Share on other sites

21 hours ago, Mikensan said:

Ah ok, well them cross that off the list lol. How many watts is your PSU?

Sigh, crap. Turns out I was wrong, not sure why it ran stably for a couple hours yesterday, but now it's back to randomly alarming and shutting down after a short time of running, even while idle.

 

PSU is 700 Watts, well more than the server is drawing at any given time, with a load meter I have not seen it try to draw more than 220 so far in fact, though I haven't had the CPUs at full load yet.

 

I've attached another set of server logs here...though this time I'm even more confused about the cause of the shut down than before. This is driving me nuts.

serverfail.rar

Link to comment
Share on other sites

Link to post
Share on other sites

For shits and giggles, could you try booting to a ubuntu live CD (or USB) and leaving it on? See if it will run for more than 24 hours.

Link to comment
Share on other sites

Link to post
Share on other sites

11 hours ago, Mikensan said:

For shits and giggles, could you try booting to a ubuntu live CD (or USB) and leaving it on? See if it will run for more than 24 hours.

So... I tried that, and once the ubuntu was installed to another disk, the whole thing won't boot. Server starts, but the boot options are all blank, and I manually add another one in, even has me select what efi file I want to boot from, and the ubuntu install is for some reason the only option...and then it doesn't boot from it, and all the boot entries are blank again when it reboots. I have no idea why, nor how to fix this...

 

Edit: I managed to get around this by disabling the ability in the BIOS for the entire disk controller that disk was on to boot, and suddenly I could boot to windows again, it's like if certain boot options are on, others are entirely impossible to reach which is very ridiculous and frustrating. Of course this hasn't solved the shutting down after being on for a few minutes problem.

Edited by Grimm Spector
New information
Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×