Jump to content

Xyratex HS-1235E constant fail mode

Hi

 

I bought a Xyratex HS-1235E server chassis as a fun project to use as a NAS with a standard ATX motherboard. I also bought a SAS9220-8i to connect to the back plane of the chassis. It works fine as I can use the drives, but I can't seem to get rid of the orange (!) led. This also cause all the 10 fans to run at 12k RPM and I think its not far from flying of the table and the neighbors complaining :)

 

When I turn on the power it has a ok sound level with the fans spinning at a lower RPM but as soon as the ! led lights up the fans revs up to 12k.

 

Anyone here who has a good idea of reasons for the alarm indicator? I use SATA drives so maybe that's why?

I have been trying to use sg_ses to lower the RPM but I don't think it cares when it's in a alarm mode.

 

One way to fix this might be to replace the fans or some other hardware hack.

 

Hope someone with more knowledge than me can help give me some advice because I really like to use this chassis in the same building as I live in :)

Link to comment
Share on other sites

Link to post
Share on other sites

Keeping any heat source further away from drives is key, then drives can work without cooling at all. Maybe CPU can't handle all the work you throw at it?

If you use DIY way to provide power to the fans, you could give them less of it, thus they could not spin at max revs...

 

Maybe LED is warning you about high temperatures though, then slowing the fans down isn't a solution, and you might indeed look for some that make less noise.

Link to comment
Share on other sites

Link to post
Share on other sites

Have you checked to verify that none of the fans have died? It's common for the fans to ramp up to full speed to compensate when a fan dies.

Link to comment
Share on other sites

Link to post
Share on other sites

Thank you for the replies, as you both are writing it has something to do with the temperature. I found a manual for a similar system and got an explanation to the orange fault led. Apparently if its on all the time it has something to do with the temperature that is outside of a safe range, if none of the leds on the fans are indicating an error move the enclosure to a better environment.

 

I also checked each fan and they are all spinning so my conclusion must be that the sensor to measure the temperature is not working. 

 

The fans runs 12k a few seconds after I start the server so it should not have to do with high loads on the CPU as its not even past bios yet.

 

I suspect that this temperature sensor or several of them are not located on the PCB in the chassis, further snooping around on the internet and I found the manual of the original motherboard for this case. As I bought the case without one I use a normal PC ATX motherboard and looking in this manual and some images of it mounted i found a cable that apparently is called IPMB connector that should be connected to the BMC on the motherboard with a I2C bus. This is also called IPMI and are do indeed have temperatures and fan control as one of its features and apparently other nice things you need in a server.

 

So if my guesses are correct, and please let me know if misunderstood anything, I miss the part that sends the CPU temperature to the fan controller. I did see that you might be able to set the temperature that is considered safe but I don't know how. This might be possible to allow 32k C or whatever a non connected sensor reports. But it might also be that the fault is triggered if the BMC is not reporting anything.

 

Anyone know if I can simulate a BMC somehow with maybe a arduino that can talk i2c?

Link to comment
Share on other sites

Link to post
Share on other sites

This is strange, from sg_ses:

 

 Element type: Temperature sensor, subenclosure id: 0 [ti=3]
      Overall descriptor: <empty>
      Element 0 descriptor: Int
      Element 1 descriptor: Ext
      Element 2 descriptor: CPU0
      Element 3 descriptor: CPU1
      Element 4 descriptor: <empty>
      Element 5 descriptor: DIMM


 Element type: Temperature sensor, subenclosure id: 0 [ti=3]
      Overall descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: Unrecoverable
        Ident=0, Fail=0, OT failure=1, OT warning=0, UT failure=0
        UT warning=0
        Temperature: <reserved>
      Element 0 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=25 C
      Element 1 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=24 C
      Element 2 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: Not installed
        Ident=0, Fail=0, OT failure=1, OT warning=0, UT failure=0
        UT warning=0
        Temperature=31 C
      Element 3 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: Not installed
        Ident=0, Fail=0, OT failure=1, OT warning=0, UT failure=0
        UT warning=0
        Temperature=31 C
      Element 4 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: Not installed
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=31 C
      Element 5 descriptor:
        Predicted failure=0, Disabled=0, Swap=0, status: OK
        Ident=0, Fail=0, OT failure=0, OT warning=0, UT failure=0
        UT warning=0
        Temperature=226 C


 

The CPU is marked as "not installed" but the DIMM is "OK" and 226 C, wonder if that is the reason for the fault and not that it cant find the CPU temp.

 

Link to comment
Share on other sites

Link to post
Share on other sites

I have been trying to understand the commands I can send with sg_ses to the backplane. Apparently you can disable sensors.

 

Also its the Element 2 and 3 above that triggers and OT failure the DIMM element 5 is probably not the one that cause the alarm.

 

After googling like a madman and reading anything I could find I think this should disable the sensor:

sg_ses /dev/sg2 --index=ts,5 --set=0:5:1=1

Manual below for the bits to disable a element, --set=[byte]:[startBit]:[bits to include (to the right)]

 

image.png.1a68466912e5076fc585bb45207c94a6.png

 

The problem seems to be that it's write protected somehow, the only commands that seems to work is ident, so I can lit up the let on fans and drive bays. No other set command seems to do anything, a get command of the same adress does not show anything after the set. 

 

So is it common that they disable all commands to backplanes or I'am doing something wrong or need to enable writing somehow?

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×