HPE warns of firmware bug that bricks SSDs after 40,000 hours

2FA · March 26, 2020

HPE and firmware problems, name a more iconic duo...

Murasaki · March 26, 2020

SAS today, sus tomorrow.

justpoet · March 26, 2020

Same bug as they reported on other HPE SSDs in November.

williamcll · March 26, 2020

8 hours ago, Vishera said:

So they made a firmware with planned obsolescence and after a while said it was a "bug" and released a fix to the firmware...

I don't think they would put planned obsolescence on enterprise hardware, especially if it doesn't meet the stated expected lifetime hours.

TechyBen · March 26, 2020

11 hours ago, Ryan_Vickers said:

The 40000 hours figure is far too "round" and arbitrary for this to not be the result of a coding mistake (something added on purpose that was meant to be changed or removed but never was, as opposed to something they caused by accident and didn't realize). If it had been 65536 hours or something like that it would be a different story but tbh this kind of issue after any duration strikes me as a little odd. I'm curious what the thought process and purpose was surrounding this number and the code involved in this bug.

Edit: The source does reference a previous issue that did occur at 32768 hours, just as I suspected could happen, but goes on to say that it's unrelated, again, as I suspected.

Possibly just a mistake. It is either 1) An error where the device memory fills up because the rollover/cache clearing of the controller failed or 2) the "please replace, device is old" warning fails and instead bricks the device

Not a "kill in 4000 hours to get free resale/repurchases". It's an error, but usually these things just say "buy a new one!!!" they don't self destruct.

TechyBen · March 26, 2020

5 hours ago, RejZoR said:

My Samsung 850 Pro 2TB has over 31.000 hours clocked in. I guess I should be getting worried...

AFAIK with SSDs it's writes/reads not hours used. Unlike HDDs that both spin and have fliud/grease for the spindle, an SSD is theoretically the life of the chip, and dormant chips can last a LONG time... but data needs refreshing on an SSD and does not in a HDD. So a SSD will lose some life just rewriting the data to refresh it every so often.

vanished · March 26, 2020

2 minutes ago, TechyBen said:

Possibly just a mistake. It is either 1) An error where the device memory fills up because the rollover/cache clearing of the controller failed or 2) the "please replace, device is old" warning fails and instead bricks the device

Not a "kill in 4000 hours to get free resale/repurchases". It's an error, but usually these things just say "buy a new one!!!" they don't self destruct.

Maybe I'm reading into nothing but I feel like a lot of people are thinking that I'm suggesting this was planned obsolescence or some other intentional failure, despite never saying that and even outright stating the opposite, so just for the record (again lol) I'm not suggesting this was intentional in any way

As for which, I'm not sure how best to describe it but all I'm trying to say is I'm leaning strongly toward number 2 here. It's clearly code that was added by a person to do *something* and due to a mistake has caused this bricking instead of whatever it was supposed to do (send an alert, etc.). I don't think it's the kind of thing where there was an underlying logic or coding error that simply wasn't noticed (number 1 I guess) because those types of issues tend to involve "computer numbers" like the ones I listed, and like the one in the previous issue the source cited.

TechyBen · March 26, 2020

8 minutes ago, Ryan_Vickers said:

Maybe I'm reading into nothing but I feel like a lot of people are thinking that I'm suggesting this was planned obsolescence or some other intentional failure, despite never saying that and even outright stating the opposite, so just for the record (again lol) I'm not suggesting this was intentional in any way

As for which, I'm not sure how best to describe it but all I'm trying to say is I'm leaning strongly toward number 2 here. It's clearly code that was added by a person to do *something* and due to a mistake has caused this bricking instead of whatever it was supposed to do (send an alert, etc.). I don't think it's the kind of thing where there was an underlying logic or coding error that simply wasn't noticed (number 1 I guess) because those types of issues tend to involve "computer numbers" like the ones I listed, and like the one in the previous issue the source cited.

Impossible to tell really without the source code/analysis. Because sometimes, the dat is from *driver origin* or device version 1 design. So 38,000 hours could actually be 2 years after 40,000 hours, but they dropped 2,000 because it was version 2 of the driver released later that year. Not all systems default to perfect "day zero" coding methods (see Linux Epoc etc). Or they do a signed/unsigned int/double but have it start at some random point due to some other coding compatibility (the driver is signed from the date the system coder added it in, and not the starting date of the devices lifetime). Lots of little things can creep into design sometimes, like even copy/pasting code from a running device, so instead of starting at hour zero, it accidentally starts at day 200!

jagdtigger · March 26, 2020

3 hours ago, TechyBen said:

AFAIK with SSDs it's writes/reads not hours used.

Only writes i think, at least that is what i can deduce from the absence of a read counter in SMART.....

Dabombinable · March 26, 2020

10 hours ago, mr moose said:

I'm coming up to 8 years on my vertex 3. couldn't tell how many working hours that is without checking. It still hasn't thrown and error and the last report gave it 100%. Even still, I don;t trust it anymore and it's now in a non integral system.

Pretty much if a drive doesn't fail within the first year, it'll last a long time (excluding physical damage of course). That's why I know I got statistical defects with my first 240GB (OCZ ARC 100 - failed while in use) and first 480GB NVME SSD (also failed while in use), while my first SSD (60GB Corsair Force LS) is still going well.

Sauron · March 26, 2020

16 hours ago, Ryan_Vickers said:

The 40000 hours figure is far too "round" and arbitrary for this to not be the result of a coding mistake (something added on purpose that was meant to be changed or removed but never was, as opposed to something they caused by accident and didn't realize). If it had been 65536 hours or something like that it would be a different story but tbh this kind of issue after any duration strikes me as a little odd. I'm curious what the thought process and purpose was surrounding this number and the code involved in this bug.

40k hours is a fairly typical warranty number, it's possible the drive is supposed to signal when it's past that limit in the S.M.A.R.T. metadata but somehow the code that does that is broken and bricks the unit.

mr moose · March 26, 2020

29 minutes ago, Dabombinable said:

Pretty much if a drive doesn't fail within the first year, it'll last a long time (excluding physical damage of course). That's why I know I got statistical defects with my first 240GB (OCZ ARC 100 - failed while in use) and first 480GB NVME SSD (also failed while in use), while my first SSD (60GB Corsair Force LS) is still going well.

It'll be interesting to see long term data on ssd's (by that I mean failure rates across time not just in the first few months), however platters generally only go 5 years.

For now while it looks good, I am still recommending SSD's at the 4-5 year mark get replaced and relocated to non critical use.

vanished · March 26, 2020

12 minutes ago, Sauron said:

40k hours is a fairly typical warranty number, it's possible the drive is supposed to signal when it's past that limit in the S.M.A.R.T. metadata but somehow the code that does that is broken and bricks the unit.

Yeah that seems to be the prevailing idea and it does make sense. I would be curious to know how what should be an alert ends up doing this instead though. Perhaps these are coded at an extremely low level, for example just writing binary directly to certain memory addresses. In such a context I can easily imagine how this issue could result. With more familiar, high-level coding methods though it is difficult for me to understand.

Chunchunmaru_ · March 26, 2020

Quote

So, although a large majority of even LTT forum browsers may not have enterprise grade drives sitting at home

Well uhm... I do...

At work

Over 20 servers, at least

Well actually I try to regularly update all of them including the bios, ssd and hdd firmware and the OS itself so I do not worry that much, all of them under warranty

Sauron · March 26, 2020

1 minute ago, Ryan_Vickers said:

I would be curious to know how what should be an alert ends up doing this instead though.

Probably not that hard, if it does anything to corrupt the S.M.A.R.T. data the controller is probably going to lose its mind. It's an enterprise SAS drive so just rolling with it when the status data is corrupted is probably not a good idea.

vanished · March 26, 2020

5 minutes ago, Sauron said:

Probably not that hard, if it does anything to corrupt the S.M.A.R.T. data the controller is probably going to lose its mind. It's an enterprise SAS drive so just rolling with it when the status data is corrupted is probably not a good idea.

How and if that's the cause aside, I would think that in the event of a corruption like that it would lock down into a safe mode where the whole drive is read only, etc. or something like that. The fact it just bricks feels like the result of something necessary for operation breaking rather than something making an observation and "deciding" to take this course of action.

Sauron · March 26, 2020

34 minutes ago, Ryan_Vickers said:

How and if that's the cause aside, I would think that in the event of a corruption like that it would lock down into a safe mode where the whole drive is read only, etc. or something like that. The fact it just bricks feels like the result of something necessary for operation breaking rather than something making an observation and "deciding" to take this course of action.

Hard to say without knowing exactly what the bug is. If the corrupted region is something the controller requires to address the cells it may cause something like that. Regardless, locking in read only in a RAID setup is pretty much the same as a brick - the owner might even prefer a hard failure in that case so they can hot swap the drive rather than locking the whole thing because suddenly you can't write to a drive. Of course, that's in a normal situation where all drives don't die at roughly the same time.

TechyBen · March 27, 2020

15 hours ago, Dabombinable said:

Pretty much if a drive doesn't fail within the first year, it'll last a long time (excluding physical damage of course). That's why I know I got statistical defects with my first 240GB (OCZ ARC 100 - failed while in use) and first 480GB NVME SSD (also failed while in use), while my first SSD (60GB Corsair Force LS) is still going well.

Some early SSDs were known to fail though. Tech has gone along way.

mr moose · March 27, 2020

On 3/27/2020 at 5:20 AM, Ryan_Vickers said:

I feel like a lot of people are thinking that I'm suggesting this was planned obsolescence or some other intentional failure,

I read it that way too. I guess without non verbal cues and the fact so many people on these forums seem paranoid and automatically assume nefarious motives means most of us are becoming geared to read things in that light.

March 28, 2020

I'm thinking the coders on that one moved from their Printer division where they perfected their craft, expiring printer cartridges older than a year even if they were still ok. Printers refused to use them. SAS HD's, next it might be RAM or displays...mice? It's endless, like printing money!

freeagent · March 28, 2020

Last Wednesday my Crucial M4 256 bricked itself during a reboot at 99% life remaining. It was only half full if that. Locks up any pc it gets plugged into. I am not shocked to see big players playing this game. But we enable them so that makes it ok.

Curious Pineapple · March 28, 2020

43 minutes ago, freeagent said:

Last Wednesday my Crucial M4 256 bricked itself during a reboot at 99% life remaining. It was only half full if that. Locks up any pc it gets plugged into. I am not shocked to see big players playing this game. But we enable them so that makes it ok.

I have an SD card that does that, it's called hardware failure.

Sign In

HPE warns of firmware bug that bricks SSDs after 40,000 hours

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites