HPE warns of firmware bug that bricks SSDs after 40,000 hours

lazypc · March 26, 2020

Hewlett Packard Enterprise (HPE) has warned that certain solid state drives will brick after exactly 40,000 hours (4 Years, 206 Days, and 16 Hours) of service due to a firmware bug. Assuming the products entered service soon after being bought that would place failures to start happening in October. There is a firmware update available, but If the drive fails in this manner there is no fix and the data is currently not recoverable. This could easily be catastrophic in servers using these in Raid or Parity configurations, since if the drives entered service at the same time, if they hit the 40,000 hour mark at the same time, would fail at the same time, causing the entire array to be lost.

HPE is currently recommending for users to update firmware or plan on migrating to new drives before then. So, although a large majority of even LTT forum browsers may not have enterprise grade drives sitting at home. For those who make purchases of used hardware or do work with this type of hardware, this will be something to be wary of going forward.

Quote

The drives in question are 800GB and 1.6TB SAS models and storage products listed in the service bulletin here. It applies to any products with HPD7 or earlier firmware. HPE also includes instructions on how to update the firmware and check the total time on the drive to best plan an upgrade. According to HPE, the drives could start failing as early as October this year.

Sources: https://www.engadget.com/2020-03-25-hpe-ssd-bricked-firmware-flaw.html

https://www.bleepingcomputer.com/news/hardware/hpe-warns-of-new-bug-that-kills-ssd-drives-after-40-000-hours/

https://www.techradar.com/news/new-bug-destroys-hpe-ssds-after-40000-hours

Customer Service Bulletin: https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=a00097382en_us

dizmo · March 26, 2020

At first I was slightly worried. Then I saw it's for SAS drives and now give 0 fucks. It'll never affect me.

vanished · March 26, 2020

The 40000 hours figure is far too "round" and arbitrary for this to not be the result of a coding mistake (something added on purpose that was meant to be changed or removed but never was, as opposed to something they caused by accident and didn't realize). If it had been 65536 hours or something like that it would be a different story but tbh this kind of issue after any duration strikes me as a little odd. I'm curious what the thought process and purpose was surrounding this number and the code involved in this bug.

Edit: The source does reference a previous issue that did occur at 32768 hours, just as I suspected could happen, but goes on to say that it's unrelated, again, as I suspected.

Vishera · March 26, 2020

So they made a firmware with planned obsolescence and after a while said it was a "bug" and released a fix to the firmware...

mr moose · March 26, 2020

9 minutes ago, Ryan_Vickers said:

The 40000 hours figure is far too "round" and arbitrary for this to not be the result of a coding mistake (something added on purpose that was meant to be changed or removed but never was, as opposed to something they caused by accident and didn't realize). If it had been 65536 hours or something like that it would be a different story but tbh this kind of issue after any duration strikes me as a little odd. I'm curious what the thought process and purpose was surrounding this number and the code involved in this bug.

Edit: The source does reference a previous issue that did occur at 32768 hours, just as I suspected could happen, but goes on to say that it's unrelated, again, as I suspected.

If I was writing code for a product like this I would use rounded numbers for timed events. 40K hours is easier to remember than 39872 hours.

vanished · March 26, 2020

Just now, mr moose said:

If I was writing code for a product like this I would use rounded numbers for timed events. 40K hours is easier to remember than 39872 hours.

Exactly, this is clearly something chosen by a human and not an integer overflow or something of that nature. I wonder what the intended purpose was though, as I assume bricking the drive was not a timed event they intended to schedule

mr moose · March 26, 2020

17 minutes ago, Ryan_Vickers said:

Exactly, this is clearly something chosen by a human and not an integer overflow or something of that nature. I wonder what the intended purpose was though, as I assume bricking the drive was not a timed event they intended to schedule

Of course it's something designed by a person. Maybe one of the HPE guru's will be able to tell us what it was supposed to do or why it was there. Given the drive is designed for power computing in big clusters, it could be something as benign as designed to throw a warning that he drive is nearing the end of it's serviceable life so technicians can ensure it gets replaced before it fails.

PineyCreek · March 26, 2020

...immediately went and looked through all the datasheets of the products I support for my job. This is a call driver (one of those "Are we affected?" calls).

leadeater · March 26, 2020

19 minutes ago, Ryan_Vickers said:

Exactly, this is clearly something chosen by a human and not an integer overflow or something of that nature. I wonder what the intended purpose was though, as I assume bricking the drive was not a timed event they intended to schedule

Was probably something like a Predictive Failure metric, all SSDs and HDDs from HPE have power on hour counters, and it's used with other metrics (SMART data but also other stuff) to put a storage device in to warning/error state and raise an alert if the device is about to fail (maybe/probably).

There is no logic to the hours here as the part comes with standard manufacturer warranty of 3 years but basically everyone buys extended warranty on the server system which covers the drives in them, we extend ours to 5 years as part of the CTO purchase.

Sounds to me instead of putting the device in to error state it's locking it, oops lol.

Edit:

Also power on hours alone isn't enough to put a device in to Predictive Failure status, not normally.

vanished · March 26, 2020

24 minutes ago, Vishera said:

So they made a firmware with planned obsolescence and after a while said it was a "bug" and released a fix to the firmware...

As mentioned above/below, this is not very likely. Makes no sense from their perspective. The issues this presents them with in terms of bad PR/reputation with valuable customers regardless, but especially if something were to go wrong is way bigger than any kind of extra earnings they could get from trying to squeeze people with an intentional planned failure.

5 minutes ago, mr moose said:

Of course it's something designed by a person. Maybe one of the HPE guru's will be able to tell us what it was supposed to do or why it was there. Given the drive is designed for power computing in big clusters, it could be something as benign as designed to throw a warning that he drive is nearing the end of it's serviceable life so technicians can ensure it gets replaced before it fails.

AnonymousGuy · March 26, 2020

4 minutes ago, leadeater said:

Sounds to me instead of putting the device in to error state it's locking it, oops lol.

Not even locking like Intel drives will do when they're at the end of life and switch to read only. Full on brick-shitting mode.

RejZoR · March 26, 2020

What's up with this timed bricking and SSD's. Coz this isn't the first time this exact "bug" is happening. I remember some other manufacturer having SSD's getting bricked the same way as they reached certain hour mark of operation, they just bricked. Stinks of "planned obsolescence" gone wrong type of "bug"...

leadeater · March 26, 2020

36 minutes ago, AnonymousGuy said:

Not even locking like Intel drives will do when they're at the end of life and switch to read only. Full on brick-shitting mode.

Well as far as I know it's a ~~Toshiba~~ Sandisk SSD and the firmware issue is from ~~Toshiba~~ Sandisk, HPE don't create the firmware only provide specs.

Edit:

Probably Sandisk, 3rd guess is the charm.

leadeater · March 26, 2020

22 minutes ago, RejZoR said:

What's up with this timed bricking and SSD's. Coz this isn't the first time this exact "bug" is happening. I remember some other manufacturer having SSD's getting bricked the same way as they reached certain hour mark of operation, they just bricked. Stinks of "planned obsolescence" gone wrong type of "bug"...

That was Sandisk and is literally the same bug and same power on time, guess I was wrong about these being Toshiba SSDs or the pic was not of the actual SSD (so many are just stock pics grr).

RejZoR · March 26, 2020

And I think Crucial also had such "bug" sometime down the line back when M4 SSD drives were a thing. Though there were some more conditions there not just the hour mark. Or was it Samsung. Damn...

mr moose · March 26, 2020

As always, you should be replacing your drives at the 4 year mark anyway. Manufacturer motive aside, the tech just doesn't last that long anyway.

lazypc · March 26, 2020

1 hour ago, leadeater said:

That was Sandisk and is literally the same bug and same power on time, guess I was wrong about these being Toshiba SSDs or the pic was not of the actual SSD (so many are just stock pics grr).

Yeah sorry about that, I tried looking up a model # for an image and didn't get very far so random stock image time... With a side of white background to offend all those dark mode users.

leadeater · March 26, 2020

10 minutes ago, lazypc said:

Yeah sorry about that, I tried looking up a model # for an image and didn't get very far so random stock image time... With a side of white background to offend all those dark mode users.

Don't worry I wasn't actually using your image, just trying to find a high quality one on Google images that was actually correct and I could see all the part numbers on it. Almost everyone uses a few stock images for these parts not the real thing.

Doobeedoo · March 26, 2020

I find i interesting that data is not recoverable though.

Master Disaster · March 26, 2020

3 hours ago, Ryan_Vickers said:

The 40000 hours figure is far too "round" and arbitrary for this to not be the result of a coding mistake (something added on purpose that was meant to be changed or removed but never was, as opposed to something they caused by accident and didn't realize). If it had been 65536 hours or something like that it would be a different story but tbh this kind of issue after any duration strikes me as a little odd. I'm curious what the thought process and purpose was surrounding this number and the code involved in this bug.

Edit: The source does reference a previous issue that did occur at 32768 hours, just as I suspected could happen, but goes on to say that it's unrelated, again, as I suspected.

Honestly I'd guess its a similar situation to how the Y2K bug came into existence. When the drives were created it was widely assumed that SSDs have a much shorter lifespan than platter drives and I would assume they arbitralily limited some variable in the firmware because they assumed the drive would be long dead before that point.

I cannot imagine them doing this with the intent of killing drives prematurely.

RejZoR · March 26, 2020

1 hour ago, mr moose said:

As always, you should be replacing your drives at the 4 year mark anyway. Manufacturer motive aside, the tech just doesn't last that long anyway.

ONE MILLION HOURS MTBF!!!! Remember how everyone was flexing with this crap? And in the last reliability survey it turns out SSD's fail just as much as mechanical HDD's iirc.

mr moose · March 26, 2020

58 minutes ago, RejZoR said:

ONE MILLION HOURS MTBF!!!! Remember how everyone was flexing with this crap? And in the last reliability survey it turns out SSD's fail just as much as mechanical HDD's iirc.

I'm coming up to 8 years on my vertex 3. couldn't tell how many working hours that is without checking. It still hasn't thrown and error and the last report gave it 100%. Even still, I don;t trust it anymore and it's now in a non integral system.

Curious Pineapple · March 26, 2020

This could be as simple as using a signed integer instead of an unsigned in the firmware, or a float instead of a double. Go beyond 40000 and it rolls to a negative, a condition that can never occur during operation and the firmware can't cope with.

jagdtigger · March 26, 2020

2 hours ago, mr moose said:

As always, you should be replacing your drives at the 4 year mark anyway. Manufacturer motive aside, the tech just doesn't last that long anyway.

My OCZ Vertex 4 would like to have a word with you...

RejZoR · March 26, 2020

1 hour ago, mr moose said:

I'm coming up to 8 years on my vertex 3. couldn't tell how many working hours that is without checking. It still hasn't thrown and error and the last report gave it 100%. Even still, I don;t trust it anymore and it's now in a non integral system.

My Samsung 850 Pro 2TB has over 31.000 hours clocked in. I guess I should be getting worried...

Sign In

HPE warns of firmware bug that bricks SSDs after 40,000 hours

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites