Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
lazypc

HPE warns of firmware bug that bricks SSDs after 40,000 hours

Recommended Posts

Posted · Original PosterOP

269929.jpg.a4ad753c5e8bd3a46d92926a5acecbf0.jpg

 

Hewlett Packard Enterprise (HPE) has warned that certain solid state drives will brick after exactly 40,000 hours (4 Years, 206 Days, and 16 Hours) of service due to a firmware bug. Assuming the products entered service soon after being bought that would place failures to start happening in October. There is a firmware update available, but If the drive fails in this manner there is no fix and the data is currently not recoverable. This could easily be catastrophic in servers using these in Raid or Parity configurations, since if the drives entered service at the same time, if they hit the 40,000 hour mark at the same time, would fail at the same time, causing the entire array to be lost.

 

HPE is currently recommending for users to update firmware or plan on migrating to new drives before then. So, although a large majority of even LTT forum browsers may not have enterprise grade drives sitting at home. For those who make purchases of used hardware or do work with this type of hardware, this will be something to be wary of going forward.

 

Quote

The drives in question are 800GB and 1.6TB SAS models and storage products listed in the service bulletin here. It applies to any products with HPD7 or earlier firmware. HPE also includes instructions on how to update the firmware and check the total time on the drive to best plan an upgrade. According to HPE, the drives could start failing as early as October this year.

 

Sources: https://www.engadget.com/2020-03-25-hpe-ssd-bricked-firmware-flaw.html

               https://www.bleepingcomputer.com/news/hardware/hpe-warns-of-new-bug-that-kills-ssd-drives-after-40-000-hours/

               https://www.techradar.com/news/new-bug-destroys-hpe-ssds-after-40000-hours

 

Customer Service Bulletin: https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=a00097382en_us

Link to post
Share on other sites

At first I was slightly worried. Then I saw it's for SAS drives and now give 0 fucks. It'll never affect me. 


Current PC:

Spoiler

*WORK IN PROGRESS*

 

Mothballed PC:

Spoiler

 

CPU: Intel i5 4690k Cooler: Corsair H100i V2 Motherboard: MSI Z97i AC ITX

RAM: Crucial Ballistix 16GB DDR3 Storage: Kingston Fury 240GB GPU: Asus Strix GTX 970

PSU: Thermaltake TR2 Case: Phanteks Enthoo Evolv ITX

Monitor: Dell P2214H x2 Mouse: Logitech MX Master Keyboard: G.Skill KM780 Cherry MX Red

 

 

Link to post
Share on other sites

The 40000 hours figure is far too "round" and arbitrary for this to not be the result of a coding mistake (something added on purpose that was meant to be changed or removed but never was, as opposed to something they caused by accident and didn't realize).  If it had been 65536 hours or something like that it would be a different story but tbh this kind of issue after any duration strikes me as a little odd.  I'm curious what the thought process and purpose was surrounding this number and the code involved in this bug.

 

Edit: The source does reference a previous issue that did occur at 32768 hours, just as I suspected could happen, but goes on to say that it's unrelated, again, as I suspected.


Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to post
Share on other sites

So they made a firmware with planned obsolescence and after a while said it was a "bug" and released a fix to the firmware...

 


A PC Enthusiast since 2011
AMD Ryzen 5 2600@4GHz | GIGABYTE GTX 1660 GAMING OC @ Core 2040MHz Memory 5000MHz
Cinebench R15: 1382cb | Unigine Superposition 1080p Extreme: 3439
Link to post
Share on other sites
9 minutes ago, Ryan_Vickers said:

The 40000 hours figure is far too "round" and arbitrary for this to not be the result of a coding mistake (something added on purpose that was meant to be changed or removed but never was, as opposed to something they caused by accident and didn't realize).  If it had been 65536 hours or something like that it would be a different story but tbh this kind of issue after any duration strikes me as a little odd.  I'm curious what the thought process and purpose was surrounding this number and the code involved in this bug.

 

Edit: The source does reference a previous issue that did occur at 32768 hours, just as I suspected could happen, but goes on to say that it's unrelated, again, as I suspected.

If I was writing code for a product like this I would use rounded numbers for timed events.  40K hours is easier to remember than 39872 hours.


QuicK and DirtY. Read the CoC it's like a guide on how not to be moron.  Also I don't have an issue with the VS series.

Sometimes I miss contractions like n't on the end of words like wouldn't, couldn't and shouldn't.    Please don't be a dick,  make allowances when reading my posts.

Link to post
Share on other sites
Just now, mr moose said:

If I was writing code for a product like this I would use rounded numbers for timed events.  40K hours is easier to remember than 39872 hours.

Exactly, this is clearly something chosen by a human and not an integer overflow or something of that nature.  I wonder what the intended purpose was though, as I assume bricking the drive was not a timed event they intended to schedule xD


Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to post
Share on other sites
17 minutes ago, Ryan_Vickers said:

Exactly, this is clearly something chosen by a human and not an integer overflow or something of that nature.  I wonder what the intended purpose was though, as I assume bricking the drive was not a timed event they intended to schedule xD

Of course it's something designed by a person.   Maybe one of the HPE guru's will be able to tell us what it was supposed to do or why it was there.  Given the drive is designed for power computing in big clusters, it could be something as benign as designed to throw a warning that he drive is nearing the end of it's serviceable life so technicians can ensure it gets replaced before it fails. 


QuicK and DirtY. Read the CoC it's like a guide on how not to be moron.  Also I don't have an issue with the VS series.

Sometimes I miss contractions like n't on the end of words like wouldn't, couldn't and shouldn't.    Please don't be a dick,  make allowances when reading my posts.

Link to post
Share on other sites

...immediately went and looked through all the datasheets of the products I support for my job.  This is a call driver (one of those "Are we affected?" calls).

Link to post
Share on other sites
19 minutes ago, Ryan_Vickers said:

Exactly, this is clearly something chosen by a human and not an integer overflow or something of that nature.  I wonder what the intended purpose was though, as I assume bricking the drive was not a timed event they intended to schedule xD

Was probably something like a Predictive Failure metric, all SSDs and HDDs from HPE have power on hour counters, and it's used with other metrics (SMART data but also other stuff) to put a storage device in to warning/error state and raise an alert if the device is about to fail (maybe/probably).

 

There is no logic to the hours here as the part comes with standard manufacturer warranty of 3 years but basically everyone buys extended warranty on the server system which covers the drives in them, we extend ours to 5 years as part of the CTO purchase.

 

Sounds to me instead of putting the device in to error state it's locking it, oops lol.

 

Edit:

Also power on hours alone isn't enough to put a device in to Predictive Failure status, not normally.

Link to post
Share on other sites
24 minutes ago, Vishera said:

So they made a firmware with planned obsolescence and after a while said it was a "bug" and released a fix to the firmware...

 

As mentioned above/below, this is not very likely.  Makes no sense from their perspective.  The issues this presents them with in terms of bad PR/reputation with valuable customers regardless, but especially if something were to go wrong is way bigger than any kind of extra earnings they could get from trying to squeeze people with an intentional planned failure.

5 minutes ago, mr moose said:

Of course it's something designed by a person.   Maybe one of the HPE guru's will be able to tell us what it was supposed to do or why it was there.  Given the drive is designed for power computing in big clusters, it could be something as benign as designed to throw a warning that he drive is nearing the end of it's serviceable life so technicians can ensure it gets replaced before it fails. 

 


Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to post
Share on other sites
4 minutes ago, leadeater said:

Sounds to me instead of putting the device in to error state it's locking it, oops lol.

Not even locking like Intel drives will do when they're at the end of life and switch to read only.  Full on brick-shitting mode.


Workstation: 9900KF @ 5.0Ghz || ASRock Z390 Taichi Ultimate || Gigabyte 1080Ti || G.Skill DDR4-3800 @ 3600 4x8GB || Corsair AX1500i || 11 gallon whole-house loop.

LANRig/GuestGamingBox: 8600K@ 4.9Ghz || Gigabyte Z270 Gaming 9  || EVGA Titan X (Maxwell) || Corsair SF600 || CPU+GPU watercooled 280 rad push only.

Server Router (Untangle): 8350K @ 4.7Ghz || ASRock Z370 ITX || 2x8GB || PicoPSU 250W, running on AX1200i from Server Storage || CPU watercooled, 11 gallon whole-house loop.

Server VM/Plex/HTTPS: E5-2699v4 (22 core!) || Asus X99m WS || GT 630 || Corsair RM650x || CPU watercooled, 11 gallon whole-house loop.

Server Storage: Pent. G3220 || Z87 Gryphon mATX || || LSI 9280i + Adaptec + Intel Expander || 4x10TB Seagate Enterprise Raid 6, 3x8TB Seagate Archive Backup, Corsair AX1200i (drives) Corsair RM450 (machine) || CPU watercooled, 11 gallon whole-house loop.

Laptop: HP Elitebook 840 G3 (Intel 8350U).

Link to post
Share on other sites

What's up with this timed bricking and SSD's. Coz this isn't the first time this exact "bug" is happening. I remember some other manufacturer having SSD's getting bricked the same way as they reached certain hour mark of operation, they just bricked. Stinks of "planned obsolescence" gone wrong type of "bug"...

Link to post
Share on other sites
36 minutes ago, AnonymousGuy said:

Not even locking like Intel drives will do when they're at the end of life and switch to read only.  Full on brick-shitting mode.

Well as far as I know it's a Toshiba Sandisk SSD and the firmware issue is from Toshiba Sandisk, HPE don't create the firmware only provide specs.

 

Edit:

Probably Sandisk, 3rd guess is the charm.

Link to post
Share on other sites
22 minutes ago, RejZoR said:

What's up with this timed bricking and SSD's. Coz this isn't the first time this exact "bug" is happening. I remember some other manufacturer having SSD's getting bricked the same way as they reached certain hour mark of operation, they just bricked. Stinks of "planned obsolescence" gone wrong type of "bug"...

That was Sandisk and is literally the same bug and same power on time, guess I was wrong about these being Toshiba SSDs or the pic was not of the actual SSD (so many are just stock pics grr).

Link to post
Share on other sites

And I think Crucial also had such "bug" sometime down the line back when M4 SSD drives were a thing. Though there were some more conditions there not just the hour mark. Or was it Samsung. Damn...

Link to post
Share on other sites

As always, you should be replacing your drives at the 4 year mark anyway.  Manufacturer motive aside, the tech just doesn't last that long anyway.


QuicK and DirtY. Read the CoC it's like a guide on how not to be moron.  Also I don't have an issue with the VS series.

Sometimes I miss contractions like n't on the end of words like wouldn't, couldn't and shouldn't.    Please don't be a dick,  make allowances when reading my posts.

Link to post
Share on other sites
Posted · Original PosterOP
1 hour ago, leadeater said:

That was Sandisk and is literally the same bug and same power on time, guess I was wrong about these being Toshiba SSDs or the pic was not of the actual SSD (so many are just stock pics grr).

Yeah sorry about that, I tried looking up a model # for an image and didn't get very far so random stock image time... With a side of white background to offend all those dark mode users.

Link to post
Share on other sites
10 minutes ago, lazypc said:

Yeah sorry about that, I tried looking up a model # for an image and didn't get very far so random stock image time... With a side of white background to offend all those dark mode users.

Don't worry I wasn't actually using your image, just trying to find a high quality one on Google images that was actually correct and I could see all the part numbers on it. Almost everyone uses a few stock images for these parts not the real thing.

Link to post
Share on other sites

I find i interesting that data is not recoverable though. 


Ryzen 7 3800X | X570 Aorus Elite | G.Skill 16GB 3200MHz C16 | Radeon RX 5700 XT | Samsung 850 PRO 256GB | Mouse: Zowie S1 | OS: Windows 10

Link to post
Share on other sites
3 hours ago, Ryan_Vickers said:

The 40000 hours figure is far too "round" and arbitrary for this to not be the result of a coding mistake (something added on purpose that was meant to be changed or removed but never was, as opposed to something they caused by accident and didn't realize).  If it had been 65536 hours or something like that it would be a different story but tbh this kind of issue after any duration strikes me as a little odd.  I'm curious what the thought process and purpose was surrounding this number and the code involved in this bug.

 

Edit: The source does reference a previous issue that did occur at 32768 hours, just as I suspected could happen, but goes on to say that it's unrelated, again, as I suspected.

Honestly I'd guess its a similar situation to how the Y2K bug came into existence. When the drives were created it was widely assumed that SSDs have a much shorter lifespan than platter drives and I would assume they arbitralily limited some variable in the firmware because they assumed the drive would be long dead before that point.

 

I cannot imagine them doing this with the intent of killing drives prematurely.


Main Rig:-

Ryzen 7 3800X | Asus ROG Strix X570-F Gaming | 16GB Team Group Dark Pro 3600Mhz | Samsung 970 Evo 500GB NVMe | Sapphire 5700 XT Pulse | Corsair H115i Platinum | WD Black 1TB | WD Green 4TB | EVGA SuperNOVA G3 650W | Asus TUF GT501 | Samsung C27HG70 1440p 144hz HDR FreeSync 2 | Windows 10 Pro X64 |

 

Server:-

Raspberry Pi 4 Model B running OMV Arrakis and an 8TB Seagate USB 3.0 external HDD

Link to post
Share on other sites
1 hour ago, mr moose said:

As always, you should be replacing your drives at the 4 year mark anyway.  Manufacturer motive aside, the tech just doesn't last that long anyway.

ONE MILLION HOURS MTBF!!!! Remember how everyone was flexing with this crap? And in the last reliability survey it turns out SSD's fail just as much as mechanical HDD's iirc.

Link to post
Share on other sites
58 minutes ago, RejZoR said:

ONE MILLION HOURS MTBF!!!! Remember how everyone was flexing with this crap? And in the last reliability survey it turns out SSD's fail just as much as mechanical HDD's iirc.

 

I'm coming up to 8 years on my vertex 3.  couldn't tell how many working hours that is without checking.  It still hasn't thrown and error and the last report gave it 100%. Even still, I don;t trust it anymore and it's now in a non integral system.


QuicK and DirtY. Read the CoC it's like a guide on how not to be moron.  Also I don't have an issue with the VS series.

Sometimes I miss contractions like n't on the end of words like wouldn't, couldn't and shouldn't.    Please don't be a dick,  make allowances when reading my posts.

Link to post
Share on other sites

This could be as simple as using a signed integer instead of an unsigned in the firmware, or a float instead of a double. Go beyond 40000 and it rolls to a negative, a condition that can never occur during operation and the firmware can't cope with.


Probably banned for disagreeing

Link to post
Share on other sites
2 hours ago, mr moose said:

As always, you should be replacing your drives at the 4 year mark anyway.  Manufacturer motive aside, the tech just doesn't last that long anyway.

My OCZ Vertex 4 would like to have a word with you... :D

Link to post
Share on other sites
1 hour ago, mr moose said:

 

I'm coming up to 8 years on my vertex 3.  couldn't tell how many working hours that is without checking.  It still hasn't thrown and error and the last report gave it 100%. Even still, I don;t trust it anymore and it's now in a non integral system.

My Samsung 850 Pro 2TB has over 31.000 hours clocked in. I guess I should be getting worried...

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×