Jump to content

HPE warns of firmware bug that bricks SSDs after 40,000 hours

lazypc

269929.jpg.a4ad753c5e8bd3a46d92926a5acecbf0.jpg

 

Hewlett Packard Enterprise (HPE) has warned that certain solid state drives will brick after exactly 40,000 hours (4 Years, 206 Days, and 16 Hours) of service due to a firmware bug. Assuming the products entered service soon after being bought that would place failures to start happening in October. There is a firmware update available, but If the drive fails in this manner there is no fix and the data is currently not recoverable. This could easily be catastrophic in servers using these in Raid or Parity configurations, since if the drives entered service at the same time, if they hit the 40,000 hour mark at the same time, would fail at the same time, causing the entire array to be lost.

 

HPE is currently recommending for users to update firmware or plan on migrating to new drives before then. So, although a large majority of even LTT forum browsers may not have enterprise grade drives sitting at home. For those who make purchases of used hardware or do work with this type of hardware, this will be something to be wary of going forward.

 

Quote

The drives in question are 800GB and 1.6TB SAS models and storage products listed in the service bulletin here. It applies to any products with HPD7 or earlier firmware. HPE also includes instructions on how to update the firmware and check the total time on the drive to best plan an upgrade. According to HPE, the drives could start failing as early as October this year.

 

Sources: https://www.engadget.com/2020-03-25-hpe-ssd-bricked-firmware-flaw.html

               https://www.bleepingcomputer.com/news/hardware/hpe-warns-of-new-bug-that-kills-ssd-drives-after-40-000-hours/

               https://www.techradar.com/news/new-bug-destroys-hpe-ssds-after-40000-hours

 

Customer Service Bulletin: https://support.hpe.com/hpesc/public/docDisplay?docLocale=en_US&docId=a00097382en_us

I might just be back after the last few years because Spez is an idiot and I'm making a point to start staying away from Reddit.

Link to comment
Share on other sites

Link to post
Share on other sites

At first I was slightly worried. Then I saw it's for SAS drives and now give 0 fucks. It'll never affect me. 

CPU: Ryzen 9 5900 Cooler: EVGA CLC280 Motherboard: Gigabyte B550i Pro AX RAM: Kingston Hyper X 32GB 3200mhz

Storage: WD 750 SE 500GB, WD 730 SE 1TB GPU: EVGA RTX 3070 Ti PSU: Corsair SF750 Case: Streacom DA2

Monitor: LG 27GL83B Mouse: Razer Basilisk V2 Keyboard: G.Skill KM780 Cherry MX Red Speakers: Mackie CR5BT

 

MiniPC - Sold for $100 Profit

Spoiler

CPU: Intel i3 4160 Cooler: Integrated Motherboard: Integrated

RAM: G.Skill RipJaws 16GB DDR3 Storage: Transcend MSA370 128GB GPU: Intel 4400 Graphics

PSU: Integrated Case: Shuttle XPC Slim

Monitor: LG 29WK500 Mouse: G.Skill MX780 Keyboard: G.Skill KM780 Cherry MX Red

 

Budget Rig 1 - Sold For $750 Profit

Spoiler

CPU: Intel i5 7600k Cooler: CryOrig H7 Motherboard: MSI Z270 M5

RAM: Crucial LPX 16GB DDR4 Storage: Intel S3510 800GB GPU: Nvidia GTX 980

PSU: Corsair CX650M Case: EVGA DG73

Monitor: LG 29WK500 Mouse: G.Skill MX780 Keyboard: G.Skill KM780 Cherry MX Red

 

OG Gaming Rig - Gone

Spoiler

 

CPU: Intel i5 4690k Cooler: Corsair H100i V2 Motherboard: MSI Z97i AC ITX

RAM: Crucial Ballistix 16GB DDR3 Storage: Kingston Fury 240GB GPU: Asus Strix GTX 970

PSU: Thermaltake TR2 Case: Phanteks Enthoo Evolv ITX

Monitor: Dell P2214H x2 Mouse: Logitech MX Master Keyboard: G.Skill KM780 Cherry MX Red

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

The 40000 hours figure is far too "round" and arbitrary for this to not be the result of a coding mistake (something added on purpose that was meant to be changed or removed but never was, as opposed to something they caused by accident and didn't realize).  If it had been 65536 hours or something like that it would be a different story but tbh this kind of issue after any duration strikes me as a little odd.  I'm curious what the thought process and purpose was surrounding this number and the code involved in this bug.

 

Edit: The source does reference a previous issue that did occur at 32768 hours, just as I suspected could happen, but goes on to say that it's unrelated, again, as I suspected.

Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to comment
Share on other sites

Link to post
Share on other sites

So they made a firmware with planned obsolescence and after a while said it was a "bug" and released a fix to the firmware...

 

A PC Enthusiast since 2011
AMD Ryzen 7 5700X@4.65GHz | GIGABYTE GTX 1660 GAMING OC @ Core 2085MHz Memory 5000MHz
Cinebench R23: 15669cb | Unigine Superposition 1080p Extreme: 3566
Link to comment
Share on other sites

Link to post
Share on other sites

9 minutes ago, Ryan_Vickers said:

The 40000 hours figure is far too "round" and arbitrary for this to not be the result of a coding mistake (something added on purpose that was meant to be changed or removed but never was, as opposed to something they caused by accident and didn't realize).  If it had been 65536 hours or something like that it would be a different story but tbh this kind of issue after any duration strikes me as a little odd.  I'm curious what the thought process and purpose was surrounding this number and the code involved in this bug.

 

Edit: The source does reference a previous issue that did occur at 32768 hours, just as I suspected could happen, but goes on to say that it's unrelated, again, as I suspected.

If I was writing code for a product like this I would use rounded numbers for timed events.  40K hours is easier to remember than 39872 hours.

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, mr moose said:

If I was writing code for a product like this I would use rounded numbers for timed events.  40K hours is easier to remember than 39872 hours.

Exactly, this is clearly something chosen by a human and not an integer overflow or something of that nature.  I wonder what the intended purpose was though, as I assume bricking the drive was not a timed event they intended to schedule xD

Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to comment
Share on other sites

Link to post
Share on other sites

17 minutes ago, Ryan_Vickers said:

Exactly, this is clearly something chosen by a human and not an integer overflow or something of that nature.  I wonder what the intended purpose was though, as I assume bricking the drive was not a timed event they intended to schedule xD

Of course it's something designed by a person.   Maybe one of the HPE guru's will be able to tell us what it was supposed to do or why it was there.  Given the drive is designed for power computing in big clusters, it could be something as benign as designed to throw a warning that he drive is nearing the end of it's serviceable life so technicians can ensure it gets replaced before it fails. 

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

...immediately went and looked through all the datasheets of the products I support for my job.  This is a call driver (one of those "Are we affected?" calls).

Link to comment
Share on other sites

Link to post
Share on other sites

19 minutes ago, Ryan_Vickers said:

Exactly, this is clearly something chosen by a human and not an integer overflow or something of that nature.  I wonder what the intended purpose was though, as I assume bricking the drive was not a timed event they intended to schedule xD

Was probably something like a Predictive Failure metric, all SSDs and HDDs from HPE have power on hour counters, and it's used with other metrics (SMART data but also other stuff) to put a storage device in to warning/error state and raise an alert if the device is about to fail (maybe/probably).

 

There is no logic to the hours here as the part comes with standard manufacturer warranty of 3 years but basically everyone buys extended warranty on the server system which covers the drives in them, we extend ours to 5 years as part of the CTO purchase.

 

Sounds to me instead of putting the device in to error state it's locking it, oops lol.

 

Edit:

Also power on hours alone isn't enough to put a device in to Predictive Failure status, not normally.

Link to comment
Share on other sites

Link to post
Share on other sites

24 minutes ago, Vishera said:

So they made a firmware with planned obsolescence and after a while said it was a "bug" and released a fix to the firmware...

 

As mentioned above/below, this is not very likely.  Makes no sense from their perspective.  The issues this presents them with in terms of bad PR/reputation with valuable customers regardless, but especially if something were to go wrong is way bigger than any kind of extra earnings they could get from trying to squeeze people with an intentional planned failure.

5 minutes ago, mr moose said:

Of course it's something designed by a person.   Maybe one of the HPE guru's will be able to tell us what it was supposed to do or why it was there.  Given the drive is designed for power computing in big clusters, it could be something as benign as designed to throw a warning that he drive is nearing the end of it's serviceable life so technicians can ensure it gets replaced before it fails. 

 

Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to comment
Share on other sites

Link to post
Share on other sites

4 minutes ago, leadeater said:

Sounds to me instead of putting the device in to error state it's locking it, oops lol.

Not even locking like Intel drives will do when they're at the end of life and switch to read only.  Full on brick-shitting mode.

Workstation:  13700k @ 5.5Ghz || Gigabyte Z790 Ultra || MSI Gaming Trio 4090 Shunt || TeamGroup DDR5-7800 @ 7000 || Corsair AX1500i@240V || whole-house loop.

LANRig/GuestGamingBox: 9900nonK || Gigabyte Z390 Master || ASUS TUF 3090 650W shunt || Corsair SF600 || CPU+GPU watercooled 280 rad pull only || whole-house loop.

Server Router (Untangle): 13600k @ Stock || ASRock Z690 ITX || All 10Gbe || 2x8GB 3200 || PicoPSU 150W 24pin + AX1200i on CPU|| whole-house loop

Server Compute/Storage: 10850K @ 5.1Ghz || Gigabyte Z490 Ultra || EVGA FTW3 3090 1000W || LSI 9280i-24 port || 4TB Samsung 860 Evo, 5x10TB Seagate Enterprise Raid 6, 4x8TB Seagate Archive Backup ||  whole-house loop.

Laptop: HP Elitebook 840 G8 (Intel 1185G7) + 3080Ti Thunderbolt Dock, Razer Blade Stealth 13" 2017 (Intel 8550U)

Link to comment
Share on other sites

Link to post
Share on other sites

What's up with this timed bricking and SSD's. Coz this isn't the first time this exact "bug" is happening. I remember some other manufacturer having SSD's getting bricked the same way as they reached certain hour mark of operation, they just bricked. Stinks of "planned obsolescence" gone wrong type of "bug"...

Link to comment
Share on other sites

Link to post
Share on other sites

36 minutes ago, AnonymousGuy said:

Not even locking like Intel drives will do when they're at the end of life and switch to read only.  Full on brick-shitting mode.

Well as far as I know it's a Toshiba Sandisk SSD and the firmware issue is from Toshiba Sandisk, HPE don't create the firmware only provide specs.

 

Edit:

Probably Sandisk, 3rd guess is the charm.

Link to comment
Share on other sites

Link to post
Share on other sites

22 minutes ago, RejZoR said:

What's up with this timed bricking and SSD's. Coz this isn't the first time this exact "bug" is happening. I remember some other manufacturer having SSD's getting bricked the same way as they reached certain hour mark of operation, they just bricked. Stinks of "planned obsolescence" gone wrong type of "bug"...

That was Sandisk and is literally the same bug and same power on time, guess I was wrong about these being Toshiba SSDs or the pic was not of the actual SSD (so many are just stock pics grr).

Link to comment
Share on other sites

Link to post
Share on other sites

And I think Crucial also had such "bug" sometime down the line back when M4 SSD drives were a thing. Though there were some more conditions there not just the hour mark. Or was it Samsung. Damn...

Link to comment
Share on other sites

Link to post
Share on other sites

As always, you should be replacing your drives at the 4 year mark anyway.  Manufacturer motive aside, the tech just doesn't last that long anyway.

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, leadeater said:

That was Sandisk and is literally the same bug and same power on time, guess I was wrong about these being Toshiba SSDs or the pic was not of the actual SSD (so many are just stock pics grr).

Yeah sorry about that, I tried looking up a model # for an image and didn't get very far so random stock image time... With a side of white background to offend all those dark mode users.

I might just be back after the last few years because Spez is an idiot and I'm making a point to start staying away from Reddit.

Link to comment
Share on other sites

Link to post
Share on other sites

10 minutes ago, lazypc said:

Yeah sorry about that, I tried looking up a model # for an image and didn't get very far so random stock image time... With a side of white background to offend all those dark mode users.

Don't worry I wasn't actually using your image, just trying to find a high quality one on Google images that was actually correct and I could see all the part numbers on it. Almost everyone uses a few stock images for these parts not the real thing.

Link to comment
Share on other sites

Link to post
Share on other sites

I find i interesting that data is not recoverable though. 

| Ryzen 7 7800X3D | AM5 B650 Aorus Elite AX | G.Skill Trident Z5 Neo RGB DDR5 32GB 6000MHz C30 | Sapphire PULSE Radeon RX 7900 XTX | Samsung 990 PRO 1TB with heatsink | Arctic Liquid Freezer II 360 | Seasonic Focus GX-850 | Lian Li Lanccool III | Mousepad: Skypad 3.0 XL / Zowie GTF-X | Mouse: Zowie S1-C | Keyboard: Ducky One 3 TKL (Cherry MX-Speed-Silver)Beyerdynamic MMX 300 (2nd Gen) | Acer XV272U | OS: Windows 11 |

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, Ryan_Vickers said:

The 40000 hours figure is far too "round" and arbitrary for this to not be the result of a coding mistake (something added on purpose that was meant to be changed or removed but never was, as opposed to something they caused by accident and didn't realize).  If it had been 65536 hours or something like that it would be a different story but tbh this kind of issue after any duration strikes me as a little odd.  I'm curious what the thought process and purpose was surrounding this number and the code involved in this bug.

 

Edit: The source does reference a previous issue that did occur at 32768 hours, just as I suspected could happen, but goes on to say that it's unrelated, again, as I suspected.

Honestly I'd guess its a similar situation to how the Y2K bug came into existence. When the drives were created it was widely assumed that SSDs have a much shorter lifespan than platter drives and I would assume they arbitralily limited some variable in the firmware because they assumed the drive would be long dead before that point.

 

I cannot imagine them doing this with the intent of killing drives prematurely.

Main Rig:-

Ryzen 7 3800X | Asus ROG Strix X570-F Gaming | 16GB Team Group Dark Pro 3600Mhz | Corsair MP600 1TB PCIe Gen 4 | Sapphire 5700 XT Pulse | Corsair H115i Platinum | WD Black 1TB | WD Green 4TB | EVGA SuperNOVA G3 650W | Asus TUF GT501 | Samsung C27HG70 1440p 144hz HDR FreeSync 2 | Ubuntu 20.04.2 LTS |

 

Server:-

Intel NUC running Server 2019 + Synology DSM218+ with 2 x 4TB Toshiba NAS Ready HDDs (RAID0)

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, mr moose said:

As always, you should be replacing your drives at the 4 year mark anyway.  Manufacturer motive aside, the tech just doesn't last that long anyway.

ONE MILLION HOURS MTBF!!!! Remember how everyone was flexing with this crap? And in the last reliability survey it turns out SSD's fail just as much as mechanical HDD's iirc.

Link to comment
Share on other sites

Link to post
Share on other sites

58 minutes ago, RejZoR said:

ONE MILLION HOURS MTBF!!!! Remember how everyone was flexing with this crap? And in the last reliability survey it turns out SSD's fail just as much as mechanical HDD's iirc.

 

I'm coming up to 8 years on my vertex 3.  couldn't tell how many working hours that is without checking.  It still hasn't thrown and error and the last report gave it 100%. Even still, I don;t trust it anymore and it's now in a non integral system.

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

This could be as simple as using a signed integer instead of an unsigned in the firmware, or a float instead of a double. Go beyond 40000 and it rolls to a negative, a condition that can never occur during operation and the firmware can't cope with.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, mr moose said:

As always, you should be replacing your drives at the 4 year mark anyway.  Manufacturer motive aside, the tech just doesn't last that long anyway.

My OCZ Vertex 4 would like to have a word with you... :D

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, mr moose said:

 

I'm coming up to 8 years on my vertex 3.  couldn't tell how many working hours that is without checking.  It still hasn't thrown and error and the last report gave it 100%. Even still, I don;t trust it anymore and it's now in a non integral system.

My Samsung 850 Pro 2TB has over 31.000 hours clocked in. I guess I should be getting worried...

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×