Jump to content

HPE warns of firmware bug that bricks SSDs after 40,000 hours

lazypc

HPE and firmware problems, name a more iconic duo...

[Out-of-date] Want to learn how to make your own custom Windows 10 image?

 

Desktop: AMD R9 3900X | ASUS ROG Strix X570-F | Radeon RX 5700 XT | EVGA GTX 1080 SC | 32GB Trident Z Neo 3600MHz | 1TB 970 EVO | 256GB 840 EVO | 960GB Corsair Force LE | EVGA G2 850W | Phanteks P400S

Laptop: Intel M-5Y10c | Intel HD Graphics | 8GB RAM | 250GB Micron SSD | Asus UX305FA

Server 01: Intel Xeon D 1541 | ASRock Rack D1541D4I-2L2T | 32GB Hynix ECC DDR4 | 4x8TB Western Digital HDDs | 32TB Raw 16TB Usable

Server 02: Intel i7 7700K | Gigabye Z170N Gaming5 | 16GB Trident Z 3200MHz

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, Vishera said:

So they made a firmware with planned obsolescence and after a while said it was a "bug" and released a fix to the firmware...

 

I don't think they would put planned obsolescence on enterprise hardware, especially if it doesn't meet the stated expected lifetime hours.

Specs: Motherboard: Asus X470-PLUS TUF gaming (Yes I know it's poor but I wasn't informed) RAM: Corsair VENGEANCE® LPX DDR4 3200Mhz CL16-18-18-36 2x8GB

            CPU: Ryzen 9 5900X          Case: Antec P8     PSU: Corsair RM850x                        Cooler: Antec K240 with two Noctura Industrial PPC 3000 PWM

            Drives: Samsung 970 EVO plus 250GB, Micron 1100 2TB, Seagate ST4000DM000/1F2168 GPU: EVGA RTX 2080 ti Black edition

Link to comment
Share on other sites

Link to post
Share on other sites

11 hours ago, Ryan_Vickers said:

The 40000 hours figure is far too "round" and arbitrary for this to not be the result of a coding mistake (something added on purpose that was meant to be changed or removed but never was, as opposed to something they caused by accident and didn't realize).  If it had been 65536 hours or something like that it would be a different story but tbh this kind of issue after any duration strikes me as a little odd.  I'm curious what the thought process and purpose was surrounding this number and the code involved in this bug.

 

Edit: The source does reference a previous issue that did occur at 32768 hours, just as I suspected could happen, but goes on to say that it's unrelated, again, as I suspected.

Possibly just a mistake. It is either 1) An error where the device memory fills up because the rollover/cache clearing of the controller failed or 2) the "please replace, device is old" warning fails and instead bricks the device :P

 

Not a "kill in 4000 hours to get free resale/repurchases". It's an error, but usually these things just say "buy a new one!!!" they don't self destruct.

Link to comment
Share on other sites

Link to post
Share on other sites

5 hours ago, RejZoR said:

My Samsung 850 Pro 2TB has over 31.000 hours clocked in. I guess I should be getting worried...

AFAIK with SSDs it's writes/reads not hours used. Unlike HDDs that both spin and have fliud/grease for the spindle, an SSD is theoretically the life of the chip, and dormant chips can last a LONG time... but data needs refreshing on an SSD and does not in a HDD. So a SSD will lose some life just rewriting the data to refresh it every so often.

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, TechyBen said:

Possibly just a mistake. It is either 1) An error where the device memory fills up because the rollover/cache clearing of the controller failed or 2) the "please replace, device is old" warning fails and instead bricks the device :P

 

Not a "kill in 4000 hours to get free resale/repurchases". It's an error, but usually these things just say "buy a new one!!!" they don't self destruct.

Maybe I'm reading into nothing but I feel like a lot of people are thinking that I'm suggesting this was planned obsolescence or some other intentional failure, despite never saying that and even outright stating the opposite, so just for the record (again lol) I'm not suggesting this was intentional in any way :P

 

As for which, I'm not sure how best to describe it but all I'm trying to say is I'm leaning strongly toward number 2 here.  It's clearly code that was added by a person to do *something* and due to a mistake has caused this bricking instead of whatever it was supposed to do (send an alert, etc.).  I don't think it's the kind of thing where there was an underlying logic or coding error that simply wasn't noticed (number 1 I guess) because those types of issues tend to involve "computer numbers" like the ones I listed, and like the one in the previous issue the source cited.

Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, Ryan_Vickers said:

Maybe I'm reading into nothing but I feel like a lot of people are thinking that I'm suggesting this was planned obsolescence or some other intentional failure, despite never saying that and even outright stating the opposite, so just for the record (again lol) I'm not suggesting this was intentional in any way :P

 

As for which, I'm not sure how best to describe it but all I'm trying to say is I'm leaning strongly toward number 2 here.  It's clearly code that was added by a person to do *something* and due to a mistake has caused this bricking instead of whatever it was supposed to do (send an alert, etc.).  I don't think it's the kind of thing where there was an underlying logic or coding error that simply wasn't noticed (number 1 I guess) because those types of issues tend to involve "computer numbers" like the ones I listed, and like the one in the previous issue the source cited.

Impossible to tell really without the source code/analysis. Because sometimes, the dat is from *driver origin* or device version 1 design. So 38,000 hours could actually be 2 years after 40,000 hours, but they dropped 2,000 because it was version 2 of the driver released later that year. Not all systems default to perfect "day zero" coding methods (see Linux Epoc etc). Or they do a signed/unsigned int/double but have it start at some random point due to some other coding compatibility (the driver is signed from the date the system coder added it in, and not the starting date of the devices lifetime). Lots of little things can creep into design sometimes, like even copy/pasting code from a running device, so instead of starting at hour zero, it accidentally starts at day 200! :P

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, TechyBen said:

AFAIK with SSDs it's writes/reads not hours used.

Only writes i think, at least that is what i can deduce from the absence of a read counter in SMART.....

Link to comment
Share on other sites

Link to post
Share on other sites

10 hours ago, mr moose said:

 

I'm coming up to 8 years on my vertex 3.  couldn't tell how many working hours that is without checking.  It still hasn't thrown and error and the last report gave it 100%. Even still, I don;t trust it anymore and it's now in a non integral system.

Pretty much if a drive doesn't fail within the first year, it'll last a long time (excluding physical damage of course). That's why I know I got statistical defects with my first 240GB (OCZ ARC 100 - failed while in use) and first 480GB NVME SSD (also failed while in use), while my first SSD (60GB Corsair Force LS) is still going well.

"We also blind small animals with cosmetics.
We do not sell cosmetics. We just blind animals."

 

"Please don't mistake us for Equifax. Those fuckers are evil"

 

This PSA brought to you by Equifacks.
PMSL

Link to comment
Share on other sites

Link to post
Share on other sites

16 hours ago, Ryan_Vickers said:

The 40000 hours figure is far too "round" and arbitrary for this to not be the result of a coding mistake (something added on purpose that was meant to be changed or removed but never was, as opposed to something they caused by accident and didn't realize).  If it had been 65536 hours or something like that it would be a different story but tbh this kind of issue after any duration strikes me as a little odd.  I'm curious what the thought process and purpose was surrounding this number and the code involved in this bug.

40k hours is a fairly typical warranty number, it's possible the drive is supposed to signal when it's past that limit in the S.M.A.R.T. metadata but somehow the code that does that is broken and bricks the unit.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

29 minutes ago, Dabombinable said:

Pretty much if a drive doesn't fail within the first year, it'll last a long time (excluding physical damage of course). That's why I know I got statistical defects with my first 240GB (OCZ ARC 100 - failed while in use) and first 480GB NVME SSD (also failed while in use), while my first SSD (60GB Corsair Force LS) is still going well.

It'll be interesting to see long term data on ssd's (by that I mean failure rates across time not just in the first few months), however platters generally only go 5 years. 

 

For now while it looks good, I am still recommending SSD's at the 4-5 year mark get replaced and relocated to non critical use.

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

12 minutes ago, Sauron said:

40k hours is a fairly typical warranty number, it's possible the drive is supposed to signal when it's past that limit in the S.M.A.R.T. metadata but somehow the code that does that is broken and bricks the unit.

Yeah that seems to be the prevailing idea and it does make sense.  I would be curious to know how what should be an alert ends up doing this instead though.  Perhaps these are coded at an extremely low level, for example just writing binary directly to certain memory addresses.  In such a context I can easily imagine how this issue could result.  With more familiar, high-level coding methods though it is difficult for me to understand.

Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to comment
Share on other sites

Link to post
Share on other sites

Quote

So, although a large majority of even LTT forum browsers may not have enterprise grade drives sitting at home

 

Well uhm... I do...

At work

Over 20 servers, at least

 

Well actually I try to regularly update all of them including the bios, ssd and hdd firmware and the OS itself so I do not worry that much, all of them under warranty

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Ryan_Vickers said:

I would be curious to know how what should be an alert ends up doing this instead though.

Probably not that hard, if it does anything to corrupt the S.M.A.R.T. data the controller is probably going to lose its mind. It's an enterprise SAS drive so just rolling with it when the status data is corrupted is probably not a good idea.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, Sauron said:

Probably not that hard, if it does anything to corrupt the S.M.A.R.T. data the controller is probably going to lose its mind. It's an enterprise SAS drive so just rolling with it when the status data is corrupted is probably not a good idea.

How and if that's the cause aside, I would think that in the event of a corruption like that it would lock down into a safe mode where the whole drive is read only, etc. or something like that.  The fact it just bricks feels like the result of something necessary for operation breaking rather than something making an observation and "deciding" to take this course of action.

Solve your own audio issues  |  First Steps with RPi 3  |  Humidity & Condensation  |  Sleep & Hibernation  |  Overclocking RAM  |  Making Backups  |  Displays  |  4K / 8K / 16K / etc.  |  Do I need 80+ Platinum?

If you can read this you're using the wrong theme.  You can change it at the bottom.

Link to comment
Share on other sites

Link to post
Share on other sites

34 minutes ago, Ryan_Vickers said:

How and if that's the cause aside, I would think that in the event of a corruption like that it would lock down into a safe mode where the whole drive is read only, etc. or something like that.  The fact it just bricks feels like the result of something necessary for operation breaking rather than something making an observation and "deciding" to take this course of action.

Hard to say without knowing exactly what the bug is. If the corrupted region is something the controller requires to address the cells it may cause something like that. Regardless, locking in read only in a RAID setup is pretty much the same as a brick - the owner might even prefer a hard failure in that case so they can hot swap the drive rather than locking the whole thing because suddenly you can't write to a drive. Of course, that's in a normal situation where all drives don't die at roughly the same time.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

15 hours ago, Dabombinable said:

Pretty much if a drive doesn't fail within the first year, it'll last a long time (excluding physical damage of course). That's why I know I got statistical defects with my first 240GB (OCZ ARC 100 - failed while in use) and first 480GB NVME SSD (also failed while in use), while my first SSD (60GB Corsair Force LS) is still going well.

Some early SSDs were known to fail though. Tech has gone along way.

Link to comment
Share on other sites

Link to post
Share on other sites

On 3/27/2020 at 5:20 AM, Ryan_Vickers said:

 I feel like a lot of people are thinking that I'm suggesting this was planned obsolescence or some other intentional failure,

 

I read it that way too.  I guess without non verbal cues and the fact so many people on these forums seem paranoid and automatically assume nefarious motives  means most of us are becoming geared to read things in that light.

Grammar and spelling is not indicative of intelligence/knowledge.  Not having the same opinion does not always mean lack of understanding.  

Link to comment
Share on other sites

Link to post
Share on other sites

I'm thinking the coders on that one moved from their Printer division where they perfected their craft, expiring printer cartridges older than a year even if they were still ok. Printers refused to use them. SAS HD's, next it might be RAM or displays...mice? It's endless, like printing money!

Link to comment
Share on other sites

Link to post
Share on other sites

Last Wednesday my Crucial M4 256 bricked itself during a reboot at 99% life remaining. It was only half full if that. Locks up any pc it gets plugged into. I am not shocked to see big players playing this game. But we enable them so that makes it ok.

AMD R7 5800X3D | Thermalright Phantom Spirit 120 EVO, 1x T30

Asus Crosshair VIII Dark Hero | 32GB G.Skill Trident Z @ 3733C14

Zotac 4070 Ti Trinity OC @ 3060/1495 | WD SN850, SN850X, SN770

Seasonic Vertex GX-1000 | Fractal Torrent Compact RGB, Many CFM's

Link to comment
Share on other sites

Link to post
Share on other sites

43 minutes ago, freeagent said:

Last Wednesday my Crucial M4 256 bricked itself during a reboot at 99% life remaining. It was only half full if that. Locks up any pc it gets plugged into. I am not shocked to see big players playing this game. But we enable them so that makes it ok.

I have an SD card that does that, it's called hardware failure.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×