Jump to content

Google's 6y study into enterprise and consumer grade SSDs - it's not pretty

zMeul

source: https://www.usenix.org/sites/default/files/fast16_full_proceedings.pdf - 380p document; the interesting part is at page 67 (78 in the reader)

via: http://www.zdnet.com/article/ssd-reliability-in-the-real-world-googles-experience/

 

The FAST 2016 paper Flash Reliability in Production: The Expected and the Unexpected, by Professor Bianca Schroeder of the University of Toronto, and Raghav Lagisetty and Arif Merchant of Google, covers:

  • millions of drive days over 6 years
  • 10 different drive models
  • 3 different flash types: MLC, eMLC and SLC
  • enterprise and consumer drives

 

Quote

KEY CONCLUSIONS

  • Ignore Uncorrectable Bit Error Rate (UBER) specs. A meaningless number.
  • Good news: Raw Bit Error Rate (RBER) increases slower than expected from wearout and is not correlated with UBER or other failures.
  • High-end SLC drives are no more reliable that MLC drives.
  • Bad news: SSDs fail at a lower rate than disks, but UBER rate is higher.
  • SSD age, not usage, affects reliability.
  • Bad blocks in new SSDs are common, and drives with a large number of bad blocks are much more likely to lose hundreds of other blocks, most likely due to die or chip failure.
  • 30-80 percent of SSDs develop at least one bad block and 2-7 percent develop at least one bad chip in the first four years of deployment.

 

 

differences in types of flash

Quote

Comparison of MLC, eMLC, and SLC drives

 

eMLC and SLC drives target the enterprise market and command a higher price point. Besides offering a higher write endurance, there is also the perception that the enterprise drives are higher-end drives, which are overall more reliable and robust. This section evaluates the accuracy of this perception.
Revisiting Table 3, we see that this perception is correct when it comes to SLC drives and their RBER, as they are orders of magnitude lower than for MLC and eMLC drives. However, Tables 2 and 5 show that SLC drives do not perform better for those measures of reliability that matter most in practice: SLC drives don’t have lower repair or replacement rates, and don’t typically have lower rates of non-transparent errors.

The eMLC drives exhibit higher RBER than the MLC drives, even when taking into account that the RBER for MLC drives are lower bounds and could be up to 16X higher in the worst case. However, these differences might be due to their smaller lithography, rather than other differences in technology.
Based on our observations above, we conclude that SLC drives are not generally more reliable than MLC drives.

tables 2,3 and 5:

Spoiler

TJYeMCl.png

---

qTxs42F.png

---

pb1VGYr.png

 

comparison with hard disk drives:

Quote

An obvious question is how flash reliability compares to that of hard disk drives (HDDs), their main competitor. We find that when it comes to replacement rates, flash drives win. The annual replacement rates of hard disk drives have previously been reported to be 2-9%,
which is high compared to the 4-10% of flash drives we see being replaced in a 4 year period. However, flash drives are less attractive when it comes to their error rates. More than 20% of flash drives develop uncorrectable errors in a four year period, 30-80% develop bad
blocks and 2-7% of them develop bad chips. In comparison, previous work on HDDs reports that only 3.5% of disks in a large population developed bad sectors in a 32 months period – a low number when taking into account that the number of sectors on a hard disk is orders of magnitudes larger than the number of either blocks or chips on a solid state drive, and that sectors are smaller than blocks, so a failure is less severe.

In summary, we find that the flash drives in our study experience significantly lower replacement rates (within their rated lifetime) than hard disk drives. On the downside, they experience significantly higher rates of uncorrectable errors than hard disk drives.

 

---

 

so what have we learned?

back-up your data!

  • at least three copies
  • in two different formats
  • with one of those copies off-site

 

in the end, SSDs are more prone to losing data than HDDs; HDDs are more prone to fail completely than SSDs

Link to comment
Share on other sites

Link to post
Share on other sites

I think the final comment that SSDs will cause more data corruption, while HDDs break down more often is more or less what you would expect. Mechanical parts, bearings and so on have wear and tear that a solid state drive will never see. But when an SSD chip dies, that's it. You can still recover the data on a mechanical drive by transplanting the platters into another enclosure, or by replacing the read-head bearing.

Intel i7 5820K (4.5 GHz) | MSI X99A MPower | 32 GB Kingston HyperX Fury 2666MHz | Asus RoG STRIX GTX 1080ti OC | Samsung 951 m.2 nVME 512GB | Crucial MX200 1000GB | Western Digital Caviar Black 2000GB | Noctua NH-D15 | Fractal Define R5 | Seasonic 860 Platinum | Logitech G910 | Sennheiser 599 | Blue Yeti | Logitech G502

 

Nikon D500 | Nikon 300mm f/4 PF  | Nikon 200-500 f/5.6 | Nikon 50mm f/1.8 | Tamron 70-210 f/4 VCII | Sigma 10-20 f/3.5 | Nikon 17-55 f/2.8 | Tamron 90mm F2.8 SP Di VC USD Macro | Neewer 750II

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Fetzie said:

You can still recover the data on a mechanical drive by transplanting the platters into another enclosure, or by replacing the read-head bearing.

You can recover data from an SSD too. 'Just' unmount the chip from the SSD and place it in a third party circuit (this can be  done on aircraft black boxes etc) OK this is V V V expensive but data retrieval is not impossible with an SSD

 Two motoes to live by   "Sometimes there are no shortcuts"

                                           "This too shall pass"

Link to comment
Share on other sites

Link to post
Share on other sites

14 minutes ago, soup said:

You can recover data from an SSD too. 'Just' unmount the chip from the SSD and place it in a third party circuit (this can be  done on aircraft black boxes etc) OK this is V V V expensive but data retrieval is not impossible with an SSD

not if the data is encrypted, and in enterprise grade SSDs you bet your ass it is; and it's done at the controller level

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, zMeul said:

not if the data is encrypted, and in enterprise grade SSDs you bet your ass it is; and it's done at the controller level

Still easily recoverable. If all you're doing is copying the data over, it can be rejoined with the rest and decrypted by the same controller. Further, if the encryption standard used by the controllerand pass key are known, decryption is trivial.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, patrickjp93 said:

Still easily recoverable. If all you're doing is copying the data over, it can be rejoined with the rest and decrypted by the same controller. Further, if the encryption standard used by the controllerand pass key are known, decryption is trivial.

doesn't the SSD generate a new key each time the drive is formatted (secure erase)?

so, that part of the key needs to be extracted somehow

 

and as far as I read, this applies to consumer SSDs too

Link to comment
Share on other sites

Link to post
Share on other sites

For SSDs no one has ever looked at UBER. Write endurance or byes written has been the measure ever since standardized rating of SSD became a thing. Not sure why this was even listed in key conclusions since it was never used.

 

Also in case anyone is confused or misunderstands what they mean by reliability of SLC vs eMLC they mean failure rate of the drive unrelated to usage. SLC SSD will last longer and have a higher wear rating but all SSDs have similar failure rates.

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, leadeater said:

SLC SSD will last longer and have a higher wear rating but all SSDs have similar failure rates.

SLCs "last longer" because the drives based on this type of flash have bigger overprovisioning, not because the type of flash is better

it's written somewhere in their findings

Link to comment
Share on other sites

Link to post
Share on other sites

7 minutes ago, patrickjp93 said:

Still easily recoverable. If all you're doing is copying the data over, it can be rejoined with the rest and decrypted by the same controller. Further, if the encryption standard used by the controllerand pass key are known, decryption is trivial.

If you are pulling NAND chips for data recovery the most likely reason for this is the controller has failed. SED/FIPS encrypted disks and SSDs require the original controller to read the data, that is the point of disk encryption, so decryption in this case is not that easy.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, zMeul said:

SLCs "last longer" because the drives based on this type of flash have bigger overprovisioning, not because the type of flash is better

it's written somewhere in their findings

SLC last longer due to single bit per cell. To get the same size as MLC you require much more NAND chips. The reliability of the NAND is exactly the same between SLC and MLC but a failed cell in SLC only effects a single bit, not multiple. This is the main reason why SLC have higher wear ratings.

 

SSD manufactures generally have two models of SSDs of the eMLC/MLC type, read optimized and write optimized. As you can likely guess write optimized has more over provisioning so has a higher wear rating. SLC are always 'write optimized' designed so feature high over provisioning.

 

SLC with 30% OP will last longer than eMLC with 30% OP.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, leadeater said:

SLC with 30% OP will last longer than eMLC with 30% OP.

that's oversimplifying it since the SLC based SSDs will always have more overprovisioning than the rest

as I said, it's written somewhere in their findings

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, zMeul said:

that's oversimplifying it since the SLC based SSDs will always have more overprovisioning than the rest

as I said, it's written somewhere in their findings

Yes but that is the essential point. There are so many different designs on the market from different SSD manufactures targeting different usages you can't also simply say SLC always have more OP than eMLC since this is not always true. SLC is a type and unrelated to OP, it just stands to reason customers buying SLC based SSD are doing so for write endurance reasons so manufactures pair SLC designs with high OP.

 

The point of my post was failing cells in NAND for a SLC SSD results in a single bit failing, not two. It should be very obvious from this fact as to why a 100GB SLC wear rating is higher than a 100GB eMLC regardless of OP.

Link to comment
Share on other sites

Link to post
Share on other sites

35 minutes ago, zMeul said:

doesn't the SSD generate a new key each time the drive is formatted (secure erase)?

so, that part of the key needs to be extracted somehow

 

and as far as I read, this applies to consumer SSDs too

That's a configurable setting.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

24 minutes ago, leadeater said:

If you are pulling NAND chips for data recovery the most likely reason for this is the controller has failed. SED/FIPS encrypted disks and SSDs require the original controller to read the data, that is the point of disk encryption, so decryption in this case is not that easy.

We have hardware-level AES 256. Just use that or the new 512 version.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, patrickjp93 said:

We have hardware-level AES 256. Just use that or the new 512 version.

What has that got to do with decrypting the data stored on the disk? That is an encryption standard and hardware level just means the device itself can do the encryption and doesn't require off-board software. This in no way helps decrypting the data.

 

Data recovery companies are only just now starting to be able to recover data from SED disks and they still require you to know the encryption key and the original disk still has to be mostly functional. They do not break the encryption at all. Actually breaking even small bit AES encryption would still take longer than you will be alive with current computing power.

 

http://www.businesswire.com/news/home/20160106005210/en/ACE-Announces-Data-Recovery-Solution-Failed-WD

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, zMeul said:

not if the data is encrypted,

I wasn't commenting on how easy (or hard/impossible) it was to understand the data just that it could be recovered .

 Two motoes to live by   "Sometimes there are no shortcuts"

                                           "This too shall pass"

Link to comment
Share on other sites

Link to post
Share on other sites

19 minutes ago, leadeater said:

What has that got to do with decrypting the data stored on the disk? That is an encryption standard and hardware level just means the device itself can do the encryption and doesn't require off-board software. This in no way helps decrypting the data.

 

Data recovery companies are only just now starting to be able to recover data from SED disks and they still require you to know the encryption key and the original disk still has to be mostly functional. They do not break the encryption at all. Actually breaking even small bit AES encryption would still take longer than you will be alive with current computing power.

 

http://www.businesswire.com/news/home/20160106005210/en/ACE-Announces-Data-Recovery-Solution-Failed-WD

I didn't say you had to break the encryption. Decrypting data only requires 3 things: the data, the key, and the encryption algorithm (since the decryption algorithm is heavily based on it anyway, I don't distinguish between the two).

 

AES128 can be broken (intelligent pattern-finding brute force) by the Titan supercomputer in about 4 hours. AES256 is exponentially more complex. Further, this isn't about breaking, just standard decryption. If you're stupid enough to use encryption that lives and dies with the controller, you deserve your loss of money.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

17 minutes ago, patrickjp93 said:

I didn't say you had to break the encryption. Decrypting data only requires 3 things: the data, the key, and the encryption algorithm.

 

AES128 can be broken (intelligent pattern-finding brute force) by the Titan supercomputer in about 4 hours. AES256 is exponentially more complex. Further, this isn't about breaking, just standard decryption. If you're stupid enough to use encryption that lives and dies with the controller, you deserve your loss of money.

Not everyone has a supercomputer, data recovery companies included, but yes this isn't the point.

 

SED disks are designed to be secure, the only stupid thing is having one copy of the data. Data recovery off standard disks is a luck draw already, throwing in encryption further reduces the chances of successful recovery of data. Data encryption comes with down sides along with the up sides.

 

I guess we are getting pretty far off topic now though :P

Link to comment
Share on other sites

Link to post
Share on other sites

32 minutes ago, Misanthrope said:

Nice, I'll be sure not to buy SSDs made 6 fucking years ago.

because the NAND tech advanced so much in the last 4-6y .. NOT!

Link to comment
Share on other sites

Link to post
Share on other sites

13 minutes ago, zMeul said:

because the NAND tech advanced so much in the last 4-6y .. NOT!

It has. Endurance is up an order of magnitude vs. what we had in 2008.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

54 minutes ago, CUDA_Cores said:

vouldn't you do this with an SSD in the event of a controller failure by simply desoldering the flash chips and resoldering them onto a new board?

Depends on the SSD. Hence the discussion in this thread about encrypted data on the chips :)

Intel i7 5820K (4.5 GHz) | MSI X99A MPower | 32 GB Kingston HyperX Fury 2666MHz | Asus RoG STRIX GTX 1080ti OC | Samsung 951 m.2 nVME 512GB | Crucial MX200 1000GB | Western Digital Caviar Black 2000GB | Noctua NH-D15 | Fractal Define R5 | Seasonic 860 Platinum | Logitech G910 | Sennheiser 599 | Blue Yeti | Logitech G502

 

Nikon D500 | Nikon 300mm f/4 PF  | Nikon 200-500 f/5.6 | Nikon 50mm f/1.8 | Tamron 70-210 f/4 VCII | Sigma 10-20 f/3.5 | Nikon 17-55 f/2.8 | Tamron 90mm F2.8 SP Di VC USD Macro | Neewer 750II

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Fetzie said:

Depends on the SSD. Hence the discussion in this thread about encrypted data on the chips :)

It's really quite simple. Don't buy SSDs with proprietary encryption standards, and only have Secure-Erase working on actual erasures.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

I have a problem with that ZD Net article. 

Quote

Two standout conclusions from the study. First, that MLC drives are as reliable as the more costly SLC "enteprise" drives. This mirrors hard drive experience, where consumer SATA drives have been found to be as reliable as expensive SAS and Fibre Channel drives.

This is simply not true. URE is 10^14 bit on consumer, and 10^15 on enterprise.  On platters, it matters a lot. In singular disks, it might not matter, but in large raids, its extremely important. It will determine your ability to salvage a degraded array. 

 

Any who, that's just one thing that irked me. Continue on with the SSD debate.  

My (incomplete) memory overclocking guide: 

 

Does memory speed impact gaming performance? Click here to find out!

On 1/2/2017 at 9:32 PM, MageTank said:

Sometimes, we all need a little inspiration.

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, MageTank said:

I have a problem with that ZD Net article. 

This is simply not true. URE is 10^14 bit on consumer, and 10^15 on enterprise.  On platters, it matters a lot. In singular disks, it might not matter, but in large raids, its extremely important. It will determine your ability to salvage a degraded array. 

 

Any who, that's just one thing that irked me. Continue on with the SSD debate.  

Except that the ratings have been found to be bogus, so the actual quality between the two is the same in this respect. Also, just because you read 1*10^15 bits from a drive that has a 1*10^-15 chance per bit to have an error does not mean you will wind up with an error. Do a little basic research on combinatorics and its role in probability.

 

That's only about a 10% chance of having an unrecoverable error. And frankly, this is what Raid 10 is for. Most enterprise doesn't use RAID 6 anymore for exactly this reason. It's practically a non-factor now.

Software Engineer for Suncorp (Australia), Computer Tech Enthusiast, Miami University Graduate, Nerd

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×