Jump to content

Linux 5.1 kernel hit by SSD TRIM bug which causes massive data loss

Chunchunmaru_

Summary: Looks like there is a bad commit in the Linux 5.1.2 kernel, which in some circumstances issuing a ATA TRIM causes actual data to be discarded instead of deleted blocks.

Looks like Windows 10 is not the only one to get data loss after updates.

So far, it appears the issue happens if:

  • LUKS+dm_crypt if:(Linux kernel software encryption)
  • A Samsung SSD is used
  • Linux 5.1.2 is used

 

Quote
LINUX STORAGE --

As a forewarning to those using LVM, dm-crypt, and Samsung solid-state drives, this combination in some manner(s) may lead to data corruption if using the Linux 5.1 kernel.

Linux FSTRIM/Discard is being too aggressive leading to data loss on certain setups, which at this point seem to be isolated to those using LVM and dm-crypt. The device mapper bug in Linux 5.1 is causing for blocks to be discarded wrongly or too much and that can lead to "massive data loss" issues.


It should be noted that this happens only when ATA TRIM is issued, so when the "discard" fstab option is used or the weekly "fstrim" service cron/systemd timer is enabled, or a manual fstrim is issued through CLI.

Ubuntu based distros should not be affected by this as it doesn't use the Linux 5.1 kernel.

Arch Linux/Manjaro users sadly are affected, on Manjaro the fstrim is enabled automatically, while on Arch it has to be issued or configured manually.
I do not know if other distros are affected

Since it's a bug related to the device mapper, this could happen on any filesystem

Mitigation: The best thing you can do is to shut down the system immediately and use another kernel.
You can also try disabling the fstrim timer by issuing this command, but it's not very recommended:

`systemctl disable fstrim.timer`
`systemctl stop fstrim.timer`

Or if you are an expert, checking in /etc/fstab for the discard parameter (but you are basically screwed up if you have this enabled as it's a continuous trim)
Remember for your SSD health and optimal performance, to re-enable those options once everything is over by just replacing disable with enable, and stop with start.

Source:
https://www.redhat.com/archives/dm-devel/2019-May/msg00084.html
https://bbs.archlinux.org/viewtopic.php?id=246569
https://bugs.archlinux.org/task/62693
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg87788.html
https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.1-FSTRIM-Bug

Edited by Chunchunmaru_
Added mitigation
Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Chunchunmaru_ said:

Ubuntu based distros should not be affected by this as it doesn't use the Linux 5.1 kernel.

Good to know, nobody in my family uses other Linux distros.

 

I'm curious, does it only happen with the Samsung SSDs, or would it happen with a standard drive?

Quote or tag me( @Crunchy Dragon) if you want me to see your reply

If a post solved your problem/answered your question, please consider marking it as "solved"

Community Standards // Join Floatplane!

Link to comment
Share on other sites

Link to post
Share on other sites

I'm glad I found out now that I switched back to macOS...

 

I was actually gonna try the 5.1 kernel too... oof. 

She/Her

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Crunchy Dragon said:

Good to know, nobody in my family uses other Linux distros.

 

I'm curious, does it only happen with the Samsung SSDs, or would it happen with a standard drive?

It's not completely clear at this point, the reporters of those issues look like to have in common only Samsung SSD's + LUKS+LVM encryption, Samsung SSD's are pretty common though...

Link to comment
Share on other sites

Link to post
Share on other sites

At least it's not just update and it's gone, you have to run fstrim or hit your scheduled trim. This along with the hard/software combo will hopefully keep loss to a minimum.

Resident Mozilla Shill.   Typed on my Ortholinear JJ40 custom keyboard
               __     I am the ASCIIDino.
              / _)
     _.----._/ /      If you can see me you 
    /         /       must put me in your 
 __/ (  | (  |        signature for 24 hours.
/__.-'|_|--|_|        
Link to comment
Share on other sites

Link to post
Share on other sites

4 minutes ago, Fez_Boy said:

At least it's not just update and it's gone, you have to run fstrim or hit your scheduled trim. This along with the hard/software combo will hopefully keep loss to a minimum.

for the ones who got the continuous "discard" fstab option enabled, it should be completely destructive on those specific setups

I checked on Ubuntu with a btrfs filesystem, and only an fstrim timer is enabled weekly, I don't use manjaro since ages so I can't really tell, If I don't remember wrong it does this

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, Chunchunmaru_ said:

Ubuntu based distros should not be affected by this as it doesn't use the Linux 5.1 kernel.

Arch Linux/Manjaro users sadly are affected, on Manjaro the fstrim is enabled automatically, while on Arch it has to be issued or configured manually.
I do not know if other distros are affected

I am frankly very grateful that my PC refused to install Manjaro now. Additionally that I discovered that my CPU overclocks were causing Ubuntu not to boot, and so I stuck with Ubuntu as my Linux distro of choice.

mechanical keyboard switches aficionado & hi-fi audio enthusiast

switch reviews  how i lube mx-style keyboard switches

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, seoz said:

Additionally that I discovered that my CPU overclocks were causing Ubuntu not to boot, and so I stuck with Ubuntu as my Linux distro of choice.

Huh?

Resident Mozilla Shill.   Typed on my Ortholinear JJ40 custom keyboard
               __     I am the ASCIIDino.
              / _)
     _.----._/ /      If you can see me you 
    /         /       must put me in your 
 __/ (  | (  |        signature for 24 hours.
/__.-'|_|--|_|        
Link to comment
Share on other sites

Link to post
Share on other sites

Just now, Fez_Boy said:

Huh?

> be me
> want to try Linux
> try Lubuntu, refused to boot and kept causing hardware errors
> try Manjaro for experimental reasons, booted into installer but again kept causing hardware errors
> "Well shucks, what could be causing these?"
> try underclocking to 4.5GHz 1.25V
> Lubuntu boots
> "I don't like this after all"
> try Ubuntu 19.04
> now we're here

mechanical keyboard switches aficionado & hi-fi audio enthusiast

switch reviews  how i lube mx-style keyboard switches

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, seoz said:

-snip-

Oh, you said Ubuntu both times, not Lubuntu.

Resident Mozilla Shill.   Typed on my Ortholinear JJ40 custom keyboard
               __     I am the ASCIIDino.
              / _)
     _.----._/ /      If you can see me you 
    /         /       must put me in your 
 __/ (  | (  |        signature for 24 hours.
/__.-'|_|--|_|        
Link to comment
Share on other sites

Link to post
Share on other sites

Quote

Linux 5.1 kernel hit by SSD TRIM bug which causes massive data loss

Has SSD in system running 5.1.x

Heart rate increases

Quote

Looks like there is a bad commit in the Linux 5.1.2 kernel, which in some circumstances issuing a ATA TRIM causes actual data to be discarded instead of deleted blocks.

Said system is running 5.1.2

Heart rate increases further

Quote

on some Samsung SSD's

Said SSD is a samsung

Heart rate increases even further

Quote

if LUKS+dm_crypt is used

Doesn’t use LUKS

Heart rate stabilizes

 

uh, yeah, still gonna go update the kernel on that machine right now though.

Current LTT F@H Rank: 90    Score: 2,503,680,659    Stats

Yes, I have 9 monitors.

My main PC (Hybrid Windows 10/Arch Linux):

OS: Arch Linux w/ XFCE DE (VFIO-Patched Kernel) as host OS, windows 10 as guest

CPU: Ryzen 9 3900X w/PBO on (6c 12t for host, 6c 12t for guest)

Cooler: Noctua NH-D15

Mobo: Asus X470-F Gaming

RAM: 32GB G-Skill Ripjaws V @ 3200MHz (12GB for host, 20GB for guest)

GPU: Guest: EVGA RTX 3070 FTW3 ULTRA Host: 2x Radeon HD 8470

PSU: EVGA G2 650W

SSDs: Guest: Samsung 850 evo 120 GB, Samsung 860 evo 1TB Host: Samsung 970 evo 500GB NVME

HDD: Guest: WD Caviar Blue 1 TB

Case: Fractal Design Define R5 Black w/ Tempered Glass Side Panel Upgrade

Other: White LED strip to illuminate the interior. Extra fractal intake fan for positive pressure.

 

unRAID server (Plex, Windows 10 VM, NAS, Duplicati, game servers):

OS: unRAID 6.11.2

CPU: Ryzen R7 2700x @ Stock

Cooler: Noctua NH-U9S

Mobo: Asus Prime X470-Pro

RAM: 16GB G-Skill Ripjaws V + 16GB Hyperx Fury Black @ stock

GPU: EVGA GTX 1080 FTW2

PSU: EVGA G3 850W

SSD: Samsung 970 evo NVME 250GB, Samsung 860 evo SATA 1TB 

HDDs: 4x HGST Dekstar NAS 4TB @ 7200RPM (3 data, 1 parity)

Case: Sillverstone GD08B

Other: Added 3x Noctua NF-F12 intake, 2x Noctua NF-A8 exhaust, Inatek 5 port USB 3.0 expansion card with usb 3.0 front panel header

Details: 12GB ram, GTX 1080, USB card passed through to windows 10 VM. VM's OS drive is the SATA SSD. Rest of resources are for Plex, Duplicati, Spaghettidetective, Nextcloud, and game servers.

Link to comment
Share on other sites

Link to post
Share on other sites

16 minutes ago, firelighter487 said:

I'm glad I found out now that I switched back to macOS...

 

I was actually gonna try the 5.1 kernel too... oof. 

IKR :P though you shouldn't worry about this unless you have LUKS and a samsung drive.

 

This is why I use the LTS kernel on Arch.

25 minutes ago, Chunchunmaru_ said:

I do not know if other distros are affected

I would assume this applies to any distro running 5.1, after all one of the sources is a redhat mailing list and ubuntu/rhel/arch are the base for a lot of popular distros.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Sauron said:

I would assume this applies to any distro running 5.1, after all one of the sources is a redhat mailing list and ubuntu/rhel/arch are the base for a lot of popular distros.

I miss the times where Torvalds himself roasts people with bad kernel commits...

Edited by Chunchunmaru_
Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, geo3 said:

Are that many systems using 5.1 yet?  Also using LUKS + dm_crypt specifically on a Samsung SSD. This is a very specific combination you need. 

Arch has been on 5.1 for a few weeks IIRC, so I’d imagine that most arch users are.

Current LTT F@H Rank: 90    Score: 2,503,680,659    Stats

Yes, I have 9 monitors.

My main PC (Hybrid Windows 10/Arch Linux):

OS: Arch Linux w/ XFCE DE (VFIO-Patched Kernel) as host OS, windows 10 as guest

CPU: Ryzen 9 3900X w/PBO on (6c 12t for host, 6c 12t for guest)

Cooler: Noctua NH-D15

Mobo: Asus X470-F Gaming

RAM: 32GB G-Skill Ripjaws V @ 3200MHz (12GB for host, 20GB for guest)

GPU: Guest: EVGA RTX 3070 FTW3 ULTRA Host: 2x Radeon HD 8470

PSU: EVGA G2 650W

SSDs: Guest: Samsung 850 evo 120 GB, Samsung 860 evo 1TB Host: Samsung 970 evo 500GB NVME

HDD: Guest: WD Caviar Blue 1 TB

Case: Fractal Design Define R5 Black w/ Tempered Glass Side Panel Upgrade

Other: White LED strip to illuminate the interior. Extra fractal intake fan for positive pressure.

 

unRAID server (Plex, Windows 10 VM, NAS, Duplicati, game servers):

OS: unRAID 6.11.2

CPU: Ryzen R7 2700x @ Stock

Cooler: Noctua NH-U9S

Mobo: Asus Prime X470-Pro

RAM: 16GB G-Skill Ripjaws V + 16GB Hyperx Fury Black @ stock

GPU: EVGA GTX 1080 FTW2

PSU: EVGA G3 850W

SSD: Samsung 970 evo NVME 250GB, Samsung 860 evo SATA 1TB 

HDDs: 4x HGST Dekstar NAS 4TB @ 7200RPM (3 data, 1 parity)

Case: Sillverstone GD08B

Other: Added 3x Noctua NF-F12 intake, 2x Noctua NF-A8 exhaust, Inatek 5 port USB 3.0 expansion card with usb 3.0 front panel header

Details: 12GB ram, GTX 1080, USB card passed through to windows 10 VM. VM's OS drive is the SATA SSD. Rest of resources are for Plex, Duplicati, Spaghettidetective, Nextcloud, and game servers.

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, geo3 said:

Are that many systems using 5.1 yet?  Also using LUKS + dm_crypt specifically on a Samsung SSD. This is a very specific combination you need. 

Arch already has 5.1.4 in the main repos, unless you were using the LTS kernel like I do you'd have gotten the 5.1 a while ago.

2 minutes ago, geo3 said:

Are that many systems using 5.1 yet?  Also using LUKS + dm_crypt specifically on a Samsung SSD. This is a very specific combination you need. 

That's the only known combination.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

39 minutes ago, Chunchunmaru_ said:

It's not completely clear at this point, the reporters of those issues look like to have in common only Samsung SSD's + LUKS+LVM encryption, Samsung SSD's are pretty common though...

That's a rather important distinction though. If it only happens on some Samsung drives and even then not all of them, then the bug is more likely due to a poor implementation in their controllers on Samsung's part than any bug with the Linux Kernel itself

Link to comment
Share on other sites

Link to post
Share on other sites

38 minutes ago, Fez_Boy said:

At least it's not just update and it's gone, you have to run fstrim or hit your scheduled trim. This along with the hard/software combo will hopefully keep loss to a minimum.

Yes, but by default the trim service runs once per week and this has been around for multiple weeks.

 

By the way, DISABLE THE TRIM SERVICE if you're running 5.1 - we don't know for sure that this is the only affected combination and I would err on the side of safety. @Chunchunmaru_ might be a good idea to include this in the OP.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

28 minutes ago, Dan Castellaneta said:

-snip-

linux 5.1 users*

 

12 minutes ago, Sauron said:

By the way, DISABLE THE TRIM SERVICE if you're running 5.1 - we don't know for sure that this is the only affected combination and I would err on the side of safety. @Chunchunmaru_ might be a good idea to include this in the OP.

Found out it's an util-linux package dependency, systemd has two fstrim.service and an fstrim.timer (it's like a cron bootleg) so one can either force-disable one of those by removing the actual file, the disable command doesn't seem to work for me, instead of breaking packages, the safest thing people can do is just not to use that kernel...

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Chunchunmaru_ said:

the safest thing people can do is just not to use that kernel...

Yeah sure, but disabling the timer is a more convenient fix than trying to downgrade or switching to the LTS branch just for a few days.

Don't ask to ask, just ask... please 🤨

sudo chmod -R 000 /*

Link to comment
Share on other sites

Link to post
Share on other sites

11 minutes ago, Sauron said:

Yeah sure, but disabling the timer is a more convenient fix than trying to downgrade or switching to the LTS branch just for a few days.

Looks like it works, but I'm not sure how APT handles eventual updates for util-linux, well ubuntu users are not affected anyway...

EDIT: It just gives a warning about it being disabled, nvm

But I don't know if on manjaro is any different, if I don't remember wrong pacman doesn't enable services automatically

Link to comment
Share on other sites

Link to post
Share on other sites

19 minutes ago, GoodBytes said:

If only Linux hired a QA team instead of exploiting people do their work for free,

 

/s

How would a kernel hire a QA team? (bad joke)

It's impossible to have always a perfect job even with a QA team I guess? Which anyway is not compatible with a project like a linux kernel as it's not owned by just a company

Btw the actual dangerous commit in that case was made from a specific company (Red Hat, now part of IBM) which was also a maintainer, so probably it has not been checked by others and eventually the kernel got into the stable release before anyone found out a problem

Most of the issues are resolved when a kernel release it's in the mainline or rc state release

Anyway even if some kernels like the 5.1 is considered "stable" by the kernel developers, it's not for most distributions except the one mentioned before (arch)

Link to comment
Share on other sites

Link to post
Share on other sites

49 minutes ago, Chunchunmaru_ said:

I miss the times where Torvalds himself roasts people with bad kernel commits...

That’s like saying “I miss the times where Jobs himself roasts Google with stealing products.”

 

Come Bloody Angel

Break off your chains

And look what I've found in the dirt.

 

Pale battered body

Seems she was struggling

Something is wrong with this world.

 

Fierce Bloody Angel

The blood is on your hands

Why did you come to this world?

 

Everybody turns to dust.

 

Everybody turns to dust.

 

The blood is on your hands.

 

The blood is on your hands!

 

Pyo.

Link to comment
Share on other sites

Link to post
Share on other sites

16 minutes ago, Chunchunmaru_ said:

How would a kernel hire a QA team? (bad joke)

It's impossible to have always a perfect job even with a QA team I guess? Which anyway is not compatible with a project like a linux kernel as it's not owned by just a company

Btw the actual dangerous commit in that case was made from a specific company (Red Hat, now part of IBM) which was also a maintainer, so probably it has not been checked by others and eventually the kernel got into the stable release before anyone found out a problem

Most of the issues are resolved when a kernel release it's in the mainline or rc state release

Heh, yes.

 

I was referring to the comment that goes around here at about every reports done here of some bug discovered on Windows which always states on how it doesn't have a QA team (even though they do), and how "they will switch to Linux!" as that OS is absolutely perfect and has no issues. And how Microsoft, not having a QA team (which again, they do) rely solely (they don't) on Insiders, people who are foolish to test for them Windows and aren't even paid.

I just flipped the statement, as the joke.

 

The point I was conveying with the silly joke is that:

Software like an OS  is very complex these days, and testing everything, with all the different possible configurations is virtually impossible (unless you never want it to ever be released as there is always something to fix, including fixes that broke other things which may require a massive architectural change to actually fix the other thing which can break other things).  In addition, until humans has anything to do with development or testing, things will always slip through cracks. Heck, you buy a newly build home and I can guaranty you it will have bugs (well, in this case, issues). 

 

It is just the sad reality we live in today, and yes, they are work done by all communities and software companies that fall into such issue, in trying different strategies to improve testing mythologies and code reviewing to reduce bugs coming out as software (whatever is the case) becomes more and more complicated to deal with traditional testing methods.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×