ZFS Z2, 2 Drives Faulted. What next?

LIGISTX · December 21, 2023

Hi everyone,

Not my first time I have had drives go bad… but it is the first time I have had 2 fault at the same time. Thankfully its a Z2 array, and I have a cold spare waiting to go in. But with 2 drives down, I am a little weary of how to approach this.

Would it make sense to do a zpool clear to force it to think everything is ok, and then replace one of the drives with my cold spare? See how the resilver goes, do a scrub, and monitor the situation? I don’t want to get myself into a worse situation by jumping to any conclusions prematurly.

I have had a few SMART errors pop up over the past few SMART tests, I guess I didn’t dig deep enough into them because I had thought the drives were still working fine (I have had eronious errors in the past that didn’t actually result in bad drives or any corruption). Looking at the SMART status of da5 (Drive with 62 faults according to zpool status), I am seeing:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       43
  3 Spin_Up_Time            0x0027   186   161   021    Pre-fail  Always       -       5700
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       234
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   028   028   000    Old_age   Always       -       52765
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       228
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       224
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       131
194 Temperature_Celsius     0x0022   121   105   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   055   000    Old_age   Always       -       608
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

da7 (Drive with 61 faults):

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   184   159   021    Pre-fail  Always       -       5766
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       235
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   028   028   000    Old_age   Always       -       52758
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       228
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       224
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       107
194 Temperature_Celsius     0x0022   120   105   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

Zpool status:

pool: pergamum
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
	repaired.
  scan: scrub repaired 1.89M in 06:25:16 with 0 errors on Thu Dec 21 06:25:36 2023
config:

	NAME                                            STATE     READ WRITE CKSUM
	pergamum                                        DEGRADED     0     0     0
	  raidz2-0                                      DEGRADED     0     0     0
	    gptid/ab0351e8-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/abbfceac-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/e3c7752a-1fc4-11ea-8e70-000c29cab7ac  ONLINE       0     0     0
	    gptid/6ebdcf54-ac93-11ec-b2a3-279dd0c48793  ONLINE       0     0     0
	    gptid/ae0d7e64-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/aeca106f-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/af89686d-44ea-11e8-8cad-e0071bffdaee  FAULTED     61     0     0  too many errors
	    gptid/b04ad4fc-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/b10b6452-44ea-11e8-8cad-e0071bffdaee  FAULTED     62     0     0  too many errors
	    gptid/b1d949c1-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0

errors: No known data errors

What should my next steps here be?

To stave off any questison about hardware, it can all be found in my signature, but the controller is a H310 to a SAS expander, and has been in good working order for 7+ years (minus a few drive failures over the years). Its possible a SAS → SATA cable (or two) is going bad, its happened to me before. But I am thinking this is not that sort of situation.

FloRolf · December 21, 2023

Oof, that's not a good position to be in.

First question: do you have a backup of the important data?

Then it sounds a little sus, that both drives fail at the exact same time. Could be the cable (also weird), the PSU (also little weird) or my guess, maybe the controller? Could it be that these two are on a separate one?

Did you buy all 10 drives at the same time? If so, rebuilding the array with a new drive or two might lead to more dying quickly, so I'd be very cautious.

Electronics Wizardy · December 21, 2023

The first drives with CRC errors makes me think the cable is bad. It also has other issues in smart, so time to replave it.

I'd just do a zpool replace and wait and see.

I have had configs like this where I could reboot and get the drives back working again in the pool, but I wouldn't mess with that here cause you got the spare and the pool is still working fine now.

LIGISTX · December 21, 2023

11 minutes ago, Electronics Wizardy said:

The first drives with CRC errors makes me think the cable is bad. It also has other issues in smart, so time to replave it.

This drive has had those CRC errors for about a year, once day a bunch cropped up, I reseated the cable, and I have not seen additional CRC since then. The rear errors though, those are new as of last nights scrub which is what alerted me to this.

Seeing as I have two bad drives, da5 which has more read errors, da7 with less, and currently only 1 cold spare, what exactly should I do here? Would it make sense to do a zpool clear (will this online both faulted drives?), at which point I would then go in the gui and replace da5 with the cold spare and let it resilver? I have a new drive on the way (will be here tomorrow), but I will need to bad blocks that for a few days before deploying it, so I am just trying to get myself in the best possitino possible, as quickly as possible.

I am not great with the actual mechanics of ZFS, so a step by step guide would be extremely helpful here.

15 minutes ago, FloRolf said:

First question: do you have a backup of the important data?

Yes, all important data is backed up to B2.

15 minutes ago, FloRolf said:

Then it sounds a little sus, that both drives fail at the exact same time. Could be the cable (also weird), the PSU (also little weird) or my guess, maybe the controller? Could it be that these two are on a separate one?

Did you buy all 10 drives at the same time? If so, rebuilding the array with a new drive or two might lead to more dying quickly, so I'd be very cautious.

I agree, it is strange. These drives are both 2 of the original drives in this array, which just turned 6 years old. I have had to replace 3 other drives over the years. Controller is a H310, it seems to be fine? I don't suspect PSU since everything else seems to be fine, but PSU or controller are both technically possible. All drives are plugged in via SAS ->4x SATA break out cables from the SAS exapnder, HBA -> expander has 2 SAS cables linking them together.

Electronics Wizardy · December 21, 2023

3 minutes ago, LIGISTX said:

This drive has had those CRC errors for about a year, once day a bunch cropped up, I reseated the cable, and I have not seen additional CRC since then. The rear errors though, those are new as of last nights scrub which is what alerted me to this.

yea ignore that if it hasn't changed recently

3 minutes ago, LIGISTX said:

Seeing as I have two bad drives, da5 which has more read errors, da7 with less, and currently only 1 cold spare, what exactly should I do here? Would it make sense to do a zpool clear (will this online both faulted drives?), at which point I would then go in the gui and replace da5 with the cold spare and let it resilver? I have a new drive on the way (will be here tomorrow), but I will need to bad blocks that for a few days before deploying it, so I am just trying to get myself in the best possitino possible, as quickly as possible.

I typically ignore the raw read error rate as its often not meant to be interpreted as a bigger = worse and can be mixed with other data. And backbaze found it didn't correlate well with failure of drives.

No point it running the new drives through bad blocks, just let the rebuild start. ZFS will complain if there are errors on the drive.

6 minutes ago, LIGISTX said:

I am not great with the actual mechanics of ZFS, so a step by step guide would be extremely helpful here.

zpool replace pergamum gptid/af89686d-44ea-11e8-8cad-e0071bffdaee /dev/NEWDRIVEHERE

LIGISTX · December 21, 2023

29 minutes ago, Electronics Wizardy said:

yea ignore that if it hasn't changed recently

I typically ignore the raw read error rate as its often not meant to be interpreted as a bigger = worse and can be mixed with other data. And backbaze found it didn't correlate well with failure of drives.

No point it running the new drives through bad blocks, just let the rebuild start. ZFS will complain if there are errors on the drive.

zpool replace pergamum gptid/af89686d-44ea-11e8-8cad-e0071bffdaee /dev/NEWDRIVEHERE

Hmm, I went to reboot truenas as I added my cold spare but the GUI didn't seem to detect it. So I went for a reboot. It seems to be hanging, and looking in proxmox concole, I am now seeing this:

I didn't look at the conlole prior to issuing the reboot command from Truenas GUI, but this makes it seems like like maybe I do have a fialing controller? It seems awefully mad about something. Looks like da0 also was removed from the array?

I don't have any previous history as I can't scroll up... I am at a bit of a less. Truenas is still hung trying to reboot.

Electronics Wizardy · December 21, 2023

8 minutes ago, LIGISTX said:

Hmm, I went to reboot truenas as I added my cold spare but the GUI didn't seem to detect it. So I went for a reboot. It seems to be hanging, and looking in proxmox concole, I am now seeing this:

I didn't look at the conlole prior to issuing the reboot command from Truenas GUI, but this makes it seems like like maybe I do have a fialing controller? It seems awefully mad about something. Looks like da0 also was removed from the array?

I don't have any previous history as I can't scroll up... I am at a bit of a less. Truenas is still hung trying to reboot.

This screenshot is from the bootup right? Not shutdown?

I've seen bad disks do something like this before. Try unplugging the bad drives and see if it boots up.

LIGISTX · December 21, 2023

13 minutes ago, Electronics Wizardy said:

This screenshot is from the bootup right? Not shutdown?

From shutdown.

It was hung for a good 5 minutes on the screen, so I just shutdown proxmox, reseated the HBA in the PCIe slot, replugged SAS cables. And now it’s booting up. I have all VM’s the rear and write to the array turned off (shut them down when I woke up, and set to not start on proxmox boot), so I guess we will see what happens here in a minute….

I am thinking possibly an HBA failure?

Electronics Wizardy · December 21, 2023

1 minute ago, LIGISTX said:

From shutdown.

It was hung for a good 5 minutes on the screen, so I just shutdown proxmox, reseated the HBA in the PCIe slot, replugged SAS cables. And now it’s booting up. I have all VM’s the rear and write to the array turned off (shut them down when I woke up, and set to not start on proxmox boot), so I guess we will see what happens here in a minute….

I am thinking possibly an HBA failure?

Could easily be a HBA failure.

How is it doing after a host reboot?

LIGISTX · December 21, 2023

1 minute ago, Electronics Wizardy said:

Could easily be a HBA failure.

How is it doing after a host reboot?

Rebooted and this is what I see in proxmox consol for truenas VM:

LIGISTX · December 21, 2023

This is curious, upon reboot, all drives are reporting correctly (even with the above errors being shown on console), and apparently it was 71% done with a resilver... and still trying to resilver? I shut it down, if there is a bad HBA throwing whatever these eorrors above are, I don't think trying to resilver is a good idea. Unless I am reading that wrong and that is just spitting out errors for a particular drive? I don't know, I am starting be a bit confused as to what is happening here.

But at time of shutdown, all drives reported online, pool reported healthy. I am not entirely sure I trust that tho?

Next step is leave truenas off, get myself a new HBA, bring it up, and see what it reports?

Electronics Wizardy · December 21, 2023

24 minutes ago, LIGISTX said:

This is curious, upon reboot, all drives are reporting correctly (even with the above errors being shown on console), and apparently it was 71% done with a resilver... and still trying to resilver? I shut it down, if there is a bad HBA throwing whatever these eorrors above are, I don't think trying to resilver is a good idea. Unless I am reading that wrong and that is just spitting out errors for a particular drive? I don't know, I am starting be a bit confused as to what is happening here.

But at time of shutdown, all drives reported online, pool reported healthy. I am not entirely sure I trust that tho?

Next step is leave truenas off, get myself a new HBA, bring it up, and see what it reports?

can you login? Does it show the disks?

Could easily be a bad disk da6 from that page.

LIGISTX · December 21, 2023

2 minutes ago, Electronics Wizardy said:

can you login? Does it show the disks?

Could easily be a bad disk da6 from that page.

Ya, I can log on. It showed my array as fully healthy, was in the middle of a resolver, and array was online.

I turned TrueNAS off not fully understanding what the above error actually meant.

Suggestion would be boot TrueNAS, determine which drive is throwing the error, remove it, and replace with spare?

Electronics Wizardy · December 21, 2023

16 minutes ago, LIGISTX said:

Ya, I can log on. It showed my array as fully healthy, was in the middle of a resolver, and array was online.

I turned TrueNAS off not fully understanding what the above error actually meant.

Suggestion would be boot TrueNAS, determine which drive is throwing the error, remove it, and replace with spare?

I'd boot it back up and start a scrub.

While your drives may have issues, its not a simple dual disk failure here. Try replugging everything when your at it.

It could easily be a HBA issuel, I've lost a few of those in my days.

LIGISTX · December 21, 2023

Well, assuming the Truenas GUI is not non-sensical, I am replacing the worst of the 2 drives with my cold spare now. It looks like it was already resilving the current worst drive (it is again listed as da6), so once that is finished it will start the replacement with the cold spare.

I am not sure why it is trying to resilver the currently bad drive since it was faulted out... I guess when i rebooted it earlier it threw it into a resilver function since the drive did come back online? Eitehr way, the pool is currently listed as online, no data errors across the pool itself, but the worst offending drive is showing 10 read errors currently.

  pool: pergamum

 state: ONLINE

status: One or more devices is currently being resilvered.  The pool will

    continue to function, possibly in a degraded state.

action: Wait for the resilver to complete.

  scan: resilver in progress since Thu Dec 21 11:48:40 2023

    19.5T scanned at 30.0G/s, 16.6T issued at 25.6G/s, 20.5T total

    32K resilvered, 81.23% done, 00:02:33 to go

config:



    NAME                                              STATE     READ WRITE CKSUM

    pergamum                                          ONLINE       0     0     0

      raidz2-0                                        ONLINE       0     0     0

        gptid/ab0351e8-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/abbfceac-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/e3c7752a-1fc4-11ea-8e70-000c29cab7ac    ONLINE       0     0     0

        gptid/6ebdcf54-ac93-11ec-b2a3-279dd0c48793    ONLINE       0     0     0

        gptid/ae0d7e64-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/aeca106f-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/af89686d-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/b04ad4fc-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        replacing-8                                   ONLINE       0     0     0

          gptid/b10b6452-44ea-11e8-8cad-e0071bffdaee  ONLINE      10     0     0  (resilvering)

          gptid/a1020a2d-a04d-11ee-8a53-0002c95458ac  ONLINE       0     0     0  (awaiting resilver)

        gptid/b1d949c1-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0



errors: No known data errors

Thanks for the advice, we shall see how this goes. New drive to replace the potentially second bad drive shows up tomorrow, assuming all is well and nothing is worse, I plan to badblocks is as I normally do prior to deploying a drive, and then repeat this process. If the second suspected bad drive starts to yeet isetlf from the array, I may forgo badblocks as I think I would rather a likely working new drive vs a known bad drive in my array, we will see...

leadeater · December 22, 2023

4 hours ago, LIGISTX said:

I am thinking possibly an HBA failure?

I've had faulty disks lock up controllers, if everything is working otherwise the controller is fine and it's more likely a bad HDD locking up the SAS bus trying to detect the HDD.

LIGISTX · December 22, 2023

14 minutes ago, leadeater said:

I've had faulty disks lock up controllers, if everything is working otherwise the controller is fine and it's more likely a bad HDD locking up the SAS bus trying to detect the HDD.

That does appear to be what was happening. Things have settled down (I had three drives resilvering about an hour ago... the failing drive, the replacement for it, and the suspected second drive), and at this point the array is fully healthy, with a single resilver process running to replace the bad drive with the cold spare.

It looks ike the resilver of the other 2 drives was successful, I do see 10 read errors on the known bad drive being replaced, but the array is reporting healthy and not degraded. I assume it addressed this with reallocating the bad sectors?

da6 is still throwing lots of errors in the proxmox console window for truenas, which leads me to believe this drive is what caused the errors I was seeing, but thankfully nothing is locked up.

Whats also curious... while the multiple resilvers were happening, the webUI didn't report any of them. If I clicked pool status, it did show the same as a zpool status while SSHed in showed, but the top right where the webUI typically reports job progress had no jobs to report... now that only 1 resilver is happening, it is once again working correctly and shows that there is a resilver job happening - strange.

Sign In

ZFS Z2, 2 Drives Faulted. What next?

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Featured Topics

Topics

Latest From Linus Tech Tips:

I Am Not Buying A Super Computer - WAN Show May 3, 2024

Latest From Tech Quickie:

This Guy BUILT His Own Graphics Card!

Latest From TechLinked: