Jump to content

ZFS Z2, 2 Drives Faulted. What next?

LIGISTX

Hi everyone,

Not my first time I have had drives go bad… but it is the first time I have had 2 fault at the same time. Thankfully its a Z2 array, and I have a cold spare waiting to go in. But with 2 drives down, I am a little weary of how to approach this.

Would it make sense to do a zpool clear to force it to think everything is ok, and then replace one of the drives with my cold spare? See how the resilver goes, do a scrub, and monitor the situation? I don’t want to get myself into a worse situation by jumping to any conclusions prematurly.

I have had a few SMART errors pop up over the past few SMART tests, I guess I didn’t dig deep enough into them because I had thought the drives were still working fine (I have had eronious errors in the past that didn’t actually result in bad drives or any corruption). Looking at the SMART status of da5 (Drive with 62 faults according to zpool status), I am seeing:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       43
  3 Spin_Up_Time            0x0027   186   161   021    Pre-fail  Always       -       5700
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       234
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   028   028   000    Old_age   Always       -       52765
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       228
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       224
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       131
194 Temperature_Celsius     0x0022   121   105   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   055   000    Old_age   Always       -       608
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

da7 (Drive with 61 faults):

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   184   159   021    Pre-fail  Always       -       5766
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       235
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   028   028   000    Old_age   Always       -       52758
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       228
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       224
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       107
194 Temperature_Celsius     0x0022   120   105   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

 

Zpool status:

pool: pergamum
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
	repaired.
  scan: scrub repaired 1.89M in 06:25:16 with 0 errors on Thu Dec 21 06:25:36 2023
config:

	NAME                                            STATE     READ WRITE CKSUM
	pergamum                                        DEGRADED     0     0     0
	  raidz2-0                                      DEGRADED     0     0     0
	    gptid/ab0351e8-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/abbfceac-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/e3c7752a-1fc4-11ea-8e70-000c29cab7ac  ONLINE       0     0     0
	    gptid/6ebdcf54-ac93-11ec-b2a3-279dd0c48793  ONLINE       0     0     0
	    gptid/ae0d7e64-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/aeca106f-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/af89686d-44ea-11e8-8cad-e0071bffdaee  FAULTED     61     0     0  too many errors
	    gptid/b04ad4fc-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0
	    gptid/b10b6452-44ea-11e8-8cad-e0071bffdaee  FAULTED     62     0     0  too many errors
	    gptid/b1d949c1-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0

errors: No known data errors

 

What should my next steps here be?

 

To stave off any questison about hardware, it can all be found in my signature, but the controller is a H310 to a SAS expander, and has been in good working order for 7+ years (minus a few drive failures over the years). Its possible a SAS → SATA cable (or two) is going bad, its happened to me before. But I am thinking this is not that sort of situation.

Rig: i7 13700k - - Asus Z790-P Wifi - - RTX 4080 - - 4x16GB 6000MHz - - Samsung 990 Pro 2TB NVMe Boot + Main Programs - - Assorted SATA SSD's for Photo Work - - Corsair RM850x - - Sound BlasterX EA-5 - - Corsair XC8 JTC Edition - - Corsair GPU Full Cover GPU Block - - XT45 X-Flow 420 + UT60 280 rads - - EK XRES RGB PWM - - Fractal Define S2 - - Acer Predator X34 -- Logitech G502 - - Logitech G710+ - - Logitech Z5500 - - LTT Deskpad

 

Headphones/amp/dac: Schiit Lyr 3 - - Fostex TR-X00 - - Sennheiser HD 6xx

 

Homelab/ Media Server: Proxmox VE host - - 512 NVMe Samsung 980 RAID Z1 for VM's/Proxmox boot - - Xeon e5 2660 V4- - Supermicro X10SRF-i - - 128 GB ECC 2133 - - 10x4 TB WD Red RAID Z2 - - Corsair 750D - - Corsair RM650i - - Dell H310 6Gbps SAS HBA - - Intel RES2SC240 SAS Expander - - TreuNAS + many other VM’s

 

iPhone 14 Pro - 2018 MacBook Air

Link to comment
Share on other sites

Link to post
Share on other sites

Oof, that's not a good position to be in. 

First question: do you have a backup of the important data? 

 

Then it sounds a little sus, that both drives fail at the exact same time. Could be the cable (also weird), the PSU (also little weird) or my guess, maybe the controller? Could it be that these two are on a separate one? 

 

Did you buy all 10 drives at the same time? If so, rebuilding the array with a new drive or two might lead to more dying quickly, so I'd be very cautious. 

Gaming HTPC:

R5 5600X - Cryorig C7 - Asus ROG B350-i - EVGA RTX2060KO - 16gb G.Skill Ripjaws V 3333mhz - Corsair SF450 - 500gb 960 EVO - LianLi TU100B


Desktop PC:
R9 3900X - Peerless Assassin 120 SE - Asus Prime X570 Pro - Powercolor 7900XT - 32gb LPX 3200mhz - Corsair SF750 Platinum - 1TB WD SN850X - CoolerMaster NR200 White - Gigabyte M27Q-SA - Corsair K70 Rapidfire - Logitech MX518 Legendary - HyperXCloud Alpha wireless


Boss-NAS [Build Log]:
R5 2400G - Noctua NH-D14 - Asus Prime X370-Pro - 16gb G.Skill Aegis 3000mhz - Seasonic Focus Platinum 550W - Fractal Design R5 - 
250gb 970 Evo (OS) - 2x500gb 860 Evo (Raid0) - 6x4TB WD Red (RaidZ2)

Synology-NAS:
DS920+
2x4TB Ironwolf - 1x18TB Seagate Exos X20

 

Audio Gear:

Hifiman HE-400i - Kennerton Magister - Beyerdynamic DT880 250Ohm - AKG K7XX - Fostex TH-X00 - O2 Amp/DAC Combo - 
Klipsch RP280F - Klipsch RP160M - Klipsch RP440C - Yamaha RX-V479

 

Reviews and Stuff:

GTX 780 DCU2 // 8600GTS // Hifiman HE-400i // Kennerton Magister
Folding all the Proteins! // Boincerino

Useful Links:
Do you need an AMP/DAC? // Recommended Audio Gear // PSU Tier List 

Link to comment
Share on other sites

Link to post
Share on other sites

The first drives with CRC errors makes me think the cable is bad. It also has other issues in smart, so time to replave it. 

 

I'd just do a zpool replace and wait and see. 

 

I have had configs like this where I could reboot and get the drives back working again in the pool, but I wouldn't mess with that here cause you got the spare and the pool is still working fine now.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

11 minutes ago, Electronics Wizardy said:

The first drives with CRC errors makes me think the cable is bad. It also has other issues in smart, so time to replave it. 

 

This drive has had those CRC errors for about a year, once day a bunch cropped up, I reseated the cable, and I have not seen additional CRC since then. The rear errors though, those are new as of last nights scrub which is what alerted me to this.

 

Seeing as I have two bad drives, da5 which has more read errors, da7 with less, and currently only 1 cold spare, what exactly should I do here? Would it make sense to do a zpool clear (will this online both faulted drives?), at which point I would then go in the gui and replace da5 with the cold spare and let it resilver? I have a new drive on the way (will be here tomorrow), but I will need to bad blocks that for a few days before deploying it, so I am just trying to get myself in the best possitino possible, as quickly as possible.

 

I am not great with the actual mechanics of ZFS, so a step by step guide would be extremely helpful here.

 

15 minutes ago, FloRolf said:

First question: do you have a backup of the important data? 

 

Yes, all important data is backed up to B2.

 

15 minutes ago, FloRolf said:

Then it sounds a little sus, that both drives fail at the exact same time. Could be the cable (also weird), the PSU (also little weird) or my guess, maybe the controller? Could it be that these two are on a separate one? 

 

Did you buy all 10 drives at the same time? If so, rebuilding the array with a new drive or two might lead to more dying quickly, so I'd be very cautious. 

I agree, it is strange. These drives are both 2 of the original drives in this array, which just turned 6 years old. I have had to replace 3 other drives over the years. Controller is a H310, it seems to be fine? I don't suspect PSU since everything else seems to be fine, but PSU or controller are both technically possible. All drives are plugged in via SAS ->4x SATA break out cables from the SAS exapnder, HBA -> expander has 2 SAS cables linking them together.

Rig: i7 13700k - - Asus Z790-P Wifi - - RTX 4080 - - 4x16GB 6000MHz - - Samsung 990 Pro 2TB NVMe Boot + Main Programs - - Assorted SATA SSD's for Photo Work - - Corsair RM850x - - Sound BlasterX EA-5 - - Corsair XC8 JTC Edition - - Corsair GPU Full Cover GPU Block - - XT45 X-Flow 420 + UT60 280 rads - - EK XRES RGB PWM - - Fractal Define S2 - - Acer Predator X34 -- Logitech G502 - - Logitech G710+ - - Logitech Z5500 - - LTT Deskpad

 

Headphones/amp/dac: Schiit Lyr 3 - - Fostex TR-X00 - - Sennheiser HD 6xx

 

Homelab/ Media Server: Proxmox VE host - - 512 NVMe Samsung 980 RAID Z1 for VM's/Proxmox boot - - Xeon e5 2660 V4- - Supermicro X10SRF-i - - 128 GB ECC 2133 - - 10x4 TB WD Red RAID Z2 - - Corsair 750D - - Corsair RM650i - - Dell H310 6Gbps SAS HBA - - Intel RES2SC240 SAS Expander - - TreuNAS + many other VM’s

 

iPhone 14 Pro - 2018 MacBook Air

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, LIGISTX said:

This drive has had those CRC errors for about a year, once day a bunch cropped up, I reseated the cable, and I have not seen additional CRC since then. The rear errors though, those are new as of last nights scrub which is what alerted me to this.

 

yea ignore that if it hasn't changed recently

 

3 minutes ago, LIGISTX said:

Seeing as I have two bad drives, da5 which has more read errors, da7 with less, and currently only 1 cold spare, what exactly should I do here? Would it make sense to do a zpool clear (will this online both faulted drives?), at which point I would then go in the gui and replace da5 with the cold spare and let it resilver? I have a new drive on the way (will be here tomorrow), but I will need to bad blocks that for a few days before deploying it, so I am just trying to get myself in the best possitino possible, as quickly as possible.

 

I typically ignore the raw read error rate as its often not meant to be interpreted as a bigger = worse and can be mixed with other data. And backbaze found it didn't correlate well with failure of drives.

 

No point it running the new drives through bad blocks, just let the rebuild start. ZFS will complain if there are errors on the drive.

 

6 minutes ago, LIGISTX said:

I am not great with the actual mechanics of ZFS, so a step by step guide would be extremely helpful here.

 

zpool replace pergamum gptid/af89686d-44ea-11e8-8cad-e0071bffdaee /dev/NEWDRIVEHERE

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

29 minutes ago, Electronics Wizardy said:

yea ignore that if it hasn't changed recently

 

I typically ignore the raw read error rate as its often not meant to be interpreted as a bigger = worse and can be mixed with other data. And backbaze found it didn't correlate well with failure of drives.

 

No point it running the new drives through bad blocks, just let the rebuild start. ZFS will complain if there are errors on the drive.

 

zpool replace pergamum gptid/af89686d-44ea-11e8-8cad-e0071bffdaee /dev/NEWDRIVEHERE

 

 

 

Hmm, I went to reboot truenas as I added my cold spare but the GUI didn't seem to detect it. So I went for a reboot. It seems to be hanging, and looking in proxmox concole, I am now seeing this:

 

image.thumb.png.8426595778e9dd1f64a6ac263e50fa1a.png

 

I didn't look at the conlole prior to issuing the reboot command from Truenas GUI, but this makes it seems like like maybe I do have a fialing controller? It seems awefully mad about something. Looks like da0 also was removed from the array? 

 

I don't have any previous history as I can't scroll up... I am at a bit of a less. Truenas is still hung trying to reboot.

Rig: i7 13700k - - Asus Z790-P Wifi - - RTX 4080 - - 4x16GB 6000MHz - - Samsung 990 Pro 2TB NVMe Boot + Main Programs - - Assorted SATA SSD's for Photo Work - - Corsair RM850x - - Sound BlasterX EA-5 - - Corsair XC8 JTC Edition - - Corsair GPU Full Cover GPU Block - - XT45 X-Flow 420 + UT60 280 rads - - EK XRES RGB PWM - - Fractal Define S2 - - Acer Predator X34 -- Logitech G502 - - Logitech G710+ - - Logitech Z5500 - - LTT Deskpad

 

Headphones/amp/dac: Schiit Lyr 3 - - Fostex TR-X00 - - Sennheiser HD 6xx

 

Homelab/ Media Server: Proxmox VE host - - 512 NVMe Samsung 980 RAID Z1 for VM's/Proxmox boot - - Xeon e5 2660 V4- - Supermicro X10SRF-i - - 128 GB ECC 2133 - - 10x4 TB WD Red RAID Z2 - - Corsair 750D - - Corsair RM650i - - Dell H310 6Gbps SAS HBA - - Intel RES2SC240 SAS Expander - - TreuNAS + many other VM’s

 

iPhone 14 Pro - 2018 MacBook Air

Link to comment
Share on other sites

Link to post
Share on other sites

8 minutes ago, LIGISTX said:

Hmm, I went to reboot truenas as I added my cold spare but the GUI didn't seem to detect it. So I went for a reboot. It seems to be hanging, and looking in proxmox concole, I am now seeing this:

 

image.thumb.png.8426595778e9dd1f64a6ac263e50fa1a.png

 

I didn't look at the conlole prior to issuing the reboot command from Truenas GUI, but this makes it seems like like maybe I do have a fialing controller? It seems awefully mad about something. Looks like da0 also was removed from the array? 

 

I don't have any previous history as I can't scroll up... I am at a bit of a less. Truenas is still hung trying to reboot.

This screenshot is from the bootup right? Not shutdown?

 

I've seen bad disks do something like this before. Try unplugging the bad drives and see if it boots up. 

Link to comment
Share on other sites

Link to post
Share on other sites

13 minutes ago, Electronics Wizardy said:

This screenshot is from the bootup right? Not shutdown?

From shutdown. 
 

It was hung for a good 5 minutes on the screen, so I just shutdown proxmox, reseated the HBA in the PCIe slot, replugged SAS cables. And now it’s booting up. I have all VM’s the rear and write to the array turned off (shut them down when I woke up, and set to not start on proxmox boot), so I guess we will see what happens here in a minute….

 

I am thinking possibly an HBA failure?

Rig: i7 13700k - - Asus Z790-P Wifi - - RTX 4080 - - 4x16GB 6000MHz - - Samsung 990 Pro 2TB NVMe Boot + Main Programs - - Assorted SATA SSD's for Photo Work - - Corsair RM850x - - Sound BlasterX EA-5 - - Corsair XC8 JTC Edition - - Corsair GPU Full Cover GPU Block - - XT45 X-Flow 420 + UT60 280 rads - - EK XRES RGB PWM - - Fractal Define S2 - - Acer Predator X34 -- Logitech G502 - - Logitech G710+ - - Logitech Z5500 - - LTT Deskpad

 

Headphones/amp/dac: Schiit Lyr 3 - - Fostex TR-X00 - - Sennheiser HD 6xx

 

Homelab/ Media Server: Proxmox VE host - - 512 NVMe Samsung 980 RAID Z1 for VM's/Proxmox boot - - Xeon e5 2660 V4- - Supermicro X10SRF-i - - 128 GB ECC 2133 - - 10x4 TB WD Red RAID Z2 - - Corsair 750D - - Corsair RM650i - - Dell H310 6Gbps SAS HBA - - Intel RES2SC240 SAS Expander - - TreuNAS + many other VM’s

 

iPhone 14 Pro - 2018 MacBook Air

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, LIGISTX said:

From shutdown. 
 

It was hung for a good 5 minutes on the screen, so I just shutdown proxmox, reseated the HBA in the PCIe slot, replugged SAS cables. And now it’s booting up. I have all VM’s the rear and write to the array turned off (shut them down when I woke up, and set to not start on proxmox boot), so I guess we will see what happens here in a minute….

 

I am thinking possibly an HBA failure?

Could easily be a HBA failure. 

 

How is it doing after a host reboot?

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Electronics Wizardy said:

Could easily be a HBA failure. 

 

How is it doing after a host reboot?

Rebooted and this is what I see in proxmox consol for truenas VM:

 

image.thumb.png.b10577dac76e2bccbf8de36622b54ce0.png

Rig: i7 13700k - - Asus Z790-P Wifi - - RTX 4080 - - 4x16GB 6000MHz - - Samsung 990 Pro 2TB NVMe Boot + Main Programs - - Assorted SATA SSD's for Photo Work - - Corsair RM850x - - Sound BlasterX EA-5 - - Corsair XC8 JTC Edition - - Corsair GPU Full Cover GPU Block - - XT45 X-Flow 420 + UT60 280 rads - - EK XRES RGB PWM - - Fractal Define S2 - - Acer Predator X34 -- Logitech G502 - - Logitech G710+ - - Logitech Z5500 - - LTT Deskpad

 

Headphones/amp/dac: Schiit Lyr 3 - - Fostex TR-X00 - - Sennheiser HD 6xx

 

Homelab/ Media Server: Proxmox VE host - - 512 NVMe Samsung 980 RAID Z1 for VM's/Proxmox boot - - Xeon e5 2660 V4- - Supermicro X10SRF-i - - 128 GB ECC 2133 - - 10x4 TB WD Red RAID Z2 - - Corsair 750D - - Corsair RM650i - - Dell H310 6Gbps SAS HBA - - Intel RES2SC240 SAS Expander - - TreuNAS + many other VM’s

 

iPhone 14 Pro - 2018 MacBook Air

Link to comment
Share on other sites

Link to post
Share on other sites

This is curious, upon reboot, all drives are reporting correctly (even with the above errors being shown on console), and apparently it was 71% done with a resilver... and still trying to resilver? I shut it down, if there is a bad HBA throwing whatever these eorrors above are, I don't think trying to resilver is a good idea. Unless I am reading that wrong and that is just spitting out errors for a particular drive? I don't know, I am starting be a bit confused as to what is happening here.

 

But at time of shutdown, all drives reported online, pool reported healthy. I am not entirely sure I trust that tho? 

 

Next step is leave truenas off, get myself a new HBA, bring it up, and see what it reports?

Rig: i7 13700k - - Asus Z790-P Wifi - - RTX 4080 - - 4x16GB 6000MHz - - Samsung 990 Pro 2TB NVMe Boot + Main Programs - - Assorted SATA SSD's for Photo Work - - Corsair RM850x - - Sound BlasterX EA-5 - - Corsair XC8 JTC Edition - - Corsair GPU Full Cover GPU Block - - XT45 X-Flow 420 + UT60 280 rads - - EK XRES RGB PWM - - Fractal Define S2 - - Acer Predator X34 -- Logitech G502 - - Logitech G710+ - - Logitech Z5500 - - LTT Deskpad

 

Headphones/amp/dac: Schiit Lyr 3 - - Fostex TR-X00 - - Sennheiser HD 6xx

 

Homelab/ Media Server: Proxmox VE host - - 512 NVMe Samsung 980 RAID Z1 for VM's/Proxmox boot - - Xeon e5 2660 V4- - Supermicro X10SRF-i - - 128 GB ECC 2133 - - 10x4 TB WD Red RAID Z2 - - Corsair 750D - - Corsair RM650i - - Dell H310 6Gbps SAS HBA - - Intel RES2SC240 SAS Expander - - TreuNAS + many other VM’s

 

iPhone 14 Pro - 2018 MacBook Air

Link to comment
Share on other sites

Link to post
Share on other sites

24 minutes ago, LIGISTX said:

This is curious, upon reboot, all drives are reporting correctly (even with the above errors being shown on console), and apparently it was 71% done with a resilver... and still trying to resilver? I shut it down, if there is a bad HBA throwing whatever these eorrors above are, I don't think trying to resilver is a good idea. Unless I am reading that wrong and that is just spitting out errors for a particular drive? I don't know, I am starting be a bit confused as to what is happening here.

 

But at time of shutdown, all drives reported online, pool reported healthy. I am not entirely sure I trust that tho? 

 

Next step is leave truenas off, get myself a new HBA, bring it up, and see what it reports?

can you login? Does it show the disks?

 

 Could easily be a bad disk da6 from that page.

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, Electronics Wizardy said:

can you login? Does it show the disks?

 

 Could easily be a bad disk da6 from that page.

Ya, I can log on. It showed my array as fully healthy, was in the middle of a resolver, and array was online. 
 

I turned TrueNAS off not fully understanding what the above error actually meant.

 

Suggestion would be boot TrueNAS, determine which drive is throwing the error, remove it, and replace with spare? 

Rig: i7 13700k - - Asus Z790-P Wifi - - RTX 4080 - - 4x16GB 6000MHz - - Samsung 990 Pro 2TB NVMe Boot + Main Programs - - Assorted SATA SSD's for Photo Work - - Corsair RM850x - - Sound BlasterX EA-5 - - Corsair XC8 JTC Edition - - Corsair GPU Full Cover GPU Block - - XT45 X-Flow 420 + UT60 280 rads - - EK XRES RGB PWM - - Fractal Define S2 - - Acer Predator X34 -- Logitech G502 - - Logitech G710+ - - Logitech Z5500 - - LTT Deskpad

 

Headphones/amp/dac: Schiit Lyr 3 - - Fostex TR-X00 - - Sennheiser HD 6xx

 

Homelab/ Media Server: Proxmox VE host - - 512 NVMe Samsung 980 RAID Z1 for VM's/Proxmox boot - - Xeon e5 2660 V4- - Supermicro X10SRF-i - - 128 GB ECC 2133 - - 10x4 TB WD Red RAID Z2 - - Corsair 750D - - Corsair RM650i - - Dell H310 6Gbps SAS HBA - - Intel RES2SC240 SAS Expander - - TreuNAS + many other VM’s

 

iPhone 14 Pro - 2018 MacBook Air

Link to comment
Share on other sites

Link to post
Share on other sites

16 minutes ago, LIGISTX said:

Ya, I can log on. It showed my array as fully healthy, was in the middle of a resolver, and array was online. 
 

I turned TrueNAS off not fully understanding what the above error actually meant.

 

Suggestion would be boot TrueNAS, determine which drive is throwing the error, remove it, and replace with spare? 

I'd boot it back up and start a scrub. 

 

While your drives may have issues, its not a simple dual disk failure here. Try replugging everything when your at it.

 

It could easily be a HBA issuel, I've lost a few of those in my days.

 

Link to comment
Share on other sites

Link to post
Share on other sites

Well, assuming the Truenas GUI is not non-sensical, I am replacing the worst of the 2 drives with my cold spare now. It looks like it was already resilving the current worst drive (it is again listed as da6), so once that is finished it will start the replacement with the cold spare.

I am not sure why it is trying to resilver the currently bad drive since it was faulted out... I guess when i rebooted it earlier it threw it into a resilver function since the drive did come back online? Eitehr way, the pool is currently listed as online, no data errors across the pool itself, but the worst offending drive is showing 10 read errors currently.

 

  pool: pergamum

 state: ONLINE

status: One or more devices is currently being resilvered.  The pool will

    continue to function, possibly in a degraded state.

action: Wait for the resilver to complete.

  scan: resilver in progress since Thu Dec 21 11:48:40 2023

    19.5T scanned at 30.0G/s, 16.6T issued at 25.6G/s, 20.5T total

    32K resilvered, 81.23% done, 00:02:33 to go

config:



    NAME                                              STATE     READ WRITE CKSUM

    pergamum                                          ONLINE       0     0     0

      raidz2-0                                        ONLINE       0     0     0

        gptid/ab0351e8-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/abbfceac-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/e3c7752a-1fc4-11ea-8e70-000c29cab7ac    ONLINE       0     0     0

        gptid/6ebdcf54-ac93-11ec-b2a3-279dd0c48793    ONLINE       0     0     0

        gptid/ae0d7e64-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/aeca106f-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/af89686d-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/b04ad4fc-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        replacing-8                                   ONLINE       0     0     0

          gptid/b10b6452-44ea-11e8-8cad-e0071bffdaee  ONLINE      10     0     0  (resilvering)

          gptid/a1020a2d-a04d-11ee-8a53-0002c95458ac  ONLINE       0     0     0  (awaiting resilver)

        gptid/b1d949c1-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0



errors: No known data errors


Thanks for the advice, we shall see how this goes. New drive to replace the potentially second bad drive shows up tomorrow, assuming all is well and nothing is worse, I plan to badblocks is as I normally do prior to deploying a drive, and then repeat this process. If the second suspected bad drive starts to yeet isetlf from the array, I may forgo badblocks as I think I would rather a likely working new drive vs a known bad drive in my array, we will see...

Rig: i7 13700k - - Asus Z790-P Wifi - - RTX 4080 - - 4x16GB 6000MHz - - Samsung 990 Pro 2TB NVMe Boot + Main Programs - - Assorted SATA SSD's for Photo Work - - Corsair RM850x - - Sound BlasterX EA-5 - - Corsair XC8 JTC Edition - - Corsair GPU Full Cover GPU Block - - XT45 X-Flow 420 + UT60 280 rads - - EK XRES RGB PWM - - Fractal Define S2 - - Acer Predator X34 -- Logitech G502 - - Logitech G710+ - - Logitech Z5500 - - LTT Deskpad

 

Headphones/amp/dac: Schiit Lyr 3 - - Fostex TR-X00 - - Sennheiser HD 6xx

 

Homelab/ Media Server: Proxmox VE host - - 512 NVMe Samsung 980 RAID Z1 for VM's/Proxmox boot - - Xeon e5 2660 V4- - Supermicro X10SRF-i - - 128 GB ECC 2133 - - 10x4 TB WD Red RAID Z2 - - Corsair 750D - - Corsair RM650i - - Dell H310 6Gbps SAS HBA - - Intel RES2SC240 SAS Expander - - TreuNAS + many other VM’s

 

iPhone 14 Pro - 2018 MacBook Air

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, LIGISTX said:

I am thinking possibly an HBA failure?

I've had faulty disks lock up controllers, if everything is working otherwise the controller is fine and it's more likely a bad HDD locking up the SAS bus trying to detect the HDD.

Link to comment
Share on other sites

Link to post
Share on other sites

14 minutes ago, leadeater said:

I've had faulty disks lock up controllers, if everything is working otherwise the controller is fine and it's more likely a bad HDD locking up the SAS bus trying to detect the HDD.

That does appear to be what was happening. Things have settled down (I had three drives resilvering about an hour ago... the failing drive, the replacement for it, and the suspected second drive), and at this point the array is fully healthy, with a single resilver process running to replace the bad drive with the cold spare.

 

It looks ike the resilver of the other 2 drives was successful, I do see 10 read errors on the known bad drive being replaced, but the array is reporting healthy and not degraded. I assume it addressed this with reallocating the bad sectors?

 

da6 is still throwing lots of errors in the proxmox console window for truenas, which leads me to believe this drive is what caused the errors I was seeing, but thankfully nothing is locked up. 

 

Whats also curious... while the multiple resilvers were happening, the webUI didn't report any of them. If I clicked pool status, it did show the same as a zpool status while SSHed in showed, but the top right where the webUI typically reports job progress had no jobs to report... now that only 1 resilver is happening, it is once again working correctly and shows that there is a resilver job happening - strange.

Rig: i7 13700k - - Asus Z790-P Wifi - - RTX 4080 - - 4x16GB 6000MHz - - Samsung 990 Pro 2TB NVMe Boot + Main Programs - - Assorted SATA SSD's for Photo Work - - Corsair RM850x - - Sound BlasterX EA-5 - - Corsair XC8 JTC Edition - - Corsair GPU Full Cover GPU Block - - XT45 X-Flow 420 + UT60 280 rads - - EK XRES RGB PWM - - Fractal Define S2 - - Acer Predator X34 -- Logitech G502 - - Logitech G710+ - - Logitech Z5500 - - LTT Deskpad

 

Headphones/amp/dac: Schiit Lyr 3 - - Fostex TR-X00 - - Sennheiser HD 6xx

 

Homelab/ Media Server: Proxmox VE host - - 512 NVMe Samsung 980 RAID Z1 for VM's/Proxmox boot - - Xeon e5 2660 V4- - Supermicro X10SRF-i - - 128 GB ECC 2133 - - 10x4 TB WD Red RAID Z2 - - Corsair 750D - - Corsair RM650i - - Dell H310 6Gbps SAS HBA - - Intel RES2SC240 SAS Expander - - TreuNAS + many other VM’s

 

iPhone 14 Pro - 2018 MacBook Air

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×