Jump to content

Freenas degraded array, how to troubleshoot potential HDD problem?

I shut down the NAS earlier today because the electrician needed to cut the power for a while.  Upon booting it again I heard that the usual triple beep ... and some extra beeps a second or so later. 

I knew right away that there was an issue.  Sure enough, as soon as I entered the decryption key for the HDD pool I got a warning that the array was degraded.

 

Checking the volume status shows the following :

1972011649_Freenasmisery001.jpg.b034dc2e36982390bbca912dd2d4262b.jpg

 

 

Plenty of redundancy still and I have a recent backup (I'm going to do a fresh backup anyway before I screw things up), so I'm not panicking ... yet.

I opened up the NAS to check the cables, but those are still connected properly.  Even blew on the contacts as if they were Nintendo cartridges, but that didn't help either. 

 

I can troubleshoot a disk in Windows just fine, but with FreeNAS I have absolutely no idea what to do next.  I never had to repair or resilver an array, so I really have  no idea what I should and shouldn't do.  Any advice? 

How can I verify that it is indeed the drive and not the SAS->4xSATA cables?  Format the drive in Windows and run HDTune etc to verify its health? 

 

All the drives are 4TB WD Reds (32TB raw space, 19.9TB usable space after redundancy and ZFS overhead), all were purchased in June 2015 at the same place.  (yeah, I know, that wasn't the brightest idea) 

 

Full specs:

 

MoBo : Intel Server S1200V3RPS
CPU : Intel Core i3-4170 (ECC compatible)
Ram : 32GB (4x 8GB) Kingston ValueRAM ECC DDR3-1333
 
HBA : LSI 9211-8i flashed to IT mode
SSD pool 1 : 2x Crucial MX200 250GB (mirrored)
SSD pool 2 : 4x Samsung 950PRO 256GB (RAIDZ)
HDD pool : 8x WD Red 4TB (RAIDZ2)

 

Cables : 2x SAS->4xSATA from the HBA to the HDDs.  6x sharkoon round SATA cables from the motherboard to the SSDs

OS : FreeNAS 11 Stable (just found out that 11.2 is out) running from an 8GB Kingston DataTraveler stick.

 

A couple of months ago I had the same problem with the same drive when I tried to reboot the NAS after an update.  Back then another reboot fixed it. 

This time though I rebooted twice already and it is still listed as unavailable.  So I'd like to look into it before it becomes a real problem. 

If the drive itself is indeed failing, I can expect the others to start giving me trouble soon. 

Link to comment
Share on other sites

Link to post
Share on other sites

You can check if its still visible on the system with geom list. 

 

Im guessing its just a bad drive. Id take it out of the server, and put a new one in, and rebuild the array.

2 hours ago, Captain Chaos said:

If the drive itself is indeed failing, I can expect the others to start giving me trouble soon. 

Not really, there is a lot of variation in failure times. They probably won't all fail at one

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, Electronics Wizardy said:

You can check if its still visible on the system with geom list. 

 

Im guessing its just a bad drive. Id take it out of the server, and put a new one in, and rebuild the array.

Not really, there is a lot of variation in failure times. They probably won't all fail at one

 

 

Agreed. @Captain Chaos, I'd also suggest that - after replacing the drive and rebuilding the array - you setup some CRON jobs to regularly run SMART short tests on the drives (say, once per week), and then again, run SMART long tests say, once per month.

 

This way you can more reasonably pick out a dying drive before it bites the dust. If the drive is still visible to FreeNAS, you can run the smart short and smart long tests, and then pull the results and see if anything looks off.

 

But I'd still just replace the drive and rebuild, as per Wizardy.

For Sale: Meraki Bundle

 

iPhone Xr 128 GB Product Red - HP Spectre x360 13" (i5 - 8 GB RAM - 256 GB SSD) - HP ZBook 15v G5 15" (i7-8850H - 16 GB RAM - 512 GB SSD - NVIDIA Quadro P600)

 

Link to comment
Share on other sites

Link to post
Share on other sites

Ok, I apparently am too stupid to get SSH running.  I keep getting "permission denied (publickey, password)" after attempting to enter the password. 

As I said I'm a complete noob at this.  I set the NAS up 4 years ago and it just worked ever since, so I never bothered to look into all kinds of functions that I'd normally never use.  I guess that's biting me in the rear right now.

 

Freenas' built-in shell is generally considered pretty useless, but when entering geom status and geom list there the very last drive in the list is da4, the one that's giving me trouble right now. So the drive is recognized.  It's not in its usual position though.  It's like it has been kicked out of its pool. 

 

1046887318_Freenasmisery002.jpg.cf82c9e335988b833896d27cd638d4c1.jpg

 

1037117586_Freenasmisery003.jpg.61e545fdf0c14c03673344c7e14b46c1.jpg

 

Looked up the command to read the SMART status, but not sure if I got the right one.  When I enter smartctl -a da4 I get the following result :

 

1782576662_Freenasmisery004.jpg.259da98755b09faac9609a5b7b42615a.jpg

 

 

 

 

 

I've been looking at my options and crunching the numbers.  I really don't feel like buying another HDD, not with SSD prices falling this rapidly.

 

I'll probably do a full backup, remove the dead drive and build a new RAIDZ2 pool with the 7 remaining drives.  I still have more than enough free space to go that route.  That way I can have 2 (or even 3) more drives fail before things become problematic.  In the meantime SSD prices will most likely continue to drop.  Then when the time comes and I really need to replace drives, I'll just throw eight 4TB SSDs at the problem and call it a day. 

 

Question though : can I swap those SAS->SATA cables around?  For example plug da4's cable into da5 and vice versa?  Or would that mess things up with regards to striping?  I really would like to make sure that the cable itself isn't causing the problem.   

Link to comment
Share on other sites

Link to post
Share on other sites

You should be able to swap the SAS cables around, yes, assuming they are connected to an HBA or a RAID Card in IT mode, not a proper RAID card.

 

However, I would label the cables and the drives in their current position, in case something doesn't work.

For Sale: Meraki Bundle

 

iPhone Xr 128 GB Product Red - HP Spectre x360 13" (i5 - 8 GB RAM - 256 GB SSD) - HP ZBook 15v G5 15" (i7-8850H - 16 GB RAM - 512 GB SSD - NVIDIA Quadro P600)

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, dalekphalm said:

You should be able to swap the SAS cables around, yes, assuming they are connected to an HBA or a RAID Card in IT mode, not a proper RAID card.

 

However, I would label the cables and the drives in their current position, in case something doesn't work.

 

The card is an LSI 9211-8i in IT mode (I bought it pre-flashed after @MrBucket101 suggested going that route), so it's not doing any RAID stuff itself.  I went that route specifically because Freenas likes to do the RAID in software.

 

The drives are labeled and I have photos of the labels together with the serial numbers.  So I did at least that part of my homework beforehand. 

I just swapped its data cable with the drive above and it looks like it's still the same.  So the cable is okay and it's the drive itself that's effed up. 

 

Oh well, guess I'll be running 7 HDDs from now on then. 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 minute ago, Captain Chaos said:

 

The card is an LSI 9211-8i in IT mode (I bought it pre-flashed after @MrBucket101 suggested going that route), so it's not doing any RAID stuff itself.  I went that route specifically because Freenas likes to do the RAID in software.

 

The drives are labeled and I have photos of the labels together with the serial numbers.  So I did at least that part of my homework beforehand. 

I just swapped its data cable with the drive above and it's still the same.  So the cable is okay and it's the drive itself that's effed up. 

 

 

Yep - so at this point, you just need to decide what you're going to do.

 

If you want to follow through with your plan, do a full backup of anything you don't want to do, then kill the array and build a new one in whatever config you want (With 7 drives I'd recommend RAIDZ2 at minimum - RAIDZ3 is likely overkill though for 7 drives).

 

Otherwise just buy a replacement, pop it in, and start a rebuild.

 

In either case, I would make your choice ASAP, because you don't want to leave it in a degraded state for too long.

For Sale: Meraki Bundle

 

iPhone Xr 128 GB Product Red - HP Spectre x360 13" (i5 - 8 GB RAM - 256 GB SSD) - HP ZBook 15v G5 15" (i7-8850H - 16 GB RAM - 512 GB SSD - NVIDIA Quadro P600)

 

Link to comment
Share on other sites

Link to post
Share on other sites

@CaptainChaos you should setup daily smart short tests, weekly smart long tests, and monthly pool scrubs. This will help ensure peak conditions for your array, and will pre-empt drive failures so you know well in advance if there is a problem.

 

As for how to troubleshoot a drive problem. You can try adding the same drive back into the array. It will probably work, but TBH, you should just buy a new drive. Once you trust in a drive has been lost, you shouldn’t really keep using it.

 

in the UI you will see the drive serial numbers. Match those to the drives in your bays to find the drive that is bad, as it’s serial wont be listed in the UI

 

afterwords, in the comment section for the drive, I would record the respective drive bay. Makes determining failures much easier

 

if you would like to run through some tests and figure out how to replace a failed disk, you can setup a virtual machine, install freenas and setup an array using 10MB vdisks. Setup the array, then delete or disconnect one vdisk. Replace it with another 10MB vdisk and try to replace/rebuild.

 

this will get you some real world experience, without consequence. (Or you can read any of the numerous guides out there)

Link to comment
Share on other sites

Link to post
Share on other sites

Yeah, I'm going to just eliminate that drive, create a 7-drive array and copy everything back over from my backup disks.  Should be a good workout for the remaining drives.

 

 

 

@MrBucket101 It looks like Freenas was already doing scrubs on its own without me setting that up.  According to the picture in the first post it did a scrub on May 5th.

 

I'm not putting that drive back in the array.  I know it's the same one that had an issue earlier, so I guess it has been slowly dying for months now.  As Steve Gibson always says "if a hard drive wants to die, you can use all the tools you want to keep it alive but eventually it will win"

 

I had already figured out the labeling thing.  Each drive has its serial number on the front (WD puts a decal there with the number on), as well as a label indicating how Freenas sees it (da0, da1, da2 etc).  I have photos of the HDD cages on my phone and a list in a spreadsheet.  So I know exactly which drive it is.

I didn't think about using the comment section in the UI for that though.  Thanks for that tip!

 

Link to comment
Share on other sites

Link to post
Share on other sites

Yeah, you should be able to remove the drive and plug it into another PC and run WD Lifeguard Diagnostics on it (extended run) to see if it is toast or not.

 

Though this is making me nervous because I also have 4TB Red drives made bought in May 2015

Link to comment
Share on other sites

Link to post
Share on other sites

11 minutes ago, scottyseng said:

Yeah, you should be able to remove the drive and plug it into another PC and run WD Lifeguard Diagnostics on it (extended run) to see if it is toast or not.

 

Though this is making me nervous because I also have 4TB Red drives made bought in May 2015

Don't worry too much about the drive age. 2015 is still pretty young for drives. Any HDD of any brand of any model can die at any time. He got unlucky.

For Sale: Meraki Bundle

 

iPhone Xr 128 GB Product Red - HP Spectre x360 13" (i5 - 8 GB RAM - 256 GB SSD) - HP ZBook 15v G5 15" (i7-8850H - 16 GB RAM - 512 GB SSD - NVIDIA Quadro P600)

 

Link to comment
Share on other sites

Link to post
Share on other sites

Glad you've got backups. Just to supplement what everyone is saying - set it up for email alerts. You'll get notifications of scrubs / smart errors / security checks. You also get notified if it loses connectivity with AD. I used gmail and simply generated a one time password - had it going within 10 minutes. Pretty sure this saved my bacon a couple times.

 

 

 

? I bought 5 Seagate 4TBs when I built my NAS, as of today 4 of 5 need(ed) replacements. 2 outright died, these other 2 are on death's door. Bought them in 2016 - this is why in my first IT job my then manager told me to always try to buy from different vendors and not in batches. Harder to do as a consumer though.

 

Not afraid of losing data, just really depressed given how much they cost me when I bought them.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, Mikensan said:

Glad you've got backups. Just to supplement what everyone is saying - set it up for email alerts. You'll get notifications of scrubs / smart errors / security checks. You also get notified if it loses connectivity with AD. I used gmail and simply generated a one time password - had it going within 10 minutes. Pretty sure this saved my bacon a couple times.

 

 

 

? I bought 5 Seagate 4TBs when I built my NAS, as of today 4 of 5 need(ed) replacements. 2 outright died, these other 2 are on death's door. Bought them in 2016 - this is why in my first IT job my then manager told me to always try to buy from different vendors and not in batches. Harder to do as a consumer though.

 

Not afraid of losing data, just really depressed given how much they cost me when I bought them.

 

 

What Seagate drives did you buy? 4TB Ironwolf's come with 3 year warranties, so I'd imagine you could have replaced all the dead ones. The regular Seagate drives have 2 year warranties, so the DOA's would have been an annoyance, but replaceable.

For Sale: Meraki Bundle

 

iPhone Xr 128 GB Product Red - HP Spectre x360 13" (i5 - 8 GB RAM - 256 GB SSD) - HP ZBook 15v G5 15" (i7-8850H - 16 GB RAM - 512 GB SSD - NVIDIA Quadro P600)

 

Link to comment
Share on other sites

Link to post
Share on other sites

22 minutes ago, dalekphalm said:

What Seagate drives did you buy? 4TB Ironwolf's come with 3 year warranties, so I'd imagine you could have replaced all the dead ones. The regular Seagate drives have 2 year warranties, so the DOA's would have been an annoyance, but replaceable.

Definitely not iron wolfs, these were their janky consumer models. the notorious x000Tdm series - 1 year warranty per the box. My first total failure thankfully was within a year and the warranty process was great (shipped a new one - used box to return old one).

Link to comment
Share on other sites

Link to post
Share on other sites

3 hours ago, Mikensan said:

Definitely not iron wolfs, these were their janky consumer models. the notorious x000Tdm series - 1 year warranty per the box. My first total failure thankfully was within a year and the warranty process was great (shipped a new one - used box to return old one).

Ahh too bad. At least you got some of them replaced though. I try to avoid drives with 1 year warranty, unless there's a substantial cost difference (Eg: Shucking a huge USB drive for the drive inside if it's on clearance sale).

For Sale: Meraki Bundle

 

iPhone Xr 128 GB Product Red - HP Spectre x360 13" (i5 - 8 GB RAM - 256 GB SSD) - HP ZBook 15v G5 15" (i7-8850H - 16 GB RAM - 512 GB SSD - NVIDIA Quadro P600)

 

Link to comment
Share on other sites

Link to post
Share on other sites

13 minutes ago, dalekphalm said:

Ahh too bad. At least you got some of them replaced though. I try to avoid drives with 1 year warranty, unless there's a substantial cost difference (Eg: Shucking a huge USB drive for the drive inside if it's on clearance sale).

Honestly I've had such good luck for the past 20 some odd years I just got cocky ignoring the warranty term. I have a.. 250mb disk I think it is from when I was a kid that still spins up. Connected it IDE > USB just several months ago for fun. Nothing like pain to teach one a lesson :'(. I also have a 3gb drive from 2001 that still spins up, though I think that bad boy is a maxtor quantom or whatever that's oil filled.

Link to comment
Share on other sites

Link to post
Share on other sites

  • 2 weeks later...

It's been a while, mostly because I decided to make 2 more backups of 16TB worth of data first ... to external USB 3.0 drives ... via a gigabit network.  :$

 

I removed the problem drive and am trying to set up a new share with the 7 remaining drives.  Struggling a bit with that, but I haven't been in the FreeNAS menus for quite a while except to run updates and to enter my encryption keys after reboots.

EDIT : Got it all set up again.  Now all I have to do is copy everything back (sigh).  Perhaps this is a good opportunity to do some cleaning.

 

 

 

I formatted the bad drive in Windows and it seems to mount just fine there.  So I opened CrystalDiskInfo for a S.M.A.R.T. reading.

I don't see a problem there either.  Or am I missing something?

 

1074579355_BadHDDSMART.jpg.0870526f7fb6eca398de300908f1cbe6.jpg

Link to comment
Share on other sites

Link to post
Share on other sites

The main things you want to look at are 05, C5 & C6

 

Drive looks good, so maybe it was a logical issue and got corrected with a reformat. 

Specifically before you reformatted it, it could have been the C5 (Pending Sector) which could have had a problem. When that gets bad enough it can cause major problems with the logical drive. Generally this is resolved by running a check over the disk with fsck (Linux) or chkdsk (Windows) which will fix or mark the sectors as bad. Because you formatted it, you wiped all that. I've had a disk with this exact behavior, and it was fine after the format. 

Spoiler

Desktop: Ryzen9 5950X | ASUS ROG Crosshair VIII Hero (Wifi) | EVGA RTX 3080Ti FTW3 | 32GB (2x16GB) Corsair Dominator Platinum RGB Pro 3600Mhz | EKWB EK-AIO 360D-RGB | EKWB EK-Vardar RGB Fans | 1TB Samsung 980 Pro, 4TB Samsung 980 Pro | Corsair 5000D Airflow | Corsair HX850 Platinum PSU | Asus ROG 42" OLED PG42UQ + LG 32" 32GK850G Monitor | Roccat Vulcan TKL Pro Keyboard | Logitech G Pro X Superlight  | MicroLab Solo 7C Speakers | Audio-Technica ATH-M50xBT2 LE Headphones | TC-Helicon GoXLR | Audio-Technica AT2035 | LTT Desk Mat | XBOX-X Controller | Windows 11 Pro

 

Spoiler

Server: Fractal Design Define R6 | Ryzen 3950x | ASRock X570 Taichi | EVGA GTX1070 FTW | 64GB (4x16GB) Corsair Vengeance LPX 3000Mhz | Corsair RM850v2 PSU | Fractal S36 Triple AIO | 12 x 8TB HGST Ultrastar He10 (WD Whitelabel) | 500GB Aorus Gen4 NVMe | 2 x 2TB Samsung 970 Evo Plus NVMe | LSI 9211-8i HBA

 

Link to comment
Share on other sites

Link to post
Share on other sites

39 minutes ago, Jarsky said:

Drive looks good, so maybe it was a logical issue and got corrected with a reformat. 

hmm ... so maybe I should just run checkdisk on it (or Spinrite), put it back in the NAS and start copying my data back to it. 

 

Then again I had problems with this exact drive before.  I just don't trust it anymore.  Guess I'll put it in my old/spare parts collection. 

If I ever run into trouble with one of the other drives, I can always try to put this one back in and see if that helps me along for a while.

Link to comment
Share on other sites

Link to post
Share on other sites

8 hours ago, Captain Chaos said:

hmm ... so maybe I should just run checkdisk on it (or Spinrite), put it back in the NAS and start copying my data back to it. 

 

Then again I had problems with this exact drive before.  I just don't trust it anymore.  Guess I'll put it in my old/spare parts collection. 

If I ever run into trouble with one of the other drives, I can always try to put this one back in and see if that helps me along for a while.

I wouldn't put it back into the array. I would run Checkdisk though, and afterwards run the SMART tests again. If it looks okay, use it in a non-critical, non-array situation.

For Sale: Meraki Bundle

 

iPhone Xr 128 GB Product Red - HP Spectre x360 13" (i5 - 8 GB RAM - 256 GB SSD) - HP ZBook 15v G5 15" (i7-8850H - 16 GB RAM - 512 GB SSD - NVIDIA Quadro P600)

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×