Jump to content

I see a lot of articles talking about in say RAID 5, a HDD fail and when you are rebuilding the array, another HDD fails.

This also seems to be a problem with RAID 1 where if you have 2 HDD in a RAID 1 config, one fails and when you are rebuilding the array, the other fails and your data is gone.

So, I was wondering if it's possible to maybe mitigate this problem by adding drives half way through the lifespan of the existing drives.

 

Say I have 2 drives in RAID 1 with lifespan of 5 years. Can I add another drive about 2.5 years in and then not have to worry about both drives failing at the same time?

Link to comment
https://linustechtips.com/topic/1119653-delay-adding-hdd-to-raid/
Share on other sites

Link to post
Share on other sites

1 minute ago, Trinopoty said:

So, I was wondering if it's possible to maybe mitigate this problem by adding drives half way through the lifespan of the existing drives.

 

Say I have 2 drives in RAID 1 with lifespan of 5 years. Can I add another drive about 2.5 years in and then not have to worry about both drives failing at the same time?

Going for a higher RAID-level would be a more reliable solution, like e.g. RAID6 can handle two drives failing and thus the likelihood of a catastrophic failure goes down. RAID10 is another solution. I guess swapping drives for newer ones after some time could reduce the likelihood of such a failure to some extent, but I doubt it'd reduce it all that much and you could still get unlucky. Besides, you have no way of knowing the lifespan of a drive anyways -- I have some 15 year old drives that still work as if they were new, and I have had some that failed within a week after purchase.

Hand, n. A singular instrument worn at the end of the human arm and commonly thrust into somebody’s pocket.

Link to post
Share on other sites

9 minutes ago, Trinopoty said:

Say I have 2 drives in RAID 1 with lifespan of 5 years. Can I add another drive about 2.5 years in and then not have to worry about both drives failing at the same time?

Yes you can, add a new disk to the array using the configuration utility but don't expand the array or anything like that just add the disk. Wait for the array reconstruction/rebuild to complete then remove one of the original disks from the array using the array configuration utility (don't just pull it out).

Link to post
Share on other sites

3 minutes ago, WereCatf said:

I have some 15 year old drives that still work as if they were new, and I have had some that failed within a week after purchase.

Typically if a drive is going to fail it will do so in the first 12 months otherwise it'll go for years and years.

Link to post
Share on other sites

13 minutes ago, Trinopoty said:

Say I have 2 drives in RAID 1 with lifespan of 5 years. Can I add another drive about 2.5 years in and then not have to worry about both drives failing at the same time?

You can also do a 3 drive RAID 1, where there are two copies of the data meaning 33% usable space. Like my first post you can do that after the fact a few years later and actually just leave all the disks in then if one fails you still have a fully working mirror disk while you replace the broken drive.

 

3 disk RAID 1 is rather wasteful on usable space and cost though.

Link to post
Share on other sites

3 minutes ago, leadeater said:

Yes you can, add a new disk to the array using the configuration utility but don't expand the array or anything like that just add the disk. Wait for the array reconstruction/rebuild to complete then remove one of the original disks from the array using the array configuration utility (don't just pull it out).

I mean yes, you need to rebuild the array. Either manually or automatically. I'm not saying just add a drive and pull one out.

Link to post
Share on other sites

Just now, leadeater said:

You can also do a 3 drive RAID 1, where there are two copies of the data meaning 33% usable space. Like my first post you can do that after the fact a few years later and actually just leave all the disks in then if one fails you still have a fully working mirror disk while you replace the broken drive.

 

3 disk RAID 1 is rather wasteful on usable space and cost though.

So maybe another option is to have a RAID 1 with 1 drive (not really a RAID but bear with me). Then add a drive a few years in.

Not a good solution I think but it is theoretically possible.

Link to post
Share on other sites

6 minutes ago, Trinopoty said:

So maybe another option is to have a RAID 1 with 1 drive (not really a RAID but bear with me). Then add a drive a few years in.

Not a good solution I think but it is theoretically possible.

Um, what would the point be? A RAID1 with only one drive is the same thing as no RAID at all. It wouldn't be able to handle a single failure, so the original issue would still exist, but it'd be even worse.

Hand, n. A singular instrument worn at the end of the human arm and commonly thrust into somebody’s pocket.

Link to post
Share on other sites

8 minutes ago, Trinopoty said:

So maybe another option is to have a RAID 1 with 1 drive (not really a RAID but bear with me). Then add a drive a few years in.

Not a good solution I think but it is theoretically possible.

Yes you can do that, though when it comes to hardware RAID cards some will enforce a 1 disk RAID 0 and other will enforce a 1 disk RAID 1. Software RAIDs don't have that issue.

Link to post
Share on other sites

Just now, leadeater said:

Yes you can do that, though when it comes to hardware RAID cards some will enforce a 1 disk RAID 0 and other will enforce a 1 disk RAID 1. Software RAIDs don't have that issue.

I haven't actually used a RAID card before so I don't really have real world knowledge on the matter.

 

So, another solution would probably be to have a 2 drive RAID, remove a drive and add a new one (and rebuild), then use the older drive you just removed for something else.

Link to post
Share on other sites

15 minutes ago, Trinopoty said:

I mean yes, you need to rebuild the array. Either manually or automatically. I'm not saying just add a drive and pull one out.

Well I mentioned that because in a normal situation during a drive failure you remove the failed disk and when you insert the replacement it is automatically added to the array and rebuild starts. If you are not in a failure situation the steps are not automated so you have to add and remove the disks using the utility, but if you add a new disk to an array that is say RAID 5 and you expand it you cannot then remove any disks anymore as the used space is now large enough to require that disk you added. Just have to be careful because you can do things which mean you can no longer remove a disk you originally wanted to, because you cannot shrink typically.

Link to post
Share on other sites

1 minute ago, leadeater said:

Well I mentioned that because in a normal situation during a drive failure you remove the failed disk and when you insert the replacement it is automatically added to the array and rebuild starts. If you are not in a failure situation the steps are not automated so you have to add and remove the disks using the utility, but if you add a new disk to an array that is say RAID 5 and you expand it you cannot then remove any disks anymore as the used space is now large enough to require that disk you added. Just have to be careful because you can do things which mean you can no longer remove a disk you originally wanted to, because you cannot shrink typically.

With RAID 5 or 6, you can just pull out a drive and the controller will probably assume it died. Then just add a fresh one in and it should trigger a rebuild.

Link to post
Share on other sites

2 minutes ago, Trinopoty said:

With RAID 5 or 6, you can just pull out a drive and the controller will probably assume it died. Then just add a fresh one in and it should trigger a rebuild.

Yes but you're wanting to do it the safe way before a failure, and the correct way is to add the disk to the array then remove a disk after the rebuild so the array never goes in to a degraded health state putting your data at risk.

Link to post
Share on other sites

3 minutes ago, leadeater said:

Yes but you're wanting to do it the safe way before a failure, and the correct way is to add the disk to the array then remove a disk after the rebuild so the array never goes in to a degraded health state putting your data at risk.

Well, it's a theoretical possibility. No one would actually pull out a drive from a RAID 5 unless it's dead already.

So, as I understand, it's slightly more difficult to do with a RAID 5/6 than a RAID 1.

But, it would still hold for RAID 51,15,61,16 (no one uses it I imagine though) etc; as long as there is RAID 1.

Link to post
Share on other sites

8 minutes ago, Trinopoty said:

Well, it's a theoretical possibility. No one would actually pull out a drive from a RAID 5 unless it's dead already.

Depends, I have a few times but that is during a capacity upgrade. Had to on multiple occasions one by one replace 300GB SAS disks with 600GB SAS disks and when finally at the end after days of add + rebuild + remove + rebuild per disk expand the array doubling it's size, which you can only do once all have been swapped with 600GB disks.

 

When you're working with RAID and doing disk maintenance and not during a failure you should never do anything that puts the array in to a degraded health state, you don't need to. As you noted in your original post rebuilding is the most dangerous time and to take a currently healthy array and then make it unhealthy then trigger a rebuild is not good practice and if a failure does happen it's not going to be easily explained away when there is a better safer way to do the task.

Link to post
Share on other sites

11 minutes ago, Trinopoty said:

With RAID 5 or 6, you can just pull out a drive and the controller will probably assume it died. Then just add a fresh one in and it should trigger a rebuild.

In a RAID 5 doing it this way you're putting your array at risk. 

If another disk fails, or the new disk fails then your array goes offline. You need to remove the failed disk, re-add the original disk, delete and recreate the array hoping that you used the default values (or that you recorded the values you used), then again it will have to do a rebuild which stresses your drives. I've had to do this a couple of times in the past. 

 

Spoiler

Desktop: Ryzen9 5950X | ASUS ROG Crosshair VIII Hero (Wifi) | EVGA RTX 3080Ti FTW3 | 32GB (2x16GB) Corsair Dominator Platinum RGB Pro 3600Mhz | EKWB EK-AIO 360D-RGB | EKWB EK-Vardar RGB Fans | 1TB Samsung 980 Pro, 4TB Samsung 980 Pro | Corsair 5000D Airflow | Corsair HX850 Platinum PSU | Asus ROG 42" OLED PG42UQ + LG 32" 32GK850G Monitor | Roccat Vulcan TKL Pro Keyboard | Logitech G Pro X Superlight  | MicroLab Solo 7C Speakers | Audio-Technica ATH-M50xBT2 LE Headphones | TC-Helicon GoXLR | Audio-Technica AT2035 | LTT Desk Mat | XBOX-X Controller | Windows 11 Pro

 

Spoiler

Server: Fractal Design Define R6 | Ryzen 3950x | ASRock X570 Taichi | EVGA GTX1070 FTW | 64GB (4x16GB) Corsair Vengeance LPX 3000Mhz | Corsair RM850v2 PSU | Fractal S36 Triple AIO + 4 Additional Venturi 120mm Fans | 14 x 20TB Seagate Exos X22 20TB | 500GB Aorus Gen4 NVMe | 2 x 2TB Samsung 970 Evo Plus NVMe | LSI 9211-8i HBA

 

Link to post
Share on other sites

6 minutes ago, Jarsky said:

In a RAID 5 doing it this way you're putting your array at risk. 

If another disk fails, or the new disk fails then your array goes offline. You need to remove the failed disk, re-add the original disk, delete and recreate the array hoping that you used the default values (or that you recorded the values you used), then again it will have to do a rebuild which stresses your drives. I've had to do this a couple of times in the past. 

 

Well. We're trying to prevent an array going offline for any length of time so having to delete and recreate an array is not what we want.

Guess this kind of gimmick is best pulled with RAID 1.

Then again, stressing drives halfway through it's life is better than stressing them near the end of their life.

RAID 5 is not advised anymore precisely because the failure rate shoots way up as soon as a drive naturally fails.

The failure probability should only increase very slightly if you pull out a drive when none has actually failed.

Link to post
Share on other sites

30 minutes ago, Trinopoty said:

The failure probability should only increase very slightly if you pull out a drive when none has actually failed.

Problem is it's really hard to know the actual health of a disk, they could all be fine or near death, or only one near death and you'll never know. I've watched a cascading disk failure happen in the space of a few hours, full 42U rack of disks and three died one by one during a rebuild and the disks were all different ages and production runs and the end result was total system failure (of medical data at a hospital so yay for that). Thankfully I didn't actually work there but I did get to watch it happen and experience the multiple days of them trying to get it fixed.

 

For both RAID 1 and 5/6 it's simple to add a disk and wait for a rebuild then remove one, you can also just leave it in the system as a hot spare in idle spin down but using a previously in use 2.5 year old disk as a host spare may not be the best idea but in a home situation it's better than not having a spare.

 

RAID 1 also has another nice trick which is sort of getting lost to time as the industry has moved on. When you're making OS or big system changes you can pop out one of the of disks in the mirror so you have a 'snapshot' before the changes and if it all goes wrong you can shut down the system and pull out the disk that was just active with the failure of what ever it was you're doing on it and put in the disk with the state before the change, then add the other one and rebuild/overwrite the disk then you can repeat and try again. It's a pretty neat old school trick that you can do very easily with VMs now as snapshots is a native capability.

Link to post
Share on other sites

1 minute ago, Trinopoty said:

Well. We're trying to prevent an array going offline for any length of time so having to delete and recreate an array is not what we want.

Guess this kind of gimmick is best pulled with RAID 1.

Then again, stressing drives halfway through it's life is better than stressing them near the end of their life.

The problem with that logic is that you have no idea what the drives 'life' span is.

You could go by the manufacturers warranty period, for most thats approx 5 years, but the issue is you'd then have to be replacing drives every 2.5 years which would be extremely costly.

The best alternative you can do is go off the drives MTBF rating. However that logic is flawed. If you go with say the WD Red's they have an MTBF of 1,000,000 hours. Thats 114 years. Lets say you have 8 drives, in a recommended RAID6, then it would be a risk of 1 drive going down in an array every 18 years. But we know that failure rate is far higher than that. You don't want to be waiting 9 years to replace drives if uptime is your primary concern. 

Generally i've found that with mechanical disks if they get past the first few weeks then they'll be fine for 5 years (the typical warranty period).

I've found generally this is when its most cost effective to swap out the disks. 

 

1 minute ago, Trinopoty said:

RAID 5 is not advised anymore precisely because the failure rate shoots way up as soon as a drive naturally fails.

The failure probability should only increase very slightly if you pull out a drive when one has actually failed.

 

True, but theyre really your best option for 4-bay enclosures such as consumer NAS shelf units, or 1U servers. 

RAID5 is also much faster than RAID6 so is still relevant where speed is more of a necessity such as a caching server.

RAID6 is generally more relevant with 6+ disks, and of course if speed is not as relevant as redundancy such as a general purpose file server for a small site. 

Spoiler

Desktop: Ryzen9 5950X | ASUS ROG Crosshair VIII Hero (Wifi) | EVGA RTX 3080Ti FTW3 | 32GB (2x16GB) Corsair Dominator Platinum RGB Pro 3600Mhz | EKWB EK-AIO 360D-RGB | EKWB EK-Vardar RGB Fans | 1TB Samsung 980 Pro, 4TB Samsung 980 Pro | Corsair 5000D Airflow | Corsair HX850 Platinum PSU | Asus ROG 42" OLED PG42UQ + LG 32" 32GK850G Monitor | Roccat Vulcan TKL Pro Keyboard | Logitech G Pro X Superlight  | MicroLab Solo 7C Speakers | Audio-Technica ATH-M50xBT2 LE Headphones | TC-Helicon GoXLR | Audio-Technica AT2035 | LTT Desk Mat | XBOX-X Controller | Windows 11 Pro

 

Spoiler

Server: Fractal Design Define R6 | Ryzen 3950x | ASRock X570 Taichi | EVGA GTX1070 FTW | 64GB (4x16GB) Corsair Vengeance LPX 3000Mhz | Corsair RM850v2 PSU | Fractal S36 Triple AIO + 4 Additional Venturi 120mm Fans | 14 x 20TB Seagate Exos X22 20TB | 500GB Aorus Gen4 NVMe | 2 x 2TB Samsung 970 Evo Plus NVMe | LSI 9211-8i HBA

 

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×