Jump to content

RAID Maintenance Help

IBM_THINKPAD_R51

So, I just saw this "Our data is GONE... Again" video today and that got me wondering about my own server.

 

I got a Dell PowerEdge R910 with 2 RAID 6's of 7 15K RPM SAS drives each.

 

Basically, in the video he said something about data rot and how the checks where never configured properly (Since he is using software raid)

 

My question is that since I have hardware raid with the more robust serial attached SCSI (SAS) standard, does my hardware raid card do all the necessary checks automatically? Since in the OS, it only comes up as "PERC H700 SCSI Disk Device"

Link to comment
Share on other sites

Link to post
Share on other sites

Do you have the software in your os to manage the raid array? There should be software for most major oses, otherwise you can use the drac or the bios menu.

 

But my perc cards will do a full disk check every week by default it seems, just check that thats the case here.

 

But Id make sure your backups are good, Raid scrubs only do so much.

Link to comment
Share on other sites

Link to post
Share on other sites

Hardware RAID isn't necessarily more robust than software RAID. It's often LESS robust as there's more parts that can fail.

What happened was that ZFS (very advanced and very robus but also needs a lot more sophistication to be used well) was detecting errors, the normal process to work around them was NOT being performed (user configuration error - this setting is configured by default) AND there were no hot spare drives in the servers (also a configuration error) and there was no alerting (also a configuration error).

 

With any luck, Linus will be able to recover 99% or more of his data. It's not guaranteed and he'll likely have some headaches in the days and potentially weeks to come. He will actually want to do some research to make sure anything foolish isn't done (like trying to replace ALL of the drives at once).

---

As far as YOU are concerned - you should have your config set to do alerting if you get drive errors and if you're using an advanced file system (ZFS, BTRFS, etc.) you should make sure that it's doing all of its appropriate background maintenance tasks on a regular schedule.

 

My expectation is that if you're doing HW RAID that these checks aren't even a thing and you'll just lose data without ever being aware of it due to bit-rot. This probably won't matter.

https://arstechnica.com/information-technology/2014/01/bitrot-and-atomic-cows-inside-next-gen-filesystems/

3900x | 32GB RAM | RTX 2080

1.5TB Optane P4800X | 2TB Micron 1100 SSD | 16TB NAS w/ 10Gbe
QN90A | Polk R200, ELAC OW4.2, PB12-NSD, SB1000, HD800
 

Link to comment
Share on other sites

Link to post
Share on other sites

-> Moved to Storage Devices

^^^^ That's my post ^^^^
<-- This is me --- That's your scrollbar -->
vvvv Who's there? vvvv

Link to comment
Share on other sites

Link to post
Share on other sites

I have my servers (with HW RAID) set to do a volume check every month where it checks all the data. It takes anywhere from 10 to 20 hours. It has shown an error only a few times, which was always corrected. Been years since the last one. I scheduled them just for peace of mind really, the whole bitrot thing always felt like a myth to me to be honest. Good to know the volume check helps to prevent it though, more peace of mind hehe.

 

i did have to set up the checks myself, so you better check it i guess

I have no signature

Link to comment
Share on other sites

Link to post
Share on other sites

Linus failed to monitor his drives. It wasn't data rot or unicorn dust. If he had bothered to run SMART checks on his drives they would have flagged errors. "Hey look I'm running Linux and don't have to do that." Lol. This is why I'm such a strong opponent of running these open source NAS platforms without hardware monitoring.

 

I've had quarter million dollar SAN arrays puke 75% of their drives over relatively short periods. Drives fail, and when you get boxes of them directly from the manufacturer they seem to fail more.

 

If its a Dell Server running Windows you should be able to download Dells Open Manage and this will show you the drive staus through the H700. This is a decent card, and RAID 6 is much more resilient than RAID 5. However, 15k spinners are notoriously prone to failure because of their higher heat and stress. Dont drink the kool-aid that SAS drives are more reliable than SATA. They arent. At a minimum you need to get a monitor on that RAID card to see the drive status. Open Manage is the quickest solution, IMO. 

 

I do a fair amount of migrations from legacy storage like this to SSD. The server itself will run forever, and there's nothing wrong with the OS. However, older spinners, especially 15ks are a bomb waiting to go off so I just migrate the mess to a data center grade Intel or Kingston. That single SSD has a far lower chance of failure than the previous RAID array and its faster. 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

9 hours ago, wseaton said:

Linus failed to monitor his drives. It wasn't data rot or unicorn dust. If he had bothered to run SMART checks on his drives they would have flagged errors. "Hey look I'm running Linux and don't have to do that." Lol. This is why I'm such a strong opponent of running these open source NAS platforms without hardware monitoring.

 

I've had quarter million dollar SAN arrays puke 75% of their drives over relatively short periods. Drives fail, and when you get boxes of them directly from the manufacturer they seem to fail more.

 

If its a Dell Server running Windows you should be able to download Dells Open Manage and this will show you the drive staus through the H700. This is a decent card, and RAID 6 is much more resilient than RAID 5. However, 15k spinners are notoriously prone to failure because of their higher heat and stress. Dont drink the kool-aid that SAS drives are more reliable than SATA. They arent. At a minimum you need to get a monitor on that RAID card to see the drive status. Open Manage is the quickest solution, IMO. 

 

I do a fair amount of migrations from legacy storage like this to SSD. The server itself will run forever, and there's nothing wrong with the OS. However, older spinners, especially 15ks are a bomb waiting to go off so I just migrate the mess to a data center grade Intel or Kingston. That single SSD has a far lower chance of failure than the previous RAID array and its faster. 

 

 

Yeah its windows server, so I do have all the utilities at my disposal, although i will say that i havnt had any 15k sas drives fail on me yet since ive been using them like 5-10 years ago, however my previous server had some 1tb 7.2k sata hard drives and those have fail sooo...i dont know it makes sense the higher stress makes for a more unreliable drive but for me so far its actually the other way around.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×