Jump to content

Best way to simulate a RAID 5 disk failure

What is the best way to simulate a RAID 5 drive failure? One of the goals I have been working towards recently is simulating different catastrophic events in a "enterprise" (On a very small scale) so that I can better understand not only what to expect in the event of something but also to know what steps I need to take in order to recover from an event. One of the next items on my list is recovering a failed RAID array, in a safe, non-production environment with non critical or known backups of data. (There are lots of others like recovering deleted data which I will hopefully get wrapped up tomorrow, as well as building "simulated" networks on real hardware (Think CCNA practice labs)). 

 

So, what would you guys recommend doing in order to simulate a failed RAID 5 array? Just up and pull a drive out of the machine while its running? Do I need to sacrifice a disk with a hammer for the greater good? (Or open one up and maybe just poke the platter? Who knows. What would you do? If there are multiple different suggestions (IE one for a drive completly dropped, one for a damaged drive, etc.) Please post below with instruction's on the best way you can think of. (Also include other things like what should the machine be running? VM Ware with active VMs? Straight windows? Should I put a load on the machine or possibly initiate a heavy read/write cycle when I break it?)

 

These tests will be running on a retired T330 with a RAID card and a SAS backplane. I have 6 drives in it at the moment, 2 1TB in a R0 (I am currently just using those as host OS storage) and 4 1TB drives in a RAID 5 (Don't remember if hot spare is enabled or now). These specific drives are the only SAS ones I have access to that match in size (And manufacture/model specific in each array). I would plan to only do a pulled drive config or something. For any physical damage that I might need to inflict I have a multitude of retired SATA drives that I will swap in instead.

 

Thank you!

 

Breaking things 1 day at a time

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, TubsAlwaysWins said:

Just up and pull a drive out of the machine while its running?

Yes.

2 minutes ago, TubsAlwaysWins said:

Do I need to sacrifice a disk with a hammer for the greater good?

What good would that do?

3 minutes ago, TubsAlwaysWins said:

(Or open one up and maybe just poke the platter?

Again, what good would that do? Corrupted/bad sectors isn't a "catastrophic failure" and the array will just correct such errors on-the-fly. The drive failing completely, e.g. losing power or you pulling it out or such, would be a "catastrophic failure" of the drive, though it still wouldn't be one for the whole array.

Hand, n. A singular instrument worn at the end of the human arm and commonly thrust into somebody’s pocket.

Link to comment
Share on other sites

Link to post
Share on other sites

@ragnarok0273

Sounds good. Will do

 

@WereCatf

I mean obviously no good. Just know drives can fail in multiple ways. If its just a waste of a drive then I wont do it. Thank you!

 

Breaking things 1 day at a time

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, TubsAlwaysWins said:

@ragnarok0273

Sounds good. Will do

 

@WereCatf

I mean obviously no good. Just know drives can fail in multiple ways. If its just a waste of a drive then I wont do it. Thank you!

They can drive in many, many ways, but there is only really 2 "ways" the RAID controller/software would "care" about. Either data corruption, or loss of data entirely via the loss of a drive. Loss of a drive.... pull a drive, wipe it and rebuild (or rebuild with a different drive). As far as corruption, it should fix this on the fly as stated previously assuming there is sufficient redundancy to rebuild it; if there isn't, not much you can do as a sysad besides go find a backup :)

Rig: i7 13700k - - Asus Z790-P Wifi - - RTX 4080 - - 4x16GB 6000MHz - - Samsung 990 Pro 2TB NVMe Boot + Main Programs - - Assorted SATA SSD's for Photo Work - - Corsair RM850x - - Sound BlasterX EA-5 - - Corsair XC8 JTC Edition - - Corsair GPU Full Cover GPU Block - - XT45 X-Flow 420 + UT60 280 rads - - EK XRES RGB PWM - - Fractal Define S2 - - Acer Predator X34 -- Logitech G502 - - Logitech G710+ - - Logitech Z5500 - - LTT Deskpad

 

Headphones/amp/dac: Schiit Lyr 3 - - Fostex TR-X00 - - Sennheiser HD 6xx

 

Homelab/ Media Server: Proxmox VE host - - 512 NVMe Samsung 980 RAID Z1 for VM's/Proxmox boot - - Xeon e5 2660 V4- - Supermicro X10SRF-i - - 128 GB ECC 2133 - - 10x4 TB WD Red RAID Z2 - - Corsair 750D - - Corsair RM650i - - Dell H310 6Gbps SAS HBA - - Intel RES2SC240 SAS Expander - - TreuNAS + many other VM’s

 

iPhone 14 Pro - 2018 MacBook Air

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, TubsAlwaysWins said:

@ragnarok0273

Sounds good. Will do

 

@WereCatf

I mean obviously no good. Just know drives can fail in multiple ways. If its just a waste of a drive then I wont do it. Thank you!

They can drive in many, many ways, but there is only really 2 "ways" the RAID controller/software would "care" about. Either data corruption, or loss of data entirely via the loss of a drive. Loss of a drive.... pull a drive, wipe it and rebuild (or rebuild with a different drive). As far as corruption, it should fix this on the fly as stated previously assuming there is sufficient redundancy to rebuild it; if there isn't, not much you can do as a sysad besides go find a backup :)

 

Oh, also...... should probs state the obvious, if this is ultimately for enterprise rollout, I wouldn't be running RAID 5, RAID 6 at a minimum. Even in my homelab I wouldn't use RAID 5 these days.... I run Z2 (ZFS Raid 6).

Rig: i7 13700k - - Asus Z790-P Wifi - - RTX 4080 - - 4x16GB 6000MHz - - Samsung 990 Pro 2TB NVMe Boot + Main Programs - - Assorted SATA SSD's for Photo Work - - Corsair RM850x - - Sound BlasterX EA-5 - - Corsair XC8 JTC Edition - - Corsair GPU Full Cover GPU Block - - XT45 X-Flow 420 + UT60 280 rads - - EK XRES RGB PWM - - Fractal Define S2 - - Acer Predator X34 -- Logitech G502 - - Logitech G710+ - - Logitech Z5500 - - LTT Deskpad

 

Headphones/amp/dac: Schiit Lyr 3 - - Fostex TR-X00 - - Sennheiser HD 6xx

 

Homelab/ Media Server: Proxmox VE host - - 512 NVMe Samsung 980 RAID Z1 for VM's/Proxmox boot - - Xeon e5 2660 V4- - Supermicro X10SRF-i - - 128 GB ECC 2133 - - 10x4 TB WD Red RAID Z2 - - Corsair 750D - - Corsair RM650i - - Dell H310 6Gbps SAS HBA - - Intel RES2SC240 SAS Expander - - TreuNAS + many other VM’s

 

iPhone 14 Pro - 2018 MacBook Air

Link to comment
Share on other sites

Link to post
Share on other sites

It you want to simulate bit rot (a bit changing state, which happens over time and is often not properly caught by normal hardware RAID controllers nor Linux mdraid, but is caught and handled by ZFS and BTRFS), then you can do this:

  1. shut down the system
  2. Either move the victim drive to another system, or boot the system with an separate OS (maybe a live CD)
  3. write a small amount of random data to random locations on the drive (not the first or last few gigabytes though)
  4. Return the system to normal configuration. Boot it and then do a patrol read (if normal RAID) or scrub (if ZFS).

Looking to buy GTX690, other multi-GPU cards, or single-slot graphics cards: 

 

Link to comment
Share on other sites

Link to post
Share on other sites

On 11/25/2020 at 11:50 PM, LIGISTX said:

Oh, also...... should probs state the obvious, if this is ultimately for enterprise rollout, I wouldn't be running RAID 5, RAID 6 at a minimum. Even in my homelab I wouldn't use RAID 5 these days.... I run Z2 (ZFS Raid 6).

This is not an enterprise rollout, just a test so I know what to expect when I do have to. And I will keep that in mind when I am building a enterprise/client server.

Thank you!

 

Breaking things 1 day at a time

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×