Jump to content

Let's Talk About: RAID Survival Rates

Advanced RAID: Survival Rates

 

In previous discussions, we've talked about RAID survival in general, the advantages and disadvantages of levels of parity, and the advantages and disadvantages of striping. However, for advanced RAID volumes, we left one topic hanging. Can you guess which one it is?

 

Of course you can. It's the topic of maximum and minimum redundancy levels. In the discussion, we mentioned how striped RAID volumes (RAID 10, RAID 50 and RAID 60) have both a minimum and maximum tolerance to failed drives. However, we didn't talk about how likely it is that a volume will fail if a certain number of drives fail. That's what we're going to do today.

 

However, there will be no math discussed here today. I won't bore you with details of probability, or enrage someone because I made a mistake (everyone does). We're going to simulate drives failing.

 

That's right. We're going to simulate RAID configurations, drive failures, and check if the RAID is still alive. But we're not going to do it just once, we're going to do it twenty thousand times.

 

Why so many times?

 

It comes up heads. Does that mean that coin flips always come up heads? Of course not.

Flip it ten times. Let's say you get 7 heads and 3 tails. Does that mean that coin flips are 70/30 to heads/tails? Of course not.

Flip it ten thousand times, get 5200 heads and 4800 tails. Therefore, heads comes up 52% of the time.

But we know that truly random coin flips are 50/50. And as we flip more coins, the overall trend of the coin flips approaches 50/50. It doesn't get there, because any one coin flip might cause it to deviate from the average 50/50 by just a tiny bit. But with lots of coin flips, we can minimize this effect.

 

So, for any one configuration with N drives, we will randomly choose M of them to fail, then check if the array is still alive. We'll also talk about how the number of stripes in a volume and drives per stripe influence the failure rate. The check will look at all stripes, and if any are failed the array will be considered dead. By doing this many times, we will get an approximate chance of a configuration's survival rate, which is more interesting than the failure rate.

 

I'll be using MATLAB to simulate it, because it's easy to make plots that look nice and because I've used it a lot in undergraduate and graduate school.

 

 

We'll say this RAID configuration has 8 drives, because that's a nice healthy 12TB of storage using 3TB drives. That means it has four RAID 1 mirrors striped together. From the previous discussion, that means that our RAID array has a worst-case redundancy of 1 drive, and a best-case redundancy of 4 drives.

 

So let's simulate the worst possible scenario: 4 drives randomly failing from the array. We'll also take a look at how, as the number of samples increases, our average survival rate converges to its true value.

 

post-653-0-45852200-1389916099.png

 

I wouldn't want this to happen to me. You can see, at the beginning, the survival rate is all over the place. However, as the number of simulations increases, the survival rate stabilizes to about 23%. We could do more simulations, but I felt this was good enough for our purposes.

 

As a sanity check, let's make sure that, if 5 drives die, the survival rate is zero.

 

post-653-0-30159900-1389916103.png

 

If I were to ask how many times the actual survival rate exactly matched the experimental data, I'd have to say all of them. Five drives failing will always yield a dead array.

 

Now that we've seen how our estimate survival rate converges to a stable value, we can start doing it for various drive configurations.

 

Let's calculate the probability of failure due to drive failure. We'll test RAID 10, RAID 50, RAID 60 and striped RAID-Z3 (RAID 70, or striped triple-parity), starting with 1 drive and working our way up until the configuration always fails. We'll put them on one graph, for convenience.

 

post-653-0-70889700-1389918701.png

 

Some things to note:

  • Each RAID configuration used the minimum number of drives per stripe. So two for RAID 10, three for RAID 50, four for RAID 60 and six for RAID 70. This results in 6 mirrored drives for RAID 10, three for RAID 50, six for RAID 60, and six for RAID 70.
  • Twenty thousand simulations per data point.
  • I recommend going back to the discussion and cross-comparing the worst and best case redundancy with our data.

 

We can see that RAID 50 is the worst and RAID 70 is the best. It's also interesting to note that the rate of failure drops slowly at first, then accelerates. This is especially prevalent in a RAID 60 or 70.

 

These configurations are best-case for parity, since they have the highest possible number of redundant drives. This is good for keeping data safe, but is expensive to set up, because more storage is lost to drive redundancy.

 

 

In this case, we have the same RAID 10 and RAID 70 configurations as before, since we already had the most space-efficient configurations. However, our RAID 50 and RAID 60 now have only two and four parity drives, respectively. Let's see what happens to their failure rates now.

 

post-653-0-43857800-1389919292.png

 

RAID 50's ability to take a hit falls off a huge cliff, even with the second drive failure. Before, it was around 85% likely to survive. Now it's closer to 55%. RAID 60 similarly becomes less safe, with a third drive failure only letting the array live 80% of the time vs. 95% before. In addition, the previous RAID 60 with 4 drive failures was 80% likely to survive, compared to 45% now.

 

Clearly our ability to survive drive failures takes a big hit, but we do gain more capacity. This is the tradeoff that is faced when deciding how to set up a striped RAID volume: redundancy vs. capacity. Performance does change between the two configurations, but only for certain workloads, and rebuild time should be considered as well. As parity and total capacity increase, rebuild times increase.

 

I'd like to do one more comparison, this time with 24 drives...

 

 

So RAID 10 has 12 mirrored drives, RAID 50 has 8, RAID 60 has 12, and RAID 70 has 12.

 

post-653-0-53380900-1389920224.png

 

Looks pretty similar to the previous high-redundancy configuration, doesn't it? However, the survival rate is dramatically improved. For example, the likelihood of RAID 60 surviving four dead drives is now 95%, rather than 80%. RAID 70 can survive 6 drive failures 93% of the time, versus 45% of the time before. Even RAID 50 looks better, surviving two drive failures more than 90% of the time versus about 80% before.

 

 

Just like the last space-efficient system, the number of redundant drives in the system is the same. Lets see how they are affected as the number of dead drives goes past 7.

 

post-653-0-24522600-1389920452.png

 

We can't harp on RAID 50, 60 and 70 too much, since their configurations changed, while the RAID 10 didn't. That is why RAID 10 overtakes every other configuration in redundancy: It's space efficiency doesn't change. However, the likelihood of the RAID 10 surviving those drive failures is very low.

 

For the most part, the other configurations didn't gain or lose any redundancy going to more drives. Space efficiency did go up from the 12-drive scenario, but so did the rate of drive failure.

 

Admittedly these are the extreme ends of the spectrum. Choosing a nice middle ground will get you a balance of space efficiency and redundancy in your storage. Maybe 6 drives per stripe in a RAID 50 or 8 drives per stripe in a RAID 60 with 24 drives would be more balanced...

 

post-653-0-87390000-1389921536.png

 

It's really up to you. Good luck with your storage build, make sure you plan ahead.

I do not feel obliged to believe that the same God who has endowed us with sense, reason and intellect has intended us to forgo their use, and by some other means to give us knowledge which we can attain by them. - Galileo Galilei
Build Logs: Tophat (in progress), DNAF | Useful Links: How To: Choosing Your Storage Devices and Configuration, Case Study: RAID Tolerance to Failure, Reducing Single Points of Failure in Redundant Storage , Why Choose an SSD?, ZFS From A to Z (Eric1024), Advanced RAID: Survival Rates, Flashing LSI RAID Cards (alpenwasser), SAN and Storage Networking

Link to comment
Share on other sites

Link to post
Share on other sites

Is it weird that there is a stack of floppy disks by my keyboard while I'm reading a thread on RAID

There are 10 types of people in this world, those who understand binary, and those who don't.

Link to comment
Share on other sites

Link to post
Share on other sites

Very cool and detailed post, thanks for taking the time to write this.

Workstation: 3930k @ 4.3GHz under an H100 - 4x8GB ram - infiniband HCA  - xonar essence stx - gtx 680 - sabretooth x79 - corsair C70 Server: i7 3770k (don't ask) - lsi-9260-4i used as an HBA - 6x3TB WD red (raidz2) - crucia m4's (60gb (ZIL, L2ARC), 120gb (OS)) - 4X8GB ram - infiniband HCA - define mini  Goodies: Røde podcaster w/ boom & shock mount - 3x1080p ips panels (NEC monitors for life) - k90 - g9x - sp2500's - HD598's - kvm switch

ZFS tutorial

Link to comment
Share on other sites

Link to post
Share on other sites

By the way, if you want to calculate survival rates for different configurations, here is a good calculator.

 

http://raid-failure.com/raid10-50-60-failure.aspx

 

Their calculations match up with my experimental data, so I can recommend it.

 

EDIT: Their calculations for RAID 60 and 4 drives failing don't match up. However everything else appears to be okay.

I do not feel obliged to believe that the same God who has endowed us with sense, reason and intellect has intended us to forgo their use, and by some other means to give us knowledge which we can attain by them. - Galileo Galilei
Build Logs: Tophat (in progress), DNAF | Useful Links: How To: Choosing Your Storage Devices and Configuration, Case Study: RAID Tolerance to Failure, Reducing Single Points of Failure in Redundant Storage , Why Choose an SSD?, ZFS From A to Z (Eric1024), Advanced RAID: Survival Rates, Flashing LSI RAID Cards (alpenwasser), SAN and Storage Networking

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×