RAID 10 Performance Considerations in HPC Setup

genome-guy · February 4, 2018

So I wanted to bounce something off the group....

In our research lab, we're currently setting up a private micro-HPC environment due to some functionality issues with our university's central HPC cluster. We deal with very large volumes of raw genomic data, which is both extremely data-dense and resource-intensive to process.

Based on our computing needs (which is highly parallelized), we have specced out 4x [2x 32-core Epyc] Thinkmate servers (256 cores total), each linked through 40Gb QSPF+ (via a top-of-rack switch) to 2x 36-drive Thinkmate storage servers. Each storage server will be filled to capacity with 12 TB drives in RAID 10, for a useable-storage total of ~216 TB each (I should note that these will contain distinct data sets, so we do not intend to cluster them).

The type of data analysis we do is extremely I/O intensive, and as such RAID 10 is the obvious choice for us (also for the ability to rebuild quickly in event of drive failure is imperative).

What isn't so obvious for us is how well the RAID 10 will scale in terms of read/write performance. "In theory" my understanding is that it should scale more-or-less linearly (R:N & W:N/2) (i.e. so given each enterprise drive maxes out at about ~250MB/s, we would pretty much just saturate the 40Gb QSPF+ links at a sustained read/write speed of 4.5-5GB/s).

So my question is, for people who have experience building large RAID 10 arrays, how well do read/write speeds actually scale? Are my estimates way off from real-world? (and for that matter, how good is real-world 40Gbe throughput performance?)

The other option we are considering is adding ~8TB NVMe SSDs to each server to act as local scratches, but that obviously would grossly increase our costs.

GDRRiley · February 4, 2018

don't use raid 10. use raid 60 if anything. if you need this much fast storage consider SSDs. while they are expensive they are best for high random IOPS. then back that up with a large HDD array.

as for SSDs. look at U.2 (on newegg shown as PCIE in 2.5in) or look at enterprise sata/sas drives.

leadeater · February 4, 2018

Have a look at using Ceph instead, traditional RAID will not scale and is not suited to HPC which Ceph is very well suited for.

You also have the minimum required servers to start a Ceph cluster which is 3, you can run Ceph and the computation on the same servers then later look to move the Ceph storage off to dedicated single socket storage nodes.

You'll get much more performance with Ceph as it is distributed across nodes and easily scales out.

You'll want a minimum of 2 SSDs per server in RAID 1 for the Journal for the HDDs, you can also Tier the storage using SSD OSDs and HDD OSDs so active data has lower latency.

If you want to use erasure coding to get more usable storage, highly recommended to use Tiering, then get Mellanox NICs that have Erasure Coding offload engines to greatly reduce the CPU load which would be very important for you if sharing the computation and storage on the same server nodes.

http://docs.ceph.com/docs/master/rados/operations/cache-tiering/

leadeater · February 4, 2018

16 minutes ago, genome-guy said:

In our research lab, we're currently setting up a private micro-HPC environment due to some functionality issues with our university's central HPC cluster. We deal with very large volumes of raw genomic data, which is both extremely data-dense and resource-intensive to process.

Are you able to explain what these are?

genome-guy · February 4, 2018

5 minutes ago, leadeater said:

Are you able to explain what these are?

I presume you're referring to the functionality issues? There are a number of things, but the deal-breaker for us is the very antiquated campus network setup that severely bottlenecks data flow from our current local storage servers to the cluster's temp storage. Because we may be transferring 10s of terabytes at any one time, we spend probably just as much time transferring data back and forth as we do processing. This results in huge delays in getting results back. We can't relocate our primary storage arrays to the HPC cluster for a number of reasons I can't really get into here.

leadeater · February 4, 2018

30 minutes ago, genome-guy said:

I presume you're referring to the functionality issues? There are a number of things, but the deal-breaker for us is the very antiquated campus network setup that severely bottlenecks data flow from our current local storage servers to the cluster's temp storage. Because we may be transferring 10s of terabytes at any one time, we spend probably just as much time transferring data back and forth as we do processing. This results in huge delays in getting results back. We can't relocate our primary storage arrays to the HPC cluster for a number of reasons I can't really get into here.

Thanks, exactly the type of information I was looking for.

Do you think you'll be expanding the setup beyond what is indicated? The two storage servers and 4 compute nodes is a very simple setup and easy to maintain, not much to go wrong and almost anyone can diagnose issues.

If you are likely to expand this is where you'll start to notice the limitations of the storage servers and RAID, though are you talking about hardware RAID or ZFS? Hardware RAID gives excellent performance in smaller arrays, 24 disks and below, but the performance gain per disk in the array tapers off rather quickly. ZFS does a better job at that.

Ceph is the next logical iteration beyond ZFS storage servers and scale up only designs. I very much doubt that you'll be able to get 40Gbps out of the single storage servers but 20Gbps should easily be achievable.

We have a number of departments doing exactly what you are going to do, buying storage and compute servers for research (genomics, geological, eco&finance etc) and we're looking to start up a Ceph + OpenStack eResearch cluster of our own to offer out to these departments. We also have another department on one of our other campuses running their own Hadoop cluster but it's already a hodge podge of new and older hardware and we have nothing to do with it, they hate ITS .

genome-guy · February 4, 2018

40 minutes ago, leadeater said:

Thanks, exactly the type of information I was looking for.

Do you think you'll be expanding the setup beyond what is indicated? The two storage servers and 4 compute nodes is a very simple setup and easy to maintain, not much to go wrong and almost anyone can diagnose issues.

If you are likely to expand this is where you'll start to notice the limitations of the storage servers and RAID, though are you talking about hardware RAID or ZFS? Hardware RAID gives excellent performance in smaller arrays, 24 disks and below, but the performance gain per disk in the array tapers off rather quickly. ZFS does a better job at that.

Ceph is the next logical iteration beyond ZFS storage servers and scale up only designs. I very much doubt that you'll be able to get 40Gbps out of the single storage servers but 20Gbps should easily be achievable.

We have a number of departments doing exactly what you are going to do, buying storage and compute servers for research (genomics, geological, eco&finance etc) and we're looking to start up a Ceph + OpenStack eResearch cluster of our own to offer out to these departments. We also have another department on one of our other campuses running their own Hadoop cluster but it's already a hodge podge of new and older hardware and we have nothing to do with it, they hate ITS .

We were initially thinking either ZFS (striped mirrored vdevs - quasi-RAID 10) or Btrfs RAID 10 (on CentOS). We were leaning towards the latter, as we're already spending a small fortune on memory for the compute nodes . But in theory we could supercharge the storage nodes with a ton of RAM for ZFS.

leadeater · February 4, 2018

11 minutes ago, genome-guy said:

We were initially thinking either ZFS (striped mirrored vdevs - quasi-RAID 10) or Btrfs RAID 10 (on CentOS). We were leaning towards the latter, as we're already spending a small fortune on memory for the compute nodes . But in theory we could supercharge the storage nodes with a ton of RAM for ZFS.

If you're not doing a large percentage of writes you shouldn't have to throw in lots of ram, large data set seq reads and even writes should perform really well even with 32GB of ram. Something you should be able to test once you get the hardware anyway, hope to use one with a fallback if not.

https://calomel.org/zfs_raid_speed_capacity.html

Erkel · February 4, 2018

RAID 60,

I think one of the main things to consider is the size of the active dataset, you do not want to actually be touching mechanical disks as part of the actual processing if possible if that is a bottleneck, I try to run everything cached in RAM if possible.

You want to use as much as possible of these before hitting disk in this order.

RAM

NVRAM on raid controller

SSD caching / tiering

Sign In

RAID 10 Performance Considerations in HPC Setup

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

The Future of PC Cooling?

Latest From ShortCircuit:

The coolest looking monitor. Period. - ASUS ROG display at Computex (Sponsored)

Latest From TechLinked:

Microsoft Just Can’t Help Itself

Latest From GameLinked:

Gamers, We’re Eatin’ Good

Latest From Tech Quickie:

Who's Tracking Your Phone Right Now?

Latest From The WAN Show:

Pizza Hut is Being Sued Over AI