Jump to content

RAID 10 Performance Considerations in HPC Setup

Go to solution Solved by leadeater,
11 minutes ago, genome-guy said:

We were initially thinking either ZFS (striped mirrored vdevs - quasi-RAID 10) or Btrfs RAID 10 (on CentOS). We were leaning towards the latter, as we're already spending a small fortune on memory for the compute nodes :P . But in theory we could supercharge the storage nodes with a ton of RAM for ZFS.

If you're not doing a large percentage of writes you shouldn't have to throw in lots of ram, large data set seq reads and even writes should perform really well even with 32GB of ram. Something you should be able to test once you get the hardware anyway, hope to use one with a fallback if not.

https://calomel.org/zfs_raid_speed_capacity.html

So I wanted to bounce something off the group....

 

In our research lab, we're currently setting up a private micro-HPC environment due to some functionality issues with our university's central HPC cluster. We deal with very large volumes of raw genomic data, which is both extremely data-dense and resource-intensive to process.

 

Based on our computing needs (which is highly parallelized), we have specced out 4x [2x 32-core Epyc] Thinkmate servers (256 cores total), each linked through 40Gb QSPF+ (via a top-of-rack switch) to 2x 36-drive Thinkmate storage servers. Each storage server will be filled to capacity with 12 TB drives in RAID 10, for a useable-storage total of ~216 TB each (I should note that these will contain distinct data sets, so we do not intend to cluster them).

 

The type of data analysis we do is extremely I/O intensive, and as such RAID 10 is the obvious choice for us (also for the ability to rebuild quickly in event of drive failure is imperative). 

 

What isn't so obvious for us is how well the RAID 10 will scale in terms of read/write performance. "In theory" my understanding is that it should scale more-or-less linearly (R:N & W:N/2) (i.e. so given each enterprise drive maxes out at about ~250MB/s, we would pretty much just saturate the 40Gb QSPF+ links at a sustained read/write speed of 4.5-5GB/s).

 

So my question is, for people who have experience building large RAID 10 arrays, how well do read/write speeds actually scale? Are my estimates way off from real-world? (and for that matter, how good is real-world 40Gbe throughput performance?)

 

The other option we are considering is adding ~8TB NVMe SSDs to each server to act as local scratches, but that obviously would grossly increase our costs.

 

 

Link to post
Share on other sites

don't use raid 10. use raid 60 if anything. if you need this much fast storage consider SSDs. while they are expensive they are best for high random IOPS. then back that up with a large HDD array.  

as for SSDs. look at U.2 (on newegg shown as PCIE in 2.5in) or look at enterprise sata/sas drives. 

Good luck, Have fun, Build PC, and have a Wii and PS2 as your only consoles.

NightHawk 3.0: R7 5700x @, B550A vision D, H105, 2x32gb Oloy 3600, Asrock RX9070xt Steel Legends, Corsair RM750X, 500gb 850 evo, 2tb rocket and 5tb Toshiba x300, 3x 6TB WD Black W10 all in a Obsidian 750D airflow.
GF PC: (NightHawk 2.0): R7 2700x, B450m vision D, 4x8gb Geli 2933, Sapphire RX 6700XT  Nitro+, CX650M RGB, Obsidian 350D

Skunkworks: R5 3500U, 16gb, 500gb 860 evo, Vega 8. HP probook G455R G6 Ubuntu 20. LTS

Condor (MC server): 6600K, z170m plus, 16gb corsair vengeance LPX, samsung 750 evo, EVGA BR 450.

Spirt  (NAS) ASUS Z9PR-D12, 2x E5 2620V2, 8x4gb, 24 3tb HDD. F80 800gb cache, trueNAS, 2x12disk raid Z3 stripped

HP probook 445R G6 review

 

"Stupidity is like trying to find a limit of a constant. You are never truly smart in something, just less stupid."

Camera Gear: X-S10, 16-80 F4, 35mm F1.4, Helios 44

Link to post
Share on other sites

Have a look at using Ceph instead, traditional RAID will not scale and is not suited to HPC which Ceph is very well suited for.

 

You also have the minimum required servers to start a Ceph cluster which is 3, you can run Ceph and the computation on the same servers then later look to move the Ceph storage off to dedicated single socket  storage nodes.

 

You'll get much more performance with Ceph as it is distributed across nodes and easily scales out.

 

You'll want a minimum of 2 SSDs per server in RAID 1 for the Journal for the HDDs, you can also Tier the storage using SSD OSDs and HDD OSDs so active data has lower latency.

 

If you want to use erasure coding to get more usable storage, highly recommended to use Tiering, then get Mellanox NICs that have Erasure Coding offload engines to greatly reduce the CPU load which would be very important for you if sharing the computation and storage on the same server nodes.

http://docs.ceph.com/docs/master/rados/operations/cache-tiering/

Link to post
Share on other sites

16 minutes ago, genome-guy said:

In our research lab, we're currently setting up a private micro-HPC environment due to some functionality issues with our university's central HPC cluster. We deal with very large volumes of raw genomic data, which is both extremely data-dense and resource-intensive to process.

Are you able to explain what these are?

Link to post
Share on other sites

5 minutes ago, leadeater said:

Are you able to explain what these are?

I presume you're referring to the functionality issues? There are a number of things, but the deal-breaker for us is the very antiquated campus network setup that severely bottlenecks data flow from our current local storage servers to the cluster's temp storage. Because we may be transferring 10s of terabytes at any one time, we spend probably just as much time transferring data back and forth as we do processing. This results in huge delays in getting results back. We can't relocate our primary storage arrays to the HPC cluster for a number of reasons I can't really get into here.

Link to post
Share on other sites

30 minutes ago, genome-guy said:

I presume you're referring to the functionality issues? There are a number of things, but the deal-breaker for us is the very antiquated campus network setup that severely bottlenecks data flow from our current local storage servers to the cluster's temp storage. Because we may be transferring 10s of terabytes at any one time, we spend probably just as much time transferring data back and forth as we do processing. This results in huge delays in getting results back. We can't relocate our primary storage arrays to the HPC cluster for a number of reasons I can't really get into here.

Thanks, exactly the type of information I was looking for.

 

Do you think you'll be expanding the setup beyond what is indicated? The two storage servers and 4 compute nodes is a very simple setup and easy to maintain, not much to go wrong and almost anyone can diagnose issues.

 

If you are likely to expand this is where you'll start to notice the limitations of the storage servers and RAID, though are you talking about hardware RAID or ZFS? Hardware RAID gives excellent performance in smaller arrays, 24 disks and below, but the performance gain per disk in the array tapers off rather quickly. ZFS does a better job at that.

 

Ceph is the next logical iteration beyond ZFS storage servers and scale up only designs. I very much doubt that you'll be able to get 40Gbps out of the single storage servers but 20Gbps should easily be achievable.

 

We have a number of departments doing exactly what you are going to do, buying storage and compute servers for research (genomics, geological, eco&finance etc) and we're looking to start up a Ceph + OpenStack eResearch cluster of our own to offer out to these departments. We also have another department on one of our other campuses running their own Hadoop cluster but it's already a hodge podge of new and older hardware and we have nothing to do with it, they hate ITS :P.

Link to post
Share on other sites

40 minutes ago, leadeater said:

Thanks, exactly the type of information I was looking for.

 

Do you think you'll be expanding the setup beyond what is indicated? The two storage servers and 4 compute nodes is a very simple setup and easy to maintain, not much to go wrong and almost anyone can diagnose issues.

 

If you are likely to expand this is where you'll start to notice the limitations of the storage servers and RAID, though are you talking about hardware RAID or ZFS? Hardware RAID gives excellent performance in smaller arrays, 24 disks and below, but the performance gain per disk in the array tapers off rather quickly. ZFS does a better job at that.

 

Ceph is the next logical iteration beyond ZFS storage servers and scale up only designs. I very much doubt that you'll be able to get 40Gbps out of the single storage servers but 20Gbps should easily be achievable.

 

We have a number of departments doing exactly what you are going to do, buying storage and compute servers for research (genomics, geological, eco&finance etc) and we're looking to start up a Ceph + OpenStack eResearch cluster of our own to offer out to these departments. We also have another department on one of our other campuses running their own Hadoop cluster but it's already a hodge podge of new and older hardware and we have nothing to do with it, they hate ITS :P.

We were initially thinking either ZFS (striped mirrored vdevs - quasi-RAID 10) or Btrfs RAID 10 (on CentOS). We were leaning towards the latter, as we're already spending a small fortune on memory for the compute nodes :P . But in theory we could supercharge the storage nodes with a ton of RAM for ZFS.

Link to post
Share on other sites

11 minutes ago, genome-guy said:

We were initially thinking either ZFS (striped mirrored vdevs - quasi-RAID 10) or Btrfs RAID 10 (on CentOS). We were leaning towards the latter, as we're already spending a small fortune on memory for the compute nodes :P . But in theory we could supercharge the storage nodes with a ton of RAM for ZFS.

If you're not doing a large percentage of writes you shouldn't have to throw in lots of ram, large data set seq reads and even writes should perform really well even with 32GB of ram. Something you should be able to test once you get the hardware anyway, hope to use one with a fallback if not.

https://calomel.org/zfs_raid_speed_capacity.html

Link to post
Share on other sites

RAID 60,

 

I think one of the main things to consider is the size of the active dataset,  you do not want to actually be touching mechanical disks as part of the actual processing if possible if that is a bottleneck, I try to run everything cached in RAM if possible.

 

You want to use as much as possible of these before hitting disk in this order.

 

RAM

NVRAM on raid controller

SSD caching / tiering

 

 

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×