Crazy NAS/data/file server build

leadeater · June 14, 2019

8 hours ago, alpha754293 said:

Ergo, with about 6 SAS 12 Gbps drives, you're already overloading the bus. Even if you assume that the bus is only about 80% efficient, that means that with 7 SAS 12 Gbps drives, you're at 84 Gbps.

The actual throughput of SAS SSDs isn't 12Gbps though, even the better ones only burst to around 9Gbps but won't sustain it for too long. Interface connection speeds aren't a measure of actual bandwidth utilization or if a device with that interface could actually use it.

10 hours ago, alpha754293 said:

I mean, this is the motivation with NVMe, right?

Not really, NVMe is about increasing IOPs because the SATA/SAS protocol is the limiting factor in 4kb/64k random I/O performance. SSDs got to the point where they were faster than SATA/SAS could do the protocol translations.

If your I/O block size, not file size, is large then NVMe won't actually give you a speed increase and you're just spending money on more expensive SSDs that won't really be much better for you. You need to figure out your I/O profile first before deciding what storage devices you are going to use and how to configure them.

You also need to factor in cost benefit against overall run time, is spending 60%-200% more for an overall 20% speed increase worth it for example. How long is the job compute run time, how long is the file access times, what proportion does this make up. Are you focusing a lot of time and funds on 20% of the total job time etc.

For an example if we look at dedicated storage arrays the difference between All Flash NVMe and All Flash SAS isn't always that big, type of workload makes one versus the other much better but for workloads that don't benefit not so much.

Netapp AFF800 (NVMe)

Quote

For our VDBench workloads, the NetApp AFF A800 continued to shine. Highlights include 1.2 million IOPS in 4K read, 439K IOPS in 4K write, 18.9GB/s in sequential 64K read, and 5.03GB/s in 64K write.

https://www.storagereview.com/netapp_aff_a800_review

Netapp AFF300 (SAS)

Quote

The A300 put up some impressive numbers highlights include random 4K peak performances of 635K IOPS read and nearly 209K IOPS write. For 64K sequential, the array hit 5.71GB/s read and 3.1GB/s write.

https://www.storagereview.com/netapp_aff_a300_review

Something like triple the cost for only 2GB/s more seq write performance which in is the part that is of interest to you.

alpha754293 · June 14, 2019

22 hours ago, Windows7ge said:

This begs the question is there any way to make the work that's being performed more efficient in such a way that the required bandwidth can be reduced?

Kind of like instead of trying to source the worlds most high-end hardware instead try and make the application more efficient. Does this software have any wiggle room for that? Or have you quite possibly already done what you can in that department?

It's COTS software. I'm sure that it's already as efficient as it can possibly be for the types of analyses it performs.

(It is in the nature of the type of problem/solution/solver. In-core sparse direct solvers are known to be memory hungry and as such, are also scratch space hungry, and since I only have four half-width nodes in a 2U chassis (Supermicro Twin^2), if I can offload the scratching from the local system which only has SATA 6 Gbps to NVMeoF, in hopes that will speed up things faster than local SATA 6 Gbps, then all the better.

If you've ever ran the Intel Burn Test on Very High and it eats up a lot of memory because it is solving for the linear algebra system Ax=b, this is what this is (loosely) about. Except it's with a sparse solver. But the same basic jist/principle.

alpha754293 · June 14, 2019

21 hours ago, Electronics Wizardy said:

What was the full config of this test? I have gotten over these speeds with just smb, raid 0 sata3 ssds, no trim no special config, no rdma.

That aint gonna happen. Those 100k do pay for things.

what type of cluster? What os are these nodes running.

Have you tried to setup storage spaces direct? Cluster seems to be the way to go here. Id get like 4 nodes dual 2011 nodes here for storage, put as much ram as you can on all of them, and cheaper sata ssds. Thats the best your gonna get for 6k, Your won't get the performance you want for 6k, you can't even buy the drives for that amount to take in that much data.

I'm not sure what you mean by "full config" of this test.

I literally installed SLES12 SP1, installed the MLNX_OFED linux drivers for it, assign IP to the ports, mount the RAM drive using either tmpfs or ramfs on both systems (source and target), copy a 50 GB file (one of my project archive files) over from my network server to the source server, and sftp it over to the target.

I've also tried it where I set up then set up the RAM drive as a NFS share, enabled the NFS server and client on the source, enable the NFS client on the target, and try and copy that over NFS as well. Again, I ran this test with both the RAM drive as tmpfs and also as ramfs. Same result.

Well...you might be surprised.

The set up that I've got now, combined, cost less than $10k. New, this would have been like a $30k-40k system. So, you'd be surprised what you are able to accomplish now. And the system is still plenty fast in my mind.

re: "Have you tried to setup storage spaces direct?"
No. Again, as I have stated -- not yet, because I would need to purchase/acquire the hardware that we are talking about on here first before I'll be able to do that. The best that I will be able to do right now won't even come remotely close to it due to me being limited with just SATA 6 Gbps connections right now.

(And again, when you do the math on the bandwidths alone, along with the cost analysis, a distributed storage system would cost more than $6k. I've done the mental rough math very quickly in my head, and in order to be able to match the bandwidth capacity, you'd have to go with something like 4 storage systems/nodes, also all with NVMe drives, which takes at least two to three extra 4x EDR IB NICs and cables, and basically everything else as well -- chassis (unless you're going with a $20k rackmount solution for the chassis), CPU, cooler, power, power supplies, etc.)

I'll have to see the data behind the cost-bandwidth-power-cooling analysis to ascertain that and figure out whether or not that's a viable option.

re: "Your won't get the performance you want for 6k, you can't even buy the drives for that amount to take in that much data."

The Asus Hyper M.2 x16 v2 card is only $80 on Newegg.

A Samsung 970 Pro M.2 2280 1 TB PCIe 3.0 x4 NVMe SSD is only $300 on Newegg (https://www.newegg.com/samsung-970-pro-1tb/p/N82E16820147694).

Eight of that will only run $2400 which will have 8 TB of ultra high speed, pure NVMe SSD.

Remember that the system REPLACES that volume of data per iteration, so it isn't like I am saving multiple copies of that 50 GB data, for example. So it's 50, 50 50 50, not 50 100 150 200, etc. (The data doesn't accumulate, not until the very end.)

Based on that, I don't see why, with eight NVMe M.2 SSDs, I won't be able to get that.

alpha754293 · June 14, 2019

17 hours ago, leadeater said:

The actual throughput of SAS SSDs isn't 12Gbps though, even the better ones only burst to around 9Gbps but won't sustain it for too long. Interface connection speeds aren't a measure of actual bandwidth utilization or if a device with that interface could actually use it.

Not really, NVMe is about increasing IOPs because the SATA/SAS protocol is the limiting factor in 4kb/64k random I/O performance. SSDs got to the point where they were faster than SATA/SAS could do the protocol translations.

If your I/O block size, not file size, is large then NVMe won't actually give you a speed increase and you're just spending money on more expensive SSDs that won't really be much better for you. You need to figure out your I/O profile first before deciding what storage devices you are going to use and how to configure them.

You also need to factor in cost benefit against overall run time, is spending 60%-200% more for an overall 20% speed increase worth it for example. How long is the job compute run time, how long is the file access times, what proportion does this make up. Are you focusing a lot of time and funds on 20% of the total job time etc.

For an example if we look at dedicated storage arrays the difference between All Flash NVMe and All Flash SAS isn't always that big, type of workload makes one versus the other much better but for workloads that don't benefit not so much.

Netapp AFF800 (NVMe)

https://www.storagereview.com/netapp_aff_a800_review

Netapp AFF300 (SAS)

https://www.storagereview.com/netapp_aff_a300_review

Something like triple the cost for only 2GB/s more seq write performance which in is the part that is of interest to you.

re: NVMe
Interesting - did not know that about the history and the comparison of AHCI vs. NVMe.

Some of the file sizes that we're about here (for my smaller runs), the total size of the file is about 10 GB in the scratch directory.

re: "You also need to factor in cost benefit against overall run time, is spending 60%-200% more for an overall 20% speed increase worth it for example. How long is the job compute run time, how long is the file access times, what proportion does this make up. Are you focusing a lot of time and funds on 20% of the total job time etc."

Well...see...that's the thing - until I actually build and deploy the system for testing, I actually won't know that.

There is nothing that I have in my entire overall hardware infrastructure that can take on anything more or faster than SATA 6 Gbps. I currently don't have any NVMe devices at all (in testing or otherwise), which is why so much of the anaylsis is based on theorectical throughput rather than actuals. If I had it, I would probably already be testing it by now.

Again, right now, the entire main line storage subsystem infrastucture is all SATA 6 Gbps, which means that even with a RAID server, the best that I will be able to do would be around 48 Gbps. But that also currently has all mechanical, large capacity hard drives as well (as getting eight 1 TB Samsung SSDs 860 Pros is only $18 cheaper each than the Samsung 970 Pro NVMe 1 TB SSD.) And also, none of my current systems are able to accept NVMe drives as well. So for me, this would be a storage subsystem paradigm shift away from SATA 6 Gbps with mechanically rotating hard drives over to an all NVMe based SSD setup.

Therefore; I don't have a way of testing it until the solution is deployed into testing. Running it with my current setup would be pointless because I know that my SATA RAID array is limited by the fact that it's using mechanically rotating hard drives, and tops out at 1.2 Gbps. (This was tested with a 10 GbE connection. The mechanically rotating drives just can't write the data that fast.)

However, I also know that there is, of course, a huge difference when I went from previously running with said rotating hard disks to running it on SSDs.

And this build (and the research that's going on right now behind it) will be the basis for the next jump in performance where I am going away from SATA 6 Gbps SSDs to NVMe based SSDs. This is what this is trying to ascertain.

Thank you.

Electronics Wizardy · June 14, 2019

48 minutes ago, alpha754293 said:

literally installed SLES12 SP1, installed the MLNX_OFED linux drivers for it, assign IP to the ports, mount the RAM drive using either tmpfs or ramfs on both systems (source and target), copy a 50 GB file (one of my project archive files) over from my network server to the source server, and sftp it over to the target.

Why are you using sftp? Thats not good for this use.

49 minutes ago, alpha754293 said:

'm not sure what you mean by "full config" of this test.

full specs of the test systems? What all software are you using?

50 minutes ago, alpha754293 said:

I've also tried it where I set up then set up the RAM drive as a NFS share, enabled the NFS server and client on the source, enable the NFS client on the target, and try and copy that over NFS as well. Again, I ran this test with both the RAM drive as tmpfs and also as ramfs. Same result.

what were you using to test the speed? Try using fio.

What was cpu usage?

52 minutes ago, alpha754293 said:

Remember that the system REPLACES that volume of data per iteration, so it isn't like I am saving multiple copies of that 50 GB data, for example. So it's 50, 50 50 50, not 50 100 150 200, etc. (The data doesn't accumulate, not until the very end.)

How much data are were thinking for the working data set?

Have you tried getting something like 900p as a scratch disk in each of the nodes, then copying the files over when your done to the main server. I don't see why you want the main scratch disk to be over the network, esp when your dealing with a low budget, id just put the storage in the compute nodes.

54 minutes ago, alpha754293 said:

- not yet, because I would need to purchase/acquire the hardware that we are talking about on here first before I'll be able to do that.

Why not rent hardware and see how it goes. then buy it if it works well, and if it doesn't it won't cost much.

55 minutes ago, alpha754293 said:

The best that I will be able to do right now won't even come remotely close to it due to me being limited with just SATA 6 Gbps connections right now.

Do these nodes have a pcie slot? Id just put a high speed ssd in there.

22 minutes ago, alpha754293 said:

), the total size of the file is about 10 GB in the scratch directory.

If its only 10gb, why not just do this on a local ram disk and add a bit more ram to each of the nodes?

leadeater · June 15, 2019

3 hours ago, alpha754293 said:

Therefore; I don't have a way of testing it until the solution is deployed into testing. Running it with my current setup would be pointless because I know that my SATA RAID array is limited by the fact that it's using mechanically rotating hard drives, and tops out at 1.2 Gbps. (This was tested with a 10 GbE connection. The mechanically rotating drives just can't write the data that fast.)

1Gbps or 1GB/s? SATA RAID can write past what a single 10Gb NIC can do but the problem with RAID cards is that the write-back cache is the fastest you can ever write, that's why it's automatically disabled for SSD arrays as LSI etc found even a few SSD is faster.

Multiple SATA based RAIDs, SSD or HDD, can do the job but you have to overlay on top of that a distributed storage system like Gluster so you can span that I/O across the RAID cards. It's a pretty cheap way of getting very high throughput, even spanning across RAID cards you won't be able to get the throughput you are looking at. That's why SDS exists because there isn't anything that exists as a single server/system that is able to do more than a few GB/s over network plus it's actually cheaper to built 10 servers that can do 1GB/s than to build 1 that can do 10GB/s.

The issue with NVMe and many of them is you actually impact/lose that latency a bit because once you go over the number of actual slots or U.2 ports natively PCIe PLX/switches are used which does two things, increases latency and splits the bandwidth to each device as you can't create more bandwidth from nothing. So any server with 24 NVMe bays will perform significantly less than you'd expect, no better than 4 to 6 NVMe devices in sequential performance and then network overhead eats a lot in to that as well.

A modest NVMe system is easier to build and more compact so you should probably still base your design on that however expecting to be able to do it with just a single server is unrealistic. Just get a few good NVMe SSD, a good cheap used 40Gb Mellanox NIC and then add more servers as you deem necessary to increase performance.

I'm assuming you're going to be using directly connected networking? That will limit your scale out but I don't think you'll need more than 3 storage servers, I expect 2 would be enough. Having to also get 40Gb network switching would be very expensive, if you had to personally I would drop back down to 10Gb and scale out more.

leadeater · June 15, 2019

3 hours ago, Electronics Wizardy said:

Have you tried getting something like 900p as a scratch disk in each of the nodes, then copying the files over when your done to the main server. I don't see why you want the main scratch disk to be over the network, esp when your dealing with a low budget, id just put the storage in the compute nodes.

3 hours ago, Electronics Wizardy said:

If its only 10gb, why not just do this on a local ram disk and add a bit more ram to each of the nodes?

This is actually much more common. If you need very high performance storage and do not require shared access to it across compute nodes do it locally on the compute node then when finished copy the resulting final data back to main storage. Removes all complexity and avoid networking all together, significantly lower cost.

alpha754293 · June 18, 2019

On 6/14/2019 at 4:53 PM, Electronics Wizardy said:

Why are you using sftp? Thats not good for this use.

full specs of the test systems? What all software are you using?

what were you using to test the speed? Try using fio.

What was cpu usage?

How much data are were thinking for the working data set?

Have you tried getting something like 900p as a scratch disk in each of the nodes, then copying the files over when your done to the main server. I don't see why you want the main scratch disk to be over the network, esp when your dealing with a low budget, id just put the storage in the compute nodes.

Why not rent hardware and see how it goes. then buy it if it works well, and if it doesn't it won't cost much.

Do these nodes have a pcie slot? Id just put a high speed ssd in there.

If its only 10gb, why not just do this on a local ram disk and add a bit more ram to each of the nodes?

sftp because it's easy/what I know.

Haven't tried scp or anything else for that matter (in terms of data transfer tools).

Dual Intel Xeon E5-2690 (v1) (8 cores, 2.9 GHz stock, 3.3 GHz all core turbo, 3.6 GHz max turbo)

8x Samsung DDR3-1866 16 GB 2Rx4 ECC Registered RAM running at DDR3-1600 speeds due to it being 2R.

Intel 545 Series 1 TB SATA 6 Gbps SSD

HGST 3 TB SATA 6 Gbps 7200 rpm HDD (RAID0)

HGST 6 TB Sata 6 Gbps 7200 rpm HDD (RAID0)

Mellanox ConnectX-4 VPI dual port 4x EDR IB

Supermicro Twin^2 6027TR-HTRF

1+1 Redundant 1620W power supply

SLES12 SP1 (full install)

MLNX OFED linux driver

sftp reports out the speed

no idea

Are you talking about my current test data set or are you talking the ultimate size of my production data? Current test data set, I'm capping at 50 GB because if I am trying to test it with the RAM drive, and given that I have 128 GB of RAM, therefore; when I create a RAMdrive, I create one that's 110 GB in size, so that it would be able to host a copy of the file, and also be able to send/receive another copy of the same.

Production data, when deployed, will be limited to the capacity of the array/pool, whatever solution will be able to handle. The pricing that I've put together so far will size the solution to about 8 TB of NVMe storage, which means that the system will be able to handle upto that, of course. Some of my simulations/analyses might get close to that, other times, less so. And other times, I'm into tens to hundreds of TB of data transferred, but only a fraction of that will remain in the end -- which is the nature of scratch data; it's highly volatile/temporary storage.

I'm not using 900p because I don't have any free PCIe slots. The chassis itself only has a single PCIe 3.0 x16 HHHL slot available (https://www.supermicro.com/products/system/2U/6027/SYS-6027TR-HTRF.cfm).

That slot is currently occupied by the Mellanox ConnectX-4 VPI dual port 4x EDR IB card.

That plus if I were able to do that, the maximum size of the problem that I will be able to handle will be limited by the maximum size of the AIC. But pooling the storage together, it makes it so that each node can have access to more than what would otherwise be available to the host node itself, so if say, one node needed all 8 TB to itself, it will be available. With an AIC per host, if it needs more space, then I would be reading/writing to a clustered scratch FS anyways, which won't be any different than what I'm currently trying to accomplish here, except that a) the current Supermicro chassis doesn't have any free PCIe slots, and b) the IOPs can be offloaded onto the NIC with "true" RDMA so that the compute nodes can just run the computations. IOs are RDMA'd.

(BTW, 900p caps out at 480 GB. I have runs that's going right now that's scratching to the local 1TB SATA 6 Gbps SSD because it's all I've got right now.)

I don't know where there's a place where I'd be able to rent this kind of hardware.

Most of the time, the kicker is going to be there aren't very many outfits that offer 100 Gbps NIC connections. Everything else would generally be less of an issue. The 100 Gbps NIC is harder to "rent" because it's not widely deployed for even most cloud based systems. (It's used on the backend, but not exposed to customers, to the best of my knowledge.)

See above.

Because 100 Gbps RDMA scratch disk is faster than a SATA 6 Gbps SSD, even with a 10 GB file.

alpha754293 · June 18, 2019

On 6/14/2019 at 8:11 PM, leadeater said:

1Gbps or 1GB/s? SATA RAID can write past what a single 10Gb NIC can do but the problem with RAID cards is that the write-back cache is the fastest you can ever write, that's why it's automatically disabled for SSD arrays as LSI etc found even a few SSD is faster.

Multiple SATA based RAIDs, SSD or HDD, can do the job but you have to overlay on top of that a distributed storage system like Gluster so you can span that I/O across the RAID cards. It's a pretty cheap way of getting very high throughput, even spanning across RAID cards you won't be able to get the throughput you are looking at. That's why SDS exists because there isn't anything that exists as a single server/system that is able to do more than a few GB/s over network plus it's actually cheaper to built 10 servers that can do 1GB/s than to build 1 that can do 10GB/s.

The issue with NVMe and many of them is you actually impact/lose that latency a bit because once you go over the number of actual slots or U.2 ports natively PCIe PLX/switches are used which does two things, increases latency and splits the bandwidth to each device as you can't create more bandwidth from nothing. So any server with 24 NVMe bays will perform significantly less than you'd expect, no better than 4 to 6 NVMe devices in sequential performance and then network overhead eats a lot in to that as well.

A modest NVMe system is easier to build and more compact so you should probably still base your design on that however expecting to be able to do it with just a single server is unrealistic. Just get a few good NVMe SSD, a good cheap used 40Gb Mellanox NIC and then add more servers as you deem necessary to increase performance.

I'm assuming you're going to be using directly connected networking? That will limit your scale out but I don't think you'll need more than 3 storage servers, I expect 2 would be enough. Having to also get 40Gb network switching would be very expensive, if you had to personally I would drop back down to 10Gb and scale out more.

1.2 Gbps (150 MB/s). That's the fastest, large volume network-based storage I am able to transfer currently.

For large scale installations, that's true. But since I am only dealing (currently) with four compute nodes, it's actually, surprisingly, cheaper for me to do it this way (again, my current estimate for the build price is at $6k with 8 TB of all NVMe SSD storage on two Asus Hyper M.2 X16 v2 cards.

Most "new" servers that I've seen so far, the chassis alone will come close to that.

The 4x EDR IB card, I bought it used for < $300. And since I already spent the money on a used Mellanox 36-port 4x EDR IB switch (which was only about 1/4th of retail), in other words, my home office already has the infrastructure to support 100 Gbps 4x EDR IB.

if I didn't have that already, then you'd probably be correct. And yes, it would have been vastly cheaper to go with say like 10 GbE. But being that I've already made the jump to 100 Gbps (which COULD be configured, if I had or built a 100 GbE bridge/switch instead of getting a separate 100 GbE switch), the cost differential for me now is such that it's actually cheaper for me to build a single, ultra fast system than a bunch of slower systems, clustered together.

This is why I am trying to get the block diagrams from Dell because I need to assess how the PCIe bus subsystem is broken/up divided/muxed/plexed because that will highlight any potential bottlenecks in terms of how they've engineered their system. And it wouldn't make sense to me. Even with the assumption that NVMe drives 0-11 are in bay0 and 12-24 are in bay 1, 12 PCIe 3.0 x4 devices demands 48 lanes alone. There is no documentation that the four PCIe ports that the backplane connects to can support 24 PCIe 3.0 lanes each. And this doesn't include the rest of the onboard peripherals, etc.

This is why I want to see their block diagram. On other hand, their system also already costs more than my DIY build, granted, they are "advertising" more bandwidth than my build will have.

Once I add the dual ConnectX-4 cards into that system, it will eat up all 128 PCIe 3.0 lanes that EPYCs offer. So, I have to see the block diagram to believe that this type of a build is even possible or that I will run into bottlenecks with the PCie bus subsystem which would kind of defeat the purpose of the build, at least in part.

No, I'm using a 36-port Mellanox 4x EDR IB switch (MSB-7890 externally managed switch). Yeah, I ran into that scalability problem around December of last year, and so I bought to switch around Febraury to deal/take care of that. And now that I have it, and the cost to put all of my systems onto the 100 Gbps IB network really isn't that bad (~$350-400 per system), which seems pricy, but it's really not for the performance that the infrastructure can support), it's actually quite the bargin!

I already bought the switch. Again, it was about 1/4th the price of retail, so it really wasn't that bad at all.

A simulation that used to take me 6 days to run now runs in about 2 since deploying the switch, so it's clearly working out for me and what I need it for/what I do.

And with that, now my cluster is more capable, but again, I'm running into this whole issue with ultra fast scratch space because the problems that I am now able to handle, at least on the compute side of the shop, is vastly larger/bigger, and therefore; is driving larger/bigger data sets, which demands more space. And scratching to 7200 rpm hard drives (due to their larger capacities) would make this really pointless, hence the investigation and the research going into resolving that bottleneck - which is why I'm here.

alpha754293 · June 18, 2019

To give you guys an idea of what the data usage pattern is like, below is a screenshot that I had taken ca. 2013 when I ran a free-free modal analysis on an automotive body-in-white structure on a system with only 6 cores and 64 GB of RAM.

You can see that it's only ran for about 84 minutes wall clock time, and in that time, it has read in 7.7 TB of data, but wrote only 393 GB, just to give you guys an idea.

leadeater · June 18, 2019

4 hours ago, alpha754293 said:

And with that, now my cluster is more capable, but again, I'm running into this whole issue with ultra fast scratch space because the problems that I am now able to handle, at least on the compute side of the shop, is vastly larger/bigger, and therefore; is driving larger/bigger data sets, which demands more space. And scratching to 7200 rpm hard drives (due to their larger capacities) would make this really pointless, hence the investigation and the research going into resolving that bottleneck - which is why I'm here.

Scratch space on compute nodes is supposed to be NVMe or NVDIMM.

Electronics Wizardy · June 19, 2019

8 hours ago, alpha754293 said:

sftp because it's easy/what I know.

Try something else, sftp isn't designed for speed.

8 hours ago, alpha754293 said:

(BTW, 900p caps out at 480 GB. I have runs that's going right now that's scratching to the local 1TB SATA 6 Gbps SSD because it's all I've got right now.)

905p is the bigger model

8 hours ago, alpha754293 said:

don't know where there's a place where I'd be able to rent this kind of hardware.

Did you google around? There are a few places I have found. Or you can just buy the hardware from a place that allows returns.

8 hours ago, alpha754293 said:

I'm not using 900p because I don't have any free PCIe slots. The chassis itself only has a single PCIe 3.0 x16 HHHL slot available (https://www.supermicro.com/products/system/2U/6027/SYS-6027TR-HTRF.cfm).

Can you get different nodes that have multiple pcie slots. Those systems are also getting pretty old now aswell.

Your in a tough spot. You don't have enough money for the correct solution, so really your only option is to buy some used hardware like a r720xd with 24x 2.5 sas/sata slots, fill it with sata ssds, make a big raid volume and share it. Your not gonna get much better for your budget. I wouldn't worry so much about pcie lanes and max bandwidth cause you won't fill it anyways, and just try to get those most you can get, and a large sata raid is about the best bang for the buck performance you can get today. Maybe add a few more nodes for more performance later on and use something like ceph.

alpha754293 · June 19, 2019

6 hours ago, leadeater said:

Scratch space on compute nodes is supposed to be NVMe or NVDIMM.

Yes, but if you don't have the PCIe slots for it and there's no onboard NVMe slots either, this is the next best thing.

alpha754293 · June 19, 2019

3 hours ago, Electronics Wizardy said:

Try something else, sftp isn't designed for speed.

905p is the bigger model

Did you google around? There are a few places I have found. Or you can just buy the hardware from a place that allows returns.

Can you get different nodes that have multiple pcie slots. Those systems are also getting pretty old now aswell.

Your in a tough spot. You don't have enough money for the correct solution, so really your only option is to buy some used hardware like a r720xd with 24x 2.5 sas/sata slots, fill it with sata ssds, make a big raid volume and share it. Your not gonna get much better for your budget. I wouldn't worry so much about pcie lanes and max bandwidth cause you won't fill it anyways, and just try to get those most you can get, and a large sata raid is about the best bang for the buck performance you can get today. Maybe add a few more nodes for more performance later on and use something like ceph.

re: "Try something else, sftp isn't designed for speed."
Such as...?

re: "Or you can just buy the hardware from a place that allows returns."
Testing time generally would put me outside of the return window.

re: "Can you get different nodes that have multiple pcie slots. Those systems are also getting pretty old now aswell."

If I could, then I wouldn't really be here having this discussion.

Even the latest that Supermicro has to offer doesn't have NVMe by default, or dual port 4x EDR IB onboard, so no.

re: "You don't have enough money for the correct solution, so really your only option is to buy some used hardware like a r720xd with 24x 2.5 sas/sata slots, fill it with sata ssds, make a big raid volume and share it."

The numbers doesn't make sense, though to support that conclusion.

24x Samsung 860 Pro 1 TB SSDs at $277.99 each (https://www.newegg.com/samsung-860-pro-series-1tb/p/N82E16820147682) comes in at $6672 just for the SSDs alone. 24x 6 Gbps = 144 Gbps. The Broadcom/LSI 9405W-16i is one of the few PCIe 3.1 x16 HBAs that I've been able to find so far, and that's $615 a pop. That takes the total up to $7902 before anything else.

Conversely, 8 Samsung 970 Pro 1 TB M.2 PCIe 3.0 x4 NVMe is $297.99 each, for a total of $2384 and has a total bandwidth capacity of 256 Gbps. The Asus Hyper M.2 x16 v2 was $80 each (x2, for a total of $160, https://www.newegg.com/p/N82E16815293043?cm_sp=SearchSuccess-_-INFOCARD-_-asus+hyper+m.2+x16-_-9SIAKAC9BG6395-_-1&Description=asus+hyper+m.2+x16). That puts the total for that at $2544.

That works out to be 0.01822 Gbps/$ using SAS/SATA SSDs and 0.10062893 Gbps/$ using NVMe. I don't get how having 24 SAS/SATA SSDs would be better.

And with dual Hyper M.2 x16 V2 cards, I would be surprised if 400 Gbps coming into the system, all at the same time, can't or won't be able to saturate the 256 Gbps that the two cards will be able to offer at the interface.

And yes, the 905p is the bigger brother, topping out at 1.5 TB, but they're also over $2000 a piece. (https://www.newegg.com/intel-optane-ssd-905p-series-1-5tb/p/0D9-002V-003X1?Description=905p&cm_re=905p-_-0D9-002V-003X1-_-Product). The 4 kB random write performance on the 1.5 TB 905p is upto 550 kIOPS whereas even if I get two Intel 760p 2TB drives, put it in the Asus Hyper M.2 x16 v2 and RAID0, each will be able to do 275 kIOPS at a cost of $357.99 (or $716, https://www.newegg.com/intel-760p-series-2tb/p/N82E16820167450).

If I go with the Samsung 970 Pros (1 TB), each is capable of upto 500 kIOPS, which means that two of them will give me 33% more capacity, 1000 kIOPS, at a lower cost.

(One of my workstations has one of the older Intel 750 series 400 GB PCIe AIC SSDs. It's great. It's just low capacity. This stuff should get around a lot of that. And with four systems, each having a dual port 100 Gbps NIC, that can easily take up three PCIe 3.0 x16 slots' worth of bandwidth. Attached is a screenshot from just the IB benchmarking itself where the interface tops out at 96.4 Gbps (out of 100 Gbps), so the probability of me saturating the three PCIe 3.0 x16 slots' worth of bandwidth is very much within my grasp and quite potentially very real for me.)

leadeater · June 19, 2019

2 hours ago, alpha754293 said:

Yes, but if you don't have the PCIe slots for it and there's no onboard NVMe slots either, this is the next best thing.

~~How many slots do you have and what's using them all? GPUs + NICs?~~

Edit: Don't worry saw the link to the supermicro system.

leadeater · June 19, 2019

29 minutes ago, alpha754293 said:

Conversely, 8 Samsung 970 Pro 1 TB M.2 PCIe 3.0 x4 NVMe is $297.99 each, for a total of $2384 and has a total bandwidth capacity of 256 Gbps. The Asus Hyper M.2 x16 v2 was $80 each (x2, for a total of $160, https://www.newegg.com/p/N82E16815293043?cm_sp=SearchSuccess-_-INFOCARD-_-asus+hyper+m.2+x16-_-9SIAKAC9BG6395-_-1&Description=asus+hyper+m.2+x16). That puts the total for that at $2544.

That works out to be 0.01822 Gbps/$ using SAS/SATA SSDs and 0.10062893 Gbps/$ using NVMe. I don't get how having 24 SAS/SATA SSDs would be better.

You can also buy smaller ones if you don't need that much space for the NVMe option. You're comparing 8TB against 24TB, the smaller one is easily going to be cheaper with that configuration. Performance per dollar is better on NVMe though, that won't change no matter how small the SATA SSDs you compare to.

I would also split the SATA SSDs across more than one HBA and go with a cheaper older generation, no increase in bandwidth unless you use 3 of them but cheaper. You should be able to get the SATA price down to around or just below $4k, NVMe is still cheaper for the performance.

The advantage of SATA/SAS is capacity and scale, you can't really scale out NVMe. This is where local scratch space and copying resulting data works out so well because even cheap NVMe storage is still expensive for the capacity. You don't have a PCIe riser in your nodes that splits out to two slots though so that's not an option for you, the motherboard has 24 lane riser slot so I don't know why there isn't a second x8 slot with that x16 slot you have.

38 minutes ago, alpha754293 said:

If I go with the Samsung 970 Pros (1 TB), each is capable of upto 500 kIOPS, which means that two of them will give me 33% more capacity, 1000 kIOPS, at a lower cost.

These Samsung consumer devices don't actually perform to their manufacturer specs, those are only short burst performance figures using not that common I/O load settings. Server/Enterprise SSDs undergo different testing and the specification is based off that different testing, figures on those are a lot more accurate compared to consumer ones. Just something to keep in mind.

44 minutes ago, alpha754293 said:

Attached is a screenshot from just the IB benchmarking itself where the interface tops out at 96.4 Gbps (out of 100 Gbps), so the probability of me saturating the three PCIe 3.0 x16 slots' worth of bandwidth is very much within my grasp and quite potentially very real for me.)

The actual storage performance and the I/O block size you'll likely be doing, should be 64k I suspect, is unlikely to reach these speeds for a sustained time. 970 Pro's aren't really designed to handle that, they should do a very good job for you though, just lower the expectation slightly.

The other problem is getting the server to handle that I/O load. You're going to be using a software solution to pool these NVMe devices? That's going to be very high CPU load to do that. If not you're going to need to use an NVMe RAID card which can't actually handle that either, 7.8 GB/s, unless you use two.

This performance limitation equally applies to NVMe and SATA so it's not so much a factor in which device type to buy but more than you need to actually go ahead with the system build to find the limitations, these are unlikely to come down to communication bus bandwidths or network. What ever you do it's highly likely to be able to be reused or reconfigured anyway.

4 hours ago, Electronics Wizardy said:

Maybe add a few more nodes for more performance later on and use something like ceph

Ceph isn't really all that high performance compared to Lustre, you get good aggregated throughput performance though but it fills a bit different need than Lustre does. It's very murky difference though and Ceph is getting faster and faster over time.

Personally if the compute nodes had the PCIe slots I would be looking at a converged solution of storage and compute in the same nodes and dual tasking them. Then as you add nodes you're adding capacity for both. It's not a large scale solution that many do but it's something I would do in a home lab.

alpha754293 · June 19, 2019

15 hours ago, leadeater said:

You can also buy smaller ones if you don't need that much space for the NVMe option. You're comparing 8TB against 24TB, the smaller one is easily going to be cheaper with that configuration. Performance per dollar is better on NVMe though, that won't change no matter how small the SATA SSDs you compare to.

I would also split the SATA SSDs across more than one HBA and go with a cheaper older generation, no increase in bandwidth unless you use 3 of them but cheaper. You should be able to get the SATA price down to around or just below $4k, NVMe is still cheaper for the performance.

The advantage of SATA/SAS is capacity and scale, you can't really scale out NVMe. This is where local scratch space and copying resulting data works out so well because even cheap NVMe storage is still expensive for the capacity. You don't have a PCIe riser in your nodes that splits out to two slots though so that's not an option for you, the motherboard has 24 lane riser slot so I don't know why there isn't a second x8 slot with that x16 slot you have.

These Samsung consumer devices don't actually perform to their manufacturer specs, those are only short burst performance figures using not that common I/O load settings. Server/Enterprise SSDs undergo different testing and the specification is based off that different testing, figures on those are a lot more accurate compared to consumer ones. Just something to keep in mind.

The actual storage performance and the I/O block size you'll likely be doing, should be 64k I suspect, is unlikely to reach these speeds for a sustained time. 970 Pro's aren't really designed to handle that, they should do a very good job for you though, just lower the expectation slightly.

The other problem is getting the server to handle that I/O load. You're going to be using a software solution to pool these NVMe devices? That's going to be very high CPU load to do that. If not you're going to need to use an NVMe RAID card which can't actually handle that either, 7.8 GB/s, unless you use two.

This performance limitation equally applies to NVMe and SATA so it's not so much a factor in which device type to buy but more than you need to actually go ahead with the system build to find the limitations, these are unlikely to come down to communication bus bandwidths or network. What ever you do it's highly likely to be able to be reused or reconfigured anyway.

Ceph isn't really all that high performance compared to Lustre, you get good aggregated throughput performance though but it fills a bit different need than Lustre does. It's very murky difference though and Ceph is getting faster and faster over time.

Personally if the compute nodes had the PCIe slots I would be looking at a converged solution of storage and compute in the same nodes and dual tasking them. Then as you add nodes you're adding capacity for both. It's not a large scale solution that many do but it's something I would do in a home lab.

re: number of drives

We can always argue it both ways.

I stuck with 1 TB drives to keep at least the per-drive capacity the same.

There is no COTS board that exists right now that where you can actually have 24 NVMe M.2 drives, which is why I stuck with 8 for the NVMe and 24 for the total number of drive bays for the Dells. (I still haven't heard back from them in regards to a block diagram yet.)

Despite that, again, remember that the 1 TB Samsung SATA SSD is only $18 LESS than the same capacity NVMe drives. And the practical and functional difference is that the SATA SSD will be limited to 6 Gbps whereas the NVMe 3.0 x4 is capable of 32 Gbps or over 5 TIMES that bandwidth. For $18 more. So...why wouldn't you spend the extra $18 per drive, whether it'd be 8 or 24 (other than you can't actually install 24 NVMe drives) and quintuple your bandwidth, and therefore; performance?

re: "The advantage of SATA/SAS is capacity and scale, you can't really scale out NVMe."

But again, I'm not capacity limited.

And if by "scale", it means multiple systems, well that was already suggested by you for a distributed/cluster FS using SATA, which means that if I am going to run a distributed/cluster FS, why can't/wouldn't I do it with the NVMe setup (clone the hardware/software setup) and repeat it? Again, I'm not limited by capacity.

Each node is limited by capacity, but if I can build a system like this, then the capacity is shared between the nodes, and so no single node will run out of capacity. At least, it isn't projected/expected to do so for the next 2-3 years (because at and by that point, I'll be back to being CPU bound anyways, i.e. the problems will be large enough that I will have to start adding more nodes, more storage, and more scratch disk space, so the entire cluster config grows/doubles in size and everything, and I mean literally everything, gets doubled.)

re: "so I don't know why there isn't a second x8 slot with that x16 slot you have."

There is one, but it isn't one where you can attach an AIC. (It doesn't have a physical spot to mount a second card via a PCIe 3.0 x8 to PCIe 3.0 x8 riser.)

re: "Server/Enterprise SSDs undergo different testing and the specification is based off that different testing, figures on those are a lot more accurate compared to consumer ones. Just something to keep in mind."

Good point.

Thank you.

re: "The actual storage performance and the I/O block size you'll likely be doing, should be 64k I suspect, is unlikely to reach these speeds for a sustained time. 970 Pro's aren't really designed to handle that, they should do a very good job for you though, just lower the expectation slightly."

Yeah, it'll be interesting to see what happens when you literally push these devices and the entire PCIe bus subsystem to its absolute limits because my compute nodes will be able to shove data onto this new build at a very rapid pace.

re: "The other problem is getting the server to handle that I/O load. You're going to be using a software solution to pool these NVMe devices? That's going to be very high CPU load to do that. If not you're going to need to use an NVMe RAID card which can't actually handle that either, 7.8 GB/s, unless you use two."

Yeah, the Asus Hyper M.2 x16 v2 cards state that it's supposed to support RAID either built into the card or via VROC. I've never tested/experimented/played with this technology, so again, it'll be interesting to see how it works.

There's hardly any data out there, at least again, in the public domain that talks about RDMA and VROC together, so this will be really interesting.

But it is my understanding that since without the M-key, the VROC only supports RAID0 (which, since this is designed to be a "burst-y" scratch disk anyways, is fine for that. I don't need parity on the scratch disk data), and that the PCIe bus isn't dependent on the CPU clock speed or operation, that the CPU shouldn't have a problem with it. That's supposed to be the benefit of having so many PCIe lanes that are controlled directly by the CPU rather than routing through a PCH.

Yeah, the plan is to use Asus Hyper M.2 x16 v2 cards. Not sure if you can create a RAID array across the two cards, so again, that is yet to be seen/determined. If I can't, then there will be a software RAID layer (zfs stripe) pool that comes into play. If I can, then this won't even be an issue.

re: "Personally if the compute nodes had the PCIe slots I would be looking at a converged solution of storage and compute in the same nodes and dual tasking them. Then as you add nodes you're adding capacity for both. It's not a large scale solution that many do but it's something I would do in a home lab."

Yeah, but the problem with that that I have right now is that I kind of want to purposely decouple that a little bit.

That way, if I am working on reviewing the results from a run or a data set that where I want to keep the results, it can be sent to my longer term storage servers, keeping the scratch server neater and cleaner.

That way, I can take my time post-processing the results. On the other hand, some of the first-round of post-processing of the results can be done directly on the NVMe/scratch server and then the post-processed files (e.g. converting the results files from one format to another format in preparation to be post-processed by a post-processor) can be done directly on that system and then only the converted files would then be transferred to my main storage servers for longer term storage and further/actual post-processing.

Again, this build serves these objectives:

- Increase the speed of the scratch disk (since the nodes don't have NVMe slots and no more PCIe slots where an extra AIC can be installed).

- Increaes the total, consolidated capacity of the scratch disk space to be shared amongst all nodes, not just locally to itself.

- Can act as the first stage post-processing to convert files from one format to another in preparation for post-processing

- Offloads the handling of storage from the compute nodes onto a dedicated server/system

- Maximizes the throughput/bandwidth of the data transfer system

- Leverages and maximizes the utilization of the PCIe bus subsystem to handle all of the data I/O traffic

- And do all of that without breaking the bank (cost optimized solution)

I found this Gigabyte board: https://www.newegg.com/gigabyte-mz31-ar0-amd-epyc-7000-series-processor-family-single-processor-14nm-up-to-32-core-64-thr/p/N82E16813145034

so the thing that I still need to do is to check and verify that the PCIe 3.0 x16 slots can be bifuricated into x4/x4/x4/x4 in preparation for the NVMe RAID. And I also need to confirm that AMD's EPYC processors even supports VROC because so far, the only thing that I've been able to find online is that they've added it to the Threadripper/Ryzen line up. No data/no confirmation on that front in regards to the EPYC processors.

(And I don't want to use the Treadripper processors mostly because it only has 64 PCIe 3.0 lanes vs. EPYC's 128 PCIe 3.0 lanes. If I use a Threadripper, per this (https://community.amd.com/community/gaming/blog/2017/10/02/now-available-free-nvme-raid-upgrade-for-amd-x399-chipset), I will be limited to just six NVMe M.2 devices, which means that I will top out at the theorectical 192 Gbps total bandwidth limit, which AMD's own testing (from the same source), shows that they're able to achieve a throughput efficiency of 86.344% by being able to hit 165.78125 Gbps - at least on their reads. Their writes is still only barely able to get 90 Gbps, or 46.92% of the rated throughput. Pity.)

leadeater · June 19, 2019

28 minutes ago, alpha754293 said:

re: number of drives

We can always argue it both ways.

I stuck with 1 TB drives to keep at least the per-drive capacity the same.

There is no COTS board that exists right now that where you can actually have 24 NVMe M.2 drives, which is why I stuck with 8 for the NVMe and 24 for the total number of drive bays for the Dells. (I still haven't heard back from them in regards to a block diagram yet.)

Despite that, again, remember that the 1 TB Samsung SATA SSD is only $18 LESS than the same capacity NVMe drives. And the practical and functional difference is that the SATA SSD will be limited to 6 Gbps whereas the NVMe 3.0 x4 is capable of 32 Gbps or over 5 TIMES that bandwidth. For $18 more. So...why wouldn't you spend the extra $18 per drive, whether it'd be 8 or 24 (other than you can't actually install 24 NVMe drives) and quintuple your bandwidth, and therefore; performance?

Sorry it wasn't that clear what I meant, I was meaning if you only need 8TB for the NVMe option then using 24 1TB 860 Pro's is a bit odd when you could use 512GB ones instead and get a significantly lower cost for the SATA option. It's still a higher cost for performance though like I mentioned.

alpha754293 · June 20, 2019

5 hours ago, leadeater said:

Sorry it wasn't that clear what I meant, I was meaning if you only need 8TB for the NVMe option then using 24 1TB 860 Pro's is a bit odd when you could use 512GB ones instead and get a significantly lower cost for the SATA option. It's still a higher cost for performance though like I mentioned.

Yeah, I get that. I mean, again, you can always argue it both ways, right, especially when it comes to the cost comparison?

In other news though, there appears to be critical flaws with my plan, which I think other people have been saying and/or alluding to:

1) Apparently AMD NVMe RAID (at the BIOS level) ONLY works with Windows 10. (So...that is a problem because the Mellanox drivers doesn't work in Windows 10.)

2) If I go with linux, it'll have to be software RAID as it was alluded to earlier where IB works, but the NVMe RAID from AMD doesn't.

Then the question of it being CPU bound starts coming into play a little bit more, but it looks like that with ext4, it isn't as much of an issue. (https://www.phoronix.com/scan.php?page=article&item=linux418-nvme-raid&num=2).Phoronix is a great resource for a lot of stuff Linux. In their same test, it also showed that basically, Btrfs sucks compared to XFS or ext4.

Still, even with an average of 5857 MB/s sequential write, that's still only 45.75 Gbps, which would put the estimate of having or needing two cards almost a necessity in order to achieve close to the throughput of a single 4X EDR IB port, and also highlights how even with the latest and greatest, it still can't fully take advantage of the PCIe 3.0 x16 bus bandwidth.

He also tested it with 20 "conventional" SSDs, and found that Btrfs works better in that configuration, but ZFS was almost as good as the NVMe test, but required five times the number of drives. (https://www.phoronix.com/scan.php?page=article&item=freebsd-12-zfs&num=2)

Based on this research, I'm left with the conclusion that the SSD technology still isn't mature enough yet to be able to saturate even PCIe 3.0 x4 interface bandwidth for writes, and that deploying a system like this, I need to be prepared to be significantly/massively underwhelmed by the write performance relative to what the interface should be able to do.

Interestingly enough, the slightly older Intel Optane 900p (as a single device), didn't really perform all that well in Phoronix's test suite (https://www.phoronix.com/scan.php?page=article&item=samsung-970evo-plus&num=3) which suggests that even AIC SSDs suffer the same fate, which is just fundamental to the way SSD flash memory chips are designed and manufactured.

In other words, based on his tests, even having local scratch space wouldn't be terribly great either.

Pity/bummer. *sigh...*

leadeater · June 20, 2019

31 minutes ago, alpha754293 said:

1) Apparently AMD NVMe RAID (at the BIOS level) ONLY works with Windows 10. (So...that is a problem because the Mellanox drivers doesn't work in Windows 10.)

I think that doesn't include EPYC though, that platform requires either software RAID or a hardware RAID controller (which would mean converting U.2 to M.2 in your situation).

31 minutes ago, alpha754293 said:

2) If I go with linux, it'll have to be software RAID as it was alluded to earlier where IB works, but the NVMe RAID from AMD doesn't.

This should be fine though if you stick to striping, mirroring or combination of. Parity is a no go performance wise especially without hardware offload to handle it.

31 minutes ago, alpha754293 said:

Then the question of it being CPU bound starts coming into play a little bit more, but it looks like that with ext4, it isn't as much of an issue. (https://www.phoronix.com/scan.php?page=article&item=linux418-nvme-raid&num=2).Phoronix is a great resource for a lot of stuff Linux. In their same test, it also showed that basically, Btrfs sucks compared to XFS or ext4.

BTRFS has a bunch of nice features that MD RAID doesn't have but it's not a very high performance option, still good but won't let you leverage the performance of NVMe as you have noted.

Both BTRFS and ZFS will use more CPU than MD RAID also.

31 minutes ago, alpha754293 said:

Still, even with an average of 5857 MB/s sequential write, that's still only 45.75 Gbps, which would put the estimate of having or needing two cards almost a necessity in order to achieve close to the throughput of a single 4X EDR IB port, and also highlights how even with the latest and greatest, it still can't fully take advantage of the PCIe 3.0 x16 bus bandwidth.

You might be able to get 8+ GB/s with more NVMe devices and spreading them across PCIe slots as you plan to do. If you hit scaling problems you can also look at doing two 4 NVMe RAID 0 etc. There's a slight bit of overhead which gets slightly more as the number of devices increases in the same RAID array. Doing that would make it harder on your end in having to pick which file path to put data.

31 minutes ago, alpha754293 said:

He also tested it with 20 "conventional" SSDs, and found that Btrfs works better in that configuration, but ZFS was almost as good as the NVMe test, but required five times the number of drives. (https://www.phoronix.com/scan.php?page=article&item=freebsd-12-zfs&num=2)

Based on this research, I'm left with the conclusion that the SSD technology still isn't mature enough yet to be able to saturate even PCIe 3.0 x4 interface bandwidth for writes, and that deploying a system like this, I need to be prepared to be significantly/massively underwhelmed by the write performance relative to what the interface should be able to do.

Yep, this is why I said the bus and device connection speeds aren't the best things to look at because the devices are either slower than them or other technical limitations come up well before hitting bus bandwidth limitations. All Flash NVMe arrays are quite complex and specialized which is why the enterprise offerings are so darn expensive, really really expensive. Like that Netapp AFF800 I linked earlier, what ever you guess the price is you'd likely be a factor of 10 off .

31 minutes ago, alpha754293 said:

Interestingly enough, the slightly older Intel Optane 900p (as a single device), didn't really perform all that well in Phoronix's test suite (https://www.phoronix.com/scan.php?page=article&item=samsung-970evo-plus&num=3) which suggests that even AIC SSDs suffer the same fate, which is just fundamental to the way SSD flash memory chips are designed and manufactured.

In other words, based on his tests, even having local scratch space wouldn't be terribly great either.

Larger Optane drives are faster, same with NAND. Small devices have less memory chips to parallel the I/O over so don't perform as well as the bigger ones with more. This only effects the smallest ones though, middle and upper sized ones have the same number of memory chips using different sizes.

For local scratch using NVMe currently the upper limit is around 3GB/s to 4GB/s, with PCIe 4.0 coming in to the market you'll be able to get past the 5GB/s mark soon but likely not that close to 8GB/s sadly. However expensive because new technology etc.

NVRAM is where it's at for really high performance scratch space but that's even more expensive.

alpha754293 · June 20, 2019

14 hours ago, leadeater said:

I think that doesn't include EPYC though, that platform requires either software RAID or a hardware RAID controller (which would mean converting U.2 to M.2 in your situation).

This should be fine though if you stick to striping, mirroring or combination of. Parity is a no go performance wise especially without hardware offload to handle it.

BTRFS has a bunch of nice features that MD RAID doesn't have but it's not a very high performance option, still good but won't let you leverage the performance of NVMe as you have noted.

Both BTRFS and ZFS will use more CPU than MD RAID also.

You might be able to get 8+ GB/s with more NVMe devices and spreading them across PCIe slots as you plan to do. If you hit scaling problems you can also look at doing two 4 NVMe RAID 0 etc. There's a slight bit of overhead which gets slightly more as the number of devices increases in the same RAID array. Doing that would make it harder on your end in having to pick which file path to put data.

Yep, this is why I said the bus and device connection speeds aren't the best things to look at because the devices are either slower than them or other technical limitations come up well before hitting bus bandwidth limitations. All Flash NVMe arrays are quite complex and specialized which is why the enterprise offerings are so darn expensive, really really expensive. Like that Netapp AFF800 I linked earlier, what ever you guess the price is you'd likely be a factor of 10 off .

Larger Optane drives are faster, same with NAND. Small devices have less memory chips to parallel the I/O over so don't perform as well as the bigger ones with more. This only effects the smallest ones though, middle and upper sized ones have the same number of memory chips using different sizes.

For local scratch using NVMe currently the upper limit is around 3GB/s to 4GB/s, with PCIe 4.0 coming in to the market you'll be able to get past the 5GB/s mark soon but likely not that close to 8GB/s sadly. However expensive because new technology etc.

NVRAM is where it's at for really high performance scratch space but that's even more expensive.

It would appear that NVMe RAID for AMD EPYC, currently, doesn't appear like it's even available, so I concur with you that it would either be an all software or all hardware based solution.

But that's also why I had mentioned that the NVMe RAID will also only work with Windows 10 (on Threadripper).

According to the benchmarking by Phoronix, ZFS appears to be a better option than Btrfs.

(I'm not really sure how NVMe devices will present itself to ZFS, especially ZFS on Linux.)

It's too bad that the U.2 format for the PCIe 3.0 x4 NVMe isn't that popular. A quick check on Newegg shows that it's pretty much (almost) ALL Intel devices that use that interface format, and again, the Optane 905P series isn't cheap.

For now, this project/build is put on hold as the CBA calculation is being reworked/redone with the new information.

alpha754293 · June 25, 2019

On 6/13/2019 at 3:47 PM, Electronics Wizardy said:

Dell r740xd

https://www.dell.com/en-us/work/shop/cty/pdp/spd/poweredge-r740xd/pe_r740xd_12238_vi_vp

Dell r7415/r7425

https://www.dell.com/en-us/work/shop/povw/poweredge-r7415

This supermicro https://ftpw.supermicro.com.tw/products/system/2U/2029/SYS-2029U-E1CR4T.cfm

Lots of servers havae 24x nvme on one node

How did you test performance? What protocol were you using over the network?

Why are you looking at mostly luster?

What os are the data making nodes running? Have you looked at ceph or storage spaces direct?

Did you call up dell? They should give you one. Probably some pcie splitters though

SO what os is this all running? Id go all windows or all linux, normally works better that way. If your going windows id just do storage spaces direct here.

I finally heard back from Dell re: their NVMe server -- the PCIe subsystem for NVMe storage is oversubscribed 3-to-1.

They put 12 drives worth of throughput/bandwidth (48 PCIe 3.0 lanes) through a single x16 connector via a PCIe switch.

The other system that I'm looking at now is the Supermicro 2123US-TN24R25M, but a fully configured system with dual AMD EPYC 7251, 128 GB of RAM, and 24x Intel P4510 1 TB PCie 3.1 x4 NVMe drives will run $13591. So, on the upside, it has more NVMe capacity than my proposed build specs. But the downside is that it will have to be running software RAID, when running on/with Linux, especially if I were to run ZFS on Linux. Downside: It's $13591. Ouch. (More than twice the cost of my original proposed build.)

*sigh...*

Looks like that this isn't going to be happening for quite some time then.

Electronics Wizardy · June 25, 2019

On 6/20/2019 at 12:05 PM, alpha754293 said:

But that's also why I had mentioned that the NVMe RAID will also only work with Windows 10 (on Threadripper).

Thats for bootable raid, your not booting from this array, so you can make an array in software with any os or use storage spaces in windows.

On 6/20/2019 at 12:05 PM, alpha754293 said:

According to the benchmarking by Phoronix, ZFS appears to be a better option than Btrfs.

This also really depends on use, so might as well test it when you get it, but normally zfs is better.

On 6/20/2019 at 12:05 PM, alpha754293 said:

(I'm not really sure how NVMe devices will present itself to ZFS, especially ZFS on Linux.)

Shows up like any other block device in /dev/nvme0n0p1 or some other number. Just add it to zfs like any other drive.

On 6/20/2019 at 12:05 PM, alpha754293 said:

t's too bad that the U.2 format for the PCIe 3.0 x4 NVMe isn't that popular. A quick check on Newegg shows that it's pretty much (almost) ALL Intel devices that use that interface format, and again, the Optane 905P series isn't cheap.

Lots of enterprise grade drivies are 2.5 nvme, look on ebay if you want them cheap.

5 hours ago, alpha754293 said:

But the downside is that it will have to be running software RAID,

All nvme raid is software raid, so this isnt' really a problem here. Vroc is software raid too its using the cpu to do all the work.

alpha754293 · June 26, 2019

6 hours ago, Electronics Wizardy said:

Shows up like any other block device in /dev/nvme0n0p1 or some other number. Just add it to zfs like any other drive.

If I remember correctly, that wasn't how devices showed up in FreeBSD, but that is how devices show up in Solaris.

However, enterprise drives are also a lot more expensive (relatively speaking) than consumer grade drives with, often times, lower performance. The real difference is their MTBF and write endurance. But as far as performance goes (which is what this build will be geared towards), the enterprise grade SSDs doesn't really help in this regard.

My understanding re: NVMe RAID and why it's such a big deal is that there are differences between md RAID and zero-channel RAID vs. NVMe RAID (unless you count NVMe RAID as a form of zero-channel RAID), especially when it comes to how it is presented to the BIOS and therefore; the OS. This distinction is probably more apparent when and/or if the NVMe RAID fails. You actually touched on that in the first line of your response.

Electronics Wizardy · June 26, 2019

12 hours ago, alpha754293 said:

If I remember correctly, that wasn't how devices showed up in FreeBSD, but that is how devices show up in Solaris.

nvme drives should show up as a /dev/nvm0 or so on in freebsd, and work like any other block device.

12 hours ago, alpha754293 said:

However, enterprise drives are also a lot more expensive (relatively speaking) than consumer grade drives with, often times, lower performance. The real difference is their MTBF and write endurance. But as far as performance goes (which is what this build will be geared towards), the enterprise grade SSDs doesn't really help in this regard.

used drives are your friends, you can find these cheap on ebay and there very good values. Id get something like this here https://www.ebay.com/itm/Samsung-PM983-2-5-3-84TB-SSD-PCIe-Gen3/123572508794?hash=item1cc57ed87a:g:QyMAAOSwfvtcUITI

Sign In

Crazy NAS/data/file server build

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites