Jump to content

Crazy NAS/data/file server build

Has anybody ever pushed their available PCI express subsystem to the limits and tested it?

 

Here's my problem statement:

 

I run a micro supercomputer/cluster in my basement (four nodes, dual Xeon E5-2690 (v1) each, 128 GB per node, 512 GB total, 1 TB SSD, 3 TB + 6 TB HDD per node) that I perform mechancial engineering simulations on/with and as a result, I generate a LOT of data.

 

The four nodes are currently connected via a Mellanox ConnectX-4 dual port VPI (4x EDR IB) 100 Gbps QSFP28 PCI express 3.0 x16 network card with a 36-port Mellanox MSB-7890 4x EDR IB switch (7.2 Tbps switching capacity).

 

Infiniband port-to-port, the system has been benchmarked at around the 96 Gbps data transfer rate already, which means that with four nodes, I can generate just shy of 400 Gbps worth of traffic (or 78% of four PCI express 3.0 x16 lanes slots).

 

I know that AMD EPYC supposedly has 128 PCI express 3.0 lanes, which means that I can have half it (64 lanes) dedicated to just RECEIVING data from the compute nodes as it is being generated, but then I will also need a storage system that ISN'T a ramdrive (tried it - didn't work) to be able to write the data to short term scratch space before it is either committeed to SATA SSDs or SATA HDDs.

 

So...has anybody ever tested what a fully maxed out PCI express subsystem would look like or in terms of what it can do?

 

I was thinking about using two Asus Hyper M.2 x16 cards with four 2280 M.2 NVMe 3.0 x4 SSDs each (total of 8 NVMe SSDs) so that it would be able to receive the data at the line rate, but I didn't want to dump a ton of money into this project only to have it fail (i.e. achieve a fraction of the sustained write speed that when the line itself would be capable of so much more).

 

Any ideas?

 

Will it work? Has anybody pushed their PCI express subsystem to the limits like this before?

 

(I am literally running out of PCI express lanes and this also assumes that I have to be using IGP off the CPU because I won't have any free PCIe x16 slots for discrete GPU.)

 

And two of the PCIe x16 slots will have to be bifuricated to x4/x4/x4/x4 operation mode.

 

And I am also trying to contain this system entirely within a single "box" because I don't want to have to run like ZFS on Lustre FS.

 

Thanks.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

Hmn, @leadeater Thoghts?

 

This goes quite beyond anything I've played with. The only thing I can think of is perhaps a disk-shelf with a 100Gbit NIC or two and load it with either SATAIII or SAS SSDs. Your M.2 drive idea would in theory work if you do have bifuricated support you'd just have to setup the RAID and mount the pool then point the files to write to it. You'd have to tinder with the block size and such if you're working with such large datasets otherwise you may not see the read/write performance.

 

I was just talking to a kid who found a 1TB M.2 drive for $50 so if you're not afraid of used hardware you could pull this off cheaply (relatively speaking).

Link to comment
Share on other sites

Link to post
Share on other sites

Have you looked at pcie x8 ssds? There a good amount faster.

 

Optane might be a better bet here aswell as its has much lower latency

 

Also look at the rack servers with 24x nvme 2.5 drive bays instead of using m.2 adapters.

 

Also you might be running into cpu limits with all that bandwidth

 

and if you go something like ryzen/epyc or dual socket you run into the problem that there is inter chip communication that will slow it down.

Link to comment
Share on other sites

Link to post
Share on other sites

20 hours ago, Electronics Wizardy said:

Have you looked at pcie x8 ssds? There a good amount faster.

 

Optane might be a better bet here aswell as its has much lower latency

 

Also look at the rack servers with 24x nvme 2.5 drive bays instead of using m.2 adapters.

 

Also you might be running into cpu limits with all that bandwidth

 

and if you go something like ryzen/epyc or dual socket you run into the problem that there is inter chip communication that will slow it down.

The problem, again, even with PCIe x8 SSDs is that it still doesn't have enough bandwidth for a single SSD. The PCIe x8 are PCIe 2.0 x8, (at least for consumer grade hardware) which only allows for 40 Gbps. The incoming data stream from my four compute nodes is 400 Gbps (in 4 x 100 Gbps configuration).

 

If I switch to enterprise grade PCIe 3.0 x8 SSDs, a 3.2TB card is $2000 and that's still only 64 Gbps, which means that I will need two of them (in a RAID0 configuration) to get any hope of being able to write at the combined interface speed of 128 Gbps, which will be the same as the the NIC itself.

 

This is why I was looking at trying to do something like this with the Asus Hyper M.2 x16 (due to cost).

 

Optane doesn't have the capacity that I will need.

 

I've looked at Supermicro's all NVMe solution as well, except that their system that has the 24 2.5" NVMe drive bays is actually hosted with four half-width dual socket nodes, with each node having only 6 drives which means that in order for the system to present itself to my network as a single logical volume/image, I will have to run something like Lustre FS on top of either ZFS and/or ext4 or something else that would be more NVMe SSD friendly.

 

A single socket EPYC is "supposed to, in theory" be able to go upto 128 PCIe 3.0 lanes, but I'm having difficulty confirming whether that's ACTUALLY true or whether I would need dual socket for that. AMD's website is of no help in this regard in confirming that, but this is the thought process right now and this is the idea that's running around in my head to see if something like this would be even REMOTELY possible and also whether it would get even REMOTELY close to that.

(My early tests trying to do this with ramdrives with my current systems - I could barely even get 10 Gbps transfer rate to/from the Ramdrive running in Windows. In Linux, it was barely getting 2.5 Gbps, which is about the same speed as the SATA 6 Gbps SSD that I've got in the system right now for the OS, etc.; so I'm trying to think of and find ways to significantly increase the storage subsystem I/O in order to keep up with the NIC, which is now, at least on paper, CPU PCIe 3.0 lane-bound.)

 

Thank you.

 

P.S. I'm actually a little less worried about intersocket communication only because AMD's infinity fabric has a theorectical maximum throughput of 4096 Gbps throughput rate/bandwidth, so...I don't think that's going to be much of a concern if it is real. This, of course, contrasts with the older QPI that Intel Xeons use, which, at best, can only muster up 307.2 Gbps.

 

The new Intel Xeon W-series finally supports 64 PCIe 3.0 lanes, but they're also single socket solutions, which means that I would only be able to support HALF of what I am intending on trying to set up/use a system like this for, and as such, won't have intersocket communication bandwidth issues since those processors are single socket only.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

20 hours ago, Windows7ge said:

Hmn, @leadeater Thoghts?

 

This goes quite beyond anything I've played with. The only thing I can think of is perhaps a disk-shelf with a 100Gbit NIC or two and load it with either SATAIII or SAS SSDs. Your M.2 drive idea would in theory work if you do have bifuricated support you'd just have to setup the RAID and mount the pool then point the files to write to it. You'd have to tinder with the block size and such if you're working with such large datasets otherwise you may not see the read/write performance.

 

I was just talking to a kid who found a 1TB M.2 drive for $50 so if you're not afraid of used hardware you could pull this off cheaply (relatively speaking).

SATA "3" (or SATA 6 Gbps) is only up to 6 Gbps theorectical interface peak. And with my Samsung 860 Evo 1 TB SSDs that are in the system right now, I can only barely muster up around 2.5 Gbps (312.5 MB/s) sustained write speeds to it.

 

SAS 12 Gbps will only double that. The incoming data rate can be, again, upto eight times higher than that, per node (and I have four nodes that's spitting out data like there's no tomorrow (mostly in the form of scratch files during the course of the simulation)), so being able to keep up with the high speed interconnect/network subsystem is important.

 

(There's NVMe-over-Fabric, (NVMeoF) and there's also iSCSI Extensions over RDMA (iSER), so I'm still not sure which is going to be the solution that will be able to keep up with my four compute nodes, splitting out data like there's no tomorrow.)

 

Thanks. 

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

9 minutes ago, alpha754293 said:

SATA "3" (or SATA 6 Gbps) is only up to 6 Gbps theorectical interface peak. And with my Samsung 860 Evo 1 TB SSDs that are in the system right now, I can only barely muster up around 2.5 Gbps (312.5 MB/s) sustained write speeds to it.

 

SAS 12 Gbps will only double that. The incoming data rate can be, again, upto eight times higher than that, per node (and I have four nodes that's spitting out data like there's no tomorrow (mostly in the form of scratch files during the course of the simulation)), so being able to keep up with the high speed interconnect/network subsystem is important.

 

(There's NVMe-over-Fabric, (NVMeoF) and there's also iSCSI Extensions over RDMA (iSER), so I'm still not sure which is going to be the solution that will be able to keep up with my four compute nodes, splitting out data like there's no tomorrow.)

 

Thanks. 

I can't really advise you any further because I haven't worked with anything quite this high-end. I'm but a peasant aggregating 10Gbit links with a pure SATAIII SSD server. Haven't touched 100Gbit Infiniband or gone NVMe storage outside of a read/write cache for a VM server.

 

I'd like to play with 100Gbit Infiniband though. It'd made for quite a powerful cluster computer.

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, alpha754293 said:

The PCIe x8 are PCIe 2.0 x8, (at least for consumer grade hardware) which only allows for 40 Gbps.

Pretty much everything has been PCIe 3.0 for ages now, you'd only encounter that buying used server equipment off ebay.

 

@alpha754293 You're pretty much never going to have an I/O profile that will actually allow or utilize such a setup. Most I/O loads are too small in block size and unless you have hundreds to thousands of nodes generating a high enough I/O queue almost every link in the chain will be significantly below maximum bandwidth.

 

Just a warning though, you'll never be able to achieve what you want using consumer grade SSDs. These just do not have the sustained write capability and latency consistency to work, you'll get probably about 10 to 30 minutes of peak performance then you'll drop down to around 20% of that.

 

Raw throughput wise going with NVMe won't actually allow more than using SATA or SAS. You can achieve equal by using HBAs and enough SATA/SAS SSDs to get the bandwidth you want. There is no effective difference here other than the number of required storage devices. NVMe is more to give lower latency and higher performance at smaller block sizes with the added benefit of less storage devices however at a higher per device cost.

Link to comment
Share on other sites

Link to post
Share on other sites

4 hours ago, alpha754293 said:

This is why I was looking at trying to do something like this with the Asus Hyper M.2 x16 (due to cost).

What is your budget?

 

4 hours ago, alpha754293 said:

've looked at Supermicro's all NVMe solution as well, except that their system that has the 24 2.5" NVMe drive bays is actually hosted with four half-width dual socket nodes,

They make many with 24x nvme in a single or dual socket system. Not just 4 node systems

 

 

4 hours ago, alpha754293 said:

A single socket EPYC is "supposed to, in theory" be able to go upto 128 PCIe 3.0 lanes, but I'm having difficulty confirming whether that's ACTUALLY true or whether I would need dual socket for that. AMD's website is of no help in this regard in confirming that, but this is the thought process right now and this is the idea that's running around in my head to see if something like this would be even REMOTELY possible and also whether it would get even REMOTELY close to that.

You can use all those lanes, the problem is your limited by other things at that point like protocol overhead, infinity fabric and other issues.

 

4 hours ago, alpha754293 said:

(My early tests trying to do this with ramdrives with my current systems - I could barely even get 10 Gbps transfer rate to/from the Ramdrive running in Windows. In Linux, it was barely getting 2.5 Gbps, which is about the same speed as the SATA 6 Gbps SSD that I've got in the system right now for the OS, etc.; so I'm trying to think of and find ways to significantly increase the storage subsystem I/O in order to keep up with the NIC, which is now, at least on paper, CPU PCIe 3.0 lane-bound.)

How did you make the ram drive?

 

You should probably be using nvmeof here instead of something like samba or nfs.

 

Do you have rdma on the nic?

 

 

 

Id do more testing with the ram drive, if you can't the speeds with the ram drive you have a cpu or network or protocol problem, and adding much slower flash will just make it slower. THose 24x nvme server like the dell r7415 is the best hope you probably have here at taking in a ton of writes. 

 

Also why does this have to be on one server? A cluster will allow much faster speeds, or have one data dump per node making data.

Link to comment
Share on other sites

Link to post
Share on other sites

19 hours ago, Windows7ge said:

I can't really advise you any further because I haven't worked with anything quite this high-end. I'm but a peasant aggregating 10Gbit links with a pure SATAIII SSD server. Haven't touched 100Gbit Infiniband or gone NVMe storage outside of a read/write cache for a VM server.

 

I'd like to play with 100Gbit Infiniband though. It'd made for quite a powerful cluster computer.

Yeah, 100 Gbps well..it WAS (and maybe still is) the standard for supercomputers, HPC, and clusters.

 

Interestingly enough, I actually got it because of the video that Linus made about how you can now pick up the cards off eBay at a fraction of the new, retail cost, so that's how (and why) I made the jump.

 

It's certainly interesting. And my line of work IS one of the few instances where even for just four nodes, it can really make use of it.

 

Now, I am just trying to expand its use case.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

16 hours ago, leadeater said:

Pretty much everything has been PCIe 3.0 for ages now, you'd only encounter that buying used server equipment off ebay.

 

@alpha754293 You're pretty much never going to have an I/O profile that will actually allow or utilize such a setup. Most I/O loads are too small in block size and unless you have hundreds to thousands of nodes generating a high enough I/O queue almost every link in the chain will be significantly below maximum bandwidth.

 

Just a warning though, you'll never be able to achieve what you want using consumer grade SSDs. These just do not have the sustained write capability and latency consistency to work, you'll get probably about 10 to 30 minutes of peak performance then you'll drop down to around 20% of that.

 

Raw throughput wise going with NVMe won't actually allow more than using SATA or SAS. You can achieve equal by using HBAs and enough SATA/SAS SSDs to get the bandwidth you want. There is no effective difference here other than the number of required storage devices. NVMe is more to give lower latency and higher performance at smaller block sizes with the added benefit of less storage devices however at a higher per device cost.

If you go to Newegg, for example, and you go to Internal SSDs, under the filters that they have set up for the interface, they only list either PCIe 3.0/3.1 x4 or PCIe 2.0 x8.

 

I would LOVE it if they had a SSD that was natively PCIe 3.0 x8 or x16, but so far, no such luck, at least not with the "consumer grade" stuff. With the enterprise SSDs, there's three, and one of them is out of stock on Newegg, so now you're down to two, the 4 TB DC P3608 at a cost of $3000 each or the 3.2 TB of the same for ONLY $2000 each.

 

I don't know about the I/O loads being too small.

 

I mean, part of what I'm also looking at is what are the SSD's random write IOPS rating is as well.

 

On the other hand though, some of my simulations, the scratch DBALL file will be in excess of 11 GB a piece for a relatively small run, and it generates a new one for each solution iteration which occurs VERY quickly, so being able to erase or overwrite 11 GB/s (~88 Gbps) would be fantastic.

 

Most "normal" workloads ARE much smaller than that, probably between 1 MB - 4 MB a piece at max. But for what I do, it's literally closer to like tens of GBs each pass, so the faster that it can purge the previous DBALL scratch file and write the new one, the faster my overall solution times will be. (And this isn't uncommon for mechancial engineering analyses using a direct sparse solver.)

 

That's part of the reason why putting the NVMe SSDs into a RAID0 array, to help mitigate some of that. And the plan is to see if I can actually stripe it ACROSSED the two Asus Hyper M.2 x16 cards, but I'll have to play around with that, if I end up trying to deploy this kind of a solution. It'll also be interesting to see how and if NVMeoF will work (or not).

 

"Raw throughput wise going with NVMe won't actually allow more than using SATA or SAS."

 

I am not sure why you would say that.

 

It would depend on the CPU (and therefore; how many PCIe 3.0 lanes the CPU actually supports directly).

 

For example, with the Intel Core i9-9900K, which has only 16 PCIe 3.0 lanes on the CPU itself (and 8 more via the chipset/PCH), I would agree with you if it was routing through the chipset/PCH where the 8 PCIe 3.0 lanes will be shared with the SATA/SAS. But for the 16 PCIe 3.0 lanes that are directly connected to the CPU, as long as the CPU itself has an IGP (such that I am not "spending" additional PCIe 3.0 lanes for graphics), then I can put the Asus Hyper M.2 x16 card in the slot that's directly connected to the CPU.

 

But again, this is part of the question -- if the AMD EPYC processors really is truly capable of 128 PCIe 3.0 lanes (AMD EPYC 7351P for example), then paired with the Gigabyte MZ31-AR0 (https://www.newegg.com/gigabyte-mz31-ar0-amd-epyc-7000-series-processor-family-single-processor-14nm-up-to-32-core-64-thr/p/N82E16813145034), then at least this is theorectically possible, although I would only be able to have only maybe two Mellanox ConnectX-4 cards and two (or three) of the Asus Hyper M.2 x16 cards.

 

"You can achieve equal by using HBAs and enough SATA/SAS SSDs to get the bandwidth you want."

 

Well, Linus just built the 27 3.84 TB Seagate IronWolf all SSD server using the dual Broadcom/LSI 9305-16i SAS 12 Gbps HBAs. The problem will be that the HBA itself will be the limiting factor. I mean, this is the motivation with NVMe, right? It's so that instead of routing through a HBA, you're connected to the PCIe bus directly. And Linus' new all SSD build costs $40k.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

17 minutes ago, alpha754293 said:

Well, Linus just built the 27 3.84 TB Seagate IronWolf all SSD server using the dual Broadcom/LSI 9305-16i SAS 12 Gbps HBAs. The problem will be that the HBA itself will be the limiting factor. I mean, this is the motivation with NVMe, right? It's so that instead of routing through a HBA, you're connected to the PCIe bus directly. And Linus' new all SSD build costs $40k.

You are aware that it's 12Gbit per port not per card right? Each SSD has access to it's own 12Gbit link which is actually overkill. The limiting factor will be the PCI_e x8 3.0 connection the card is using. Those SSDs will have a cumulative bandwidth up to almost 8GB/s between them and the rest of the system so there's plenty of bandwidth to go around.

 

That's assuming my research is correct of course.

Link to comment
Share on other sites

Link to post
Share on other sites

How compressible is the data you have?

 

There are algorithms like LZ4 that can do ~ 500 MB/s per core or zStandard that can achieve high compression speeds .... if you can reduce the data to even 50% of its size you may be able to use nvme drives in raid, or split data across multiple 40gbps cards and dump everything in near real time etc etc..

See  https://facebook.github.io/zstd/

 and here they show a 7zip mod which compresses data at up to 700 MB/s on a laptop using LZ4  ...  think how it would do on a 24-32 core epyc/threadripper : https://mcmilk.de/projects/7-Zip-zstd/

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

17 hours ago, Electronics Wizardy said:

What is your budget?

 

They make many with 24x nvme in a single or dual socket system. Not just 4 node systems

 

 

You can use all those lanes, the problem is your limited by other things at that point like protocol overhead, infinity fabric and other issues.

 

How did you make the ram drive?

 

You should probably be using nvmeof here instead of something like samba or nfs.

 

Do you have rdma on the nic?

 

 

 

Id do more testing with the ram drive, if you can't the speeds with the ram drive you have a cpu or network or protocol problem, and adding much slower flash will just make it slower. THose 24x nvme server like the dell r7415 is the best hope you probably have here at taking in a ton of writes. 

 

Also why does this have to be on one server? A cluster will allow much faster speeds, or have one data dump per node making data.

If it works with the AMD EPYC and the Gigabyte motherboard, then it'll just be the cost of all of the components which will sit at around maybe $6000 when it's all said and done.

 

"They make many with 24x nvme in a single or dual socket system. Not just 4 node systems"

 

Can you send me links to those, because from my research, the only ones that I've found that has 24 NVMe drive bays - they're all multi-node systems, so I would definitely be interested in seeing who offers just a "single pizza box" that's either a single socket or a dual socket solution (not a multi-node solution) for this.

 

Thank you.

 

"You can use all those lanes, the problem is your limited by other things at that point like protocol overhead, infinity fabric and other issues."

 

Actually, with NVMeoF, it should be able to bypass almost all of that.

 

(See: https://community.mellanox.com/s/article/howto-configure-nvme-over-fabrics)

 

In Linux, I actually tried it both ways:

 

# mount -t tmpfs -o size=100g tmpfs /mnt/ramdisk 

 

(Note: My system has 128 GB of RAM per node.)

 

And when that yielded results that were comparable to the SSD that was already on the system, I tried it with ramfs instead.

 

# mount -t ramfs -o size=100g ramfs /mnt/ramdisk

 

That also, resulted in similar performance as the SATA 6 Gbps SSD that I have in the system as well.

 

In Windows Server 2016, I used this:

 

https://www.ghacks.net/2014/01/26/create-dynamic-ramdisk-imdisk-toolkit/

 

(However, with Windows Server 2016, I was able to actually get close to 10 Gbps mark, it would saturate that for about 20-30 seconds or so, then drop back down to around 2.5-ish Gbps.)

 

re: "You should probably be using nvmeof here instead of something like samba or nfs."

 

That's the plan. Although there's also SMB direct which has RDMA capabilities as well as iSER. So, it'll be quite a bit (lot) of experimentation to see which configuration will actually get me closest to the theorectical performance that has been calculated so far.

 

re: "Do you have rdma on the nic?"

 

Yes. It's a Mellanox ConnectX-4 dual QSFP28 VPI port, 4x EDR IB card attached to a Mellanox 36-port MSB-7890 4x EDR IB externally managed switch.

 

re: "THose 24x nvme server like the dell r7415 is the best hope you probably have here at taking in a ton of writes. "

 

I'm trying to get a block diagram out of Dell so that I can see how the 24 NVMe drives are connected to the CPU (so that I can see if there are any bottlenecks in the layout/configuration of the system). The R7415 also has other PCIe slots, etc., so I want to know how the PCIe subsystem is divided up.

 

re: "Also why does this have to be on one server? A cluster will allow much faster speeds, or have one data dump per node making data."

 

Because deploying Lustre (probably with ZFS) is non-trivial. And I have to assume that intrasystem communication is faster than intersystem communication.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

56 minutes ago, Windows7ge said:

You are aware that it's 12Gbit per port not per card right? Each SSD has access to it's own 12Gbit link which is actually overkill. The limiting factor will be the PCI_e x8 3.0 connection the card is using. Those SSDs will have a cumulative bandwidth up to almost 8GB/s between them and the rest of the system so there's plenty of bandwidth to go around.

 

That's assuming my research is correct of course.

Thank you.

 

Yes, I am.

 

The Broadcom/LSI SAS 9305-16i uses 4x4 breakout cables to be able to deliver 12 Gbps per port. 16 ports * 12 Gbps each = 192 Gbps. The card itself is on a PCie 3.0 x8 link (64 Gbps). (Source: https://www.broadcom.com/products/storage/host-bus-adapters/sas-9305-16i#specifications)

 

Ergo, with about 6 SAS 12 Gbps drives, you're already overloading the bus. Even if you assume that the bus is only about 80% efficient, that means that with 7 SAS 12 Gbps drives, you're at 84 Gbps.

 

Conversely, with two NVMe PCIe 3.0 x4 SSDs, each NVMe PCIe 3.0 x4 SSD is 32 Gbps, meaning two of those will equal the bandwidth of the single SAS 12 Gbps RAID HBA.

 

So the thing to remember is that I have four nodes that are all outputting data and I'm trying to be able to get it on and off that system as fast as it can go. 400 Gbps = 50 GB/s. That's my bogey. That's the amount of data that is expected to be coming into the system at any given point in time.

 

Even with two Asus Hyper M.2 x16 cards, it'll still only be able to pull off around 32 GB/s (out of the 50 GB/s that's required).

 

re: "Each SSD has access to it's own 12Gbit link which is actually overkill."

 

Not when you're trying to shove 50 GB/s of data onto it.

 

Oh the joys of HPC.

 

Thank you.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

27 minutes ago, mariushm said:

How compressible is the data you have?

 

There are algorithms like LZ4 that can do ~ 500 MB/s per core or zStandard that can achieve high compression speeds .... if you can reduce the data to even 50% of its size you may be able to use nvme drives in raid, or split data across multiple 40gbps cards and dump everything in near real time etc etc..

See  https://facebook.github.io/zstd/

 and here they show a 7zip mod which compresses data at up to 700 MB/s on a laptop using LZ4  ...  think how it would do on a 24-32 core epyc/threadripper : https://mcmilk.de/projects/7-Zip-zstd/

 

 

 

The data really isn't compressible at all.

 

Moreover, the highly transient nature of the data doesn't warrant compression at all. (Compression isn't executed during the solve process/at solve time.)

 

Scratch data is a type of data which can be very large in volume, but it is only used temporarily by the solver as the data is being manupulated both into and out of RAM, into and out of the CPU.

 

Even a small run can generate a few TBs worth of scratch data during the solve process. That's not uncommon at all.

 

The intent with this is that instead of having a small(er) pool of localized NVMe storage on each node, the highly transient nature of the data means that if I build a scratch server system like this, I can pool together a lot more storage than what might fit on a node (given that the node itself only has a single PCIe 3.0 x16 slot that's used for the Mellanox ConnectX-4 card), so I have to "offload" the storage onto another system (like this) and make/use this as the scratch server.

 

At the end of the run, when I have my final results, then that data will be moved off the scratch server (ultimately) and either onto another SSD server (which can be sufficient served with just using SATA 6 Gbps SSDs, but just lots of them, or even mechanically rotating hard drives for longer term storage, because at that point, it's more about total available capacity/volume rather than pure/sheer/raw speed.

 

Some of that data might be a little bit more compressible (upto ~30%).

 

Thanks.

 

(P.S. I currently use 7-zip a LOT!)

 

 

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, alpha754293 said:

Even with two Asus Hyper M.2 x16 cards, it'll still only be able to pull off around 32 GB/s (out of the 50 GB/s that's required).

Ah, so you're well and truly beyond anything I can "reasonably" think of to help. You mentioned attempting to use a RAM disk didn't work. My own experiments with RAM disks I don't recall quad channel memory being capable of 50GB/s. At least not quad channel DDR3.

 

I'm wondering if you might have to somehow utilize multiple servers to somehow link-aggregate the offload so each server see's only a portion of that.

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, alpha754293 said:

Can you send me links to those, because from my research, the only ones that I've found that has 24 NVMe drive bays - they're all multi-node systems, so I would definitely be interested in seeing who offers just a "single pizza box" that's either a single socket or a dual socket solution (not a multi-node solution) for this.

Dell r740xd

 

https://www.dell.com/en-us/work/shop/cty/pdp/spd/poweredge-r740xd/pe_r740xd_12238_vi_vp

 

Dell r7415/r7425

 

https://www.dell.com/en-us/work/shop/povw/poweredge-r7415

 

This supermicro https://ftpw.supermicro.com.tw/products/system/2U/2029/SYS-2029U-E1CR4T.cfm

 

Lots of servers havae 24x nvme on one node

 

2 hours ago, alpha754293 said:

And when that yielded results that were comparable to the SSD that was already on the system, I tried it with ramfs instead.

How did you test performance? What protocol were you using over the network?

 

 

 

2 hours ago, alpha754293 said:

Because deploying Lustre (probably with ZFS) is non-trivial. And I have to assume that intrasystem communication is faster than intersystem communication.

Why are you looking at mostly luster?

 

What os are the data making nodes running? Have you looked at ceph or storage spaces direct?

 

2 hours ago, alpha754293 said:

I'm trying to get a block diagram out of Dell so that I can see how the 24 NVMe drives are connected to the CPU (so that I can see if there are any bottlenecks in the layout/configuration of the system). The R7415 also has other PCIe slots, etc., so I want to know how the PCIe subsystem is divided up.

Did you call up dell? They should give you one. Probably some pcie splitters though

 

2 hours ago, alpha754293 said:

In Windows Server 2016, I used this:

SO what os is this all running? Id go all windows or all linux, normally works better that way. If your going windows id just do storage spaces direct here.

 

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Windows7ge said:

Ah, so you're well and truly beyond anything I can "reasonably" think of to help. You mentioned attempting to use a RAM disk didn't work. My own experiments with RAM disks I don't recall quad channel memory being capable of 50GB/s. At least not quad channel DDR3.

 

I'm wondering if you might have to somehow utilize multiple servers to somehow link-aggregate the offload so each server see's only a portion of that.

Yeah, that was also a question that @Electronics Wizardy asked.

 

It isn't that it is impossible, but it gets vastly more complicated when I am running a distributed file system (Lustre) which actually would tie together "pools" of ZFS zpools/arrays so that it will be presented to the network as a single logical volume or mount(able) filesystem "image".

 

To answer your other point, no, no memory subsystem that's currently in use/available/deployed (or deployable) is capable of 50 GB/s (400 Gbps) data transfers. The proposed or draft specifications for DDR5-51200 SDRAM can hit a theorectical 409.6 Gbps, but there's nothing using it at the moment.

Also interestingly enough, in trying to benchmark RAMdisks, "conventional" transfers (TCP over IPoIB) doesn't use RDMA, so that's probably a big part/reason as to why it wasn't any faster than my existing SATA 6 Gbps SSD.

 

To really be able to use a RAMdisk like that, I would probably have to set up something like iSER, mount it, and then it might be able to work like that, and that's about as close as I would get in regards to that (at least on Linux systems).

 

The advantage with using NVMe is that there's NVMeoF such that it should, at least in theory, if I understand it correctly, expose the NVMe volume/array/devices to the RDMA enabled and aware fabric such that I shouldn't (again, in theory) have to deploy iSER on top of the NVMe in order to achieve the same stated goal.

 

Unfortunately, without or until I actually get or build a system like this, I won't know because I don't have the hardware to actually test it yet until that point. So this, of course, presents this problem as a Catch 22/Chicken-and-egg problem and I was hoping that someone else might have already done work like this already and might be able to offer their insights as far as pushing their entire PCI Express bus subsystem to its limit and to see what's in the realm of possible and what's not.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

5 minutes ago, alpha754293 said:

To answer your other point, no, no memory subsystem that's currently in use/available/deployed (or deployable) is capable of 50 GB/s (400 Gbps) data transfers. The proposed or draft specifications for DDR5-51200 SDRAM can hit a theorectical 409.6 Gbps, but there's nothing using it at the moment.

Well people are getting >130GB/s on single socket epyc as 8 channel ram will speed stuf up a lot.

 

6 minutes ago, alpha754293 said:

Also interestingly enough, in trying to benchmark RAMdisks, "conventional" transfers (TCP over IPoIB) doesn't use RDMA, so that's probably a big part/reason as to why it wasn't any faster than my existing SATA 6 Gbps SSD.

 

You should be able to get well over 6gb/s without rdma, something seems wrong if you can.t

 

And you can use ramdisks with nvmeof

 

7 minutes ago, alpha754293 said:

t isn't that it is impossible, but it gets vastly more complicated when I am running a distributed file system (Lustre) which actually would tie together "pools" of ZFS zpools/arrays so that it will be presented to the network as a single logical volume or mount(able) filesystem "image".

Have you setup a cluster? It really isn't that hard and probably the best solution here.

Link to comment
Share on other sites

Link to post
Share on other sites

3 minutes ago, alpha754293 said:

--snip--

This still confuses me because RDMA or (Remote Direct Memory Access) to my knowledge bypasses the CPU and moves data directly to/from memory. If we are both aware that even DDR4 in quad channel doesn't touch 400Gbit then no singular server will be able to host a NVMe pool to intake these massive file streams. You'd have no choice but to go distributed with a cluster running these NVMe drives.

 

Quote

I was hoping that someone else might have already done work like this already and might be able to offer their insights as far as pushing their entire PCI Express bus subsystem to its limit and to see what's in the realm of possible and what's not.

Hopefully you've posted this question to a few different forums because I would say 95%+ of this forums demographic are in-between entry level computer users and computer enthusiasts (overclockers, custom builds, etc...), 5% actively being into servers & networking so you probably won't find as many answers here as you would have hoped.

Link to comment
Share on other sites

Link to post
Share on other sites

6 minutes ago, Electronics Wizardy said:

Dell r740xd

 

https://www.dell.com/en-us/work/shop/cty/pdp/spd/poweredge-r740xd/pe_r740xd_12238_vi_vp

 

Dell r7415/r7425

 

https://www.dell.com/en-us/work/shop/povw/poweredge-r7415

 

This supermicro https://ftpw.supermicro.com.tw/products/system/2U/2029/SYS-2029U-E1CR4T.cfm

 

Lots of servers havae 24x nvme on one node

 

How did you test performance? What protocol were you using over the network?

 

 

 

Why are you looking at mostly luster?

 

What os are the data making nodes running? Have you looked at ceph or storage spaces direct?

 

Did you call up dell? They should give you one. Probably some pcie splitters though

 

SO what os is this all running? Id go all windows or all linux, normally works better that way. If your going windows id just do storage spaces direct here.

 

The Supermicro system - only four of the 24 drive bays are NVMe with an optional AOC. The rest are still just SATA/SAS 3 ports. (To even get all 24 ports to be SAS, you still need the AOC, again, where four of the ports are either SAS OR NVMe.)

re: "Lots of servers havae 24x nvme on one node"

Lots of servers have 24 2.5" bays. Only the Dells that you've sent so far support it and even then, the Dell R740xd -- I wasn't able to again, find nor confirm the block diagram of how the system is laid out in regards to the PCI Express lane allocation.

 

re: "How did you test performance? What protocol were you using over the network?"

 

Network is IPoIB. FS is ext4 (in Linux) (unless it's RAMdrive, then it's either tmpfs or ramfs). Transfers was over NFS (although Mellanox has removed NFSoRDMA support from their drivers, so now it's up to iSER.)

 

I have a whole bunch of files that are range from either like 10 GB to 50 GB which contains the archive of the results data that's copied between the systems as the test subject.

 

re: "Why are you looking at mostly luster?"

Mostly because Top500 already deploys is, so there's a lot of documentation and resources on how to do it. My mental model is that if it works for thousands to hundreds of thousands of nodes, then it should be able to work for my puny little four node cluster.

 

re: "What os are the data making nodes running? Have you looked at ceph or storage spaces direct?"

 

SLES12 SP1. Storage Spaces Direct isn't too far removed from SMB Direct. But not yet only because the compute nodes right now are all Linux (SLES12 SP1 nodes). This build well be using either Windows Server 2016 R2 (if I'm using Storage Spaces/SMB Direct) or Linux (distro TBD depending on NVMeoF implementation/deployment).

 

re: "Did you call up dell? They should give you one. Probably some pcie splitters though"

Yup, already did. Their enterprise account manager is scheduled to call me back either tomorrow or some time next week and hopefully they'll be able to provide an answer in regards to that.

 

re: "SO what os is this all running? Id go all windows or all linux, normally works better that way. If your going windows id just do storage spaces direct here."

I'm currently leaning towards Linux (either CentOS or something like that) mostly because the Windows Mellanox OFED drivers doesn't include the subnet manager necessary to actually RUN IB on Windows (without a Linux system running the SM itself), at which point, again, if I'm running either iSER or NVMeoF, my current and early research into the software side of the deployment strategy is pointing mostly towards *nix for that.

 

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

55 minutes ago, Electronics Wizardy said:

Well people are getting >130GB/s on single socket epyc as 8 channel ram will speed stuf up a lot.

 

You should be able to get well over 6gb/s without rdma, something seems wrong if you can.t

 

And you can use ramdisks with nvmeof

 

Have you setup a cluster? It really isn't that hard and probably the best solution here.

 

I am going to guess that the > 130 GB/s on a single socket EPYC with 8 channels of RAM is CPU to RAM test, not RAM of node A to RAM of node B test.

 

My working level assumption right now (as to why I can't get over 6 gb/s without RDMA) is that it's still going through the TCP/IP stack (i.e. not RDMA). Beyond that, I'm not really sure why (could be TRIM? ext4 not being optimized for SSDs, a whole bunch of factors like that).

 

The problem with RAMdisks is in its name - they're RAMdisks. If RAM is "spent" for storage, then the RAM isn't available for the solve processes, and that's bad.

 

The compute nodes are already in a cluster.

 

It's the complexity (or the perceived complexity) of Lustre deployment (on top of ZFS, which I am not sure if SLES even supports OotB) that I'm not so sure that I want to get into. If I can do it with NVMeoF and a single server with a single logical volume and mountable image, then it removes this additional layer of complexity and therefore; also removes this additional layer of potential failure point.

 

Besides, my current compute nodes doesn't have NVMe support, so it'd have to be a new build anyways, and rather than spending like $20,000 for a distributed, multi-node NVMe server, I can do it, for about $6000 with a single socket AMD EPYC instead.

And currently, the compute nodes only have SATA 6 Gbps, which means that at best, even with four compute nodes, it would only be able to deliver 24 Gbps performance anyways.

 

This project aims to increase that by a factor of 16.

 

Setting up the cluster was actually more involved than you might think. I mean, that's always the case when you're treading new ground. It's easy once you know how to do it. (And no, it didn't always work. That's where a lot of time was spent, researching and making sure that all of the nodes were able to talk to each other. I now have documented procedures for bare metal deployment that I had to scrape together from all of the various sources and also to find out what features have been DEFEATURED/that are no longer available (e.g. NFSoRDMA).)

 

If it were easy, everybody would be doing it.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

1 hour ago, Windows7ge said:

This still confuses me because RDMA or (Remote Direct Memory Access) to my knowledge bypasses the CPU and moves data directly to/from memory. If we are both aware that even DDR4 in quad channel doesn't touch 400Gbit then no singular server will be able to host a NVMe pool to intake these massive file streams. You'd have no choice but to go distributed with a cluster running these NVMe drives.

 

Hopefully you've posted this question to a few different forums because I would say 95%+ of this forums demographic are in-between entry level computer users and computer enthusiasts (overclockers, custom builds, etc...), 5% actively being into servers & networking so you probably won't find as many answers here as you would have hoped.

 

So...the thought with RDMA for applications is that you have zero copy communication between the application and data residing in RAM. That's fundamentally RDMA at it's core (for applications to use).

 

But when it comes to storage technologies, my throught process is that I'm more like accessing the PCIe bus directly through RDMA and the devices connected to said PCIe bus, controlled by the controller that's on the CPU itself, so I'm not actually going from RAM to RAM, but rather RAM <-> PCIe on CPU on node0 <-> PCIe on new server's CPU node0 <-> NVMe storage.

 

In other words, it is true that no single memory subsystem can handle the 400 Gbps data rate coming into it. But it's RAM from the compute nodes (204.8 Gbps for 3.2 GT/s QPI) to AMD EYPC to NVMe storage as opposed to the RAM that's on the AMD EPYC server system.

 

No, I've only posted this question here. I was going to post this question on ServeTheHome as well, but haven't gotten around to doing so.

 

But the discussion is helping me think about a plan of a course of action though.

 

(i.e. even that Gigabyte motherboard with 5 PCIe 3.0 x16 slots, only four of those are running at x16 speeds, the 5th one runs at x8 speeds, which means that at best, even with an AMD EPYC which has 128 PCIe 3.0 lanes, I'll only be able to use 64 of it anyways between NIC and storage. If I want to improve upon that any further, then I'll HAVE to switch to a distributed file system for that, which just further complicates things because then I'll be running Lustre on top of ZFS on top of NVMeoF on top of IPoIB.)

 

So, you can see how the layering of the different technologies and protocols can prove to be detrimental to the entire purpose of the build.

 

But this is also aimed to be able to get like the performance of $100,000 system (with a "proper" distributed NVMeoF solution) with around just $6k.

 

If it works, I just "made" $94k.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

This begs the question is there any way to make the work that's being performed more efficient in such a way that the required bandwidth can be reduced?

 

Kind of like instead of trying to source the worlds most high-end hardware instead try and make the application more efficient. Does this software have any wiggle room for that? Or have you quite possibly already done what you can in that department?

Link to comment
Share on other sites

Link to post
Share on other sites

42 minutes ago, alpha754293 said:

My working level assumption right now (as to why I can't get over 6 gb/s without RDMA) is that it's still going through the TCP/IP stack (i.e. not RDMA). Beyond that, I'm not really sure why (could be TRIM? ext4 not being optimized for SSDs, a whole bunch of factors like that).

What was the full config of this test? I have gotten over these speeds with just smb, raid 0 sata3 ssds, no trim no special config, no rdma. 

 

33 minutes ago, alpha754293 said:

But this is also aimed to be able to get like the performance of $100,000 system (with a "proper" distributed NVMeoF solution) with around just $6k.

That aint gonna happen. Those 100k do pay for things.

 

45 minutes ago, alpha754293 said:

The compute nodes are already in a cluster.

what type of cluster? What os are these nodes running.

 

 

Have you tried to setup storage spaces direct? Cluster seems to be the way to go here. Id get like 4 nodes dual 2011 nodes here for storage, put as much ram as you can on all of them, and cheaper sata ssds. Thats the best your gonna get for 6k, Your won't get the performance you want for 6k, you can't even buy the drives for that amount to take in that much data.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×