My Ryzen cluster now has (almost) 100 Gbps networking between each other

alpha754293 · June 14, 2022

I have two AMD Ryzen systems where both are with the Ryzen 9 5950X processor, but one as an Asus X570 TUF Gaming Pro WiFi motherboard whilst the other has an Asus ROG STRIX X570-E Gaming WiFi II motherboard.

Earlier today, I was trying to diagnose an issue with my 100 Gbps network connection between the two Ryzen nodes and the microcluster headnode, where upon running ib_send_bw, I was only getting around 14 Gbps.

Now that I took the discrete GPUs out from each of the systems, I'm getting 96.58 Gbps on my micro HPC cluster now.

Yay!!!

(I can't imagine there being too many people who have Ryzen systems with 100 Gbps networking tying them together.)

fUnDaMeNtAl_knobhead · June 14, 2022

Uhmm... Show Off?

FloRolf · June 14, 2022

That's pretty cool. What kinda NICs are you using? And what's the purpose of the cluster?

Senzelian · June 14, 2022

2 hours ago, fUnDaMeNtAl_knobhead said:

Uhmm... Show Off?

Uhmm... jealous?

wtf...

2 hours ago, alpha754293 said:

Now that I took the discrete GPUs out from each of the systems, I'm getting 96.58 Gbps on my micro HPC cluster now.

When I read that, the first thing I thought was: Not enough PCI-e lanes! Then I remembered that the 5950X offers 24 PCI-e 4.0 lanes. 4 + 4 from the chipset, 16 from the CPU.

Even at only 4 PCI-e 3.0 lanes you should have a total bandwith available of 32Gbit/s. The 14Gbit/s doesn't really make sense there. So honestly no idea what's going on.

1 hour ago, FloRolf said:

And what's the purpose of the cluster?

I expect it to be for the most logical and reasonable purpose of all time: LULZ!

fUnDaMeNtAl_knobhead · June 14, 2022

?

alpha754293 · June 15, 2022

On 6/14/2022 at 12:30 AM, fUnDaMeNtAl_knobhead said:

Uhmm... Show Off?

Not really. it's just a function of money.

I actually bought the NICs and cables because of a video that LTT made a few years ago where you can buy the network cards off eBay for about half the price of retail. So it seemed like it was "good enough" of a deal for me.

And then if you calculate the cost efficiency based on $/Gbps, 100 Gbps would be cheaper per port than even 10 GbE.

It saves you money if you're able to make use of the bandwidth.

On 6/14/2022 at 1:26 AM, FloRolf said:

What kinda NICs are you using? And what's the purpose of the cluster?

Mellanox ConnectX-4 dual VPI ports 100 Gbps per port, Infiniband network card. (MCX456A-ECAT).

It's a PCIe 3.0 x16 card. With two 100 Gbps port and they're Virtual Protocol Interconnect meaning that I can set the ports as running either in Infiniband (IB) mode or in Ethernet (ETH) mode or have one port for each.

(Although I'm only using one port per card per system right now, because whilst I would like to deploy 100 GbE, the switches still aren't quite cost effective enough for me to use it. Plus the PCIe 3.0 x16 is only able to support a max of 128 Gbps which would be shared between both 100 Gbps ports, so it actually is better for me to only use the one port for now.)

I use my cluster for HPC/CAE/CFD/FEA applications.

This is an example of the type of stuff I run on my micro HPC cluster:

This specific simulation actually ran on my 4-node cluster system (where each node was dual socket, each socket was 8 cores, with HTT disabled, and each node also had 128 GB of RAM (for a total of 512 GB for the entire 4-node cluster).) previously.

This is another video of another type of simulation that I ran previously as well:

This simulation is designed to simulate what would happen if you were about to be in a head-on collison and then both cars swerve away from each other, but not enough and end up hitting each other anyways.

Stuff like this.

23 hours ago, Senzelian said:

When I read that, the first thing I thought was: Not enough PCI-e lanes! Then I remembered that the 5950X offers 24 PCI-e 4.0 lanes. 4 + 4 from the chipset, 16 from the CPU.

Even at only 4 PCI-e 3.0 lanes you should have a total bandwith available of 32Gbit/s. The 14Gbit/s doesn't really make sense there. So honestly no idea what's going on.

Yes and no.

I think that when I first set up my first node/system, I did try and see if I can install the discrete GPU (I think it was either the GTX 660 or the GTX 980) into the secondary PCIe slot and have the Mellanox card in the primary slot.

And as I also vague recall, the POST screen complained about the discrete GPU not being in the primary slot and refused to boot.

But now that both systems seems to be running relatively stable, they're completely headless now. And because the 5950X doesn't have an iGPU, it means that I literally have no video output coming from the nodes now, at all. (Not unless I spend $300 per node to get the TinyPilot iKVM.)

So right now, I just remote in over ssh and/or VNC.

re: 14 Gbps

Yeah, I don't know what was going on with that either.

I forget how to tell/check what the PCIe link rate is in Linux (although I'm pretty sure that I could've googled it, but I think that would have just been "information only" i.e. I don't know if there was a way for me to FORCE it to have a specific link rate). And even if I could, that might end up causing system instabilities, which wouldn't really have fixed the problem properly anyways.

One of my hypothesis is that it actually dropped the link rate from PCIe 3.0 x4 down to either PCIe 3.0 x2 or PCIe 2.0 x4. Hard to tell at this point (since I've put the Mellanox card in the primary PCIe slot now).

(The Mellanox cards are only PCIe 3.0 x16 cards anyways, so PCIe 4.0 won't really help.)

As a 100 Gbps card, I've never gotten 100 Gbps exactly.

The closest that I've gotten was about 97 Gbps in Windows and 96.58 Gbps in Linux (CentOS 7.7.1908).

In an actual application, it was testing at around 89 Gbps. So, if it had dropped the speed down from PCIe 3.0 x4 to PCIe 3.0 x2, then it would've only been delivering a theorectical max of 16 Gbps, which if it was hitting 14 Gbps, would mean that it was getting 87.5% of the PCIe 3.0 x2 theorectical bandwidth.

If it was PCIe 2.0 x4, then that should've been capable of 20 Gbps, which 14 Gbps would be 70% of the theorectical bandwidth.

Either way, something was off. And I couldn't tell if it was "firing the signals" on the rest of the connectors, but because there might not have been an electrical x16 connection in the x16 physical slot, the signals that were being "fired" to the part of the card that didn't have closed contact/connection (due to the slot being x16 physically, but could've been only x4 or x8 electrically), the card could've been expecting a response to those signals, which it never got back.

So...who knows.

It's a pity that the Mellanox ConnectX-5 Ex (MCX556A-EDAT) cards are still too expensive for me, even on eBay, because otherwise, the Ryzen system can actually use it with those cards being a PCIe 4.0 x16 card, which means that you would actually be able to support both ports running at 100 Gbps each, out of the shared 256 Gbps interface that the PCIe 4.0 x16 slot would afford.

Sir Asvald · June 15, 2022

Got any pictures?

Blue4130 · June 15, 2022

6 hours ago, alpha754293 said:

Not really. it's just a function of money.

I think he means because there is no question or discuess initiated by your post. More of a status update material...

alpha754293 · June 18, 2022

On 6/15/2022 at 5:44 AM, Sir Asvald said:

Got any pictures?

Not really, only because the systems are just in old Antec full tower cases that are probably somewhere around 15 years old now (for the case itself), so there's really not that much that's exciting nor interesting to look at.

On 6/15/2022 at 7:42 AM, Blue4130 said:

I think he means because there is no question or discuess initiated by your post. More of a status update material...

It is.

LTT doesn't really seem to something akin to like a "build log" section of the forum, and therefore; being that this pertains to networking, ergo; why I posted it on this section of said forum.

No, the "question" that would've came about previously would be the whole "why is my 100 Gbps Infiniband card only getting around 14 Gbps?" when it was plugged into the secondary PCIe slot.

But the moment that I removed the discrete GPU, and then plugged said Mellanox card into the primary PCIe slot, then it was able to obtain close to the 100 Gbps speeds again.

The more INTERESTING question could be can I force BOTH of those motherboards to accept said discrete GPUs in the secondary slot because I think that the last time that I tried it, after POST, an error message came up saying that the GPU was in the secondary slot and wouldn't boot.

So, if there is a way to FORCE it to boot with it in the secondary slot, that would be great. And no, neither of the motherboard manuals cover that. (In fact, the motherboard manual for the X570 TUF Gaming Pro WiFi doesn't even have a table for multi-GPU operation because I don't think that the board supports it and/or that there aren't enough PCIe slots for that. As a result, I don't remember, but I don't think that there is even the option to be able to run both slots at x8/x8 being as it is possible that the secondary PCIe slot is an x4 slot going through the chipset, and I don't know if there is a way to FORCE the system to accept that as the slot for the discrete GPU and FORCE the system to proceed with the boot sequence.)

(On the flip side, it works now as a purely headless system, so I'm not super worried about not having a discrete GPU in there. However, if there are problems that involve the Mellanox Infiniband card operating at full speed, and I need console access, I'd likely need to spend $300 on the TinyPiloy iKVM to be able to get console access without said discrete GPU installed.)

Sign In

My Ryzen cluster now has (almost) 100 Gbps networking between each other

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Featured Topics

Topics

Latest From Linus Tech Tips:

I Am Not Buying A Super Computer - WAN Show May 3, 2024

Latest From Tech Quickie:

This Guy BUILT His Own Graphics Card!

Latest From TechLinked:

Microsoft, Give Up Already.

Latest From GameLinked:

Roblox and Walmart... Are One

Latest From ShortCircuit:

Dell Has Destroyed the XPS - Dell XPS 16 (2024)

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!