Jump to content

My Ryzen cluster now has (almost) 100 Gbps networking between each other

alpha754293

I have two AMD Ryzen systems where both are with the Ryzen 9 5950X processor, but one as an Asus X570 TUF Gaming Pro WiFi motherboard whilst the other has an Asus ROG STRIX X570-E Gaming WiFi II motherboard.

Earlier today, I was trying to diagnose an issue with my 100 Gbps network connection between the two Ryzen nodes and the microcluster headnode, where upon running ib_send_bw, I was only getting around 14 Gbps.

Now that I took the discrete GPUs out from each of the systems, I'm getting 96.58 Gbps on my micro HPC cluster now.

Yay!!!

(I can't imagine there being too many people who have Ryzen systems with 100 Gbps networking tying them together.)

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

Uhmm... Show Off?

I have an ASUS G14 2021 with Manjaro KDE and I am a professional Linux NoOB and also pretty bad at General Computing.

 

ALSO I DON'T EDIT MY POSTS* NOWADAYS SO NO NEED TO REFRESH BEFORE REPLYING *unless I edit my post

Link to comment
Share on other sites

Link to post
Share on other sites

That's pretty cool. What kinda NICs are you using? And what's the purpose of the cluster? 

Gaming HTPC:

R5 5600X - Cryorig C7 - Asus ROG B350-i - EVGA RTX2060KO - 16gb G.Skill Ripjaws V 3333mhz - Corsair SF450 - 500gb 960 EVO - LianLi TU100B


Desktop PC:
R9 3900X - Peerless Assassin 120 SE - Asus Prime X570 Pro - Powercolor 7900XT - 32gb LPX 3200mhz - Corsair SF750 Platinum - 1TB WD SN850X - CoolerMaster NR200 White - Gigabyte M27Q-SA - Corsair K70 Rapidfire - Logitech MX518 Legendary - HyperXCloud Alpha wireless


Boss-NAS [Build Log]:
R5 2400G - Noctua NH-D14 - Asus Prime X370-Pro - 16gb G.Skill Aegis 3000mhz - Seasonic Focus Platinum 550W - Fractal Design R5 - 
250gb 970 Evo (OS) - 2x500gb 860 Evo (Raid0) - 6x4TB WD Red (RaidZ2)

Synology-NAS:
DS920+
2x4TB Ironwolf - 1x18TB Seagate Exos X20

 

Audio Gear:

Hifiman HE-400i - Kennerton Magister - Beyerdynamic DT880 250Ohm - AKG K7XX - Fostex TH-X00 - O2 Amp/DAC Combo - 
Klipsch RP280F - Klipsch RP160M - Klipsch RP440C - Yamaha RX-V479

 

Reviews and Stuff:

GTX 780 DCU2 // 8600GTS // Hifiman HE-400i // Kennerton Magister
Folding all the Proteins! // Boincerino

Useful Links:
Do you need an AMP/DAC? // Recommended Audio Gear // PSU Tier List 

Link to comment
Share on other sites

Link to post
Share on other sites

2 hours ago, fUnDaMeNtAl_knobhead said:

Uhmm... Show Off?

Uhmm... jealous?

 

wtf...

 

2 hours ago, alpha754293 said:

Now that I took the discrete GPUs out from each of the systems, I'm getting 96.58 Gbps on my micro HPC cluster now.

When I read that, the first thing I thought was: Not enough PCI-e lanes! Then I remembered that the 5950X offers 24 PCI-e 4.0 lanes. 4 + 4 from the chipset, 16 from the CPU.


Even at only 4 PCI-e 3.0 lanes you should have a total bandwith available of 32Gbit/s. The 14Gbit/s doesn't really make sense there. So honestly no idea what's going on.

 

 

1 hour ago, FloRolf said:

And what's the purpose of the cluster? 

I expect it to be for the most logical and reasonable purpose of all time: LULZ! 😛

 

 

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

?

I have an ASUS G14 2021 with Manjaro KDE and I am a professional Linux NoOB and also pretty bad at General Computing.

 

ALSO I DON'T EDIT MY POSTS* NOWADAYS SO NO NEED TO REFRESH BEFORE REPLYING *unless I edit my post

Link to comment
Share on other sites

Link to post
Share on other sites

On 6/14/2022 at 12:30 AM, fUnDaMeNtAl_knobhead said:

Uhmm... Show Off?

 

Not really. it's just a function of money.

 

I actually bought the NICs and cables because of a video that LTT made a few years ago where you can buy the network cards off eBay for about half the price of retail. So it seemed like it was "good enough" of a deal for me.

And then if you calculate the cost efficiency based on $/Gbps, 100 Gbps would be cheaper per port than even 10 GbE.

 

It saves you money if you're able to make use of the bandwidth.

 

On 6/14/2022 at 1:26 AM, FloRolf said:

What kinda NICs are you using? And what's the purpose of the cluster?


Mellanox ConnectX-4 dual VPI ports 100 Gbps per port, Infiniband network card. (MCX456A-ECAT).

It's a PCIe 3.0 x16 card. With two 100 Gbps port and they're Virtual Protocol Interconnect meaning that I can set the ports as running either in Infiniband (IB) mode or in Ethernet (ETH) mode or have one port for each.

(Although I'm only using one port per card per system right now, because whilst I would like to deploy 100 GbE, the switches still aren't quite cost effective enough for me to use it. Plus the PCIe 3.0 x16 is only able to support a max of 128 Gbps which would be shared between both 100 Gbps ports, so it actually is better for me to only use the one port for now.)

 

I use my cluster for HPC/CAE/CFD/FEA applications.

This is an example of the type of stuff I run on my micro HPC cluster:

 

This specific simulation actually ran on my 4-node cluster system (where each node was dual socket, each socket was 8 cores, with HTT disabled, and each node also had 128 GB of RAM (for a total of 512 GB for the entire 4-node cluster).) previously.

 

This is another video of another type of simulation that I ran previously as well:


This simulation is designed to simulate what would happen if you were about to be in a head-on collison and then both cars swerve away from each other, but not enough and end up hitting each other anyways.

Stuff like this.
 

23 hours ago, Senzelian said:

When I read that, the first thing I thought was: Not enough PCI-e lanes! Then I remembered that the 5950X offers 24 PCI-e 4.0 lanes. 4 + 4 from the chipset, 16 from the CPU.


Even at only 4 PCI-e 3.0 lanes you should have a total bandwith available of 32Gbit/s. The 14Gbit/s doesn't really make sense there. So honestly no idea what's going on.

Yes and no.

I think that when I first set up my first node/system, I did try and see if I can install the discrete GPU (I think it was either the GTX 660 or the GTX 980) into the secondary PCIe slot and have the Mellanox card in the primary slot.

And as I also vague recall, the POST screen complained about the discrete GPU not being in the primary slot and refused to boot.

 

But now that both systems seems to be running relatively stable, they're completely headless now. And because the 5950X doesn't have an iGPU, it means that I literally have no video output coming from the nodes now, at all. (Not unless I spend $300 per node to get the TinyPilot iKVM.)

So right now, I just remote in over ssh and/or VNC.

 

re: 14 Gbps

Yeah, I don't know what was going on with that either.

I forget how to tell/check what the PCIe link rate is in Linux (although I'm pretty sure that I could've googled it, but I think that would have just been "information only" i.e. I don't know if there was a way for me to FORCE it to have a specific link rate). And even if I could, that might end up causing system instabilities, which wouldn't really have fixed the problem properly anyways.

 

One of my hypothesis is that it actually dropped the link rate from PCIe 3.0 x4 down to either PCIe 3.0 x2 or PCIe 2.0 x4. Hard to tell at this point (since I've put the Mellanox card in the primary PCIe slot now).

(The Mellanox cards are only PCIe 3.0 x16 cards anyways, so PCIe 4.0 won't really help.)

As a 100 Gbps card, I've never gotten 100 Gbps exactly.

The closest that I've gotten was about 97 Gbps in Windows and 96.58 Gbps in Linux (CentOS 7.7.1908).

In an actual application, it was testing at around 89 Gbps. So, if it had dropped the speed down from PCIe 3.0 x4 to PCIe 3.0 x2, then it would've only been delivering a theorectical max of 16 Gbps, which if it was hitting 14 Gbps, would mean that it was getting 87.5% of the PCIe 3.0 x2 theorectical bandwidth.

If it was PCIe 2.0 x4, then that should've been capable of 20 Gbps, which 14 Gbps would be 70% of the theorectical bandwidth.

Either way, something was off. And I couldn't tell if it was "firing the signals" on the rest of the connectors, but because there might not have been an electrical x16 connection in the x16 physical slot, the signals that were being "fired" to the part of the card that didn't have closed contact/connection (due to the slot being x16 physically, but could've been only x4 or x8 electrically), the card could've been expecting a response to those signals, which it never got back.

So...who knows.

It's a pity that the Mellanox ConnectX-5 Ex (MCX556A-EDAT) cards are still too expensive for me, even on eBay, because otherwise, the Ryzen system can actually use it with those cards being a PCIe 4.0 x16 card, which means that you would actually be able to support both ports running at 100 Gbps each, out of the shared 256 Gbps interface that the PCIe 4.0 x16 slot would afford.

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

Got any pictures? 😛 

CPU: AMD Ryzen 5 5600X | CPU Cooler: Stock AMD Cooler | Motherboard: Asus ROG STRIX B550-F GAMING (WI-FI) | RAM: Corsair Vengeance LPX 16 GB (2 x 8 GB) DDR4-3000 CL16 | GPU: Nvidia GTX 1060 6GB Zotac Mini | Case: K280 Case | PSU: Cooler Master B600 Power supply | SSD: 1TB  | HDDs: 1x 250GB & 1x 1TB WD Blue | Monitors: 24" Acer S240HLBID + 24" Samsung  | OS: Win 10 Pro

 

Audio: Behringer Q802USB Xenyx 8 Input Mixer |  U-PHORIA UMC204HD | Behringer XM8500 Dynamic Cardioid Vocal Microphone | Sound Blaster Audigy Fx PCI-E card.

 

Home Lab:  Lenovo ThinkCenter M82 ESXi 6.7 | Lenovo M93 Tiny Exchange 2019 | TP-LINK TL-SG1024D 24-Port Gigabit | Cisco ASA 5506 firewall  | Cisco Catalyst 3750 Gigabit Switch | Cisco 2960C-LL | HP MicroServer G8 NAS | Custom built SCCM Server.

 

 

Link to comment
Share on other sites

Link to post
Share on other sites

6 hours ago, alpha754293 said:

 

Not really. it's just a function of money.

 

 

I think he means because there is no question or discuess initiated by your post. More of a status update material...

Link to comment
Share on other sites

Link to post
Share on other sites

On 6/15/2022 at 5:44 AM, Sir Asvald said:

Got any pictures?

 

Not really, only because the systems are just in old Antec full tower cases that are probably somewhere around 15 years old now (for the case itself), so there's really not that much that's exciting nor interesting to look at.

 

On 6/15/2022 at 7:42 AM, Blue4130 said:

I think he means because there is no question or discuess initiated by your post. More of a status update material...

It is.

LTT doesn't really seem to something akin to like a "build log" section of the forum, and therefore; being that this pertains to networking, ergo; why I posted it on this section of said forum.

 

No, the "question" that would've came about previously would be the whole "why is my 100 Gbps Infiniband card only getting around 14 Gbps?" when it was plugged into the secondary PCIe slot.

But the moment that I removed the discrete GPU, and then plugged said Mellanox card into the primary PCIe slot, then it was able to obtain close to the 100 Gbps speeds again.

The more INTERESTING question could be can I force BOTH of those motherboards to accept said discrete GPUs in the secondary slot because I think that the last time that I tried it, after POST, an error message came up saying that the GPU was in the secondary slot and wouldn't boot.

So, if there is a way to FORCE it to boot with it in the secondary slot, that would be great. And no, neither of the motherboard manuals cover that. (In fact, the motherboard manual for the X570 TUF Gaming Pro WiFi doesn't even have a table for multi-GPU operation because I don't think that the board supports it and/or that there aren't enough PCIe slots for that. As a result, I don't remember, but I don't think that there is even the option to be able to run both slots at x8/x8 being as it is possible that the secondary PCIe slot is an x4 slot going through the chipset, and I don't know if there is a way to FORCE the system to accept that as the slot for the discrete GPU and FORCE the system to proceed with the boot sequence.)

 

(On the flip side, it works now as a purely headless system, so I'm not super worried about not having a discrete GPU in there. However, if there are problems that involve the Mellanox Infiniband card operating at full speed, and I need console access, I'd likely need to spend $300 on the TinyPiloy iKVM to be able to get console access without said discrete GPU installed.)

IB >>> ETH

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×