Jump to content
Search In
  • More options...
Find results that contain...
Find results in...
nicklmg

This Server Deployment was HORRIBLE

Recommended Posts

Posted · Original PosterOP

 

Buy AMD EPYC Rome processors on Amazon (PAID LINK): https://geni.us/PYlsxp

 

 

Buy Intel P4500 on Amazon (PAID LINK): https://geni.us/ij2Qd0

 

Buy Crucial 32GB DDR4-2933 on Amazon (PAID LINK): https://geni.us/c8MOx

 

Gigabyte EPYC server: https://www.gigabyte.com/Rack-Server/R272-Z32-rev-100#ov

 

Purchases made through some store links may provide some compensation to Linus Media Group.


Sign up for Floatplane. Do it. It's fantastic.

Link to post
Share on other sites
13 minutes ago, nicklmg said:

<snip>

the video is unlisted on youtube at the moment btw


Please quote or tag  @Ben17 if you want to see a reply.

If I don't reply it's probly because I am in a different time zone or haven't seen your message yet but I will reply when I see it ? 

 

Link to post
Share on other sites
8 minutes ago, Ben17 said:

the video is unlisted on youtube at the moment btw

True


PLEASE QUOTE ME IF YOU ARE REPLYING TO ME
LinusWare Dev | NotCPUCores Dev

Desktop Build: Ryzen 7 1800X @ 4.0GHz, AsRock Fatal1ty X370 Professional Gaming, 32GB Corsair DDR4 @ 3000MHz, RX480 8GB OC, Benq XL2730 1440p 144Hz FS

Retro Build: Intel Pentium III @ 500 MHz, Dell Optiplex G1 Full AT Tower, 768MB SDRAM @ 133MHz, Integrated Graphics, Generic 1024x768 60Hz Monitor


 

Link to post
Share on other sites

dual socket should give you more overall memory bandwidth and I would have thought to try with dual 24 or 32 core chips if you could get your hands on them.


Good luck, Have fun, Build PC, and have a last gen console for use once a year. I should answer most of the time between 9 to 3 PST

NightHawk 2.0: R7 2700 @4.0ghz, B450m Steel Legends, H105, 2x8gb Gell EVO 3200, XFX RX 580 8GB, Corsair RM750X, 500 gb 850 evo, 500gb 850 pro and 5tb Toshiba x300

Skunkworks: R5 3500U, 16gb, 250 intel 750, 500gb Adata XPG 6000 lite, Vega 8. HP probook G455R G6

Condor (MC server): 6600K, z170m plus, 16gb corsair vengeance LPX, samsung 750 evo, EVGA BR 450.

Bearcat (F@H box) core 2 duo, 1x4gb EEC DDR2, 250gb WD blue, 9800GTX+, STRIX 660ti, supermicro PSU, dell T3400.

Compute server Rappter(remember to add link) HP DL380G6 2xE5520 24GB ram with 4x146gb 10k drives and 4x300gb 10K drives, running NOTHING can't get anything to work

WIP NAS Spirt Cisco Security Multiservices Platform server e5420 12gb ram, 1x6 1tb raid 6 for plex + Need funding 16+1 2tb raid 6 for mass storage.

PSU Tier List      Motherboard Tier List      How to get PC parts cheap    HP probook 445R G6 review

 

Link to post
Share on other sites

This shows one of the actual advantages of a RAID card, since there we can save the CPU from some of the trouble with handling this many devices.

Though, then we open another can of worms, and a RAID card doing NVMe PCIe v.4 is likely rather exotic to say the least, maybe unobtainium at the current moment. (I haven't checked.)

 

Though, it is rather amazing how fast an application can chew through memory bandwidth.

A dual socket system is though another can of worms, even if it theoretically gives twice the memory bandwidth, though sending stuff between NUMA nodes is suddenly an issue, and the inter CPU link will suddenly introduce new potential bottlenecks. (QPI over in Intel's offerings isn't all that amazing compared to a 16x PCIe v.4 slot, and AMD's Infinity fabric isn't much different to be fair.)

 

At that point, it might just be more logical to split the whole server, and divide the editors into groups instead. Or give them their own drives... Though at the downside of not having a single database server for it all. Compromises, that could though be fixed in software. After all, a somewhat "similar" solution would be what Floatplane has going. But keeping coherency between file servers is likely a headache inducing topic, especially if we want it for a bunch of editors that potentially send files back and forth on a regular basis. There is likely a whole slew of different solutions when it comes to having multiple servers handling a database, all with their own pros and cons, though latency is likely the big downside regardless... (an interesting topic, but way to large for a single forum post...)

 

In the end, these types of systems are interesting.

Link to post
Share on other sites

I had thoughts of making my own PCI_e SSD storage array. I think I'll stick with SATA for a while longer if even in the enterprise it's this big of an issue.


Guides & Tutorials:

How to Format Storage Devices in Windows 10

A How-To: Drive Sharing in Windows 10

VFIO GPU Pass-though w/ Looking Glass KVM on Ubuntu 19.04

A How-To Guide: Building a Rudimentary Disk Enclosure

Three Methods to Resetting a Windows Login Password

A Beginners Guide to Debian CLI Based File Servers

A Beginners Guide to PROXMOX

How to Use Rsync on Microsoft Windows for Cross-platform Automatic Data Replication

 

Guide/Tutorial in Progress:

A Beginners Guide to Servers

 

In the Queue:

[Taking Suggestions]

 

Don't see what you need? Check the Full List or *PM me, if I haven't made it I'll add it to the list.

*NOTE: I'll only add it to the list if the request is something I know I can do.

Link to post
Share on other sites

I would recommend trying a low latency (1,000 Hz) kernel. The default generic kernel, in Ubuntu, is 250 Hz. From a performance standpoint 250 and even 100 Hz are better because the system has less "interruptions", but 1,000 Hz can probably offer better stability because it gives the system more opportunities to check in on the needs of the system as a whole. Also recommend enabling x2APIC mode and MSI-X, these will provide more interrupts for your system.

 

I strongly recommend ZFS (For my work I've done extensive comparison testing with Ext4 and XFS), but I must urge you to use the latest 0.8.x branch because it's way faster; in some cases it offers 2x the performance of the 0.7.0 branch. This is available in Ubuntu 19.10 natively, and via PPA in Ubuntu 16.04 and 18.04... https://launchpad.net/~jonathonf/+archive/ubuntu/zfs

 

In the video you said nothing about NUMA domains, you are most assuredly hitting inter-NUMA transfer bandwidth limitations. You need to ensure everything is pinned to the same NUMA domains as what the NVMe devices are attached to. You may be better off segregating all of the storage subsystem to a single NUMA domain.

 

For a ZFS tuning perspective 128 KB record size offers the best overall net performance gain (assuming a mixed workload), however, 64 KB is a really good choice too if you work with small files with random access patterns. The qcow2 format uses 64 KB as the default cluster size so that is also a good choice if the primary use case is VM storage. For 24 NVMe I would recommend 4x 6-disk raidz (or raidz2) vdevs, this will be nearly as fast as striped mirrors but offer a lot more useable capacity. Finding the right ashift value is not straightforward, it's best to simply performance test each value (i.g. 9, 12, 13, etc.). Use atime=off

 

zpool create -O recordsize=64k -O compression=on -O atime=off -o ashift=9 data

 

For benchmarking I recommend the following:

 

fio --randrepeat=1 --ioengine=libaio --direct=0 --gtod_reduce=1 --name=test --filename=test --bsrange=64k-192k --numjobs=16 --group_reporting=1 --random_distribution=zipf:0.5 --norandommap=1 --iodepth=24 --size=32G --rwmixread=50 --time_based=90 --runtime=90 --readwrite=randrw

Link to post
Share on other sites

So...to recap: The 40 GbE Mellanox NIC is definitely the bottleneck. The NIC itself might be a PCIe 3.0 x16 card (I don't remember, I'll have to watch the original set up video to pull the model number and then look up the specs of it to confirm), but on the assumption that it's a PCIe 3.0 x16, that's 128 Gbps at the interface or 16 GB/s. (Note, it's a 40 Gb (lowercase b) E card, ergo 5 GB/s per port). If the 24 NVMe SSDs are PCIe 3.0 x4 each, that would be 32 Gbps per drive * 24 = 768 Gbps or 96 GB/s. I don't know if there are enterprise NVMe 4.0 x 4 drives out quite just yet. I'm somewhat surprised that AMD EPYC and NVMe devices are having such a hard time with your array.

 
It's a pity that I don't have any NVMe RAID arrays up and running to be able to help you diagnose and fix the issue. The SSD array that I do have running right now is just four 1TB Samsung 860 EVOs SATA 6 Gbps SSDs, running in RAID0, and BEST case scenario, I am able to get about 2 GB/s write speeds (as each drive is good for about 500 MB/s), which is around 16 Gbps tops, on a four drive array. (This is being accessed over 100 Gbps 4x EDR Infiniband, by four compute nodes.)
 
So...having said all that, maybe a better solution for you might have been to get one of those four-half width nodes that can only support 6 NVMe devices per node, and then you can run a distributed/parallel filesystem like GlusterFS on it for your purposes.
 
(Even my four 6 TB HGST 7200 rpm SATA 6 Gbps HDDs is able to sustain around 700-750 MB/s (5.6-6 Gbps) speeds, as each drive is good for upto around 190 MB/s.)
 
That's what in my scratch disk/head node right now.
 
In your use case, you might actually benefit from a distributed, parallel filesystem like GlusterFS.
 
I'm surprised that Wendall didn't suggest that to you. (Course, it has the potential to increase your cost, but if each processor install per node is say, only an 8-core or 16-core CPU, then you might still be able to come out on top because your total number of PCIe lanes increased by the number of CPUs you install.)
 
Hmmm....interesting.
 
Good to know though.
 
(I still think that you should be able to get around it, but again, I don't have the hardware nor the budget to be able to help you verify that, unfortunately. :( )
Link to post
Share on other sites

Partition your server (segregate your hardware) if you can.

Run GlusterFS.

 

(I was never able to get parallel NFS deployed on my cluster. GlusterFS was easier to deploy, so I would maybe test that out instead.)

Link to post
Share on other sites

Company ordered one in early January, at which point it was said to arrive within 7 days. After a week we were told the distributor had all the parts but was unable to deliver, since the official rollout had been delayed until mid February. Sounded to me like they were holding it back until some update arrives.

 

Outside of video editing, systems such as this move towards being virtualization platforms which run everything from Ramdisk, while the NVME are just there to provide backup layers.

Link to post
Share on other sites
50 minutes ago, alpha754293 said:

So...to recap: The 40 GbE Mellanox NIC is definitely the bottleneck. The NIC itself might be a PCIe 3.0 x16 card (I don't remember, I'll have to watch the original set up video to pull the model number and then look up the specs of it to confirm), but on the assumption that it's a PCIe 3.0 x16, that's 128 Gbps at the interface or 16 GB/s. (Note, it's a 40 Gb (lowercase b) E card, ergo 5 GB/s per port). If the 24 NVMe SSDs are PCIe 3.0 x4 each, that would be 32 Gbps per drive * 24 = 768 Gbps or 96 GB/s. I don't know if there are enterprise NVMe 4.0 x 4 drives out quite just yet. I'm somewhat surprised that AMD EPYC and NVMe devices are having such a hard time with your array.

 
It's a pity that I don't have any NVMe RAID arrays up and running to be able to help you diagnose and fix the issue. The SSD array that I do have running right now is just four 1TB Samsung 860 EVOs SATA 6 Gbps SSDs, running in RAID0, and BEST case scenario, I am able to get about 2 GB/s write speeds (as each drive is good for about 500 MB/s), which is around 16 Gbps tops, on a four drive array. (This is being accessed over 100 Gbps 4x EDR Infiniband, by four compute nodes.)
 
So...having said all that, maybe a better solution for you might have been to get one of those four-half width nodes that can only support 6 NVMe devices per node, and then you can run a distributed/parallel filesystem like GlusterFS on it for your purposes.
 
(Even my four 6 TB HGST 7200 rpm SATA 6 Gbps HDDs is able to sustain around 700-750 MB/s (5.6-6 Gbps) speeds, as each drive is good for upto around 190 MB/s.)
 
That's what in my scratch disk/head node right now.
 
In your use case, you might actually benefit from a distributed, parallel filesystem like GlusterFS.
 
I'm surprised that Wendall didn't suggest that to you. (Course, it has the potential to increase your cost, but if each processor install per node is say, only an 8-core or 16-core CPU, then you might still be able to come out on top because your total number of PCIe lanes increased by the number of CPUs you install.)
 
Hmmm....interesting.
 
Good to know though.
 
(I still think that you should be able to get around it, but again, I don't have the hardware nor the budget to be able to help you verify that, unfortunately. :( )

 

Mellanox ConnectX cards rarely ever meet expectations with regards to performance. The ASIC is gimped and Mellanox lies about the cards capabilities because I guess they know that very few people actually have the skillset necessary to performance test high speed networking equipment and more importantly have confidence that the benchmarking was done correctly to call them on their bs. I've seen the performance problems in their ConnectX-2, ConnectX-3, and ConnectX-5 chips personally. We have ConnextX-5 Ex 100 GbE cards in the lab right now that can barely even hit 40 GbE in a Xeon Platinum 8160 system. Without the card the system is capable of 2000 Gb/s over the loopback interfaces using network namespaces to isolate each side of the connection. If you ever run across their dual port cards, just assume only one port is active because the second port is basically just a virtual function hardwired to a second physical port. If the two ports are active at the same time the ConnectX cards will behave like half duplex devices.

Link to post
Share on other sites

I really, really enjoy watching this more in-depth kind of content. As much as I like GN and Level1, their delivery style and content quality just isn't the same as LTT and I much prefer your style. Please keep the gravy train coming with this type of stuff. 

 

One of the videos I'm really looking forward to is the studio videos with the Mac Pro (and the Mac Pro rack?)


PC Part Picker Link || CPU: Intel i7 4790K @ 5Ghz; MB: ASUS Z97 Maximus VII Ranger; RAM: 16GB Corsair Vengeance 2400; GFX: Asus Strix 1080ti; CASE: Phanteks Enthoo Evolv ATX Glass; STORAGE: 500GB Samsung 960 Pro, 250GB 850 Evo, 500GB 850 Evo, 3TB WD Red; PSU: Corsair AX1200i; MONITOR: Acer Predator X34; PERIPHERALS: Razer Blackwidow Ultimate Chroma; Razer Deathadder Chroma, Audeze Mobius

 

Devices || Macbook Pro 15" (2016); iPad Pro 9.7"; iPhone Xr

 

Audio Gear || Headphones: Audeze iSine20; Audeze LCD-X; Audeze LCD-3; Mr Speakers Ether 2; Focal Clear; B&O H5; Sony MDR-1000x; AMP/DAC: Chord Qutest; Pathos Aurium; Bryston BHA-1; Matrix Audio Element X; Benchmark AHB2; Speakers: AudioEngine A5+; Focal Aria 936

 

 

Link to post
Share on other sites

However much you are going to hate hearing this, this is exactly why enterprises like the one where I work spend big money on SANs.  Our AFF-A320 arrays might cost as much as a house but they just work and connecting to them over FC avoids all those bottlenecks.  

Link to post
Share on other sites

This was a really excellent video and very fun to watch. Linus' explanations were concise and understandable, even though I'm not familiar with most of the technology used. Wendell's messages were useful to see; the man is astounding. The bloopers were very funny and the editors were useless guinea pigs, as usual. ^_~

Link to post
Share on other sites

Dear Linus and LMG stuff

Can you stop with that thing you so call "Clicky Title"?

Linus said he will improve it once. but that seem to be short period.

 

up to this point click on LTT video seem to be gambling for me. I don't know what to expect of title. and I have problem search for video that I was already watch.

 

some video have really cool project but the title seem to be off by the long long shot. I know you want to please the mighty youtube's algorithm.

but I don't know what to say any more. it's hard subject to discuss.

 

ps. I did not watch video.

Link to post
Share on other sites
36 minutes ago, Sang said:

Dear Linus and LMG stuff

Can you stop with that thing you so call "Clicky Title"?

Linus said he will improve it once. but that seem to be short period.

 

up to this point click on LTT video seem to be gambling for me. I don't know what to expect of title. and I have problem search for video that I was already watch.

 

some video have really cool project but the title seem to be off by the long long shot. I know you want to please the mighty youtube's algorithm.

but I don't know what to say any more. it's hard subject to discuss.

 

ps. I did not watch video.

i mean  cry cry........ not their fault its google and their you tube album. if you took a few sec to look that up . you know its been a issue for well over 3 years now. even before  i started watching linus i knew about it.

Link to post
Share on other sites
11 hours ago, Ben17 said:

the video is unlisted on youtube at the moment btw

That's normal, sometimes the videos get posted here before they are made public on youtube.

Link to post
Share on other sites
16 hours ago, nbritton said:

 

Mellanox ConnectX cards rarely ever meet expectations with regards to performance. The ASIC is gimped and Mellanox lies about the cards capabilities because I guess they know that very few people actually have the skillset necessary to performance test high speed networking equipment and more importantly have confidence that the benchmarking was done correctly to call them on their bs. I've seen the performance problems in their ConnectX-2, ConnectX-3, and ConnectX-5 chips personally. We have ConnextX-5 Ex 100 GbE cards in the lab right now that can barely even hit 40 GbE in a Xeon Platinum 8160 system. Without the card the system is capable of 2000 Gb/s over the loopback interfaces using network namespaces to isolate each side of the connection. If you ever run across their dual port cards, just assume only one port is active because the second port is basically just a virtual function hardwired to a second physical port. If the two ports are active at the same time the ConnectX cards will behave like half duplex devices.

I think that it depends on what you're using the cards for.

 

For my HPC/CAE/FEA/CFD workloads, I've been able to hit around 80 Gbps (almost 10 GB/s) of actual data throughput when the sparse direct FEA solver is running across four nodes, 64 cores, where each node has a Mellanox ConnectX-4 dual port VPI 4x EDR Infiniband QSFP28 adapter.

 

For DATA transfers though, like I said, the FASTEST that I am able to get right now is around 16 Gbps (out of a possible 24 Gbps provided by the four 1 TB Samsung Evo 860 SATA 6 Gbps SSDs). In other words, I am hardware limited by the HW RAID array (those four drives are connected to a Broadcom MegaRAID SAS 9341-8i 12 Gbps SAS HW RAID HBA).

 

I would either need like 6 times more SATA 6 Gbps hard drives, connected to three Broadcom 9341-8i's and then I would still have to aggregate the RAID arrays across the three controllers in order to ensure that there's no muxing of the SATA 6 Gbps ports, or I would have to set up all 24 drives as JBOD and then use mdraid to manage the entire array. (As mentioned here -- each chiplet on the AMD EPYC processor has only two memory channels and it is the entire package (minimum four chiplets) that would actually be able to make use of all eight channels of memory, where each chiplet is still only limited to two channels.)

 

Having the 24 SATA 6 Gbps SSDs would push the usable bandwidth (assuming linear scalability) from 16 Gbps out of 24 Gbps to 96 Gbps out of 144 Gbps. 

 

So it depends. (And this is also further complicated by the fact that it will also depending on what file system you're using to format the array with, and how you are presenting the array to your network. In my case, my array is formatted using XFS (which has tuning parameters), and the array is presented to the network using NFS over RDMA (NFS v4.1, proto=rdma, port=20049), which, not all operating systems support. (Windows I think only support SMB Direct. Some OSes will support iSER, but not all.) And as such, NFS also has their own performance tuning parameters as well.

 

My point being that I can only hit my performance numbers because I am now hardware limited. (There are additional problems when you have so many devices because each one of the Broadcom 9341-8i controllers are PCIe 3.0 x8, which means that at its interface, they take up 64 Gbps of bandwidth already, so if you have three of them to address 24 SATA 6 Gbps SSDs, you're sucking up 24 PCIe 3.0 lanes (192 Gbps) of bandwidth, plus you need 128 Gbps of bandwidth (16 PCIe 3.0 lanes) just to feed a SINGLE 100 Gbps 4x EDR IB port (let alone two of them -- more on that later), which already puts your total bandwidth requirement at 44 PCIe 3.0 lanes before any consideration for any other peripherals/interfaces, graphics cards, etc.

 

So...it depends.

 

The internal IB bandwidth benchmark does test the 4x EDR 100 Gbps IB at 96.4 Gbps (slightly older data that I ran when I was still running SLES12 SP1).

 

I'll have to dig up the solver output file where it shows that the direct sparse FEA solver was able to hit close to 80 Gbps (almost 10 GB/s) throughput between the compute nodes.

 

So...to your points -- it REALLY depends on what you do and how you are doing it. I won't have the hardware to push the system much further than this for a while because my current MegaRAID SAS controller doesn't support NVMe 3.0 x4 devices, and of the ones that Broadcom makes and sells, none of them have RAID built into it by default (probably for really good reason), so to put four NVMe 3.0 x4 U.2 SSDs into RAID, I am going to have to use mdraid for that, and then the CPU will be the bottleneck.

 

(Unless I go with 24 SATA 6 Gbps SSDs, three SAS/SATA JBOD HBAs, and still use mdraid to deal with/manage all of those devices.)

 

And if I don't do that, then I'll be using four more half-width nodes, where each node can address up to 6 NVMe 3.0 x4 U.2 devices, and then I would have to deploy GlusterFS again to be able to push the system back to the limits.

 

re: "If you ever run across their dual port cards, just assume only one port is active because the second port is basically just a virtual function hardwired to a second physical port. If the two ports are active at the same time the ConnectX cards will behave like half duplex devices."

 

My Mellanox ConnectX-4 (MCX456A-ECAT) cards are dual VPI ports, 4x EDR IB 100 Gbps per port.

 

The reason why my ports will run at approximately HALF the speed is because the PCie 3.0 x16 interface itself is only good for 128 Gbps. And since you won't be able to every use all of the theorectical bandwidth that any interface provides to you, therefore; you're inherently going to be working with less available to you anyways.

 

So if I actually TRIED to use both ports at the same time, it's being choked by the fact that the NIC AIC interface itself is only a PCIe 3.0 x16 interface, which is only good for upto 128 Gbps of bandwidth.

 

Therefore; if you really want to use two ports, you need to physically separate them out into two single port cards, where each card will be able to run at 100 Gbps (out of a possible 128 Gbps) speeds.

 

But again, now you have a problem where you've literally run out of PCIe lanes. 

 

A single Intel Xeon Platinum 8160 is only good for 48 PCIe 3.0 lanes (Source: https://ark.intel.com/content/www/us/en/ark/products/120501/intel-xeon-platinum-8160-processor-33m-cache-2-10-ghz.html).

 

To get almost 100 Gbps worth of bandwidth with SATA 6 Gbps SSDs, you need 24 of them (SSDs). (144 Gbps is around 18 PCIe 3.0 lanes.) 

 

To get 100 Gbps worth of bandwidth on my ConnectX-4 4x EDR IB, you'll need 16 PCIe 3.0 lanes. If you want to run two ports, AT their full speed/full bandwidth. Now you need 32 PCIe 3.0 lanes.

 

The host bus adapter for the RAID controller to manage three groups of 8 SATA 6 Gbps SSDs -- each HBA is a PCIe 3.0 x8. So that needs 24 lanes.

 

So, if you want to max out 100 Gbps network with 24 SATA 6 Gbps SSDs, you'll need 24 + 16 PCIe 3.0 lanes, just for that. That's 40 lanes total, out of the 48 that the Intel Xeon Platinum 8160 can supply. And that assumes that the chipset and all other peripherals use zero lanes. (Or they can use upto the remaining 8 lanes.)

 

If you want to run two ConnectX-4 or ConnectX-5 cards/ports at full speed, now you need 24 + 16 + 16 PCIe 3.0 lanes or 56 lanes total, of which, only 48 can be supplied by the processor, and of that, it also assumes that the remaining peripherals consume no lanes.

 

You can increase your total PCIe lane count by going to a dual socket configuration, but now you will have to look at the block diagram so that you have as little traffic going through the DMI socket-to-socket interconnect.

 

If you're using a dual AMD EPYC system, the entire system can only still provide 128 PCIe 4.0 lanes, because 128 PCIe 4.0 lanes is consumed by the Infinity Fabric system interconnect for socket-to-socket communications.

 

My point being - from my own testing and work, Mellanox CAN deliver at least upto 80% of their advertised speeds (at which point, there are other factors that also start coming into play in regards to how the application was programmed (MPI implementation, version, etc.), and that until I get significantly faster hardware, I won't really be able to push the storage system tests as much (relative to my IB network) because I've already maxed out what the SATA 6 Gbps SSDs can do.

 

I have one aerodynamics CFD run that used to take 42 days to run on an Intel Core i7-3930K (6-cores, HTT disabled, 3.2 GHz stock, 3.5 GHz max all core turbo) whereas on my cluster now, with 64 cores supplied by four half-width dual socket nodes, each node having two Intel Xeon E5-2690 (v1) (8-core, HTT disabled, 2.9 GHz stock, 3.3 GHz max all core turbo), and 100 Gbps IB all around (going through a Mellanox MSB-7890 externally managed 36-port 100 Gbps IB switch), and my headnode -- that run now completes in 2 days 22 hours.

 

Therefore, as much as it is possible, I think that the Mellanox IB system interconnect/network that I've got is hitting the fastest speeds that it is able to hit. The rest of the performance degradations come from either the software/application stack, or the IB performancing tuning parameters, the RDMA performance tuning parameters, the NFS tuning parameters, and also the XFS tuning parameters. (I've left everything at their defaults because the size of my MPI messages vary both in size and quantity during the course of a run, and also depending on what I am running (application-wise) and also the nature of the problem that I am trying to solve.

 

But as far as I can tell, I'm now just running into other hardware bottlenecks whereby my network is no longer the reason why the solution can't run any faster.

48425872_10100138772895339_7575031230089920512_n.jpg

Link to post
Share on other sites
10 hours ago, Sang said:

Dear Linus and LMG stuff

Can you stop with that thing you so call "Clicky Title"?

Linus said he will improve it once. but that seem to be short period.

 

up to this point click on LTT video seem to be gambling for me. I don't know what to expect of title. and I have problem search for video that I was already watch.

 

some video have really cool project but the title seem to be off by the long long shot. I know you want to please the mighty youtube's algorithm.

but I don't know what to say any more. it's hard subject to discuss.

 

ps. I did not watch video.

you want them to do something that will make them earn less money and maybe force them cut employee pay/lay some of them off or have less resources to put into future videos because you dont like how they worded a title?

Link to post
Share on other sites

For those that might be interested, this is the NFSoRDMA data transfer rate results for four HGST 6 TB 7200 rpm SATA 6 Gbps HDDs in RAID0 on a Broadcom/Avago/LSI MegaRAID SAS 9341-8i 12 Gbps SAS HW RAID HBA on an Intel Core i9-4930K (6-core, 3.4 GHz base clock, 3.9 GHz max turbo, 3.6 GHZ max all core turbo, HTT disabled), with 64 GB (should be a 8x 8 GB configuration I think) DDR3-1600 Crucial RAM, on an Asus P9X79-E WS motherboard.

 

Note that with four conventional, mechanically rotating HDDs, the system (my cluster headnode) is pushing almost 1 GB/s data transfer speeds (8 Gbps) without using SSDs.

 

This is what having a high bandwidth, ultra low latency, RDMA enabled storage gets you. If you're transferring data over "conventional", non-RDMA TCP/IP or UDP, the extra overhead eats into your transfer speeds.

 

This job that I am currently running has been going for around 11320 minutes (188.6666 hours or 7 days 20 hours 40 minutes) so far, and it's maybe about half way done or so. Something like that. (It's a computational fluid dynamics (CFD) project that I am/the system is currently running.)

 

So, my point is if Linus and LMG wants to actually be able to take advantage of their NVMe drives, they need to be running something that supports RDMA (on the NIC itself), they need to be running an operating system that can support RDMA (like CentOS), and they also need to be running NVMeoF.

 

If I can get almost 1 GB/s (8 Gbps) speeds from "spinning rust", imagine what NVMeoF running iSER with a NIC that supports RDMA, will be able to provide to your Windows Server 2016+ clients.

 

Please note that I have done nothing extra or special (in the way of performance tuning on the IB interface, nor NFS, nor XFS) in order to get these numbers. (The storage and network loads vary too much to be able to tune it.)

All of this is using stuff that's "out of the box". I don't even use Mellanox's MLNX_OFED linux drivers for my ConnectX-4 (MCX456A-ECAT) cards.

 

And also again, I got into the IB "biz" because of the video that Linus did a while ago, showing that you can now pick up the hardware at a fraction of what they would normally cost retail, and I'm actually using it, with not just running benchmarks on it -- but actually using it, in my production, micro-cluster environment.

Capture.PNG

Link to post
Share on other sites
On 2/5/2020 at 11:36 PM, spartaman64 said:

you want them to do something that will make them earn less money and maybe force them cut employee pay/lay some of them off or have less resources to put into future videos because you dont like how they worded a title?

Interpret whatever you like and sorry that my wording might mislead my intention with my broken English.(same lv as Dennis or might even below him , lol)

but I still wish the best for LMG. 

Link to post
Share on other sites

P.S.

 

I'm not sure if Linus and/or his team and/or Wendall covered it, but I would recommend that at least one member from LTT's staff/team read this Anandtech write-up about benchmarking enterprise NVMe drives due to async limitations in Linux where they use fio.

 

https://www.anandtech.com/show/15491/enterprise-nvme-hynix-samsung-dapustor-dera

 

I think that those limitations may also apply to your server deployment, as far as configuration/deployment notes goes, so you might want to check that out. (Especially the part where they talk about io_uring)

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×