I would recommend trying a low latency (1,000 Hz) kernel. The default generic kernel, in Ubuntu, is 250 Hz. From a performance standpoint 250 and even 100 Hz are better because the system has less "interruptions", but 1,000 Hz can probably offer better stability because it gives the system more opportunities to check in on the needs of the system as a whole. Also recommend enabling x2APIC mode and MSI-X, these will provide more interrupts for your system.
I strongly recommend ZFS (For my work I've done extensive comparison testing with Ext4 and XFS), but I must urge you to use the latest 0.8.x branch because it's way faster; in some cases it offers 2x the performance of the 0.7.0 branch. This is available in Ubuntu 19.10 natively, and via PPA in Ubuntu 16.04 and 18.04... https://launchpad.net/~jonathonf/+archive/ubuntu/zfs
In the video you said nothing about NUMA domains, you are most assuredly hitting inter-NUMA transfer bandwidth limitations. You need to ensure everything is pinned to the same NUMA domains as what the NVMe devices are attached to. You may be better off segregating all of the storage subsystem to a single NUMA domain.
For a ZFS tuning perspective 128 KB record size offers the best overall net performance gain (assuming a mixed workload), however, 64 KB is a really good choice too if you work with small files with random access patterns. The qcow2 format uses 64 KB as the default cluster size so that is also a good choice if the primary use case is VM storage. For 24 NVMe I would recommend 4x 6-disk raidz (or raidz2) vdevs, this will be nearly as fast as striped mirrors but offer a lot more useable capacity. Finding the right ashift value is not straightforward, it's best to simply performance test each value (i.g. 9, 12, 13, etc.). Use atime=off
zpool create -O recordsize=64k -O compression=on -O atime=off -o ashift=9 data
For benchmarking I recommend the following:
fio --randrepeat=1 --ioengine=libaio --direct=0 --gtod_reduce=1 --name=test --filename=test --bsrange=64k-192k --numjobs=16 --group_reporting=1 --random_distribution=zipf:0.5 --norandommap=1 --iodepth=24 --size=32G --rwmixread=50 --time_based=90 --runtime=90 --readwrite=randrw