Jump to content

Holiday Hypervisor Server Upgrade

9 hours ago, Sir Asvald said:

Yes, I'd like to know about your other servers.

i'm currently in the process of writing up a new buildlog of a 1U watercooled server. Not quite as in depth as @Windows7ge's buildlogs though. I also have a few watercooled servers done.

Still, i'm surprised at the lack of a server modding community somewhere on the web. there's really not that many people casemodding servers, eventhough there's so much potential!

Link to comment
Share on other sites

Link to post
Share on other sites

17 hours ago, RollinLower said:

Not quite as in depth as @Windows7ge's buildlogs though.

Force of habit. I unintentionally write my build-logs the same way I write my guides and tutorials. It does make for good personal reference material though. I will go back and view my own posts just to show myself how I did something a long while back when I've forgotten (which is more frequent than I'd like to admit). 😆

 

17 hours ago, RollinLower said:

Still, i'm surprised at the lack of a server modding community somewhere on the web. there's really not that many people casemodding servers, eventhough there's so much potential!

I have, of all things in storage, a IBM e-server i-series model 270 complete with all the bells and whistles. Only downside being cold storage killed the hardware but it's a absolutely magnificent chassis made with seriously thick plate steel and I'd love to mod it into a home theater center piece/PC. Only downside is it's so old and proprietary I'll have to rivet or otherwise mount a standard motherboard tray in it and make a bracket for the PSU because it too is very much not standard.

 

Back on topic though the cables are in! And I have to say I hate that they're indistinguishable from their opposite counterparts.

 

P1000303.thumb.JPG.96c8f3f2509ebab7c8aa6237ffb5a2e9.JPG

 

SFF-8087 breakout cables come in more than one shape/color/size. There are forward breakout cables that look 100% identical to this and there are no markings at all saying it's a reverse breakout cable. This explains the 3 star rating these had with many one star reviews of people complaining that the cables "Don't work.", "I bought three of them, all broken", and "I connected them to my SFF-8087 HBA, they don't work!"

 

Of course they didn't work. You were using them wrong. 😆

 

Referencing my last update about SFF-8087 breakout cables I showed that using my red breakout cables we saw no output on the screen when I connected a drive. The forward cables cannot be used reversed and likewise the reverse cables cannot be used forwards.

 

So this is the current setup with the reverse cables:

 

P1000304.thumb.JPG.4cbb23809926925e328afb0f90e9d9b3.JPG

 

I bought them shorter because once installed the run won't be very long. Now unlike last time when I connect a HDD:

 

P1000307.thumb.JPG.143de31a70ad0c37659a55ac0b1daf2a.JPG

 

It registers that a HDD was connected and it lets us use it. :old-grin:

 

Next problem, and this is actually a problem I distinctly remember talking to @leadeaterabout years ago are the dang heat-shrink design they use on these cables:

 

P1000308.thumb.JPG.107f8a3c10e64fa7a115795c3ef558ad.JPG

 

I could not source a right-angle SFF-8087 reverse breakout cable. I have to fit two in here and it will interfere with the fan behind it. So I did what any other sensible person would do in this situation. I butchered the cable.

 

P1000309.thumb.JPG.51e0ef0d3d551446cb54d21661decf6a.JPG

 

Goodbye any possibility of a refund! In hindsight I should have done this with those SFF-8087 cables from years ago. Oh well. This will provide the SFF-8087 connector with the flexibility it needs in order to make that turn without clipping the hot swap fan behind it.

 

As of right now the server is still running fine. No hiccups of any kind. For that reason I think we're quickly approaching installation time. I'm currently waiting on one part to arrive as now is good a time as any to install it. Once it does we can start the upgrade process.

 

I can tell you right now it's going to be fun and a PITA at the same time. 😅

Link to comment
Share on other sites

Link to post
Share on other sites

The part has arrived but I'm flabbergasted that this insignificant component has just stuffed up the whole operation.

 

 P1000311.JPG

 

This is a LSI 9207-8e HBA. All it does is it allows you to plug in more 6gb/s storage drives. Of everything I've tested in this server. Even using old legacy hardware that wasn't part of the plan, everything still being incredibly stable. This simple component is causing the server to not POST. But what's irritating is that plugging it into another computer it behaves just fine. I have no idea what the problem is but I have to come up with something before we can move forward. It's frustrating.

Link to comment
Share on other sites

Link to post
Share on other sites

After a lot of BS and fighting I finally got the LSI 9207-8e working in the server and showing up in the OS.

 

ae01a05c4c0d54c288d1b974c66864e3a81db805.thumb.png.ce249b58bafc4440f006dc3890589ad9.png 

 

So what I had to do was update both the Firmware & BIOS of this HBA. The revision numbers of each didn't change by much which tell me the differences are small but for whatever reason it's what was needed to get the card to show up.

 

Outside of doing extended testing on this new HBA I think we're ready to start the upgrade. It's going to start out slow though tonight by simply disabling new tasks for BOINC on the new server. I configured it to hold up to 1/2 a day's worth of tasks so by early tomorrow morning we should be ready to get our hand on some hardware! :old-grin:

 

For things to transition as seamlessly as possible though there are a series of operations that need to take place in a specific order.

  1. All CT's & VM's need to be turned off and auto-startup disabled.
  2. All CT's & VM's need to be backed up to a tertiary file server.
  3. The new server needs to be cleared of all CT's & VM's.
  4. All relevant IP addresses need to be noted down.
  5. Both systems can then be shutdown.
  6. After the hardware swap the slot holding the PCI_e SSD's needs to be bifurcated.
  7. Now the Operating System can be loaded.
  8. Relevant IP addresses need to be changed to what they were before.
  9. ZFS pools need to be imported. Configurations added to the WebUI.
  10. vfio-pci configured for relevant hardware devices.
  11. Backup server share added to WebUI config for CT & VM import.
  12. Import all CT's & VM's. Start them one by one and re-enable auto-startup.

I've probably forgotten something somewhere but that's tomorrows plan. I'll update everyone once it's over!

Link to comment
Share on other sites

Link to post
Share on other sites

Yesterday was a rough day getting the hardware swapped over but it's finally done.

 

666569766_Screenshotfrom2022-01-0613-44-43.thumb.png.11ce8742e6a099f27a09b477b99b1837.png

 

My oh my where to begin...I did everything I could try try to make the transition as seamless as possible but it still took me from ~9AM-11PM to get everything moved over and that was partially due to various problems I ran across.

 

So first and foremost I started breaking down the test setup:

 

P1000321.thumb.JPG.a4d887532ca40d10e50328117e3363a7.JPG

 

Looking over at the current setup this is what I was working with:

 

P1000325.thumb.JPG.af253d10e2b8fac9a965ea1ed1cfd7c6.JPG

 

Quick specs:

2x Intel Xeon E5-2698v3's

8x64GB(0.5TB total) NEMIX DDR4 RDIMM ECC 2400MHz(2133MHz due to CPU limitation)

Supermicro X10DRi-T-O motherboard.

CT & VM storage by way of a RAID1 pair of 1.92TB Micron 7300 PRO NVMe SSDs.

Quad 10Gbit SFP+

3x LSI 9207-8i's (fears realized come later 🙄)

 

So the first thing out of the way after gutting the motherboard and getting the new one installed is I didn't want to introduce all of the PCIe components at once in case I ran into issues. I wanted to know what the cause was right away.

 

To start I installed the two Micron 7300 PRO NVMe SSD's:

 

P1000328.thumb.JPG.865322198797b4b454c3c0b16e11ba40.JPG

 

They're on a bifurcation board under the tilted fan. They get pretty toasty so I wanted to cool them active. I went into the BIOS. Switched the slot from x16 to x4x4x4x4 went into the OS, the drives showed up and I was able to import the pool without incident.

 

  pool: flash
 state: ONLINE
  scan: resilvered 659G in 00:25:47 with 0 errors on Fri Dec 17 15:36:26 2021
config:

	NAME                                             STATE     READ WRITE CKSUM
	flash                                            ONLINE       0     0     0
	  mirror-0                                       ONLINE       0     0     0
	    nvme-Micron_7300_MTFDHBG1T9TDF_20222880AA1F  ONLINE       0     0     0
	    nvme1n1                                      ONLINE       0     0     0

errors: No known data errors

I think one of the drives is showing up the way it is because I had to replace one of the them before. An unfortunate very early death for an enterprise grade SSD.

 

After this i introduced the four 10Gig NICs:

 

P1000329.thumb.JPG.c25594f33aae538283f7b2b221f51fb7.JPG

 

These are dual port generic BCM57810S SFP+ NICs. I wanted to pass them though to the VM I use them for which wasn't the original configuration but first:

  1. These NICs REALLY hated the vfio-pci driver. VM would not boot.
  2. These NIC's would just not behave predictably when I disabled vfio-pci but blacklisted their driver instead.

On all four interfaces they host DHCP servers and two of the four interfaces refused to lease IP addresses.

So back to the original configuration I was using. Let PROXMOX handle them and just pass virtual NIC's to the VM. Everything worked again.

 

Next up are the LSI 9207-8i cards.

 

P1000331.thumb.JPG.3788b2a08a99a34d2d23e92d952710a8.JPG

 

...And what I feared was realized. These HBA's caused the same POST hang-up that the LSI 9207-8e caused. I made a post over on the Level1Techs forum about the issue and some other server savvy people helped me narrow down the issue being the HBA cannot initialize. It's freezing when the HBA is supposed to initialize. So I performed the same patch to these two HBA's that I performed to the first. Updated their Firmware & BIOS. This worked. So the internal HBA's are installed.

 

After this I added in the LSI 9207-8e which we had already tested, and one more single port SFP+ card. A Mellanox ConnectX-3 CX311A and tossed the fan on-top of everything:

 

P1000334.thumb.JPG.894d0554e3a2c9ed31dce655c5b60bba.JPG

 

I should mention all the while I was installing these expansion cards PROXMOX kept changing the name of the NIC for the management network. So every time I added a PCI_e device I kept having to go back into the console and rename the NIC back to what it was supposed to be so I could access the WebUI. Very annoying, no idea why it kept doing that but they're all installed so it's not an issue worth investigating IMO.

 

From here I wanted to pass the LSI 9207-8e to a VM. So as I explained during testing I intercepted the default Kernel driver and told the system to give it vfio-pci:

d8:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 [1000:0087] (rev 05)
	Subsystem: Broadcom / LSI 9207-8e SAS2.1 HBA [1000:3040]
	Flags: bus master, fast devsel, latency 0, IRQ 926, NUMA node 1, IOMMU group 179
	I/O ports at f000 [size=256]
	Memory at fbe40000 (64-bit, non-prefetchable) [size=64K]
	Memory at fbe00000 (64-bit, non-prefetchable) [size=256K]
	Expansion ROM at fbd00000 [disabled] [size=1M]
	Capabilities: [50] Power Management version 3
	Capabilities: [68] Express Endpoint, MSI 00
	Capabilities: [d0] Vital Product Data
	Capabilities: [a8] MSI: Enable- Count=1/1 Maskable- 64bit+
	Capabilities: [c0] MSI-X: Enable+ Count=16 Masked-
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [1e0] Secondary PCI Express
	Capabilities: [1c0] Power Budgeting <?>
	Capabilities: [190] Dynamic Power Allocation <?>
	Capabilities: [148] Alternative Routing-ID Interpretation (ARI)
	Kernel driver in use: vfio-pci
	Kernel modules: mpt3sas

 

Then I added this PCI_e device to the VM's config:

 

1845154530_Screenshotfrom2022-01-0614-38-42.png.654a24eb0ab485cea210d30e44f8cf27.png

 

Now when we boot this VM the HBA shows up as a directly connected device:

01:00.0 Serial Attached SCSI controller [0107]: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 [1000:0087] (rev 05)
	Subsystem: Broadcom / LSI 9207-8e SAS2.1 HBA [1000:3040]
	Physical Slot: 0
	Flags: bus master, fast devsel, latency 0, IRQ 16
	I/O ports at d000 [size=256]
	Memory at c2040000 (64-bit, non-prefetchable) [size=64K]
	Memory at c2000000 (64-bit, non-prefetchable) [size=256K]
	Expansion ROM at c2100000 [disabled] [size=1M]
	Capabilities: <access denied>
	Kernel driver in use: mpt3sas
	Kernel modules: mpt3sas

And so do the disks inside the disk shelf connected to it (QEMU HARDDISK is a virtual disk assign to a normal VM)

Disk model: QEMU HARDDISK   
Disk model: ST10000NM0086-2A
Disk model: ST10000NM0086-2A
Disk model: ST10000NM0086-2A
Disk model: ST10000NM0086-2A

 

One thing that I am actually quite delighted to see is that putting the system inside a box had little to no impact on the maximum operating temperatures:

 

981429969_Screenshotfrom2022-01-0614-51-38.png.8927d14f9c28165416da2627b5fc2c05.png

 

This will actually give me room to optimize the fan curve for a little bit of quieter operation.

 

At this point I think we're done unless something goes wrong soon and I have to start troubleshooting. Thanks for tagging along. I'll tag anybody who wants to be notified about a couple of follow-up server builds I'll be doing. :old-grin:

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×