Jump to content

PROXMOX - Rebuilding ZFS RAID rpool After Disk Failure

PROXMOX is a fantastic free and open Linux KVM hypervisor (with the option of a subscription - not required) but it's not without it's caveats. If when you installed PROXMOX you opted to create a ZFS rpool for the OS be that a mirror (RAID1), striped mirror (RAID10) or any combination of parity (RAID50, 51, 60, 61) you will find the installer creates more than a ZFS partition on each disk. It creates two additional partitions these being a 1MB BIOS boot partition and a 512MB EFI boot partition.

nvme0n1     259:6    0 447.1G  0 disk 
├─nvme0n1p1 259:7    0  1007K  0 part 
├─nvme0n1p2 259:8    0   512M  0 part 
└─nvme0n1p3 259:9    0 118.7G  0 part 

Now these p1 & p2 partitions play an important role in allowing the hypervisor to boot. This is known as the bootloader. Without it the BIOS has no way of knowing where the OS is even though it's there in it's entirety on p3.

 

This is where ZFS has the caveat. If you're like me and you've suffered a few disk failures you'll be familiar with this output:

  pool: rpool
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
  scan: scrub repaired 0B in 00:00:44 with 0 errors on Sun Jun 11 00:24:48 2023
config:

        NAME                     STATE     READ WRITE CKSUM
        rpool                    DEGRADED     0     0     0
          mirror-0               DEGRADED     0     0     0
            nvme0n1p3            ONLINE       0     0     0
            4849641676990992824  UNAVAIL      0     0     0  was /dev/nvme3n1p3

errors: No known data errors

In this scenario one SSD in a mirror used to boot the server has failed.

 

Now ZFS makes it pretty easy to restore this by just replacing the SSD and resilvering with the appropriate command but it will only restore partition #3.

nvme0n1     259:6    0 447.1G  0 disk 
├─nvme0n1p1 259:7    0  1007K  0 part 
├─nvme0n1p2 259:8    0   512M  0 part 
└─nvme0n1p3 259:9    0 118.7G  0 part
nvme3n1     259:6    0 447.1G  0 disk
└─nvme0n1p3 259:9    0 118.7G  0 part 

At this stage ZFS is happy, it's hunky-dory, it's walking on sunshine...but what would happen if we were to lose nvme0n1? Well, everything would keep running fine. Until you shut the server down or rebooted. At that stage it doesn't matter that you had a ZFS backup of your boot drive ZFS didn't copy the bootloader. Your data is still there. It exists but you'd have to recover the bootloader partitions.

 

To avoid having to do that what I want to demonstrate are the steps required to re-create and initialize the bootloader partitions with a real example.

 

Before you resilver the array you'll want to run the commands:

sgdisk /dev/nvme0n1 -R /dev/nvme3n1
sgdisk -G /dev/nvme3n1

Be careful to replicate any capitalized letters such as -R and -G as capitals can issue different operation to their lowercase counterparts.

 

What this has done is taken a new disk from this:

nvme3n1     259:0    0 447.1G  0 disk

 

To this:

nvme3n1     259:0    0 447.1G  0 disk 
├─nvme3n1p1 259:1    0  1007K  0 part 
├─nvme3n1p2 259:2    0   512M  0 part 
└─nvme3n1p3 259:13   0 118.7G  0 part

Now note all it has done is created/copied the partitions. We still need to re-write the partition data over to the new disk.

 

From here we can start resilvering our ZFS array:

zpool replace -f rpool 4849641676990992824 /dev/nvme3n1p3

And check-in on how the progress is doing:

  pool: rpool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sat Oct 14 17:31:37 2023
        37.3G scanned at 764M/s, 24.5G issued at 502M/s, 37.3G total
        24.7G resilvered, 65.73% done, 00:00:26 to go
config:

        NAME                       STATE     READ WRITE CKSUM
        rpool                      DEGRADED     0     0     0
          mirror-0                 DEGRADED     0     0     0
            nvme0n1p3              ONLINE       0     0     0
            replacing-1            DEGRADED     0     0     0
              4849641676990992824  UNAVAIL      0     0     0  was /dev/nvme3n1p3/old
              nvme3n1p3            ONLINE       0     0     0  (resilvering)

errors: No known data errors

Once this completes that is p3 taken care of but p1 & p2 still aren't configured.

 

To do that we first need to know weather PROXMOX is being booted with GRUB or SYSTEMD-BOOT. The easiest way to find out for sure is to know which boot menu you see when the server is turned on.

 

One of these two screens should look familiar to you:

 

boot-grub.png.12e85c77c488c5fe068beb50e24673a2.png

 

boot-systemdboot.png.18ce80ac38bb51df6bb449b30532d85e.png

 

If you see the blue screen then your server boots using GRUB.

 

If you see a black screen then your server boots using SYSTEMD-BOOT.

 

You may also check from the OS without rebooting using:

efibootmgr -v

 

This will return a lot of information but you should see a line resembling one of the following:

Boot0005* proxmox       [...] File(\EFI\proxmox\grubx64.efi)
Boot0006* Linux Boot Manager    [...] File(\EFI\systemd\systemd-bootx64.efi)

These should be self explanatory. If you see GRUB you are using GRUB. If you see SYSTEMD you're using SYSTEMD.

 

From here you will use the corresponding command to configure your bootloader:

 

GRUB:

grub-install /dev/nvme3n1

 

SYSTEMD:

proxmox-boot-tool format /dev/nvme3n1p2
proxmox-boot-tool init /dev/nvme3n1p2

 

And you're done.

 

If you are so inclined as to test if this actually worked you can shutdown the server and remove the healthy drive you copied the boot data from. Turn the server on and see if it boots from the new drive you just formatted and copied all the system data to. If everything went well the server should boot as normal.

 

I hope this helped.

Link to comment
Share on other sites

Link to post
Share on other sites

Love it.

 

My Proxmox server is currently running a pair of WD Blue 1TB SSDs in a mirrored ZFS pool as the boot drive, so there's a chance I find this saving my life in the future.

Quote or tag me( @Crunchy Dragon) if you want me to see your reply

If a post solved your problem/answered your question, please consider marking it as "solved"

Community Standards // Join Floatplane!

Link to comment
Share on other sites

Link to post
Share on other sites

16 minutes ago, Crunchy Dragon said:

Love it.

 

My Proxmox server is currently running a pair of WD Blue 1TB SSDs in a mirrored ZFS pool as the boot drive, so there's a chance I find this saving my life in the future.

I needed this information more than once now so I'm just spreading the good word that there's a solution and it's not too painful just annoying to memorize.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×