TrueNAS Core 13 Failing Disk Replacement - Sanity Check

Greetings…

I have a TrueNAS Core v13 setup that has a pool with a drive that is failing with multiple reallocated sectors. I will be replacing that drive and have been reading the TrueNAS docs and some forum posts here to determine my proper steps. I think I understand the basic steps, but thought I would outline them here and ask if someone could sanity check me and ensure I’m on the right path.

Setup: My TrueNas Core machine is virtualized within Proxmox and has one RAIDZ1 pool consisting of four 4 TB Toshiba N300 drives, all made available to the VM via PCI Passthrough. This setup has been working flawlessly for nearly two years. I started seeing rare errors across the SATA controller abount a month ago, but infrequently enough to just monitor it. Now, (literally this morning) SMART tells me the drive is failing due to too many reallocated sectors. The pool is still functional and accessible - just that the drive in question has had too many sector rellocations. Note, too, that I have no spare SATA ports in this system, so I cannot install the new drive as a hot spare.

So, with that groundwork, the steps I believe I must take so far are as follows:

  1. Take the failing (failed) drive offline within the TrueNAS pool.
  2. Shut the TrueNAS VM down.
  3. Shutdown Proxmox. (I do not have SATA hotswap enabled on the host)
  4. Physically replace the failed drive with the new drive, ensuring a match on the drive serial number
  5. Restart the system
  6. Pass through the new drive to the TrueNAS VM
  7. Within the TrueNAS GUI, confirm that the OFFLINE disk is now shown as REMOVED.
  8. Within the affected pool, go to the Pool Status screen and REPLACE the removed disk with the newly installed disk.
  9. Verify that TrueNAS now shows the pool being resilvered and eventually back online.

What might I be missing, what do I have wrong, what else do I need to consider?

Many thanks in advance for the sanity check.

-Dave

Looks good…

Perhaps swap steps 6 and 7.

Step 4 is two phases… remove and replace. As you noted, verify and track serial numbers.

Guide is here:

Thank you so much for your feedback. I definitely saw and took quite a bit from that guide, but what had me a bit concerned was interpreting the steps for having no hot spare and not having the replacement drive available when the old one was removed.

Add to that the fact it is virtualized and I definitely wanted a sanity check on the steps.

Should I defer bringing up the TrueNAS VM until I complete the PCI passthrough of the new disk, or let it come up and allow it to see the drive as is made available when the PCI passthrough is done on the host?

The real irony here is that I bought these Toshiba drives because they were intended for NAS setups (I believe they’re the ones used by Synology), but only have about 17,000 hours on them. I have other older plain desktop drives in this box that have been around for years LOL.

The fact that you are asking this question is concerning. You don’t pass the drives through, you pass the controller through to the VM. You shouldn’t have to do anything except hook up the replacement drive and it should be available to TrueNAS in the GUI if your controller is passed though correctly.

Hardware specs would be helpful here.

1 Like

Thanks for your feedback.

It has been about two years since I built this system, so I will have to go back to my notes and verify the setup to see if what you suggested is true.

My immediate recollection when I started examining this problem was that I passed the individual drives through.

When I get off work I will reexamine my setup.

I confirmed my original suspicion. The four drives for my NAS are discretely passed through to the VM (by specifying the relevant /dev/disk/by-id identitifer). A separate drive is passed to a different, unrelated VM.

1 Like

You are on a course that is known to lead to total pool loss…

1 Like

If you could please elaborate on that more, I would be most appreciative. Under what circumstances would such a “total loss” occur? Are we talking about something related to the pending replacment of the failing drive, or a broader operational issue that might arise under certain circumstances?

Appreciate your input.

When running truenas as a vm it is critical to pass through either the sata controller or an hba entirely to the truenas & blacklist it from the hypervisor (could be complicated & require a lot of rebuilding if you have drives on that controller/hba feeding other vms or hypervisor boot drive). You’d also want to make sure the hypervisor isn’t importing the zfs pool.

The short version (my poor understanding of it) is that zfs doesn’t like to share & needs unrestricted, direct access to drives, else bad things eventually happen.

Things will likely at first work fine if you don’t. That is the problem, things are innitialy fine & then your pool is suddenly corrupted… Eventually.

Replace the disk first. Then, somewhere on the forums there is a guide called something like “if you absolutely must virtualize truenas”

1 Like

Thanks for input. I’ve been running this setup for about two years now on a 24x7 basis with no problem (obviously until this failed drive arose), but I appreciate the concern you raise.

One other thing I would mention - I see no indication the hypervisor is importing the zfs pool.

1 Like

I wish you all the best man, replace the drive & go from there. These are best practices because so many individual cases have eventually had critical issues.

Hopefully everything keeps working fine until it is time to retire the hardware & upgrade.

Well what I wanted to point out is that I have already been checking out hardware for an entirely new build; new MB, CPU, the works. Was planning to do this as soon as Christmas expenses were past LOL. My replacement drive will be here Wednesday, and I’m hoping to look at the new build here in a month or two. So my current setup only has to work through then.

Thanks again.

1 Like

A quick read through this thread should help explain some of the things that have been discussed here.

https://www.truenas.com/community/resources/absolutely-must-virtualize-truenas-a-guide-to-not-completely-losing-your-data.212/

1 Like

I didn’t really read the thread, but everything looks good until I saw this step.

If you’ve done the PCI passthrough correctly as recommended (ie. passthrough the entire controller), you shouldn’t have to pass through anything right now or change any Proxmox settings.