Truenas Core boot pool checksum

I have an old Truenas that has worked it’s way up from Freenas 9, it is still using USB 2 drives and seems like the drives are not happy. I get an alert about the boot pool having problems and it says it tried to fix the issue. Recent scrubs are OK. But I should probably replace both USB drives to make this last a bit longer and wanted to check and see which I should offline and replace first?

Truenas_boot_pool1

I’m guessing DA4 should be the first to be replaced, but wanted to make sure I’m looking at the checksum the correct way (3 read issues and 12 write issues?).

3 read / 0 write / 12 checksum
You likely cannot “replace” the old thumbsticks because they are likely to have been created with ashift=9, which is too low for most drives. The easiest way is to save the configuration file, install anew on the new boot device(s) and load the configuration back.

2 Likes

In addition to what @etorix wrote, their are rare failure cases where you can’t replace a mirror disk because both mirror disks are experiencing bad block problems, but in different places.

If you have a new boot device available, and can install it now, you may be able to establish a 3rd mirror, (temporarily). If it resilvers to this 3rd mirror success fully, you can pull DA4 and repeat the process for DA5.

One neat thing about ZFS, is replace in place. This automates the above 3rd mirror thing. It will resilver the new replacement device and when done, if no errors, detach the source device. If the source device has bad blocks, BUT the other device, (DA5 in your case), has the data, all good, ZFS will use that to sync up the replacement device.


Something similar happened to me in a Data Center RAID-5. I had bad blocks on multiple disks, (their was a manufacturing issue with the disks...). So while the RAID-5 was good, I could not replace ANY disks due to this issue. The hardware RAID disk array did not support replace in place. Nor did we care about RAID-5 -> RAID-6, (even if the disk array supported such...). Thus, full backup, fix and restore time.
1 Like

I’ll have to look into the third drive addition to see if I can clear the errors, then get the mirror back. I didn’t know it could do this and certainly worth a try.

I’ve done the fresh install and restore config in the past, I think I may have done that to get into the 11 version. Our semester is almost finished, and that will give time to work on this.

Just looked at my drives and the boot pool is on 32GB drives, I only have 8gb drives on hand so I’ll have to order some new ones. Current amount of data in the boot pool is around 6GB and I don’t think I can mirror from a bigger drive to a smaller drive.

In the meantime, I grabbed the config file again. I also deleted the oldest OS on the drives to make a little space and speed up the mirror process. I’ll have to look and see if I have room to jam a couple SSD into this server, it’s a little 1u with 8 drives so not a lot of space and maybe not any free SATA ports either. I can’t tell you the last time I opened this chassis, has to be well over 6 years now. I think the closest I got was remove/replace all of the storage drives about 3 years ago.

My other action will be to copy all the files to a share on another server, set the permissions, and probably change the DNS records to point to that second server. I think that will keep me running for a while, at least the data is backed up.

[edit] I think once I get everything backed up, I may burn this server down and go to Scale. After messing around in Core to make these shares and set permissions, Scale seems more comfortable these days. I’ll have to see when I get to that point and either start clean or upgrade in place to Scale. It’s a storage only server so shouldn’t be any issues with either approach.

Well, I think I may have a problem… Device is too small error when I tried to replace one of the boot mirror with the new drives. Any magic that can make this happen? I know there is plenty of free space on the original drives, can I shrink them to fit?

Currently running Core 13.0-u6
I have a feeling that a fresh install and import the pool is in order here, but hoping for something “better”.

I also copied everything to a different server, debating just making a new share on that server, setting permissions, and editing the DNS entry to point to this different server. Need to change DNS so that current scripts will point (and mount) to that share.

Always scary, but in the end, it all worked out.

Had to apply the config twice to get it to show my shares, and had to refresh the AD account to get permissions back (plus a couple of reboots). Also had to log out of my workstation and log back in before permissions on the shares worked. But over all, about an hour of messing around to put two fresh drives in place and reload the OS and get everything working again.

When I had the server down, I did pull the cover and vacuum everything… It’s older than I remembered and really needs to go. It’s a Supermicro X9 which puts it in the 2012 era.