GPT Partition table corruption on boot disk

Protopia · April 14, 2024, 4:44pm

Today my TrueNAS Scale system hung - I could Ping it but that was about it. No GUI, no SSH, so no way to do a software initiated reboot or shutdown.

So it was a hard reset - and then it wouldn’t reboot. I checked the boot drive and the primary GPT table was corrupt - but the data was in the backup partition table. I plugged the boot SSD into a normal Linux system and used gdisk to copy the backup table to the primary and was then able to boot successfully.

So, in essence, the BIOS was unable to find the EUFI partition and so unable to boot.

So - a few questions:

How would the primary GPT Partition get corrupt when it wasn’t being written - or at least shouldn’t have been being written since it is months (and several reboots) since I last changed partitions on this drive?
What can I do to try to prevent this from occurring? Can I prevent GPT sectors from being written somehow?
I doubt it is possible, but is there any way to enhance the BIOS so that it will use the backup partition table if the primary one is corrupt? And any way to do something in EUFI (or Linux) to automatically rewrite the primary GPT Partition table if it is corrupt but the backup one is OK?

ericloewe · April 14, 2024, 5:35pm

Dodgy boot device, most likely.

Well, it’s not super likely that that’s what happened, so I wouldn’t concern myself too much with it.

That’s something for the UEFI vendors to do. Given the miserable state of most AMI firmware deployments, I’ll go with “don’t hold your breath”.

You can hack together a shell script to check if the GPTs match and fix them if one of them breaks (gdisk probably does this somewhat automatically).

Protopia · April 14, 2024, 8:28pm

There is no sign that the boot drive is dodgy. In all other respects it is behaving normally. But I guess you never know.

My concern is that I have no idea what happened, unless it was a consequence of an unexpected power off to the SSD itself.

I am not holding my breath - the BIOS is almost certainly functionally stabilised. And to get to UEFI, you need to boot off the UEFI partition and that means that you need to find it which means that the BIOS will need to use the backup GPT Partition Table if the Primary has got corrupted. So my own view was pretty much zero hope - but I thought I would ask just in case anyone else knows or can see something I can’t.

A shell script to check the GPT Partition Table will only run once UEFI has been found and TrueNAS has booted - so not much use I fear.

So I guess I just have to hope it doesn’t happen again.

ericloewe · April 14, 2024, 9:33pm

Think of it like a scrub. It’s likely that whatever happened did not happen precisely during the reboot. It can repair the damage before it gets critical.

Protopia · April 15, 2024, 7:41am

@ericloewe That is certainly an idea worth considering. I would just have to be absolutely certain that it wasn’t going to create a problem or make an existing problem worse.

Protopia · April 15, 2024, 2:08pm

I looked into this a little further. You can do sgdisk -v /dev/sdf and it will check whether the partition tables are valid - if you get rc=2 then they aren’t. At this point you would have two options:

You could create some sort of alert e.g. by email so the admin can perform the partition table recovery from the TrueNAS console; or
Script the analysis of what is broken and the recovery actions. The sgdisk command doesn’t give different return codes for different partition table failures, so you would then need to examine the output of this sgdisk command to see what errors were found. Next, sgdisk does not have the advanced facilities of the gdisk command for recovering the primary from backup or vice versa - so you would have to script the gdisk command to do the recovery.

This is non-trivial with the real risk of screwing up your system if you get it wrong - and my BASH / TrueNAS API skills are (unfortunately) not up to this.

Stux · April 16, 2024, 12:41am

Multi-report feature?

Protopia · April 16, 2024, 8:56am

@Stux Thanks for the idea.

I had already taken a brief look at Multi-Report but not implemented it.

It does look like a good home for this requirement - so I will implement multi-report and then take a look at how to add this functionality.

Protopia · April 16, 2024, 8:34pm

I have implemented multi_report.

But the GPT Partition Table corruption happened again this evening.

I do not think it is a TrueNAS issue or a ZFS issue. So either a USB or an SSD issue. I doubt I will get far with Amazon or SSK for a warranty claim.

So I am wondering what to do to avoid this happening again. (Takes me c. 30mins to take the NAS apart to access the USB stick, insert it in in a Linux box, run gdisk to rewrite the partition table, then put it back together and reboot.)

Perhaps I should buy a second SSK stick to use in a rear panel external USB port and mirror the first one to it. Providing that a) the NAS BIOS will boot from an external USB and b) that the corruption isn’t replicated, then the BIOS should switch automatically to the second SSD…

Davvo · April 16, 2024, 9:10pm

GRUB is, apparently, a massive PIA. Iirc, @kris mentioned they were working on/considering an alternative… but I might be really remember totally wrong here.

It’s amazing how much simpler the multi report makes monitoring the system. Can’t do without it now, @joeschmuck & Co. made a fantastic job. Still don’t understand how it has not been implemented in TN yet.

ericloewe · April 16, 2024, 9:31pm

GRUB is just slightly better than cancer. But worse than HIV. Maybe HIV 15 years ago levels of bad. BUT: it’s not GRUB’s fault that the system firmware is too stupid to go look at the backup partition table, which exists for the sole purpose of being a backup in case the primary table gets corrupted so that things that read GPTs can move on with life and minimize downtime.

Still, if the same table is getting corrupted again, it seems that you either have an errant thing going around nuking the GPT or (more likely) a rather dodgy boot medium incapable of remapping bad sectors.

Protopia · April 16, 2024, 9:44pm

I am not sure why Grub is being updated by TrueNAS SCALE outside a version upgrade, nor why Grub would be updating the partition table even then.

Also, I think Grub can be implemented in various ways i.e. UEFI systems can be different and the UEFI implementation is probably better architected. I doubt that ixSystems is ready to put UEFI as a strict requirement though.

Stux · April 16, 2024, 10:51pm

If you had mirrors setup, and you found a disk had a corrupt gpt table, removing the disk and readding to the mirror would rebuild the table.

But, correct me if I’m wrong, you have non-standard partitioning on this disk right? Are you sure that’s not the root cause?

etorix · April 17, 2024, 7:30am

Pain level 1: Install Linux on a SATA drive, in a system which has another drive. Happily boot from that to check it works.
Now insert another drive and/or swap ports. Try booting. Admire how GRUB looses its marbles because former boot drive /dev/sda got a new letter. (Drive LETTERS? In 2024??? )

Pain level 2: Use your favourite search engine to find instructions to repair GRUB from its command line. Follow said instructions to boot, once. Shut down and boot again. Did your repair stuck? If so, congratulations!

That behaviour is bad enough when just fiddling with some Ubuntu distro to quickly check a build before installing a pickier OS. But with OpenMediaVault or TrueNAS SCALE, i.e. a NAS platform where one will be adding, removing or replacing drives… Abysmal.

ericloewe · April 17, 2024, 9:55am

As in “congratulations, you’re a liar”. Have fun doing it all again next boot!

etorix · April 17, 2024, 10:42am

Oh, so it’s not just me being stumped or plain “I hate the GPL so GRUB hates me”?

ericloewe · April 17, 2024, 1:24pm

Hell no. GRUB is the most byzantine piece of software I’ve had the pleasure of using. For over half a year now, I’ve had a critical server that won’t boot on its own because GRUB’s configuration won’t be regenerated because zsys breaks from too many datasets (due to docker) and the configs are apparently too ridiculous for me to figure out in polynomial time.
So, if it’s ever shut down, the local team (I can do the same via IPMI) has a cheat sheet with the correct parameters needed to boot Ubuntu.

I can’t wait for the thing’s next service window to nuke GRUB and restructure the install for ZBM. All this because I needed to replace dying Samsung 870 Evos!

Needless to say, I am not deploying GRUB again.

Stux · April 17, 2024, 1:30pm

ZBM?

~~An error occurred: Body seems unclear, is it a complete sentence?~~

ericloewe · April 17, 2024, 1:50pm

ZFS Boot Menu

Protopia · April 17, 2024, 2:20pm

Yes - I have an apps pool on the boot drive and it is not currently mirrored - so it is non-standard/unsupported, and this was a decision I made fully in this knowledge. So IF this is the cause (even because it is triggering a bug that shouldn’t be there regardless) then I have only myself to blame.

That said…

I can see no reason why having the apps pool on the boot drive should do this.

It has been operating just fine, even after the last version upgrade and several reboots, and I don’t see any reason for TrueNAS or Debian or even Grub to be changing the Partition Tables, so it feels more like a hardware issue than a software one.

One thing I am now pondering is whether:

This corruption is because I have to do a sudden power off (because TrueNAS has hung - no Web GUI, cannot SSH, SMB not responding, only an ICMP ping gets a response;

or alternatively:

The TrueNAS hang is a consequence of the GPT table getting corrupted and then Debian reloading its partition information and finding that the partitions have disappeared and Linux then grinds to a halt (because e.g. TrueNAS middleware is trying to write to the boot pool and the write fails.

Of these two it is the second one which feels most likely (guess/hunch) because if there is something updating (and corrupting) the Partition Table, then generally speaking such code will reload the partition table afterwards. So this would explain the hang also, whereas for option 1. we have to have a separate explanation for the hang.

So now we have to try to work out what in Debian / TrueNAS is causing the GPT Partition Table to be rewritten.