Drives were underpowered for a few minutes, then errors started to pop up.
Since the corrupted files weren’t critical, I made back up and wiped /var/db/system/netdata/ and /var/db/system/core/.
Now the corrupted files became HDDs/.system/netdata-ae32c386e13840b2bf9c0083275e7941:<0xc> and HDDs/.system/cores:<0x12d>.
Then I’m just stuck. These errors just WON’T go away.
HDDs/.system doesn’t exist as far as shell’s concern, ls -a and cd reports nothing, same if I ssh in instead.
I have no idea how to access that…and I probably shouldn’t.
zpool clear HDDs sounded promising, but it does nothing except cleaning checksum errors.
I’ve done scrub, memtest, switched every cable, transfered the drives to another rig, even made a fresh OS install without porting the config, the corrupted files always persists.
================================================
One odd thing that I don’t know what to make out of:
2 out of 3 drives reports same amount of checksum errors, 1 reports none.
And yes, they all passed long SMART test, and it’s the same 2 drives no matter how many environment factors I altered.
Something is definately up with the data they’re storing.
Why is it 2? Why isn’t it 1 or all of them? This is the part I still have absolutely no clue.
🛠️ Recommended Steps to Resolve the Corruption
Since a fresh OS install didn't fix it (because the pool and data are separate from the OS drive), the corruption is truly within the ZFS pool data/metadata.
1. Try to Delete the Corrupted Datasets
The simplest and cleanest solution is to destroy the corrupted system datasets, as you've already backed up and don't care about the historical netdata/core dump files.
First, identify the actual dataset name. Open the Shell in your TrueNAS GUI or via SSH:
List Datasets:
zfs list -r HDDs
Look for a line that resembles HDDs/.system or something similar, like HDDs/iocage or HDDs/jails if you use those. Note the exact name.
Destroy the Datasets (CAUTION): If the name is indeed HDDs/.system, destroy it. This will permanently delete ALL system data (RRD graphs, netdata, core dumps) for this TrueNAS installation.
zfs destroy -r HDDs/.system
(The -r flag is for recursive if it has children like HDDs/.system/netdata)
Wait for TrueNAS to Recreate: TrueNAS will automatically detect the missing .system dataset on the next reboot or shortly after and recreate it. This should clear the corruption references.
It’s super confident TrueNAS recreate ./system if it’s deleted…I’m not trusting that just yet.
Okay, so…after asking around in zfs community, the conclusion is just backup stuff and destroy .system, then just see what happens.
Since everything was backed up, I’m more confident to follow AI here…
So…basically, I switched to boot-pool then switched back, and errors are gone.
No idea if anything I’ve done before that matters.