Checksum error and resilvering disaster

krushin · May 21, 2025, 3:09am

Hi all,

I’m having some issue with a 6 x 8TB drive TrueNAS Core server, in ZFS2 configuration. I had one drive go bad, which I replaced. After resilvering, another drive was giving smart errors. So I replaced, resilvered, and then scrubbed. Everything looked fine, so I ran a short and long smart test on all drives… no errors on any of them. I added another dataset and copied 2TB worth of data to it. Not long after, I ran another scrub, and everything copied to the new dataset plus files that haven’t been touched are now showing unrecoverable errors on the zpool status -v command. Well over 1000 files. Each drive shows roughly the same number (off by up to 5), but 200+ checksum errors. I saw it mentioned that cabling or HBM could be the cause, so I replaced my 9200-8i with a 9305-16i and also installed new cables.

Still the same issues afterwards. If I try to move files, around half of them fail due to unreadable I/O errors in Windows, I’m trying to back up what I can but it seems to be getting worse. Any suggestions where to go from here? Is the pool lost? I was counting on the two spare drives to protect my data but now I have thousands of pictures and other files throwing errors. I don’t have another backup, and I can’t copy many of the files from the NAS.

MSI Pro Z90-P Wifi DDR4
Intel 12th Gen CPU
32gb RAM
6 x HGST Ultrastar SAS drives
9305-16i HBA

Captain_Morgan · May 21, 2025, 6:39am

which version of CORE?

Does your RAM have ECC?

krushin · May 21, 2025, 12:57pm

The version is TrueNAS-13.0-U6.7

RAM is actually 64GB, not 32GB. It’s 2 x GSkill Ripjaw F4-3200C16D-32GVK (Non-ECC)

krushin · May 21, 2025, 10:18pm

Going to try replacing the RAM with some ECC RAM and see if that helps. It was new though, so not holding out much hope it was bad. Frustrating that TrueNAS isn’t detecting any errors on the long tests, but my data is all corrupted. Thought I was doing the right thing to protect it with the 2 drive redundancy.

winnielinnie · May 21, 2025, 11:04pm

Temperatures? Cabling? Anything loose?

Captain_Morgan · May 22, 2025, 5:33am

Replacing the RAM won’t fix a corrupted pool.

If you ram a memory test, you might see issues and know what the cause is.
I don’t know if its the cause, I just know its a potential source of corruption.

neofusion · May 22, 2025, 12:03pm

Your motherboard does not support ECC and your CPU likely doesn’t do so either.

Also, if RAM is bad or incompatible you typically find that out at the very start. While RAM can “go bad over time”, that’s not especially common, which is why manufacturers often feel they can offer lifetime warranties.

etorix · May 22, 2025, 1:39pm

Actually the CPU supports ECC if it is not an i3 (no detailed specs in OP…), but the Z motherboard definitively doesn’t.
RAM does degrade, and here a bad RAM stick would be a reasonable explanation for the multiple checksum errors without SMART errors, so running MemTest should be high in the list.

neofusion · May 22, 2025, 2:43pm

Fair enough, most consumer 12th gen aimed at the desktop market do support ECC. I recommend checking first just to make sure, as there are several SKUs in the i5 and higher lines that lack support.

I brought that up because the poster conveyed that the RAM was new and thus not likely to be bad, I did not mean to suggest that RAM can’t go bad.

Running memtest is a very good move when a new system is put online (for many reasons), and followup tests at later points are warranted if the system suddenly becomes unstable.

MSameer · May 22, 2025, 3:14pm

Even with ECC RAM?

neofusion · May 22, 2025, 3:36pm

I run memtest86 even on ECC systems.
Comparing it to throughly testing every other possible culprit, it takes so little effort to run memtest over night that I don’t see why I wouldn’t.

MSameer · May 22, 2025, 3:48pm

I run it on new memory but thought it would not be needed afterwards. I guess you are right that it will not hurt if we are debugging

krushin · May 22, 2025, 4:36pm

So I ran MemTest, it looks like both sticks of RAM are faulty. I swapped in a different stick and it passed with no issues.

However now I can no longer boot the system. I am getting this “panic: free guard1 fail at 0x70ee18f8 from unknown:0” error. Hopefully I didn’t lose access to the whole pool. Any suggestions how to address this error?

neofusion · May 22, 2025, 4:40pm

Do you have a copy of the config, including encryption keys if you used key encryption for any pools?

If yes, you can reinstall TrueNAS over your old boot device and upload your config. Be careful when selecting installation target, you do NOT want to overwrite your data pool, there’s a good argument for disconnecting the data drives entirely but ymmv.

Edit: Wait, maybe the RAM swap caused some BIOS settings to change. Before you go the reinstall route, doublecheck the BIOS. If nothing looks amiss, reinstall.

krushin · May 22, 2025, 4:50pm

Yes, I’ve been messing with the BIOS and tried setting defaults and a few other things, but it doesn’t make any difference. I don’t have a copy of the config, the encrypted datasets I have were passkey encrypted.

Would I still be able to add the pool back in without the config files? There’s just one big pool with all the drives.

I would definitely disconnect all the drives for the install.

krushin · May 22, 2025, 4:52pm

I should say, they are passphrase encrypted. So a password has to be typed in.

neofusion · May 22, 2025, 4:55pm

Yes, you will need to redo your settings though, like create users, shares, etc.
Import the existing data pool, don’t make a new one with the same drives.

krushin · May 22, 2025, 4:56pm

Awesome, as long as I can still get to the data on the pool I’m happy. Also going to back up the config files this time!

neofusion · May 22, 2025, 4:59pm

There is unfortunately a scenario here where the data pool will be unimportable due to RAM errors having corrupted things on it.

I hope it will work out for you.

krushin · May 22, 2025, 5:01pm

I guess all I can do at this point is cross my fingers. It seems to be completely unbootable. Wish I grabbed the config before messing with the RAM.

I may take this as an opportunity to update to Scale. It seems like that has more options than core.