I’m having some issue with a 6 x 8TB drive TrueNAS Core server, in ZFS2 configuration. I had one drive go bad, which I replaced. After resilvering, another drive was giving smart errors. So I replaced, resilvered, and then scrubbed. Everything looked fine, so I ran a short and long smart test on all drives… no errors on any of them. I added another dataset and copied 2TB worth of data to it. Not long after, I ran another scrub, and everything copied to the new dataset plus files that haven’t been touched are now showing unrecoverable errors on the zpool status -v command. Well over 1000 files. Each drive shows roughly the same number (off by up to 5), but 200+ checksum errors. I saw it mentioned that cabling or HBM could be the cause, so I replaced my 9200-8i with a 9305-16i and also installed new cables.
Still the same issues afterwards. If I try to move files, around half of them fail due to unreadable I/O errors in Windows, I’m trying to back up what I can but it seems to be getting worse. Any suggestions where to go from here? Is the pool lost? I was counting on the two spare drives to protect my data but now I have thousands of pictures and other files throwing errors. I don’t have another backup, and I can’t copy many of the files from the NAS.
MSI Pro Z90-P Wifi DDR4
Intel 12th Gen CPU
32gb RAM
6 x HGST Ultrastar SAS drives
9305-16i HBA
Going to try replacing the RAM with some ECC RAM and see if that helps. It was new though, so not holding out much hope it was bad. Frustrating that TrueNAS isn’t detecting any errors on the long tests, but my data is all corrupted. Thought I was doing the right thing to protect it with the 2 drive redundancy.
If you ram a memory test, you might see issues and know what the cause is.
I don’t know if its the cause, I just know its a potential source of corruption.
Your motherboard does not support ECC and your CPU likely doesn’t do so either.
Also, if RAM is bad or incompatible you typically find that out at the very start. While RAM can “go bad over time”, that’s not especially common, which is why manufacturers often feel they can offer lifetime warranties.
Actually the CPU supports ECC if it is not an i3 (no detailed specs in OP…), but the Z motherboard definitively doesn’t.
RAM does degrade, and here a bad RAM stick would be a reasonable explanation for the multiple checksum errors without SMART errors, so running MemTest should be high in the list.
Fair enough, most consumer 12th gen aimed at the desktop market do support ECC. I recommend checking first just to make sure, as there are several SKUs in the i5 and higher lines that lack support.
I brought that up because the poster conveyed that the RAM was new and thus not likely to be bad, I did not mean to suggest that RAM can’t go bad.
Running memtest is a very good move when a new system is put online (for many reasons), and followup tests at later points are warranted if the system suddenly becomes unstable.
I run memtest86 even on ECC systems.
Comparing it to throughly testing every other possible culprit, it takes so little effort to run memtest over night that I don’t see why I wouldn’t.
So I ran MemTest, it looks like both sticks of RAM are faulty. I swapped in a different stick and it passed with no issues.
However now I can no longer boot the system. I am getting this “panic: free guard1 fail at 0x70ee18f8 from unknown:0” error. Hopefully I didn’t lose access to the whole pool. Any suggestions how to address this error?
Do you have a copy of the config, including encryption keys if you used key encryption for any pools?
If yes, you can reinstall TrueNAS over your old boot device and upload your config. Be careful when selecting installation target, you do NOT want to overwrite your data pool, there’s a good argument for disconnecting the data drives entirely but ymmv.
Edit: Wait, maybe the RAM swap caused some BIOS settings to change. Before you go the reinstall route, doublecheck the BIOS. If nothing looks amiss, reinstall.
Yes, I’ve been messing with the BIOS and tried setting defaults and a few other things, but it doesn’t make any difference. I don’t have a copy of the config, the encrypted datasets I have were passkey encrypted.
Would I still be able to add the pool back in without the config files? There’s just one big pool with all the drives.
I would definitely disconnect all the drives for the install.
Yes, you will need to redo your settings though, like create users, shares, etc. Import the existing data pool, don’t make a new one with the same drives.