New TrueNAS install, errors in pool after copying data to NFS share

alexm87 · April 6, 2024, 9:08pm

Hi,

I’m new to TrueNAS and need a bit of help figuring out what is happening with a pool.
I’ve set up a server, with a pool using a 8 TB seagate ironwolf nas drive, the drive is brand new.

Today I’ve copied ~ 2.5 terrabytes of data to a smb share on that pool, all was nice and well but now I see the pool is marked degraded.

At first it showed some large number of errors, in the millions. I’ve stopped the server and swapped the SATA cable to another one. After rebooting it now shows 1,420 errors. I’ve ran a short smart check and it says everything is fine.

I’m not worried about the data at all, but trying to figure out if there is something wrong with the hard disk, as it is in the return window. I’ve triggered scrubing, but it seems that will take some hours at least.

Does anyone have some ideas of what I can do to figure out what is going on?

joeschmuck · April 6, 2024, 9:20pm

First, one single drive as a Stripe? Not a good thing for ZFS but maybe you are testing it out.

ZFS errors are not the same as SMART errors. From the CLI run zpool status -v and see what the output is. Does it list files that are damaged? If yes, you need to delete those. Run a Scrub, make sure no more errors occur. If that works you are apt to run ‘zpool clear poolname’ poolname=your pool name and that should clear things up. HOWEVER, did you test your system for stability?

alexm87 · April 6, 2024, 9:35pm

Yes, it is a stripe as I am just trying it out, I have another 4TB disk (currently in another machine) + a new 4TB one on the way for a mirror pool, and my plan was to get another 8TB disk later too.

I’ll check, it listed one file when running zpool status -v, but I’ve clicked away from the shell in the UI, and started scrubing after, now scrubing refuses to stop…

I did not test the system stability, my assumption was not much could be wrong - the ram is from my main system that I’ve used for a couple of years now, but indeed the motherboard and cpu are used - a supermicro board with a xeon e3-1260v6.

Should i install windows and run something like aida64 stability test, or prime95?

Thank you!

joeschmuck · April 6, 2024, 9:38pm

I would recommend you run Memtest86+ to test the RAM, after that, some CPU stress test like Prime95 for about 30 minutes, some folks do it for days, I personally do not. If that works out, well you have data on your drive already so burning it in is out of the question. How much RAM do you have?

alexm87 · April 6, 2024, 9:46pm

16 GB of RAM. What is odd is the pool was fine for most of the day, I copied stuff in batches, 3-400 GB at a time. In between I was also checking through the truenas ui, no errors. They only appeared after the last few folders I copied over.

I’ll throw a windows install and see what memtest says, and also run prime95 to see if it crashes or runs fine for ~30 min. What do you mean by burning the drive in?

Thanks again for the pointers!

joeschmuck · April 6, 2024, 9:51pm

You can use UBCD (Google it) which is a bootable image and has Memtest86+ and CPU stress tests, no need to install Windows.

alexm87 · April 6, 2024, 9:54pm

Thanks, will do!

Meanwhile I can see the file that has errors : /mnt/ZPoolStorage/ix-applications/k3s/server/db/state.db-shm. So looks like some issue with the applications - I was trying to run the plex app but it kept refusing to start up.

alexm87 · April 7, 2024, 8:13am

Ok, found the issue but had to redo setup and recreate the pool. I’ve been using a PSU I had lying around for troubleshooting stuff - turns out the hard disk was disconnecting sometimes. Moved the drive power to another cable and the problem is gone.

I have already ordered a new PSU on friday, should be here next week.

Thanks for the help!

Stux · April 7, 2024, 10:18am

In theory, you’ve had a hardware failure.

Either ram, cabling, power glitch, hd etc.

Unfortunately with no redundancy there is no way to correct it.

I would check the smart -a results thoroughly for any issues (pending sectors, reallocations, udma crc failures) etc

If you don’t care about the data it could be worthwhile performing an hd disk burnin

Stux · April 7, 2024, 10:19am

Sorry. Discourse messed up. I typed up that reply quite a few hours ago…. But there ya go.

alexm87 · April 7, 2024, 10:22am

Yes, should have checked dmesg earlier. I only noticed some ata reconnected errors because I had the BMC remote open and the logs where showing where the 1-9 options are.

I’m assuming the disk is actually fine, I have another copy of the data and it is just media anyway, so I don’t care much if it gets lost.