I’m new to TrueNAS and need a bit of help figuring out what is happening with a pool.
I’ve set up a server, with a pool using a 8 TB seagate ironwolf nas drive, the drive is brand new.
Today I’ve copied ~ 2.5 terrabytes of data to a smb share on that pool, all was nice and well but now I see the pool is marked degraded.
At first it showed some large number of errors, in the millions. I’ve stopped the server and swapped the SATA cable to another one. After rebooting it now shows 1,420 errors. I’ve ran a short smart check and it says everything is fine.
I’m not worried about the data at all, but trying to figure out if there is something wrong with the hard disk, as it is in the return window. I’ve triggered scrubing, but it seems that will take some hours at least.
Does anyone have some ideas of what I can do to figure out what is going on?
First, one single drive as a Stripe? Not a good thing for ZFS but maybe you are testing it out.
ZFS errors are not the same as SMART errors. From the CLI run zpool status -v and see what the output is. Does it list files that are damaged? If yes, you need to delete those. Run a Scrub, make sure no more errors occur. If that works you are apt to run ‘zpool clear poolname’ poolname=your pool name and that should clear things up. HOWEVER, did you test your system for stability?
Yes, it is a stripe as I am just trying it out, I have another 4TB disk (currently in another machine) + a new 4TB one on the way for a mirror pool, and my plan was to get another 8TB disk later too.
I’ll check, it listed one file when running zpool status -v, but I’ve clicked away from the shell in the UI, and started scrubing after, now scrubing refuses to stop…
I did not test the system stability, my assumption was not much could be wrong - the ram is from my main system that I’ve used for a couple of years now, but indeed the motherboard and cpu are used - a supermicro board with a xeon e3-1260v6.
Should i install windows and run something like aida64 stability test, or prime95?
I would recommend you run Memtest86+ to test the RAM, after that, some CPU stress test like Prime95 for about 30 minutes, some folks do it for days, I personally do not. If that works out, well you have data on your drive already so burning it in is out of the question. How much RAM do you have?
16 GB of RAM. What is odd is the pool was fine for most of the day, I copied stuff in batches, 3-400 GB at a time. In between I was also checking through the truenas ui, no errors. They only appeared after the last few folders I copied over.
I’ll throw a windows install and see what memtest says, and also run prime95 to see if it crashes or runs fine for ~30 min. What do you mean by burning the drive in?
Meanwhile I can see the file that has errors : /mnt/ZPoolStorage/ix-applications/k3s/server/db/state.db-shm. So looks like some issue with the applications - I was trying to run the plex app but it kept refusing to start up.
Ok, found the issue but had to redo setup and recreate the pool. I’ve been using a PSU I had lying around for troubleshooting stuff - turns out the hard disk was disconnecting sometimes. Moved the drive power to another cable and the problem is gone.
I have already ordered a new PSU on friday, should be here next week.
Yes, should have checked dmesg earlier. I only noticed some ata reconnected errors because I had the BMC remote open and the logs where showing where the 1-9 options are.
I’m assuming the disk is actually fine, I have another copy of the data and it is just media anyway, so I don’t care much if it gets lost.