I have a weird issue with my truenas scale box: HP Microserver Gen 8 with 4 3TB disks(One disk is now 8TB, the problematic one).
One of the disks has been replaced twice* with brand new disks and the new disks were also throwing errors, read and write. Self tests are running fine. If I restart the machine the errors go away for a few minutes or so.
Last year I replaced the disk with a new 3tb disk and the errors persisted. So i replaced the new disk with a new one again, a 8TB disk and the issues are still persisted.
pool: data
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: resilvered 5.32M in 00:00:02 with 0 errors on Thu Sep 26 16:09:57 2024
config:
NAME STATE READ WRITE CKSUM
data DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
9bec5965-8369-11e9-81e5-941882385a30 ONLINE 0 0 0
d2b67705-9ae5-42f4-b92b-98685460ec64 FAULTED 3 0 0 too many errors
4cf28e25-eef7-11e8-b5aa-941882385a30 ONLINE 0 0 0
e0c932e6-82fc-11e9-82ab-941882385a30 ONLINE 0 0 0
I cannot zpool clear . It does nothing. The disks smartctl:
Assuming that disks are connected to motherboard SATA ports (and not an HBA - which might be in RAID mode), I was going to suggest these two same possible causes, cable issues being more likely.
When you say sudo zpool clear data does (literally) nothing, that seems VERY suspicious. There is literally zero output when you run the command?
Yes there is no output when zpool clear data
I do not know how to check wheather its the powercable, the sata cable or the PSU.
The data and power cable is a propriatery combo one that is screwd on the backplate. Power comes from one 4pin cable of the PSU. The 4 sata cables on the backplate, become one combo sata on the mobo.
I cannot make any sense of it.
So,
I shut it down, I switched slots of faulty disk with another disk from the pool. Booted up and everything is normal again. I guess for a few minutes, or hours.
Before:
Slot 1: 8TB “faulty” disk
Slot 2: 3TB ok disk
Slot 3: 3TB ok disk
Slot 4: 3TB ok disk
Now:
Slot 1: 3TB ok disk
Slot 2: 3TB ok disk
Slot 3: 3TB ok disk
Slot 4: 8TB “faulty” disk
I don’t have any errors on any of my pools, but I ran this on a pool anyway to see what the output was, and I also got literally no output. So apologies for having doubted you.
So time was all it needed once again. This is what I woke up to…
root@freenas[~]# zpool status data -v
pool: data
state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
scan: resilvered 4.38M in 00:00:03 with 0 errors on Thu Sep 26 21:45:10 2024
config:
NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
raidz1-0 ONLINE 8 30 0
9bec5965-8369-11e9-81e5-941882385a30 ONLINE 3 16 0
d2b67705-9ae5-42f4-b92b-98685460ec64 ONLINE 3 16 0
4cf28e25-eef7-11e8-b5aa-941882385a30 ONLINE 3 14 0
e0c932e6-82fc-11e9-82ab-941882385a30 ONLINE 3 14 0
errors: List of errors unavailable: pool I/O is currently suspended
Either all your drives are simultaneously failing, or there’s an issue with cabling, the backplane and/or controller.
The first option is unlikely, especially with one drive newer than the other three, and would show up in SMART reports—running long SMART test to confirm cannot hurt.
The second option essentially implies replacing the server. Do you have another system which could host the four drives?
I am assuming the same. Its either the sata cable. the power cable or the PSU in general. I will be probably replacing the whole server soon.
I will boot it up and run a few smart tests now and see.
I will post later.
I replaced the PSU with a spare desktop one I got.
The server boots and the pool is imported. I got some checksum errors I think they are from an ungracefull shutdown that corrupted the following file:
root@freenas[~]# zpool status data -v
pool: data
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: resilvered 4.38M in 00:00:03 with 0 errors on Thu Sep 26 21:45:10 2024
config:
NAME STATE READ WRITE CKSUM
data ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
9bec5965-8369-11e9-81e5-941882385a30 ONLINE 0 0 132
d2b67705-9ae5-42f4-b92b-98685460ec64 ONLINE 0 0 132
4cf28e25-eef7-11e8-b5aa-941882385a30 ONLINE 0 0 132
e0c932e6-82fc-11e9-82ab-941882385a30 ONLINE 0 0 132
errors: Permanent errors have been detected in the following files:
/var/db/system/netdata-7c35bc62b22f460fb3766e1c156d5c44/dbengine/journalfile-1-0000001051.njf
I stopped the netdata service and disabled it too. I am now monitoring for read/write errors and leaving it run until I know the psu is the only issue.
What can my next steps be? Should I scrub and smart test all disks?
So,
The server is running ok for a few hours now(In a FrankenPSUstein state). I decided to delete the corrupted netdata journal, restart the service and zpool clear the pool. Everything is back to normal.
I am now 90%+(growing by the hour) sure that the issue is the PSU, as most of you mentioned.
I do not have any apps started yet. I will let it run like that for a few days start the apps later. After that, if the server is still doing fine I will order a new PSU.
Today I am running the server with the usual amount of stress, since yesterday there where no issues. All apps are running and disks are working. So far so good. 99% the issue is the PSU. Tomorrow I will probably order a new one. Today at midnight I will also scrub.
Hello all,
Final update. New PSU was installed yesterday and everything is back to normal and tidy once again. I marked @Stux 's answer as solution, because it is the easiest to ready on a quick search, but I really thank everyone for helping out with my problem.