ZFS device fault for pool data on freenas

apollotonkosmo · September 26, 2024, 3:35pm

Hello there,

I have a weird issue with my truenas scale box: HP Microserver Gen 8 with 4 3TB disks(One disk is now 8TB, the problematic one).

One of the disks has been replaced twice* with brand new disks and the new disks were also throwing errors, read and write. Self tests are running fine. If I restart the machine the errors go away for a few minutes or so.

Last year I replaced the disk with a new 3tb disk and the errors persisted. So i replaced the new disk with a new one again, a 8TB disk and the issues are still persisted.

  pool: data
 state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
	repaired.
  scan: resilvered 5.32M in 00:00:02 with 0 errors on Thu Sep 26 16:09:57 2024
config:

	NAME                                      STATE     READ WRITE CKSUM
	data                                      DEGRADED     0     0     0
	  raidz1-0                                DEGRADED     0     0     0
	    9bec5965-8369-11e9-81e5-941882385a30  ONLINE       0     0     0
	    d2b67705-9ae5-42f4-b92b-98685460ec64  FAULTED      3     0     0  too many errors
	    4cf28e25-eef7-11e8-b5aa-941882385a30  ONLINE       0     0     0
	    e0c932e6-82fc-11e9-82ab-941882385a30  ONLINE       0     0     0

I cannot zpool clear . It does nothing. The disks smartctl:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   200   200   021    Pre-fail  Always       -       8983
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       6
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1258
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       6
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       3
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       4
194 Temperature_Celsius     0x0022   114   111   000    Old_age   Always       -       38
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

Thank you in advance. I am starting to believe this is either a cable issue or a power issue from the psu.

Protopia · September 26, 2024, 4:19pm

Assuming that disks are connected to motherboard SATA ports (and not an HBA - which might be in RAID mode), I was going to suggest these two same possible causes, cable issues being more likely.

When you say sudo zpool clear data does (literally) nothing, that seems VERY suspicious. There is literally zero output when you run the command?

apollotonkosmo · September 26, 2024, 6:39pm

Yes there is no output when zpool clear data
I do not know how to check wheather its the powercable, the sata cable or the PSU.
The data and power cable is a propriatery combo one that is screwd on the backplate. Power comes from one 4pin cable of the PSU. The 4 sata cables on the backplate, become one combo sata on the mobo.
I cannot make any sense of it.

apollotonkosmo · September 26, 2024, 6:43pm

I will move the “faulty” disk to anothe position to see what happens.

apollotonkosmo · September 26, 2024, 6:55pm

So,
I shut it down, I switched slots of faulty disk with another disk from the pool. Booted up and everything is normal again. I guess for a few minutes, or hours.

Before:
Slot 1: 8TB “faulty” disk
Slot 2: 3TB ok disk
Slot 3: 3TB ok disk
Slot 4: 3TB ok disk

Now:
Slot 1: 3TB ok disk
Slot 2: 3TB ok disk
Slot 3: 3TB ok disk
Slot 4: 8TB “faulty” disk

Lets see what happens now.

Protopia · September 26, 2024, 8:31pm

I don’t have any errors on any of my pools, but I ran this on a pool anyway to see what the output was, and I also got literally no output. So apologies for having doubted you.

Protopia · September 26, 2024, 8:33pm

Have you tried clearing the errors since you did this? Did they clear?

Please post the output from a sudo zpool status -v now so we can see the current status.

apollotonkosmo · September 27, 2024, 8:27am

Hello again,

So time was all it needed once again. This is what I woke up to…

root@freenas[~]# zpool status data -v
  pool: data
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-JQ
  scan: resilvered 4.38M in 00:00:03 with 0 errors on Thu Sep 26 21:45:10 2024
config:

	NAME                                      STATE     READ WRITE CKSUM
	data                                      ONLINE       0     0     0
	  raidz1-0                                ONLINE       8    30     0
	    9bec5965-8369-11e9-81e5-941882385a30  ONLINE       3    16     0
	    d2b67705-9ae5-42f4-b92b-98685460ec64  ONLINE       3    16     0
	    4cf28e25-eef7-11e8-b5aa-941882385a30  ONLINE       3    14     0
	    e0c932e6-82fc-11e9-82ab-941882385a30  ONLINE       3    14     0

errors: List of errors unavailable: pool I/O is currently suspended

I dont know what to make of it…

etorix · September 27, 2024, 8:41am

Either all your drives are simultaneously failing, or there’s an issue with cabling, the backplane and/or controller.
The first option is unlikely, especially with one drive newer than the other three, and would show up in SMART reports—running long SMART test to confirm cannot hurt.
The second option essentially implies replacing the server. Do you have another system which could host the four drives?

Stux · September 27, 2024, 8:45am

Or the powersupply is failing.

apollotonkosmo · September 27, 2024, 8:46am

I am assuming the same. Its either the sata cable. the power cable or the PSU in general. I will be probably replacing the whole server soon.
I will boot it up and run a few smart tests now and see.
I will post later.

Thank you All!

apollotonkosmo · September 28, 2024, 9:24am

Hello again,

I replaced the PSU with a spare desktop one I got.
The server boots and the pool is imported. I got some checksum errors I think they are from an ungracefull shutdown that corrupted the following file:

root@freenas[~]# zpool status data -v
  pool: data
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 4.38M in 00:00:03 with 0 errors on Thu Sep 26 21:45:10 2024
config:

	NAME                                      STATE     READ WRITE CKSUM
	data                                      ONLINE       0     0     0
	  raidz1-0                                ONLINE       0     0     0
	    9bec5965-8369-11e9-81e5-941882385a30  ONLINE       0     0   132
	    d2b67705-9ae5-42f4-b92b-98685460ec64  ONLINE       0     0   132
	    4cf28e25-eef7-11e8-b5aa-941882385a30  ONLINE       0     0   132
	    e0c932e6-82fc-11e9-82ab-941882385a30  ONLINE       0     0   132

errors: Permanent errors have been detected in the following files:

        /var/db/system/netdata-7c35bc62b22f460fb3766e1c156d5c44/dbengine/journalfile-1-0000001051.njf

I stopped the netdata service and disabled it too. I am now monitoring for read/write errors and leaving it run until I know the psu is the only issue.

What can my next steps be? Should I scrub and smart test all disks?

Thank you again so much!
George.

apollotonkosmo · September 28, 2024, 1:10pm

So,
The server is running ok for a few hours now(In a FrankenPSUstein state). I decided to delete the corrupted netdata journal, restart the service and zpool clear the pool. Everything is back to normal.

I am now 90%+(growing by the hour) sure that the issue is the PSU, as most of you mentioned.
I do not have any apps started yet. I will let it run like that for a few days start the apps later. After that, if the server is still doing fine I will order a new PSU.

Thank you very much all of you!
George.

Protopia · September 28, 2024, 2:06pm

Hurrah!!! A good result I think.

apollotonkosmo · September 29, 2024, 9:43am

Today I am running the server with the usual amount of stress, since yesterday there where no issues. All apps are running and disks are working. So far so good. 99% the issue is the PSU. Tomorrow I will probably order a new one. Today at midnight I will also scrub.

etorix · September 29, 2024, 10:20am

Well done narrowing it down to the PSU!

apollotonkosmo · October 5, 2024, 9:02am

Hello all,
Final update. New PSU was installed yesterday and everything is back to normal and tidy once again. I marked @Stux 's answer as solution, because it is the easiest to ready on a quick search, but I really thank everyone for helping out with my problem.

Thanks again
George