One or more devices has experienced an unrecoverable error on zpool

One of the pool in Trunas Core is in unhealthy state. It still showing unhealthy state this morning. Should I just replace the drive?
Ran zpool status -v
zpool clear
scrub pool.

smartctl -t long /dev/da1

  pool: zvol
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 4.40M in 05:02:08 with 0 errors on Fri Jan 24 03:05:12 2025
config:

	NAME                                            STATE     READ WRITE CKSUM
	zvol1                                           ONLINE       0     0     0
	  raidz1-0                                      ONLINE       0     0     0
	    gptid/bd2b0b9b-4d14-11e9-9224-000c29995503  ONLINE       0     0 1.15K
	    gptid/89e5fb1d-779c-11e9-9cb3-000c29995503  ONLINE       0     0     0
	    gptid/3d1ffde8-7852-11e9-9964-000c29995503  ONLINE       0     0     0

errors: No known data errors

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 241)	Self-test routine in progress...
					10% of test remaining.
Total time to complete Offline 
data collection: 		( 4784) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 702) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x303d)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   197   196   021    Pre-fail  Always       -       9150
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       79
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   031   031   000    Old_age   Always       -       50841
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       79
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       76
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       1307
194 Temperature_Celsius     0x0022   115   105   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Over 1,000 ZFS checksum errors?

I would check the cabling, connections, and any HBA cards, if they are involved.

I have not open up the server for months. Drives are connected to LSI 3008 on the latest firmware… Cable/HBA going bad?

Can’t hurt to check.

Cables, connections, HBA temperature or something else related to it, and the usual…

1 Like

reseat all the connectors and the problem went away. Thanks!