Troubleshoot WD Red Pro drive "removed" alert

savor-dazzler · August 25, 2024, 4:48am

I have a stock Truenas Mini X+ running TrueNAS-SCALE-23.10.2 containing 2 4-month-old 14TB Red Pro drives configured in a mirror that have been running fine for those 4 months with regular scrub and SMART tests passing. Midway through a cloud sync task today when both drives are still physically inside the enclosure, I got an email alert saying

ZFS has detected that a device was removed.

impact: Fault tolerance of the pool may be compromised.
eid: 31
class: statechange
state: REMOVED
host: truenas
time: 2024-08-24 15:36:21-0400
vpath: /dev/disk/by-partuuid/1322e964-85c3-40b4-87ad-5d8f1cbdb325
vguid: 0x10A1847C42757226
pool: pool-1 (0x3862E2ECDAE59FE6)

and then

    Pool pool-1 state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
    The following devices are not healthy:
        Disk WDC_WD142KFGX-68AFPN0 6AGHKUNX is REMOVED

Current alerts:

    Pool pool-1 state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
    The following devices are not healthy:
        Disk WDC_WD142KFGX-68AFPN0 6AGHKUNX is REMOVED

I tried some diagnostic steps shown below. How do I figure out if the drive is bad or some connection is loose or something else? Thanks.

lsblk

NAME          MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda             8:0    0  12.7T  0 disk  
└─sda1          8:1    0   7.3T  0 part  
nvme0n1       259:0    0 232.9G  0 disk  
├─nvme0n1p1   259:1    0   260M  0 part  
├─nvme0n1p2   259:2    0 216.6G  0 part  
└─nvme0n1p3   259:3    0    16G  0 part  
  └─nvme0n1p3 253:0    0    16G  0 crypt [SWAP]

zpool status

 pool: boot-pool
 state: ONLINE
status: One or more features are enabled on the pool despite not being
        requested by the 'compatibility' property.
action: Consider setting 'compatibility' to an appropriate value, or
        adding needed features to the relevant file in
        /etc/zfs/compatibility.d or /usr/share/zfs/compatibility.d.
  scan: scrub repaired 0B in 00:00:13 with 0 errors on Mon Aug 19 03:45:14 2024
config:

        NAME         STATE     READ WRITE CKSUM
        boot-pool    ONLINE       0     0     0
          nvme0n1p2  ONLINE       0     0     0

errors: No known data errors

  pool: pool-1
 state: DEGRADED
status: One or more devices has been removed by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0B in 04:18:12 with 0 errors on Sun Aug 11 06:18:13 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        pool-1                                    DEGRADED     0     0     0
          mirror-0                                DEGRADED     0     0     0
            1322e964-85c3-40b4-87ad-5d8f1cbdb325  REMOVED      0     0     0
            cf27888e-30ec-4012-b4ef-a864853fc485  ONLINE       0     0     0

errors: No known data errors

dmesg -H

[Aug24 15:35] ata1.00: exception Emask 0x0 SAct 0x400a3102 SErr 0x0 action 0x6 frozen
[  +0.001215] ata1.00: failed command: READ FPDMA QUEUED
[  +0.001192] ata1.00: cmd 60/08:08:08:f5:2b/00:00:36:00:00/40 tag 1 ncq dma 4096 in
                       res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[  +0.002539] ata1.00: status: { DRDY }
[  +0.001154] ata1.00: failed command: READ FPDMA QUEUED
[  +0.001111] ata1.00: cmd 60/08:40:a0:12:a9/00:00:26:02:00/40 tag 8 ncq dma 4096 in
                       res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[  +0.002326] ata1.00: status: { DRDY }
[  +0.001160] ata1.00: failed command: READ FPDMA QUEUED
[  +0.001169] ata1.00: cmd 60/00:60:70:50:87/05:00:a6:01:00/40 tag 12 ncq dma 655360 in
                       res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[  +0.002477] ata1.00: status: { DRDY }
[  +0.001238] ata1.00: failed command: READ FPDMA QUEUED
[  +0.001293] ata1.00: cmd 60/00:68:70:58:87/03:00:a6:01:00/40 tag 13 ncq dma 393216 in
                       res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[  +0.001699] ata1.00: status: { DRDY }
[  +0.000830] ata1.00: failed command: READ FPDMA QUEUED
[  +0.000861] ata1.00: cmd 60/00:88:70:5b:87/01:00:a6:01:00/40 tag 17 ncq dma 131072 in
                       res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[  +0.001732] ata1.00: status: { DRDY }
[  +0.000873] ata1.00: failed command: READ FPDMA QUEUED
[  +0.000873] ata1.00: cmd 60/40:98:b0:be:54/00:00:5e:02:00/40 tag 19 ncq dma 32768 in
                       res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[  +0.001821] ata1.00: status: { DRDY }
[  +0.001011] ata1.00: failed command: READ FPDMA QUEUED
[  +0.000908] ata1.00: cmd 60/00:f0:80:15:5d/08:00:6c:02:00/40 tag 30 ncq dma 1048576 in
                       res 40/00:01:06:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
[  +0.001914] ata1.00: status: { DRDY }
[  +0.000962] ata1: hard resetting link
[  +5.333491] ata1: link is slow to respond, please be patient (ready=0)
[  +4.627922] ata1: SATA link down (SStatus 0 SControl 300)
[  +0.619080] ata1: hard resetting link
[  +5.364792] ata1: link is slow to respond, please be patient (ready=0)
[  +4.683928] ata1: COMRESET failed (errno=-16)
[  +0.001275] ata1: hard resetting link
[  +4.082624] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  +0.002146] ata1.00: revalidation failed (errno=-2)
[  +5.077768] ata1: hard resetting link
[  +0.318217] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  +0.002313] ata1.00: revalidation failed (errno=-2)
[  +0.000698] ata1.00: disable device
[  +0.006788] sd 0:0:0:0: rejecting I/O to offline device
[  +0.000707] I/O error, dev sdb, sector 9378152200 op 0x0:(READ) flags 0x0 phys_seg 8 prio class 2
[  +0.000730] zio pool=pool-1 vdev=/dev/disk/by-partuuid/1322e964-85c3-40b4-87ad-5d8f1cbdb325 error=5 type=1 offset=4801611829248 size=462848 flags=1573248
[  +0.001515] I/O error, dev sdb, sector 908887112 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  +0.000059] I/O error, dev sdb, sector 778071744 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  +0.000747] zio pool=pool-1 vdev=/dev/disk/by-partuuid/1322e964-85c3-40b4-87ad-5d8f1cbdb325 error=5 type=1 offset=465348104192 size=4096 flags=1573248
[  +0.000852] zio pool=pool-1 vdev=/dev/disk/by-partuuid/1322e964-85c3-40b4-87ad-5d8f1cbdb325 error=5 type=1 offset=398370635776 size=4096 flags=1573248
[  +0.001702] I/O error, dev sdb, sector 4624 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  +0.002618] zio pool=pool-1 vdev=/dev/disk/by-partuuid/1322e964-85c3-40b4-87ad-5d8f1cbdb325 error=5 type=1 offset=270336 size=8192 flags=721089
[  +0.001966] I/O error, dev sdb, sector 15628055568 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  +0.001513] zio pool=pool-1 vdev=/dev/disk/by-partuuid/1322e964-85c3-40b4-87ad-5d8f1cbdb325 error=5 type=1 offset=8001562353664 size=8192 flags=721089
[  +0.002353] I/O error, dev sdb, sector 15628056080 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  +0.001105] zio pool=pool-1 vdev=/dev/disk/by-partuuid/1322e964-85c3-40b4-87ad-5d8f1cbdb325 error=5 type=1 offset=8001562615808 size=8192 flags=721089
[  +0.696124] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  +0.003413] sd 0:0:0:0: [sdb] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=60s
[  +0.002118] sd 0:0:0:0: [sdb] tag#1 Sense Key : Not Ready [current] 
[  +0.002027] sd 0:0:0:0: [sdb] tag#1 Add. Sense: Logical unit not ready, hard reset required
[  +0.002118] sd 0:0:0:0: [sdb] tag#1 CDB: Read(16) 88 00 00 00 00 00 36 2b f5 08 00 00 00 08 00 00
[  +0.002075] I/O error, dev sdb, sector 908850440 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  +0.002151] zio pool=pool-1 vdev=/dev/disk/by-partuuid/1322e964-85c3-40b4-87ad-5d8f1cbdb325 error=5 type=1 offset=465329328128 size=4096 flags=1573248
[  +0.004538] sd 0:0:0:0: [sdb] tag#8 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=63s
[  +0.002358] sd 0:0:0:0: [sdb] tag#8 Sense Key : Not Ready [current] 
[  +0.002270] sd 0:0:0:0: [sdb] tag#8 Add. Sense: Logical unit not ready, hard reset required
[  +0.002334] sd 0:0:0:0: [sdb] tag#8 CDB: Read(16) 88 00 00 00 00 02 26 a9 12 a0 00 00 00 08 00 00
[  +0.002307] I/O error, dev sdb, sector 9238549152 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
[  +0.002382] zio pool=pool-1 vdev=/dev/disk/by-partuuid/1322e964-85c3-40b4-87ad-5d8f1cbdb325 error=5 type=1 offset=4730135068672 size=4096 flags=1573248
[  +0.004844] sd 0:0:0:0: [sdb] tag#12 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=87s
[  +0.002550] sd 0:0:0:0: [sdb] tag#12 Sense Key : Not Ready [current] 
[  +0.002555] sd 0:0:0:0: [sdb] tag#12 Add. Sense: Logical unit not ready, hard reset required
[  +0.002530] sd 0:0:0:0: [sdb] tag#12 CDB: Read(16) 88 00 00 00 00 01 a6 87 50 70 00 00 05 00 00 00
[  +0.002576] I/O error, dev sdb, sector 7088853104 op 0x0:(READ) flags 0x0 phys_seg 10 prio class 2
[  +0.002590] zio pool=pool-1 vdev=/dev/disk/by-partuuid/1322e964-85c3-40b4-87ad-5d8f1cbdb325 error=5 type=1 offset=3629490692096 size=655360 flags=1074267264
[  +0.005365] sd 0:0:0:0: [sdb] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=87s
[  +0.001670] sd 0:0:0:0: [sdb] tag#13 Sense Key : Not Ready [current] 
[  +0.001554] sd 0:0:0:0: [sdb] tag#13 Add. Sense: Logical unit not ready, hard reset required
[  +0.001523] sd 0:0:0:0: [sdb] tag#13 CDB: Read(16) 88 00 00 00 00 01 a6 87 58 70 00 00 03 00 00 00
[  +0.001509] I/O error, dev sdb, sector 7088855152 op 0x0:(READ) flags 0x0 phys_seg 6 prio class 2
[  +0.001564] zio pool=pool-1 vdev=/dev/disk/by-partuuid/1322e964-85c3-40b4-87ad-5d8f1cbdb325 error=5 type=1 offset=3629491740672 size=393216 flags=1074267264
[  +0.003268] sd 0:0:0:0: [sdb] tag#17 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=87s
[  +0.001590] sd 0:0:0:0: [sdb] tag#17 Sense Key : Not Ready [current] 
[  +0.001572] sd 0:0:0:0: [sdb] tag#17 Add. Sense: Logical unit not ready, hard reset required
[  +0.001585] sd 0:0:0:0: [sdb] tag#17 CDB: Read(16) 88 00 00 00 00 01 a6 87 5b 70 00 00 01 00 00 00
[  +0.001544] zio pool=pool-1 vdev=/dev/disk/by-partuuid/1322e964-85c3-40b4-87ad-5d8f1cbdb325 error=5 type=1 offset=3629492133888 size=131072 flags=1573248

Fleshmauler · August 25, 2024, 4:53am

If the drive is still detected currently, what is result of:
smartctl -a /dev/sdb

But yeah, it could be an issue with a sata port and/or wire; it shouldn’t hurt to replace wire & port if possible. I’ve had it happen. I’ve also had several drives randomly die (including DOA) & be returned for RMA within warranty, so sometimes thing break before you’d hope.

The errors really do seem to kinda point to drive link issue to drive, but doesn’t guarantee it ain’t the drive.

If you did a burn-in on the drive before deployment then it is surprising, but not impossible. If you can, try to backup pool-1 just in case.

savor-dazzler · August 25, 2024, 5:01am

Thanks for the quick reply. I want to run SMART test on the removed drive but the problem is the system doesn’t even recognize there’s a drive in that bay even though the drive is physically in that bay. This Mini NAS is only 10 months old. I’m in the middle of backing up the entire pool to backblaze and will wait until that’s done to start the physical part of troubleshooting.

Fleshmauler · August 25, 2024, 5:27am

Good choice on finishing the backup.

I mean if it is the bay itself I’m sure iX will do the needful (I have no idea what their warranty or customer service is like, and am just making assumptions here).

savor-dazzler · August 29, 2024, 6:09am

After the backblaze cloud sync is done, I shut down the NAS, took the affected drive out of the bay, put it back in and turned on the NAS again. The NAS recognized the drive again and I immediately received these emails:

Pool pool-1 state is ONLINE: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.

and then

ZFS has finished a resilver:

eid: 21
class: resilver_finish
host: truenas
time: 2024-08-27 11:56:13-0400
pool: pool-1
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: resilvered 16.4G in 00:02:26 with 0 errors on Tue Aug 27 11:56:13 2024
config:

NAME STATE READ WRITE CKSUM
pool-1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
1322e964-85c3-40b4-87ad-5d8f1cbdb325 ONLINE 0 0 2
cf27888e-30ec-4012-b4ef-a864853fc485 ONLINE 0 0 0

errors: No known data errors

I then ran a LONG SMART test on the drive that passes. This is the output of smartctl -a /dev/sda.

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.74-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD142KFGX-68AFPN0
Serial Number:    6AGHKUNX
LU WWN Device Id: 5 000cca 2dec7147d
Firmware Version: 83.00A83
User Capacity:    14,000,519,643,136 bytes [14.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-4 published, ANSI INCITS 529-2018
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Aug 29 01:50:08 2024 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (  101) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1448) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   001    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0004   147   147   054    Old_age   Offline      -       52
  3 Spin_Up_Time            0x0007   094   094   001    Pre-fail  Always       -       275
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       4
  5 Reallocated_Sector_Ct   0x0033   100   100   001    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000a   100   100   001    Old_age   Always       -       0
  8 Seek_Time_Performance   0x0004   140   140   020    Old_age   Offline      -       15
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       1736
 10 Spin_Retry_Count        0x0012   100   100   001    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       4
 22 Unknown_Attribute       0x0023   100   100   025    Pre-fail  Always       -       6553700
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       73
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       73
194 Temperature_Celsius     0x0002   055   055   000    Old_age   Always       -       39 (Min/Max 24/44)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      1723         -
# 2  Extended offline    Completed without error       00%      1469         -
# 3  Extended offline    Completed without error       00%      1301         -
# 4  Extended offline    Completed without error       00%      1133         -
# 5  Extended offline    Completed without error       00%       971         -
# 6  Extended offline    Completed without error       00%       853         -
# 7  Extended offline    Completed without error       00%       702         -
# 8  Extended offline    Completed without error       00%       534         -
# 9  Extended offline    Completed without error       00%       367         -
#10  Extended offline    Completed without error       00%       198         -
#11  Extended offline    Completed without error       00%        30         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

I’m not sure what went wrong to trigger the cascade of previous problems. I’ve read through this guide and maybe it’s a data cable that was knocked loose for some reason? How do I get Truenas to recognize the pool as being in a good health state again? Thanks.

Fleshmauler · August 29, 2024, 2:26pm

Command should be:
zpool clear pool-1

neofusion · August 29, 2024, 3:38pm

Just to double check, did you verify that your 1322e964-85c3-40b4-87ad-5d8f1cbdb325 drive (the one that showed errors) was in fact the drive you after a reboot ran smartctl on?

Device names (ada, sda etc) can and do move around between devices after a reboot.