Unremovable I/O failure and can't do anything with the pool

So I have a striped pool…non-critical data, have backup anyway, so doesn’t matter what happens to it.
The problems started yesterday with Read errors on the pool and then Checksum errors. I first assumed it was a bad cable since I’ve had that previously including with this same drive (in hindsight it might not have been the cable, but haven’t had errors since replacing it). But this time the 199 UDMA_CRC_Error_Count and 188 Command_Timeout did not move at all.
Swapped the cable while the drive was suspended and immediately got more error after doing zpool clear. So I tried clearing it again and the terminal got stuck on that for minutes. No response, no interupt with CTRL+C, nothing. New SSH sessions run fine unless you tried to clear it again.
I had no idea what was happening with this drive and said fuck it I’ll try exporting the pool and see what happens and that got stuck on exporting too. Rebooted the whole system and now it’s exported but unable to import with this error:
[EZFS_IO] Failed to import 'acapoolco' pool: cannot import 'acapoolco' as 'acapoolco': I/O error
Can anyone shed some light on what is happening with this drive? Again the the data is not important so I might do zpool import -f just to see what data is damaged and maybe figure out what is wrong. Here’s the smart data on the drive, it’s directly connected to a X10SLL-F motherboard and I’ve ran a short and conveyance test after all this happened.

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Exos X16
Device Model:     ST16000NM001G-2KK103
Serial Number:    ZL2KXYAN
LU WWN Device Id: 5 000c50 0dc324b53
Firmware Version: SN03
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5625
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Jan 25 17:15:23 2025 EET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  567) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1410) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   044    Pre-fail  Always       -       212503035
  3 Spin_Up_Time            0x0003   096   090   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   087   087   020    Old_age   Always       -       13487
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   088   060   045    Pre-fail  Always       -       672513947
  9 Power_On_Hours          0x0032   070   070   000    Old_age   Always       -       27008
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   089   089   020    Old_age   Always       -       12212
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   001   000    Old_age   Always       -       8706
190 Airflow_Temperature_Cel 0x0022   064   049   040    Old_age   Always       -       36 (Min/Max 34/36)
192 Power-Off_Retract_Count 0x0032   094   094   000    Old_age   Always       -       12150
193 Load_Cycle_Count        0x0032   037   037   000    Old_age   Always       -       126339
194 Temperature_Celsius     0x0022   036   051   000    Old_age   Always       -       36 (0 13 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       99
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       9515h+04m+41.350s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       113511989286
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       798428513671

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Conveyance offline  Completed without error       00%     27008         -
# 2  Short offline       Completed without error       00%     27008         -
# 3  Extended offline    Interrupted (host reset)      00%     26900         -
# 4  Short offline       Completed without error       00%     26858         -
# 5  Short offline       Completed without error       00%     26690         -
# 6  Extended offline    Completed without error       00%     26588         -
# 7  Short offline       Completed without error       00%     26522         -
# 8  Short offline       Completed without error       00%     26354         -
# 9  Short offline       Completed without error       00%     26186         -
#10  Extended offline    Completed without error       00%     26174         -
#11  Short offline       Completed without error       00%     26018         -
#12  Short offline       Completed without error       00%     25850         -
#13  Extended offline    Completed without error       00%     25838         -
#14  Short offline       Completed without error       00%     25682         -
#15  Short offline       Completed without error       00%     25514         -
#16  Extended offline    Completed without error       00%     25454         -
#17  Short offline       Completed without error       00%     25346         -
#18  Short offline       Completed without error       00%     25178         -
#19  Extended offline    Completed without error       00%     25119         -
#20  Short offline       Completed without error       00%     25010         -
#21  Short offline       Completed without error       00%     24842         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more

Please post the output from the following commands (with the output of each command in a separate </> box):

  • lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
  • lspci
  • sudo sas2flash -list
  • sudo sas3flash -list
  • sudo zpool status -v
  • sudo zpool import

Sure, here it goes:

  • lsblk:
root@treehouse:~ # lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
NAME   MODEL                     ROTA PTTYPE TYPE    START           SIZE PARTTYPENAME             PARTUUID
sda    HGST HDN726040ALE614         1 gpt    disk           4000787030016
└─sda2                              1 gpt    part  4194688  3998639332864 Solaris /usr & Apple ZFS 21508070-4ec5-485f-9108-18f9a69e06f7
sdb    ST4000VN008-2DR166           1 gpt    disk           4000787030016
└─sdb1                              1 gpt    part     2048  4000785104896 Solaris /usr & Apple ZFS fcae35d9-337b-42d4-974f-3cce92bfc0ee
sdc    Samsung SSD 870 EVO 500GB    0 gpt    disk            500107862016
└─sdc1                              0 gpt    part     4096   500105740800 Solaris /usr & Apple ZFS 476b279b-d014-4f23-bfd4-badd528001ab
sdd    Samsung SSD 870 EVO 500GB    0 gpt    disk            500107862016
└─sdd1                              0 gpt    part     4096   499034096128 Solaris /usr & Apple ZFS 329858ba-ba34-459b-9981-0e074abf2bd8
sde    WDC WD120EDAZ-11F3RA0        1 gpt    disk          12000138625024
└─sde2                              1 gpt    part  4194560 11997990993408 Solaris /usr & Apple ZFS 1ba4afba-2254-4569-9632-530dd36087ac
sdf    HGST HDN726040ALE614         1 gpt    disk           4000787030016
└─sdf2                              1 gpt    part  4194560  3998639398400 Solaris /usr & Apple ZFS a89a8948-8cd8-4101-a2b8-c1976e0ab883
sdg    WDC WD120EDAZ-11F3RA0        1 gpt    disk          12000138625024
└─sdg2                              1 gpt    part  4194432 11997991055360 FreeBSD ZFS              8f906718-32ab-11ec-804a-0cc47a91fff1
sdh    WDC WD120EDAZ-11F3RA0        1 gpt    disk          12000138625024
└─sdh2                              1 gpt    part  4194432 11997991055360 FreeBSD ZFS              8fa502ca-32ab-11ec-804a-0cc47a91fff1
sdi    Samsung SSD 850 PRO 128GB    0 gpt    disk            128035676160
├─sdi1                              0 gpt    part     4096        1048576 BIOS boot                07e2c959-10bc-43e1-83e6-8c5f0a8b48b2
├─sdi2                              0 gpt    part     6144      536870912 EFI System               909fd1bd-6a8c-47f5-af75-2d10f1682aaa
├─sdi3                              0 gpt    part 34609152   110315773440 Solaris /usr & Apple ZFS 742397b5-0553-4ce9-b4a1-15e97a4a26c8
└─sdi4                              0 gpt    part  1054720    17179869184 Linux swap               bf7226c0-a1cd-4c08-ac6c-9ce2531fa8e3
sdj    ST16000NM001G-2KK103         1 gpt    disk          16000900661248
└─sdj2                              1 gpt    part  2097280 15999826836992 Solaris /usr & Apple ZFS 3ac32ec4-ae9c-4bcc-9710-2cdefdefe12c
sdk    HGST HDN726040ALE614         1 gpt    disk           4000787030016
└─sdk2                              1 gpt    part  4194816  3998639267328 Solaris /usr & Apple ZFS dad24db7-5496-48ee-931d-3e3b335e5aac
zd0                                 0 gpt    disk            161061273600
zd16                                0 gpt    disk             16213508096
zd32                                0 gpt    disk             21474836480
zd48                                0 gpt    disk             64424525824
zd64                                0 gpt    disk             16213508096
zd80 
  • lspci
root@treehouse:~ # lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v3 Processor DRAM Controller (rev 06)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x16 Controller (rev 06)
00:01.1 PCI bridge: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor PCI Express x8 Controller (rev 06)
00:14.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05)
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-LM (rev 05)
00:1a.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #2 (rev 05)
00:1c.0 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #1 (rev d5)
00:1c.1 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #2 (rev d5)
00:1c.4 PCI bridge: Intel Corporation 8 Series/C220 Series Chipset Family PCI Express Root Port #5 (rev d5)
00:1d.0 USB controller: Intel Corporation 8 Series/C220 Series Chipset Family USB EHCI #1 (rev 05)
00:1f.0 ISA bridge: Intel Corporation C222 Series Chipset Family Server Essential SKU LPC Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 8 Series/C220 Series Chipset Family 6-port SATA Controller 1 [AHCI mode] (rev 05)
00:1f.3 SMBus: Intel Corporation 8 Series/C220 Series Chipset Family SMBus Controller (rev 05)
00:1f.6 Signal processing controller: Intel Corporation 8 Series Chipset Family Thermal Management Controller (rev 05)
02:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2308 PCI-Express Fusion-MPT SAS-2 (rev 05)
03:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03)
04:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
05:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03)
06:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
  • sudo sas2flash -list
root@treehouse:~ # sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2308_2(D1)

        Controller Number              : 0
        Controller                     : SAS2308_2(D1)
        PCI Address                    : 00:02:00:00
        SAS Address                    : 500605b-0-08f9-b140
        NVDATA Version (Default)       : 14.01.00.06
        NVDATA Version (Persistent)    : 14.01.00.06
        Firmware Product ID            : 0x2214 (IT)
        Firmware Version               : 20.00.07.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9207-8i
        BIOS Version                   : 07.39.02.00
        UEFI BSD Version               : N/A
        FCODE Version                  : N/A
        Board Name                     : SAS9207-8i
        Board Assembly                 : H5-25412-00C
        Board Tracer Number            : SV42445368

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.
  • sas3flash -list
root@treehouse:~ # sas3flash -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.

        No Avago SAS adapters found! Limited Command Set Available!
        ERROR: Command Not allowed without an adapter!
        ERROR: Couldn't Create Command -list
        Exiting Program.
  • zpool status -v
root@treehouse:~ # zpool status -v
  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:47 with 0 errors on Sat Jan 25 03:45:48 2025
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sdi3      ONLINE       0     0     0

errors: No known data errors

  pool: content_pool
 state: ONLINE
  scan: scrub repaired 0B in 19:07:58 with 0 errors on Wed Jan 15 20:08:08 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        content_pool                              ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            8f906718-32ab-11ec-804a-0cc47a91fff1  ONLINE       0     0     0
            1ba4afba-2254-4569-9632-530dd36087ac  ONLINE       0     0     0
            8fa502ca-32ab-11ec-804a-0cc47a91fff1  ONLINE       0     0     0

errors: No known data errors

  pool: evolution_pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:12:04 with 0 errors on Sun Jan 19 23:12:08 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        evolution_pool                            ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            329858ba-ba34-459b-9981-0e074abf2bd8  ONLINE       0     0     0
            476b279b-d014-4f23-bfd4-badd528001ab  ONLINE       0     0     0

errors: No known data errors

  pool: spawning_pool
 state: ONLINE
  scan: scrub repaired 0B in 04:32:04 with 0 errors on Tue Jan 21 23:48:25 2025
expand: expanded raidz1-0 copied 8.92T in 1 days 15:22:22, on Mon Nov 25 19:18:32 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        spawning_pool                             ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            21508070-4ec5-485f-9108-18f9a69e06f7  ONLINE       0     0     0
            dad24db7-5496-48ee-931d-3e3b335e5aac  ONLINE       0     0     0
            a89a8948-8cd8-4101-a2b8-c1976e0ab883  ONLINE       0     0     0
            fcae35d9-337b-42d4-974f-3cce92bfc0ee  ONLINE       0     0     0

errors: No known data errors
  • zpool import
root@treehouse:~ # zpool import
  pool: acapoolco
    id: 4749044951322641739
 state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

        acapoolco                               ONLINE
          3ac32ec4-ae9c-4bcc-9710-2cdefdefe12c  ONLINE
  • zpool import acapoolco
root@treehouse:~ # zpool import acapoolco
cannot import 'acapoolco': I/O error
        Recovery is possible, but will result in some data loss.
        Returning the pool to its state as of Sat Jan 25 06:37:36 2025
        should correct the problem.  Approximately 15 minutes of data
        must be discarded, irreversibly.  Recovery can be attempted
        by executing 'zpool import -F acapoolco'.  A scrub of the pool
        is strongly recommended after recovery.

Of these only the boot-pool and the failed acapoolco are connected to the MB everything else is connected via the HBA.

The output looks pretty good. The HBA is in IT mode, the other pools are all healthy. The acapoolco pool seems to have a problem, perhaps caused by the cable, but the zpool import seems to give clear instructions (from the zfs developers) on how to proceed.

Other people may have more expertise, and you may want to wait to see if anyone else can add to my advice, but it seems to me that you should try the following, stopping if you get any form of error so we can advise further, and posting the output from each command here:

  • sudo zpool import -F acapoolco
  • sudo zpool scrub acapoolco (or if the pool is now shown as available in the UI, then it might be better to run a scrub from there).
1 Like

Thank you for the help I ran import -F acapoolco but with some errors:

root@treehouse:~ # zpool import -F acapoolco
cannot mount '/acapoolco': failed to create mountpoint: Read-only file system
Import was successful, but unable to mount some datasets

It shows as healthy and fine in zpool status -v, but in the UI in only appears in Datasets and complains about [ENOENT] Path /acapoolco not found.
It’s also missing in ls -al /mnt/.

That said I started a scrub, so far without errors, and no new errors on smart data. I plan to restore a previous snapshot from backup if the drive appears to be fine, but will there be any way to view what files were damaged before that?

If you absolutely must import a pool in the shell, you would use something like:
zpool import -R /mnt <name of pool>
… to specify the proper mount point. Adding any additional flags required if you need to force import for some reason.

Since the GUI does not necessarily know what you’re doing in the shell, after a successful import you then likely need to export the pool and reimport it using the GUI, to make sure they are in sync.

In your case you would need to start off by exporting the pool since you can’t reimport a pool that is already imported.

Exported and reimported from the GUI and everything is fine.

The final verdict for posterity - Likely a problem on the MB side because another drive plugged into a neighboring sata port was restricted to SATA 1 speeds according to dmesg. No corrupted files on the drive according to the scrub, just some of the files used by an app running on that pool were rolled back a few hours. That part was done automatically by the forced import, not by me rolling back a snapshot.