SCSI write errors causing pool to fail after updating TrueNAS

Rubydesic · May 4, 2024, 10:01pm

After updating TrueNAS from 22.12.2 to 23.10.2, my pool started failing with 6/8 of my drives (all the same model, NetApp X315_SMEGA04TA07. The two that are not failing are NetApp X477_HMKPX04TA07) showing write errors. I can temporarily fix it by rebooting the server and then doing zpool status clear

This is what zpool status shows:

admin@rubennas[~]$ sudo zpool status RubenPool -v
  pool: RubenPool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 816K in 00:00:02 with 0 errors on Wed Apr 24 15:35:14 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        RubenPool                                 ONLINE       0     0     0
          raidz2-0                                ONLINE       0     1     0
            52097ba9-3451-4161-ab2f-d88f393fd749  ONLINE       0     1     0
            beefb0bd-8234-4cdf-a82f-74f6b1d1c08b  ONLINE       0     1     0
            8553513c-b1c9-43dd-bd27-1758fe121dc8  ONLINE       0     0     0
            0c63b7a7-5216-44a3-9eb8-8dbc70585241  ONLINE       0     1     0
            6a3d05d1-a0aa-43f0-ad22-fcbcf952cfd9  ONLINE       0     1     0
            f5f2304d-a14d-46d0-965a-30d7e63b4727  ONLINE       0     0     0
            c53dfa6b-33e7-47f0-8d81-f70d8ced485e  ONLINE       0     1     0
            ab7d473d-783a-4cab-87de-fe34f25dc64a  ONLINE       0     1     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x15dab>

Entries like this are showing up in dmesg, which seems to indicate that my drives are rejecting the SCSI commands that are sent to them?

[  +0.004639] zio pool=RubenPool vdev=/dev/disk/by-partuuid/52097ba9-3451-4161-ab2f-d88f393fd749 error=121 type=2 offset=1799616221184 size=12288 flags=1605761
[  +0.078204] scsi_io_completion_action: 233 callbacks suppressed
[  +0.000009] sd 0:0:5:0: [sdb] tag#857 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[  +0.004591] sd 0:0:5:0: [sdb] tag#857 Sense Key : Illegal Request [current] [descriptor] 
[  +0.000006] sd 0:0:5:0: [sdb] tag#857 Add. Sense: Invalid field in cdb
[  +0.000004] sd 0:0:5:0: [sdb] tag#857 CDB: Write(16) 8a 00 00 00 00 00 d1 c0 be a8 00 00 00 18 00 00
[  +0.000002] blk_print_req_error: 233 callbacks suppressed
[  +0.000002] critical target error, dev sdb, sector 3519069864 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  +0.000007] zio pool=RubenPool vdev=/dev/disk/by-partuuid/0c63b7a7-5216-44a3-9eb8-8dbc70585241 error=121 type=2 offset=1799616221184 size=12288 flags=1572992
[  +0.015465] sd 0:0:3:0: [sdg] tag#852 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[  +0.000005] sd 0:0:3:0: [sdg] tag#852 Sense Key : Illegal Request [current] [descriptor] 
[  +0.000003] sd 0:0:3:0: [sdg] tag#852 Add. Sense: Invalid field in cdb
[  +0.000003] sd 0:0:3:0: [sdg] tag#852 CDB: Write(16) 8a 00 00 00 00 00 d1 c0 be a0 00 00 00 18 00 00
[  +0.000002] critical target error, dev sdg, sector 3519069856 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[  +0.000006] zio pool=RubenPool vdev=/dev/disk/by-partuuid/c53dfa6b-33e7-47f0-8d81-f70d8ced485e error=121 type=2 offset=1799616217088 size=12288 flags=1572992
[  +0.000085] sd 0:0:4:0: [sdd] tag#854 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
[  +0.016791] sd 0:0:4:0: [sdd] tag#854 Sense Key : Illegal Request [current] [descriptor] 
[  +0.000004] sd 0:0:4:0: [sdd] tag#854 Add. Sense: Invalid field in cdb
[  +0.000003] sd 0:0:4:0: [sdd] tag#854 CDB: Write(16) 8a 00 00 00 00 00 d1 c0 be a8 00 00 00 18 00 00

joeschmuck · May 4, 2024, 10:59pm

I recommend you roll back to your previous TrueNAS version of 22.12.2 and verify the problem clears.

I also would not run zpool clear , you have errors, fix that first, run another scrub, when you have no data errors, then you can clear the WRITE errors.

Protopia · May 5, 2024, 8:40am

I would go a little further than @joeschmuck and say be very careful what you do in order not to lose your data. If you are not absolutely 100% certain about what you are doing, make sure that you get advice on what you are thinking of doing before you actually do it.

Rubydesic · May 5, 2024, 9:02pm

Thanks for the advice, I’ll revert the OS version soon and report back. To be clear, since I realize I was not in my original post, the two drives that are not giving write errors are a different model (NetApp X477_HMKPX04TA07)

Rubydesic · May 15, 2024, 12:17am

Downgrading SCALE back to the previous release seems to fix the issue with my drives but all the apps are now broken with an error Failed to configure kubernetes cluster for Applications: Missing “RubenPool/ix-applications/docker” dataset(s) required for starting kubernetes