ZFS lots of READ errors during heavy usage (re: multiple drives and controllers)

sofakng · June 2, 2024, 4:59pm

I have a six drive RAID-Z array but during heavy usage (read/write access at the same time) it produces READ errors.

Here is my system:

SuperMicro X9DR3-F
2x Intel E5-2643 v2
192GB ECC RAM
LSI 9211-8i
Debian 12.5
ZFS for Linux v2.1.11-1

Here is the array:

NAME                                  STATE     READ WRITE CKSUM  
tank                                  ONLINE       0     0     0  
  raidz1-0                            ONLINE       0     0     0  
    ata-WDC_WUH721816ALE604_2BG0GELD  ONLINE       0     0     0
    ata-WDC_WUH721816ALE604_2BG0L76G  ONLINE       0     0     0
    ata-WDC_WUH721816ALE604_2BHWKSPN  ONLINE       0     0     0
    ata-WDC_WUH721816ALE604_2CHTZ60P  ONLINE       0     0     0
    ata-WDC_WUH721816ALE604_2CJL1ELJ  ONLINE       0     0     0
    ata-WDC_WUH721816ALE604_3WG186VT  ONLINE       0     0     0

Here is a snippet from syslog:

[121977.823237] sd 8:0:0:0: attempting task abort!scmd(0x000000003121adb0), outstanding for 30888 ms & timeout 30000 ms
[121977.823249] sd 8:0:0:0: [sdl] tag#2048 CDB: Read(16) 88 00 00 00 00 04 4f b4 f5 b0 00 00 00 08 00 00 
[121977.823253] scsi target8:0:0: handle(0x000e), sas_address(0x4433221105000000), phy(5) 
[121977.823257] scsi target8:0:0: enclosure logical id(0x500605b0034db930), slot(6)  
[121978.032697] sd 8:0:0:0: [sdl] tag#2259 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=1s 
[121978.032715] sd 8:0:0:0: [sdl] tag#2259 CDB: Read(16) 88 00 00 00 00 05 61 06 93 70 00 00 00 08 00 00 
[121978.032721] I/O error, dev sdl, sector 23102657392 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2 
[121978.032803] sd 8:0:0:0: [sdl] tag#1981 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK cmd_age=28s 
[121978.033154] sd 8:0:0:0: [sdl] tag#1981 CDB: Read(16) 88 00 00 00 00 07 3b 60 e8 f8 00 00 00 30 00 00 
[121978.033157] I/O error, dev sdl, sector 31060977912 op 0x0:(READ) flags 0x700 phys_seg 1 prio class 2 
[121978.033173] zio pool=tank vdev=/dev/disk/by-id/ata-WDC_WUH721816ALE604_3WG186VT-part1 error=5 type=1 offset=11828559536128 size=4096 flags=180880

The drives have no SMART errors at all.

I’ve been able to ‘zpool clear tank’ and it will work fine again but then the problem will return when I’m reading+writing to the drives (ie. heavy usage).

The drives were purchased as manufacturer refurbished from ServerPartDeals but before I created the array I put each drive through a very long test (read/write checksum data on every bit [twice], smart self long tests, etc) and they were fine.

I’m also seeing the READ errors randomly on every drive (not at the same time) so I don’t think they are all defective?

Does anybody have any idea what might be causing these?

NugentS · June 2, 2024, 8:31pm

OK

This is a TrueNAS forum, not a ZFS forum
What case is all this in? Is there a possibility that the LSI card is overheating?
You don’t have a powersupply (at least not one you told us about)
Check the firmware on the LSI Card - is it the correct version?
Are you running regular short and (more importantly) long smarttests?