TrueNAS SCALE random full system hangs, pool sometimes degrades, SMART shows CRC errors and some past read failures, need help isolating root cause

Hi all, I am looking for help diagnosing random full system hangs on my TrueNAS SCALE box. The hangs are severe: the Web UI, SSH, and local shell become unresponsive. The only thing that still responds is IPMI. Recovery requires a hard reset via IPMI (or power cycle).

Hardware

  • Platform: Supermicro X11SSH-LN4F

  • CPU: Intel Xeon E3-1275 v5

  • RAM: 64 GB ECC (Micron)

  • Boot: 2x SATA SSD (mirrored boot pool)

  • Data HBA: Supermicro AOC-S3008L-L8E (LSI SAS3008, 12Gbps)

  • Data drives: mix of sizes, but the main pool includes multiple 3 TB WD Red drives (WD30EFRX class). I also have at least one Toshiba 3 TB drive present.

  • Workload: Plex streaming, plus some other apps (Scrutiny, cloudflared, filebrowser, etc.)

Pool layout and current situation

  • Main data pool is RAIDZ2 (8-wide), but I do have mixed drive sizes overall in the system.

  • I ran a scrub recently and the pool came back DEGRADED with a drive showing errors.

  • The confusing part is that the ā€œsuspectā€ drive has not been consistent. After a few days, the pool degraded again but it was a different disk reporting issues compared to what I originally thought, and Scrutiny was also flagging a different drive than the one TrueNAS highlighted.

Symptom pattern (randomness)

  • The hangs are highly inconsistent and random.

  • Sometimes the system runs fine for around 3 days straight.

  • Other times it can freeze multiple times in a single day.

  • It can hang while I am actively using it (example: watching a movie on Plex), and it can also hang when it appears mostly idle.

  • I physically removed one of the drives I suspected most, but the system still froze later during a Plex stream.

What I have checked and tried so far

1) Logs

  • I checked journal logs from the previous boot (journalctl -b -1) and I do not see an obvious smoking gun right before a hang.

  • Because these are hard hangs that require a reset, I am aware logs might not flush to disk right before the freeze.

2) SMART tests and SMART review

I ran SMART short tests across all disks using smartctl --scan-open and smartctl -t short for each device, then reviewed:

  • overall SMART health

  • self-test history

  • SMART attributes, especially:

    • Reallocated_Sector_Ct

    • Current_Pending_Sector

    • Offline_Uncorrectable

    • UDMA_CRC_Error_Count

Key findings:

  • Reallocated_Sector_Ct = 0, Current_Pending_Sector = 0, Offline_Uncorrectable = 0 on the drives I focused on.

  • Several drives show UDMA_CRC_Error_Count increments:

    • One WD Red shows a very high CRC count (158).

    • Another WD Red shows CRC count (24).

    • Two other drives show CRC count (1).

  • A couple of WD Reds have self-test history entries that show ā€œCompleted: read failureā€ at some LBAs (older entries), even though more recent extended tests show as completed without error on some disks.

This is what is confusing me:

  • Some indicators look like potential media read problems (historic self-test read failures).

  • Some indicators look like link problems (UDMA CRC errors, especially the large count on one drive).

  • The pool degrading has not consistently pointed to the same disk.

  • Removing the ā€œworst lookingā€ disk did not stop the random full system hangs.

What I am trying to figure out

At this point I am unsure if I am dealing with:

  • multiple aging disks failing independently

  • a cabling or backplane problem causing intermittent SATA/SAS link errors (CRC errors)

  • an HBA problem (LSI 3008 path issues, resets, firmware, etc.)

  • something else entirely that causes a full system hang (PSU, RAM, kernel driver issue, etc.)

What I would like advice on

  1. Given the mix of symptoms (random hard hangs + pool degradations not consistently pointing to one disk + CRC errors on multiple disks), what would you suspect first?

  2. What is the best isolation plan?

    • For example: swap HBA cables, move disks to different ports, bypass the HBA temporarily and test on onboard SATA, etc.
  3. What logs or outputs would be most useful to share here to avoid guessing?

    • I can provide zpool status -v, SMART dumps for the specific drives (without serials), dmesg, and IPMI SEL logs if that helps.

Thanks in advance for any direction on how to narrow this down efficiently.

You need to check all your drive models and make sure none of them are SMR types. You only want CMR drives with ZFS / TrueNAS.
Track the drives by serial number as their device names can change between boots. SDA could be SDB the next boot. SMART Long tests should be run on all your drives. I don’t expect HBA cooling issues since it appears you are on a server platform but the HBAs expect about 150-200 linear feet per minute of air flow per the manufacturer documents.

Some of the following may help with you sorting out your drives. You don’t need to post it all back here.

sudo ZPOOL_SCRIPTS_AS_ROOT=1 zpool status -vLtsc lsblk,serial,smartx,smart

lsblk -bo NAME,LABEL,MAJ:MIN,TRAN,ROTA,ZONED,VENDOR,MODEL,SERIAL,PARTUUID,START,SIZE,PARTTYPENAME

for disk in /dev/sd?; do; sudo smartctl -x $disk; done

What firmware are you running on this card?

sas3flash -list

Thanks for the help!

I just checked and all drives are CMR.

Regarding the HBA, is there a way to monitor the temperature via TrueNAS? The card is placed next to the exhaust at the top of the case so it should have pretty good airflow, but I’d like to rule it out.

@Johnny_Fartpants Thank you for your input, I just checked using your command and I got the following:

Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02) 
Copyright 2008-2017 Avago Technologies. All rights reserved.

        No Avago SAS adapters found! Limited Command Set Available!
        ERROR: Command Not allowed without an adapter!
        ERROR: Couldn't Create Command -list
        Exiting Program.

If potential issues with the HBA or the breakout cables warrants a full system hang, I should connect my boot drives directly to the motherboard headers instead. Please let me know what you think. :folded_hands:

Maybe try ā€˜sudo’ before the command sudo sas3flash -list or it doesn’t work with an OEM version?

I think it would say no command found if that was the issue. But worrying that it can’t see the card.

What does lspci | grep -i sas show?

and lsmod | grep mpt3sas

Can we also see the output of zpool status -v

and try sudo sas3flash -listall

@SmallBarky @Johnny_Fartpants Sudo did it!

Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02) 
Copyright 2008-2017 Avago Technologies. All rights reserved.

        Adapter Selected is a Avago SAS: SAS3008(C0)

        Controller Number              : 0
        Controller                     : SAS3008(C0)
        PCI Address                    : 00:08:00:00
        SAS Address                    : 5003048-0-2495-1400
        NVDATA Version (Default)       : 0e.01.30.28
        NVDATA Version (Persistent)    : 0e.01.30.28
        Firmware Product ID            : 0x2221 (IT)
        Firmware Version               : 16.00.10.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : LSI3008-IT
        BIOS Version                   : 08.37.00.00
        UEFI BSD Version               : 18.00.00.00
        FCODE Version                  : N/A
        Board Name                     : LSI3008-IT
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

        Finished Processing Commands Successfully.
        Exiting SAS3Flash

I also did -listall

Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02) 
Copyright 2008-2017 Avago Technologies. All rights reserved.

        Adapter Selected is a Avago SAS: SAS3008(C0)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS3008(C0)  16.00.10.00    0e.01.30.28    08.37.00.00     00:08:00:00

        Finished Processing Commands Successfully.
        Exiting SAS3Flash.

truenas_admin@truenas[~]$ lspci | grep -i sas
08:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS3008 PCI-Express Fusion-MPT SAS-3 (rev 02)

truenas_admin@truenas[~]$ lsmod | grep mpt3sas
mpt3sas               405504  4
raid_class             12288  1 mpt3sas
scsi_transport_sas     57344  2 ses,mpt3sas
scsi_mod              319488  9 ses,scsi_transport_sas,sd_mod,raid_class,drivetemp,libata,sg,ahciem,mpt3sas
scsi_common            16384  6 scsi_mod,sd_mod,libata,sg,ahciem,mpt3sas

truenas_admin@truenas[~]$ zpool status -v
  pool: Armaniac
 state: ONLINE
  scan: resilvered 12.1G in 00:07:09 with 0 errors on Fri Feb 13 13:24:32 2026
config:

        NAME                                      STATE     READ WRITE CKSUM
        Armaniac                                  ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            724c23b3-be90-473f-b9e3-876847d62ba4  ONLINE       0     0     0
            e118e225-6418-4729-9ae6-934fb5a755e6  ONLINE       0     0     0
            71e7718e-0d1a-457a-80cf-01966744c79c  ONLINE       0     0     0
            0092d156-e7df-407d-89d9-3d6f29a6780e  ONLINE       0     0     0
            082931f8-4468-454d-9529-09e5d89037d4  ONLINE       0     0     0
            ecd78c1b-3fa1-46ec-9930-87739341ce98  ONLINE       0     0     0
            23ebd19e-e9cb-4b5e-a5e9-f049d7cfa897  ONLINE       0     0     0
            9ecb541b-d6c6-4295-bbb5-de7fa9d1836e  ONLINE       0     0     0

errors: No known data errors

  pool: SSD
 state: ONLINE
  scan: scrub repaired 0B in 00:18:54 with 0 errors on Sun Jan 18 00:18:55 2026
config:

        NAME                                    STATE     READ WRITE CKSUM
        SSD                                     ONLINE       0     0     0
          ee4ac092-1b80-493b-83fd-372ccc815b4e  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:28 with 0 errors on Thu Feb 12 03:45:30 2026
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdd3    ONLINE       0     0     0
            sdc3    ONLINE       0     0     0

errors: No known data errors

Also, do you guys think I should be connecting my boot mirror directly to my motherboard instead of the HBA?

Ah great. You could do with a firmware update as that version did have issues.

There is most likely a newer one supplied by Supermicro if you ask them.

If you are using an LSI 9300 HBA with FreeNAS or the soon-to-be TrueNAS CORE, you may experience some performance issues causing the controller to reset when using SATA HDDs.

The 9300 uses the same 3008 chip as is on your card.

I always connect my boot drives to the mobo so not a bad idea.

I reckon the firmware update will sort your issues as it’s a known bug.

That never crossed my mind! I’ll update it and let you know.

Just to be sure, is the newer firmware on the official product page okay? I don’t see a .bat file unlike the one in the post you mentioned. I’m worried I’ll brick the thing so I may just update to the older firmware. :rofl:

1 Like

Yeah always best to use the firmware provided by the supplier.

Download the .rar unzip and read the instructions and you will be fine. Never a bad idea to eject your main pool before running the commands just to be safe.

Eject NOT DESTROY :wink:

P16.00.14.00, of February 2024 :astonished_face:
Wow! That’s a new one.

Best flash from UEFI shell.