Hi all, I am looking for help diagnosing random full system hangs on my TrueNAS SCALE box. The hangs are severe: the Web UI, SSH, and local shell become unresponsive. The only thing that still responds is IPMI. Recovery requires a hard reset via IPMI (or power cycle).
Hardware
Platform: Supermicro X11SSH-LN4F
CPU: Intel Xeon E3-1275 v5
RAM: 64 GB ECC (Micron)
Boot: 2x SATA SSD (mirrored boot pool)
Data HBA: Supermicro AOC-S3008L-L8E (LSI SAS3008, 12Gbps)
Data drives: mix of sizes, but the main pool includes multiple 3 TB WD Red drives (WD30EFRX class). I also have at least one Toshiba 3 TB drive present.
Workload: Plex streaming, plus some other apps (Scrutiny, cloudflared, filebrowser, etc.)
Pool layout and current situation
Main data pool is RAIDZ2 (8-wide), but I do have mixed drive sizes overall in the system.
I ran a scrub recently and the pool came back DEGRADED with a drive showing errors.
The confusing part is that the āsuspectā drive has not been consistent. After a few days, the pool degraded again but it was a different disk reporting issues compared to what I originally thought, and Scrutiny was also flagging a different drive than the one TrueNAS highlighted.
Symptom pattern (randomness)
The hangs are highly inconsistent and random.
Sometimes the system runs fine for around 3 days straight.
Other times it can freeze multiple times in a single day.
It can hang while I am actively using it (example: watching a movie on Plex), and it can also hang when it appears mostly idle.
I physically removed one of the drives I suspected most, but the system still froze later during a Plex stream.
What I have checked and tried so far
1) Logs
I checked journal logs from the previous boot (journalctl -b -1) and I do not see an obvious smoking gun right before a hang.
Because these are hard hangs that require a reset, I am aware logs might not flush to disk right before the freeze.
2) SMART tests and SMART review
I ran SMART short tests across all disks using smartctl --scan-open and smartctl -t short for each device, then reviewed:
overall SMART health
self-test history
SMART attributes, especially:
Reallocated_Sector_Ct
Current_Pending_Sector
Offline_Uncorrectable
UDMA_CRC_Error_Count
Key findings:
Reallocated_Sector_Ct = 0, Current_Pending_Sector = 0, Offline_Uncorrectable = 0 on the drives I focused on.
Several drives show UDMA_CRC_Error_Count increments:
One WD Red shows a very high CRC count (158).
Another WD Red shows CRC count (24).
Two other drives show CRC count (1).
A couple of WD Reds have self-test history entries that show āCompleted: read failureā at some LBAs (older entries), even though more recent extended tests show as completed without error on some disks.
This is what is confusing me:
Some indicators look like potential media read problems (historic self-test read failures).
Some indicators look like link problems (UDMA CRC errors, especially the large count on one drive).
The pool degrading has not consistently pointed to the same disk.
Removing the āworst lookingā disk did not stop the random full system hangs.
What I am trying to figure out
At this point I am unsure if I am dealing with:
multiple aging disks failing independently
a cabling or backplane problem causing intermittent SATA/SAS link errors (CRC errors)
an HBA problem (LSI 3008 path issues, resets, firmware, etc.)
something else entirely that causes a full system hang (PSU, RAM, kernel driver issue, etc.)
What I would like advice on
Given the mix of symptoms (random hard hangs + pool degradations not consistently pointing to one disk + CRC errors on multiple disks), what would you suspect first?
What is the best isolation plan?
For example: swap HBA cables, move disks to different ports, bypass the HBA temporarily and test on onboard SATA, etc.
What logs or outputs would be most useful to share here to avoid guessing?
I can provide zpool status -v, SMART dumps for the specific drives (without serials), dmesg, and IPMI SEL logs if that helps.
Thanks in advance for any direction on how to narrow this down efficiently.
You need to check all your drive models and make sure none of them are SMR types. You only want CMR drives with ZFS / TrueNAS.
Track the drives by serial number as their device names can change between boots. SDA could be SDB the next boot. SMART Long tests should be run on all your drives. I donāt expect HBA cooling issues since it appears you are on a server platform but the HBAs expect about 150-200 linear feet per minute of air flow per the manufacturer documents.
Some of the following may help with you sorting out your drives. You donāt need to post it all back here.
sudo ZPOOL_SCRIPTS_AS_ROOT=1 zpool status -vLtsc lsblk,serial,smartx,smart
Regarding the HBA, is there a way to monitor the temperature via TrueNAS? The card is placed next to the exhaust at the top of the case so it should have pretty good airflow, but Iād like to rule it out.
@Johnny_Fartpants Thank you for your input, I just checked using your command and I got the following:
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.
No Avago SAS adapters found! Limited Command Set Available!
ERROR: Command Not allowed without an adapter!
ERROR: Couldn't Create Command -list
Exiting Program.
If potential issues with the HBA or the breakout cables warrants a full system hang, I should connect my boot drives directly to the motherboard headers instead. Please let me know what you think.
Ah great. You could do with a firmware update as that version did have issues.
There is most likely a newer one supplied by Supermicro if you ask them.
If you are using an LSI 9300 HBA with FreeNAS or the soon-to-be TrueNAS CORE, you may experience some performance issues causing the controller to reset when using SATA HDDs.
The 9300 uses the same 3008 chip as is on your card.
That never crossed my mind! Iāll update it and let you know.
Just to be sure, is the newer firmware on the official product page okay? I donāt see a .bat file unlike the one in the post you mentioned. Iām worried Iāll brick the thing so I may just update to the older firmware.
Yeah always best to use the firmware provided by the supplier.
Download the .rar unzip and read the instructions and you will be fine. Never a bad idea to eject your main pool before running the commands just to be safe.