Our TrueNAS box has started taking forever to boot. The boot cycle (until UI is available) takes approximately 20 minutes. After UI is available, it is extremely sluggish for hours and then returns to normal.
When logging into the UI, we see pool_dataset.sync_db_keys and pool.import_on_book tasks running at 0% for hours.
We updated to 24.10.0 today to see if the problem has been fixed. Unfortunately, it is still there.
The hardware is a Dell R750 with 2x Xeon 4314 and 128GB ECC memory. The storage pool has 12TB spinning SAS on a Dell 355i HBA. The boot device is a Dell BOSS-S1 with 2x SSD. All server firmware, including disk firmware, is updated to the latest available.
There is a BIG difference between an HBA in RAID mode configured to present each drive singly and an HBA in IT mode which automatically presents each drive singly - the difference is in what happens under the covers in the HBA controller. In RAID mode, the controller “optimises” writes to minimise seek time i.e. it can change the sequence of writes and this is true even for JBOD disks - but ZFS needs a specific sequence of writes to ensure consistency.
So you really should flash it to IT mode IMO - however be careful as if the RAID JBOD mode doesn’t present the disks to ZFS in exactly the same way as IT mode presenting the native disk, you may lose access to your data.
This is good news - though when you say that I am unclear whether this is from SMART attributes, a SMART short test or a SMART long test.
Try reseating the HBAs and SAS cables as previously suggested.
IMO it is related only in the sense that the symptoms are similar. This was fixed a year or so ago, so assuming that you are on Dragonfish you already have the fix. In the case of this specific issue, there were specific symptoms in the hardware console. Please connect a console and copy and paste any relevant error messages here.
The HBA is not in RAID mode; you do not need to flash the device to use it in IT mode.
SMART reports no errors for either short or long tests on any drive.
Correct, I am on Dragonfish (Actually Electric Eel as of today). During boot, I see the same very long service start times in the console before the WebUI is available. I will post screenshots when I can reboot the system again.
You should check how busy the disks are with e.g. iostat -xmt 1.
You might find an outlier (one of the disks getting a lot more busy than the others), that sort of hardware error does not show up in SMART.