Extremely slow boot

Hi,

Our TrueNAS box has started taking forever to boot. The boot cycle (until UI is available) takes approximately 20 minutes. After UI is available, it is extremely sluggish for hours and then returns to normal.

When logging into the UI, we see pool_dataset.sync_db_keys and pool.import_on_book tasks running at 0% for hours.

We updated to 24.10.0 today to see if the problem has been fixed. Unfortunately, it is still there.

The hardware is a Dell R750 with 2x Xeon 4314 and 128GB ECC memory. The storage pool has 12TB spinning SAS on a Dell 355i HBA. The boot device is a Dell BOSS-S1 with 2x SSD. All server firmware, including disk firmware, is updated to the latest available.

When the system is finally up, disk pool performance seems normal. However, SCRUB takes forever, and it is estimated to take 9 months.

I would start by:

  1. Checking SMART attributes for all drives.
  2. Powering down and reseating all the HBAs and SAS connectors.

(I assume that your HBA has been flashed to IT mode.)

We have run a full hardware diagnostic on the system, including all drives (16 hour process). Everything is tested OK. SMART reports no errors.

(The HBA presents all drives individually, no need for flashing the DELL HBA 355i)

I found this issue on Jira. [NAS-123504] Boot hangs on ix-zfs.service job for 15 minutes - iXsystems TrueNAS Jira

It seems to be related, but the issue is closed.

There is a BIG difference between an HBA in RAID mode configured to present each drive singly and an HBA in IT mode which automatically presents each drive singly - the difference is in what happens under the covers in the HBA controller. In RAID mode, the controller “optimises” writes to minimise seek time i.e. it can change the sequence of writes and this is true even for JBOD disks - but ZFS needs a specific sequence of writes to ensure consistency.

So you really should flash it to IT mode IMO - however be careful as if the RAID JBOD mode doesn’t present the disks to ZFS in exactly the same way as IT mode presenting the native disk, you may lose access to your data.

This is good news - though when you say that I am unclear whether this is from SMART attributes, a SMART short test or a SMART long test.

Try reseating the HBAs and SAS cables as previously suggested.

IMO it is related only in the sense that the symptoms are similar. This was fixed a year or so ago, so assuming that you are on Dragonfish you already have the fix. In the case of this specific issue, there were specific symptoms in the hardware console. Please connect a console and copy and paste any relevant error messages here.

Hi Protopia,

Thank you very much for your reply.

The HBA is not in RAID mode; you do not need to flash the device to use it in IT mode.

SMART reports no errors for either short or long tests on any drive.

Correct, I am on Dragonfish (Actually Electric Eel as of today). During boot, I see the same very long service start times in the console before the WebUI is available. I will post screenshots when I can reboot the system again.

You should check how busy the disks are with e.g. iostat -xmt 1.
You might find an outlier (one of the disks getting a lot more busy than the others), that sort of hardware error does not show up in SMART.

I will give it a try. Here is what I see during boot.


After 15-20min, the system appears to have booted in the console.

This might be interesting, too.

The first screenshot is completely normal, because of ZFS features supported by grub.

The second one clearly shows some problem with either your boot pool disk (too slow) or a data disks misbehaving.

Here is the output from iostat:

What’s your boot drive and is the screenshot taken soon after the reboot?

Boot drive is a Dell BOSS S1 with 2x 240GB SSD. The screenshot is taken very soon after reboot.

I think you now have enough info to create your own iX support ticket.

sdg and sdh seem to have higher %util values than the others.

Can you drop me the results of

for disk in /dev/sd?; do; hdparm -W $disk; done

in a command prompt?

Question - do you know how many snapshots you have in the system?

Here is what IOstat looks like after the system is fully booted:

The output from hdparm is here:

There are appx 800 snapshots on the system

I have tried removing all snapshots, but there is no change in behavior.

I have also tried a clean install and restoring the configuration from backup - no change in behavior.