Scale crashing during resilver

I have a server running truenas scale 24.10.2. The server has been either crashing services with the console still available, or the system reboots after trying to resilver for an hour.

System Specs:
Storinator AV15
128GB Ram
14 x Seagate Exos x18 16TB drives.
Roughly 50TB data

zpool status shows:
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:12 with 0 errors on Sat Jul 20 03:45:13 2024
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sdp3    ONLINE       0     0     0
        sdo3    ONLINE       0     0     0

errors: No known data errors

pool: hddpool1
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Jul 22 07:45:58 2024
1.65T / 54.2T scanned at 1.63G/s, 0B / 53.4T issued
0B resilvered, 0.00% done, no estimated completion time
config:

    NAME                                      STATE     READ WRITE CKSUM
    hddpool1                                  ONLINE       0     0     0
      raidz2-0                                ONLINE       0     0     0
        xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0
        xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0
        xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0
        xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0
        xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0
        xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0
        xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0
      raidz2-1                                ONLINE       0     0     0
        xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0
        xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0
        xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0
        xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0
        xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0
        xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0
        xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx  ONLINE       0     0     0

There is no sign from the status of hddpool1 that there is a problem despite the description. All drives are online and the state is NOT Degraded.

Yea, right now there is no way to tell you what this is. If its throwing a kernel panic on the screen or in the logs, then sure, we can track that down and we’ll need a bug ticket to confirm.

If its hard-resetting with no panic, then it could be heat or some other system instability that a resilver is just inducing due to it being really busy and active on the disks all at once.

I do have a debug package downloaded. Where can I submit a bug ticket?

There is a “report a bug” link in the header up top here. Create the ticket and it’ll send you a link after on where to submit the debug file securely for only our staff to review.

Found it. Ticket created and private logs uploaded.

1 Like