The curious case of the extra drive - Boot drive appearing as unassigned in TrueNAS UI

Hello all,

Possible bug I’ve run into with dragonfish that I can’t seem to wrap my head around.
TrueNAS suddenly started acting unusually slow, to the point where logging into the UI was taking about 5 minutes. The VM I had been working in had completely frozen - and I thought some odd issue had caused it to spike CPU util and bring middlewared to it’s knees (I was also just about to go into a meeting in the VM as it happened… Murphy’s law!).

Unfortunately that didn’t seem to be the case, killing off the VM didn’t help, disabling all applications didn’t help. I have a mirrored boot pool so I decided to just reboot and hope for the best.

Rebooted and everything is fine, except for two unusual things.

root@truenas[~]# zpool status -v boot-pool
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:01:18 with 0 errors on Fri May  3 18:24:12 2024
config:

        NAME                                             STATE     READ WRITE CKSUM
        boot-pool                                        ONLINE       0     0     0
          mirror-0                                       ONLINE       0     0     0
            sdl3                                         ONLINE       0     0     0
            ata-INTEL_SSDSC2KF240H6L_CVLT646100S0240CGN  ONLINE       0     0     0

Previously this was sdl3 and sdj3, this ata-INTEL name is odd. It’s correct for the disk but no idea why ZFS is reporting this as the name.

Scrubbed boot-pool, no issues reported.

Okay, weird, so I run a SMART test and it passes, maybe something went wrong with the drive, but I can’t see anything super obvious here:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       11
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       586
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       1072
170 Unknown_Attribute       0x0033   100   100   010    Pre-fail  Always       -       11
171 Unknown_Attribute       0x0032   100   100   010    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   010    Old_age   Always       -       0
173 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       45
174 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       157
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   041   053   000    Old_age   Always       -       41 (Min/Max 16/53)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       157
194 Temperature_Celsius     0x0032   041   053   000    Old_age   Always       -       41 (Min/Max 16/53)
199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
225 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       7701
226 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       0
227 Unknown_SSD_Attribute   0x0032   100   100   000    Old_age   Always       -       0
228 Power-off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       0
232 Available_Reservd_Space 0x0033   099   099   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0033   092   092   001    Pre-fail  Always       -       0
236 Unknown_Attribute       0x0033   092   092   001    Pre-fail  Always       -       0
241 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       7701
242 Total_LBAs_Read         0x0032   100   100   000    Old_age   Always       -       4614
249 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       9532

The thing that’s striking me as super odd is the fact that I now have /dev/sdj showing up as an unassigned disk in the storage tab, though when looking at disks in the UI, it clearly shows as assigned to boot-pool. Settings → Boot → Boot Pool Status shows the exact same (though with this ata-INTEL name).

Fdisk shows the disk as completely empty:

Disk /dev/sdj: 223.57 GiB, 240057409536 bytes, 468862128 sectors
Disk model: INTEL SSDSC2KF24
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

When attempting to offline, detach, or remove the SSD from boot-pool I am unable to:

root@truenas[~]# zpool offline boot-pool ata-INTEL_SSDSC2KF240H6L_CVLT646100S0240CGN
cannot offline ata-INTEL_SSDSC2KF240H6L_CVLT646100S0240CGN: no such device in pool
root@truenas[~]# zpool offline boot-pool /dev/sdj
cannot offline /dev/sdj: no such device in pool
root@truenas[~]# zpool offline boot-pool sdj
cannot offline sdj: no such device in pool
root@truenas[~]# zpool offline boot-pool sdj3
cannot offline sdj3: no such device in pool

Trying to replace ata-INTEL with sdj also doesn’t work, as the device is busy, which makes sense.

I was dumping syslog into Splunk which I am about to go through and will update if I find anything interesting. But seeing as that’s also on a VM in TrueNAS, it’s unlikely :frowning:

If anyone has any suggestions or ideas on what might have happened here they would be much appreciated, as it seems very unusual!

Okay, interesting.
There was a spike of network and disk activity at around 16:50 (for reference I noticed this at around 16:58) - this brought context switches above 200,000 and interrupts to 65,00!!!
That explains that part then, I know exactly what caused that and it’s a easy fix (I’d actually turned off boost clock and had forgotten to re-enable it).

While these writes were not going to the boot-pool I can see that there was quite a lot of data being written to the boot-pool at the time, metadata? That still doesn’t really explain why this odd disk behavior is being seen (I also didn’t think metadata would be in the order of 100s MiB/s).

Still no luck on figuring out why the disk is showing up as unassigned now, perhaps I’ll pull it out, format it, reconnect it, and do a “replace” so that it copies data from the mirror.

Something I also missed in the initial post: I am seeing this alert on boot:

### Critical

#### Failed to check for alert HadUpdate:

2024-05-03 18:34:13 (Europe/London)

I have absolutely no idea what this means, but it only started after the initial issues I saw earlier this afternoon!

Pulled the drive out to check it, and it’s completely empty! Not a thing on it! Bizarre.
CrystalDiskInfo seems to know what a lot of these unknown attributes are and there isn’t anything out of the ordinary.

I’ve detached the missing drive from the boot-pool, so am going to reconnect this drive and add it back - worst case it fails, again!

See also Zpool shows mixed device id´s

Physically removing the drive, detaching it from the pool, and then adding it back again worked. After this point I also no longer saw the “Failed to check for alert HadUpdate” alert on boot.

Still absolutely no idea what caused it.

Hit it again - albeit without knocking my boot drive out and a breakthrough has been had. I’m swapping a hell of a lot more than I should be (that’d be the writes I was seeing to my boot-pool as mentioned in a previous post), and I can see a lot of related threads with similar issues of slowness (though the boot drive issue seems to be an outlier, it may be because I rebooted the machine during this and hit some unknown issue).

I’m playing around at the moment with vm.swappiness and enforcing the following:
zfs_arc_max: 121437156556
zfs_arc_sys_free: 8589934592

Hopefully it doesn’t crop up again.