Hello all,
Possible bug I’ve run into with dragonfish that I can’t seem to wrap my head around.
TrueNAS suddenly started acting unusually slow, to the point where logging into the UI was taking about 5 minutes. The VM I had been working in had completely frozen - and I thought some odd issue had caused it to spike CPU util and bring middlewared to it’s knees (I was also just about to go into a meeting in the VM as it happened… Murphy’s law!).
Unfortunately that didn’t seem to be the case, killing off the VM didn’t help, disabling all applications didn’t help. I have a mirrored boot pool so I decided to just reboot and hope for the best.
Rebooted and everything is fine, except for two unusual things.
root@truenas[~]# zpool status -v boot-pool
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:01:18 with 0 errors on Fri May 3 18:24:12 2024
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdl3 ONLINE 0 0 0
ata-INTEL_SSDSC2KF240H6L_CVLT646100S0240CGN ONLINE 0 0 0
Previously this was sdl3 and sdj3, this ata-INTEL name is odd. It’s correct for the disk but no idea why ZFS is reporting this as the name.
Scrubbed boot-pool, no issues reported.
Okay, weird, so I run a SMART test and it passes, maybe something went wrong with the drive, but I can’t see anything super obvious here:
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 11
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 586
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 1072
170 Unknown_Attribute 0x0033 100 100 010 Pre-fail Always - 11
171 Unknown_Attribute 0x0032 100 100 010 Old_age Always - 0
172 Unknown_Attribute 0x0032 100 100 010 Old_age Always - 0
173 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 45
174 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 157
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0033 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 041 053 000 Old_age Always - 41 (Min/Max 16/53)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 157
194 Temperature_Celsius 0x0032 041 053 000 Old_age Always - 41 (Min/Max 16/53)
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 0
225 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 7701
226 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 0
227 Unknown_SSD_Attribute 0x0032 100 100 000 Old_age Always - 0
228 Power-off_Retract_Count 0x0032 100 100 000 Old_age Always - 0
232 Available_Reservd_Space 0x0033 099 099 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0033 092 092 001 Pre-fail Always - 0
236 Unknown_Attribute 0x0033 092 092 001 Pre-fail Always - 0
241 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 7701
242 Total_LBAs_Read 0x0032 100 100 000 Old_age Always - 4614
249 Unknown_Attribute 0x0032 100 100 000 Old_age Always - 9532
The thing that’s striking me as super odd is the fact that I now have /dev/sdj showing up as an unassigned disk in the storage tab, though when looking at disks in the UI, it clearly shows as assigned to boot-pool. Settings → Boot → Boot Pool Status shows the exact same (though with this ata-INTEL name).
Fdisk shows the disk as completely empty:
Disk /dev/sdj: 223.57 GiB, 240057409536 bytes, 468862128 sectors
Disk model: INTEL SSDSC2KF24
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
When attempting to offline, detach, or remove the SSD from boot-pool I am unable to:
root@truenas[~]# zpool offline boot-pool ata-INTEL_SSDSC2KF240H6L_CVLT646100S0240CGN
cannot offline ata-INTEL_SSDSC2KF240H6L_CVLT646100S0240CGN: no such device in pool
root@truenas[~]# zpool offline boot-pool /dev/sdj
cannot offline /dev/sdj: no such device in pool
root@truenas[~]# zpool offline boot-pool sdj
cannot offline sdj: no such device in pool
root@truenas[~]# zpool offline boot-pool sdj3
cannot offline sdj3: no such device in pool
Trying to replace ata-INTEL with sdj also doesn’t work, as the device is busy, which makes sense.
I was dumping syslog into Splunk which I am about to go through and will update if I find anything interesting. But seeing as that’s also on a VM in TrueNAS, it’s unlikely
If anyone has any suggestions or ideas on what might have happened here they would be much appreciated, as it seems very unusual!