QNAP TS-251+ upgraded to scale 24.04, drops off the network a while after boot

I have a QNAP TS-251+ with 16G RAM that had been running well on TrueNAS CORE and keeping up with updates for about 2 years. Hardware appeared solid up to now.

Note - I’m new to the forum and suspect that writing up this thread will help me figure out what’s going on, so please have patience.

Recently I decided to upgrade to SCALE 24.04. The upgrade went well enough. I changed from USB thumbdrive boot media to USB-connected SSD and had to futz with BIOS and use HDMI/keyboard connection for the first time. The process I used created a clean install of CORE. I uploaded the backup settings, then ran the SCALE upgrade.

I had two VMs on the Core config: HomeAssistant and an Ubuntu 22.04 VM running LogitechMediaServer. I changed the assigned NIC on both of these VMs to the new name for the NIC I had been using. HomeAssistant migrated fine and talks to all my devices, but something was broken with the Ubuntu VM network config - it booted fine but was unavailable on the net.

The IPs I was using under CORE:
x.x.x.2 - primary NAS interface
x.x.x.10 - home assistant
x.x.x.64 - LMS

LMS is not a critical system (like my HVAC…) so have been letting it sit for a few days. The weird thing is, twice I’ve had the network interface for the SCALE drop off the network. HomeAssistant is still running and available on its IP address, but ping/nmap can’t find the primary NAS IP. Power cycling restored it for a while yesterday, but when I went back to start troubleshooting again it was gone.

Immediately after booting I can access
x.x.x.2 - primary NAS interface
x.x.x.10 - home assistant

But now all I can see is
x.x.x.10 - home assistant

The NAS doesn’t respond to ping, HTTP, or SSH requests. This behavior seems really weird to me. Can any of you make sense of it?

Looking at the console, it looks like I’m seeing device errors on the boot-pool drive.

WARNING: Pool 'boot-pool' has encountered an unrecoverable I/O failure and has been suspended

Among many other warnings: unable to enumerate device, etc.

I’m also encountering occasional kernel panics during boot. I first saw that when I was performing the upgrade. I was running headless before, but the box was about as stable as could be - many months of uptime. The new instability is puzzling.

I think I need to try different boot media but on this box I only have 2 SATA bays and 3 USB. I’m considering acquiring a 2-bay USB mSATA enclosure and populating it with 2 cheap 32G SSDs to act as mirrored boot volumes. Thoughts?

I spent some more time watching the system (SCALE 24.04.1) after boot today, and I noticed about 250 processes sleeping and a troubling trend in load average. Increasing from 15, to 20, system getting slower and less responsive by the minute. It eventually locked up again after about 45 minutes of uptime.

Went back to my CORE boot media today and it’s running fine. I think that’s where I have to leave it for now. I learned a little about backing up and restoring configurations and installing on other boxes, and that was useful.