Major stability issues after moving to Dragonfish

Hi all! I’ve been having major stability issues after upgrading to Dragonfish.

App Containers

The first thing I noticed is a bunch of containers being in the yellow, not green state. They’d be green, then go yellow and stay that way. Even if I stop the container, it’s not long before they go from green to yellow again.

I have a mix of TrueCharts and TrueNAS containers.

It was most apparent in Plex and Jellyfin. Jellyfin wouldn’t load, and Plex kept having connection issues every so often. If you got a media file to play, you’d probably be fine because of buffering, but the next video file would not load, song lyrics wouldn’t load, song files wouldn’t load, etc. It’s so bad I’m spending a bunch of money to build a new app-only server.

TrueNAS Charts also have issues

I thought it was just TrueCharts containers, but even Pi-Hole and Resilio-Sync won’t start now, and I’m using the official TrueNAS container for them. I need Resilio-Sync to work on a project with someone else, so it’s problematic when this happens.

NAS restarted itself

Just tonight, the NAS restarted itself! I was copying a bunch of large multi-gigabyte video files. It was working, and then suddenly restarted. This is making me think it’s the OS and not the NIC or Kubernetes.

Cache-size maybe?

One thing I know changed in Dragonfish is the ZFS cache size. It’s now dynamic like CORE, so I removed my init script to set it, but I restarted afterward, so this crash is unrelated to that config.

What happened?

I wanna know how I can find out what caused the restart, and prevent it from happening again. My SuperMicro BMC didn’t show anything, so it’s most likely OS-level.

You should post your actual system specs so someone can help you.
System Settings >> advanced >> save debug.
Take a look through those files and you may find the issue.

Random reboots usually mean hardware failure. A power supply is bad or there is a heat related issue from a failed/improper fan/bad heatsink compound, or dirt clogging cooling passages through the chassis or a heat issue with a component.

I have had systems and industrial robots randomly reboot during production due to bad capacitors in power supplies, slow(bad) fans, bad A/C on the server rack etc. when in production under heavy load. Yet they all worked fine under no or light load.

1 Like

Which version of Dragonfish? I know there some issues that were generally patched up after the .0 release.

1 Like

OS Version: TrueNAS-SCALE-24.04.1.1
Product: Super Server
Model: AMD EPYC 7313P 16-Core Processor
Memory: 252 GiB

I downloaded the debug information. I’ll look through it, but I have no clue where to start.

My system is 50C max, and that was while I was doing file writes.

Rough, that should have fixed most of the likely culprits. I’ve noticed sometimes my set ARC Cache size won’t stick to what I’ve manually set (usually after launching/closing VMs)& then I get some performance hickups until I force it back to what I think is reasonable.

Uhhh, otherwise the only thing I can think of is TrueCharts had some changes with Dragonfish release & they have their own posts on how to fix/workaround that - luckily I hopped off of that because changing over to Dragonfish.

If you got debug info then maybe it’d make sense to just jump into making a Jira ticket & see if iX team can find something.

1 Like

Ignoring the TrueCharts and app issues, I would suggest trying a 24.04.2 nightly, or wait for 24.04.2 which is imminent

There are a lot of stability bugs in the kernel used in 24.04.1.x

And regarding TrueCharts… as they no longer support TrueNAS, I’d migrate those apps to a docker compose setup using Jailmaker, which will help with migrating them to Electric Eel.

TrueNAS Scale: Setting up Sandboxes with Jailmaker

1 Like

I’ll try out 24.04.2 when it comes out for sure! I’ve been skeptical about upgrading TrueNAS, but that could work.

Honestly, app issues have been a huge issue especially because I run Tailscale on here as well. I’ll still have to run Tailscale in the new environment, so that Jailmaker thing might still be relevant.

In regards to TrueCharts, I wanna try my hand at TalosOS or Proxmox + TalosOS.

I bought the parts to build a new app server because I’m tired of apps constantly breaking on my NAS between OS upgrades.

Sadly, that’s gonna increase network traffic; which is way slower than the 9-18GB/s read speed I was getting, but it’s much better than dealing with these issues. I just think it’s a waste of a good server since a 16C/32T CPU is overkill for TrueNAS (I think).

Nothing in ipmitool sel list?

Hard-crashes or “sudden restarts” typically imply one of two things - hardware faults, or an overenthusiastic hardware watchdog timer that turns a slow response into “oh no, system crash, NMI reboot it”

Yeah, I checked:

 74 | 12/05/23 | 18:04:10 CST | Power Supply #0xc4 | Presence detected () | Asserted
  75 | 12/05/23 | 18:04:11 CST | Power Supply #0xc5 | Presence detected () | Asserted
  76 | 12/05/23 | 19:35:12 CST | Power Supply #0xc5 | Failure detected () | Asserted
  77 | 12/05/23 | 19:35:19 CST | Power Supply #0xc5 | Power Supply AC lost () | Asserted
  78 | 12/05/23 | 19:35:25 CST | Power Supply #0xc5 | Presence detected () | Deasserted
  79 | 12/05/23 | 19:35:25 CST | Power Supply #0xc5 | Failure detected () | Deasserted
  7a | 12/05/23 | 19:35:25 CST | Power Supply #0xc5 | Power Supply AC lost () | Deasserted
  7b | 12/05/23 | 19:35:43 CST | Power Supply #0xc5 | Presence detected () | Asserted
  7c | 12/05/23 | 19:35:43 CST | Power Supply #0xc5 | Failure detected () | Asserted
  7d | 12/05/23 | 19:35:43 CST | Power Supply #0xc5 | Power Supply AC lost () | Asserted
  7e | 12/05/23 | 19:35:52 CST | Power Supply #0xc5 | Failure detected () | Deasserted
  7f | 12/05/23 | 19:35:52 CST | Power Supply #0xc5 | Power Supply AC lost () | Deasserted
  80 | 12/05/23 | 19:36:57 CST | Power Supply #0xc4 | Failure detected () | Asserted
  81 | 12/05/23 | 19:37:03 CST | Power Supply #0xc4 | Power Supply AC lost () | Asserted
  82 | 12/05/23 | 19:37:15 CST | Power Supply #0xc4 | Presence detected () | Deasserted
  83 | 12/05/23 | 19:37:15 CST | Power Supply #0xc4 | Failure detected () | Deasserted
  84 | 12/05/23 | 19:37:15 CST | Power Supply #0xc4 | Power Supply AC lost () | Deasserted
  85 | 12/05/23 | 19:37:24 CST | Power Supply #0xc4 | Presence detected () | Asserted
  86 | 12/05/23 | 19:37:24 CST | Power Supply #0xc4 | Failure detected () | Asserted
  87 | 12/05/23 | 19:37:24 CST | Power Supply #0xc4 | Power Supply AC lost () | Asserted
  88 | 12/05/23 | 19:37:36 CST | Power Supply #0xc4 | Failure detected () | Deasserted
  89 | 12/05/23 | 19:37:36 CST | Power Supply #0xc4 | Power Supply AC lost () | Deasserted
  8a | 12/18/23 | 08:13:01 CST | Unknown #0xff |  | Asserted
  8b | 02/03/24 | 21:01:05 CST | Critical Interrupt | PCI PERR () | Asserted
  8c | 02/03/24 | 21:01:05 CST | Critical Interrupt | PCI PERR () | Asserted
  8d | 02/03/24 | 21:01:05 CST | Critical Interrupt | PCI PERR () | Asserted
  8e | 02/11/24 | 00:43:04 CST | Critical Interrupt | PCI PERR () | Asserted
  8f | 03/21/24 | 01:16:34 CDT | Unknown #0xff |  | Asserted
  90 | 03/21/24 | 01:16:36 CDT | Unknown #0xff |  | Asserted
  91 | 03/21/24 | 01:17:24 CDT | Unknown #0xff |  | Asserted
  92 | 03/21/24 | 01:17:34 CDT | Unknown #0xff |  | Asserted
  93 | 03/21/24 | 01:17:36 CDT | Unknown #0xff |  | Asserted
  94 | 03/21/24 | 01:20:01 CDT | Unknown #0xff |  | Asserted
  95 | 03/21/24 | 01:20:40 CDT | Unknown #0xff |  | Asserted
  96 | 04/08/24 | 04:03:12 CDT | Critical Interrupt | PCI PERR () | Asserted
  97 | 04/08/24 | 04:03:12 CDT | Critical Interrupt | PCI PERR () | Asserted
  98 | 04/08/24 | 04:03:13 CDT | Critical Interrupt | PCI PERR () | Asserted

It just happened again:

Reminder: This also happens with TrueNAS images.

Once that motherboard comes in next week, I’ll move over Plex, but the point being, this is related to the issues I’ve been having.

If those containers existed before, you may have to remake them. I had an issue with an app, and after a lot of looking, debug reports, and a bug report, I gave up and reinstalled the app and now it updates like it is supposed to. This is on Dragonfish 24.04.1.1

From the report it looks like you may have a weak or bad power supply or bad power distributor.

Those reports about PSUs are from December, and since then, I’ve made some changes to my NAS. I was running 96 SSDs on only 50A of 5V (not necessarily enough, and it was spreading the load between two PSUs of 25A each). It’s now got a separate 20A PSU for each of the 8 sets of 16 SSDs, so that’s no longer relevant.


I don’t have an “update” issue, I have an “app stops working” issue.

I might have to remake those containers, but all of my containers, other than the two Tailscale ones, are having this “Deploying” state issue. Kubernetes randomly loses track of the app until I stop and restart the container. Sometimes containers never come back at all and stay in the “Deploying” state such as Pi-Hole, Resilio-Sync, and Jellyfin.

Still waiting on 24.04.2. I’m tempted to go in the beta channel, but I’m also afraid more things are gonna break.

Is this related to my issue maybe?
image

It seems to be counting every snapshot in the system, not just one a single zpool. But also, I’m not seeing even 5000 snapshots, so I don’t know where it’s getting this 13K+ number.

Either way, could that be affecting anything related to Apps?

Strangely enough, one of my RAM sticks went bad. I’ve seen that stick disappear before, but it was working fine when I first posted because it said I had 252 GiB of RAM in TrueNAS.

The last time I checked, it said I only have 220 GiB though and after upgrading SuperMicro’s BMC, it showed only 7 sticks.

I’m wondering if that was the issue all along. But I don’t think so. I have 256GB of RAM, and the chances of TrueNAS apps just happening to use RAM stick 5 would be unlikely. I can see that causing a reboot though.

I used a replacement stick, and it shows up fine now. I put the other stick in another identical motherboard, and it errored there too.

We’ll find out soon if that was the issue all along.

After swapping the RAM stick and testing again, I’m still having the “Deploying” state issue.

Maybe?

I don’t think it’d be that issue; I’d expect that (if it affected apps at all) to result in the apps service not starting at all. But if so, it should be fixed in 24.04.2, which I understand to be due out next month.

1 Like

The excessive number of snapshots will likely be on your ix-applications dataset, as each app has multiple sub-datasets which get snapshotted again and again if you are using recursive snapshots. Try clicking “Manage snapshots” on that and see how many appear in the total at the bottom.

All the ix-applications dataset snapshots are auto snapshots and are also permanent in the fact they don’t expire. They have to be manually cleaned out. You can see them from the Snapshots button which lists all snapshots on the system.