TrueNas SCALE crashes every 0.27days consistently

Hello Forum,
this is my first post here so hopefully my post I am in the right category.

My Problem:
for some reason my (first) TrueNas Scale nas crashes every 0.27 days ±10min (very regular)

My system:

  • MINIS FORUM 790S7
  • AMD Ryzen 9 7940HX with Radeon Graphics
  • 2x2Tb m.2 nvme SSDs
  • 1 external USB boot SSD (I know this is not optimal, but have no choice since I have not enough m.2 / SATA slots)

Things I tried to fix it:

  • reseat RAM (and run a short RAM test)
  • disable Apps
  • disable all overclocking

Some observations

  • I get the " ‘boot-pool’ is consuming USB devices ‘sda’ which is not recommended." error I think because of my external boot drive
  • when watching the crash event, RAM is normal, CPU usage is normal
  • the Web-UI becomes shortly before instable and once I have seen the error " ‘boot-pool’ is consuming USB devices ‘sda’ which is not recommended." reappearing after dismissing it after boot
  • after 10min the “freeze” solves itself and the machine runs again with no issues for 0.27 days. You can shorten the 10 minutes by hard shuttig it down and booting again
  • The screen is blank in the “freeze” period

These are the error Logs connected to one of the events:

Nov 12 14:49:41 truenas systemd-coredump[360044]: Process 1711 (asyncio_loop) of user 0 dumped core.

Module libsystemd.so.0 from deb systemd-252.26-1~deb12u2.amd64
Module libudev.so.1 from deb systemd-252.26-1~deb12u2.amd64
Stack trace of thread 3731:
#0  0x000000000054d8d0 n/a (python3.11 + 0x14d8d0)
#1  0x000000000050c70e n/a (python3.11 + 0x10c70e)
#2  0x000000000050bcbe n/a (python3.11 + 0x10bcbe)
#3  0x0000000000633477 n/a (python3.11 + 0x233477)
#4  0x00000000004fbe73 _PyObject_GC_New (python3.11 + 0xfbe73)
#5  0x000000000056a27d PyMethod_New (python3.11 + 0x16a27d)
#6  0x0000000000523813 _PyObject_GetMethod (python3.11 + 0x123813)
#7  0x000000000052d1b8 _PyEval_EvalFrameDefault (python3.11 + 0x12d1b8)
#8  0x000000000051fde7 _PyObject_FastCallDictTstate (python3.11 + 0x11fde7)
#9  0x00000000005b528c n/a (python3.11 + 0x1b528c)
#10 0x0000000000518bc6 _PyObject_MakeTpCall (python3.11 + 0x118bc6)
#11 0x000000000052c6a0 _PyEval_EvalFrameDefault (python3.11 + 0x12c6a0)
#12 0x00000000005860d4 n/a (python3.11 + 0x1860d4)
#13 0x0000000000585118 n/a (python3.11 + 0x185118)
#14 0x00000000005138c4 n/a (python3.11 + 0x1138c4)
#15 0x00000000005306f7 _PyEval_EvalFrameDefault (python3.11 + 0x1306f7)
#16 0x000000000055d661 _PyFunction_Vectorcall (python3.11 + 0x15d661)
#17 0x00000000005306f7 _PyEval_EvalFrameDefault (python3.11 + 0x1306f7)
#18 0x00000000005860d4 n/a (python3.11 + 0x1860d4)
#19 0x0000000000585118 n/a (python3.11 + 0x185118)
#20 0x000000000067bf0c n/a (python3.11 + 0x27bf0c)
#21 0x0000000000656cb4 n/a (python3.11 + 0x256cb4)
#22 0x00007f1568dfc134 start_thread (libc.so.6 + 0x89134)
#23 0x00007f1568e7c7dc __clone3 (libc.so.6 + 0x1097dc)

Stack trace of thread 1714:
#0  0x00007f1568df8e96 __futex_abstimed_wait_common64 (libc.so.6 + 0x85e96)
#1  0x00007f1568e03cd0 __new_sem_wait_slow64 (libc.so.6 + 0x90cd0)
#2  0x00000000004f9bb3 PyThread_acquire_lock_timed (python3.11 + 0xf9bb3)
#3  0x000000000058af6f n/a (python3.11 + 0x18af6f)
#4  0x00000000005518ee n/a (python3.11 + 0x1518ee)
#5  0x000000000053b94c PyObject_Vectorcall (python3.11 + 0x13b94c)
#6  0x000000000052c6a0 _PyEval_EvalFrameDefault (python3.11 + 0x12c6a0)
#7  0x00000000005860d4 n/a (python3.11 + 0x1860d4)
#8  0x0000000000585118 n/a (python3.11 + 0x185118)
#9  0x00000000005306f7 _PyEval_EvalFrameDefault (python3.11 + 0x1306f7)
#10 0x00000000005860d4 n/a (python3.11 + 0x1860d4)
#11 0x0000000000585118 n/a (python3.11 + 0x185118)
#12 0x000000000067bf0c n/a (python3.11 + 0x27bf0c)
#13 0x0000000000656cb4 n/a (python3.11 + 0x256cb4)
#14 0x00007f1568dfc134 start_thread (libc.so.6 + 0x89134)
#15 0x00007f1568e7c7dc __clone3 (libc.so.6 + 0x1097dc)

Stack trace of thread 1715:
#0  0x00007f1568df8e96 __futex_abstimed_wait_common64 (libc.so.6 + 0x85e96)
#1  0x00007f1568e03cd0 __new_sem_wait_slow64 (libc.so.6 + 0x90cd0)
#2  0x00000000004f9bb3 PyThread_acquire_lock_timed (python3.11 + 0xf9bb3)
#3  0x000000000058af6f n/a (python3.11 + 0x18af6f)
#4  0x00000000005518ee n/a (python3.11 + 0x1518ee)
#5  0x000000000053b94c PyObject_Vectorcall (python3.11 + 0x13b94c)
#6  0x000000000052c6a0 _PyEval_EvalFrameDefault (python3.11 + 0x12c6a0)
#7  0x00000000005860d4 n/a (python3.11 + 0x1860d4)
#8  0x0000000000585118 n/a (python3.11 + 0x185118)
#9  0x00000000005306f7 _PyEval_EvalFrameDefault (python3.11 + 0x1306f7)
#10 0x00000000005860d4 n/a (python3.11 + 0x1860d4)
#11 0x0000000000585118 n/a (python3.11 + 0x185118)
#12 0x000000000067bf0c n/a (python3.11 + 0x27bf0c)
#13 0x0000000000656cb4 n/a (python3.11 + 0x256cb4)
#14 0x00007f1568dfc134 start_thread (libc.so.6 + 0x89134)
#15 0x00007f1568e7c7dc __clone3 (libc.so.6 + 0x1097dc)

I hope someone can help me. Thanks in advance!

What version of SCALE is this?

thank you for the fast response! Oh sorry I forgot. It is the latest stable build TrueNAS SCALE 24.10.0.2

Please file a bug ticket on our jira.

How much RAM do you have? Also look at your SWAP usage, is it above zero? If it is over a few kb then this could be your problem.

I have 64gig of RAM of which 30 gig is free so I think there shouldn‘t be swap used. But where can I look that up :slight_smile: Sorry I am quite new to Truenas

It is in the GUI I think under reports. Sorry on the road so can’t give you an exact answer. But if the system has 30GB free, you are correct, it should not be a problem however it is still worth looking at to rule it out as a cause.

You have an interesting problem. Hope it gets solved quickly.

IIRC, SCALE 24.04 and later don’t use swap, so that shouldn’t be a factor.

1 Like

Yup, I didn’t catch the TrueNAS version being run. Thanks for posting the correction to my mistake :shushing_face:

Thanks! I have just filed a bug ticket on jira. But if anyone here knows what the problem might be I am still happy for any ideas / approaches :slight_smile:

Your ticket shows that you have modified the base truenas install. These get automatically closed with request to reproduce on a clean install.

I did install coral drivers for frigate, but the issue did occur before that too. I will redo a clean install and send a new / updated ticket when the next crahs happens. is that ok?

To me, this appears to be udev-related. Based on some experiences I’ve had with setting up zswap and issues with BIOS-managed power modes, you may need to ensure that you have the appropriate udev rules for your device and double-check that your USB ports or the USB drive itself aren’t entering a low power mode. For udev, I found that I needed rules that ensured certain drives were configured properly in udev with the right filesystem type and ignored by udev (in the case of zram and zswap, anyway) by including:

ENV{ID_FS_TYPE}=="zfs_member" ENV{UDISKS_IGNORE}="1

as part of the devices’ rules. You will also need to have the correct ENV{ID_PART_ENTRY_TYPE} set as well. I don’t know what your particular udev settings need to be, but maybe this will help point you in the right direction if the device isn’t being detected properly by udev.

Also note that if the drive appears to have “spun down” for any reason, especially if TrueNAS middleware didn’t trigger it, then TrueNAS will offline it. If the kernel doesn’t panic, it may come back online again at some point, but that behavior seems like it would be highly undefined.

Ensure you’re delivering full power to your USB port at all times, that your boot-pool isn’t configured to use any power saving modes in the TrueNAS middleware or BIOS, and that the USB device itself isn’t designed to be a low-power device or to spin itself down when not in use. I can’t guarantee that will fix your problem, but the measures I just mentioned (other than changes to udev) certainly can’t hurt.

Hi thanks for the response. This seems to be very interesting as my drive does come back after some time and the crashing is very repeatable which looks like the power to the drive is cut off at some point(but the led on the drive actually still is on when the freeze happens)

I am actually a complete newcomer to such nas systems and drive / usb power modes. Could you (or someone else) help me where to begin?
I searched the bios and the only parameter I could find was a ERP which was set to “disabled”.

How can I set udev rules? I cant find anything in the truenas search. The boot device is just a usb to nvme adapter with a ssd mounted inside it.

I have just reinstalled a clean version and will wait to upload another ticket when the system will again crash in 0.27h

I’m sorry; I’m not a udev expert. If you look at the kernel archives or various Linux how-tos you might find what you need. If not, Linux & Unix Stack Exchange or ServerFault might be useful places to ask about specific udev settings.

In all honestly, though, I suspect it’s the spin-down. You can control TrueNAS managed spin-down in the TrueNAS web client for that disk (make sure it’s disabled). As for the BIOS settings, you would need to look at the manual for your particular BIOS. For me, they were in multiple places like the PCIe and USB settings, power management settings, port settings, and the AHCI and similar settings. In other words, they were all over the place; my system happened to come with sensible defaults, but yours may not.

There may also be settings for the kernel or in the sys and proc filesystems that will allow you to disable low power modes for your device. You’d have to Google for them; I know there are some, because I’ve used them before, but if find /sys -iname "*power*" or similar doesn’t turn up something appropriate then I wouldn’t know specifically where to point you.

If you have a computer that doesn’t let you access USB or link-state power–and there certainly may be some–then you will probably need to buy a powered enclosure that isn’t relying on the power from the USB port itself, or buy a different system that has more ports or gives you BIOS or OS access to USB power states. I know that’s not what you want to hear, but if you can’t find a way to control your USB power then you mighht not have a choice.

Alternatively, you might consider rebuilding your array and assigning one of the spinning drives to be your boot drive. You’d have to dedicate a drive for it, but it could be a good stopgap solution using the hardware you already have if you don’t have any other options.

Hello I after some research tried to diable autosuspend for the usbs as described here:

But this wasn’t it either. The fresh install did crash anyways as expected after 0.27 days. I added a second debug report to the ticket. Is that ok @awalkerix ?

Thank you!

Hello,
I tried troubleshooting with booting from a USB (same issue) and bought a PCIE to nvme adapter which seemed to work but about one hour after the 0.27days the system crashed but this time fully with these logs:

Then after trying to reboot it doesn’t want to boot and when trying to reinstall truenas the following error is shown: (Fixed after formatting the SSD)

Can someone help me otherwise I will have to send the system back as I think I tried every combination of booting with nearly the same result…
Thanks!

This error is telling you that your GPT partition tables are corrupted in an unfixable way. You probably need to reformat the stick and reinstall. You may also need to enable legacy USB support in your BIOS and boot from an MBR instead of GPT if TrueNAS will even let you do that; the default seems to be using the EFI partition of a GPT table.

You’ve been wrestling with this for a while. No one else can really help you directly because it’s an unsupported configuration on hardware the rest of us can’t replicate for you. Not all computers, USB ports, BIOS options, or enclosures will work. You can get a decent mini PC even before Cyber Monday for under US $137 that will likely work if you have an internal SSD for your boot pool and a better enclosure or better cables.

There’s absolutely nothing wrong with dancing on the edge with your hardware or configuration, but doing that risks the sort of problems you’re facing. Ultimately, your current hardware–especially your need to rely on an external USB stick that may or may not be fast enough or high quality enough–is going to keep bogging you down. It’s time to try some different hardware.

If you really want to use your current hardware anyway or to use an alternative that requires booting from USB then you may want to at least try UnRAID to see if it works with your existing hardware. I actually switched to TrueNAS from UnRAID for similar reasons, but you may have the opposite use case.

UnRAID works differently, and has different performance goals and use cases. It’s not better or worse in an objective sense; it’s just different. Assuming it works with your hardware and you don’t need the performance of striping across disks (UnRAID uses a file mover to move disks from cache to a single disk in your array that’s optionally protected by parity then the only real downside of UnRAID is that there’s no community edition. It’s got a lengthy free trial, but all versions require a paid license. The license costs aren’t high, but it’s definitely non-free in both a FOSS sense and a free-as-in-beer sense. However, it may be a cheaper alternative than replacing your components if you need or want to continue working within the constraints of your current hardware.

1 Like