What is causing my random system failure?

sendu · October 28, 2024, 8:11am

I’m a first time user of TrueNAS, using a device for the first time. From the get go the system has been randomly failing every few days or so, needing me to power it off and on again with the hardware power button. How can I figure out what the issue is?

I have:

ASUSTOR Flashstore 12 Pro FS6712X (Intel Celeron N5105 CPU)
12 * Kingston NV2 4TB M.2 2280 NVMe Solid State Drive (SNV2S/4000G)
10 GbE $ lspci | grep -i Ethernet 01:00.0 Ethernet controller: Aquantia Corp. Device 04c0 (rev 03) $ sudo dmesg | grep eth0 [ 1.808985] atlantic 0000:01:00.0 enp1s0: renamed from eth0
32GB RAM (CT2K16G4SFRA32A)
Running TrueNAS Scale 24.04.2.3 (“Dragonfish”) on a Kingston NV2 1TB M.2 2280 NVMe Solid State Drive (SNV2S/1000G) in a UGREEN M.2 NVMe SSD Enclosure (USB 3.2 Gen 2 10Gbps)

Things I’ve looked at:

It passes Memtest86+ fine (the high CPU temp is because the fan is on a default low setting the whole time during this):

IMG_19444032×3024 304 KB
It passes stress -c 4 fine, with a fan control script running in TrueNAS:

stress606×611 32.9 KB
All my drive temps are always fine.
Looking at CPU usage reporting in TrueNAS UI, I can see roughly when the system must have failed, and I’ve tried looking at the logs at those times:

[first crash sometime Oct 18th:]
cd /var/log
sudo find . -type f | xargs sudo grep "Oct 18" | tac | sort -u -t: -k1,1
grep: ./journal/56993206fd474e34837b80dbdd2b3737/system.journal: binary file matches
grep: ./journal/56993206fd474e34837b80dbdd2b3737/system@000624d0750df85d-5ec3f7a345d87caa.journal~: binary file matches
grep: ./journal/56993206fd474e34837b80dbdd2b3737/system@3ff6f22ecd8c4d67af42c7100a83d694-0000000000004fa9-000624895da58294.journal: binary file matches
./auth.log:Oct 18 15:17:01 truenas CRON[538076]: pam_unix(cron:session): session closed for user root
./cron.log:Oct 18 15:17:01 truenas CRON[538077]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
./daemon.log:Oct 18 16:10:39 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
./error:Oct 18 00:00:39 truenas systemd[1]: Failed to start logrotate.service - Rotate log files.
./messages:Oct 18 12:00:39 truenas systemd-journald[609]: /var/log/journal/56993206fd474e34837b80dbdd2b3737/system.journal: Journal header limits reached or header out-of-date, rotating.
./syslog:Oct 18 16:10:39 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.

[crashed again 19th night:]
./auth.log:Oct 19 22:47:01 truenas CRON[159369]: pam_unix(cron:session): session closed for user root
./cron.log:Oct 19 22:47:01 truenas CRON[159370]: (root) CMD (test -x /usr/sbin/anacron || { cd / && run-parts --report /etc/cron.weekly; })
./daemon.log:Oct 19 23:10:13 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
./debug:Oct 19 01:46:35 truenas kernel: sd 0:0:0:0: [sda] Mode Sense: 37 00 00 08
./error:Oct 19 20:48:42 truenas kernel: snd_hda_intel 0000:00:1f.3: spurious response 0x0:0x0, last cmd=0x1470900
./kern.log:Oct 19 21:57:02 truenas kernel: perf: interrupt took too long (3924 > 3922), lowering kernel.perf_event_max_sample_rate to 50750
./messages:Oct 19 21:57:02 truenas kernel: perf: interrupt took too long (3924 > 3922), lowering kernel.perf_event_max_sample_rate to 50750
./syslog:Oct 19 23:10:13 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
./user.log:Oct 19 08:49:34 truenas TNAUDIT_MIDDLEWARE[924]: @cee:{"TNAUDIT": {"aid": "4833b09c-3eaa-40d9-812b-b38e1c656d5b", "vers": {"major": 0, "minor": 1}, "addr": "127.0.0.1", "user": "root", "sess": "22a8c372-2e8a-4dbf-83a1-5ffef8642851", "time": "2024-10-19 15:49:34.083113", "svc": "MIDDLEWARE", "svc_data": "{\"vers\": {\"major\": 0, \"minor\": 1}, \"origin\": \"pid:66740\", \"protocol\": \"WEBSOCKET\", \"credentials\": {\"credentials\": \"UNIX_SOCKET\", \"credentials_data\": {\"username\": \"admin\"}}}", "event": "AUTHENTICATION", "event_data": "{\"credentials\": {\"credentials\": \"UNIX_SOCKET\", \"credentials_data\": {\"username\": \"admin\"}}, \"error\": null}", "success": true}}

[crashed again during an internal rsync sometime  between 18:20 and 18:50 on 21st Oct]
[in reporting UI, max disk temp was 46 until metrics lost at 18:21; cpu temp was 75 shortly before, and came back at 82 before stabalising around 70 during 2nd rsync attempt]
admin@truenas[/var/log]$ sudo find . -type f | xargs sudo grep "Oct 21 18:2" | tac | sort -u -t: -k1,1
grep: ./journal/56993206fd474e34837b80dbdd2b3737/system@00062500352c263e-e991ce3dc559543c.journal~: binary file matches
grep: ./journal/56993206fd474e34837b80dbdd2b3737/system@3ff6f22ecd8c4d67af42c7100a83d694-0000000000008e56-00062500352a6315.journal: binary file matches
./auth.log:Oct 21 18:27:33 truenas sudo[889985]: pam_unix(sudo:session): session opened for user root(uid=0) by admin(uid=950)
./daemon.log:Oct 21 18:20:08 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
./error:Oct 21 18:21:57 truenas sudo[887801]:    admin : user not allowed to change root directory to chown ; TTY=pts/80 ; PWD=/mnt/NVMes/ix-applications/releases/plex/volumes/ix_volumes/config/Library/Application Support ; USER=root ; COMMAND=apps:apps 'Plex Media Server'
./syslog:Oct 21 18:29:58 truenas systemd[1]: run-containerd-runc-k8s.io-7698dafe5332c65a05429b531bdadde02e5eb8442b681c20e1a79fe481c00534-runc.DeJCyT.mount: Deactivated successfully.
admin@truenas[/var/log]$ sudo find . -type f | xargs sudo grep "Oct 21 18:3" | tac | sort -u -t: -k1,1
./daemon.log:Oct 21 18:30:08 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
./syslog:Oct 21 18:30:18 truenas systemd[1]: run-containerd-runc-k8s.io-15b9261149aba98371554f90d4e1fe921b64f35664a3cd35608b58be3bf488c6-runc.VeUsYt.mount: Deactivated successfully.

[and again at 3:45 on the 23rd while it should have been idling, though there was a raise in CPU activity and temp shortly before]
sudo find . -type f | xargs sudo grep "Oct 23 03:" | tac | sort -u -t: -k1,1
grep: ./journal/56993206fd474e34837b80dbdd2b3737/system@0006252c94ec4d43-807848c4bdb0ea02.journal~: binary file matches
./auth.log:Oct 23 03:45:01 truenas CRON[1048347]: pam_unix(cron:session): session closed for user root
./cron.log:Oct 23 03:45:01 truenas CRON[1048350]: (root) CMD (midclt call pool.scrub.run boot-pool 7 > /dev/null 2>&1)
./daemon.log:Oct 23 03:50:00 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
./syslog:Oct 23 03:51:20 truenas systemd[1]: run-containerd-runc-k8s.io-779be4f11b6a7b446d454749cb23ddf3eca4278ef3c0117f947bba8a76afde80-runc.LqjR4w.mount: Deactivated successfully.

I don’t really know what to make of the above. Do some of the log outputs suggest it’s not a sudden hardware failure, but something triggering a deliberate shutdown?

Are there other places I should be looking? What commands should I run?

Here are some other obvious issues I have, but I don’t know if they’re the cause of the actual random failures I’m concerned about, nor what to do about them:

Due to some kind of BIOS issue I imagine, it doesn’t boot in to my USB-attached boot drive with TrueNAS on it following a software requested reboot. I have to hardware-button power cycle.
TrueNAS alerts 'boot-pool' is consuming USB devices 'sda' which is not recommended. following every boot. (But this hardware gives me no other options; I need all 12 NVMe slots for data.)
Rarely, it gets stuck during boot apparently with some boot drive issue (just powering off and on again will make it work):

IMG_19451920×1440 262 KB

Farout · October 28, 2024, 8:39am

The CPU obviously doesnt have 48 PCIe lanes for 12 NVME drives, so there is some PCI switching going on. I dont find any info on the chip used for that. Even expensive PLX switches get hot and need proper cooling.

I somehow doubt, that the ASUS store manages to properly address 12 drives simultaniously, as needed by ZFS.

oxyde · October 28, 2024, 8:54am

In addition on what already said, don’t know why you are understimate those cpu temps, if

at least try use a different profile, memtest not stress CPU so much, temp can be worst in real intensive use!

i think you can fix this problem upgrading to EEL, have seen 2 recent similar case on forum. But in your place, i would test it before somehow (just check if with other OS happen the same or not)

sendu · October 28, 2024, 9:17am

It’s a funcational system used and tested by other people as well. There don’t seem to be any issues with data storage. If you think there might be a storage-related issue causing the failures, what command can I run to confirm or deny?

sendu · October 28, 2024, 9:19am

As noted in my first post, memtest86+ just doesn’t know how to control my fan. I did a stress test (100% CPU usage) in TrueNAS where the fan is controlled, and it’s under 80 degrees.

Farout · October 28, 2024, 9:43am

Tested with ZFS ? Not just by installing Truenas, that obviously works.

Your CPU has 8 lanes. So natively good for 2 NVMEs.
It then gets somehow multiplexed. This can be a problem for ZFS.

I would:

run the system with 1 drive installed only for testing purposes. See how that goes under a heavy load. Then increase the amount of drives.
try finding the chip in charge of the switching by tracing the PCI lanes. Put a heavy load on the system, and check the temps on the chip.

sendu · October 28, 2024, 10:13am

Yes, I have a ZFS DRAID2 pool across the 12 drives, and have rsync’d about 19TB of data to it without issue, other than 1 of the random failures that happened during the rsync. (I also get random failures while it should have been more or less idle.)

oxyde · October 28, 2024, 10:18am

I made some research, found this, (hope can help)

One of the big challenges with 12x M.2 NVMe SSDs is that the drives each need to connect to the system via at least a PCIe x1 link. With 12 drives, plus 2-4 PCIe lanes usually reserved for the 10Gbase-T NIC, that is a lot to ask of the Intel N5105 with only 8 PCIe lanes total. Asustor is using ASMedia ASM2806 PCIe switches and ASM1480 PCIe mux devices to help tame the PCIe needs in this system.

regarding chips involved on

I admit, never realize that “memtest” should control fan anyhow, neither that utility, so ignore this point if you sure temperature are not a problem.

Reading the review posted before, they issuing instability with more than 16gb of ram: you have 32, memtest don’t complain that but can be another point where start debugging

Farout · October 28, 2024, 10:19am

I dont think many around here are running DRAID. It might be part of your problem - or not. Maybe try a more traditional RAIDZ ?

joeschmuck · October 28, 2024, 2:04pm

Your system can ONLY support 16GB RAM, not two 16GB RAM modules.
Intel says so, and your product specs and installation guide say so.

The Asustor Upgrade site also lists the upgrade using 8GB RAM modules, they do not sell 16GB sticks for your device.

This is what I feel your problem is, however you have a non-traditional computer build, it is very specific for the company which built it. Pop a pair of 8GB RAM modules into it, see if that fixes your issue.

As for Memtest86+, as for all I know, it examines the RAM modules and says it has two 16GB sticks, that equals 32GB. When it runs the testing it likely overlaps the memory space when it tries to address anything above 16GB. I don’t know if it works that way but if the CPU does not have enough address lines to hit above 16GB, that is what I suspect is happening.

EDIT: To clarify, I suspect your system when running TrueNAS, when going above 16GB is accessing and overwriting RAM in the first 16GB range, never actually accessing anything above that physical limit.

This is just my opinion. Take it for what it is worth.

Protopia · October 28, 2024, 2:06pm

What you mean is that it is functional in normal use - but that does NOT mean that your disk system will be functional under high stress or when recovery is needed - and these will be the worst time to find out.

There are recommendations about storage controllers for a reason, typically a combination of what is formally supported together with real users’ horror stories of when they went off-piste.

That said, my system has a USB SSD as a boot drive and although I have had some issues with stability on an internal USB port, switching to an external one has fixed it.

So my suggestion would be to try a different USB port for your boot drive.

HoneyBadger · October 28, 2024, 2:23pm

That’s what I’m thinking as well. You’re seeing errors from the xhci driver timing out trying to talk to your boot device - try plugging the boot device into one of the USB 2.0 ports instead perhaps?

When the system becomes unresponsive, are you still able to interact with it from the local HDMI console?

sendu · October 28, 2024, 3:12pm

Seems this may have been an accident when I set things up. I didn’t intend to use the non-standard thing. But a bit late now, unless I can confirm this really is the issue and I need to change it. Any way to do that?

sendu · October 28, 2024, 3:16pm

Lots of other people with this hardware claim to be happily using 32 or even 64GB RAM. My system claims 23.2GB ZFS Cache right now, and seems “happy”.

If RAM is the issue, is there any way to confirm this (without sourcing compatibly 8GB sticks, which I wasn’t able to do last time I tired)?

sendu · October 28, 2024, 3:24pm

The machine has 2x USB 3.2 Gen 2 (10Gbps) and 2 x USB 2.0. I’m using one of the Gen 2 ports at the moment. I can try the other other, and also one of the normal ones, see if it makes any difference.

HoneyBadger · October 28, 2024, 3:25pm

No way to change it after the fact, but I suspect you more likely made a “RAIDZ2” and not a “DRAID2” - you can check from the Storage tab under “Data VDEVs” - it will tell you if you’re on RAIDZ, dRAID, etc.

Re: the RAM, I imagine ZFS would have thrown a very loud fit about incorrect checksums (and probably a whole kernel panic from the OS) if it was doing the “counterfeit-storage” routine of just overwriting the first bytes once it “looped around” past the supported capacity - however, this doesn’t mean that your platform doesn’t have an edge-case around stability with amounts of RAM greater than what’s officially supported. The N5105 seems to be used in quite a few “mini-PC” style systems so you’d have to check with community users there. From a quick search on my part it seems like there’s reports of both success and failure, so it may be down to the exact RAM sticks, or something more subtle like timings.

Edit: With that said on the RAM, from a review of the Flashstor:

The FS6712X had 2x 8GB installed which is our recommended upgrade. The FS6706T had a single 32GB SODIMM installed, and it was not stable with that.

So there’s potentially some validity to 16GB being a limitation of the platform or processor.

sendu · October 28, 2024, 3:27pm

That I haven’t tried yet. It lives headless in a cupboard, and I have to power it off and move it temporarily to my office to get it attached to a monitor and keyboard. I guess next time I’ll try and wrangle taking a monitor and keyboard to it instead.

HoneyBadger · October 28, 2024, 3:39pm

If you’re still getting keystroke/cursor response, even if it results in a hang after something like attempting to enter shell from the console, then I suspect boot device is offlining itself and middleware isn’t taking it too well.

Move the USB device to the 2.0 port, see if it behaves better from there. May need to change the enclosure as well - I know I have a couple that like to go offline or engage some manner of “power-savings” mode regardless of OS if they don’t receive regular I/O.

truenas-fan · October 28, 2024, 3:51pm

That’s what somebody with not much experience, or knowledge of how electronics and computers work, would “know”, think and say.

Even if that was all true, “slowness” should not cause a crash!

I am sure that other people have those same exact systems running without trouble, so this comes down to this specific case.

Now, to the person that posted this issue:
You do not need to use sudo to run grep!
“sudo find . -type f | xargs sudo grep “Oct 18” | tac | sort -u -t: -k1,1”

Then, the logs given are not in chronological order, so it’s hard to figure out what executed last before the crash and when the machine rebooted, etc.

My quick guess? Hardware issue, either the box itself or anything that was added or changed.

Start by using a different USB SSD enclosure.
If nothing, then try these in whichever order you prefer:
-Replace 10GbE (try with a different one, even if it is only 1GbE).
-Run it without the NVMes: Export the pool(s), reboot and see if it crashes again.

HoneyBadger · October 28, 2024, 3:57pm

That’s an unnecessary ad hominem. Shall we not, and instead focus on working the problem at hand?