I have a strange problem on a fresh install of 24.10.1: Any time I write to my 5-NVME-drive RAIDZ1 pool, the machine immediately reboots. When it comes back up, there’s nothing in the logs that I can find for hints. That copy operation typically causes the machine to reboot immediately; no error is thrown as far as I can tell.
Here are my hardware details:
CWWK fanless box with 4 ETH interfaces and their NVME expansion card (which works via bifurcation)
Boot drive: Samsung 860 512 GB SATA drive (via UASP USB enclosure)
RAIDZ1 array (single pool):
4X SK-Hynix gold P31 2 TB NVME drives
1X Sabrent Rocket 4 Plus NVME drive
It doesn’t matter whether I copy to the array from the network (via SMB or SCP) or whether I copy a large file locally from the SATA system drive to the array—the thing just reboots immediately.
I’ve been building 4-drive Z1 arrays trying to find a bad drive, but I can’t. Most four-drive arrays cause the reboot, but some don’t. It’s very frustrating.
I’m hoping I’m either missing something basic or I’ve stumbled into a well-known bug. But Google didn’t find much on either of those fronts.
Any thoughts? I’m a reasonably experienced Linux sysadmin, but I’ve got nothing so far.
Try deleting the pool and setting up a new pool with a single drive - do this for each NVMe - then test each pool. See if you can isolate the issues to a single drive
Thanks for your input! Yeah…NVME temp was my first thought too. But nope.
I have already set up each drive as an individual pool—and I couldn’t replicate the problem on any of them. Then I tried three-drive arrays, thinking it might be something to do with being a Z1 array.
I couldn’t build a non-rebooting Z1 array with any combination of drives—but that’s when at least one of the drives differed in make/model from the other two. I was able to build a stable Z1 array with four of my SK Hynix drives, but not when I mixed in the Sabrent drive. I swapped out the one Sabrent for a spare, and got the same result. Same when two drives in the array are Sabrent and only one is SK Hynix.
It now seems clear that mixing two NVME makes/models is the root of the problem. That was, in fact, my very first thought, but when I tested a mixed four-drive array (3 SK + 1 Sabrent) it worked just fine—or at least, I thought it did. Maybe I just didn’t build the array I thought I did that time.
I can live with just four matched drives—I’ll use the fifth NVME slot for a little OS drive and dump the USB SATA drive. (I kind of hated the clunkiness of putting the OS on an external drive anyway). I wish I could install Truenas Scale to just a partition of the OS drive so I could use the remaining space for other things. But that’s not possible with Truenas, is it?
As I said above, I think the solution is to restrict the array to four identical NVME drives. While I thought five identical drives was ideal, it was my understanding that TrueNAS could accommodate 4+1 as long as they were also NVME drives of identical size. Was I wrong about that?
I have to admit I’m surprised that Scale would do an instant reboot when there’s a problem writing to a pool that’s not the OS pool—I’d expect the OS to take the array offline and throw an error. My guess is there’s something weird going on with how the kernel talks to the two different NVME drive types in the array, and then something Very Bad happens to the kernel (though I’m just guessing).
Is Scale known to fail this way in other situations? If mixing similar drives is supposed to be OK, has anyone else had a problem when doing so?
SCALE is known to fail on some N100-class Terrmaster device if VT-d is enabled in BIOS. The culprit is the cheap PCIe switch used by the motherboard.
You may have a similar issue.
Interesting. Is it known to fail in the way I described?
Unlike the Terramaster device you mentioned, this box doesn’t have a PCIe switch—just nine PCIe3 lanes. The NVME card I’m using (from the system manufacturer) gets four of those, bifurcated from a single 4x slot. There’s no switch.
I might try turning off VT-d in the BIOS, but only if the Terramaster device does a sudden reboot on writes like my machine does.