Greetings to all and thank you in advance for the assist. I am very new to TrueNAS and virtualization in general so please pardon my ignorance.
Hardware:
MB: asrock x399 fatality
Processor: AMD Ryzen 2990wx (32 Cores, 64 threads)
Processor Cooler: Noctua tower cooler
RAM: 94GB DDR4 3200. Should be 128, but only 94 shows up. It all tested good too.
1x 4TB NVME
4x 12TB Spinning Rust
OS: Proxmox VE 8.2.7
TrueNAS Scale Dragonfish-24.04.2.3 with 12 cores and 64GB of RAM assigned…
First I will start with my current problem/situation and then I will explain how I got here.
Problem: All spinning rust has checksum errors and almost the exact same amount of errors on each disk. I ran spin rite on the disks and they all came back clean so I don’t suspect disk failure. I am guessing it was the cheap SATA cables that came with the hot swap drive caddy or the drive caddy itself. So far I have just swapped cables and trying to recover from there. I did remove the disks from the caddy with old cables and had the same problem. Replaced cables and same problem. Now everything is back in the caddy with new cables. The drive hold data but the pool is not in good shape.
Here is what I did to try to remedy the problem. All drives are being passed through from PM to TN. I took one drive offline, wiped it, then tried to bring it back online in the hopes it would trigger a resilver and without the checksum errors. Well, the drive shows up but I can’t add it back to the pool and I can’t remember the error it gave me or how to find it again. I ran zpool status and took a screenshot, now to figure out how to add it to this post.
I’m sure I need an instruction manual with popups and a juice box, but could anyone give me a vector on how to fix my situation?
First, Cksum errors are a function of corrupt data on the drive detected by a zpool scrub or accessing the file and the cksum is invalid. This is almost never a drive failure when there are multiple drives reporting cksum errors.
Run a scrub, if you have no corrupt files then you can run a zpool clear to clear the errors. If you still have a problem with your system, the errors will come back. So it is possible your cable replacement fixed it, or maybe you killed the TrueNAS VM while it was writing data, I can’t tell you.
I will say this much, running TrueNAS on a VM can be tricky and definitely has to be done knowing that TrueNAS MUST be shutdown before killing the VM/Proxmox.
Wow, I remember Spinrite from the late 1980 or early 1990’s. It was a great utility in the day for troubleshooting floppy drive systems, and you could even perform head alignments with it if you knew what you were doing, however a dedicated head alignment disk was preferred.
I am no expert in virtualizing TrueNAS, (VMWare or Proxmox), but the one thing that others have said, pass through the whole controller that have your TrueNAS ZFS data disks.
Some people have passed through a disk for the boot-pool. Not sure if that works or works well. But, for data pool disks? No.
One specific issue that seems to affect Proxmox, is that Proxmox understands ZFS. During boot of Proxmox, it is necessary to make sure beyond ANY doubt that a ZFS pool is not imported to both Proxmox and TrueNAS VM. Having both Proxmox and TrueNAS import the same pool at the same time will almost certainly lead to irrepairable ZFS pool corruption. It was mentioned in another forum post that it is possible to blacklist the disk controller that has TrueNAS data disks, so that Proxmox won’t use it.
Until recently, we had good information on reliably virtualizing TrueNAS with VMWare. But, since the money issues with VMWare, Proxmox seems to be a popular substitution. However, their are differences that need to be handled to make a reliable TrueNAS VM under Proxmox.
My advice, see if bare metal TrueNAS works reliably on that hardware. if not, then their is something else going on.
Thank you for the cksum clarification. I was thinking cksum had to do with just the data because there were no read write errors. The rest of the internet is what was throwing me off. I have run many scrubs and they all produce errors.
I think you may have identified causation. As much as this system has been up and down trying to figure it all out, pretty sure there was a point that the VM got killed due to lack of response. Seriously, I’ve reinstalled this thing so many times it’s maddening. Any other ideas other than running a scrub? I do have another machine that I am seriously considering doing a bare metal install on and then running everything off of that.
Is it possible to make this pool green again without losing data?
Okay so here is where my Newb really starts to shine (as if it hasn’t already). So the drives are all direct SATA cable to the MB and there are no additional cards managing the drives. So when you say controller, are you referencing a plug in device that all drives would be tied to that combines the disks and presents one device to host? I would assume there are controllers that would pass individual drives through as well. I have no experience with said controllers.
Or are you referencing the built in “controller” on the MB? I put that in quotes because I assume all systems have to have some sort of controller to handle disks? So same same?
I didn’t know I could pass a controller through to TN, but never really thought about it. How would I go about doing so? I assmume I could just google it?
Since the disks are straight pass through, TN is handling all the ZFS. PM is just handing over the disks from what I can tell. I did run the following commands in PM for the pass through.
qm set 102 -scsi1 /dev/disk/by-id/ata-ST12000NM0127_ZJV4Y28S
qm set 102 -scsi2 /dev/disk/by-id/ata-OOS12000G_0000QZEP
qm set 102 -scsi3 /dev/disk/by-id/ata-OOS12000G_000HEWHZ
qm set 102 -scsi4 /dev/disk/by-id/ata-OOS12000G_000Q417B
As much as I hate to say it, bare metal install is probably going to be the route I take. I would prefer to keep all the data if able, but it’s just a media drive and I can get it back. Just means I will probably have to upgrade some hardware in another box before the transition. Meh, I’ll fire it up on the old hardware and see what happens. Worst case scenerio is it doesn’t work. Should be able to put all the media back into the VM and bring it back online. Emphasis on “should” lol.
Hold up. I thought I just had a couple bad sticks of RAM. I mean, I actually did have a bad stick of ram that I replaced. But I replaced my 8GB modules totaling 64GB with 16GB modules trying to hit the max of 128…but all I get is 94GB. This is a known MB issue? The 8GB modules are Corsair and the 16 GB modules are Timetec Pinnacle Konduit 64GB KIT(4x16GB) DDR4 3200MHz. I bought two of those. I guess I’m about to google the **** out of that one! Thanks for the inject!
This is a major issue and I would not recommend you virtualize TrueNAS. If you still desire to do that you must do a few things (just my advice):
Run a CPU Stress Test for at least 24 hours. This will heat up things and if the system remains stable, that is a good sign.
Run a RAM stress test for several days. You need to run at least 5 COMPLETE passes and depending on the amount of RAM, it could take a long time. These first two tests are to make sure that you have stable components. It does not address everything however these tests root out the majority of stability issues.
As @Arwen has said, you must pass through the drive controller. Yes, it is possible to pass through individual drives however you will end up with data issues if you are not careful.
When your hypervisor starts, insert a 2 minute delay before your TrueNAS VM powers on.
When you power off your Hypervisor, ensure you have a 2 minute delay AFTER the VM closes.
Never assign your TrueNAS VM drives to another VM unless you have the boot/shutdown order figured out. This kind of thing will screw people over.
I don’t recommend that you sleep the drives either in a VM.
A disk controller can be part of the CPU chipset, on the system board as an external chip or PCIe plug in card.
Having the disk controller passed through to a TrueNAS VM appears to be the only reliable method of getting storage devices into TrueNAS, (when using a VM). I have no clue how that is done, (VMWare OR Proxmox).
Here is the SATA disk controller for my miniature desktop PC;
> lspci | grep SATA
05:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 61)
Of course, passing through the whole SATA controller means that Proxmox will not be able to have any of it’s local disks on that controller. Some system boards have multiple SATA controllers, so passing through all but 1 SATA controller can work in that case.
Unless someone has reasonable knowledge of both Linux and ZFS, I would not recommend virtualizing TrueNAS SCALE.
Wow. As a pretty new and inexperienced TrueNASer/Proxmoxer, this is news to me. Though from what I understand, assuming you shut down Proxmox graciously, it automatically graciously shuts down all its VMs first. Is that different than going in to TrueNAS itself and shutting down (which will result in the VM in Proxmox being shut down) and then shutting down Proxmox itself?
I guess I wasn’t clear, in my brain it was clear to me.
If you are running TrueNAS and have some shared storage with another VM on Proxmox/ESXi, then you need to turn off the other VM’s using those shares, then turn off TrueNAS, then power off the computer.
Yes, most OS’s can be gracefully shutdown, however you do need to plan it out if you have dependencies on other VMs running.
I’m not sure about Proxmox but ESXi (as I understand it) will tell a VM to power down, however if the VM does not do it in a certain amount of time, it could be dropped. I don’t know how that works but I have mine setup to wait 2 minutes which should be over twice as long as it needs to be.
Hopefully that clears it up, sorry to not be clearer.
Until recently I was running TrueNAS in a Proxmox VM. I passed a LSI HBA to TrueNAS and it was solid. However when 24.10 RC was released with docker support I immediately moved to baremetal TrueNAS and am very happy.
Okay so here is where I ended up. I ran a bare metal install on a 4 core intel cpu with 16gb or ram and four x 12TB Spinny Rust. It’s an older board and cpu but for the purpose of figuring things out and just making it run, it will work. If anyone really wants the specs, I will dig into the uefi for you. Just experimenting with it as a STRICT NAS server now. It’s also running Tailscale. I tried running plex off of it but it only has enough juice to stream one feed.
Post troubleshooting analysis. I believe this was a compound problem amplified by running it in Proxmox. So running a hypervisor in a hyperviser? When my problems initially started, I found a bad stick of memory. Replaced it, and still had problems with this stack of spinning rust as previously seen in this post. Bare metal install on the “new” old board and found a disk that had write errors! Replaced that disk with same same, resilver, and now it looks like it’s working properly. Anyone else ever run into multiple hardware failures at the same time? I’ll post another topic for that question. Maybe I’ll learn something new about troubleshooting and how to make my life suck less. Thanks all for entertaining my questions and educating me on TrueNAS.