Multiple issues, half recovered now but have question

Ill try to keep this short.

Pools:
6x Toshiba 16TB Raid-Z2 pool named big
2x WD shuck 8TB stripe pool named small

Server specs:

  • Asus W680 IPMI Mobo
  • Intel14600K
  • 2x48GB DDR5
  • Usually LSI 9500-8i but currently for troubleshooting LSI 9300-8i card
  • Latest Proxmox 8.x with PCIe passthrough confirgured correctly to the TrueNas VM
  • TrueNas Scale TrueNAS-25.04.1

Something? happened about 6 months ago and my TrueNAS became super unstable

  • Switched from 13600K to 14600K under warranty
  • Swapped ram as was faulty in memtest

Rebuilt VM and imported config from backup resulting in constant rebooting during boot up on ix-service and ix-zfs.service
Desperate, unplugged both WD drives and it booted!

Plugged in one WD drive and booting fine still, but now in GUI, one of the 16TB coming up as belonging to the pool small?
It appears its there on the zpool status command, im so confused.

How can I be confident of what is happening and to remedy the problem.
On top of that, one of the 16TB needs to be replaced, getting a few light uncorrectable errors, but is working - good grief!

Thank you!

I’m sure someone might be able to help you… Not!


Throwing some stuff out there.

What is your PSU?

Did the new RAM pass the memtests?

Did you run any SMART tests on the HDDs recently? Any long ones?

PSU is a SuperFlower 1000W heavy duty beast, dont have the exact model handy at the moment.
New ram and CPU pass all testing including memtest just fine
I have scheduled SMART short tests and thats where the problem with one of the drives was identified after the VM was able to boot.

I should note, I am absolutely confident of the Proxmox install and the VM setup for TrueNas, not my first rodeo with VM for Truenas and have had the same pool for several hardware iterations and even drive upgrades, and a switch from ESXI to Proxmox over the years.

It ran fine for about 6 months till all the CPU and memory issues seem to hit at once :frowning:

Thanks!

This could be a bug in the GUI, if the zpool command in the terminal shows the correct information.


Does the boot issue occur as long as both WD drives are plugged in, but there’s no issue if either one or the other is plugged in?

Post the output of zpool status, please - we might spot why the UI shows weird things.

It may well be a bug, but its a bit concerning seeing it there with the other (broken) pools name.
I can’t actually get it to boot with both the WD drives plugged in :frowning:
It goes into a boot loop when its at the starting ix-zfs.service and ix-netif.service untill i remove one WD drive.
Mind you, it does show both of the WD drives during boot when its booting up prior to it boot looping. Ill try to attach a vidoe of it posting with both drives.

I only ever tested with the same WD drive on, I should try to boot with the other WD drive too on its own. Ill be even more puzzled if that does boot!

I wanted to but could not embed images and I need to work out what has happened to my SSH access. Please bare with me.

Thanks, appreciate it

Complete the forum tutorial you were invited to when you joined–check your DMs.

1 Like

And please do not post pictures of shell commands or their output. There is copy & paste of text. It supposedly even works in the CE web shell?

1 Like

Completed, thanks!

Right! Was not aware all this time, trying that now, seems it was CTRL + Insert

root@storage[~]# zpool status
  pool: big
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 823G in 03:33:57 with 6 errors on Fri Aug 15 03:36:29 2025
config:

        NAME        STATE     READ WRITE CKSUM
        big         ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda2    ONLINE       0     0    12
            sdd2    ONLINE       0     0    12
            sdb2    ONLINE       0     0    12
            sdc2    ONLINE       0     0    12
            sdf2    ONLINE       0     0    12
            sdg2    ONLINE       0     0    12

errors: 8 data errors, use '-v' for a list

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:04 with 0 errors on Sat Aug  9 03:45:05 2025
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          vda3      ONLINE       0     0     0

errors: No known data errors

How did you create that pool? For the UI and the middleware to work you must always refer to partitions via their UUID, not their raw device.

E.g.

  pool: nvme
 state: ONLINE
  scan: scrub repaired 0B in 00:01:06 with 0 errors on Sun Jul 13 00:01:12 2025
config:

	NAME                                      STATE     READ WRITE CKSUM
	nvme                                      ONLINE       0     0     0
	  mirror-0                                ONLINE       0     0     0
	    d3597b94-ffd8-4770-a4be-017340effec6  ONLINE       0     0     0
	    d34e46e4-4fc9-4c84-9c1c-c2f167d030b7  ONLINE       0     0     0

errors: No known data errors

If you created the pool on the command line you did it wrong.

This is the disks

   Name	Serial	        Disk Size	Pool
	sdc	9130A05MFWTG	14.55 TiB	big	
	sdd	9130A02KFWTG	14.55 TiB	big	
	sdf	9120A0B3FWTG	14.55 TiB	big	
	sdg	9130A04WFWTG	14.55 TiB	small (Exported)	
	sdb	9130A03NFWTG	14.55 TiB	big	
	sda	23D0A0UGF4MJ	18.19 TiB	big	
	sde	7SH5AU9D	7.28 TiB	small (Exported)	
	vda		32 GiB	boot-pool

The pool was created using the GUI, I only use the cli for zpool command and getting more detailed smart results

Is that LSI 9300-8i flashed to IT firmware?

It is definitely in IT firmware, I flashed it myself years ago.

A year or more ago I migrated from Core to Scale, does the partitions UUID vs raw device come from the migration perhaps?

Here is the clip from a few days ago of the boot loop when all 8 disks are powered and cabled to the LSI card

youtu.be/5b_Y5O1VWbM?si=SrMtJOM7GErXclxx

A small update, your comment has me quite worried at this point, what is going on here!?
I unplugged the WD drive again and all things are back to normal now, the big pool has all drives pointing to it now.

I have not found a solution to this issue, or even how it got to this point, would appreciate any help.

Further to this, I now reckon the other pool “small” is probably fine, and it may explain the boot loop? the 2 small pool drives are interfering with the big pool maybe, causing the whole issue.

So I will keep the 2x WD out of the picture currently, but need to figure out how to get the uuids to be picked up permanently now.

Thanks!

well, I got the disk uuid’s I think? but not quite sure what to do about them yet

root@storage[~]# ls -la /dev/disk/by-partuuid 
total 0
drwxr-xr-x 2 root root 340 Aug 15 23:42 .
drwxr-xr-x 8 root root 160 Aug 15 23:42 ..
lrwxrwxrwx 1 root root  10 Aug 15 23:42 11ba6627-3bd4-11ee-8d9d-0050569f4659 -> ../../sdb1
lrwxrwxrwx 1 root root  10 Aug 15 23:42 11d30ed6-3bd4-11ee-8d9d-0050569f4659 -> ../../sdb2
lrwxrwxrwx 1 root root  10 Aug 15 23:42 52f63443-2c39-11ec-a18b-0050569f4659 -> ../../sde1
lrwxrwxrwx 1 root root  10 Aug 15 23:42 5387d5a3-2c39-11ec-a18b-0050569f4659 -> ../../sde2
lrwxrwxrwx 1 root root  10 Aug 15 23:42 53c3d7b9-2c39-11ec-a18b-0050569f4659 -> ../../sdd1
lrwxrwxrwx 1 root root  10 Aug 15 23:42 5414c88d-2c39-11ec-a18b-0050569f4659 -> ../../sdf1
lrwxrwxrwx 1 root root  10 Aug 15 23:42 544bbfdb-2c39-11ec-a18b-0050569f4659 -> ../../sda1
lrwxrwxrwx 1 root root  10 Aug 15 23:42 544de43a-2c39-11ec-a18b-0050569f4659 -> ../../sdc1
lrwxrwxrwx 1 root root  10 Aug 15 23:42 549903f2-2c39-11ec-a18b-0050569f4659 -> ../../sdd2
lrwxrwxrwx 1 root root  10 Aug 15 23:42 54c2efe8-2c39-11ec-a18b-0050569f4659 -> ../../sdf2
lrwxrwxrwx 1 root root  10 Aug 15 23:42 54c4fb3b-2c39-11ec-a18b-0050569f4659 -> ../../sda2
lrwxrwxrwx 1 root root  10 Aug 15 23:42 55137372-2c39-11ec-a18b-0050569f4659 -> ../../sdc2
lrwxrwxrwx 1 root root  10 Aug 15 23:42 bd271d56-44fe-4004-b47b-9ec3c50ede65 -> ../../vda1
lrwxrwxrwx 1 root root  10 Aug 15 23:42 c1d9ec34-b6a8-4381-9998-0885167899df -> ../../vda3
lrwxrwxrwx 1 root root  10 Aug 15 23:42 d06bc247-b974-4176-b9be-52d1e54ab71f -> ../../vda2

Did you also blacklist the PCIe device so that Proxmox can’t use it?
Passthrough is not enough.

You can export the pool and reimport it using the partuuid’s instead.
I believe you just add the -d /dev/disk/by-partuuid/ to the import command, but someone else here can probably chime in on that.

  • export the pool from the UI
  • zpool import -d /dev/disk/by-id big
  • zpool export big
  • import the pool from the UI

Yes, the mpt3sas driver is blacklisted for LSI cards :+1:
vfio modules are enabled and can confirm vfio-pci is the driver in use
IOMMU is enabled and confirmed via “dmesg | grep -e DMAR -e IOMMU -e AMD-Vi”