Multiple issues, half recovered now but have question

Not · August 15, 2025, 12:00pm

Ill try to keep this short.

Pools:
6x Toshiba 16TB Raid-Z2 pool named big
2x WD shuck 8TB stripe pool named small

Server specs:

Asus W680 IPMI Mobo
Intel14600K
2x48GB DDR5
Usually LSI 9500-8i but currently for troubleshooting LSI 9300-8i card
Latest Proxmox 8.x with PCIe passthrough confirgured correctly to the TrueNas VM
TrueNas Scale TrueNAS-25.04.1

Something? happened about 6 months ago and my TrueNAS became super unstable

Switched from 13600K to 14600K under warranty
Swapped ram as was faulty in memtest

Rebuilt VM and imported config from backup resulting in constant rebooting during boot up on ix-service and ix-zfs.service
Desperate, unplugged both WD drives and it booted!

Plugged in one WD drive and booting fine still, but now in GUI, one of the 16TB coming up as belonging to the pool small?
It appears its there on the zpool status command, im so confused.

How can I be confident of what is happening and to remedy the problem.
On top of that, one of the 16TB needs to be replaced, getting a few light uncorrectable errors, but is working - good grief!

Thank you!

winnielinnie · August 15, 2025, 12:07pm

I’m sure someone might be able to help you… Not!

Throwing some stuff out there.

What is your PSU?

Did the new RAM pass the memtests?

Did you run any SMART tests on the HDDs recently? Any long ones?

Not · August 15, 2025, 12:13pm

PSU is a SuperFlower 1000W heavy duty beast, dont have the exact model handy at the moment.
New ram and CPU pass all testing including memtest just fine
I have scheduled SMART short tests and thats where the problem with one of the drives was identified after the VM was able to boot.

I should note, I am absolutely confident of the Proxmox install and the VM setup for TrueNas, not my first rodeo with VM for Truenas and have had the same pool for several hardware iterations and even drive upgrades, and a switch from ESXI to Proxmox over the years.

It ran fine for about 6 months till all the CPU and memory issues seem to hit at once

Thanks!

winnielinnie · August 15, 2025, 12:21pm

This could be a bug in the GUI, if the zpool command in the terminal shows the correct information.

Does the boot issue occur as long as both WD drives are plugged in, but there’s no issue if either one or the other is plugged in?

pmh · August 15, 2025, 12:27pm

Post the output of zpool status, please - we might spot why the UI shows weird things.

Not · August 15, 2025, 12:37pm

It may well be a bug, but its a bit concerning seeing it there with the other (broken) pools name.
I can’t actually get it to boot with both the WD drives plugged in
It goes into a boot loop when its at the starting ix-zfs.service and ix-netif.service untill i remove one WD drive.
Mind you, it does show both of the WD drives during boot when its booting up prior to it boot looping. Ill try to attach a vidoe of it posting with both drives.

I only ever tested with the same WD drive on, I should try to boot with the other WD drive too on its own. Ill be even more puzzled if that does boot!

I wanted to but could not embed images and I need to work out what has happened to my SSH access. Please bare with me.

Thanks, appreciate it

dan · August 15, 2025, 12:43pm

Complete the forum tutorial you were invited to when you joined–check your DMs.

pmh · August 15, 2025, 12:51pm

And please do not post pictures of shell commands or their output. There is copy & paste of text. It supposedly even works in the CE web shell?

Not · August 15, 2025, 12:54pm

Completed, thanks!

Right! Was not aware all this time, trying that now, seems it was CTRL + Insert

root@storage[~]# zpool status
  pool: big
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 823G in 03:33:57 with 6 errors on Fri Aug 15 03:36:29 2025
config:

        NAME        STATE     READ WRITE CKSUM
        big         ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sda2    ONLINE       0     0    12
            sdd2    ONLINE       0     0    12
            sdb2    ONLINE       0     0    12
            sdc2    ONLINE       0     0    12
            sdf2    ONLINE       0     0    12
            sdg2    ONLINE       0     0    12

errors: 8 data errors, use '-v' for a list

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:04 with 0 errors on Sat Aug  9 03:45:05 2025
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          vda3      ONLINE       0     0     0

errors: No known data errors

pmh · August 15, 2025, 12:57pm

How did you create that pool? For the UI and the middleware to work you must always refer to partitions via their UUID, not their raw device.

E.g.

  pool: nvme
 state: ONLINE
  scan: scrub repaired 0B in 00:01:06 with 0 errors on Sun Jul 13 00:01:12 2025
config:

	NAME                                      STATE     READ WRITE CKSUM
	nvme                                      ONLINE       0     0     0
	  mirror-0                                ONLINE       0     0     0
	    d3597b94-ffd8-4770-a4be-017340effec6  ONLINE       0     0     0
	    d34e46e4-4fc9-4c84-9c1c-c2f167d030b7  ONLINE       0     0     0

errors: No known data errors

If you created the pool on the command line you did it wrong.

Not · August 15, 2025, 12:57pm

This is the disks

   Name	Serial	        Disk Size	Pool
	sdc	9130A05MFWTG	14.55 TiB	big	
	sdd	9130A02KFWTG	14.55 TiB	big	
	sdf	9120A0B3FWTG	14.55 TiB	big	
	sdg	9130A04WFWTG	14.55 TiB	small (Exported)	
	sdb	9130A03NFWTG	14.55 TiB	big	
	sda	23D0A0UGF4MJ	18.19 TiB	big	
	sde	7SH5AU9D	7.28 TiB	small (Exported)	
	vda		32 GiB	boot-pool

Not · August 15, 2025, 12:58pm

The pool was created using the GUI, I only use the cli for zpool command and getting more detailed smart results

pmh · August 15, 2025, 1:00pm

Is that LSI 9300-8i flashed to IT firmware?

Not · August 15, 2025, 1:01pm

It is definitely in IT firmware, I flashed it myself years ago.

A year or more ago I migrated from Core to Scale, does the partitions UUID vs raw device come from the migration perhaps?

Not · August 15, 2025, 1:11pm

Here is the clip from a few days ago of the boot loop when all 8 disks are powered and cabled to the LSI card

youtu.be/5b_Y5O1VWbM?si=SrMtJOM7GErXclxx

Not · August 15, 2025, 1:47pm

A small update, your comment has me quite worried at this point, what is going on here!?
I unplugged the WD drive again and all things are back to normal now, the big pool has all drives pointing to it now.

I have not found a solution to this issue, or even how it got to this point, would appreciate any help.

Further to this, I now reckon the other pool “small” is probably fine, and it may explain the boot loop? the 2 small pool drives are interfering with the big pool maybe, causing the whole issue.

So I will keep the 2x WD out of the picture currently, but need to figure out how to get the uuids to be picked up permanently now.

Thanks!

Not · August 15, 2025, 3:00pm

well, I got the disk uuid’s I think? but not quite sure what to do about them yet

root@storage[~]# ls -la /dev/disk/by-partuuid 
total 0
drwxr-xr-x 2 root root 340 Aug 15 23:42 .
drwxr-xr-x 8 root root 160 Aug 15 23:42 ..
lrwxrwxrwx 1 root root  10 Aug 15 23:42 11ba6627-3bd4-11ee-8d9d-0050569f4659 -> ../../sdb1
lrwxrwxrwx 1 root root  10 Aug 15 23:42 11d30ed6-3bd4-11ee-8d9d-0050569f4659 -> ../../sdb2
lrwxrwxrwx 1 root root  10 Aug 15 23:42 52f63443-2c39-11ec-a18b-0050569f4659 -> ../../sde1
lrwxrwxrwx 1 root root  10 Aug 15 23:42 5387d5a3-2c39-11ec-a18b-0050569f4659 -> ../../sde2
lrwxrwxrwx 1 root root  10 Aug 15 23:42 53c3d7b9-2c39-11ec-a18b-0050569f4659 -> ../../sdd1
lrwxrwxrwx 1 root root  10 Aug 15 23:42 5414c88d-2c39-11ec-a18b-0050569f4659 -> ../../sdf1
lrwxrwxrwx 1 root root  10 Aug 15 23:42 544bbfdb-2c39-11ec-a18b-0050569f4659 -> ../../sda1
lrwxrwxrwx 1 root root  10 Aug 15 23:42 544de43a-2c39-11ec-a18b-0050569f4659 -> ../../sdc1
lrwxrwxrwx 1 root root  10 Aug 15 23:42 549903f2-2c39-11ec-a18b-0050569f4659 -> ../../sdd2
lrwxrwxrwx 1 root root  10 Aug 15 23:42 54c2efe8-2c39-11ec-a18b-0050569f4659 -> ../../sdf2
lrwxrwxrwx 1 root root  10 Aug 15 23:42 54c4fb3b-2c39-11ec-a18b-0050569f4659 -> ../../sda2
lrwxrwxrwx 1 root root  10 Aug 15 23:42 55137372-2c39-11ec-a18b-0050569f4659 -> ../../sdc2
lrwxrwxrwx 1 root root  10 Aug 15 23:42 bd271d56-44fe-4004-b47b-9ec3c50ede65 -> ../../vda1
lrwxrwxrwx 1 root root  10 Aug 15 23:42 c1d9ec34-b6a8-4381-9998-0885167899df -> ../../vda3
lrwxrwxrwx 1 root root  10 Aug 15 23:42 d06bc247-b974-4176-b9be-52d1e54ab71f -> ../../vda2

neofusion · August 15, 2025, 3:07pm

Did you also blacklist the PCIe device so that Proxmox can’t use it?
Passthrough is not enough.

You can export the pool and reimport it using the partuuid’s instead.
I believe you just add the -d /dev/disk/by-partuuid/ to the import command, but someone else here can probably chime in on that.

pmh · August 15, 2025, 3:09pm

export the pool from the UI
zpool import -d /dev/disk/by-id big
zpool export big
import the pool from the UI

Not · August 15, 2025, 3:11pm

Yes, the mpt3sas driver is blacklisted for LSI cards
vfio modules are enabled and can confirm vfio-pci is the driver in use
IOMMU is enabled and confirmed via “dmesg | grep -e DMAR -e IOMMU -e AMD-Vi”

Topic		Replies	Views
Pool offline all drive exported please help TrueNAS General	22	1266	May 13, 2024
Newbie - Unassigned Disks? + SMB set up TrueNAS General SCALE , SMB , NFS , ZFS	28	1966	July 31, 2024
Cannot import pool after shutdown and restart TrueNAS General SCALE	9	266	December 16, 2024
How to remove a disk from a 2x1 pool? TrueNAS General SCALE	20	254	October 15, 2025
Pool showing offline and data not available TrueNAS General	41	747	May 6, 2024

Multiple issues, half recovered now but have question

Related topics