Pretty sure this is a recover from backup event but want a second opinion

Poetart · August 19, 2024, 3:42pm

Had some problems with my TrueNas instance and noticed it after about a day.
From what I could tell, I had over allocated RAM and it was running in swap for a while and caused everything to go into ultra slow mode.

Assume this state for a while caused issues with the pool. This is the current state of the pool.

Ran a Zpool clear just to see what would happen.

Screenshot 2024-08-19 104117

The disks show up in unassigned at this time. The disk in the pool show as numbers and not the SN

2 jobs running:
disk.sync_all is just running all time time and restarting
The scub has just been chilling.

Screenshot 2024-08-19 104250

Getting this error a lot when I am in the GUI:
Screenshot 2024-08-19 104552

Reboot causes a lockup stating something along the line of “Cant unmount XXXX”

etorix · August 19, 2024, 4:40pm

Ouch! You’re in serious risk of losing everything here.

Please describe your hardware, and how all these disks are attached.

Poetart · August 19, 2024, 4:51pm

Dell R630XD
Attached via a LSI Dell SAS 2008 Card going to 2 NetApp DS4243 Disk shelf’s

Tops shelf is all 8TB drives:
6 x 8TB RaidZ2 x 4

Second / Bottom Shelf is my expansion shelf:
6 x 12TB RaidZ2
6 x 18TB RaidZ2

Truenas is running on Proxmox (I know that not preferred) with the LSI card passed directly through to the VM.

This system was bare metal Freenas / Core that I migrated to Truenas Scale maybe 1-2 months ago from a R620.

I can provide any other info that may assist.

Poetart · August 19, 2024, 4:55pm

Here are some reference photos of the setup I took when attempting to sell my old R620’s. There is one more row of HDD’s on the bottom now.

:

neofusion · August 19, 2024, 5:10pm

Perhaps they aren’t that sensitive but did you mean to post a picture containing the service code for your device?

Poetart · August 19, 2024, 5:13pm

Only picture I had of it. As far as I am aware, Cant really do too much with a ST of a server who’s support ended in 2016.

Plus its up for sale so the ST is out in with wild by this point.

Farout · August 19, 2024, 6:23pm

Passing through a HBA used to be enough until this:

Now it is advised to blacklist the HBA aswell.

HoneyBadger · August 19, 2024, 6:29pm

@Poetart does your Proxmox host OS show the pool under a zpool list or zpool status command? HBA passthrough should be enough to keep it separated (as both host and guest can’t simultaneously attach to the physical PCIe device) but I’d like to make sure it’s properly doing that and not using a raw device passthrough instead.

Poetart · August 19, 2024, 6:37pm

Just showing my NVME array.

HoneyBadger · August 19, 2024, 6:48pm

So no disk sharing, that’s a good start.

I count 36 (6x6wZ2) drives in your zpool status but your hardware description only says 18x drives.

Are these SAS drives by any chance?

Poetart · August 19, 2024, 6:51pm

Tops self is 24 drives 6x4
Bottom shelf is 6 x 2

The 18TB drives are native SAS
The rest are native sata using the Netapp SAS-SATA interposers that came with the shelf.

Poetart · August 19, 2024, 6:52pm

I also went ahead and added the mpt3sas driver as blacklisted under proxmox just in case that was causing any issues / would in the future.

Poetart · August 19, 2024, 7:36pm

Reboot to test some things out, here are the errors I get when I reboot.

Just sits there forever and throws errors from time to time.

HoneyBadger · August 19, 2024, 7:40pm

A pool with suspended I/O - especially if it also holds the system dataset - can well be blocking the regular shutdown process.

Do you by any chance have multipath SAS configured from your head unit to your SAS shelves? I believe the top controller in your NetApp shelves is the primary module - do you have anything cabled to the secondary (lower) controller in each shelf?

Poetart · August 19, 2024, 7:45pm

Looks like it shutdown successfully.

When it booted up, The pool was showing fine for about 30-45 seconds before going back to offline / slowly propagating the

The SAS card is connected to both the top and bottom shelf.

One cable to the back of the top shelf
One cable to the back of the bottom shelf

I attempted to daisy chain them at one point but was unable to get it working.

This is the current storage tab of TrueNas. Why I am starting to suspect that this is an issue with the top shelf.

Poetart · August 19, 2024, 7:47pm

And these are the devices for my pool:

HoneyBadger · August 19, 2024, 7:53pm

Sounds like no multipath, which is good; but your 24 unassigned disks makes me think a controller may have failed in the 24-bay unit.

In your shoes, I’d shut down the system, swap the controllers (not the cables, the controllers) in your top 24-bay system, and see if the disks are detected properly.

Poetart · August 19, 2024, 7:55pm

I have a bit of luck on my side.

Was looking on ebay and found a listing for 2 x IOM6 controllers only to notice that the seller is about 5 mins away from my house.

Got local pickup. About to head out and grab them and test.

Poetart · August 19, 2024, 10:42pm

Different IOM6 showing 13 of the 24 8TB drives assigned to Datastore but that’s better than whatever it was trying to do before.

Going to mess with cables a bit to see what happens

Poetart · August 20, 2024, 12:08am

Well this cant be good.