3 Pools on Netapp DS4246 Causing Issues

Hey all, I’ve got a doozie of an issue here. I have an old Dell R620 running 2 Xeon 2620s and 16GB of RAM, and I threw an LSI SAS9207-8e and flashed the Dell RAID controller to an HBA firmware so I could use this as a ZFS replication target for backups. The drives would be inserted into a Netapp DS4246 and connected via the LSI HBA mentioned above.

So I created a pool out of 4 4TB Ironwolf NAS drives in a RAIDZ2 for Server Backups, and another pool out of 2 vdevs of 4 12TB Seagates in RAIDZ2 as well. Everything was running smoothly and no issues to speak of.

Then I created the third pool. 4 10TB HGST drives. I ran ZFS replication on it, and all was well for a few days, until Sunday. I laid down in bed and got 20 emails about being unable to read SMART data across any of the drives. I went to take a looksie, and all the pools were in a suspended state, but appeared to be online in the web gui. I poked around in the shell and couldn’t get anywhere, so I chose to reboot. From then on it doesn’t seem like I can get the pools to work properly.

I have done a clean install on a few different versions of TrueNAS, and every time the third pool is imported and the system reboots, the following happens:

  1. Smartd fails to run:
    5iwl04y0i93d1

  2. The system seems to run several disk.sync jobs, which never work quite right and leads to disks having labels that seem…off:

  1. The system seems unable to reboot properly, giving the following error. Apologies for not being able to copy paste this, it was on my KVM:

The prior screencaps are running on 23.10.2, but I see the same behavior on 24.04.1. The trigger seems to be having 3 pools, the order of import does not seem to affect things as far as I can tell.

Anyone have a clue here?

Edits: Better pictures and clarification

Little bit more information here. This is the storage page:

At this stage the only pool that appears to be broken is G_Backup, and we can verify that with zpool status -v:

However, it seems unable to view the data across any of these pools.

Not only that, but if we clear the G_Backup pool that is currently in the suspend state, the others seem to freak out in turn:

I am unsure of the “why” here. The trigger is definitely having 3 pools, but I am unsure if this is a hardware limitation, or some random issue with TrueNAS. I have a new cable coming in for the HBA tomorrow so I’ll try that…But its very strange.

Tried moving ports around, no good. I’ve found a few other posts around the internet about similar issues, but I don’t see a solution:

https://www.reddit.com/r/homelab/comments/1c0ah40/ds4246_and_proxmox_zfs_io_errors/

Just for fun, I tried importing the pools on CORE 13.3 to see if the issue was present there as well. It is not, so this appears to be an issue with either SCALE, or some sort of upstream Debian thing maybe?

Have you cabled both controllers on the DS4246 by any chance? This will potentially create an unsupported SAS multipath configuration.

If I recall correctly, the DS4246 only has a single SAS in/out per controller - connect only a single cable to it, and let’s see what happens.

Just the one cable. I tried various ports on it, to no avail.

So on Core 13.3, I imported all three pools and rebooted 5 times, and did not experience this issue. It seems SCALE specific…Not sure what would fix this though.

Have you got a debug file that you can collect from SCALE to file a bug ticket with?

I don’t offhand, but I can totally run another install real quick and trigger the issue. How do I create a debug file for that?

Sorry, evidently I am incompetent and can’t operate a reply button…

System → Advanced → Save Debug will do it. Then you can file a support ticket through the Report a Bug link at the top of the forums - don’t attach the debug as a regular file, but wait for it to confirm the bug was filed, and you’ll get a prompt for a secure upload portal.

1 Like

Thanks. I have submitted a ticket.

https://ixsystems.atlassian.net/browse/NAS-129294

1 Like

Got your bug, and confirming that the debug is attached and correctly hidden from public view. :slight_smile:

Thanks for linking back to the forum post as well for further context.

Having a similiar issue with the same server r620 and lsi 9207-8e. If i reboot the server sometimes 1, 2, or all 3 pools will get IO errors and eventually lock out. If i keep rebooting eventually all the pools will come online and be happy. No issues on core this happened when i upgraded to scale. SAS card is in IT mode also tried another card model with the same issue.

If either of you would like to PM a debug file I can take a look for you tonight.

Looks like the ticket got closed with no support for JBOD on community version…

My ticket got also closed on 24.4.0 RC1

I have the same issue.

Dell R510xd
Netapp 4246

My workaround for now if i really need to reboot:

  • stop SMB and uncheck start at boot (so you don’t loose your SMB share config)
  • stop any other service related to the pools
  • export the pools
  • reboot
  • import the pools manually after reboot
  • start all services that i stopped before reboot

This is not a solution, it is just what i do.

I am stuck with TrueNAS Scale for now, I updated my ZFS flags on my Pools because i thought it was related to

Big mistake to update the ZFS flags and not be able to get back to TrueNAS Core but lesson learned

I did not have this problem on TrueNAS Core, was running with this configuration since 2016

Hey heads up, I was able to import my pools on CORE 13.3 BETA. Really not pleased with having to do that since CORE Is going the way of the dodo…But since there seems to be no fix in sight for this issue from the dev team, this looks like the best option at the moment…

I would be more than happy to…But I can’t seem to find the button to send private messages lol…

@QuirkyKirkHax

You click on the profile pic and then there is a blue message button

I don’t seem to have the button. I wonder if it is because I am a new user?

image