3 Pools on Netapp DS4246 Causing Issues

QuirkyKirkHax · June 5, 2024, 8:15pm

I’m not sure to be frank…The other people Ive spoken to are also using netapps, so maybe? But I’d hesitate to buy new hardware for this issue when CORE seems to function properly. This still reeks of a software/OS issue to me.

Tyler_Shield · June 5, 2024, 9:16pm

I just cant run my setup on Core is the issue

QuirkyKirkHax · June 5, 2024, 10:34pm

Damn. Sadly ix doesn’t seem interested in fixing this issue. I would be curious to see if another jbod would work without issue on this, but I suspect itll be more of the same to be frank.

I might suggest trying rootdelays in the kernel params, or maybe tweaking grub. I don’t know that it will work, but at this point I am out of ideas…Please keep me posted

QuirkyKirkHax · June 5, 2024, 10:37pm

One thing I’d love to try if I had the hardware would be to try making 3 pools with a single disk on a netapp to see if it still breaks.

QuirkyKirkHax · June 13, 2024, 5:15pm

So I have a second R620 I had intended to use as a baremetal host for another truenas install (migrate my primary to it), and for fun I installed truenas 23.10 on there in a mirror across 2 ssds. I then created 3 1-disk pools (All internal, using the H310 flashed to IT mode), and it seems to work fine.

Unfortunately I do not have another jbod to test with here. One other thing I want to try is actually plugging in the second power supplies to the JBOD and Dell Server…Will that make a meaningful difference? Probably not, but I am straight up out of things to try lol

QuirkyKirkHax · June 21, 2024, 3:15pm

I found a thread on Reddit regarding something similar happening with an ubuntu server…I am curious if this is some kind of upstream Debian issue with enclosures…

https://www.reddit.com/r/DataHoarder/comments/yxd17g/netapp_ds4246_and_lsi_sas2308_on_truenas_issue/

Tyler_Shield · July 8, 2024, 5:02pm

I took my netapp out of service put in a supermicro (EL1) single channel backplane jbod issue is now resolved…

I needed the additional disk slots anyway

QuirkyKirkHax · September 27, 2024, 2:07pm

Dunno if this is gonna work or not, but I bought a used IOM12 control board to see if maybe replacing the IOM6 is the solution here. I have the cable coming in for it on Sunday, and I’ll post here with the results.

QuirkyKirkHax · September 30, 2024, 3:31pm

No dice. DS4246 won’t even fuly power on with the IOM12

IhatemyISP · September 30, 2024, 10:55pm

I’m going to wager that this is a driver issue with LSI.

I’m going to eventually hook my old 4246 up to another machine and see if I can get it to behave. I need a few days of decompression after fighting it for a good long while.

QuirkyKirkHax · October 1, 2024, 9:32pm

If you find a solution, I will buy you a coffee…I’ve been fighting off and on for a while myself, with no real luck. I’m avoiding an 800 dollar equipment purchase like a plague lol

Drainx1 · October 4, 2024, 1:47am

I’ve experienced similar, and it’s super maddening.
I’ve also noticed, from a fresh start, when TrueNAS is getting started, the Netapp (which has been running for a while prior) ‘blinks out’ (for lack of a better phrase) where all but the SAS drives activity lights go out (no lights on the drives what so ever.) All SATA drives are not responsive anymore.
After that, the drives are just not accessible, with exception to the SAS drives. At this point TrueNAS has had a few read/write errors and just puts them in IOSuspend. All drive lights come back on, but it’s too late.

Doing a sas2ircu 1 DISPLAY (I have 2 HBAs) shows the controller and the SAS drives themselves, but all the other SATA drives are gone, online light on, but no activity.

If I change the controller my HBA is connected to, it seems to pick it back up, at least partially. I have to run zpool clear on my pools before they go completely IO Suspend, but it works until the next reboot.
Rerunning the sas2ircu 1 DISPLAY shows all drives, SAS and SATA.

I’m not 100% if interposers would fix my issue, but just my observations.

Bloodpack · October 4, 2024, 2:52am

In my case interposers did not change anything…

IhatemyISP · October 4, 2024, 3:34am

Turns out the extra system I had is having CPU issues so I’m not sure I’ll be able to test for a while.

While I wait, out of morbid curiosity, is everyone running a pair of IOM6’s?

@Bloodpack @Drainx1 @QuirkyKirkHax

Bloodpack · October 4, 2024, 3:52am

I have tested two separate IOM6 and two IOM3, only one connected at a time…

I tried with and without interposers

I have changed my HBA and all cables…

Same problems

I ditched my netapp and bougt 12 x 20TB harddrives for my dell R710XD

IhatemyISP · October 4, 2024, 6:10am

TL;DWTR - Try removing the second IOM6 from the chassis (or at least sliding it out far enough to disconnect it from the backplane) and throwing it at the wall like I almost did.

I have also ditched it for a Supermicro JBOD but I want to try and save this for when/if I/you/everyone needs another 24 bays…

I told @QuirkyKirkHax I’d be poking this bear again now that the Netapp is not in use and I’ve got some free time. While I was looking at the errors, re-reading everything, and just grasping at straws, I saw some mention about mutlipath/MPIO/etc. I wondered, what happens if there is no option for a second path? I pulled one of the IOM6s completely out of the chassis, threw some super old drives in it, and connected it up. No issues. Built a Z1 pool and tested that. All good. This is where I was stuck though because I can’t reboot my TrueNAS (ElectricEel-24.10-BETA.1) box right now (large backup syncs running) so I can’t test if the reboot issues pop up.

Then I remembered Proxmox has ZFS support and my development Proxmox machine isn’t busy. Exported the pool on TrueNAS and imported it on the Proxmox machine. Imported the pool no issues. I’ve rebooted 3 or 4 times now without exporting the pool and have had no issues.

I couldn’t put this to bed because I changed 2 things, the OS and removing an IOM6. Then I figured, well, Proxmox and ElectricEel-24.10-BETA.1 are both Debian derivatives. They are using different kernels version (6.8.x vice 6.6.x, Proxmox and TrueNAS) but when you check the kernel module info for both, you see they’re both using the same version of the mpt3sas module (43.100.00.00). Maybe something in Proxmox’s 6.8.x fixes it?? I had to go deeper. Exported the pool from Proxmox.

Spun up a TrueNAS ElectricEel-24.10-RC.1 VM, passed thru the LSI card, and fired it up. I had zero issues importing the pool. Did a few normal reboots through the UI and CLI without exporting. No issues. Setup SMB share, copied a couple hundred GB to it. No issues. Force rebooted the VM via Proxmox, no issues upon reboot. I force rebooted the entire machine without shutting down the VM or exporting. No issues once everything booted back up. For poops and giggles I checked and RC.1 uses the same version mpt3sas version as Proxmox and BETA.1.

I hate to say it because I want to slap myself for not trying this before jumping ship (aka spending money) but it looks like removing that second IOM6 fixes it. I didn’t test if it had to be the top slot populated but that’s the one I’ve got in. The second one almost found a resting place in my wall.

Bloodpack · October 4, 2024, 6:29am

Only one IOM module was inserted, sometime it did reboot fine and sometimes not

Johnny_Fartpants · October 4, 2024, 7:14am

Interesting thread sorry to hear about your troubles I can feel your pain trust me.

I had three Supermicro 45 bay JBODs many years ago and when I added drives to the rear of the chassis I’d get loads of CAM Control errors but not at the front. Turned out the internal wiring of the chassis needed changing. I was told by the hardware engineer for some OS’s it needed to be this way and for others it needed to be the other. I was running FreeNAS (FreeBSD) at the time. Sure enough after the internal re-wire those systems ran great for many years. My point is it’s not always a case of hardware good or bad OS good or bad it can sometimes be much more nuanced.

Bloodpack · October 4, 2024, 8:23am

Yeah it is really a pain…

My setup with the netapp did run for at least 6 years without problems with truenas core

Sice i changed to truenas scale the nightmare begun

I want to mention that the problems i had were only on reboots wit the pool attached, if i exported the pool then rebooted and then importing the pool again it did run fine until the next reboot without exporting the pool first.

I had no errors whatsoever, no smart errors, nothing all scrub tasks went fine…

In my oppnion this a driver incompability

IhatemyISP · October 4, 2024, 2:01pm

Yea, I’m not too sure why dropping to one IOM works but it does for me. I don’t have a need for the other 24 bays right now, so I’m going to just shelve this shelf (haha) for now.

Like @Bloodpack mine ran for years on Core without issue, with both IOMs installed but only one connected.