On reboot disks on the LSI 9207-8e HBA which goes to the netapp shelf (in dumb JBOD mode)
those disk throw errors and those pools on that 24 disk shelf often get suspended and fail on reboot… Disk Sync runs like crazy in the top right task view on each reboot while errors are being thrown
Current Resolution:
- Power Down Disk Shelf
- Force Cold Boot the Dell server
- Export pools on the current powered down disk Shelf (after the server has finished a cold boot)
- Power Up shelf (wait for all disks to show up)
- Import Pools
****NO ERROR but we cant reboot without the errors occurring again
Any ideas on how to troubleshoot this?
*Considering replacing HBA
Hardware
DELL Power Edge 720
LSI 9207-8e HBA → to netapp
NetApp DS4246 x24 WD Red 3.5 spinners SATA with interposers (This is just functioning as a JBOD)
Software
Truenas Scale Dragonfish current
1 Like
Glad to see this isn’t just me at least…
Just curious, how many pools are there on your system, and how many are in the JBOD?
I am experiencing this with 4 pools total on the system (including boot), and 3 of those pools are on the JBOD.
I have 4 pools total, that includes the boot pool
Boot pool is in the dells front panel NOT in the JBOD (it has ZERO issues)
App pool is also in the front panel NOT in the JBOD (it has ZERO issues)
Media pool has 3 vdev z2 18 disks total ( is in the JBOD has problems)
Working pool 2 vdev z1 6 disks ( is in the JBOD has problems)
all 24 disks in the JBOD NetApp DS4246 are the issue
Interesting…Your total number of pools is consistent with my issue. I was thinking perhaps having them in the front panel might resolve this, but it appears that the 3rd non boot pool is the trigger in general, irrespective of the jbod…
When you look at the job history on the borked reboot, do you see several disk_sync.all jobs? And what do the disk labels look like in the webgui?
Yes the disk sync job go crazy and my log look like yours in the other thread
The disc sync keeps taking for about 5 min over and over after reboot
How many power supplies are in use in your netapp? Don’t think it matters really, just trying to find how many commonalities there are.
Have you found a solution to this issue? I haven’t been able to do much of use lately on my end.
Out of curiosity, what is the output of zpool status
?
I’m not an IT guy but do I understand you both correctly, you have a bunch of drives in a JBOD configuration and trying to use ZFS? If true, I’m fairly certain that will likely end in disaster. Maybe if you could explain a bit more.
JBOD as in a disk enclosure, not like a jbod disk configuration.
Netapp shelf was taken out of service a supermicro Jbod single channel (EL1) backplane replaced the netapp…
The issue is resolved
3 Likes
Thanks for the update…
sigh
Gonna be a fun purchase to run by the missus…
1 Like
I’m assuming you’ve stuck to CORE 13.3 beta?
I’m having identical issues with the exact same setup. I really don’t want to swap out a working enclosure for this.
I have stuck with 13.3 BETA for the time being. I spent a loooong time poking and prodding with this, and all I can figure is its either:
- Some kernel fuckery with SAS Timing
- An Upstream Debian problem
- Possibly an issue with the Netapp Firmware on the IOM6 modules (Might be able to try something with the 3s, but you’d take a performace hit).
I want to try the electric eel beta, but I haven’t had a free chance in recent weeks to screw with it…Please let me know if you find anything.
I spent a bit of time poking around myself. The issue is in ElectricEel-24.10-BETA.1 as well so don’t waste your time.
I compared the TrueNAS kernel source (6.6.44) against the vanilla kernel source (kernel.org 6.6.44). The only differences are some additions for NVME to SAS encapsulation (on the TrueNAS side). This leads me to believe this is a hosed combination on any Linux distro.
Much like yourself, after several reboots everything works until that next reboot. This is unfortunate for me because my first reboot after installation worked flawlessly, so I transitioned all my infrastructure over to leverage Scale’s capabilities. I don’t have time or the inclination to revert to Core. For now, I’m just not rebooting and have ordered at Supermicro JBOD like @Tyler_Shield. Like him, I can also use the extra bays so eh, I’m sorta ok with the switch.
Once that arrives, I believe I have a spare machine I can hook the DS4246 up to with a 9200-8e (same issue as the 9207-8e) and I’d be happy give someone/ a team, full access to it to test.
Well the super micro just worked… It is an EL1 super micro
Hate to ask, but could you link me to the model you bought?
Got mine yesterday and got all the disks swapped over.
It works perfectly. It is a bit louder than the DS4246 but I doubt that’s an issue if you’ve been using the DS4246 for a while.
1 Like