SCALE and ESXi iSCSI Dead Status

So my battle with SCALE has hit another snag. I have a test system that is using the hardware from my second ESXi host (running 1 host currently). I have gutted this system and all it has are a pair of M.2 boot disks and 2 SAS controllers. I put 2 SSDs in there just to be able to make a test pool.

As virtualization is borked on 25.04.01, I thought I would do some testing of other things. One such thing was iSCSI. After restoring my CORE config to SCALE, and re-arranging all of the networking to work properly, I tried to setup a new iSCSI share to test if it worked as expected. I have used the wizard and I have done it manually and I keep getting a Dead status:

I added the dynamic Portal connection in ESXi:

and the static links show up:

(I renamed the portal during testing - made no difference)

On the SCALE side, I have a simple setup:

(yes, it is nested in a datastore - tried it this way and in the root and it made no difference - it is a zvol)

You can see in TrueNAS that the ESXi system is connecting to the target:

but in ESXi it is showing as dead:

But TrueNAS things things are swell…

I am so stumped currently. I have rebooted SCALE (a number of times) and tried over and over with no luck. The iSCSI networks can ping between ESXi/SCALE of course. 2 NIC ports, all 10G through a 3850X. Each subnet is on a separate VLAN. This setup works fine with CORE:

and here is SCALE’s network:

Finally, here is what SCALE is reporting:

So, yeah, I am stumped. I really cannot reboot ESXi right now without an outage. May punt this SCALE back to default settings and start over - see if there is something coming from the CORE restore.

Any help at all would be FANTASTIC - I sure hope that I did not make some rookie mistake here…

Cheers,

So I have seen some weird edge cases trigger bugs in older VMWare versions. All Paths Down is triggered when the STORAGE is rebooted in like a non TrueNAS Enterprise HA context. If it happens longer than 140 seconds, an APD is triggered in VMWare. This can happen accidentally during maintenance because of the wrong order of operations as an example. VMWare should be able to handle that either manually or automatically, but I have seen cases where in certain 6.x versions this does not work. The only solution I have ever found is this. I never ran into it personally on 7…And I have not used 8 in any meaningful capacity. Their docs actually scope 7 and 8 tho.

  • All affected ESXi hosts may require a reboot to remove any residual references to the affected devices that are in an APD state.

https://knowledge.broadcom.com/external/article/318850/all-paths-down-for-a-storage-device.html

I don’t think your issue is related but this is interesting because your problem sounds like it’s the opposite problem?

The vmk log files might be more interesting I think. Can you share those @scharbag

I am running ESXi 7.0.3…

All my CORE paths are fine. Just new SCALE paths are being stupid. Set back to default and now ESXi will not even find a target on the SCALE machine. Research suggests I need an ESXi reboot but not right now… I am tired and crabby.

This has eaten my lunch for sure. booo.

But thanks for the reply and I will look into this when I can.

Cheers,

1 Like