Truenas Scale Cobia SMB Kernel Panic

This was reproducible from two different VMs and Docker Containers:

I am running Cobia 23.10.2. I have a ubuntu 22.04.4 host that mounted the SMB share using cifs-utils. The share was mounted using AD credentials and a uid/gid mask for local control of the shared folder.

My two containers that were using this share were unable to view the disk size and were causing a kernel panic when attempting to move files onto the drive. It appeared that they could create and read files fine, they were fully able to traverse the directory, but could not move or copy files into the share.

The samba4 logs showed multiple instances of this error (directory/filename.ext is a placeholder):

[2024/04/09 16:05:08.774429, 1] ../../source3/smbd/close.c:872(close_normal_file) Failed to disconnect durable handle for file directory/filename.ext: NT_STATUS_NOT_SUPPORTED - proceeding with normal close

I rolled back the containers, the truenas version, attempted recreating the share, modifying the ACL, etc. No settings seemed to affect the behavior of the shares. As far as I can tell, this began after the upgrade from 23.10.1.3 to 23.10.2.

I have since modified the share to use NFS instead of SMB, and all is functioning correctly. This is more informational, and I would be willing to reproduce as necessary.

My biggest concern was that it appears to be invisible, I can’t tell what is actually going wrong. I have been using the same setup through multiple ubuntu, docker, and truenas version upgrades.

Those two errors were the only things I could find in any logs that showed anything valuable while I was in the midst of troubleshooting to get it back up and operating.

Paging @awalkerix

1 Like

You seem to have enough info to file a bug, especially since you can reproduce it and have done extensive testing.

If you can afford to, it would be helpful to know whether the 24.0-RC1 version has the same issue. Obviously, it would be great if it was fixed.

Can you report the NAS ticket here.

Well, it looks like I may have missed what should have been one of my first troubleshooting steps. I never deleted the share and recreated it.

I am unable to get it to reproduce now, at least using a basic bash container, in a new SMB share.

Our SMB server is in userspace and so kernel panic from it is unlikely. Not much I can do without repro case.

Ok, good news, I have reproduced the issue. I still can’t tell what exactly causes it, but there is definitely something to do with permissions. I created a new dataset and SMB share, and tried multiple permission sets:

Here is the log produced by samba4, no placeholders this time, as there is no PII. The container used is sabnzbd, and I am just doing the testing within the application that creates a file on the drive.

I also included a line that seems benign but is very annoying, it throws this deprecated setting line every 5 to 10 seconds.

[2024/04/11 12:31:00.030791,  1] ../../lib/param/loadparm.c:1909(lpcfg_do_global_parameter)
  lpcfg_do_global_parameter: WARNING: The "syslog only" option is deprecated
[2024/04/11 12:31:07.145396,  1] ../../source3/smbd/close.c:872(close_normal_file)
  Failed to disconnect durable handle for file outputTESTING.txt: NT_STATUS_NOT_SUPPORTED - proceeding with normal close

Here is a snippet of my redacted ACL that causes the error:
image

Using the Posix Open preset, the SMB share works, however, it appears that it would allow everyone in.

Using the Posix Restricted Preset, I cannot change the owner of the group, and adding the user and group to the preset results in the same behavior (snippet):

Again, the logs are not terribly forthcoming, so I can’t tell what is actually happening. My guess based on the behavior of the container and SMB is that the SMB connection ends up hanging, causing both the smbd service and container to hang themselves.

The container becomes unresponsive to both web connections, and docker socket connections (like: docker kill container). I am able to stop the smbd process through systemctl so that Truenas doesn’t reset. If I were to let it go, eventually I would hit the memory overwrite and it would reset Truenas.

I have also been able to replicate from another VM. This one is not domain joined, but it is using AD credentials and a sabnzbd container.

It doesn’t exhibit the same propensity for completely breaking everything, but it definitely hangs. the container and truenas have to be forced to close their connections. And the smbd service doesn’t want to stop (this behavior was exhibited when connecting from both systems).

I don’t seem to have the same kernel panic from this system though.

How do you have the application configured for storage? Host Path?

Yeah.

The SMB Share is mounted on the host using cifs-utils. (ex: //server/share /mnt/share)
The directory is bind mounted into the container. (ex: -v /mnt/share:/mnt/share)

Hmm… So Like this:

graph TD;
    TrueNAS("TrueNAS")
    Ubuntu("Ubuntu")
    sabnzbd("sabnzbd (Kubernetes)")

    sabnzbd -.->|-v /mnt/share:/mnt/share| Ubuntu
    Ubuntu -->|//server/share /mnt/share| TrueNAS

Can you reproduce if you mount the SMB share directly in the container instead? I think something is wonky with doing bind mounting/host path to an SMB share, which I don’t think is a good practice anyway.

graph TD;
    TrueNAS("TrueNAS")
    sabnzbd("sabnzbd (Kubernetes)")

    sabnzbd -.->|//server/share /mnt/share| TrueNAS

GitHub - kubernetes-csi/csi-driver-smb: This driver allows Kubernetes to access SMB Server on both Linux and Windows nodes.

EDIT: Looks like FlexVolume is deprecated now, so nevermind sorry! :frowning: Looks like you can only use SMB as a repo for PVCs and not actually interact with a share/files.

Yeah, that’s a solid representation, other than I’m not using kubernetes, just docker engine.

That does seem more elegant. On my production system, I have multiple containers manipulating data on the same smb share. For that use case, is it better to mount smb straight onto multiple containers, or just mount on the host and bind mount into the container as I’ve been doing?

Also, I’ve continued my testing, and it appears related to the mask. I am absolutely a hobbyist in the systems space, I am a network engineer / architect by trade, so I don’t really understand posix and the permission sets now. The ACLs made more sense to me before.

When I set the mask to full execute, it appears to behave as desired and expected. When the mask is set to none, I am able to connect to the share and do some limited actions, but the container isn’t able to move stuff around properly (this is probably on purpose, as I’ve been trying to read up on what the mask does).

1 Like