Trunas crashed and ACLs were corrupted afterwards

So looking through the logs I see a few things.
I’m on ElectricEel-24.10.2 and have this issue (which is resolved upstream).

I think this is not my root cause but it is causing some log spam and is broken.

This ACL corruption issue on SMB has happened regularly for me over several releases, and I have to strip and reapply ACLs to retain access. I thought I had eliminated it for about a month by correcting my network configuration (previously had 2 IPs, I got rid of the 2nd connection and stopped seeing issues for about a month).

This issue also occured right after a scheduled task I have the deletes recycle bin files.

find /mnt/SMB_POOL/SMB_SHARE_PATH/.recycle/* -atime +30 -exec rm -rf ‘{}’ ;

I have it scheduled to run weekdays at 12pm normally runs without issue. It runs as a domain administrator account on my SMB share.

I have disabled this task for the time being just in case it is causing an issue.

I’d like to get someone to have a look at my logs though… as something is still funky here as I thought I had resolved the ACL corruption issue.And I don’t see why a crash/reboot would cause that to occur. I’m not sure what crashed either looking at the logs, I just see it start reboooting a few minutes after 12pm yesterday.

Actually the net-smpd issue appears to not be fixed until net-snmpd 5.10 releases (which should be soon).

What do you mean by corruption?

I mean access from clients to the share suddenly stops working, and stripping and reapplying the ACLs allows them to connect again. As to exactly what is happening, maybe that needs to be investigated next time it happens I’d be happy to run whatever is needed to analyze that before I repair the share.

So you haven’t determined whether they’re corrupted. Are the ACLs changing between working / non-working? If they are changing have you tried to figure out why they are changing? For example, some apps will perform recursive permissions operations (potentially breaking permissions).

From when I have looked at this in the past I think that is the case, the ACLs are changed and no longer match the actual Active Directory domain ACLs. But I’m not sure what I need to do to generate the info I need to confirm that.

If they are NFSv4 ACL type then look at nfs4xdr_getfacl command, otherwise getfacl. Permissions do not change on their own. This is typically either an SMB client changing permissions, or more likely some other process doing either a recursive chown / chmod (which some poorly-designed apps do on startup).

I don’t think its an application causing this as I a few different shares and one of the shares that often is inaccessible is an archive share that pretty much nobody is accessing or modifying. It gets mounted to most of my clients but isn’t data that anyone is actively accessing typically (its like 20 year old dwg drawings and such).

Also I should be explicit its the filesystem ACLs that I reapply not the share ACLs.

Right. As I mentioned, the permissions don’t change themselves. You’ll need to track down what is doing the action. Actual ACL corruption (not a permissions change) would be something completely different and more serious (typically this would manifest via a panic).

It appears I am setup with NFSv4 ACLs. nfs4xdr_getfacl -v -i apppears to dump as much info as possible I saved that info so for 2 of my shares so I can compare later.

Also there is still the issue of the crash itself…

The crash is something new, but your permissions being changed was something old?

The crash / reboot is new (truenas rebooted itself) , its been about a month since the last issue with ACLs but previously they did not involve crashes.

So the crash / reboot then is unrelated to the permissions issues you are periodically seeing.

I can’t say for sure. I just know they happened at the same time yesterday.

Since the permissions issues were happening for a time (and were apparently addressed by a network fix at one point) the odds that they are related to the reboot is small. Your time is probably better spent trying to figure out what’s changing when permissions are working vs non-working.

Typical reasons for this would be:

  1. client changing permissions
  2. local process changing permissions
  3. (Since you have AD) – connection to AD domain breaking for some unknown reason. Various layers of caching can obscure when domain is being wonky. We do health checks every 10 minutes and should alert if things go sideways (assuming you have alerts configured).

Yes alerts are configured and the directory services are at the default WARNING level and I should be getting emails at the WARNING level.

I doubt its 1. because this would have also existed before we started using Trunas (the shares were originally hosted directly on the domain controller for about a decade before I got involved).

Ok so I have lost access to one of my non critical shares. So I have more time than usual to investigate it.

I compared the output of nfs4xdr_getfacl -i -v on the root of the share and there are no changes but I still cannot access it from my clients.

From one of my windows clients I get this , a notification that there was a problem accessing the share then, it switches to this dialog.

Checking in via the Trunas GUI, it appears it is failing to resolve the group IDs to the group names… the group ID’s on disk have not changed though.

Also currently this is only failed for 1 share out of about 7… so not sure how that happens.

Also I have 2 ACL presets that use some of the same groups, and the JobArchives share that that is currently inaccessible when I try to reapply it from the preset (without stripping it) it does not resolve the group name correctly as in the screenshot above , but the other presets are still correct (eg the one for my main share which ) so something funny is definitely going on within Truenas itself.

So it looks like both of the groups I had fail this time have spaces in the names of the group. These groups have been this way for probably over a decade but only recently moved to truenas… its odd though that they would work most of the time and sometimes error out.

Should these GIDs not match the IDs in the filesystem for this share? They are 10xxxxxxx instead of 9xxxxxx ranged ids.

it looks like the AD and SMB IDs are getting munged up?

You can see here after rebuilding the AD cache… it has changed the GIDs for some of the groups in use.