So looking through the logs I see a few things.
I’m on ElectricEel-24.10.2 and have this issue (which is resolved upstream).
I think this is not my root cause but it is causing some log spam and is broken.
This ACL corruption issue on SMB has happened regularly for me over several releases, and I have to strip and reapply ACLs to retain access. I thought I had eliminated it for about a month by correcting my network configuration (previously had 2 IPs, I got rid of the 2nd connection and stopped seeing issues for about a month).
This issue also occured right after a scheduled task I have the deletes recycle bin files.
I have it scheduled to run weekdays at 12pm normally runs without issue. It runs as a domain administrator account on my SMB share.
I have disabled this task for the time being just in case it is causing an issue.
I’d like to get someone to have a look at my logs though… as something is still funky here as I thought I had resolved the ACL corruption issue.And I don’t see why a crash/reboot would cause that to occur. I’m not sure what crashed either looking at the logs, I just see it start reboooting a few minutes after 12pm yesterday.
I mean access from clients to the share suddenly stops working, and stripping and reapplying the ACLs allows them to connect again. As to exactly what is happening, maybe that needs to be investigated next time it happens I’d be happy to run whatever is needed to analyze that before I repair the share.
So you haven’t determined whether they’re corrupted. Are the ACLs changing between working / non-working? If they are changing have you tried to figure out why they are changing? For example, some apps will perform recursive permissions operations (potentially breaking permissions).
From when I have looked at this in the past I think that is the case, the ACLs are changed and no longer match the actual Active Directory domain ACLs. But I’m not sure what I need to do to generate the info I need to confirm that.
If they are NFSv4 ACL type then look at nfs4xdr_getfacl command, otherwise getfacl. Permissions do not change on their own. This is typically either an SMB client changing permissions, or more likely some other process doing either a recursive chown / chmod (which some poorly-designed apps do on startup).
I don’t think its an application causing this as I a few different shares and one of the shares that often is inaccessible is an archive share that pretty much nobody is accessing or modifying. It gets mounted to most of my clients but isn’t data that anyone is actively accessing typically (its like 20 year old dwg drawings and such).
Also I should be explicit its the filesystem ACLs that I reapply not the share ACLs.
Right. As I mentioned, the permissions don’t change themselves. You’ll need to track down what is doing the action. Actual ACL corruption (not a permissions change) would be something completely different and more serious (typically this would manifest via a panic).
It appears I am setup with NFSv4 ACLs. nfs4xdr_getfacl -v -i apppears to dump as much info as possible I saved that info so for 2 of my shares so I can compare later.
Also there is still the issue of the crash itself…
The crash / reboot is new (truenas rebooted itself) , its been about a month since the last issue with ACLs but previously they did not involve crashes.
Since the permissions issues were happening for a time (and were apparently addressed by a network fix at one point) the odds that they are related to the reboot is small. Your time is probably better spent trying to figure out what’s changing when permissions are working vs non-working.
Typical reasons for this would be:
client changing permissions
local process changing permissions
(Since you have AD) – connection to AD domain breaking for some unknown reason. Various layers of caching can obscure when domain is being wonky. We do health checks every 10 minutes and should alert if things go sideways (assuming you have alerts configured).
Yes alerts are configured and the directory services are at the default WARNING level and I should be getting emails at the WARNING level.
I doubt its 1. because this would have also existed before we started using Trunas (the shares were originally hosted directly on the domain controller for about a decade before I got involved).
Also currently this is only failed for 1 share out of about 7… so not sure how that happens.
Also I have 2 ACL presets that use some of the same groups, and the JobArchives share that that is currently inaccessible when I try to reapply it from the preset (without stripping it) it does not resolve the group name correctly as in the screenshot above , but the other presets are still correct (eg the one for my main share which ) so something funny is definitely going on within Truenas itself.
So it looks like both of the groups I had fail this time have spaces in the names of the group. These groups have been this way for probably over a decade but only recently moved to truenas… its odd though that they would work most of the time and sometimes error out.