Something really weird/scary keeps happening to my system. I first saw it after updating to 25.10.0.1, so I reverted to 25.10.0 which didn’t fix it. I then rebuilt from scratch on 25.04.2.6 and it was behaving fine while I configured everything.
At some point during the later stages of configuration and rsync’ing data to a couple of folders to fix some files that had disappeared from there, something ditched a load of files from /etc, causing PAM to error (→ no SSH, no web UI shell, all jobs fail). Rebooting threw up further errors from nginx (→ no web UI) and multiple other boot errors, all relating to missing configs in /etc.
I’ve just rolled back to an earlier snapshot of all the boot-pool datasets, rebooted and it’s back up and running.
I can pin down the most recent time the /etc problem happened to within about 10 minutes. There’s nothing obvious in syslog, just a raft of complaints from processes that can’t open a shell. /var/log/error matches that with a load of PAM errors. Journal, nothing interesting.
With the files disappearing from other datasets, it’s a bit random - never folders, only lowest-level files. Not all files; one folder has a XLS but is missing a load of DOCs.
This is a real pain, a major issue and potential for data loss.
Does anyone have any ideas what could be causing this or where else to look for clues? Thanks in advance!
System - i5/10th, 32GB. Boot is a nVME, the data volume that also has disappearing files is a mirror of two 2TB SATA SSDs.
This is utterly bizarre. I’ve just come back to the system after leaving it last night with about 700GB of data on my main mirror. There is now less than 200GB of data there. Whole folder trees are empty of files, just the folder structure left. In the boot-pool, a load of files have gone missing from /etc again. PAM errors, etc are all back. Checking my other disks, I’m also missing a ton of files from the simple 1-disk pool; again, folders are there, files are not and ‘du‘ reports much less than ought to be there. That volume doesn’t even have snapshots scheduled yet, it’s the simplest pool possible.
Looking at the datasets, those where I took a manual snapshot after re-loading the data yesterday look like this: USED 5.4G USEDSNAP 5.4G USEDDS 8M REFER 8M WRITTEN 5.4G (should have 5.4G of data). I’m not very knowledgeable about ZFS, but that looks to me like a massive deletion of files happened. /etc is similar, goes from 7M of data on an early snap to 5M snap + 2M data now. Restoring snapshots recovers all the missing files from those datasets.
The only job scheduled to run overnight was an rsync out from an unaffected pool on a different disk. Scrubs are switched off. At first sight, nothing informative in the logs. I’ve downloaded the whole of /var/log to my PC for closer review. ‘zpool status -v’ shows no errors anywhere.
I’ve never heard of TrueNAS behaving like that, and I’m afraid I can’t offer much direct help here.
But maybe you could share some more details about your setup (is TrueNAS running on bare metal, etc.?).
With a bit more information, other users here might be able to piece together what’s going on.
OMG. I think I’ve found it, thanks to the job logs. A cron job ran at about 23:45, that’s supposed to clear out old backups from a particular folder. Lots of errors in its log about being unable to delete read-only files from /usr… which it shouldn’t be looking in!
Looks like its environment maybe went gaga and the folder to delete from wasn’t defined. The effect of running ‘find -mtime +30 -exec rm -f’ without a path, I dread to think, but I can guess!! ‘Dear linux, please remove all files over 30 days old from my entire system’?
You know what, I’ve never seen that happen in all my years with Linux. I would’ve thought that ‘find’ would error if it’s not given a path, but apparently not so, it just helpfully assumes the CWD I guess!
Once I finish laughing, I have some very long rsync jobs to run! At least I know my backups work, they’ve had a good test the last few days.