I’ve encountered this issue for nearly 10 years, dating back to the FreeNAS days and it continues to be a problem. If for instance you set a 10GB quota on a dataset, configure snapshots and replication to a secondary system replication will fail once the dataset reaches its quota because there isn’t enough space on the receiving end to catch the latest increment. This happens when you maintain dataset properties in replication so the dataset quotas match. The current workaround is to manually increase the quota on the receiving system after which replication resumes. While this fix is simple it’s inconvenient especially when dealing with nested replications across multiple datasets, where just one hitting its quota can disrupt the entire replication process essentially rendering you without up to date backups until it’s addressed.
I’m not sure quite how or where in the stack this should be fixed but it’s clearly an issue and given that replication is such a key component to ZFS and TrueNAS it’s an obvious fix that’s needed.
This sounds like it would be expected behavior. If you have a quota set and it’s respecting the quota on the other side (because you’re maintaining properties) why is this wrong?
Setting a quota to ensure a given user or group don’t exceed their storage allocation is a pretty standard and reasonable ask but what isn’t reasonable is that when that user or group hit their quota their entire backup breaks. I get what you’re saying from a technical point-of-view but from a more practical angle it’s broken. It doesn’t allow for the fact that replication applies the most recent increment before discarding old ones meaning you are always going to exceed your quota on the receiving end the day you hit quota and therefore break replication. Seems to me like a small percentage needs to be added to the receiving end by default like say 1-2% to largely prevent this from happening.
Looks like the solution is to NOT set “full filesystem replication” but to manually define suitable properties (as in a larger quota, or even no quota at all) on the target system.
So I don’t actually use “full filesystem replication” but I do include dataset properties.
The problem I have is that I use the secondary system that I replicate to as the failover system so in the event of a disaster I need to be able to use that system as the primary so could really do with the all dataset properties including quotas.
Excluding properties would be equally a nightmare as every time I update someone’s quota on the primary I would have to hop onto the secondary and do the same. As you say having no quota would fix this but thats just not practical in the situation Im using it in.
Appreciate all your thoughts. I just wanted to make sure I wasn’t missing something blindingly obvious.
It may not apply, but I always considered a users quota to only apply to their “undeleted” live content. Ie their now content. They can’t control snapshots.
And the primary system has the quota.
Then IT backs it up.
That’s then IT’s quota/problem
Anyway, I guess the point is that the backup target shouldn’t have a quota, and I guess that if you’re sending properties then you need to not send the quota property.
I use the backup system as a failover target also so if all datasets on the backup didn’t have quotas and I failed over users could run riot with the storage. Seems to me that if replication is designed to always send the most recent increment and then prune after it would have some sort of failsafe builtin to prevent this exact issue otherwise quotas and replication is destined to fail one day which seems sort of cruel. I think it’s fair to assume most people would use defaults which replicating quota properties is so out of the box replication on TrueNAS will fail using the defaults when you hit quota. Again seems a bit cruel to set someone up to fail from the start using defaults.