Bad zfs pool performance switching from core 12.0 to scale 24.10

urgali · December 4, 2024, 3:14pm

Good day,

I’m banging my head on an issue I’m having switching from CORE 12.0 to SCALE 14.10.

First of all, the scenario I have:

-zpool of 44Tib composed of:

7 hard disks in RAIDZ1 of 9.1Tib each
2 nvme disks as l2arc cache of 447Gib each

-CPU Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz (server is Supermicro SSG-6049P-E1CR36L)

-32Gib ECC ram

-zpool was created in truenas CORE 12 and is not upgraded to SCALE 24.10

-zpool is exported via NFS

-zpool is used for proxmox backup where: proxmox cluster backups to PBS VM in cluster which has truenas zpool mounted as NFS

-backup speed is limited at 1gbit/second because of port speed

I know, this is not recommended setup nor it’s really good, but this is what we have.

The issue i’m facing is that, when we were on CORE 12, PBS garbage collection was fast (about one day for 40Tib of used space) while on SCALE 24.10, garbage collection needs one week to complete, causing issues because slow garbage means verify starts when garbage is already going and zpool scrub begins as well, bringing backup transfer speed at 2mbyte/second.

Proxmox backups keep having same speed on both CORE and SCALE though.
So in this case we have exact same hardware/setup/zpool, only thing that changes is truenas from CORE to SCALE.

Is it a known issue or something we can work on?
Any more info i can provide?

One more info: CORE was pretty “default”, it also had autotune available.
We have no apps/jails/vms in truenas.

Protopia · December 4, 2024, 5:43pm

I have no knowledge about PBS nor about what exactly PBS Garbage Collection does under the covers.

But judging from the description it would seem to me that whatever PBS is doing over NFS during garbage collection is working several orders of magnitude more slowly under SCALE than under CORE. My guess (and it is nothing more than that) is that PBS asks for details about existing files and this is happening file by file on SCALE when it was previously done directory by directory.

So my first question is whether you can take some measurements when garbage collection is happening but nothing else (i.e. not a backup nor a verify nor a scrub) and see what the utilisation is for NVMe, for HDDs and for network i.e. which is the bottleneck?

I also have some configuration questions:

L2ARC is not normally recommended with < 64GB of memory, however you have a very specific PBS workload which might benefit from it none the less. This may also be an area which has changed between Core and Scale Dragonfish, specifically the amount of RAM used for ARC, and the impact this might have on L2ARC usage. It seems unlikely but the upgrade might somehow have resulted in L2ARC incompatibility with the result that L2ARC is not actually currently working. (The pool can survive an L2ARC disk failure - are the 2x NVMe disks mirrored or striped?)
- What can you see in the TrueNAS Scale reports pages / Netdata pages? Is there anything there that looks odd (screenshots please)
- I am wondering whether first removing L2ARC would make any noticeable difference and then whether adding it back in again would make a difference.
Asynchronous vs. Synchronous writes - again, is this something that has changed between Core & Scale. What are the recommendations regarding async vs. sync writes from PBS oer NFS? Can you double check that writes are actually asynchronous rather than synchronous? (Datasets should be marked for asynchronous writes, NFS parameters should force asynchronous writes, NFS should be mounted for PBS as asynchronous.)

That’s pretty all I can think of (from my position of inexperience with either Core or Scale - but some general background in other large scale backup systems over the decades).

SmallBarky · December 4, 2024, 5:57pm

I suggest removing the L2ARC. 447GiB of L2ARC is eating into your RAM just to exist. Those 2 NVME could be used as another pool on your machine for APPS or VMs. The L2ARC is costing you about 9 GB of your RAM. You didn’t mention what resources you had assigned to the PBS VM.

See ZFS Read Cache in ZFS Primer

Memory Sizing section off TrueNAS 24.10 documents.

urgali · December 5, 2024, 9:32am

First of all, thanks for the replies!
I will get back soon with answers for all the questions.

One quick info though: PBS is doing a lot of small read/write operations.

Farout · December 5, 2024, 9:49am

Your pool has the IOPS of one drive. I cant comment on the differences from CORE to SCALE you are experiencing, but for this kind of workload you will be better off using mirror pairs of HDs or go SSD. For NFS sync writes to a HD pool an enterprise grade SLOG device will speed writes up.

urgali · December 5, 2024, 9:54am

Thanks for the head up but the question still remains: why, having the exact same hardware/layout with such huge differences by just going from CORE to SCALE, the performances got so bad?

I just rollbacked to scale in one of our 3 sites and performance are back to being good

urgali · December 5, 2024, 10:18am

during garbage collect, the two nics (they are in lagg) are doing about 100kbit download and 150kbit upload
this is what i see about disks on htop on SCALE:

image1821×164 44.4 KB
this is the cache load:

image1155×251 43.9 KB

image1659×622 66.7 KB

A difference i notice is disk I/O; on SCALE i have, for example:

Whereas on CORE, i have:

It is something we can try, still i don’t understand why on CORE it’s good vs bad on SCALE

It is recommended to use sync to avoid losing any data; no problem if backups are a bit slower, the real issue here is the infamous garbage collection

Last few info that might be useful:
on CORE we had system installed on 2x usb pendrive (yep, horrible and not recommended) whereas SCALE is installed on 2x WD Red disks

Also, since we have 3 sites with this setup, i rolled back one of them to CORE and the other two are still on SCALE so i can compare the systems

Protopia · December 5, 2024, 12:10pm

Unfortunately IMO most of the stats you posted can’t tell us anything meaningful.

The NICs running at 100-150kb/s probably tells us that the network is not the bottleneck and so we should look at storage.

htop doesn’t tell us much because it doesn’t tell us which NVMes or HDDs are contributing to the disk i/o so we have no idea whether it is L2ARC hits or misses.

The “cache load” (zpool status) tells us only how much data has accumulated in L2ARC over time, and nothing about how frequently it is being used. There are much better graphs in TrueNAS and Netdata to tell us about ARC and L2ARC useage.

The NVMe usage graphs don’t tell us much because there is nothing that says what PBS actions were running at e.g. the peaks. It does show that ZFS is striping things roughly evenly, but that’s about it. BUT… In particular it does not show how much memory L2ARC is taking away from ARC, and so no way to assess whether L2ARC is being of any benefit. However, with only 9GB of data held in L2ARC, my guess is that it isn’t going to be benefitting you much and the detriment to standard ARC might outweigh the benefits.

My guess is also that L2ARC is not the cause of this Garbage Collection issue.

My advice would be to remove L2ARC from the SCALE servers for the moment so that we can see if that helps, and to simplify the configuration to aid with diagnostic analysis.

It would also help to see copies of all the TrueNAS and Netdata graphs from both SCALE and CORE that relate to ARC usage so that we can see whether we can assess the positive or negative impact of L2ARC.

The graphs which do give rise to concern are the CORE vs SCALE HDD utilisation where these are VERY different. The SCALE graphs are showing a background utilisation of 0.5MB/s per disk in the vDev, so c. 3.5MB/s overall - and it seems reasonable to assume that this is attributable to the PBS garbage collection. Unfortunately I have no idea where on the CORE graphs garbage collection is happening to compare the disk utilisations for the same processing. In essence we are unable to determine whether the Disk I/O is slowing down the PBS Garbage Collection to a snails pace, or whether something else is doing this and causing the same level of background I/O to occur for much much longer.

And finally, we come to synchronous vs. asynchronous I/O. Firstly we have an issue of PBS speed here so clearly I/O performance for PBS does matter. It is entirely possible that Garbage Collection is somehow being slowed down by synchronous I/O, especially since you don’t have an SLOG.

It is also unclear whether the potential loss of data during an o/s crash actually has any meaning in the context of PBS backups. This is NOT a transactional system where zero data loss is essential for business integrity, or a zVolume where there is another guest file system that needs ZFS to guarantee its integrity. And it is perfectly feasible that SCALE has worse synchronous I/O than CORE or indeed that SCALE and CORE somehow differ as to whether they are using async i/o or not or possibly there is a difference about how ZFS on SCALE and CORE process file deletes. I suspect that garbage collection is actually a large bunch of very small I/Os (low network utilisation) that do quite a lot of disk I/O under the covers rewriting a bunch of metadata to remove the files and to return the files blocks to available storage, and that this is a wildly different pattern of network and disk I/O to writes of large files during streaming backups. I have no idea why CORE and SCALE would handle this differently in general or differently when using synchronous I/O, but this seems to me definitely something that should be tested.

I therefore recommend 3 actions here:

Undertake technical research as to whether using async I/O with PBS represents a genuine data integrity risk. What do the experts say?
Regardless of 1., for a trial period set the ZFS datasets so that writes are asynchronous and see if that makes a difference.
After you have removed the L2ARC NVMes temporarily to see what happens and measured the results, and before you set async writes for the datasets, try using these same drives as a temporary SLOG mirror on the pool and see what impact this has.

Hopefully, with these additional tests (no L2ARC, SLOG, async writes) we can try to narrow down the cause.

P.S. As a fairly minor aside that has literally nothing to do with this issue and is probably not worth changing now that the pool exists, I would personally say that for future pools which are e.g. 6x 10TB useable, then an 8x RAIDZ2 would be generally recommended as preferable to a 7x RAIDZ1 due to the size and width of the vDev, the time required for resilvering and the risk that a 2nd drive fails during the stress of resilvering the first. Of course, this is a backup server, and so the data is a backup and not primary, so in this instance, you may well have considered this risk to be perfectly acceptable when it wouldn’t be if this were primary data.

Protopia · December 5, 2024, 1:44pm

I didn’t spot this comment. If L2ARC is removing 9GB of ARC in order to store 9GB of data in L2ARC, it would seem that you would almost certainly be better off without L2ARC at all.

SmallBarky · December 5, 2024, 4:15pm

I looked at the Proxmox PBS docs. Installation — Proxmox Backup 3.3.0-1 documentation


Recommended Server System Requirements

    CPU: Modern AMD or Intel 64-bit based CPU, with at least 4 cores

    Memory: minimum 4 GiB for the OS, filesystem cache and Proxmox Backup Server daemons. Add at least another GiB per TiB storage space.

    OS storage:

        32 GiB, or more, free storage space

        Use a hardware RAID with battery protected write cache (BBU) or a redundant ZFS setup (ZFS is not compatible with a hardware RAID controller).

    Backup storage:

        Prefer fast storage that delivers high IOPS for random IO workloads; use only enterprise SSDs for best results.

        If HDDs are used: Using a metadata cache is highly recommended, for example, add a ZFS special device mirror.

    Redundant Multi-GBit/s network interface cards (NICs)

Whattteva · December 5, 2024, 4:27pm

I don’t know what your exact problem is and I don’t know your SSD setup as you didn’t specify the details, but from my experience with Proxmox and VM backups especially when doing it through NFS, which requires sync write, you need to have fast storage or write cache.

And by fast, I don’t just mean any SSD’s. It has to be SSD’s with PIP which can guarantee fast sync writes. This generally means enterprise-level SSD’s. Using consumer level SSD’s (even NVMe) will often lead to extremely slow speeds (slower than spinning rust) and you can see plenty of forum posts on Proxmox forums complaining about this.

Protopia · December 6, 2024, 12:08pm

I was thinking about this again in the shower this morning, and @SmallBarky’s point about using a special allocation vDev to hold metadata and small files might be the solution - though IMO we should continue to try to determine the reason that we see this problem on Scale i.e. Linux and not on Core i.e. FreeBSD.

However, experimenting with a special allocation vDev is not something to be undertaken lightly since once it cannot be removed (RAIDZ) or has performance implications (mirrors), but also because the special vDev mirroring really needs to be at least as resilient (if not more so) as the data vDevs.

If anyone is interested, here is my current thinking based on this description of how Garbage Collection works:

Phase 1 reads index files and then updates the atime metadata for any chunks that are referenced.
Phase 2 scans the chunks and any with old atimes are deleted.

This should result in the following I/Os:

Sequential reads of the backup index files - these should benefit from sequential pre-fetch. I can see no reason why Core/Scale should be different in this respect and if it were then I think there would be other more widespread symptoms.
Large scale updates of chunk file atimes which are held in metadata blocks. Since Garbage Collection explicitly handles delays of 24hrs5mins caused by relatime, you need to check that relatime is set for the pool/dataset. Again I see no reason why this would be different for Core and Scale, but we should double-check.

It seems to me that there is no real reason why these need to be synchronous writes - if there is a crash or power loss during phase 1 of GC you would just run it from the beginning again anyway. Again, I see no reason why this should be different between Core & Scale.

I tried to find a definitive source to say whether PBS writes should be sync or async. I found pages that suggest that PBS native writes are async, but if they are sent over NFS they may be converted to sync writes unnecessarily.
Reading the atimes of all chunks - again all metadata. Hopefully this metadata would already be in ARC but if for some reason they are not being cached then this would cause a lot of small i/os.
Deleting the chunks that are too old. Deleting files is also all metadata writes i.e.:
- Removing the file descriptor and writing the directory metadata block back to disk - chunks are 4MB in size so there are a lot of them, and so the directory metadata will be held in many blocks ("ZAP"s?). Currently when files are deleted, leaving empty metadata blocks they are not consolidated but there is a merged OpenZFS fix not yet shipped that consolidates these blocks. But the net result of these huge directories is that updating the metadata to remove the file descriptor may result in large metadata writes.
- Assuming no snapshots, the files data blocks also need to be returned to the free-space list. I assume that ZFS has been designed not to “lose” freed blocks if there is a crash between the file descriptor being removed and the data blocks having been returned to the free list. However, these are metadata writes as well.

So Garbage Collection is heavy on metadata writes and there may be a difference between Core and Scale versions of OpenZFS as to how these are handled. And this would suggest that using a special allocation vDev to store metadata might speed up garbage collection substantially.

trexintheshell · December 6, 2024, 12:26pm

You guys keep talking about changing the zfs pool layout, editing zfs tunables, efiting nfs mount options, so on and so forth; still everybody keep missing the main argument OP is talking about.

Same hardware, same zfs pool, same workload type, same zfs tunables, same everything… Truenas scale has major performance issues, truenas core has not.
Maybe some default tunable value that differ from core to scale? Maybe a complete different innerworking between zfs on bsd and openzfs on linux that affect his specific workload?

I dont know, but this is a serious problem that devs should focus on if they intend to superseed the bsd based version.

Protopia · December 6, 2024, 1:10pm

No - we are NOT missing this. We are trying both:

To diagnose why there is this difference / issue with Scale; but also
To find a workaround to a pressing issue without needing to wait for an OpenZFS issue to be diagnosed, and a fix coded and tested, and for this to make it into Debian and / or kernel, and for the updated Debian and/or kernel to make it into TrueNAS.

This is a huge exaggeration and I have already told you 10 billion times not to exaggerate like this.

Most people don’t have a performance problem on Scale. This particular use case has an issue.

OpenZFS in both Core and Scale.
I agree, @urgali should raise a ticket and provide a debug file from both Core and Scale systems.

In the mean-time, those community members here who are not iX employees and who are donating our time for free to help @urgali don’t really appreciate it when dismissive comments such as these are made about the help we are giving.

urgali · December 6, 2024, 1:28pm

I actually have updates on the topic.

First of all thanks a lot to @Protopia for taking a lot of time and effort to help out, much appreciated.
Anyway, i checked NFS and my previous statement was wrong, it is mountes as async.

Now, on the matter: we “solved” the issue (more like workarounded) by removing NFS and start using SMB/cifs since i’ve found some posts having almost same issues on NFS that got solved using SMB.
I know have even faster i/o speed than on CORE NFS.

So, at this point i’d say there is something different about how NFS is exposed by SCALE and CORE since the client options are exactly the same, as stated before; unfortunately i don’t have time to dig deeper and i’ll just start using SMB if/when needed.

Thanks all

awalkerix · December 6, 2024, 1:29pm

github.com/truenas/zfs

NAS-132930 / 24.10.2 / add get_name implementation for exports. (#16833)

truenas:stable/electriceel ← truenas:NAS-132930

opened 01:02PM - 06 Dec 24 UTC

anodos325

+75 -0

This fixes a serious performance problem with NFS handling of large directories,… as the new get_name code is much more efficient than the default zfs_readdir. This is actually part of 20232ecfaa34177bef6c08f2f1a55b8c8bd20da4 in 2.3. But I've taken only the minimum code to implement get_name, and not the rest of the long name changes. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov>  ### Motivation and Context ### Description ### How Has This Been Tested? ### Types of changes - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Performance enhancement (non-breaking change which improves efficiency) - [ ] Code cleanup (non-breaking change which makes code smaller or more readable) - [ ] Breaking change (fix or feature that would cause existing functionality to change) - [ ] Library ABI change (libzfs, libzfs\_core, libnvpair, libuutil and libzfsbootenv) - [ ] Documentation (a change to man pages or other documentation) ### Checklist: - [ ] My code follows the OpenZFS [code style requirements](https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md#coding-conventions). - [ ] I have updated the documentation accordingly. - [ ] I have read the [**contributing** document](https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md). - [ ] I have added [tests](https://github.com/openzfs/zfs/tree/master/tests) to cover my changes. - [ ] I have run the ZFS Test Suite with this change applied. - [ ] All commit messages are properly formatted and contain [`Signed-off-by`](https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md#signed-off-by).

Maybe your issue was related to this. Depends on particulars of how data is used / structured.

Protopia · December 6, 2024, 4:13pm

Yes - NAS-132930 matches extremely well the issue we have been seeing - according to the PBS documentation you can have 10s or 100s of thousands of 4MB files in the chunks directory, so a performance issue with enumerating those over NFS would almost certainly explain it.

I am very glad you got a result.

awalkerix · December 6, 2024, 5:05pm

Okay. It’s been merged for 24.10.1 release.

urgali · December 6, 2024, 5:12pm

Nice to know, thanks a lot!
…do you happen to know when can we expect the .1 “release” or if i should rollback from .2?

Thanks

DjP-iX · December 6, 2024, 5:28pm

24.10.1 is planned for next week