Significantly reduced sync write performance on Scale vs. Core

hoeser · March 9, 2025, 10:32pm

Hi Folks,

Any help with this would be greatly appreciated, I’ve really been pulling my hair out on this one. I recently migrated from Core 13.0-U6.2 and in the process upgraded the system from a dual socket 2670v2 to a single socket Epyc 7402p. The first thing I noticed with scale was greatly reduced (as in orders of magnitude reduced) sync: always performance for any of my pools, despite the use of a PLP SLOG SSD of any kind.

I spent days chasing this but eventually I built a very simple test environment and was able to replicate it there. I’ll present this basic environment for review and perhaps someone can help me chase this down -

The basic environment is as follows:
Ryzen 5900X , 64GB DDR4 host with 1 boot NVME and 1x Samsung PM1733 960GB drive.

Vanilla install of Core tested vs. vanilla install of Scale.

Single vdev, single drive stripe ZFS with 1x Samsung PM1733, 1MB recordset, standard lz4 compression, sync set to always.
fio --name=fiotest --ioengine=posixaio --size 50G --rw=write --bs=1M --direct=1 --runtime=60 --iodepth=32 --sync=1
TrueNAS Core : write: IOPS=1221, BW=1221MiB/s (1280MB/s)(50.0GiB/41928msec); 0 zone resets
TrueNAS Scale: write: IOPS=366, BW=367MiB/s (384MB/s)(21.5GiB/60060msec); 0 zone resets

I have tried every imagineable io engine, different depths, I always get about 1/10th or worse performance on scale for sync: always vs. TrueNAS Core. This seems like such a massive performance margin that it must be a misconfiguration somewhere, but this test environment is basically just out of the box on a really simple setup.

Any ideas? Thanks again.

Captain_Morgan · March 10, 2025, 6:35am

Unexpected…

Since you have sync = always, I doubt you need:

Direct=1 (not really supported by ZFS yet)
Sync=1

Can you confirm if one of these is causing the issue.

Are only doing the test within the NAS… not via NFS or iSCSI?

Protopia · March 10, 2025, 10:03am

So we can eliminate non-sync aspects, can you also try the tests with sync=standard or sync=off and confirm that these give equivalent performance on Core vs. Scale?

hoeser · March 10, 2025, 12:47pm

I can confirm that sync standard or sync disabled are equivalent between core and scale in all of my test cases, or so close that they are within tolerance - working as intended for those cases.

hoeser · March 10, 2025, 12:50pm

Requested testing - Dropping the sync=1 on the fio test has no effect. Dropping direct=1 or setting direct=0 also no effect.

and @Captain_Morgan - this test is being done directly on the NAS itself but I see the same differences on iSCSI and NFS… it was NFS where I first noticed the major clawback on performance when moving to scale.

Johnny_Fartpants · March 10, 2025, 1:15pm

Perhaps running zpool iostat -v pool 10 in another terminal session would be useful when running your fio test to compare.

PS: can you compare zfs get all pool/dataset between CORE and SCALE to make sure there is no obvious difference.

hoeser · March 10, 2025, 2:28pm

Good thoughts but yeah… I’ve tried this in my week long debug session for this. I don’t see anything compelling in the iostat, and the arc_summary, as well as zfs get all don’t show any differences.

hoeser · March 10, 2025, 8:52pm

So, interestingly, I moved up to the Scale 25 Beta branch for testing - which has ZFS 2.3.0-1 - and there is a significant improvement in sync write performance, but still nowhere near as fast as Core.

I’m seeing write: IOPS=564, BW=565MiB/s (592MB/s)(33.1GiB/60049msec); 0 zone resets on Scale 25 Beta… about 1250MB/sec on core as shown above. This is very perplexing.

Captain_Morgan · March 10, 2025, 9:44pm

The 25.04-RC.1 comes out tomorrow… please confirm with that.
The only explanations I have are:

a subtle change to ZFS
Linux handles writes to the specific SSD differently

In general, we don’t see this problem on TrueNAS appliances, but we do see more reliable testing when using protocols over network.

If the problem persists, please report as a bug, and send the NAS ticket here.

Stux · March 10, 2025, 9:47pm

May be worthwhile specifically benching performance to the SSD, rather than to the SSD in the pool.

hoeser · March 10, 2025, 10:54pm

I can confirm that I’ve tested this SSD as ext4 sync and async and do not see any difference in performance. I’ve also tested this against other SSDs with the same result.

I’ll definitely try tomorrow with 25.04-RC.1. If it persists I’ll submit a bug. Thanks.

Captain_Morgan · March 11, 2025, 5:28pm

I think your setup has the ZIL/SLOG and pool on the same single SSD?

If so, that’s a major difference from what we normally test and could explain the unusual behaviour. If this were the cause, its a ZFS 2.3 change…

hoeser · March 11, 2025, 7:15pm

Log device isn’t assigned as its a single drive pool for testing purposes - can you point me to the change in ZFS 2.3 for this behavior ? Having trouble finding it.

hoeser · March 12, 2025, 12:15am

I discovered a pretty significant issue with RC-1 today… nfsv4 to esxi was working great up until the point that I upgraded the pool to the latest featureset…

Now getting errors when attempting vmotions into that dataset. I’ve been chasing it all day and finally narrowed it down to the zfs dataset update.

2025-03-11T20:22:34.035Z In(05) vmx - SVMotion: Enter Phase 1
2025-03-11T20:22:34.037Z In(05) worker-6323006 - SVMotionDiskGetSrcInfo: disk scsi0:0: type: 11, allocType: 2, capacityInBytes: 107374182400, grain: 0, numlinks: 1,  rdm: null, disk sector size: 512.
2025-03-11T20:22:34.037Z In(05) worker-6323006 - SVMotionDiskGetDstInfo: disk scsi0:0: type: 11, allocType: 2, capacityInBytes: 107374182400, grain: 0, numlinks: 1,  rdm: null, disk sector size: 512.
2025-03-11T20:22:34.037Z In(05) worker-6323006 - SVMotionDiskSetup: Adding disk scsi0:0: moveRDMDesc: 0, isRemote: 0, skipZeros: 1.
2025-03-11T20:22:34.037Z In(05) worker-6323006 - MigrateWriteHostLog: Writing to log file took 389 us.
2025-03-11T20:22:34.037Z In(05) worker-6323006 - MigrateSetState: Transitioning from state MIGRATE_TO_VMX_PREPARING (2) to MIGRATE_TO_VMX_PRECOPY (3).
2025-03-11T20:22:34.037Z In(05) worker-6323006 - MigrateWriteHostLog: Writing to log file took 205 us.
2025-03-11T20:22:34.037Z In(05) worker-6323006 - UTIL: Change file descriptor limit from soft 16499,hard 16499 to soft 32998,hard 32998.
2025-03-11T20:22:34.037Z In(05) worker-6323006 - SVMotion: Enter Phase 2
2025-03-11T20:22:34.038Z In(05) worker-6323006 - SVMotionDiskGetCreateExtParams: not using a storage policy to create disk '/vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1.vmdk'
2025-03-11T20:22:34.038Z In(05) worker-6323006 - DISKLIB-LIB_CREATE   : DiskLibCreateObjExtParamsInt: CreateObjExtParams: Object backing type 0 is invalid. Figuring out the most suitable backing type...
2025-03-11T20:22:34.045Z In(05) worker-6323006 - DISKLIB-VMFS  : "/vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1-flat.vmdk" : open successful (33554433) size = 4096, hd = 0. Type 3
2025-03-11T20:22:34.047Z In(05) worker-6323006 - DISKLIB-VMFS  : "/vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1-flat.vmdk" : closed.
2025-03-11T20:22:34.047Z In(05) worker-6323006 - MigrateWriteHostLog: Writing to log file took 229 us.
2025-03-11T20:22:34.047Z In(05) worker-6323006 - SVMotion: Enter Phase 3
2025-03-11T20:22:34.048Z In(05) worker-6323006 - DISKLIB-VMFS  : VmfsExtentCommonOpen: possible extent truncation (?) realSize is 0, size in descriptor 209715200.
2025-03-11T20:22:34.048Z In(05) worker-6323006 - DISKLIB-VMFS  : "/vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1-flat.vmdk" : failed to open (The file specified is not a virtual disk): Size of extent in descriptor file larger than real size. Type 3
2025-03-11T20:22:34.048Z Er(02) worker-6323006 - DISKLIB-LINK  : DiskLinkOpen: Failed to open '/vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1.vmdk': : The file specified is not a virtual disk
2025-03-11T20:22:34.048Z Er(02) worker-6323006 - DISKLIB-CHAIN : DiskChainOpen: "/vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1.vmdk": failed to open: The file specified is not a virtual disk.
2025-03-11T20:22:34.048Z In(05) worker-6323006 - DISKLIB-LIB   : Failed to open '/vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1.vmdk' with flags 0x820a The file specified is not a virtual disk (15).
2025-03-11T20:22:34.048Z Wa(03) worker-6323006 - Mirror: scsi0:0: SVMotionLocalDiskLoad: failed to open the destination disk The file specified is not a virtual disk.
2025-03-11T20:22:34.048Z Wa(03) worker-6323006 - Mirror: scsi0:0: Failed to load dest disk /vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1.vmdk.
2025-03-11T20:22:34.048Z Wa(03) worker-6323006 - SVMotionPrepareForCopyThread: Failed to load destination disks
2025-03-11T20:22:34.048Z In(05) worker-6323006 - SVMotion: FailureCleanup thread completes.

hoeser · March 12, 2025, 12:16am

Nothing really shows up on the TrueNAS side at all… esxi nfs3 to the same dataset works. ESXi 8.0 u3.

Captain_Morgan · March 12, 2025, 12:42am

Its worth a new thread and perhaps a bug report… its exactly the type of issue we are searching for.

Was it working with BETA???

Captain_Morgan · March 12, 2025, 1:33am

No, but by doing sync writes, it is acting as a SLOG and data vdev.

Its unusual… we don’t test systems like that.

I don’t know of any OpenZFS2.3 changes that would impact this… but you have confirmed that OpenZFS 2.3 is having issues when OpenZFS2.2 was not.

hoeser · March 12, 2025, 1:37am

Yes, it was working on beta - it’s only after updating to RC-1 and updating the pool feature flags that the problem cropped up. Worked on RC-1 up until I updated zfs feature flags. I’ve run out of cycles to troubleshoot/bug report this today but I’ll hopefully circle back when I have some more time later in the week.

Protopia · March 12, 2025, 12:04pm

I appreciate that you understand the details, but for anyone reading this that doesn’t:

ZFS does synchronous writes when:

dataset sync=always : all writes
dataset sync=standard : fsync writes
dataser sync=never : never

Synchronous writes are made to the ZIL. SLOG changes where the ZIL is located, and there are AFAICS three potential reasons to use SLOG:

SLOG device is far faster than the data vDev (or possibly a special allocation (metadata) vDev if you have one though this may depend on the small record size you have set for your dataset - I haven’t yet found a definitive answer on this).
You need more IOPS than your current ZIL devices can provide - and separating out the ZIL to an SLOG splits these IOPS to a separate device.
You want to reduce free-space fragmentation (apparently because ZIL writes can increase this).

Thus, if your data vDev is already NVMe or Optane, and the workload is not heavy there may be no significant benefit to having a separate SLOG.

So IMO, having the ZIL on the same SSD as data should perhaps be a normal use case that should be tested. But iX will need to form their own judgement on this.

Captain_Morgan · March 12, 2025, 4:36pm

I don’t disagree, but its not a configuration we sell and performance test for Enterprises. Its a community config that I assumed was still working well.