Any help with this would be greatly appreciated, I’ve really been pulling my hair out on this one. I recently migrated from Core 13.0-U6.2 and in the process upgraded the system from a dual socket 2670v2 to a single socket Epyc 7402p. The first thing I noticed with scale was greatly reduced (as in orders of magnitude reduced) sync: always performance for any of my pools, despite the use of a PLP SLOG SSD of any kind.
I spent days chasing this but eventually I built a very simple test environment and was able to replicate it there. I’ll present this basic environment for review and perhaps someone can help me chase this down -
The basic environment is as follows:
Ryzen 5900X , 64GB DDR4 host with 1 boot NVME and 1x Samsung PM1733 960GB drive.
Vanilla install of Core tested vs. vanilla install of Scale.
Single vdev, single drive stripe ZFS with 1x Samsung PM1733, 1MB recordset, standard lz4 compression, sync set to always.
fio --name=fiotest --ioengine=posixaio --size 50G --rw=write --bs=1M --direct=1 --runtime=60 --iodepth=32 --sync=1
TrueNAS Core : write: IOPS=1221, BW=1221MiB/s (1280MB/s)(50.0GiB/41928msec); 0 zone resets
TrueNAS Scale: write: IOPS=366, BW=367MiB/s (384MB/s)(21.5GiB/60060msec); 0 zone resets
I have tried every imagineable io engine, different depths, I always get about 1/10th or worse performance on scale for sync: always vs. TrueNAS Core. This seems like such a massive performance margin that it must be a misconfiguration somewhere, but this test environment is basically just out of the box on a really simple setup.
So we can eliminate non-sync aspects, can you also try the tests with sync=standard or sync=off and confirm that these give equivalent performance on Core vs. Scale?
I can confirm that sync standard or sync disabled are equivalent between core and scale in all of my test cases, or so close that they are within tolerance - working as intended for those cases.
Requested testing - Dropping the sync=1 on the fio test has no effect. Dropping direct=1 or setting direct=0 also no effect.
and @Captain_Morgan - this test is being done directly on the NAS itself but I see the same differences on iSCSI and NFS… it was NFS where I first noticed the major clawback on performance when moving to scale.
Good thoughts but yeah… I’ve tried this in my week long debug session for this. I don’t see anything compelling in the iostat, and the arc_summary, as well as zfs get all don’t show any differences.
So, interestingly, I moved up to the Scale 25 Beta branch for testing - which has ZFS 2.3.0-1 - and there is a significant improvement in sync write performance, but still nowhere near as fast as Core.
I’m seeing write: IOPS=564, BW=565MiB/s (592MB/s)(33.1GiB/60049msec); 0 zone resets on Scale 25 Beta… about 1250MB/sec on core as shown above. This is very perplexing.
I can confirm that I’ve tested this SSD as ext4 sync and async and do not see any difference in performance. I’ve also tested this against other SSDs with the same result.
I’ll definitely try tomorrow with 25.04-RC.1. If it persists I’ll submit a bug. Thanks.
Log device isn’t assigned as its a single drive pool for testing purposes - can you point me to the change in ZFS 2.3 for this behavior ? Having trouble finding it.
I discovered a pretty significant issue with RC-1 today… nfsv4 to esxi was working great up until the point that I upgraded the pool to the latest featureset…
Now getting errors when attempting vmotions into that dataset. I’ve been chasing it all day and finally narrowed it down to the zfs dataset update.
2025-03-11T20:22:34.035Z In(05) vmx - SVMotion: Enter Phase 1
2025-03-11T20:22:34.037Z In(05) worker-6323006 - SVMotionDiskGetSrcInfo: disk scsi0:0: type: 11, allocType: 2, capacityInBytes: 107374182400, grain: 0, numlinks: 1, rdm: null, disk sector size: 512.
2025-03-11T20:22:34.037Z In(05) worker-6323006 - SVMotionDiskGetDstInfo: disk scsi0:0: type: 11, allocType: 2, capacityInBytes: 107374182400, grain: 0, numlinks: 1, rdm: null, disk sector size: 512.
2025-03-11T20:22:34.037Z In(05) worker-6323006 - SVMotionDiskSetup: Adding disk scsi0:0: moveRDMDesc: 0, isRemote: 0, skipZeros: 1.
2025-03-11T20:22:34.037Z In(05) worker-6323006 - MigrateWriteHostLog: Writing to log file took 389 us.
2025-03-11T20:22:34.037Z In(05) worker-6323006 - MigrateSetState: Transitioning from state MIGRATE_TO_VMX_PREPARING (2) to MIGRATE_TO_VMX_PRECOPY (3).
2025-03-11T20:22:34.037Z In(05) worker-6323006 - MigrateWriteHostLog: Writing to log file took 205 us.
2025-03-11T20:22:34.037Z In(05) worker-6323006 - UTIL: Change file descriptor limit from soft 16499,hard 16499 to soft 32998,hard 32998.
2025-03-11T20:22:34.037Z In(05) worker-6323006 - SVMotion: Enter Phase 2
2025-03-11T20:22:34.038Z In(05) worker-6323006 - SVMotionDiskGetCreateExtParams: not using a storage policy to create disk '/vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1.vmdk'
2025-03-11T20:22:34.038Z In(05) worker-6323006 - DISKLIB-LIB_CREATE : DiskLibCreateObjExtParamsInt: CreateObjExtParams: Object backing type 0 is invalid. Figuring out the most suitable backing type...
2025-03-11T20:22:34.045Z In(05) worker-6323006 - DISKLIB-VMFS : "/vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1-flat.vmdk" : open successful (33554433) size = 4096, hd = 0. Type 3
2025-03-11T20:22:34.047Z In(05) worker-6323006 - DISKLIB-VMFS : "/vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1-flat.vmdk" : closed.
2025-03-11T20:22:34.047Z In(05) worker-6323006 - MigrateWriteHostLog: Writing to log file took 229 us.
2025-03-11T20:22:34.047Z In(05) worker-6323006 - SVMotion: Enter Phase 3
2025-03-11T20:22:34.048Z In(05) worker-6323006 - DISKLIB-VMFS : VmfsExtentCommonOpen: possible extent truncation (?) realSize is 0, size in descriptor 209715200.
2025-03-11T20:22:34.048Z In(05) worker-6323006 - DISKLIB-VMFS : "/vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1-flat.vmdk" : failed to open (The file specified is not a virtual disk): Size of extent in descriptor file larger than real size. Type 3
2025-03-11T20:22:34.048Z Er(02) worker-6323006 - DISKLIB-LINK : DiskLinkOpen: Failed to open '/vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1.vmdk': : The file specified is not a virtual disk
2025-03-11T20:22:34.048Z Er(02) worker-6323006 - DISKLIB-CHAIN : DiskChainOpen: "/vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1.vmdk": failed to open: The file specified is not a virtual disk.
2025-03-11T20:22:34.048Z In(05) worker-6323006 - DISKLIB-LIB : Failed to open '/vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1.vmdk' with flags 0x820a The file specified is not a virtual disk (15).
2025-03-11T20:22:34.048Z Wa(03) worker-6323006 - Mirror: scsi0:0: SVMotionLocalDiskLoad: failed to open the destination disk The file specified is not a virtual disk.
2025-03-11T20:22:34.048Z Wa(03) worker-6323006 - Mirror: scsi0:0: Failed to load dest disk /vmfs/volumes/29d19148-36cd94c1-0000-000000000000/ad1_1/ad1.vmdk.
2025-03-11T20:22:34.048Z Wa(03) worker-6323006 - SVMotionPrepareForCopyThread: Failed to load destination disks
2025-03-11T20:22:34.048Z In(05) worker-6323006 - SVMotion: FailureCleanup thread completes.
Yes, it was working on beta - it’s only after updating to RC-1 and updating the pool feature flags that the problem cropped up. Worked on RC-1 up until I updated zfs feature flags. I’ve run out of cycles to troubleshoot/bug report this today but I’ll hopefully circle back when I have some more time later in the week.
I appreciate that you understand the details, but for anyone reading this that doesn’t:
ZFS does synchronous writes when:
dataset sync=always : all writes
dataset sync=standard : fsync writes
dataser sync=never : never
Synchronous writes are made to the ZIL. SLOG changes where the ZIL is located, and there are AFAICS three potential reasons to use SLOG:
SLOG device is far faster than the data vDev (or possibly a special allocation (metadata) vDev if you have one though this may depend on the small record size you have set for your dataset - I haven’t yet found a definitive answer on this).
You need more IOPS than your current ZIL devices can provide - and separating out the ZIL to an SLOG splits these IOPS to a separate device.
You want to reduce free-space fragmentation (apparently because ZIL writes can increase this).
Thus, if your data vDev is already NVMe or Optane, and the workload is not heavy there may be no significant benefit to having a separate SLOG.
So IMO, having the ZIL on the same SSD as data should perhaps be a normal use case that should be tested. But iX will need to form their own judgement on this.
I don’t disagree, but its not a configuration we sell and performance test for Enterprises. Its a community config that I assumed was still working well.