ZFS Scale Slow down on local disk to disk copy

whoopn · July 21, 2024, 11:17pm

I’m having a slow down when copying files. I’m moving off of XFS disks I’ve placed in the Truenas scale server and on to the local Zpool.

I see 190MB/s copies (probably max of the single disk I’m reading off of) and it’ll do that speed for about 30 seconds and then tank to 5MB/s. Then about 2-3 minutes later its back up at 190MB/s or sometimes down around 100MB/s then tanks again to 5MB/s after only 20-30 seconds or so.

Here is my layout (yes, i did 13, bad me, next one wont go past 12):

config:

        NAME                                      STATE     READ WRITE CKSUM
        zfspool                                   ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            b4f09d49-a1ce-4d38-96c1-d2458ccc81b3  ONLINE       0     0     0
            2f530f09-4742-4b6f-a0f8-72e73ccfe826  ONLINE       0     0     0
            4e9bde16-4f18-4a6e-884f-602db26bead2  ONLINE       0     0     0
            26d75f96-689f-4f50-b3a6-0c89f66753e4  ONLINE       0     0     0
            d3f74673-4557-4be4-b242-671c10c2325e  ONLINE       0     0     0
            sdd1                                  ONLINE       0     0     0
            66d741d7-bb41-43fc-87a3-e871a7d75d23  ONLINE       0     0     0
            sdf1                                  ONLINE       0     0     0
            7124feb8-ca1a-435f-818c-661e9f14f22c  ONLINE       0     0     0
            sdk1                                  ONLINE       0     0     0
        special
          sdv                                     ONLINE       0     0     0
        logs
          0b382af6-a553-4ee4-9475-83b901be0686    ONLINE       0     0     0
        cache
          nvme0n1                                 ONLINE       0     0     0

root@truenas[/mnt/mnt/TowerDisks/sdr/Media/BOOKS]# zpool iostat -v
                                            capacity     operations     bandwidth
pool                                      alloc   free   read  write   read  write
----------------------------------------  -----  -----  -----  -----  -----  -----
boot-pool                                 2.63G  92.4G      1     27  16.7K   365K
  mirror-0                                2.63G  92.4G      1     27  16.7K   365K
    sdw3                                      -      -      0     13  9.15K   182K
    sdx3                                      -      -      0     13  7.52K   182K
----------------------------------------  -----  -----  -----  -----  -----  -----
zfspool                                   46.5T  26.7T    292    922  1.22M  25.3M
  raidz2-0                                46.4T  26.4T      8     78  52.4K  8.23M
    b4f09d49-a1ce-4d38-96c1-d2458ccc81b3      -      -      0     10  5.26K   843K
    2f530f09-4742-4b6f-a0f8-72e73ccfe826      -      -      0      7  5.21K   843K
    4e9bde16-4f18-4a6e-884f-602db26bead2      -      -      0      4  5.23K   843K
    26d75f96-689f-4f50-b3a6-0c89f66753e4      -      -      0      4  5.25K   843K
    d3f74673-4557-4be4-b242-671c10c2325e      -      -      0     14  5.22K   843K
    sdd1                                      -      -      0      7  5.26K   843K
    66d741d7-bb41-43fc-87a3-e871a7d75d23      -      -      0      7  5.24K   843K
    sdf1                                      -      -      0      7  5.21K   843K
    7124feb8-ca1a-435f-818c-661e9f14f22c      -      -      0      7  5.27K   843K
    sdk1                                      -      -      0      7  5.24K   843K
special                                       -      -      -      -      -      -
  sdv                                      123G   341G    283    843  1.17M  17.0M
logs                                          -      -      -      -      -      -
  0b382af6-a553-4ee4-9475-83b901be0686    3.62M   952G      0      0     17  12.9K
cache                                         -      -      -      -      -      -
  nvme0n1                                 1.75T  75.6G    150     92   613K  6.69M
----------------------------------------  -----  -----  -----  -----  -----  -----

Let me know what else is useful to have detail-wise. This is driving me a bit nuts.

SmallBarky · July 21, 2024, 11:38pm

All drives in pool are CMR and not SMR?

Please post your hardware details and the pool details. Your listing shows special, logs and cache and I don’t’ exactly know what the are used for.

This post had a nice system summary by Russell, for an example.

Performance

whoopn · July 22, 2024, 1:30am

All CMR, what details would you like?

=== System Information ===
Hostname: truenas
Uptime: up 9 hours, 56 minutes
=== CPU Information ===
CPU(s):                               24
On-line CPU(s) list:                  0-23
Model name:                           Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz
BIOS Model name:                      Intel(R) Xeon(R) CPU E5-2620 0 @ 2.00GHz  CPU @ 2.0GHz
Thread(s) per core:                   2
Core(s) per socket:                   6
Socket(s):                            2
CPU(s) scaling MHz:                   56%
NUMA node0 CPU(s):                    0-5,12-17
NUMA node1 CPU(s):                    6-11,18-23
=== Memory Information ===
Total Memory: 110Gi
Used Memory: 78Gi
Free Memory: 8.0Gi
=== Disk Information ===
sda         TEAM TM8PS7001T           953.9G
`-sda1                                953.9G
sdb         WDC WD80EFZZ-68BTXN0        7.3T
`-sdb1                                  7.3T
sdc         WDC WD80EFAX-68LHPN0        7.3T
`-sdc1                                  7.3T
sdd         WDC WD80EMAZ-00WJTA0        7.3T
`-sdd1                                  7.3T
sde         WDC WD80EMAZ-00WJTA0        7.3T
`-sde1                                  7.3T
sdf         WDC WD80EMAZ-00WJTA0        7.3T
`-sdf1                                  7.3T
sdg         WDC WD80EDAZ-11TA3A0        7.3T
`-sdg1                                  7.3T
sdh         WDC WD80EFZX-68UW8N0        7.3T
`-sdh1                                  7.3T
sdi         WDC WD80EMAZ-00WJTA0        7.3T
`-sdi1                                  7.3T
sdj         WDC WD80EMAZ-00WJTA0        7.3T
`-sdj1                                  7.3T
sdk         WDC WD80EMAZ-00WJTA0        7.3T
`-sdk1                                  7.3T
sdl         WDC WD80EFZX-68UW8N0        7.3T
`-sdl1                                  7.3T
sdm         WDC WD80EMAZ-00WJTA0        7.3T
`-sdm1                                  7.3T
sdn         WDC WD80EMAZ-00WJTA0        7.3T
`-sdn1                                  7.3T
sdo         WDC WD80EMAZ-00WJTA0        7.3T
`-sdo1                                  7.3T
sdp         WDC WD80EFZZ-68BTXN0        7.3T
`-sdp1                                  7.3T
sdq         WDC WD40EFRX-68WT0N0        3.6T
`-sdq1                                  3.6T
sdr         WDC WD40EFRX-68N32N0        3.6T
`-sdr1                                  3.6T
sds         WDC WD40EFRX-68WT0N0        3.6T
`-sds1                                  3.6T
sdt         WDC WD80EMAZ-00WJTA0        7.3T
`-sdt1                                  7.3T
sdu         WDC WD80EDAZ-11TA3A0        7.3T
`-sdu1                                  7.3T
sdv         Samsung SSD 840 EVO 500GB 465.8G
sdw         Samsung SSD 850 EVO 120GB 111.8G
|-sdw1                                    1M
|-sdw2                                  512M
|-sdw3                                 95.3G
`-sdw4                                   16G
sdx         SAMSUNG SSD 830 Series    119.2G
|-sdx1                                    1M
|-sdx2                                  512M
|-sdx3                                102.7G
`-sdx4                                   16G
sdy         WDC WD80EFZX-68UW8N0        7.3T
`-sdy1                                  7.3T
nvme0n1     WD_BLACK SN850X 2000GB      1.8T
=== Network Interfaces ===
enp2s0f0         UP
enp2s0f1         UP
enp131s0f0       DOWN
enp131s0f1       DOWN
enp132s0f0       DOWN
enp132s0f1       DOWN
bond0            UP

SmallBarky · July 22, 2024, 4:39am

It seems like you have a drive blocking i/o and it is getting stuck waiting. Maybe check the GUI reports on disks and see if you see anything bottlenecking

Have you checked the SMART data for the HDDs?

Also checking if your have L2ARC and/or SLOG (write cache)
If I am following the data you have posted above, you seem to have a lot of possibly mismatched devices. Boot pool is two different series of Samsung SSD? Z1 pool has HDDs with different cache internally, 128 & 258Mb models. I just checked a few

whoopn · July 22, 2024, 5:08am

I do have L2ARC, and SLOG.

I have boot devices as 1x 120GB and 1x 128GB yes. Boot shouldn’t affect this right?

I’ll run a smart against everything.

I should also point out only the following devices are actually participating in zpool:

This is the vdev config:

Yes I have another set ssds that will be added. I’d love a set of recommendations for cache, mostly what I’ve read is “give zfs all the ram and all the SSDs and have lots of vdev pools and mirrors, etc”. I’m doing what I can afford here, I hope it’ll be enough. I can certainly add more RAM and I have 2450 V2 xeons on their way so I can add 64GB DIMMS and the Mobo will support up to 1TB of RAM (may not go all that way, maybe 512GB).

Disk Models are:
WD80EDAZ
WD80EFAX
WD80EFZX
WD80EFZZ
WD80EMAZ

WD80EDAZ

Type: HDD (Hard Disk Drive)
Capacity: 8TB
Cache: 256MB
RPM: 5400 RPM
Spindle Count: 1
Throughput: Up to 175 MB/s

WD80EFAX

Type: HDD (Hard Disk Drive)
Capacity: 8TB
Cache: 256MB
RPM: 5400 RPM
Spindle Count: 1
Throughput: Up to 175 MB/s

WD80EFZX

Type: HDD (Hard Disk Drive)
Capacity: 8TB
Cache: 256MB
RPM: 7200 RPM
Spindle Count: 1
Throughput: Up to 200 MB/s

WD80EFZZ

Type: HDD (Hard Disk Drive)
Capacity: 8TB
Cache: 256MB
RPM: 7200 RPM
Spindle Count: 1
Throughput: Up to 205 MB/s

WD80EMAZ

Type: HDD (Hard Disk Drive)
Capacity: 8TB
Cache: 256MB
RPM: 5400 RPM
Spindle Count: 1
Throughput: Up to 175 MB/s

Summary

Capacity: All models are 8TB.
Cache: All models have 256MB.
RPM: WD80EFZX and WD80EFZZ have 7200 RPM; WD80EDAZ, WD80EFAX, and WD80EMAZ have 5400 RPM.
Throughput: Ranges from 175 MB/s to 205 MB/s depending on the model.

SmallBarky · July 22, 2024, 6:10am

I tried to add a performance tag but it doesn’t appear at the top of the forum. Maybe you can add it?

I think you should be fine with just your basic Z2 pool and then researching and adding ‘special devices’ if necessary. Some will kill the entire pool upon failure, others are recommended to be mirrored. L2ARC device could die and the pool will still work. Just checking, you don’t have de duplication turned on? That’s another very special use option.

I’ll point you to the ZFS Primer.
Look at the info for SLOG. Says it uses 16GiB of space and also points to drive models with power protection.

whoopn · July 22, 2024, 6:12am

I’ll attempt to add the tag.

I am using dedupe, getting ~9% reduction as well.

I will read that

SmallBarky · July 22, 2024, 6:25am

DEDUP.
That should be the problem child. Do a search on the forums and you can see when it really should be used. I don’t know if you have to rebuild your pool from scratch when you turn it off. You will have to check the documentation and forums for advice.

Old Forum Search results for query: deduplication | TrueNAS Community

Protopia · July 22, 2024, 8:49am

Lots of good information here. A few minor points:

The recommendation to stick to 12 drives or less in a RAIDZ vDev is based only on the data drives and you have only 10.
I am unclear why some drives have a device name sd* and others have a uuid? The reason for UUIDs is that they are independent of the position of the drive in the hardware, whereas sd* can change between reboots. I have no idea whether this is a big problem or not, nor how to fix it.
IMO you should NOT have an unmirrored metadata special vDev - if you lose this you WILL lose the entire pool so it is more important than anything else to ensure that it has good redundancy. However if you configure the L2ARC correctly, it will cache all the metadata anyway, so it is probably redundant. I don’t know whether it is possible to remove the metadata vDev without rebuilding the pool but IMO you have to do something about this ASAP.
I am not sure just how useful the SLOG will be - it depends on your workload but it only helps with synchronous writes, and e.g. Windows SMB usage is always asynchronous. Asynchronous writes are generally going to appear faster from the network client because no write to either SDD or HDD is needed before the client sees it as complete, and the more memory you have the more asynchronous writes can be queued up in memory awaiting disk write capacity to be written out. Of course, any asynchronous writes still in memory will be lost on a crash or power outage, so you have to evaluate this risk. If you are using synchronous writes, you similarly need to evaluate the risks of losing what is stored on the SLOG and not yet committed to HDD in the event that the SLOG SSD dies, and decide whether it needs to be mirrored.

whoopn · July 22, 2024, 1:56pm

I have a mirror coming to join that metadata vdev.

I’ll probably keep the SLOG unless its slowing things down.

I’m just as confused about the dev device names vs parteduuid as well. Truenas GUI did all of that on its own.

Thank you for the points!

Protopia · July 22, 2024, 5:31pm

If you have a large L2ARC anyway, a metadata vDev probably won’t speed up your reads significantly though it might allow a slightly higher volume of writes. But if you have it then it must be mirrored - and IMO a 3x mirror because of its importance.

If write volumes are not of concern, then having a metadata vdev just adds complexity and additional points of failure for your pool. So my advice (as a non-expert) would be to remove the metadata vDev if you can and rely on L2ARC to speed up the metadata access for reads.

whoopn · July 22, 2024, 6:29pm

I’ve done a 2x mirror on the metadata now (just finished resilvering).

I can add a 4TB l2arc (QLC unfortunately) if it will matter. I just really want the disks to handle the writes. I get its a COW filesystem but I also would like to not be bottlenecked by the dirtydata_max of 4GB. That appears to be the case best I can tell. That should be how that works right? B/c dirtydata_max_max is also 4GB so you cant really change it.

SmallBarky · July 22, 2024, 8:41pm

You need to size L2ARC appropriately. The bigger it is, the bigger the RAM usage. It will do nothing if your current ARC hit rate is high.

Please read the ZFS Primer and TrueNAS Scale documents.
One reference on L2ARC L2ARC | TrueNAS Documentation Hub

Start out with a regular Z2 pool setup. Start learning the performance from that. If there are problems then come back to the forum and post a new thread with you entire setup described and what you are doing. You just seem to be trying to use all the ZFS features without understanding when and how to use them.

whoopn · July 22, 2024, 8:57pm

I haven’t seen anything but guesses on why the performance would be slow. I am not randomly trying out features, I’m trying to allow others to provide examples of their performance characteristics.

The near constant advisories to not use dedup leads me to believe it is misunderstood by the community in general. ZFS was designed with this feature and compression in mind. It is a solid feature. ZFS doesn’t belong on a small system, as noted by Oracle’s own ZFS arrays being absolute monsters (I’ve managed them myself).

I’ll figure out what the issue is and post a how to. Maybe it will benefit someone.

SmallBarky · July 22, 2024, 9:47pm

The advisories against dedup is from experiences on forums and iX Systems. I linked to the old forums but there may be something in the current Resources section or the guides on the old forum.

HoneyBadger · July 23, 2024, 2:20pm

Deduplication has been identified as a likely culprit here; while you do have a special vdev, it’s only a single (which was now changed to a mirror) disk - and the 840 EVO may not be particularly suited to the task.

etorix · July 23, 2024, 3:45pm

The raidz2 is “only” 10-wide, and that’s what we look at, so it is not TOO wide.
But it is intriguiging that three drives are shown by partition rather than GPTID, as should be the case.

And you have a binking red alert with this pool: Single drive special vdev!
Never ever do that! If this single drive dies your pool is lost. Add at least one other drive, and preferably two to extend this special vdev into a 2-way or 3-way mirror.

It is not apparent whether you actually have a use for SLOG. But you do not have enough RAM for a 2 TB L2ARC… and this could even be part of your problem.
The good point here is you can remove SLOG and/or L2ARC.

BINGO!
9% is absolutely ridiculous for this ressource hog, and it’s straining the single drive special vdev for good measure.
Unless you’re willing to backup and destroy your pool to rebuild sanely, you should:

FIRST AND FOREMOST extend this special vdev!
For each datset which has dedup enabled, create a new dataset without dedup…
… set up a one-off LOCAL replication task from the deduped dataset to the new one; uncheck “Include dataset properties” and “(Almost) Full Filesystem Replication” (set manually any other non-default property you want, you may also crank up compression if your dataset may benefit from it); and run…
…check that your data has safely moved into the new deduplicated dataset, remove the “read-only” property and make the new dataset into the new share…
… and delete the old deduped datset.
Disable dedup at pool level.

(If dedup is on the entire pool, you do not have enough RAM to handle it, far from that. The system should still cope with it thanks to EITHER a special/dedup vdev or a persistent L2ARC, not both at the same time, but your current special vdev is a major liability—and you cannot remove it. For 9%, just get rid of dedup entirely.)

etorix · July 23, 2024, 3:54pm

Sure, we’re all morons but you’re the sole genius which will save us all from the deep darkness of Misunderstanding…

Take it from my personal experience with a little dedup’ed dataset (16 TB quota, 10 TB used): Just scrubbing, that is READ only, a mere 10 TB of deduped data would bring a system with 64 GB RAM to its knees.
Dedup is a solid feature… for enterprise systems sized with over $10k of RAM. Home-labbers are not invited.