Long post, I apologize for the detail, but I didn’t want to find myself not providing the required information for the question. Summary: I am not an expert on setting up RAIDZ2, and am suspecting something in my setup is causing something to get bogged down. I’m looking for recommendations and why it would help. My initial suspicion is that I need to add a metadata, log, or cache vdev, though that’s little more than a hunch right now.
- Environment: TrueNAS Scale 24.10.2.4
- Dell Poweredge R730xd, disks presented directly, not RAID configured
- Data Disks: Micron_5200_MTFDDAK1T9TDD
- OS Disk: OCZ-Vertex3
- NIC: X520-DA2 10Gb
- Memory: 62.8 GiB, ECC. Currently showing 6.7 GiB to services, 22.9 to ZFS cache.
- CPU: Xeon E5-2623 v3 @ 3GHz. 2 CPUs, total of 8 cores.
- Raid configuration: 2xRAIDZ2 | 6 wide | 1.75 TiB,
- No Metadata, Log, Cache, or dedup vdevs.
- 2 x spare drives,
- Currently 61% full. Usable capacity 13.8 TiB, used 8.41TiB.
- ZFS health shows good, no errors, and no smart errors.
- 3 datasets, plus the iocage/ix-applications
- 1 NFS share, 1 SMB share, and one that is both SMB and NFS
Testing when I changed to 10Gb in my house about 6 months ago showed consistent large file transfers (multi-gig movie files) that would top out around 1 GBps and settle around 600 GBps (nework traffic of 9.2 Gbps, settling around 4.8 Gbps). I established this was to be expected with the zfs cache in memory filling up and then writing at disk speed, and was happy with it.
I retired several of my other pools over the last several months, deleted them, and then removed the disks, and at some point afterwards I noted that my file transfers to my one remaining pool would start out at full speed for a few seconds, and then slow down to something on the order of KBps to 2MBps.
I have tested the network layer, using iperf, and confirmed that at a network layer, my speed remains at the expected speed for 10Gb.
I have started trying to prepare myself for the upgrade to 25.04 to see if the problem is tied to the old version of Truenas, but when I decided to create a tar file of my docker apps as a backup, I noted that my performance was extremely slow; it was showing the same behavior internally as I was seeing on the SMB share. this leads me to believe the issue is with either the hardware or the ZFS configuration, and also eliminates the share protocol from the list of suspects. I then tested on the NFS share (again, same pool, different dataset), and confirmed it has the same behavior: fast for a couple meg, then slows down to KB to MBps.
Due to being a new member, I am unable to upload the output of performing an iostat -x 1 for an extended amount of time, can someone suggest a method of doing so? In the mean time I’ll put a comment with a subset of what i’m seeing in the results.
Current RAIDZ2 devs:
- RAIDZ2 device 1:
- sda, sdc, sde, sdf, sdg, sdh
- RAIDZ2 device 2:
- sdj, sdk, sdl, sdm, sdn, sdo
- Spares:
- sdb, sdi
What I noticed is that for extended periods of time, one of the raid devices (what I’m calling device 1, the a-h device) has at least one disk that becomes saturated at 100% utilization, and then periodically I get a burst of activity from the other device. I am at a loss as to what I need to do to fix this.
I am wondering if possibly when I deleted the unused pools, somehow I deleted something that was being used as a log, cache, or metadata vdev for this pool…I don’t know how I would have done that, but I can’t say it’s impossible. Either way, what should I next look at for troubleshooting this? I don’t want to try to upgrade to 25.04 if the i/o for this pool is abysmal or if the whole pool setup is already broken. I do have intel SSD D3-S4510 Series 1.92 TB drives that I could replace the microns with, if it would help (I retired my company’s datacenter a while back and they let me keep the hardware), as well as spare microns of the same model if simply adding a log, cache, or metadata vdev would help.