Slow expansion of RAIDZ1

PBe · January 4, 2025, 3:27pm

I’m setting up my new TueNAS SCALE server and expanding my existing RAIDZ1 pool with a repurposed HDD. The UI has been stuck at 25% pool.attach. I’ve read some of the other posts (like this one here (which sadly did not reach a resolution), done some rudimentary troubleshooting, and learned from the CLI that the expansion is going at ~8MiB/s with an ETA in 14 days, which doesn’t seem normal. Looking at the disk dashboard the write speed appears to have momentarily peaked a couple of instances to 60MiB/s since the start, but the peaks are very short and few.

System: ElectricEel-24.10.1
Disks in question:
sda ST8000VN004-3CP101
sdb ST8000VN002-2ZM188
sdc ST8000VN002-2ZM188
sdd ST8000VN004-2M2101 <= ‘newcomer’

Current zpool status

$ zpool status hdd_array -v
  pool: hdd_array
 state: ONLINE
expand: expansion of raidz1-0 in progress since Sat Jan  4 00:03:08 2025
        480G / 10.4T copied at 8.47M/s, 4.50% done, 14 days 06:31:07 to go
config:

        NAME                                      STATE     READ WRITE CKSUM
        hdd_array                                 ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            28b6cbd7-3726-447e-a894-5d618c9ab79f  ONLINE       0     0     0
            c71373d1-b30a-43b9-8f26-51e249021d5d  ONLINE       0     0     0
            1879b64d-45b0-42a9-8d48-b0381f87a21f  ONLINE       0     0     0
            336608bc-9edf-43e8-a630-55f24fbbbced  ONLINE       0     0     0

errors: No known data errors

iostats

$ sudo zpool iostat -v
                                            capacity     operations     bandwidth
pool                                      alloc   free   read  write   read  write
----------------------------------------  -----  -----  -----  -----  -----  -----
app_pool                                  16.6G   935G      1     30   103K  1.51M
  d71d0cdb-e4f6-42fb-b225-e67fff830bb2    16.6G   935G      1     30   103K  1.51M
----------------------------------------  -----  -----  -----  -----  -----  -----
boot-pool                                 2.65G   233G      0      5  9.70K  59.4K
  nvme1n1p3                               2.65G   233G      0      5  9.70K  59.4K
----------------------------------------  -----  -----  -----  -----  -----  -----
hdd_array                                 10.4T  11.4T    209    521  22.4M  3.53M
  raidz1-0                                10.4T  11.4T    209    521  22.4M  3.53M
    28b6cbd7-3726-447e-a894-5d618c9ab79f      -      -    148    183  7.46M   956K
    c71373d1-b30a-43b9-8f26-51e249021d5d      -      -     30     75  7.46M   954K
    1879b64d-45b0-42a9-8d48-b0381f87a21f      -      -     30     79  7.48M   954K
    336608bc-9edf-43e8-a630-55f24fbbbced      -      -      0    537  17.6K  2.14M
----------------------------------------  -----  -----  -----  -----  -----  -----

$ iostat -x
Linux 6.6.44-production+truenas (truenas)       01/04/25        _x86_64_        (12 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.21    1.50    2.20    3.81    0.00   90.28

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB
nvme0n1          1.90    103.53     0.00   0.00    0.33    54.35   31.75   1541.81     0.00   0.00    0.17    48.56    0.00      0.
nvme1n1          0.42     10.55     0.00   0.00    0.25    25.17    5.35     59.41     0.00   0.00    0.11    11.10    0.00      0.
sda            148.76   7639.85     0.11   0.07    1.47    51.36  183.97    955.40     0.01   0.01    0.20     5.19    0.00      0.
sdb             31.58   7657.77     0.03   0.11   18.64   242.47   79.44    953.76     0.01   0.01    6.58    12.01    0.00      0.
sdc             31.89   7632.76     0.03   0.11   17.93   239.36   75.51    953.88     0.01   0.01    6.92    12.63    0.00      0.
sdd              0.13      6.27     0.00   0.06    3.98    48.26  183.00    747.14     0.20   0.11    0.22     4.08    0.00      0.

I’ve seen the speedup thread and applied the suggestion:

$ echo $((100 * 16777216)) | sudo tee /sys/module/zfs/parameters/raidz_expand_max_copy_bytes
1677721600
$ cat /sys/module/zfs/parameters/raidz_expand_max_copy_bytes
1677721600

This seems to have not influenced the process. Similarly, the “pause” suggestion from that same thread had no effect.

Can someone more knowledgeable make something of this?

Additionally, is it safe to reboot the system? I’ve considered this to try and apply the increased copy_bytes but I don’t dare interrupt the expand

Protopia · January 4, 2025, 4:00pm

RAIDZ expansion has to move a lot of data. Assuming that your 3x8TB are 75% full, then each disk has 6TB. Expansion needs to spread this around, so each disk needs to have c. 4.5TB on it - so it needs to move a total of 4.5TB from the existing 3 disks to the extra disk.
The expansion is (supposed to be) restartable after a reboot - and whilst this is a rarely used function, anecdotes suggest it works. But this does mean that ZFS cannot run at full speed, but instead needs to keep writing consistency state to the disks, which makes it slower than you might expect.

People have reported that expansion can take days to achieve - and according to the zpool status you only started today (4 Jan) - so my advice is to avoid a reboot unless absolutely necessary and let it run, and hopefully eventually it will complete.

LarsR · January 4, 2025, 4:22pm

Maybe this post helps

The OP was able to double is expansion speed…

PBe · January 4, 2025, 4:27pm

The disks report 6.95 TiB usage. With parity that would be ~10.5TiB to redistribute. Indeed not a small number.
Thanks, that’s good to know. It’s just baffling to me that the redistribution is so slow - for that it’s worth, almost all the data in the pool initially came from this disk and the copy over was completed nicely overnight (~10h) with 200MB/s. I would be fully on board for this to take a day or three, but 14 days made me think I’m doing something wrong. I did indeed start the process at midnight today, which for me is now roughly 17 hours ago and I’m at 513G / 10.4T → 8.4MiB/s average since start.

I guess I’ll heed your advice and wait it out since I don’t want to lose the data and I’m in no rush and the system will be on anyway.

PBe · January 4, 2025, 4:29pm

That’s the exact thread I quoted as “already tried” in my original post.

LarsR · January 4, 2025, 4:30pm

Then i’m sorry for double posting, i should really start reading the whole post instead of skipping the code parts when i’m half asleep…

Davvo · January 4, 2025, 4:43pm

How are the drives connected to the motherboard? What is your full hardware list?

Also, never trust the estimated completition time: it’s not reliable. You will very likely end earlier.

Protopia · January 4, 2025, 4:47pm

Sorry - but whilst the total data blocks including parity is 10.5TB you need to move 1/4 of this to the new drive which is c. 2.6TB.

Once your expansion is complete, full records will be 2 data blocks + 1 parity rather than 3 data blocks + 1 parity. So 3 records will be 9 blocks, and they could be rewritten using 8 blocks. Small records won’t benefit. So using a rebalancing script to rewrite every file would recover (say) up to 1TB in space.

PBe · January 4, 2025, 5:26pm

It’s a ‘stock’ ZimaCube Pro. Are there any specifics you’re looking for? I can probably dig them up from the somewhat scattered resources.

Davvo · January 4, 2025, 5:29pm

I’m mostly interested in how the drives are connected to the motherboard in order to understand if something is limiting the transfer.

PBe · January 4, 2025, 5:39pm

Here’s a deep dive into the HW - especially the HDD connection pretty much at the end.

Per the docs, the 6 SATA bays are connected to the MoBo via PCIe4.0 x2

Davvo · January 4, 2025, 5:45pm

I assume, then, they are using PCH? It’s not clear from the link you posted.

PBe · January 4, 2025, 7:24pm

Looks like the SATAs are controlled by a SATA controller: ASM1166 but I don’t know how exactly it’s connected.

I’ll be honest: I can’t make much out of the data I’m pulling or know how to get better info, so if there’s anything more specific or useful about the topology that I can pull form the cli please let me know.

lspci -nn

$ lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation Alder Lake-U15 Host and DRAM Controller [8086:4601] (rev 04)
00:02.0 VGA compatible controller [0300]: Intel Corporation Alder Lake-UP3 GT2 [Iris Xe Graphics] [8086:46a8] (rev 0c)
00:06.0 PCI bridge [0604]: Intel Corporation 12th Gen Core Processor PCI Express x4 Controller #0 [8086:464d] (rev 04)
00:07.0 PCI bridge [0604]: Intel Corporation Alder Lake-P Thunderbolt 4 PCI Express Root Port #0 [8086:466e] (rev 04)
00:07.2 PCI bridge [0604]: Intel Corporation Alder Lake-P Thunderbolt 4 PCI Express Root Port #2 [8086:462f] (rev 04)
00:08.0 System peripheral [0880]: Intel Corporation 12th Gen Core Processor Gaussian & Neural Accelerator [8086:464f] (rev 04)
00:0a.0 Signal processing controller [1180]: Intel Corporation Platform Monitoring Technology [8086:467d] (rev 01)
00:0d.0 USB controller [0c03]: Intel Corporation Alder Lake-P Thunderbolt 4 USB Controller [8086:461e] (rev 04)
00:0d.2 USB controller [0c03]: Intel Corporation Alder Lake-P Thunderbolt 4 NHI #0 [8086:463e] (rev 04)
00:0d.3 USB controller [0c03]: Intel Corporation Alder Lake-P Thunderbolt 4 NHI #1 [8086:466d] (rev 04)
00:14.0 USB controller [0c03]: Intel Corporation Alder Lake PCH USB 3.2 xHCI Host Controller [8086:51ed] (rev 01)
00:14.2 RAM memory [0500]: Intel Corporation Alder Lake PCH Shared SRAM [8086:51ef] (rev 01)
00:15.0 Serial bus controller [0c80]: Intel Corporation Alder Lake PCH Serial IO I2C Controller #0 [8086:51e8] (rev 01)
00:15.1 Serial bus controller [0c80]: Intel Corporation Alder Lake PCH Serial IO I2C Controller #1 [8086:51e9] (rev 01)
00:16.0 Communication controller [0780]: Intel Corporation Alder Lake PCH HECI Controller [8086:51e0] (rev 01)
00:19.0 Serial bus controller [0c80]: Intel Corporation Alder Lake-P Serial IO I2C Controller #0 [8086:51c5] (rev 01)
00:19.1 Serial bus controller [0c80]: Intel Corporation Alder Lake-P Serial IO I2C Controller #1 [8086:51c6] (rev 01)
00:1c.0 PCI bridge [0604]: Intel Corporation Device [8086:51b8] (rev 01)
00:1c.4 PCI bridge [0604]: Intel Corporation Device [8086:51bc] (rev 01)
00:1c.6 PCI bridge [0604]: Intel Corporation Device [8086:51be] (rev 01)
00:1c.7 PCI bridge [0604]: Intel Corporation Alder Lake PCH-P PCI Express Root Port #9 [8086:51bf] (rev 01)
00:1d.0 PCI bridge [0604]: Intel Corporation Alder Lake PCI Express Root Port [8086:51b0] (rev 01)
00:1e.0 Communication controller [0780]: Intel Corporation Alder Lake PCH UART #0 [8086:51a8] (rev 01)
00:1e.3 Serial bus controller [0c80]: Intel Corporation Alder Lake SPI Controller [8086:51ab] (rev 01)
00:1f.0 ISA bridge [0601]: Intel Corporation Alder Lake PCH eSPI Controller [8086:5182] (rev 01)
00:1f.3 Audio device [0403]: Intel Corporation Alder Lake PCH-P High Definition Audio Controller [8086:51c8] (rev 01)
00:1f.4 SMBus [0c05]: Intel Corporation Alder Lake PCH-P SMBus Host Controller [8086:51a3] (rev 01)
00:1f.5 Serial bus controller [0c80]: Intel Corporation Alder Lake-P PCH SPI Controller [8086:51a4] (rev 01)
01:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2824 PCIe Gen3 Packet Switch [1b21:2824] (rev 01)
02:00.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2824 PCIe Gen3 Packet Switch [1b21:2824] (rev 01)
02:04.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2824 PCIe Gen3 Packet Switch [1b21:2824] (rev 01)
02:08.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2824 PCIe Gen3 Packet Switch [1b21:2824] (rev 01)
02:0c.0 PCI bridge [0604]: ASMedia Technology Inc. ASM2824 PCIe Gen3 Packet Switch [1b21:2824] (rev 01)
05:00.0 Non-Volatile memory controller [0108]: MAXIO Technology (Hangzhou) Ltd. NVMe SSD Controller MAP1202 [1e4b:1202] (rev 01)
5b:00.0 Non-Volatile memory controller [0108]: Kingston Technology Company, Inc. OM8PGP4 NVMe PCIe SSD (DRAM-less) [2646:501b]
5c:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1166 Serial ATA Controller [1b21:1166] (rev 02)
5d:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I226-V [8086:125c] (rev 04)
5e:00.0 Ethernet controller [0200]: Intel Corporation Ethernet Controller I226-V [8086:125c] (rev 04)
5f:00.0 Ethernet controller [0200]: Aquantia Corp. AQtion AQC113 NBase-T/IEEE 802.3an Ethernet Controller [Antigua 10G] [1d6a:04c0] (rev 03)

lspci -t

$ lspci -t
-[0000:00]-+-00.0
           +-02.0
           +-06.0-[01-06]----00.0-[02-06]--+-00.0-[03]--
           |                               +-04.0-[04]--
           |                               +-08.0-[05]----00.0
           |                               \-0c.0-[06]--
           +-07.0-[07-30]--
           +-07.2-[31-5a]--
           +-08.0
           +-0a.0
           +-0d.0
           +-0d.2
           +-0d.3
           +-14.0
           +-14.2
           +-15.0
           +-15.1
           +-16.0
           +-19.0
           +-19.1
           +-1c.0-[5b]----00.0
           +-1c.4-[5c]----00.0
           +-1c.6-[5d]----00.0
           +-1c.7-[5e]----00.0
           +-1d.0-[5f]----00.0
           +-1e.0
           +-1e.3
           +-1f.0
           +-1f.3
           +-1f.4
           \-1f.5

etorix · January 4, 2025, 7:45pm

6 SATA ports from 2 PCIe lanes. Great for a low cost platform that is starved of lanes. Not so great for performance when ZFS request access from all drives simultaneously…

PBe · January 4, 2025, 8:11pm

I understand that it’s suboptimal but is that reason enough for such slow speeds?

etorix · January 4, 2025, 8:23pm

I’m afraid it is. Flexible chipset I/O usually trades one lane of PCIe for one lane of SATA. A SAS HBA would have no trouble repackaging 3 SATA lanes into one single SAS lane, but this low power ASM1166 certainly has not the computing power of a LSI3008.

Davvo · January 4, 2025, 8:39pm

I wouldn’t bet all my eggs on it, but from the evidence we have it seems plausible.

prez02 · January 4, 2025, 8:55pm

Also, there is the “7th” drive which houses the other asmedia chip that allows you to connect 4 NVME. That is connected to the backplane as well, which in turn is connected via these two cables to the mainboard.

Not sure if that is common or not and what that means this means in terms of PCI lanes.

Protopia · January 4, 2025, 9:01pm

Cheap hardware can often only be cheap because compromises are made, and these are usually performance. By buying cheap hardware you are accepting that the performance will be good under low load, but under stress performance will suffer. If you are looking for peak performance under stress, then you needed to pay a lot more for a server architected to achieve that, and also pay for a top-speed LAN infrastructure.
Artificial benchmarks are VERY difficult to get the tests right so that they measure the right things, and the results can be difficult to analyse properly. But, because artificial tests stress the system, the one thing they are good at is highlighting all these compromises made to get the price down, regardless of whether you will actually notice the issue in real life. For example, in real life you expect a lot of your reads to be satisfied from memory (ARC, pre-fetch) - but a benchmark won’t usually reflect this properly - it will either get everything from cache or very little.
In the end what matters - and the only thing that really matters - is whether the performance you receive in real-life with a real workload is acceptable or not. If most of the time you are either 1) streaming (pre-fetched) or 2) reading or writing a handful of smallish files (low-volumes) or 3) doing a bulk copy where at best it is going to take minutes and where taking twice as long may not be an issue - then in reality cheap hardware is probably going to do you fine.

Davvo · January 4, 2025, 9:16pm

I would like to also point out that not being able to get these kinds of information from the manufacturer is a major obstacle to troubleshooting. Not being transparent with the hardware is a big red flag to me.

Further testing once the expansion is complete will be appreciated.