Issue with Truenas Scale 24.10 zpool extension

tessierp · October 30, 2024, 9:55pm

Hi,

I had a ZFS Pool of 8 drivers… I had 3 other drivers I wanted to use to extend. The first drive went relatively well although, some errors were detected after on a disk but the SMART data says the drive is fine. I decided to clear the error and tried to add a second HDD and got this error…

Anyone know what the issue could be?

tessierp · October 30, 2024, 9:57pm

This is the issue I get :

FAILED, [EFAULT] 2098 is not a valid Error

Error: Traceback (most recent call last):
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 488, in run
await self.future
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 533, in _run_body
rv = await self.method(*args)
^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 179, in nf
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/lib/python3/dist-packages/middlewared/schema/processor.py”, line 49, in nf
res = await f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3/dist-packages/middlewared/plugins/pool/attach_disk.py", line 67, in attach
await extend_job.wait(raise_error=True)
File “/usr/lib/python3/dist-packages/middlewared/job.py”, line 438, in wait
raise CallError(self.error)
middlewared.service_exception.CallError: [EFAULT] 2098 is not a valid Error

tessierp · October 30, 2024, 10:03pm

Well things are getting worse. I had no issues prior to trying to extend my ZFS POOL and now all disks are reporting write errors 1 by 1. Something is seriously wrong with this ZPOOL extension feature. I would cut & paste a screenshot but that is not permitted here.

tessierp · October 30, 2024, 10:14pm

Another thing I just noticed, when I tried to extend the first time and all seemed to work I had 30 TB of usable space (with 8 six TB drives)… With 9, it still is showing 30 TB. All 9 drives are ONLINE, all SMART data is passing no failures. Not sure what is going on.

For the errors, I’ll try a SCRUB to see if that fixes things.

But yeah, before I tried to extend, all was working fine. Looks like extending has broken something.

RetroG · October 30, 2024, 10:24pm

you mention that your zpool is accumulating write errors? is there anything on the local console (or sudo dmesg ) when this happens? can you post an output of sudo zpool status poolname, and post these with triple-backticks at the beginning and end. IE ```

the python error is just the truenas middleware giving up waiting, not much that tells us what is going on with the pool itself.

write errors can be caused by a few things besides a failing drive, most relevant are bad cables and bad/buggy driver/kernel code, I’m leaning on the first.

tessierp · October 30, 2024, 10:27pm

Is there a way for me to attach files here? I don’t see the option to do so, otherwise it will be a long “cut & paste”

RetroG · October 30, 2024, 10:29pm

click the little gear icon on the text entry field, and choose “Hide Details”

click me

which will give you something like this

tessierp · October 30, 2024, 10:30pm

Here is the result from sudo dmesg :

[ 70.645946] critical target error, dev sde, sector 42190448 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[ 70.645978] zio pool=MAIN vdev=/dev/disk/by-partuuid/c5adb8fc-4c75-11ec-b658-d05099df7916 error=121 type=2 offset=19453960192 size=4096 flags=1572992
[ 84.767639] systemd[1]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set
[ 85.299466] ioatdma: Intel(R) QuickData Technology Driver 5.00
[ 85.301940] NTB Resource Split driver, version 1
[ 85.303501] Software Queue-Pair Transport over NTB, version 4
[ 85.316424] NMI watchdog: Enabled. Permanently consumes one hw-PMU counter.
[ 86.843446] pps pps2: new PPS source ptp2
[ 86.843506] ixgbe 0000:24:00.1: registered PHC device on enp36s0f1
[ 92.033826] ixgbe 0000:24:00.1 enp36s0f1: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[ 101.060545] audit: type=1400 audit(1730324357.343:2): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe” pid=4571 comm=“apparmor_parser”
[ 101.060549] audit: type=1400 audit(1730324357.343:3): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“nvidia_modprobe//kmod” pid=4571 comm=“apparmor_parser”
[ 101.060668] audit: type=1400 audit(1730324357.343:4): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“lsb_release” pid=4570 comm=“apparmor_parser”
[ 101.060761] audit: type=1400 audit(1730324357.343:5): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“virt-aa-helper” pid=4575 comm=“apparmor_parser”
[ 101.061336] audit: type=1400 audit(1730324357.343:6): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“/usr/bin/man” pid=4573 comm=“apparmor_parser”
[ 101.061339] audit: type=1400 audit(1730324357.343:7): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“man_filter” pid=4573 comm=“apparmor_parser”
[ 101.061342] audit: type=1400 audit(1730324357.343:8): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“man_groff” pid=4573 comm=“apparmor_parser”
[ 101.061399] audit: type=1400 audit(1730324357.343:9): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“libvirtd” pid=4577 comm=“apparmor_parser”
[ 101.061402] audit: type=1400 audit(1730324357.343:10): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“libvirtd//qemu_bridge_helper” pid=4577 comm=“apparmor_parser”
[ 101.062264] audit: type=1400 audit(1730324357.343:11): apparmor=“STATUS” operation=“profile_load” profile=“unconfined” name=“/usr/sbin/chronyd” pid=4576 comm=“apparmor_parser”
[ 101.788126] systemd-journald[1019]: Data hash table of /var/log/journal/d5e92ecedd4e49478aae4286ed723502/system.journal has a fill level at 75.1 (8540 of 11377 items, 6553600 file size, 767 bytes per hash table item), suggesting rotation.
[ 101.788131] systemd-journald[1019]: /var/log/journal/d5e92ecedd4e49478aae4286ed723502/system.journal: Journal header limits reached or header out-of-date, rotating.
[ 102.756796] NFSD: Using UMH upcall client tracking operations.
[ 102.756802] NFSD: starting 90-second grace period (net f0000000)
[ 241.686339] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 241.686382] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 241.686411] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 241.686437] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 241.686471] sd 0:0:0:0: [sdd] tag#2327 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=7s
[ 241.686483] sd 0:0:0:0: [sdd] tag#2327 Sense Key : Illegal Request [current]
[ 241.686487] sd 0:0:0:0: [sdd] tag#2327 Add. Sense: Logical block address out of range
[ 241.686490] sd 0:0:0:0: [sdd] tag#2327 CDB: Write(16) 8a 00 00 00 00 00 02 8c fa d0 00 00 00 08 00 00
[ 241.686492] critical target error, dev sdd, sector 42793680 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[ 241.686519] zio pool=MAIN vdev=/dev/disk/by-partuuid/c57ae2de-4c75-11ec-b658-d05099df7916 error=121 type=2 offset=19762814976 size=4096 flags=1572992
[ 316.690841] mce: [Hardware Error]: Machine check events logged
[ 316.690848] mce: [Hardware Error]: Machine check events logged
[ 439.604108] sdh: sdh1
[ 471.275837] sdh: sdh1
[ 515.356903] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 515.357632] sd 0:0:0:0: [sdd] tag#2167 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=6s
[ 515.357644] sd 0:0:0:0: [sdd] tag#2167 Sense Key : Illegal Request [current]
[ 515.357650] sd 0:0:0:0: [sdd] tag#2167 Add. Sense: Logical block address out of range
[ 515.357656] sd 0:0:0:0: [sdd] tag#2167 CDB: Write(16) 8a 00 00 00 00 00 04 4b 4a b8 00 00 00 08 00 00
[ 515.357660] critical target error, dev sdd, sector 72043192 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[ 515.358757] zio pool=MAIN vdev=/dev/disk/by-partuuid/c57ae2de-4c75-11ec-b658-d05099df7916 error=121 type=2 offset=34738565120 size=4096 flags=1572992
[ 523.149286] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 523.150021] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 523.150686] sd 0:0:2:0: [sdb] tag#2153 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=6s
[ 523.150698] sd 0:0:2:0: [sdb] tag#2153 Sense Key : Illegal Request [current]
[ 523.150705] sd 0:0:2:0: [sdb] tag#2153 Add. Sense: Logical block address out of range
[ 523.150711] sd 0:0:2:0: [sdb] tag#2153 CDB: Write(16) 8a 00 00 00 00 00 04 4b e7 08 00 00 00 08 00 00
[ 523.150715] critical target error, dev sdb, sector 72083208 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[ 523.151611] zio pool=MAIN vdev=/dev/disk/by-partuuid/c4910958-4c75-11ec-b658-d05099df7916 error=121 type=2 offset=34759053312 size=4096 flags=1572992
[ 541.941551] sdh: sdh1
[ 557.080076] sdi: sdi1
[ 627.980118] mce: [Hardware Error]: Machine check events logged
[ 627.980125] mce: [Hardware Error]: Machine check events logged
[ 687.007351] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 687.008015] sd 0:0:0:0: [sdd] tag#2112 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=6s
[ 687.008028] sd 0:0:0:0: [sdd] tag#2112 Sense Key : Illegal Request [current]
[ 687.008034] sd 0:0:0:0: [sdd] tag#2112 Add. Sense: Logical block address out of range
[ 687.008040] sd 0:0:0:0: [sdd] tag#2112 CDB: Write(16) 8a 00 00 00 00 00 04 5d 33 78 00 00 00 08 00 00
[ 687.008044] critical target error, dev sdd, sector 73216888 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[ 687.008743] zio pool=MAIN vdev=/dev/disk/by-partuuid/c57ae2de-4c75-11ec-b658-d05099df7916 error=121 type=2 offset=35339497472 size=4096 flags=1572992
[ 690.201829] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 690.202322] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 690.202709] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 690.203072] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 690.203390] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 690.203688] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 690.203986] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 690.204283] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 690.204578] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 690.204872] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 690.205168] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 690.205471] sd 0:0:2:0: [sdb] tag#2177 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=6s
[ 690.205480] sd 0:0:2:0: [sdb] tag#2177 Sense Key : Illegal Request [current]
[ 690.205484] sd 0:0:2:0: [sdb] tag#2177 Add. Sense: Logical block address out of range
[ 690.205489] sd 0:0:2:0: [sdb] tag#2177 CDB: Write(16) 8a 00 00 00 00 00 04 5e 0e f8 00 00 00 08 00 00
[ 690.205491] critical target error, dev sdb, sector 73273080 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[ 690.206471] zio pool=MAIN vdev=/dev/disk/by-partuuid/c4910958-4c75-11ec-b658-d05099df7916 error=121 type=2 offset=35368267776 size=4096 flags=1572992
[ 799.699633] sdh: sdh1
[ 824.421679] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 824.422433] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 824.423096] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 824.423739] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 824.424372] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 824.425001] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 824.425629] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 824.426253] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 824.426875] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 824.427501] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 824.428122] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 824.428753] sd 0:0:2:0: [sdb] tag#2187 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=6s
[ 824.428768] sd 0:0:2:0: [sdb] tag#2187 Sense Key : Illegal Request [current]
[ 824.428775] sd 0:0:2:0: [sdb] tag#2187 Add. Sense: Logical block address out of range
[ 824.428781] sd 0:0:2:0: [sdb] tag#2187 CDB: Write(16) 8a 00 00 00 00 00 04 5f 2d 60 00 00 00 18 00 00
[ 824.428786] critical target error, dev sdb, sector 73346400 op 0x1:(WRITE) flags 0x0 phys_seg 3 prio class 2
[ 824.430367] zio pool=MAIN vdev=/dev/disk/by-partuuid/c4910958-4c75-11ec-b658-d05099df7916 error=121 type=2 offset=35405807616 size=12288 flags=1074267264
[ 939.261698] mce: [Hardware Error]: Machine check events logged
[ 939.261705] mce: [Hardware Error]: Machine check events logged
[ 1203.302110] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1203.302897] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1203.303570] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1203.304229] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1203.304885] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1203.305532] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1203.306173] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1203.306811] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1203.307457] sd 0:0:2:0: [sdb] tag#2168 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=6s
[ 1203.307469] sd 0:0:2:0: [sdb] tag#2168 Sense Key : Illegal Request [current]
[ 1203.307475] sd 0:0:2:0: [sdb] tag#2168 Add. Sense: Logical block address out of range
[ 1203.307480] sd 0:0:2:0: [sdb] tag#2168 CDB: Write(16) 8a 00 00 00 00 00 04 60 02 40 00 00 00 08 00 00
[ 1203.307485] critical target error, dev sdb, sector 73400896 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[ 1203.308628] zio pool=MAIN vdev=/dev/disk/by-partuuid/c4910958-4c75-11ec-b658-d05099df7916 error=121 type=2 offset=35433709568 size=4096 flags=1572992
[ 1204.408176] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1204.408995] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1204.409691] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1204.410371] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1204.411047] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1204.411722] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1204.412398] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1204.413070] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1204.413740] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1204.414405] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 1204.415076] sd 0:0:1:0: [sde] tag#2133 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=8s
[ 1204.415087] sd 0:0:1:0: [sde] tag#2133 Sense Key : Illegal Request [current]
[ 1204.415093] sd 0:0:1:0: [sde] tag#2133 Add. Sense: Logical block address out of range
[ 1204.415099] sd 0:0:1:0: [sde] tag#2133 CDB: Write(16) 8a 00 00 00 00 00 04 5f eb 40 00 00 00 08 00 00
[ 1204.415103] critical target error, dev sde, sector 73395008 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 2
[ 1204.416337] zio pool=MAIN vdev=/dev/disk/by-partuuid/c5adb8fc-4c75-11ec-b658-d05099df7916 error=121 type=2 offset=35430694912 size=4096 flags=1572992
[ 1250.547396] mce: [Hardware Error]: Machine check events logged
[ 1561.833900] mce: [Hardware Error]: Machine check events logged
[ 1561.833911] mce: [Hardware Error]: Machine check events logged
[ 1873.119008] mce: [Hardware Error]: Machine check events logged
[ 1873.119014] mce: [Hardware Error]: Machine check events logged
[ 2184.404708] mce: [Hardware Error]: Machine check events logged
[ 2184.404715] mce: [Hardware Error]: Machine check events logged
[ 2495.690205] mce: [Hardware Error]: Machine check events logged
[ 2495.690213] mce: [Hardware Error]: Machine check events logged
[ 2806.975827] mce: [Hardware Error]: Machine check events logged

Quiet1824 · October 30, 2024, 10:31pm

Are you sure the job completed? I added a 7th disk to my 6x20TB and it requires 3 1/2 days to complete. Run zpool status yourpoolname and look at the expand section.
Mine for example:

root@truenas:/home/admin# zpool status rust
  pool: rust
 state: ONLINE
  scan: scrub repaired 0B in 19:03:19 with 0 errors on Fri Sep 27 00:03:21 2024
expand: expansion of raidz2-0 in progress since Tue Oct 29 21:45:45 2024
        23.8T / 96.6T copied at 335M/s, 24.64% done, 2 days 15:12:01 to go
config:

        NAME                                      STATE     READ WRITE CKSUM
        rust                                      ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            9605883d-0150-49a6-9440-f19c323646b1  ONLINE       0     0     0
            a7dad7ae-96a2-4955-9851-5a4f2ab09f59  ONLINE       0     0     0
            9678a58a-42bb-4b04-af32-541d867b1981  ONLINE       0     0     0
            b5432386-1de9-4b32-b043-27e818bfd0d6  ONLINE       0     0     0
            58923fef-67b5-4110-8068-1818ebedbafb  ONLINE       0     0     0
            f2197047-5a18-4b86-a2b8-50c510b33943  ONLINE       0     0     0
            1ea2493e-8987-4c7b-b7b8-70a0a5274bb6  ONLINE       0     0     0

errors: No known data errors
root@truenas:/home/admin# zpool iostat -v rust
                                            capacity     operations     bandwidth 
pool                                      alloc   free   read  write   read  write
----------------------------------------  -----  -----  -----  -----  -----  -----
rust                                      96.6T  12.5T    376  26.0K   333M   334M
  raidz2-0                                96.6T  12.5T    376  26.0K   333M   334M
    9605883d-0150-49a6-9440-f19c323646b1      -      -     62  3.43K  55.6M  47.7M
    a7dad7ae-96a2-4955-9851-5a4f2ab09f59      -      -     62  3.46K  55.6M  47.7M
    9678a58a-42bb-4b04-af32-541d867b1981      -      -     62  3.49K  55.6M  47.7M
    b5432386-1de9-4b32-b043-27e818bfd0d6      -      -     62  3.49K  55.6M  47.7M
    58923fef-67b5-4110-8068-1818ebedbafb      -      -     62  3.45K  55.6M  47.7M
    f2197047-5a18-4b86-a2b8-50c510b33943      -      -     62  3.46K  55.6M  47.7M
    1ea2493e-8987-4c7b-b7b8-70a0a5274bb6      -      -      0  5.23K      2  47.9M
----------------------------------------  -----  -----  -----  -----  -----  -----

tessierp · October 30, 2024, 10:31pm

sudo zpool status MAIN details :

 pool: MAIN
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
       attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
       using 'zpool clear' or replace the device with 'zpool replace'.
  see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
 scan: scrub in progress since Wed Oct 30 17:57:31 2024
       5.62T / 5.62T scanned, 632G / 5.62T issued at 364M/s
       0B repaired, 10.99% done, 03:59:55 to go
expand: expansion of raidz2-0 in progress since Wed Oct 30 17:20:50 2024
       12.2G / 5.62T copied at 2.99M/s, 0.21% done, paused for resilver or clear
config:

       NAME                                      STATE     READ WRITE CKSUM
       MAIN                                      ONLINE       0     0     0
         raidz2-0                                ONLINE       0     0     0
           c3a86428-4c75-11ec-b658-d05099df7916  ONLINE       0     0     0
           c4997696-4c75-11ec-b658-d05099df7916  ONLINE       0     0     0
           c4c145cb-4c75-11ec-b658-d05099df7916  ONLINE       0     0     0
           c4910958-4c75-11ec-b658-d05099df7916  ONLINE       0     6     0
           c5720668-4c75-11ec-b658-d05099df7916  ONLINE       0     0     0
           c57ae2de-4c75-11ec-b658-d05099df7916  ONLINE       0     1     0
           c5a324be-4c75-11ec-b658-d05099df7916  ONLINE       0     0     0
           c5adb8fc-4c75-11ec-b658-d05099df7916  ONLINE       0     1     0
           4068a277-c737-4d95-9ddf-f9b6bb27fcb1  ONLINE       0     0     0

errors: No known data errors

Quiet1824 · October 30, 2024, 10:33pm

Yeah dude…look at the job…it’s only .21% done and you’re trying to add another. Wait until the 1st one completes…and to make it even worse you’re running a scrub.

tessierp · October 30, 2024, 10:34pm

Everything was online and disk was added to the pool, there was nothing indicating anywhere that it was still copying / expanding. I did do a zpool status poolname.

From the UI I did see pausing : resilvering and then errors were detected. It stayed like this with no progress for a long time (1 hour) at which point I rebooted.

RetroG · October 30, 2024, 10:35pm

the write errors are very concerning, this should be fixed ASAP as zpool will desperately want to kick out disks from your pool otherwise. try swapping a cable and seeing if it moves with the drive or the cable, and replace the culprit hardware once you know what it is.

Quiet1824 · October 30, 2024, 10:35pm

Ah yeah, I see it’s paused for the resilver.

tessierp · October 30, 2024, 10:36pm

With errors and resilvering in progress, should I instead wait for the scrub to finish and see if errors are fixed?

tessierp · October 30, 2024, 10:37pm

The cables issue, I’m not sure, since I had no errors before. The setup has been like this for several months now. It only started to happen when I tried to expanded the pool.

tessierp · October 30, 2024, 10:39pm

One more thing about that, it has been stating resilvering for a long time now, there is no way to known if that is still going on. It literally didn’t move from 0.21%, maybe because I’m scrubbing?

tessierp · October 30, 2024, 10:40pm

What is the best way to deal with errors or test / resolve them? Will a scrub do? Anything else I can do?

RetroG · October 30, 2024, 10:45pm

it’s likely in it’s scanning phase, used for sequential resilver/scub

Quiet1824 · October 30, 2024, 10:53pm

Pray you have a good backup!
My takeaway so far is that zpool expansion isn’t worth it. It takes forever with large pools, and even after it’s done we’re going to have a run a damn script to re-write all the data which will take even more days.
We would’ve been better off backing up/zfs replicating the data elsewhere, blowing away the pool, building a new one, and restoring/replicating back.

Sorry, I know this doesn’t address your current issue. It looks like you should find out if there’s a hardware/drive issue first. Look at smartctl and logs to see what’s going on. If it’s just one error at the beginning then you can try clearing it as shown in the article.
https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P/index.html