I had a 6 wide Z1 VDEV i extended with another 20TB drive.
Unfortunately somewhere in the process it stopped extending.
Now it shows the drive with an error and all my apps wont show because of a docker error, but they all still work.
Does anyone have an idea what went wrong or how to fix it?
My system:
Truenas Scale 24.10.1
6x 20TB hard drives in Z1 (7 now with the new one)
As already stated on Reddit, I have no idea why RAIDZ expansion would cause this - unless the pool is now offline.
Understanding the state of the pool and your configuration might help so please run the following commands and post the output back here with each command in a separate </> box:
NAME MODEL ROTA PTTYPE TYPE START SIZE PARTTYPENAME PARTUUID
sda ST20000NM007D-3DJ103 1 gpt disk 20000588955648
└─sda1 1 gpt part 2048 20000586858496 Solaris /usr & Apple ZFS a238b596-47c6-4fbd-a831-da6f35053c9f
sdb ST20000NM007D-3DJ103 1 gpt disk 20000588955648
└─sdb1 1 gpt part 4096 20000586841600 Solaris /usr & Apple ZFS 46789880-6932-429a-8e37-5c9f368ca23c
sdc ST20000NM007D-3DJ103 1 gpt disk 20000588955648
└─sdc1 1 gpt part 4096 20000586841600 Solaris /usr & Apple ZFS 3bfb07ef-c4ca-4ac8-b458-1cfefd82578a
sdd ST20000NM007D-3DJ103 1 gpt disk 20000588955648
└─sdd1 1 gpt part 2048 20000587890176 Solaris /usr & Apple ZFS aff30a52-9d06-48d6-bc26-6b4ce57f3055
sde ST20000NM007D-3DJ103 1 gpt disk 20000588955648
└─sde1 1 gpt part 4096 20000586841600 Solaris /usr & Apple ZFS 85367f74-0401-4e3c-8f9d-d0eac0ecb266
sdf ST20000NM007D-3DJ103 1 gpt disk 20000588955648
└─sdf1 1 gpt part 4096 20000586841600 Solaris /usr & Apple ZFS 022662b0-e6bc-49b4-9ad5-9e0fbc1afd34
sdg ST20000NM007D-3DJ103 1 gpt disk 20000588955648
└─sdg1 1 gpt part 4096 20000586841600 Solaris /usr & Apple ZFS 6b54881f-9855-4c0d-8a50-eb29e24d9116
nvme0n1 KINGSTON SNV2S1000G 0 gpt disk 1000204886016
├─nvme0n1p1 0 gpt part 4096 1048576 BIOS boot b9f13970-efdc-4e29-ba31-95adf1dd4ccd
├─nvme0n1p2 0 gpt part 6144 536870912 EFI System ad2e38d1-e354-4f14-924c-c5df0e032369
├─nvme0n1p3 0 gpt part 34609152 982484983296 Solaris /usr & Apple ZFS 9f9a3b67-09c5-4fd0-8744-e061a99a860e
└─nvme0n1p4 0 gpt part 1054720 17179869184 Linux swap 222567ac-07fe-4677-a111-686b8a3b55c9
nvme1n1 KINGSTON SNV2S1000G 0 gpt disk 1000204886016
├─nvme1n1p1 0 gpt part 4096 1048576 BIOS boot a658ba02-d492-4cd4-953b-b5407221f4d6
├─nvme1n1p2 0 gpt part 6144 536870912 EFI System 64964f1d-6db5-497e-9103-cf4496a6f119
├─nvme1n1p3 0 gpt part 34609152 982484983296 Solaris /usr & Apple ZFS b12cd586-9d09-4a17-9564-d4453a0e110a
└─nvme1n1p4 0 gpt part 1054720 17179869184 Linux swap 116f3ba1-8649-4be6-89bc-140ff746dc38
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved
No LSI SAS adapters found! Limited Command Set Available!
ERROR: Command Not allowed without an adapter!
ERROR: Couldn't Create Command -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.
No Avago SAS adapters found! Limited Command Set Available!
ERROR: Command Not allowed without an adapter!
ERROR: Couldn't Create Command -list
Exiting Program.
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:12 with 0 errors on Fri Jan 24 03:45:13 2025
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme1n1p3 ONLINE 0 0 0
nvme0n1p3 ONLINE 0 0 0
errors: No known data errors
pool: volume1
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub in progress since Fri Jan 24 13:15:03 2025
53.3T / 82.8T scanned at 339M/s, 52.8T / 82.8T issued at 336M/s
0B repaired, 63.75% done, 1 days 02:01:11 to go
expand: expansion of raidz1-0 in progress since Wed Jan 22 12:06:54 2025
53.3T / 82.9T copied at 163M/s, 64.21% done, 2 days 04:53:06 to go
config:
NAME STATE READ WRITE CKSUM
volume1 ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
3bfb07ef-c4ca-4ac8-b458-1cfefd82578a ONLINE 0 0 0
46789880-6932-429a-8e37-5c9f368ca23c ONLINE 0 0 0
85367f74-0401-4e3c-8f9d-d0eac0ecb266 ONLINE 0 0 0
022662b0-e6bc-49b4-9ad5-9e0fbc1afd34 ONLINE 0 0 0
6b54881f-9855-4c0d-8a50-eb29e24d9116 ONLINE 0 0 0
aff30a52-9d06-48d6-bc26-6b4ce57f3055 ONLINE 0 0 0
a238b596-47c6-4fbd-a831-da6f35053c9f ONLINE 0 0 1
errors: No known data errors
It is possible that the docker service was the process that had the unrecoverable error and so it didn’t start on this occasion, but it will start on a reboot.
Also, the zpool status and lsblk commands point to the problematic disks, so please run the following and post the results:
sudo smartctl -x /dev/sda
P.S. Also, if you can wait 2 days for the expand to complete without having Docker running, then that would be best, but if you need to reboot after running the above command, then the expansion should pick up from where it was before the reboot.
One point I missed but I think you need to address is that you are running a scrub concurrently with the expansion. I think the ongoing scrub should be cancelled and any scheduled scrubs turned off until expansion completes.
Smart doesn’t show anything specific. The LBAs read seem high for 93 hours at 291TiB, and the temperature is stable but higher than I would really like to see (airflow issues), but these are very minor points.
IMO (and others may differ in their opinions) a sudo zpool clear volume1 is warranted.
So i did stop the scrubbing. I know about the temperature, but there since it is stable and within range i leave it that for the moment.
How do you see my system is still on a expansion? I don’t see it anywhere.
Hopefully none of this data is important, you are putting all your eggs in a raidz1 basket. The moment one disk fails you are at the brink of total data loss.
With that many drives of that size (20TB) you stand a fair risk of that happening.
About 1-2 TB of data is important but backed up on a few different locations. The rest is somewhat not important but still would be a big loss. I also always have a spare hard disk at home if one fails.
What would your recommendation be in my situation? From what i know the only way would be to start over with new disks and do Z2 but that would mean i need to have a second system and at least 7 new 20TB disks right?