Hi everyone, sorry for my english first of all )
I have baremetal instance of TrueNAS-SCALE-24.04.2.2
at asus p10s-i.
One day i find out, that i have a lot of checksum error at all 4 disks and some files are corrupted
after troubleshooting i came to the conclusion that it was either the power supply or the cable. After replacing them my pool disappeared. So, at this point a have
root@truenas[/home/admin]# lsblk -o name,size,type,partuuid
NAME SIZE TYPE PARTUUID
sda 3.6T disk
├─sda1 2G part a4722b12-5b0f-476d-a704-7f384653ad21
│ └─md127 2G raid1
│ └─md127 2G crypt
└─sda2 3.6T part 78768fa5-a92a-4548-9557-0f42401ccccf
sdb 3.6T disk
├─sdb1 2G part 02b02130-2f0b-415f-ae8c-85d849261c33
└─sdb2 3.6T part 83c95c75-7fe2-4fec-9845-d974fc3f637f
sdc 3.6T disk
├─sdc1 2G part 9ee250cd-dccf-494c-aab3-1f2cc2fbd84c
│ └─md127 2G raid1
│ └─md127 2G crypt
└─sdc2 3.6T part 7148f5b2-5cf3-4f20-8f67-3e13d6350289
sdd 223.6G disk
├─sdd1 1M part 0bfe2fd4-c7c6-48f7-95c3-9ac97455137b
├─sdd2 512M part c2e55895-c266-48e3-a249-8574d9502604
├─sdd3 207.1G part a8ca35d1-8aef-4b47-8784-68848999cfe0
└─sdd4 16G part ef366a6b-d331-4d0b-a08b-9c6e73abe626
sde 3.6T disk
├─sde1 2G part 9f7ed069-5e4e-4b7f-a806-5f9e40047067
│ └─md127 2G raid1
│ └─md127 2G crypt
└─sde2 3.6T part af598827-99d2-46cc-bb20-e3821e34ca31
root@truenas[/home/admin]#
root@truenas[/home/admin]# zpool import
pool: pool
id: 13997916364281612929
state: ONLINE
status: Some supported features are not enabled on the pool.
(Note that they may be intentionally disabled if the
‘compatibility’ property is set.)
action: The pool can be imported using its name or numeric identifier, though
some features will not be available without an explicit ‘zpool upgrade’.
config:
root@truenas[/home/admin]# zpool import pool
cannot import ‘pool’: insufficient replicas
Destroy and re-create the pool from
a backup source.
So. The heart of the matter is to get data back. I have a backup, but I’ll only consider that as a last resort.
At the old forum i saw the solution like this:
sysctl vfs.zfs.max_missing_tvds=1
sysctl vfs.zfs.spa.load_verify_metadata=0
sysctl vfs.zfs.spa.load_verify_data=0
The good news is that the disks and partitions all seem lined up with the pool definitions.
Try the following and see any of them work and if not whether you get any different error messages (and if you get no message that might indicate it worked):
sudo zpool import -d /dev/disk/by-partuuid -R /mnt pool
sudo zpool import -R /mnt -f pool
sudo zpool import -d /dev/disk/by-partuuid -R /mnt -f pool
P.S. Don’t try random stuff you read off web pages in case they make things worse. The above commands will either work or not, and should not make things worse.
@svag Welcome here, and thanks for giving a reasoanbly good description of your steup and issue. But please use the formatted text button </> when pasting terminal output: It makes things easier to follow.
root@truenas[/home/admin]# zpool import -f -o readonly=on pool
cannot import 'pool': I/O error
Destroy and re-create the pool from
a backup source.
root@truenas[/home/admin]#
With 2 disk’s worth of redundancy, it is unlikely that you have a totally corrupt pool. However, if one disk got detached before the errors on the other disks got worse, it is possible that ZFS can’t assemble the pool with common TXGs.
What I am looking for, it the txg: field. If their is one or 2 disks that are different, you should be okay. Or if they are all different, but not too far apart, you can roll your pool’s transaction group back to a common one. But, their is a limited number of roll backs possible.
I don’t know what others think, but the labels appear good and consistent. What is worrying is that three of the partitions are shown in the labels as aux_state: 'err_exceeded'.
I guess we need to see the smart attributes. Please run the following commands and post the output:
Yes, it looks like 1 disk has a last TXG 8847849, (ZFS Transaction Group number), different from the others, which is 8847875.
This means you can’t import the pool as is. Yet, since it is a redundant pool, you can survive without that 1 disk.
From what I can tell, you have 2 choices:
Dis-connect or remove “sdb” and try importing the pool again. It should complain about the missing disk, but you should be able to import it. Potentially needing additional options to the zpool import command. This will mean zero data lost.
Try rolling back the pool to the most common TXG, 8847849. This will mean some of the most recent data written could be lost.
But, I am not sure option 2 will work because it is possible that the 3 good disks have gone past the ring buffer size for TXGs.
Someone else may have additional suggestions, so you might wait and see what others say first.
That was a good spot by @arwen which I had missed.
I think that their suggestion to remove the disk with the lower txg is a sensible one. As they said, you will end up with a temporarily degraded pool, however assuming that sdb is not failing you should be able to resilver to it.
Before trying this import, I think it would be useful to get the smart attributes anyway so we can see what the state of the drives are and whether they have any other problems that might cause things to go wrong.
(The difference in TXG numbers is 26. With ashift=14 I think you get 32 uberblocks, in which case a recovery to txg 8847849 should hopefully be possible. But I still think an attempt to recover as a degraded pool would be better as @arwen suggested.)
so, according the first choice i need to
zpool offline pool gptid/83c95c75-7fe2-4fec-9845-d974fc3f637f
OR
just disconnect it from the bay physically.
Correct?
But clearly something went wrong when writing to /dev/sdb and it might be sensible to see whether there is a hardware issue with this drive, and check that there aren’t hardware issues with the other drives.
finally i find out wich one sdX contain wrong txg. Everytime reassigning sdX is quite annoying.
so at this point i got this
root@truenas[/home/admin]# lsblk -o name,size,type,partuuid
NAME SIZE TYPE PARTUUID
sda 3.6T disk
├─sda1 2G part a4722b12-5b0f-476d-a704-7f384653ad21
└─sda2 3.6T part 78768fa5-a92a-4548-9557-0f42401ccccf
sdb 223.6G disk
├─sdb1 1M part 0bfe2fd4-c7c6-48f7-95c3-9ac97455137b
├─sdb2 512M part c2e55895-c266-48e3-a249-8574d9502604
├─sdb3 207.1G part a8ca35d1-8aef-4b47-8784-68848999cfe0
└─sdb4 16G part ef366a6b-d331-4d0b-a08b-9c6e73abe626
└─sdb4 16G crypt
sdc 3.6T disk
├─sdc1 2G part 9ee250cd-dccf-494c-aab3-1f2cc2fbd84c
└─sdc2 3.6T part 7148f5b2-5cf3-4f20-8f67-3e13d6350289
sdd 3.6T disk
├─sdd1 2G part 02b02130-2f0b-415f-ae8c-85d849261c33
└─sdd2 3.6T part 83c95c75-7fe2-4fec-9845-d974fc3f637f
root@truenas[/home/admin]#
root@truenas[/home/admin]# zpool import
pool: pool
id: 13997916364281612929
state: DEGRADED
status: One or more devices contains corrupted data.
action: The pool can be imported despite missing or damaged devices. The
fault tolerance of the pool may be compromised if imported.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J
config:
pool DEGRADED
raidz2-0 DEGRADED
7148f5b2-5cf3-4f20-8f67-3e13d6350289 ONLINE
78768fa5-a92a-4548-9557-0f42401ccccf ONLINE
9143061293094038254 UNAVAIL
83c95c75-7fe2-4fec-9845-d974fc3f637f ONLINE
root@truenas[/home/admin]#
root@truenas[/home/admin]#
root@truenas[/home/admin]#
root@truenas[/home/admin]# zpool status -v
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:19 with 0 errors on Fri May 30 03:45:20 2025
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sdb3 ONLINE 0 0 0
errors: No known data errors
root@truenas[/home/admin]#
but still got this
root@truenas[/home/admin]#
root@truenas[/home/admin]# zpool import pool
cannot import 'pool': insufficient replicas
Destroy and re-create the pool from
a backup source.
root@truenas[/home/admin]#
admin@truenas[~]$ sudo zpool import -F pool
cannot import 'pool': insufficient replicas
Destroy and re-create the pool from
a backup source.
admin@truenas[~]$ sudo zpool import -f pool
cannot import 'pool': insufficient replicas
Destroy and re-create the pool from
a backup source.
admin@truenas[~]$ sudo zpool import -f -m pool
cannot import 'pool': insufficient replicas
Destroy and re-create the pool from
a backup source.
admin@truenas[~]$
This is an issue. You should look into replacing these with CMR drives, or rebuild a new pool.
Even though it may not be the (sole) reason for your trouble, as it seems that 24.10.2 and/or 25.04 are perfectly capable of eating pools without SMR drives being involved.
But first, your data!
Check before trying potentially dangerous commands.
Since -f and -F have already failed, your next attempts would be -X, or at -T 8847849 with the fourth drive.
Try sudo zpool import -FXn pool in a tmux session, as this could take a long time to return… and the desirable result is actually nothing (no error). This should be safe due to the -n option. Wait for confirmation by @Arwen or @HoneyBadger before attempting real recovery (without -n but with -R /mnt).
I agree SMR drives are not good but we should focus on getting the pool back online first - but I would personally advise against trying to resilver to the same SMR drive if you get the pool online with one drive extracted.
We only have SMART data for one drive, and it isn’t showing any major defects but there are some oddities:
1.5 start/stops PER HOUR???!!! Without APM??? Could be an aspect of this being a drive with firmware designed for ad-hoc desktop usage rather than NAS usage.
SMART short tests every 24 hours (too frequent?) but no long tests at all?
@etorix’s advice sounds right to me. Try doing a trial import with -FXn and see what the result is. But don’t try it for real until you have posted the results of the trial and the real experts have taklen a look.