Truenas scale disk errors and pool missing

smic717394 · November 20, 2024, 2:02pm

I rebooted, but when it started there no datasets and no pools, so I runed again your command, I see the datasets but no pool

Farout · November 20, 2024, 2:10pm

Try exporting again, then:

zpool import -F -o altroot=/mnt HDD

smic717394 · November 20, 2024, 2:13pm

I tried, after reset there are no pools and no datasets, so I can just run
zpool import -F -o altroot=/mnt HDD
but again I see the datasets but no pools. I restarted again and in pools I see no pool, but if I click import it show I can import HDD, I’m importing from gui

smic717394 · November 20, 2024, 2:14pm

Ok, it finished importing from gui and I see everything is fine now, pool there and no error, datasets, apps.

1000000000000000000000 Thank you for your help. Now If I can bother more, how do I find out if any of the drives are indeed bad or what

Farout · November 20, 2024, 2:16pm

I didnt really help .

Better copy away your important data…

smic717394 · November 20, 2024, 2:17pm

I think I´ll change disc a and b whichever they are one by one. Can I mix them with segate, maybe I get better luck

Farout · November 20, 2024, 2:19pm

Yes. Just make sure they are CMR drives and not SMR, and of the same size.

smic717394 · November 20, 2024, 2:19pm

Thank you again, I´m running a short smart test on all drive,

Farout · November 20, 2024, 2:21pm

You should do long tests.
Also when replacing discs in a dodgy raidz1, its best to replace drives while the old ones are still connected.

smic717394 · November 20, 2024, 2:22pm

dont I need to export the faulty disk and connect the new one instead, and let the system resilver?

etorix · November 20, 2024, 4:02pm

If you have a spare port, do not offline anything, plug in the new drive, go to Storage > Pool > Status, select a drive and click “Replace”. ZFS will resilver to replace the drive and then offline the old drive; remove the old drive, rince and repeat.

Protopia · November 20, 2024, 4:05pm

@smic717394 You got lucky when you did as the error message suggested and did an zpool import -F - you could easily have made things worse by trying random commands.

Take my advice and SLOOOOOW DOWN, and wait for expert advice from people who have some knowledge and understanding of ZFS.

So, before you decide to swap out drives you need to establish what problems you now have and whether a drive actually needs to be swapped out. If you have some other sort of problem and you attempt to swap out a drive you may end up losing your data.

So, here is my opinion on what you need to do:

Run SMART Long tests on each drive. Once they have all finished…
Run smartctl -x /dev/sdX for each drive again, and post the responses, this time making sure to follow my previous instructions about enclosing them with lines containing ``` so that the output is readable. The last lot were 100x more difficult to interpret because you didn’t do this - but from what I could tell from trying to read this mess, all the drives looked fine.
Reboot and check the pool comes online automatically again - maybe repeat this to be doubly sure.
Let us analyse the smartctl output and advise on whether your disks have a problem or not, and if so what to do about it.

smic717394 · November 20, 2024, 5:35pm

Agreed… I didnt just run the zpool import -f, I wal told here to tun the command zpool import, running this truenas message was that I should run zpool import -f, anyway, the advice is good and appreciated.

I´m running a log test on first disk it’s at 40% see how it goes, when ill finished hem all Ill post the results.

So far Topology, Usage, ZFS Health and Disk Health all green. I’m thinking maybe the hotswap connector on the back of some bays are bad or the power supply, but its early to say, it’s still running the first disk it’s at 50% after like 3 hours.

Anyways, I really appreciate all your guys help.

smic717394 · November 20, 2024, 5:46pm

I forgot to mention after I imported the pool and was everything working, I got later a notification

ZFS has finished a resilver:
eid: 28
class: resilver_finish
host: HomeServer
time: 2024-11-20 14:25:03+0100
pool: HDD
state: ONLINE
scan: resilvered 696K in 00:00:03 with 0 errors on Wed Nov 20 14:25:03 2024
config:
NAME STATE READ WRITE CKSUM
HDD ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
73e8954f-fd43-4770-a5f8-78faa91fd6ee ONLINE 0 0 0
25087691-9aac-45c6-99a2-741fcda14a58 ONLINE 0 0 0
f4d8b7aa-08e7-49b8-a06e-c0c1bca59d58 ONLINE 0 0 0
c842d9a6-4ca8-4477-81aa-cae554d19506 ONLINE 0 0 0
errors: No known data errors`

and looking to the error notifications from this morning, is there any way to identify the drive from this info?

The number of I/O errors associated with a ZFS device exceeded
acceptable levels. ZFS has marked the device as faulted.
impact: Fault tolerance of the pool may be compromised.
eid: 19
class: statechange
state: FAULTED
host: HomeServer
time: 2024-11-20 05:34:37+0100
vpath: /dev/disk/by-partuuid/f4d8b7aa-08e7-49b8-a06e-c0c1bca59d58
vguid: 0xC6B9449BC950CBC3
pool: HDD (0x05C0866261A9460F)

Protopia · November 20, 2024, 6:16pm

@smic717394 A little knowledge is a dangerous thing. zpool import -f and zpool import -F are very different things. You REALLY need to know what you are doing when you issue ZFS console commands, especially when you do it as root where there are few if any safety nets.

To save time, you can run the SMART long tests in parallel as each one only involves the drive you run it on.

The resilver message and most recent zpool status shows zero errors. The details of which drive had errors previously may be available in the system logs. Someone else will need to tell you which commands to run to check for this. But let’s wait until we see the smartctl -x output once the long tests have finished and see if that helps identify the root cause.

Once the system is fully stable you should implement @joeschmuck’s Multi-Report script so that you get early warnings by email of any disk issues.

smic717394 · November 20, 2024, 6:32pm

still running the smart but I think I identify sba2 and sbb2

Running zpool status I get

  pool: HDD
 state: ONLINE
  scan: scrub canceled on Wed Nov 20 14:37:59 2024
config:

        NAME                                      STATE     READ WRITE CKSUM
        HDD                                       ONLINE       0     0     0
          raidz1-0                                ONLINE       0     0     0
            73e8954f-fd43-4770-a5f8-78faa91fd6ee  ONLINE       0     0     0
            25087691-9aac-45c6-99a2-741fcda14a58  ONLINE       0     0     0
            f4d8b7aa-08e7-49b8-a06e-c0c1bca59d58  ONLINE       0     0     0
            c842d9a6-4ca8-4477-81aa-cae554d19506  ONLINE       0     0     0

errors: No known data errors

Then running zpool status -LP HDD I get

root@HomeServer[~]# zpool status -LP HDD                         
  pool: HDD
 state: ONLINE
  scan: scrub canceled on Wed Nov 20 14:37:59 2024
config:

        NAME           STATE     READ WRITE CKSUM
        HDD            ONLINE       0     0     0
          raidz1-0     ONLINE       0     0     0
            /dev/sdb2  ONLINE       0     0     0
            /dev/sdd2  ONLINE       0     0     0
            /dev/sda2  ONLINE       0     0     0
            /dev/sdc2  ONLINE       0     0     0

errors: No known data errors

So I quess

sdb2 : 73e8954f-fd43-4770-a5f8-78faa91fd6ee
sdd2: 25087691-9aac-45c6-99a2-741fcda14a58
sda2: f4d8b7aa-08e7-49b8-a06e-c0c1bca59d58
sdc2; c842d9a6-4ca8-4477-81aa-cae554d19506

Then running dd if=/dev/sds2 of=/dev/null bs=1M count=5000 I can see wtaht drive is what

Protopia · November 20, 2024, 7:09pm

It is a reasonable assumption that the devices are shown in the same order, but it is still an assumption rather than a definitive extrapolation.

The correct way to do this is using lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID which will list the device name /dev/sdX1 and the associated UUID which you can then use to map against the UUIDs in the pool.

Also to identify drives by flashing lights, the dd command is probably an OK way to do it, but you can probably specify the device by UUID i.e. dd if=/dev/disk/by-uuid/25087691-9aac-45c6-99a2-741fcda14a58 of=/dev/null bs=1M count=5000.

smic717394 · November 20, 2024, 8:31pm

I like this command. Thank you. The second I had to change

dd if=/dev/disk/by-uuid

to

dd if=/dev/disk/by-partuuid/

becase I see the disks are devided in 2 partitions a swap sda1 and the main partition sda2, not sure if this is ok.

sda      WDC WD40EFZX-68AWUN0    1 gpt    disk           4000787030016                          
├─sda1                           1 gpt    part       128    2147418624 Linux swap               d3c6800b-f1f3-48a5-8732-ffa98f2d65f7
└─sda2                           1 gpt    part   4194432 3998639463936 Solaris /usr & Apple ZFS f4d8b7aa-08e7-49b8-a06e-c0c1bca59d58

Protopia · November 20, 2024, 8:58pm

Sorry - my mistake. {fill fill}

smic717394 · November 22, 2024, 10:18am

OK, so I done a long smart test on all 4 drives one by one, took about 7 hours each. And they all show pass.

What else could it be?