Truenas scale disk errors and pool missing

Hi guys,

I’m in a bit of a mess. My pool is gone all my drives are offline.

The drives are about 3 year old, about a year ago, i star getting random error that drives where failing, sometome is sda sometime sdb, sometimes both and few times sdc and sdd, but mainly a and b. Usually turning off the system removing the drives and dust the connectors and reconect, will fix the error, the pool will show healthy the drive smart show no errors.

Yesterday i got a message (find bellow), I turned off, clean , turn back in now all 4 drives are offline, pool is missing… restarted few times, nothing. Please help.

Im running the latest version of truenas scale

  • Device: /dev/sda [SAT], ATA error count increased from 0 to 72.
  • Device: /dev/sdb [SAT], ATA error count increased from 0 to 72.
  • Failed to configure kubernetes cluster for Applications: Missing β€œHDD/ix-applications/releases, HDD/ix-applications/k3s” dataset(s) required for starting kubernetes.
  • β€˜boot-pool’ is consuming USB devices β€˜sdc’ which is not recommended.
  • Pool HDD state is OFFLINE: None
  • SMB shares have path-related configuration issues that may impact service stability: apps: Path does not exist., Robert_Backup: Path does not exist., work HDD: Path does not exist., plex: Path does not exist., Home-Assistant: Path does not exist., addguard: Path does not exist.
  • Failed to configure kubernetes cluster for Applications: Missing "HDD/ix-applications/k3s,

Here is a standard set of commands from @Protopia which will help narrowing down the issue.
lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
lspci
sas2flash -list
sas3flash -list
sudo zpool status -v
sudo zpool import

Also when you say smart shows no errors, were these long smart tests ? Can you also post the results ?

Also be carefull when referring to drives as sda sdb etc. These can change between reboots. Could have been always the same drive…

A drive that was complaining for so long also should have been replaced long ago.

The sas commands are probably not usefull in your case since i assume the drives are not connected to a HBA.

Any way of knowing what drive is what sda and sdb since they change the name?

No, you have to identify them by serial number.

Ideally you will identify them by Partition UUID - both zpool and my lsblk commands show partition UUID.

This is what the commands listed by @Farout are intended to do. Run them and copy and paste the results here (see below for copy & paste instructions.

Now to try to answer the original post…

I have a hunch that the i/o errors are what has taken your HDD pool offline. Most other error messages are a result of this. So we need to concentrate on this. Rebooting will not bring them back online - we have to do some command line stuff.

Bad electrical contact can be the cause of I/O errors, but not necessarily. Dust is IMO unlikely to be the cause, so dusting the connectors is unlikely to be a fix. More likely to be grease or greasy dirt or corrosion or oxidization and for that you need to clean the contacts both on both sides of the connector and both ends of the cable with a degreasing agent and maybe a gentle scrape to remove any oxidization. What happens when you remove and reconnect is you create enough friction to make contact again, but not enough for the connection to stay good over time.

You have good quality cables, and a quick check with an online PSU calculator says that a 250W PSU should be fine.

So, at present I don’t see any detailed evidence to point the way to a cause.

In addition to the previous commands, can you please also run:

  • smartctl -x /dev/sda
  • smartctl -x /dev/sdb
  • smartctl -x /dev/sdc
  • smartctl -x /dev/sdd
  • smartctl -x /dev/sde

Please copy and paste the output of each command separately and enclose each within separate lines containing just ``` i.e.

```
paste of command 1 results
```
```
paste of command 2 results
```
which should come out looking like this:

   paste of command 1 results
   paste of command 2 results

A few more questions:

  1. I cannot find the MB you specified - is it A1SRi-2758F or something else?

  2. What exact Atom processor do you have? C2758?

  3. Are you using the PCIe slot(s)?

  4. You do realise that whilst these MBs have 6x SATA ports, only 2 are SATA 3 with the other 4 being SATA 2? Since you are using RAIDZ and I/O goes to all ports, this is effectively equivalent to all ports being SATA2.

  5. You may have some different issues running off a USB converter - but it should work, and these shouldn’t cause this issue…

  6. I am a bit worried by some of what you said: β€œIm running the latest version of truenas scale” suggests you are running 24.10 which does NOT use Kubernetes. Yet you are also getting error β€œFailed to configure kubernetes cluster for Applications”. So I would guess that you are probably on 24.04 Dragonfish and not 24.10 ElectricEel. However a lot of these messages are because your pool is offline, so that is what we need to focus on.

  7. Finally, and most importantly, do you have a backup of your system configuration file? If not take one now!!

P.S. Once we have it back up and running you should think about a small SSD to hold your apps and their data.

0x01 0x008 4 64 β€” Lifetime Power-On Resets
0x01 0x010 4 20820 β€” Power-on Hours
0x01 0x018 6 34484635712 β€” Logical Sectors Written
0x01 0x020 6 1617000114 β€” Number of Write Commands
0x01 0x028 6 72177585644 β€” Logical Sectors Read
0x01 0x030 6 161084819 β€” Number of Read Commands
0x01 0x038 6 1937555968 β€” Date and Time TimeStamp
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 20696 β€” Spindle Motor Power-on Hours
0x03 0x010 4 20681 β€” Head Flying Hours
0x03 0x018 4 101 β€” Head Load Events
0x03 0x020 4 0 β€” Number of Reallocated Logical Sectors
0x03 0x028 4 1 β€” Read Recovery Attempts
0x03 0x030 4 0 β€” Number of Mechanical Start Failures
0x03 0x038 4 0 β€” Number of Realloc. Candidate Logical Sectors
0x03 0x040 4 33 β€” Number of High Priority Unload Events
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 4 β€” Number of Reported Uncorrectable Errors
0x04 0x010 4 0 β€” Resets Between Cmd Acceptance and Completion
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 24 β€” Current Temperature
0x05 0x010 1 32 β€” Average Short Term Temperature
0x05 0x018 1 32 β€” Average Long Term Temperature
0x05 0x020 1 57 β€” Highest Temperature
0x05 0x028 1 23 β€” Lowest Temperature
0x05 0x030 1 53 β€” Highest Average Short Term Temperature
0x05 0x038 1 30 β€” Lowest Average Short Term Temperature
0x05 0x040 1 48 β€” Highest Average Long Term Temperature
0x05 0x048 1 32 β€” Lowest Average Long Term Temperature
0x05 0x050 4 0 β€” Time in Over-Temperature
0x05 0x058 1 65 β€” Specified Maximum Operating Temperature
0x05 0x060 4 0 β€” Time in Under-Temperature
0x05 0x068 1 0 β€” Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 412 β€” Number of Hardware Resets
0x06 0x010 4 195 β€” Number of ASR Events
0x06 0x018 4 1 β€” Number of Interface CRC Errors
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value

Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 58 β€” Lifetime Power-On Resets
0x01 0x010 4 20814 β€” Power-on Hours
0x01 0x018 6 34614825386 β€” Logical Sectors Written
0x01 0x020 6 1615251139 β€” Number of Write Commands
0x01 0x028 6 72191888854 β€” Logical Sectors Read
0x01 0x030 6 162599582 β€” Number of Read Commands
0x01 0x038 6 1915955968 β€” Date and Time TimeStamp
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 20689 β€” Spindle Motor Power-on Hours
0x03 0x010 4 20682 β€” Head Flying Hours
0x03 0x018 4 91 β€” Head Load Events
0x03 0x020 4 0 β€” Number of Reallocated Logical Sectors
0x03 0x028 4 0 β€” Read Recovery Attempts
0x03 0x030 4 0 β€” Number of Mechanical Start Failures
0x03 0x038 4 0 β€” Number of Realloc. Candidate Logical Sectors
0x03 0x040 4 27 β€” Number of High Priority Unload Events
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 3 β€” Number of Reported Uncorrectable Errors
0x04 0x010 4 0 β€” Resets Between Cmd Acceptance and Completion
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 24 β€” Current Temperature
0x05 0x010 1 33 β€” Average Short Term Temperature
0x05 0x018 1 34 β€” Average Long Term Temperature
0x05 0x020 1 56 β€” Highest Temperature
0x05 0x028 1 23 β€” Lowest Temperature
0x05 0x030 1 52 β€” Highest Average Short Term Temperature
0x05 0x038 1 30 β€” Lowest Average Short Term Temperature
0x05 0x040 1 48 β€” Highest Average Long Term Temperature
0x05 0x048 1 34 β€” Lowest Average Long Term Temperature
0x05 0x050 4 0 β€” Time in Over-Temperature
0x05 0x058 1 65 β€” Specified Maximum Operating Temperature
0x05 0x060 4 0 β€” Time in Under-Temperature
0x05 0x068 1 0 β€” Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 405 β€” Number of Hardware Resets
0x06 0x010 4 195 β€” Number of ASR Events
0x06 0x018 4 1 β€” Number of Interface CRC Errors
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value

0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 58 β€” Lifetime Power-On Resets
0x01 0x010 4 20774 β€” Power-on Hours
0x01 0x018 6 33576322377 β€” Logical Sectors Written
0x01 0x020 6 1575699408 β€” Number of Write Commands
0x01 0x028 6 69991566414 β€” Logical Sectors Read
0x01 0x030 6 156448445 β€” Number of Read Commands
0x01 0x038 6 1771955968 β€” Date and Time TimeStamp
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 20648 β€” Spindle Motor Power-on Hours
0x03 0x010 4 20634 β€” Head Flying Hours
0x03 0x018 4 98 β€” Head Load Events
0x03 0x020 4 0 β€” Number of Reallocated Logical Sectors
0x03 0x028 4 1 β€” Read Recovery Attempts
0x03 0x030 4 0 β€” Number of Mechanical Start Failures
0x03 0x038 4 0 β€” Number of Realloc. Candidate Logical Sectors
0x03 0x040 4 26 β€” Number of High Priority Unload Events
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 1 β€” Number of Reported Uncorrectable Errors
0x04 0x010 4 0 β€” Resets Between Cmd Acceptance and Completion
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 24 β€” Current Temperature
0x05 0x010 1 33 β€” Average Short Term Temperature
0x05 0x018 1 34 β€” Average Long Term Temperature
0x05 0x020 1 55 β€” Highest Temperature
0x05 0x028 1 23 β€” Lowest Temperature
0x05 0x030 1 52 β€” Highest Average Short Term Temperature
0x05 0x038 1 31 β€” Lowest Average Short Term Temperature
0x05 0x040 1 48 β€” Highest Average Long Term Temperature
0x05 0x048 1 34 β€” Lowest Average Long Term Temperature
0x05 0x050 4 0 β€” Time in Over-Temperature
0x05 0x058 1 65 β€” Specified Maximum Operating Temperature
0x05 0x060 4 0 β€” Time in Under-Temperature
0x05 0x068 1 0 β€” Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 406 β€” Number of Hardware Resets
0x06 0x010 4 194 β€” Number of ASR Events
0x06 0x018 4 2 β€” Number of Interface CRC Errors
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value

Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 57 β€” Lifetime Power-On Resets
0x01 0x010 4 20823 β€” Power-on Hours
0x01 0x018 6 33781113043 β€” Logical Sectors Written
0x01 0x020 6 1582398369 β€” Number of Write Commands
0x01 0x028 6 71483697664 β€” Logical Sectors Read
0x01 0x030 6 166213734 β€” Number of Read Commands
0x01 0x038 6 1948355968 β€” Date and Time TimeStamp
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 20697 β€” Spindle Motor Power-on Hours
0x03 0x010 4 20682 β€” Head Flying Hours
0x03 0x018 4 97 β€” Head Load Events
0x03 0x020 4 0 β€” Number of Reallocated Logical Sectors
0x03 0x028 4 5 β€” Read Recovery Attempts
0x03 0x030 4 0 β€” Number of Mechanical Start Failures
0x03 0x038 4 0 β€” Number of Realloc. Candidate Logical Sectors
0x03 0x040 4 25 β€” Number of High Priority Unload Events
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 1 β€” Number of Reported Uncorrectable Errors
0x04 0x010 4 0 β€” Resets Between Cmd Acceptance and Completion
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 24 β€” Current Temperature
0x05 0x010 1 32 β€” Average Short Term Temperature
0x05 0x018 1 34 β€” Average Long Term Temperature
0x05 0x020 1 55 β€” Highest Temperature
0x05 0x028 1 23 β€” Lowest Temperature
0x05 0x030 1 52 β€” Highest Average Short Term Temperature
0x05 0x038 1 32 β€” Lowest Average Short Term Temperature
0x05 0x040 1 47 β€” Highest Average Long Term Temperature
0x05 0x048 1 34 β€” Lowest Average Long Term Temperature
0x05 0x050 4 0 β€” Time in Over-Temperature
0x05 0x058 1 65 β€” Specified Maximum Operating Temperature
0x05 0x060 4 0 β€” Time in Under-Temperature
0x05 0x068 1 0 β€” Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 407 β€” Number of Hardware Resets
0x06 0x010 4 194 β€” Number of ASR Events
0x06 0x018 4 2 β€” Number of Interface CRC Errors
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value

NAME MODEL ROTA PTTYPE TYPE START SIZE PARTTYPENAME PARTUUID
sda WDC WD40EFZX-68AWUN0 1 gpt disk 4000787030016
β”œβ”€sda1 1 gpt part 128 2147418624 Linux swap d3c6800b-f1f3-48a5-8732-ffa98f2d65f7
└─sda2 1 gpt part 4194432 3998639463936 Solaris /usr & Apple ZFS f4d8b7aa-08e7-49b8-a06e-c0c1bca59d58
sdb WDC WD40EFZX-68AWUN0 1 gpt disk 4000787030016
β”œβ”€sdb1 1 gpt part 128 2147418624 Linux swap a24c4968-41fd-488e-8b45-830d1980e329
└─sdb2 1 gpt part 4194432 3998639463936 Solaris /usr & Apple ZFS 73e8954f-fd43-4770-a5f8-78faa91fd6ee
sdc WDC WD40EFZX-68AWUN0 1 gpt disk 4000787030016
β”œβ”€sdc1 1 gpt part 128 2147418624 Linux swap eeb74157-84fd-4099-858a-73d8cffdf35e
└─sdc2 1 gpt part 4194432 3998639463936 Solaris /usr & Apple ZFS c842d9a6-4ca8-4477-81aa-cae554d19506
sdd WDC WD40EFZX-68AWUN0 1 gpt disk 4000787030016
β”œβ”€sdd1 1 gpt part 128 2147418624 Linux swap 2fac4b9e-41b7-4bfe-bd30-0fa4dd8ae50e
└─sdd2 1 gpt part 4194432 3998639463936 Solaris /usr & Apple ZFS 25087691-9aac-45c6-99a2-741fcda14a58
sde EC-SP02 0 gpt disk 250059350016
β”œβ”€sde1 0 gpt part 4096 1048576 BIOS boot 8aa8c827-4d03-49e7-a4e9-9c83c53e8fd2
β”œβ”€sde2 0 gpt part 6144 536870912 EFI System 06886ca9-0b4d-439c-94c2-57a0929ee358
β”œβ”€sde3 0 gpt part 34609152 232339447296 Solaris /usr & Apple ZFS 58bd29ca-0a73-46db-ad0c-532333469d90
└─sde4 0 gpt part 1054720 17179869184 Linux swap 0924bb8d-be28-4557-a778-edaf3597201e
└─sde4 0 crypt 17179869184

00:00.0 Host bridge: Intel Corporation Atom processor C2000 SoC Transaction Router (rev 02)
00:01.0 PCI bridge: Intel Corporation Atom processor C2000 PCIe Root Port 1 (rev 02)
00:02.0 PCI bridge: Intel Corporation Atom processor C2000 PCIe Root Port 2 (rev 02)
00:03.0 PCI bridge: Intel Corporation Atom processor C2000 PCIe Root Port 3 (rev 02)
00:0b.0 Co-processor: Intel Corporation Atom processor C2000 QAT (rev 02)
00:0e.0 Host bridge: Intel Corporation Atom processor C2000 RAS (rev 02)
00:0f.0 IOMMU: Intel Corporation Atom processor C2000 RCEC (rev 02)
00:13.0 System peripheral: Intel Corporation Atom processor C2000 SMBus 2.0 (rev 02)
00:14.0 Ethernet controller: Intel Corporation Ethernet Connection I354 (rev 03)
00:14.1 Ethernet controller: Intel Corporation Ethernet Connection I354 (rev 03)
00:14.2 Ethernet controller: Intel Corporation Ethernet Connection I354 (rev 03)
00:14.3 Ethernet controller: Intel Corporation Ethernet Connection I354 (rev 03)
00:16.0 USB controller: Intel Corporation Atom processor C2000 USB Enhanced Host Controller (rev 02)
00:17.0 SATA controller: Intel Corporation Atom processor C2000 AHCI SATA2 Controller (rev 02)
00:18.0 SATA controller: Intel Corporation Atom processor C2000 AHCI SATA3 Controller (rev 02)
00:1f.0 ISA bridge: Intel Corporation Atom processor C2000 PCU (rev 02)
00:1f.3 SMBus: Intel Corporation Atom processor C2000 PCU SMBus (rev 02)
01:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03)
02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
03:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)

the other sas2 and 3 show no adapter

root@HomeServer[~]# sudo zpool status -v

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:04:04 with 0 errors on Mon Nov 18 03:49:06 2024
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      sde3      ONLINE       0     0     0

errors: No known data errors

root@HomeServer[~]# sudo zpool import

pool: HDD
id: 414478922821486095
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:

    HDD                                       ONLINE
      raidz1-0                                ONLINE
        73e8954f-fd43-4770-a5f8-78faa91fd6ee  ONLINE
        25087691-9aac-45c6-99a2-741fcda14a58  ONLINE
        f4d8b7aa-08e7-49b8-a06e-c0c1bca59d58  ONLINE
        c842d9a6-4ca8-4477-81aa-cae554d19506  ONLINE

root@HomeServer[~]#

I run sudo zpool import HDD and got this error

cannot import β€˜HDD’: I/O error
Recovery is possible, but will result in some data loss.
Returning the pool to its state as of Wed Nov 20 05:34:37 2024
should correct the problem. Approximately 5 seconds of data
must be discarded, irreversibly. Recovery can be attempted
by executing β€˜zpool import -F HDD’. A scrub of the pool
is strongly recommended after recovery.

I run sudo zpool import -F HDD

and get

cannot mount β€˜/HDD’: failed to create mountpoint: Read-only file system
Import was successful, but unable to mount some datasets
root@HomeServer[~]#

Im running a scrub now

When importing via CLI, you must use:

zpool import -o altroot=/mnt POOLNAME

Should I stop the scrub and run this command? I cant upload photos, but all disk are online and show no errors but clicking the dataset show the error

Yes. Stop it. Export via GUI, then import via CLI using the altroot.

I get this
root@HomeServer[~]# zpool import -o altroot=/mnt HDD
cannot import β€˜HDD’: a pool with that name already exists
use the form β€˜zpool import <pool | id> ’ to give it a new name
root@HomeServer[~]#

export the pool first

Ok, when exporting I get 3 options destroy data, delete saved config and confirm export. I guess the first no, sencond is already selected

IMO second is NO too.

Ok, runed the command, I see the datasets, but if I go to Storage, it say no pool

What does zpool status -v say now ?

root@HomeServer[~]# zpool status -v
pool: HDD
state: ONLINE
scan: scrub canceled on Wed Nov 20 14:37:59 2024
config:

    NAME                                      STATE     READ WRITE CKSUM
    HDD                                       ONLINE       0     0     0
      raidz1-0                                ONLINE       0     0     0
        73e8954f-fd43-4770-a5f8-78faa91fd6ee  ONLINE       0     0     0
        25087691-9aac-45c6-99a2-741fcda14a58  ONLINE       0     0     0
        f4d8b7aa-08e7-49b8-a06e-c0c1bca59d58  ONLINE       0     0     0
        c842d9a6-4ca8-4477-81aa-cae554d19506  ONLINE       0     0     0

errors: No known data errors

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:04:04 with 0 errors on Mon Nov 18 03:49:06 2024
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      sde3      ONLINE       0     0     0

errors: No known data errors

Try a reboot…it seems to have imported fine :man_shrugging:

Sorry my ZFS-fu is at an end now.