Iβm in a bit of a mess. My pool is gone all my drives are offline.
The drives are about 3 year old, about a year ago, i star getting random error that drives where failing, sometome is sda sometime sdb, sometimes both and few times sdc and sdd, but mainly a and b. Usually turning off the system removing the drives and dust the connectors and reconect, will fix the error, the pool will show healthy the drive smart show no errors.
Yesterday i got a message (find bellow), I turned off, clean , turn back in now all 4 drives are offline, pool is missing⦠restarted few times, nothing. Please help.
Im running the latest version of truenas scale
Device: /dev/sda [SAT], ATA error count increased from 0 to 72.
Device: /dev/sdb [SAT], ATA error count increased from 0 to 72.
Failed to configure kubernetes cluster for Applications: Missing βHDD/ix-applications/releases, HDD/ix-applications/k3sβ dataset(s) required for starting kubernetes.
βboot-poolβ is consuming USB devices βsdcβ which is not recommended.
Pool HDD state is OFFLINE: None
SMB shares have path-related configuration issues that may impact service stability: apps: Path does not exist., Robert_Backup: Path does not exist., work HDD: Path does not exist., plex: Path does not exist., Home-Assistant: Path does not exist., addguard: Path does not exist.
Failed to configure kubernetes cluster for Applications: Missing "HDD/ix-applications/k3s,
Here is a standard set of commands from @Protopia which will help narrowing down the issue. lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID lspci sas2flash -list sas3flash -list sudo zpool status -v sudo zpool import
Also when you say smart shows no errors, were these long smart tests ? Can you also post the results ?
Also be carefull when referring to drives as sda sdb etc. These can change between reboots. Could have been always the same driveβ¦
A drive that was complaining for so long also should have been replaced long ago.
The sas commands are probably not usefull in your case since i assume the drives are not connected to a HBA.
Ideally you will identify them by Partition UUID - both zpool and my lsblk commands show partition UUID.
This is what the commands listed by @Farout are intended to do. Run them and copy and paste the results here (see below for copy & paste instructions.
Now to try to answer the original postβ¦
I have a hunch that the i/o errors are what has taken your HDD pool offline. Most other error messages are a result of this. So we need to concentrate on this. Rebooting will not bring them back online - we have to do some command line stuff.
Bad electrical contact can be the cause of I/O errors, but not necessarily. Dust is IMO unlikely to be the cause, so dusting the connectors is unlikely to be a fix. More likely to be grease or greasy dirt or corrosion or oxidization and for that you need to clean the contacts both on both sides of the connector and both ends of the cable with a degreasing agent and maybe a gentle scrape to remove any oxidization. What happens when you remove and reconnect is you create enough friction to make contact again, but not enough for the connection to stay good over time.
You have good quality cables, and a quick check with an online PSU calculator says that a 250W PSU should be fine.
So, at present I donβt see any detailed evidence to point the way to a cause.
In addition to the previous commands, can you please also run:
smartctl -x /dev/sda
smartctl -x /dev/sdb
smartctl -x /dev/sdc
smartctl -x /dev/sdd
smartctl -x /dev/sde
Please copy and paste the output of each command separately and enclose each within separate lines containing just ``` i.e.
```
paste of command 1 results
```
```
paste of command 2 results
```
which should come out looking like this:
paste of command 1 results
paste of command 2 results
A few more questions:
I cannot find the MB you specified - is it A1SRi-2758F or something else?
What exact Atom processor do you have? C2758?
Are you using the PCIe slot(s)?
You do realise that whilst these MBs have 6x SATA ports, only 2 are SATA 3 with the other 4 being SATA 2? Since you are using RAIDZ and I/O goes to all ports, this is effectively equivalent to all ports being SATA2.
You may have some different issues running off a USB converter - but it should work, and these shouldnβt cause this issueβ¦
I am a bit worried by some of what you said: βIm running the latest version of truenas scaleβ suggests you are running 24.10 which does NOT use Kubernetes. Yet you are also getting error βFailed to configure kubernetes cluster for Applicationsβ. So I would guess that you are probably on 24.04 Dragonfish and not 24.10 ElectricEel. However a lot of these messages are because your pool is offline, so that is what we need to focus on.
Finally, and most importantly, do you have a backup of your system configuration file? If not take one now!!
P.S. Once we have it back up and running you should think about a small SSD to hold your apps and their data.
0x01 0x008 4 64 β Lifetime Power-On Resets
0x01 0x010 4 20820 β Power-on Hours
0x01 0x018 6 34484635712 β Logical Sectors Written
0x01 0x020 6 1617000114 β Number of Write Commands
0x01 0x028 6 72177585644 β Logical Sectors Read
0x01 0x030 6 161084819 β Number of Read Commands
0x01 0x038 6 1937555968 β Date and Time TimeStamp
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 20696 β Spindle Motor Power-on Hours
0x03 0x010 4 20681 β Head Flying Hours
0x03 0x018 4 101 β Head Load Events
0x03 0x020 4 0 β Number of Reallocated Logical Sectors
0x03 0x028 4 1 β Read Recovery Attempts
0x03 0x030 4 0 β Number of Mechanical Start Failures
0x03 0x038 4 0 β Number of Realloc. Candidate Logical Sectors
0x03 0x040 4 33 β Number of High Priority Unload Events
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 4 β Number of Reported Uncorrectable Errors
0x04 0x010 4 0 β Resets Between Cmd Acceptance and Completion
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 24 β Current Temperature
0x05 0x010 1 32 β Average Short Term Temperature
0x05 0x018 1 32 β Average Long Term Temperature
0x05 0x020 1 57 β Highest Temperature
0x05 0x028 1 23 β Lowest Temperature
0x05 0x030 1 53 β Highest Average Short Term Temperature
0x05 0x038 1 30 β Lowest Average Short Term Temperature
0x05 0x040 1 48 β Highest Average Long Term Temperature
0x05 0x048 1 32 β Lowest Average Long Term Temperature
0x05 0x050 4 0 β Time in Over-Temperature
0x05 0x058 1 65 β Specified Maximum Operating Temperature
0x05 0x060 4 0 β Time in Under-Temperature
0x05 0x068 1 0 β Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 412 β Number of Hardware Resets
0x06 0x010 4 195 β Number of ASR Events
0x06 0x018 4 1 β Number of Interface CRC Errors
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 58 β Lifetime Power-On Resets
0x01 0x010 4 20814 β Power-on Hours
0x01 0x018 6 34614825386 β Logical Sectors Written
0x01 0x020 6 1615251139 β Number of Write Commands
0x01 0x028 6 72191888854 β Logical Sectors Read
0x01 0x030 6 162599582 β Number of Read Commands
0x01 0x038 6 1915955968 β Date and Time TimeStamp
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 20689 β Spindle Motor Power-on Hours
0x03 0x010 4 20682 β Head Flying Hours
0x03 0x018 4 91 β Head Load Events
0x03 0x020 4 0 β Number of Reallocated Logical Sectors
0x03 0x028 4 0 β Read Recovery Attempts
0x03 0x030 4 0 β Number of Mechanical Start Failures
0x03 0x038 4 0 β Number of Realloc. Candidate Logical Sectors
0x03 0x040 4 27 β Number of High Priority Unload Events
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 3 β Number of Reported Uncorrectable Errors
0x04 0x010 4 0 β Resets Between Cmd Acceptance and Completion
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 24 β Current Temperature
0x05 0x010 1 33 β Average Short Term Temperature
0x05 0x018 1 34 β Average Long Term Temperature
0x05 0x020 1 56 β Highest Temperature
0x05 0x028 1 23 β Lowest Temperature
0x05 0x030 1 52 β Highest Average Short Term Temperature
0x05 0x038 1 30 β Lowest Average Short Term Temperature
0x05 0x040 1 48 β Highest Average Long Term Temperature
0x05 0x048 1 34 β Lowest Average Long Term Temperature
0x05 0x050 4 0 β Time in Over-Temperature
0x05 0x058 1 65 β Specified Maximum Operating Temperature
0x05 0x060 4 0 β Time in Under-Temperature
0x05 0x068 1 0 β Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 405 β Number of Hardware Resets
0x06 0x010 4 195 β Number of ASR Events
0x06 0x018 4 1 β Number of Interface CRC Errors
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 58 β Lifetime Power-On Resets
0x01 0x010 4 20774 β Power-on Hours
0x01 0x018 6 33576322377 β Logical Sectors Written
0x01 0x020 6 1575699408 β Number of Write Commands
0x01 0x028 6 69991566414 β Logical Sectors Read
0x01 0x030 6 156448445 β Number of Read Commands
0x01 0x038 6 1771955968 β Date and Time TimeStamp
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 20648 β Spindle Motor Power-on Hours
0x03 0x010 4 20634 β Head Flying Hours
0x03 0x018 4 98 β Head Load Events
0x03 0x020 4 0 β Number of Reallocated Logical Sectors
0x03 0x028 4 1 β Read Recovery Attempts
0x03 0x030 4 0 β Number of Mechanical Start Failures
0x03 0x038 4 0 β Number of Realloc. Candidate Logical Sectors
0x03 0x040 4 26 β Number of High Priority Unload Events
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 1 β Number of Reported Uncorrectable Errors
0x04 0x010 4 0 β Resets Between Cmd Acceptance and Completion
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 24 β Current Temperature
0x05 0x010 1 33 β Average Short Term Temperature
0x05 0x018 1 34 β Average Long Term Temperature
0x05 0x020 1 55 β Highest Temperature
0x05 0x028 1 23 β Lowest Temperature
0x05 0x030 1 52 β Highest Average Short Term Temperature
0x05 0x038 1 31 β Lowest Average Short Term Temperature
0x05 0x040 1 48 β Highest Average Long Term Temperature
0x05 0x048 1 34 β Lowest Average Long Term Temperature
0x05 0x050 4 0 β Time in Over-Temperature
0x05 0x058 1 65 β Specified Maximum Operating Temperature
0x05 0x060 4 0 β Time in Under-Temperature
0x05 0x068 1 0 β Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 406 β Number of Hardware Resets
0x06 0x010 4 194 β Number of ASR Events
0x06 0x018 4 2 β Number of Interface CRC Errors
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 57 β Lifetime Power-On Resets
0x01 0x010 4 20823 β Power-on Hours
0x01 0x018 6 33781113043 β Logical Sectors Written
0x01 0x020 6 1582398369 β Number of Write Commands
0x01 0x028 6 71483697664 β Logical Sectors Read
0x01 0x030 6 166213734 β Number of Read Commands
0x01 0x038 6 1948355968 β Date and Time TimeStamp
0x03 ===== = = === == Rotating Media Statistics (rev 1) ==
0x03 0x008 4 20697 β Spindle Motor Power-on Hours
0x03 0x010 4 20682 β Head Flying Hours
0x03 0x018 4 97 β Head Load Events
0x03 0x020 4 0 β Number of Reallocated Logical Sectors
0x03 0x028 4 5 β Read Recovery Attempts
0x03 0x030 4 0 β Number of Mechanical Start Failures
0x03 0x038 4 0 β Number of Realloc. Candidate Logical Sectors
0x03 0x040 4 25 β Number of High Priority Unload Events
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 1 β Number of Reported Uncorrectable Errors
0x04 0x010 4 0 β Resets Between Cmd Acceptance and Completion
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 24 β Current Temperature
0x05 0x010 1 32 β Average Short Term Temperature
0x05 0x018 1 34 β Average Long Term Temperature
0x05 0x020 1 55 β Highest Temperature
0x05 0x028 1 23 β Lowest Temperature
0x05 0x030 1 52 β Highest Average Short Term Temperature
0x05 0x038 1 32 β Lowest Average Short Term Temperature
0x05 0x040 1 47 β Highest Average Long Term Temperature
0x05 0x048 1 34 β Lowest Average Long Term Temperature
0x05 0x050 4 0 β Time in Over-Temperature
0x05 0x058 1 65 β Specified Maximum Operating Temperature
0x05 0x060 4 0 β Time in Under-Temperature
0x05 0x068 1 0 β Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 407 β Number of Hardware Resets
0x06 0x010 4 194 β Number of ASR Events
0x06 0x018 4 2 β Number of Interface CRC Errors
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value
NAME MODEL ROTA PTTYPE TYPE START SIZE PARTTYPENAME PARTUUID
sda WDC WD40EFZX-68AWUN0 1 gpt disk 4000787030016
ββsda1 1 gpt part 128 2147418624 Linux swap d3c6800b-f1f3-48a5-8732-ffa98f2d65f7
ββsda2 1 gpt part 4194432 3998639463936 Solaris /usr & Apple ZFS f4d8b7aa-08e7-49b8-a06e-c0c1bca59d58
sdb WDC WD40EFZX-68AWUN0 1 gpt disk 4000787030016
ββsdb1 1 gpt part 128 2147418624 Linux swap a24c4968-41fd-488e-8b45-830d1980e329
ββsdb2 1 gpt part 4194432 3998639463936 Solaris /usr & Apple ZFS 73e8954f-fd43-4770-a5f8-78faa91fd6ee
sdc WDC WD40EFZX-68AWUN0 1 gpt disk 4000787030016
ββsdc1 1 gpt part 128 2147418624 Linux swap eeb74157-84fd-4099-858a-73d8cffdf35e
ββsdc2 1 gpt part 4194432 3998639463936 Solaris /usr & Apple ZFS c842d9a6-4ca8-4477-81aa-cae554d19506
sdd WDC WD40EFZX-68AWUN0 1 gpt disk 4000787030016
ββsdd1 1 gpt part 128 2147418624 Linux swap 2fac4b9e-41b7-4bfe-bd30-0fa4dd8ae50e
ββsdd2 1 gpt part 4194432 3998639463936 Solaris /usr & Apple ZFS 25087691-9aac-45c6-99a2-741fcda14a58
sde EC-SP02 0 gpt disk 250059350016
ββsde1 0 gpt part 4096 1048576 BIOS boot 8aa8c827-4d03-49e7-a4e9-9c83c53e8fd2
ββsde2 0 gpt part 6144 536870912 EFI System 06886ca9-0b4d-439c-94c2-57a0929ee358
ββsde3 0 gpt part 34609152 232339447296 Solaris /usr & Apple ZFS 58bd29ca-0a73-46db-ad0c-532333469d90
ββsde4 0 gpt part 1054720 17179869184 Linux swap 0924bb8d-be28-4557-a778-edaf3597201e
ββsde4 0 crypt 17179869184
00:00.0 Host bridge: Intel Corporation Atom processor C2000 SoC Transaction Router (rev 02)
00:01.0 PCI bridge: Intel Corporation Atom processor C2000 PCIe Root Port 1 (rev 02)
00:02.0 PCI bridge: Intel Corporation Atom processor C2000 PCIe Root Port 2 (rev 02)
00:03.0 PCI bridge: Intel Corporation Atom processor C2000 PCIe Root Port 3 (rev 02)
00:0b.0 Co-processor: Intel Corporation Atom processor C2000 QAT (rev 02)
00:0e.0 Host bridge: Intel Corporation Atom processor C2000 RAS (rev 02)
00:0f.0 IOMMU: Intel Corporation Atom processor C2000 RCEC (rev 02)
00:13.0 System peripheral: Intel Corporation Atom processor C2000 SMBus 2.0 (rev 02)
00:14.0 Ethernet controller: Intel Corporation Ethernet Connection I354 (rev 03)
00:14.1 Ethernet controller: Intel Corporation Ethernet Connection I354 (rev 03)
00:14.2 Ethernet controller: Intel Corporation Ethernet Connection I354 (rev 03)
00:14.3 Ethernet controller: Intel Corporation Ethernet Connection I354 (rev 03)
00:16.0 USB controller: Intel Corporation Atom processor C2000 USB Enhanced Host Controller (rev 02)
00:17.0 SATA controller: Intel Corporation Atom processor C2000 AHCI SATA2 Controller (rev 02)
00:18.0 SATA controller: Intel Corporation Atom processor C2000 AHCI SATA3 Controller (rev 02)
00:1f.0 ISA bridge: Intel Corporation Atom processor C2000 PCU (rev 02)
00:1f.3 SMBus: Intel Corporation Atom processor C2000 PCU SMBus (rev 02)
01:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 03)
02:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 30)
03:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)
the other sas2 and 3 show no adapter
root@HomeServer[~]# sudo zpool status -v
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:04:04 with 0 errors on Mon Nov 18 03:49:06 2024
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sde3 ONLINE 0 0 0
errors: No known data errors
root@HomeServer[~]# sudo zpool import
pool: HDD
id: 414478922821486095
state: ONLINE
action: The pool can be imported using its name or numeric identifier.
config:
cannot import βHDDβ: I/O error
Recovery is possible, but will result in some data loss.
Returning the pool to its state as of Wed Nov 20 05:34:37 2024
should correct the problem. Approximately 5 seconds of data
must be discarded, irreversibly. Recovery can be attempted
by executing βzpool import -F HDDβ. A scrub of the pool
is strongly recommended after recovery.
I run sudo zpool import -F HDD
and get
cannot mount β/HDDβ: failed to create mountpoint: Read-only file system
Import was successful, but unable to mount some datasets
root@HomeServer[~]#
I get this
root@HomeServer[~]# zpool import -o altroot=/mnt HDD
cannot import βHDDβ: a pool with that name already exists
use the form βzpool import <pool | id> β to give it a new name
root@HomeServer[~]#