Confused on why my Pool died. Need advice

I need a bit of advice. A couple days ago I first had random errors on my zfs pool. I decided to reboot my nas and then the bios didn’t recognize the boot ssd anymore. I thought my cheap pcie to sata adapter had died and the sudden loss of my boot ssd corrupted some things. So i installed a new 2,5" Sata drive instead and reinstalled. The Pool was not dead at this point , it just had a few corrupted files. After I deleted those it worked again for a few days. But this night I heard my nas make really weird noises (maybe the hdds?) and when I woke up this morning and checked, the Pool was dead , and I got this error:
image

If I check on the drives in the storage menu, its say’s they both have roughly 60 errors on them. My question now is why did this happen. I had my Pool set up with one vdev consisting of 2 drives in a mirror. If one drive died I don’t understand how both of them have errors on them. The drives are segate barracuda 4TB and are only a year old. And why didn’t the S.M.A.R.T self test’s say anything? Here is my log file , the weird noise was at around 2am on the 4th of december. Nov 28 06:32:18 truenas kernel: Console: switching to colour dummy device 80x25 - Pastebin.com Update: I just rebooted the nas again and the error switched to “Pool Main_ZPool state is Offline:None” and the storage panel says the vdev is offline and gives no further info. And in the disk section it says it no longer recognizes either of my hdd’s? Whats the chance of both of them dying at the same time? Maybe the sata ports on my motherboard are just faulty? I really have no clue. I also just realized that a long smart self test was sheduled to run at the 4th on 0am , maybe that caused the drives to die? Update2: Looking through the log file , the first error appears to be “ata1: SATA link down (SStatus 0 SControl 300)” , does that mean it disconnected from the motherboard side , or did the drive stop responding? Did my motherboard data controller die maybe? If someone knows how to proceed here , any help is much appreciated :smiley:

Glad you came here from Reddit to get some help.

The first things we will need are a detailed description of your hardware and some diagnostics. Please run the following commands and copy and paste the results here (putting the results of each command in a separate unformatted box by clicking the </> button):

  • lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
  • smartctl -x /dev/sdX for each of the missing drives indicated in the previous command
  • lspci
  • sas2flash -list
  • sas3flash -list
  • sudo zpool status -v
  • sudo zpool import

SMART does not (and cannot) catch everything. It 's already good is you had set regular tests; what are the SMART reports?
The “Barracuda” could be SMR drives, whose internal maintenance can pass for I/O error to ZFS (does nor respond fast enough? FAULTED!).

according to CMR and SMR Hard Drives | Seagate US they are smr drives. Only the 1TB drives in the baracuda line are cmr drives.

Edit: Fixed typo

1 Like

@palomar Are these drives connected to motherboard SATA ports?

Some drives have Error Recovery Control (explanation here) aka TLER aka CCTL that allows you to set the amount of time (through the smartctl settings) before a drive reports an error back to TrueNAS - and if these specific Baracuda drives have this (we will know when we see the output from smarttctl -x) then increasing it could prevent this from happening.

Obviously we need to try to get the existing disks back online, but a note to @Palomar for the future is to avoid SMR disks like the plague for ZFS redundant disks because regardless of normal operation if you ever need to resilver you may be waiting for several days or even weeks for the resilver to complete.

Yes they were. They stopped being detected though and the logs say various sata ports disconnected multiple times. The first ssd that I thought was bad was also connected via sata, which makes me think the data controller on my motherboard chipset failed.

I disconnected the hard drives to prevent further damage , but they were not being detected by the system anymore anyway. Here are the outputs:

This is my motherboard+cpu combo, if that matters: ASRock J5040-ITX Mini-ITX

lsblk -bo:

NAME MODEL ROTA PTTYPE TYPE   START         SIZE PARTTYPENAME           PARTUUID
sda  ZOTAC    0 gpt    disk         120034123776                        
├─sda1
│             0 gpt    part    4096      1048576 BIOS boot              747939a2-6acb-49e3-87a7-cdd4a29c402d
├─sda2
│             0 gpt    part    6144    536870912 EFI System             500314b4-cc1d-4445-8d77-e059aadb84fd
└─sda3
              0 gpt    part 1054720 119494090240 Solaris /usr & Apple ZFS
                                                                        7acd1083-369c-43fc-8e3e-6a8f9372c016

smartctl -x /dev/sda:
(the ssd is the only one that didn’t report back: no such device , even before disconnecting them)

=== START OF INFORMATION SECTION ===
Device Model:     ZOTAC SATA SSD
Serial Number:    A1D4076508F100171777
Firmware Version: SAFM12.2
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        Not in smartctl database 7.3/5625
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Dec  4 16:16:11 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
DSN feature is:   Unavailable
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (   30) seconds.
Offline data collection
capabilities:                    (0x79) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (   2) minutes.
Conveyance self-test routine
recommended polling time:        (   3) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     PO-R--   100   100   050    -    0
  9 Power_On_Hours          -O--C-   100   100   000    -    3482
 12 Power_Cycle_Count       -O--C-   100   100   000    -    1763
168 Unknown_Attribute       -O--C-   100   100   000    -    2
170 Unknown_Attribute       PO----   100   100   010    -    109
173 Unknown_Attribute       -O--C-   100   100   000    -    1441847
192 Power-Off_Retract_Count -O--C-   100   100   000    -    139
194 Temperature_Celsius     PO---K   070   070   030    -    30 (Min/Max 30/30)
218 Unknown_Attribute       PO-R--   100   100   050    -    1
231 Unknown_SSD_Attribute   PO--C-   100   100   000    -    98
241 Total_LBAs_Written      -O--C-   100   100   000    -    4356
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O     51  Comprehensive SMART error log
0x03       GPL     R/O     64  Ext. Comprehensive SMART error log
0x04       GPL,SL  R/O      6  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x30       GPL,SL  R/O      8  IDENTIFY DEVICE data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (64 sectors)
No Errors Logged

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      3473         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Commands not supported

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 2) ==
0x01  0x008  4            1763  ---  Lifetime Power-On Resets
0x01  0x010  4            3482  ---  Power-on Hours
0x01  0x018  6      9135778932  ---  Logical Sectors Written
0x01  0x020  6       137682828  ---  Number of Write Commands
0x01  0x028  6     13532224226  ---  Logical Sectors Read
0x01  0x030  6       217831962  ---  Number of Read Commands
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x04  0x010  4               1  ---  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              32  ---  Current Temperature
0x05  0x010  1              32  ---  Average Short Term Temperature
0x05  0x018  1              32  ---  Average Long Term Temperature
0x05  0x020  1              50  ---  Highest Temperature
0x05  0x028  1               5  ---  Lowest Temperature
0x05  0x030  1              50  ---  Highest Average Short Term Temperature
0x05  0x038  1              16  ---  Lowest Average Short Term Temperature
0x05  0x040  1              50  ---  Highest Average Long Term Temperature
0x05  0x048  1              16  ---  Lowest Average Long Term Temperature
0x05  0x050  4               0  ---  Time in Over-Temperature
0x05  0x058  1              50  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  ---  Time in Under-Temperature
0x05  0x068  1               5  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4         2046704  ---  Number of Hardware Resets
0x06  0x018  4               1  ---  Number of Interface CRC Errors
0x07  =====  =               =  ===  == Solid State Device Statistics (rev 1) ==
0x07  0x008  1               5  ---  Percentage Used Endurance Indicator
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  4            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  4            3  Device-to-host register FISes sent due to a COMRESET
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC

lspci:

0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  4            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  4            3  Device-to-host register FISes sent due to a COMRESET
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0010  2            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  2            0  R_ERR response for host-to-device non-data FIS, non-CRC

truenas% lspci
00:00.0 Host bridge: Intel Corporation Gemini Lake Host Bridge (rev 06)
00:00.1 Signal processing controller: Intel Corporation Celeron/Pentium Silver Processor Dynamic Platform and Thermal Framework Processor Participant (rev 06)
00:02.0 VGA compatible controller: Intel Corporation GeminiLake [UHD Graphics 605] (rev 06)
00:0e.0 Audio device: Intel Corporation Celeron/Pentium Silver Processor High Definition Audio (rev 06)
00:0f.0 Communication controller: Intel Corporation Celeron/Pentium Silver Processor Trusted Execution Engine Interface (rev 06)
00:12.0 SATA controller: Intel Corporation Celeron/Pentium Silver Processor SATA Controller (rev 06)
00:13.0 PCI bridge: Intel Corporation Gemini Lake PCI Express Root Port (rev f6)
00:13.1 PCI bridge: Intel Corporation Gemini Lake PCI Express Root Port (rev f6)
00:13.2 PCI bridge: Intel Corporation Gemini Lake PCI Express Root Port (rev f6)
00:13.3 PCI bridge: Intel Corporation Gemini Lake PCI Express Root Port (rev f6)
00:15.0 USB controller: Intel Corporation Celeron/Pentium Silver Processor USB 3.0 xHCI Controller (rev 06)
00:1f.0 ISA bridge: Intel Corporation Celeron/Pentium Silver Processor LPC Controller (rev 06)
00:1f.1 SMBus: Intel Corporation Celeron/Pentium Silver Processor Gaussian Mixture Model (rev 06)
03:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 15)
04:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)

sas2flash -list:

LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18) 
Copyright (c) 2008-2014 LSI Corporation. All rights reserved 

        No LSI SAS adapters found! Limited Command Set Available!
        ERROR: Command Not allowed without an adapter!
        ERROR: Couldn't Create Command -list
        Exiting Program.

sas3flash -list:

Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02) 
Copyright 2008-2017 Avago Technologies. All rights reserved.

        No Avago SAS adapters found! Limited Command Set Available!
        ERROR: Command Not allowed without an adapter!
        ERROR: Couldn't Create Command -list
        Exiting Program.

sudo zpool status -v (again this is only the ssd boot pool)

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:08 with 0 errors on Wed Dec  4 03:45:09 2024
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sda3      ONLINE       0     0     0

errors: No known data errors

sudo zpool import:

no pools available to import

TBH we don;t have any diagnostics that can point to any specific component.

lspci says that there are two SATA controllers:

00:12.0 SATA controller: Intel Corporation Celeron/Pentium Silver Processor SATA Controller (rev 06)
04:00.0 SATA controller: ASMedia Technology Inc. ASM1062 Serial ATA Controller (rev 02)

so it is quite possible that your boot drive is connected to one controller and your HDDs to the other. You might want to see what drives are visible at present in BIOS (screen shot please) and then see what is visible if you swap the SATA cables between ports (screen shot please) - if a hard drive plugged into the current boot drive SATA port becomes visible, that will almost certainly suggest that one of the SATA controllers is dead.

Looking at this from an entirely different perspective, regardless of which component has actually failed, it does seem to be reasonably certain that this is a motherboard failure - have you taken a physical look at the MB to see if you can see any signs of e.g. overheating or a component fritzing out? - in which case we should start planning for a replacement and you should get buying.

If you want to post the full details of your case, existing motherboard and disks, we can start a debate on your best route forwards.

Tysm for your help. Right now I have already send an rma request for the motherboard and am waiting on a response. I kind of don’t trust the system enough anymore to actually put the drives in it again though. They hold some pretty important data and I would like to prevent further damage.
I don’t have the time to take the machine apart and inspect the parts right now , but I will definitely do that tomorrow afternoon.

My specs are the following:

  • My Motherboard and Cpu combo were the Asrock J5040-ITX
  • My case is a nas case with built in psu with 4 3,5" hdd bays called “Eolize SVD-NC11-4 Mini ITX”
  • The first ssd that disconnected randomly and I though died. Is a WD Green 256GB Sata M.2 on a pcie to sata m.2 adapter
  • The second boot drive , the one that is still connecting but the pool died on is a 128Gb Zotac Sata ssd plugged directly into one of the 4 sata ports
  • My storage drives were 2 4TB Segate Barracuda HDD’s , which I found out while researching an rma are smr drives , which is less then ideal

Obviously a warranty replacement will be free (if you get one) however you really should consider buying all new hardware.

  • An M.2 → SATA adapter is not recommended for ZFS. The disconnects could well be due to this adapter.
  • SMR drives are completely unsuitable for ZFS redundant pools.

So you should probably:

  1. Get a MB with either enough SATA ports or enough PCIe capacity to support a decent HBA. Other people here can almost certainly make ITX MB recommendations.
  2. Replace the Baracudas with CMR drives though it is obviously more important to see if we can access the data on them.

I have had time to look at it now. The board has 2 sata chipsets , an asmedia one and an intel one. I had 2 drives in a mirror and they were plugged into a different chipset each.
So I had drives disconnect on:
-The Asmedia chipset
-The Intel chipset
-The pcie to sata adapter with the first boot drive

I am confused? Are all my drives (both hdd and that first boot ssd) bad or did both chipsets and the boot drive adatper just fail all at the same time? Both seems rather unlikely to me.

Really looks like SMR drives being at fault here, especially since the Barracudas seem to have a bad reputation in general.

Perhaps the dodgy PSU also contribuited.

As long as you plug a single drive, PCIe to SATA should be somewhat acceptable.

In your place I would troubleshoot by removing any potential issue: run the system with a single boot drive and use a good PSU with new cables.

Then we could try to see if something can be done for the drives, but I would not have much hope.

1 Like

I am not sure about that , why would my original boot drive also drop out if the hdds were failing?

There are quite a few possibilities, I would focus on solving the HDD issue first since the boot drive is kinda expendable.

How big (wattage) is your PSU? Also, why are you using a PCIe to SATA adapter since you have 4x SATA ports?

I have 4 hdd bays in my case so I wanted to keep enough ports free to be able to expand later.

1 Like