Failed SMART test on mirrored boot pool

One of the two drives in my mirrored boot pool failed its weekly SMART test overnight. I’m having some difficulty finding more info about the test results in the admin panel - all i have is the text of the alerts.

Boot pool status is ONLINE: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state..

Device: /dev/sde [SAT], not capable of SMART self-check.

Device: /dev/sde [SAT], failed to read SMART Attribute Data.

Device: /dev/sde [SAT], Read SMART Self-Test Log Failed.

Device: /dev/sde [SAT], Read SMART Error Log Failed.

I’d appreciate some guidance on how to approach dealing with this.

What model of disk is used in the boot pool?
Try running “smartctl -a /dev/[disk]” to check the supported smart tests.

Some SSD do not support smart test,
Or worse, a disk in your boot pool has failed.

Both of the disks in the boot pool are Patriot P210 128GB SATA SSDs. I built this NAS slightly over a year ago, and this is the first time a SMART test has failed. So, the model definitely supports SMART tests.

You need to try out my fancy troubleshooting flowcharts, see link in my sigature.

You may or may not have a drive problem, it could just be ZFS. Following the charts will have you enter a few commands and then you just do what it says.

I would appreciate feedback as well so I can improve on it.

This flow chart seems useful in general, but i don’t think it helps me here. Like i said, the SMART test for this drive fails, so the flow chart just immediately dead ends.

I know I probably have a serious problem - that’s why I’m here. I’m asking what i need to do about it.

edit - i’ve tried running smartctl in the truenas shell, but it says the command isn’t found. do i need to install it? i’ve done command line installs plenty of times on desktop linux, but never in treunas.

“System Settings” → “Boot”
“Boot pool status” button

That should show you the vDev status of your boot pool. I expect you will have one drive shown as Error.

I have not done the following myself, so I might have got something not quite right, but AFAIK…

If you have a spare SATA port, power off, add your replacement SSD, power on and from the above screen select attach from the vertical dots on the working drive. Once it has completely resilvered, click the vertical dots for the Error drive and click Remove.

If you DON’T have a spare SATA port, power off, replace the broken SSD, power on and from the above screen select replace from the vertical dots on the Errored drive.

You should run as root, you are a non-privileged user right now. Or try sudo smartctl -a /dev/???

Your situation aids to me making certain things clearer in the flowchart, such as if you are a privileged user or not.

While you are entering data in a posting, you should include the following:
What version of TrueNAS you are running.
Refer to Appendix B for the commands.
We need to know the zpool status -v, the Identify Drive by GPTID or Drive Ident for SCALE, and the SMART output smartctl -x /dev/sde

Please include those using the ‘</>’ tags (above) to maintain the proper formatting of the mesages.

here is my “go to” list of commands to run to give the info that @joeschmuck was asking for…

  • lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
  • sudo zpool status -v
  • sudo zpool import
  • lspci
  • sudo storcli show all
  • sudo sas2flash -list
  • sudo sas3flash -list
1 Like

When i went into the boot poot in the gui per Protopia’s suggestion, it says the pool overall is healthy, but that one of the two drives has errors.

and smartctl did work when i ran it with sudo. i’d seen terminal commands say they need root privileges to run before, but i’d never seen one that said the command didn’t exist at all without sudo. It just says Inquiry Failed when i run it on the bad drive though, with -a or -x. sudo smartctl does work when i run it on other drives, so the drive is definitely the problem.

i do not have an extra SATA port, so i’ll need to boot off one drive. is there any point to formatting the bad drive, re-mirrioring the boot pool to it, and seeing if the issue recurs? or or should i just go get a new one.

here’s what i get from those commands

lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID

NAME     MODEL                ROTA PTTYPE TYPE    START           SIZE PARTTYPENAME             PARTUUID
sda      ST12000VN0008-2YS101    1 gpt    disk          12000138625024                          
└─sda1                           1 gpt    part     2048 12000136527872 Solaris /usr & Apple ZFS a4d7f440-4faf-4aff-b9a5-2814927494fd
sdb      ST12000VN0008-2YS101    1 gpt    disk          12000138625024                          
└─sdb1                           1 gpt    part     2048 12000136527872 Solaris /usr & Apple ZFS e38c5f62-3d4f-4e37-9c04-d7638f85bb20
sdc      ST12000VN0008-2YS101    1 gpt    disk          12000138625024                          
└─sdc1                           1 gpt    part     2048 12000136527872 Solaris /usr & Apple ZFS f58a057f-ddbd-46d4-abb5-e75d853a09f2
sdd      ST12000VN0008-2YS101    1 gpt    disk          12000138625024                          
└─sdd1                           1 gpt    part     2048 12000136527872 Solaris /usr & Apple ZFS 1c04a49e-d86e-467a-aa35-d6306bc531ad
sde      Patriot P210 128GB      0 gpt    disk            128035676160                          
├─sde1                           0 gpt    part     4096        1048576 BIOS boot                5079a1fa-6a9d-4b7f-9394-64f02b64a1dd
├─sde2                           0 gpt    part     6144      536870912 EFI System               18b626e0-9f8b-49b4-ac89-7f758f5b05e1
├─sde3                           0 gpt    part 34609152   110315773440 Solaris /usr & Apple ZFS 4f3aef10-ac5e-40d6-8b1a-ae79bda1e941
└─sde4                           0 gpt    part  1054720    17179869184 Linux swap               07e35d1d-3147-4f33-8278-86ff147a0bb8
sdf      Patriot P210 128GB      0 gpt    disk            128035676160                          
├─sdf1                           0 gpt    part     4096        1048576 BIOS boot                e0e589e8-461a-4058-8277-2a180cbd8a63
├─sdf2                           0 gpt    part     6144      536870912 EFI System               1acf25c0-59b0-47fa-b440-a29c3bed357a
├─sdf3                           0 gpt    part 34609152   110315773440 Solaris /usr & Apple ZFS a71562de-4c31-4a2d-932c-5ee7f7a571de
└─sdf4                           0 gpt    part  1054720    17179869184 Linux swap               cacda179-8611-40af-9304-810f17a50864
nvme0n1  WD Blue SN580 500GB     0 gpt    disk            500107862016                          
└─nvme0n1p1
                                 0 gpt    part     4096   500105740800 Solaris /usr & Apple ZFS 55ddfbb4-c896-4a96-9df4-d65487c57796

sudo zpool status -v

  pool: Apps
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:02:54 with 0 errors on Sun Feb  9 04:02:55 2025
    scan warning: skipped blocks that are only referenced by the checkpoint.
checkpoint: created Tue Jun  4 22:11:12 2024, consumes 19.5G
config:

        NAME                                    STATE     READ WRITE CKSUM
        Apps                                    ONLINE       0     0     0
          55ddfbb4-c896-4a96-9df4-d65487c57796  ONLINE       0     0     0

errors: No known data errors

  pool: RAID
 state: ONLINE
  scan: scrub repaired 0B in 04:22:46 with 0 errors on Sun Feb  9 04:22:47 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        RAID                                      ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            1c04a49e-d86e-467a-aa35-d6306bc531ad  ONLINE       0     0     0
            e38c5f62-3d4f-4e37-9c04-d7638f85bb20  ONLINE       0     0     0
            a4d7f440-4faf-4aff-b9a5-2814927494fd  ONLINE       0     0     0
            f58a057f-ddbd-46d4-abb5-e75d853a09f2  ONLINE       0     0     0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
status: One or more devices are faulted in response to persistent errors.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
        repaired.
  scan: scrub repaired 0B in 00:02:02 with 0 errors on Mon Mar  3 03:47:03 2025
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sde3    FAULTED      3    87     0  too many errors
            sdf3    ONLINE       0     0     0

errors: No known data errors

sudo zpool import

no pools available to import

lspci

00:00.0 Host bridge: Intel Corporation Device 461c
00:02.0 VGA compatible controller: Intel Corporation Alder Lake-N [UHD Graphics]
00:0d.0 USB controller: Intel Corporation Device 464e
00:14.0 USB controller: Intel Corporation Alder Lake-N PCH USB 3.2 xHCI Host Controller
00:14.2 RAM memory: Intel Corporation Alder Lake-N PCH Shared SRAM
00:16.0 Communication controller: Intel Corporation Alder Lake-N PCH HECI Controller
00:17.0 SATA controller: Intel Corporation Device 54d3
00:1c.0 PCI bridge: Intel Corporation Device 54ba
00:1c.3 PCI bridge: Intel Corporation Device 54bb
00:1c.6 PCI bridge: Intel Corporation Device 54be
00:1d.0 PCI bridge: Intel Corporation Device 54b0
00:1d.1 PCI bridge: Intel Corporation Device 54b1
00:1d.2 PCI bridge: Intel Corporation Device 54b2
00:1f.0 ISA bridge: Intel Corporation Alder Lake-N PCH eSPI Controller
00:1f.3 Audio device: Intel Corporation Alder Lake-N PCH High Definition Audio Controller
00:1f.4 SMBus: Intel Corporation Device 54a3
00:1f.5 Serial bus controller: Intel Corporation Device 54a4
01:00.0 Ethernet controller: Intel Corporation Ethernet Controller I226-V (rev 04)
02:00.0 Ethernet controller: Intel Corporation Ethernet Controller I226-V (rev 04)
03:00.0 Ethernet controller: Intel Corporation Ethernet Controller I226-V (rev 04)
04:00.0 Ethernet controller: Intel Corporation Ethernet Controller I226-V (rev 04)
05:00.0 Non-Volatile memory controller: Sandisk Corp WD Blue SN580 NVMe SSD (DRAM-less) (rev 01)
06:00.0 SATA controller: JMicron Technology Corp. JMB58x AHCI SATA controller

sudo storcli show all

CLI Version = 007.2807.0000.0000 Dec 22, 2023
Operating system = Linux 6.6.44-production+truenas
Status Code = 0
Status = Success
Description = None

Number of Controllers = 0
Host Name = truenas
Operating System  = Linux 6.6.44-production+truenas

sudo sas2flash -list

LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18) 
Copyright (c) 2008-2014 LSI Corporation. All rights reserved 

        No LSI SAS adapters found! Limited Command Set Available!
        ERROR: Command Not allowed without an adapter!
        ERROR: Couldn't Create Command -list
        Exiting Program.

sudo sas3flash -list

Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02) 
Copyright 2008-2017 Avago Technologies. All rights reserved.

        No Avago SAS adapters found! Limited Command Set Available!
        ERROR: Command Not allowed without an adapter!
        ERROR: Couldn't Create Command -list
        Exiting Program.

If it were me I would:

  1. Make sure that I had an up to date copy of the system configuration file.
  2. Shutdown the server and check that the SATA cables to the Patriot SSDs were seated properly.
  3. Restart and try to replace the failed drive with itself i.e. trigger a resilver and see what happens.

Thank you. I have backed up the configuration file to another device, and will try what you advise later tonight.

After i check the cables and turn it back on, I assume Truenas will pop an alert about the bad boot drive. WIll the alert have a “click here to re-silver the bad boot drive” type prompt, orr will i need to find that button myself.

What’s the cost of a replacement drive? 20 € with shipping? Less?
What’s the value of any time spent diagnosing what’s wrong with this drive?

powered off, re-seated the sata power and data cables, and rebooted. the boot pool now shows as degraded, instead of online. the drive that was working last time is still working, and the drive that wasn’t working still isn’t. however, the Boot Pool page now says faulted drive only has 11 errors, whereas before it said there were 90. Also, the Disks page no longer lists a serial number for the faulted drive.

truenas didn’t prompt me with an option to re-silver the boot pool. how can i do that?

You only resilver after replacing the bad drive, physically.

That is the most obvious situation, but when a drive goes offline due to errors caused by a dodgy SATA connection, then you can bring it back online by resilvering.

Removed the failed drive using the GUI, and then extend boot-pool from GUI.

First, use sudo if you need to, goes without saying.

Try this, smartctl --scan-open and see what the results are. You will be presented with the interface the drives uses, place that into the command below.

Example Output:
/dev/da0 -d scsi # /dev/sde, SCSI device
You would enter smartctl -d scsi -x /dev/sde

Now you enter… smartctl -d interface -x /dev/sde

Let’s say this fails to work, we have one other option, not may favorite as the format is not as easy to read but I’m certain it will work: midclt call disk.smart_attributes sde | jq > sde.txt and this will create a file called sde.txt. You can then open the text file and cut/paste that to the forum.

If you are able to use smartctl then the next command would be to start a SMART Long test smartclt -d interface -t long /dev/sde

Cheers

EDIT: I never jump to the conclusion a drive has failed without at least trying to prove it. Drives cost money, testing does not.

smartctl --scan-open

/dev/sda -d sat # /dev/sda [SAT], ATA device
/dev/sdb -d sat # /dev/sdb [SAT], ATA device
/dev/sdc -d sat # /dev/sdc [SAT], ATA device
/dev/sdd -d sat # /dev/sdd [SAT], ATA device
# /dev/sde -d scsi # /dev/sde, SCSI device open failed: INQUIRY failed
/dev/sdf -d sat # /dev/sdf [SAT], ATA device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device

note - unplugging and re-plugging seems to have changed the sdN assignments. the misbehaving drive is still sde, but the working boot drive is now sdd instead of sdf.

sde shows as scsi, which is seems odd when all of the other sata drives show as ata. but anyway, smartctl -d scsi -x /dev/sde doesn’t get me anything helpful.

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

Standard Inquiry (36 bytes) failed [Input/output error]
Retrying with a 64 byte Standard Inquiry
Standard Inquiry (64 bytes) failed [Input/output error]
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

midclt call disk.smart_attributes sde | jq > sde.txt also says fails at the top, but it has a lot of detail after that

EFAULT] smartctl failed for disk sde:
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      4
    ],
    "pre_release": false,
    "svn_revision": "5530",
    "platform_info": "x86_64-linux-6.6.44-production+truenas",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "-A",
      "/dev/sde",
      "-j"
    ],
    "messages": [
      {
        "string": "Smartctl open device: /dev/sde failed: INQUIRY failed",
        "severity": "error"
      }
    ],
    "exit_status": 2
  },
  "local_time": {
    "time_t": 1741097294,
    "asctime": "Tue Mar  4 08:08:14 2025 CST"
  }
}

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 211, in call_method
    result = await self.middleware.call_with_audit(message['method'], serviceobj, methodobj, params, self)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1529, in call_with_audit
    result = await self._call(method, serviceobj, methodobj, params, app=app,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1460, in _call
    return await methodobj(*prepared_call.args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 179, in nf
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 49, in nf
    res = await f(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/disk_/smart_attributes.py", line 41, in smart_attributes
    output = json.loads(await self.middleware.call('disk.smartctl', name, ['-A', '-j']))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1629, in call
    return await self._call(
           ^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1460, in _call
    return await methodobj(*prepared_call.args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/schema/processor.py", line 179, in nf
    return await func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/middlewared/plugins/disk_/smartctl.py", line 75, in smartctl
    raise CallError(f'smartctl failed for disk {disk}:\n{cp.stdout}')
middlewared.service_exception.CallError: [EFAULT] smartctl failed for disk sde:
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      4
    ],
    "pre_release": false,
    "svn_revision": "5530",
    "platform_info": "x86_64-linux-6.6.44-production+truenas",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "-A",
      "/dev/sde",
      "-j"
    ],
    "messages": [
      {
        "string": "Smartctl open device: /dev/sde failed: INQUIRY failed",
        "severity": "error"
      }
    ],
    "exit_status": 2
  },
  "local_time": {
    "time_t": 1741097294,
    "asctime": "Tue Mar  4 08:08:14 2025 CST"
  }
}```

@etorix can you, or someone else, tell me exactly where to go in the GUI to remove the failed drive from the boot pool? and then where exactly to go to add it back and trigger the re-silver?

i understand what you want me to do here, but i’ve spent awhile going through the GUI, and i can’t find an option do this. it’s not in Storage - Disks, System - Boot Pool, or anywhere else i’ve checked.