Which drive is offline?

Thank you.

WDC WUH721414ALE604 9JHDHHLT Currently sda - Looks fine
WDC WUH721414ALE604 9JGJ0WVT Currently sdb - Looks fine
TOSHIBA MG07ACA14TEY Y840A0AJFFHG Currently sdc - 1 error in the log
TOSHIBA MG07ACA14TEY Y810A05JFFHG Currently sdd - 1 error in the log
WDC WUH721414ALE604 9JHMBEJT Currently sde - Looks fine
WDC WUH721414ALE604 9JGJ3UJT Currently sdf - Looks fine
TOSHIBA MG07ACA14TEY Y880A0C1FFHG Currently sdg - 1 error in the log
TOSHIBA MG07ACA14TEY Y890A08UFFHG Currently sdh - Looks fine
WDC WUH721414ALN6L4 9RH9DH5C Currently sdi - Looks fine Edit: Actually has 8 uncorrectable errors.
TOSHIBA MG07ACA14TEY Y880A07TFFHG Currently sdj - 148 UDMA CRC Errors, 149 errors recorded in total
WDC WUH721414ALN6L4 9RHUA20C Currently sdl - 3 Reallocated sectors, 65536 read errors, may be literal. Looks dodgy, investigate further.
TOSHIBA MG07ACA14TEY Y870A03ZFFHG Currently sdm - 1 error in the log
WDC WUH721414ALN6L4 9JH45NBT Currently sdn - Looks fine
WDC WUH721414ALN6L4 9RHPHNMC Currently sdo - 15 read errors, 29 pending sectors, 16 errors in the log

  • The drives with 1 error in the log are probably fine, at least unless that starts to climb. I suspect a single event (maybe a power failure? Caused that).
  • sdj may have a cable or connector issue, watch if the CRC errors continue to climb.
  • sdl especially but also sdo are suspect, read errors are a big warning sign, as are a growing number of reallocated sectors.

I’ll also note that it would be a good idea to run a long SMART test, especially on the suspect drives. It will take around a day to complete. A smart tests leverages the drives built-in self-diagnostic features and can help catch some issues before they result in a drive failure.

If a long test is reported as having failed it’s usually grounds for an RMA if the drive is under warranty.

Ideally you set up smart tests to run regularly on a schedule. A long test a month could be a good starting point, although some prefer to run them more frequently. A short test takes a minute or two but is also very limited in what it actually tests.

2 Likes

Thanks for summarizing all that.

So it seems like you think I might not need to swap out a drive yet based on the SMART data. Is there other data I should look at too, or is that the primary source?

Two other times I had notifications like this, I replaced the drives, and the companies replaced them under warranty, and I purchased cold spares.

How often do the letter assignments (sda, sdo, etc) change? Seems weird they would change often.

Won’t a test on a potentially problematic drive potentially cause a failure?

The smart report gives you a tally, at this point I would keep an eye on it. If the errors continue to climb on the same drives, they are failing.

It can happen every time the server boots. The fact that the notifications use these volatile devices names to identify potential problems is unfortunate.

It depends on how you look at it. A test may stress a drive such that it chokes. But if a smart test manages that, isn’t it better to know early so you can get it replaced?

Smart tests are routine and you are, as a sensible data hoarder, expected to use them as one of many tools to keep tabs on the health of your drives.

2 Likes

Smart tests are routine and you are, as a sensible data hoarder, expected to use them as one of many tools to keep tabs on the health of your drives.

I think you’re the first netizen to call me sensible in almost a decade. What are you doing later? :kissing_heart: :rofl:

Getting back to the topic at hand, I guess I could put this back in the equipment closet and kick off some SMART tests. SDL and SDO first. Maybe I’ll leave it on my workbench and run the test so it’s easier with the names.

Do that long test on all the drives just to see where you’re at right now.
If a drive fails the long test, RMA if still available.

After that it would be good to schedule recurring SMART tests. Do so using Data ProtectionPeriodic S.M.A.R.T. Tests

For extra points, look into using something like the most excellent Multi-Report to help you keep tabs on the tests and if your SMART attributes are going up in a bad way. There have been a few issues with Multi-report in SCALE 24.10.1 specifically, but it’s developer joeschmuck has been hard at work resolving those and a new release is set to go stable in the near future. You can follow the development of the newest version in this thread.

1 Like

Well, SDO failed an extended report, but doesn’t say much as to why.

Remaining: 0.9
Lifetime: 11334
Error: 3418095502

SDL is still going.

They don’t interfere if you run 2 at the same time, do they?

They can run concurrently just fine. A reboot would interrupt it, but that’s about it.

Reallocated sectors should be enough. If these were my drives, sdl and sdo would be either out to RMA or in the bin.

Potentially any time you reboot. Always track by serial number. (Complaints should be addressed at the Linux kernel maintainers…)

1 Like

SDL succeeded its extended SMART test. I will replace SDO, then let it rebuild. Should I run a test on anything else? Or wait till it rebuilds and then replace SDL?

SDI is currently 9RH9DH5C
SDO is currently 9RHPHNMC
SDL is currently 9RHUA20C

Also, I have the following notifications:

el
Critical
Device: /dev/sdo [SAT], Self-Test Log error count increased from 0 to 1.
Jan 10, 2025 15:04:44 (America/New_York)
Dismiss
cancel
Critical
Device: /dev/sdo [SAT], 29 Currently unreadable (pending) sectors.
Jan 11, 2025 11:04:44 (America/New_York)
Dismiss
cancel
Critical
Device: /dev/sdi [SAT], 8 Currently unreadable (pending) sectors.
Jan 11, 2025 11:04:44 (America/New_York)
Dismiss
cancel
Critical
Device: /dev/sdi [SAT], 8 Offline uncorrectable sectors.
Jan 11, 2025 11:04:44 (America/New_York)
Dismiss 

Schedule regular SMART tests on all drives.

That sdi is also showing bad sectors is worrying, unfortunately I misread that SMART report and there were actually 8 uncorrectable errors in the report you posted.

You are fast approaching a pool failure. The unavoidable resilvers may push you over the edge, it depends on if the data on your “working” drives is readable or not.

I replaced SDO with one of my cold spares, resilvered, reached out to the seller for an RMA, and sent off the drive.

I ran a long SMART test on 9RH9DH5C (what was above SDI, but is now SDJ) and it failed rather quickly, just like SDO above. Time to do it all over again.

Remaining: 0.9
Lifetime: 11429
Error: 3417497693

I was able to use my spare, and order an actual 2nd standby drive (I thought I had 2 cold standby drives, but could only find 1) while the vendor was replenishing their stocks. I should get my 2 replacements Soon™, they should ship this week sometime.

The other zPool had a failure after the resilver and the drive was taken offline immediately by TrueNAS, so I pulled it and sent it off for a replacement. I left the NAS powered off during the RMA, since at that point, clearly, the solar winds were affecting it, or something.

I installed the replacement, let it resilver, and ANOTHER (4th) drive had an issue (sdb, sn Y890A08UFFHG). I did a long SMART test on this drive and it passed. It’s only had this one notification, and from this post and others, I’m not sure if this is actually a critical issue or not given it passing the long SMART test. Should I replace the one with the error too?

At this point we’re up to 4 failures on this thread. I’ve never had such bad luck with hard drives before, but we’re also literally at an 11 year peak solar activity, so sunspots are on the table and not just superstition. These are all refurb’d enterprise drives, and these last 2 failures are Toshiba’s, whereas the other 2 on this thread, and 1 or 2 failures I had a long time ago, were all WD. Do you think something else is up, or I just got the lucky in the wrong way?

 Critical
Device: /dev/sdb [SAT], ATA error count increased from 0 to 1.
Jan 30, 2025 15:19:16 (America/New_York)
Dismiss 
 1
	
sdb Extended offline SUCCESS
	
Remaining: N/A
Lifetime: 41188
Error: N/A
lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
NAME     MODEL   ROTA PTTYPE TYPE     START           SIZE PARTTYPENAME             PARTUUID
sda      WDC WUH    1 gpt    disk           14000519643136
└─sda1              1 gpt    part      4096 14000516497920 Solaris /usr & Apple ZFS 25ec455c-a629-4300-b16d-09deaa3a88a0
sdb      TOSHIBA    1 gpt    disk           14000519643136
└─sdb1              1 gpt    part      4096 14000516497920 Solaris /usr & Apple ZFS c9e772ad-f118-4e92-b5bd-c872033b71a1
sdc      TOSHIBA    1 gpt    disk           14000519643136
└─sdc1              1 gpt    part      4096 14000516497920 Solaris /usr & Apple ZFS 41644b00-0499-4742-9d4e-a8e76f823d61
sdd      WDC WUH    1 gpt    disk           14000519643136
└─sdd1              1 gpt    part      4096 14000516497920 Solaris /usr & Apple ZFS 1851ae7d-b04d-4ce7-a718-5a8aefbcbb2b
sde      TOSHIBA    1 gpt    disk           14000519643136
└─sde1              1 gpt    part      4096 14000516497920 Solaris /usr & Apple ZFS 0278dd80-3622-46b5-8a8f-4fa4b6b6142b
sdf      TOSHIBA    1 gpt    disk           14000519643136
└─sdf1              1 gpt    part      4096 14000516497920 Solaris /usr & Apple ZFS 8544a5e0-4009-463e-90a2-7d937b658e61
sdg      WDC WUH    1 gpt    disk           14000519643136
└─sdg1              1 gpt    part      4096 14000516497920 Solaris /usr & Apple ZFS 5d64fc93-6e46-48d4-b69e-d94d9dceb84c
sdh      TOSHIBA    1 gpt    disk           14000519643136
└─sdh1              1 gpt    part      4096 14000516497920 Solaris /usr & Apple ZFS 6406fd6a-053c-4195-babe-f088281483f8
sdi      WDC WUH    1 gpt    disk           14000519643136
└─sdi1              1 gpt    part      4096 14000516497920 Solaris /usr & Apple ZFS 8c7742ef-a8a0-4dc3-b974-316e43d8a15a
sdj      WDC WUH    1 gpt    disk           14000519643136
└─sdj1              1 gpt    part      4096 14000516497920 Solaris /usr & Apple ZFS 3eaf1a0a-317f-402c-b8f3-60a3ac83a3cc
sdk      SanDisk    0 gpt    disk             128035676160
├─sdk1              0 gpt    part      4096        1048576 BIOS boot                f1eb4c83-ddc8-45bc-b59a-5909fba158b5
├─sdk2              0 gpt    part      6144      536870912 EFI System               84d5455d-9b3d-4d74-a64a-df269a20e05c
├─sdk3              0 gpt    part  34609152   110315773440 Solaris /usr & Apple ZFS 6ecbfa39-b644-4021-bcaa-5c7e684f174d
└─sdk4              0 gpt    part   1054720    17179869184 Linux swap               646e196c-7554-4fbe-8f45-45f26bdf5938
  └─sdk4            0        crypt             17179869184
sdl      WDC WUH    1 gpt    disk           14000519643136
└─sdl1              1 gpt    part      4096 14000516497920 Solaris /usr & Apple ZFS 76d05838-4e24-4f1d-a2b5-69687952b39e
sdm      WDC WUH    1 gpt    disk           14000519643136
└─sdm1              1 gpt    part      4096 14000516501504 Solaris /usr & Apple ZFS b6014162-d311-47ac-b63c-6e6a72c4f07f
sdn      WDC WUH    1 gpt    disk           14000519643136
└─sdn1              1 gpt    part      4096 14000516501504 Solaris /usr & Apple ZFS 1a4e3a70-25d0-4c61-91bd-9f76014d1ae3
sdo      TOSHIBA    1 gpt    disk           14000519643136
└─sdo1              1 gpt    part      4096 14000516497920 Solaris /usr & Apple ZFS 5fc7e2f4-0dde-4895-8f1a-432ca652f35e
admin@truenas[~]$
admin@truenas[~]$ lspci
00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne Root Complex
00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne IOMMU
00:01.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe GPP Bridge
00:01.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne PCIe GPP Bridge
00:02.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:08.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Renoir PCIe Dummy Host Bridge
00:08.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus
00:08.2 PCI bridge: Advanced Micro Devices, Inc. [AMD] Renoir Internal PCIe GPP Bridge to Bus
00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD] FCH SMBus Controller (rev 51)
00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD] FCH LPC Bridge (rev 51)
00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 0
00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 1
00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 2
00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 3
00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 4
00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 5
00:18.6 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 6
00:18.7 Host bridge: Advanced Micro Devices, Inc. [AMD] Cezanne Data Fabric; Function 7
01:00.0 Serial Attached SCSI controller: Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
02:00.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse Switch Upstream
03:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
03:03.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
03:08.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
03:09.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
03:0a.0 PCI bridge: Advanced Micro Devices, Inc. [AMD] Matisse PCIe GPP Bridge
04:00.0 USB controller: ASMedia Technology Inc. Device 3241
05:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8125 2.5GbE Controller (rev 05)
06:00.0 Non-Essential Instrumentation [1300]: Advanced Micro Devices, Inc. [AMD] Starship/Matisse Reserved SPP
06:00.1 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
06:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 Host Controller
07:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
08:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 51)
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev c9)
09:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Renoir Radeon High Definition Audio Controller
09:00.2 Encryption controller: Advanced Micro Devices, Inc. [AMD] Family 17h (Models 10h-1fh) Platform Security Processor
09:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
09:00.4 USB controller: Advanced Micro Devices, Inc. [AMD] Renoir/Cezanne USB 3.1
09:00.6 Audio device: Advanced Micro Devices, Inc. [AMD] Family 17h/19h HD Audio Controller
0a:00.0 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 81)
0a:00.1 SATA controller: Advanced Micro Devices, Inc. [AMD] FCH SATA Controller [AHCI mode] (rev 81)
admin@truenas[~]$
admin@truenas[~]$ sudo sas2flash -list
[sudo] password for admin:
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2008(B2)

        Controller Number              : 0
        Controller                     : SAS2008(B2)
        PCI Address                    : 00:01:00:00
        SAS Address                    : 5003005-7-01a8-08b0
        NVDATA Version (Default)       : 14.01.00.08
        NVDATA Version (Persistent)    : 14.01.00.08
        Firmware Product ID            : 0x2213 (IT)
        Firmware Version               : 20.00.07.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9211-8i
        BIOS Version                   : 07.39.02.00
        UEFI BSD Version               : 07.27.01.01
        FCODE Version                  : N/A
        Board Name                     : SAS9211-8i
        Board Assembly                 : ARTofSERVER
        Board Tracer Number            : N/A

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.
admin@truenas[~]$
sudo sas3flash -list
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02)
Copyright 2008-2017 Avago Technologies. All rights reserved.

        No Avago SAS adapters found! Limited Command Set Available!
        ERROR: Command Not allowed without an adapter!
        ERROR: Couldn't Create Command -list
        Exiting Program.
admin@truenas[~]$
sudo zpool status -v
  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:08 with 0 errors on Fri Jan 31 03:45:09 2025
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          sdk3      ONLINE       0     0     0

errors: No known data errors

  pool: zPool
 state: ONLINE
  scan: scrub repaired 0B in 04:24:43 with 0 errors on Sun Feb  2 04:24:45 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        zPool                                     ONLINE       0     0     0
          raidz2-0                                ONLINE       0     0     0
            1851ae7d-b04d-4ce7-a718-5a8aefbcbb2b  ONLINE       0     0     0
            0278dd80-3622-46b5-8a8f-4fa4b6b6142b  ONLINE       0     0     0
            25ec455c-a629-4300-b16d-09deaa3a88a0  ONLINE       0     0     0
            c9e772ad-f118-4e92-b5bd-c872033b71a1  ONLINE       0     0     0
            8544a5e0-4009-463e-90a2-7d937b658e61  ONLINE       0     0     0
            8c7742ef-a8a0-4dc3-b974-316e43d8a15a  ONLINE       0     0     0
            41644b00-0499-4742-9d4e-a8e76f823d61  ONLINE       0     0     0
          raidz2-1                                ONLINE       0     0     0
            5d64fc93-6e46-48d4-b69e-d94d9dceb84c  ONLINE       0     0     0
            6406fd6a-053c-4195-babe-f088281483f8  ONLINE       0     0     0
            5fc7e2f4-0dde-4895-8f1a-432ca652f35e  ONLINE       0     0     0
            b6014162-d311-47ac-b63c-6e6a72c4f07f  ONLINE       0     0     0
            1a4e3a70-25d0-4c61-91bd-9f76014d1ae3  ONLINE       0     0     0
            76d05838-4e24-4f1d-a2b5-69687952b39e  ONLINE       0     0     0
            3eaf1a0a-317f-402c-b8f3-60a3ac83a3cc  ONLINE       0     0     0

errors: No known data errors
admin@truenas[~]$
sudo smartctl -a /dev/sdb
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-6.1.63-production+truenas] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Toshiba MG07ACA... Enterprise Capacity HDD
Device Model:     TOSHIBA MG07ACA14TEY
Serial Number:    Y890A08UFFHG
LU WWN Device Id: 5 000039 908c8db37
Firmware Version: 4902
User Capacity:    14,000,519,643,136 bytes [14.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Tue Feb  4 22:29:35 2025 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        (1378) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       7975
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       40
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   001   001   000    Old_age   Always       -       41268
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       39
 23 Helium_Condition_Lower  0x0023   100   100   075    Pre-fail  Always       -       0
 24 Helium_Condition_Upper  0x0023   100   100   075    Pre-fail  Always       -       0
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       37
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       204
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       32 (Min/Max 15/41)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       1
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       69206024
222 Loaded_Hours            0x0032   006   006   000    Old_age   Always       -       37642
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       593
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 1
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 41140 hours (1714 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  84 41 10 cf 55 9e 40  Error: ICRC, ABRT at LBA = 0x009e55cf = 10376655

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 10 00 d0 55 9e 40 00      01:58:53.676  READ FPDMA QUEUED
  60 68 10 68 55 9e 40 00      01:58:53.676  READ FPDMA QUEUED
  60 30 08 30 55 9e 40 00      01:58:53.675  READ FPDMA QUEUED
  60 a8 00 88 54 9e 40 00      01:58:53.675  READ FPDMA QUEUED
  60 b8 10 d0 52 9e 40 00      01:58:53.675  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     41188         -
# 2  Short offline       Completed without error       00%     27264         -
# 3  Short offline       Completed without error       00%     18419         -
# 4  Short offline       Completed without error       00%     18353         -
# 5  Short offline       Completed without error       00%     18353         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

admin@truenas[~]$

Normally if I saw so many failures I’d think it is a faulty SAS cable or HBA issue (fault, over heating, wrong firmware etc.) But you have the correct firmware on the HBA, and most of these drives have pending/reallocated sector issues (which is a fault on the drive itself).

However, going back again a step to mundane faults; sdb, Serial Number: Y890A08UFFHG, has a pending CRC error, which is usually caused by a loose/faulty cable or controller fault.

For that specific error, I’d reseat connections, swap cables, make sure the HBA has airflow (not overheating, I slap a fan directly onto mine), try directly connecting to motherboard if possible, etc.

So far every time I’ve had a CRC error, it was because I knocked a cable slightly loose when cleaning/swapping drives.

The controller has plenty of space around it, and there are plenty of fans. 1 started making noise, and I have replacements on hand but have not replaced them yet.

Every time I fiddle with the drives, I make sure the SATA cables have a good connection.

I’m not that familiar with some of these diagnostics, where do I see the pending sector issues?

When I set this up the first time, I tried to interleave the brands of hard drives across both the motherboard and add in controller. Did the old diagnostic data show which controller the drive was connected to? If they were all on the same controller, that might be useful info. Then again, if the drives have on-drive issues, that’s not a controller issue.

For example sn ending in DH5C has bad sectors:

197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       8
1

8 sectors that have been offlined by the hdd from further use & if not mistaken 8 sectors that have data that needs to be moved to good sectors (could be the same 8).