WD Red Pro causing weird error

Hello,
I had an old WD disk presenting reallocating sector in increasing number. So in decided to replace the drive with a new one.
Previous model was WD Red Pro WDC_WD4003FFBX
New one is WD Red Pro WDC_WD4005FFBX
During resilvering process, the new drive went faulted. I had an IDNF LBA error causing write error in the ZFS log.
I read that it could be caused by SMR incompatibility with ZFS, but it looks like my drive is a CMR type.
I tried replacing the new faulty drive with another drive but it gave me the same error…
What should I do ?

Details on motherboard, any other accessories, how the drive is connected, etc. could help.

Output of smartctl logs for drive(s) in question would also be invaluable.

Interesting. That is the exact Western Digital Red SMR error that plagues these drives when using ZFS.

Now I do believe you when you say you bought a WD Red Pro. However, we have started to see fraud being perpetrated on hard disk buyers.

So I second the request for the output of SMART. And throw in a request for physical examination of the drive to see if perhaps it may have been used before. Or tampered with in any way, (before you got it…).

Hello,
Thanks for your quick feedbacks.
Here are the details:
OS Version:25.04.2.4
Product:ProLiant MicroServer Gen8
Model:Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz
Memory:16 GiB

Drives are all connected throught the integrated controller: a SAS controller (I don’t know the exact model, but it’s not causing problem to the other drives)

SMART logs:

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red Pro
Device Model:     WDC WD4005FFBX-68CAUN0
Serial Number:    WD-AS00XXXX
LU WWN Device Id: 5 0014ee 21642a937
Firmware Version: 83.00A83
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database 7.3/5852
ATA Version is:   ACS-4 published, ANSI INCITS 529-2018
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Sep 24 08:58:26 2025 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Disabled
DSN feature is:   Disabled
ATA Security is:  Disabled, frozen [SEC2]
Wt Cache Reorder: Enabled

Warning! SMART Attribute Thresholds Structure error: invalid SMART checksum.
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (31560) seconds.
Offline data collection
capabilities:                    (0x71) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 318) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   100   100   051    -    0
  2 Throughput_Performance  --S--K   100   100   000    -    0
  3 Spin_Up_Time            POS--K   100   100   021    -    0
  4 Start_Stop_Count        -O--CK   100   100   000    -    1
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  8 Seek_Time_Performance   --S--K   100   100   000    -    0
  9 Power_On_Hours          -O--CK   100   100   000    -    12
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    1
192 Power-Off_Retract_Count -O--CK   200   200   000    -    0
193 Load_Cycle_Count        -O--CK   200   200   000    -    2
194 Temperature_Celsius     -O---K   112   107   000    -    38
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   100   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   200   100   000    -    0
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      6  Ext. Comprehensive SMART error log
0x04       GPL     R/O    256  Device Statistics log
0x04       SL      R/O    255  Device Statistics log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x0a       GPL     R/W    256  Device Statistics Notification
0x0c       GPL     R/O   2048  Pending Defects log
0x0f       GPL     R/O      2  Sense Data for Successful NCQ Cmds log
0x10       GPL     R/O      1  NCQ Command Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x12       GPL     R/O      1  SATA NCQ Non-Data log
0x13       GPL     R/O      1  SATA NCQ Send and Receive log
0x15       GPL     R/W      1  Rebuild Assist log
0x24       GPL     R/O    312  Current Device Internal Status Data log
0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
0x53       GPL     R/O      1  Sense Data log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa0-0xa1  GPL,SL  VS      16  Device vendor specific log
0xa3-0xa7  GPL,SL  VS      16  Device vendor specific log
0xa8-0xb1  GPL,SL  VS       1  Device vendor specific log
0xb2       GPL     VS   65535  Device vendor specific log
0xb3-0xb6  GPL,SL  VS       1  Device vendor specific log
0xb9           SL  VS       1  Device vendor specific log
0xba       GPL,SL  VS      80  Device vendor specific log
0xbd       GPL,SL  VS       1  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xc1       GPL     VS      93  Device vendor specific log
0xd2           SL  VS       1  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 1
        CR     = Command Register
        FEATR  = Features Register
        COUNT  = Count (was: Sector Count) Register
        LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
        LH     = LBA High (was: Cylinder High) Register    ]   LBA
        LM     = LBA Mid (was: Cylinder Low) Register      ] Register
        LL     = LBA Low (was: Sector Number) Register     ]
        DV     = Device (was: Device/Head) Register
        DC     = Device Control Register
        ER     = Error register
        ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 [0] occurred at disk power-on lifetime: 2 hours (0 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  10 -- 53 02 00 00 00 09 a6 9f e8 e0 00  Error: IDNF 512 sectors at LBA = 0x09a69fe8 = 161914856

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  35 00 00 02 00 00 00 09 a6 9f e8 e0 08     02:14:13.496  WRITE DMA EXT
  ca 00 00 00 58 00 00 09 a6 a7 78 e9 08     02:14:13.495  WRITE DMA
  ca 00 00 00 e0 00 00 09 a6 9e 00 e9 08     02:14:13.493  WRITE DMA
  35 00 00 02 00 00 00 09 a6 9c 00 e0 08     02:14:13.490  WRITE DMA EXT
  35 00 00 02 00 00 00 09 a6 9a 00 e0 08     02:14:13.335  WRITE DMA EXT

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%         8         -
# 2  Short offline       Completed without error       00%         2         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
Device State:                        Active (0)
Current Temperature:                    38 Celsius
Power Cycle Min/Max Temperature:     24/43 Celsius
Lifetime    Min/Max Temperature:     24/43 Celsius
Under/Over Temperature Limit Count:   0/0
Minimum supported ERC Time Limit:    65 (6.5 seconds)
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/65 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (266)

Index    Estimated Time   Temperature Celsius
 267    2025-09-24 01:01    40  *********************
 ...    ..(232 skipped).    ..  *********************
  22    2025-09-24 04:54    40  *********************
  23    2025-09-24 04:55    41  **********************
  24    2025-09-24 04:56    42  ***********************
  25    2025-09-24 04:57    42  ***********************
  26    2025-09-24 04:58    43  ************************
  27    2025-09-24 04:59    43  ************************
  28    2025-09-24 05:00    42  ***********************
  29    2025-09-24 05:01    42  ***********************
  30    2025-09-24 05:02    41  **********************
 ...    ..(  3 skipped).    ..  **********************
  34    2025-09-24 05:06    41  **********************
  35    2025-09-24 05:07    40  *********************
 ...    ..(  4 skipped).    ..  *********************
  40    2025-09-24 05:12    40  *********************
  41    2025-09-24 05:13    39  ********************
 ...    ..(  9 skipped).    ..  ********************
  51    2025-09-24 05:23    39  ********************
  52    2025-09-24 05:24    38  *******************
 ...    ..(213 skipped).    ..  *******************
 266    2025-09-24 08:58    38  *******************

SCT Error Recovery Control:
           Read:     70 (7.0 seconds)
          Write:     70 (7.0 seconds)

Device Statistics (GP Log 0x04)
Page  Offset Size        Value Flags Description
0x01  =====  =               =  ===  == General Statistics (rev 3) ==
0x01  0x008  4               1  -D-  Lifetime Power-On Resets
0x01  0x010  4              12  -D-  Power-on Hours
0x01  0x018  6       497180576  -D-  Logical Sectors Written
0x01  0x020  6         1090615  -D-  Number of Write Commands
0x01  0x028  6           83508  -D-  Logical Sectors Read
0x01  0x030  6            1009  -D-  Number of Read Commands
0x01  0x038  6        43200000  -D-  Date and Time TimeStamp
0x02  =====  =               =  ===  == Free-Fall Statistics (rev 1) ==
0x02  0x010  4               0  -D-  Overlimit Shock Events
0x03  =====  =               =  ===  == Rotating Media Statistics (rev 1) ==
0x03  0x008  4              12  -D-  Spindle Motor Power-on Hours
0x03  0x010  4              12  -D-  Head Flying Hours
0x03  0x018  4               2  -D-  Head Load Events
0x03  0x020  4               0  -D-  Number of Reallocated Logical Sectors
0x03  0x028  4               0  -D-  Read Recovery Attempts
0x03  0x030  4               0  -D-  Number of Mechanical Start Failures
0x03  0x038  4               0  -D-  Number of Realloc. Candidate Logical Sectors
0x03  0x040  4               0  -D-  Number of High Priority Unload Events
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               1  -D-  Number of Reported Uncorrectable Errors
0x04  0x010  4               0  -D-  Resets Between Cmd Acceptance and Completion
0x05  =====  =               =  ===  == Temperature Statistics (rev 1) ==
0x05  0x008  1              38  ---  Current Temperature
0x05  0x010  1               -  -D-  Average Short Term Temperature
0x05  0x018  1               -  -D-  Average Long Term Temperature
0x05  0x020  1              41  -D-  Highest Temperature
0x05  0x028  1              30  -D-  Lowest Temperature
0x05  0x030  1               -  -D-  Highest Average Short Term Temperature
0x05  0x038  1               -  -D-  Lowest Average Short Term Temperature
0x05  0x040  1               -  -D-  Highest Average Long Term Temperature
0x05  0x048  1               -  -D-  Lowest Average Long Term Temperature
0x05  0x050  4               0  -D-  Time in Over-Temperature
0x05  0x058  1              65  ---  Specified Maximum Operating Temperature
0x05  0x060  4               0  -D-  Time in Under-Temperature
0x05  0x068  1               0  ---  Specified Minimum Operating Temperature
0x06  =====  =               =  ===  == Transport Statistics (rev 1) ==
0x06  0x008  4               3  -D-  Number of Hardware Resets
0x06  0x010  4               1  -D-  Number of ASR Events
0x06  0x018  4               0  -D-  Number of Interface CRC Errors
0xff  =====  =               =  ===  == Vendor Specific Statistics (rev 1) ==
0xff  0x008  7               0  -D-  Vendor Specific
0xff  0x010  7               0  -D-  Vendor Specific
0xff  0x018  7               0  -D-  Vendor Specific
0xff  0x040  7               0  -D-  Vendor Specific
0xff  0x048  7               0  -D-  Vendor Specific
0xff  0x050  7               1  -D-  Vendor Specific
0xff  0x058  7               0  -D-  Vendor Specific
0xff  0x060  7               0  -D-  Vendor Specific
0xff  0x068  7               0  -D-  Vendor Specific
0xff  0x070  7               0  -D-  Vendor Specific
0xff  0x078  7               1  -D-  Vendor Specific
0xff  0x080  7               0  -D-  Vendor Specific
0xff  0x088  7               0  -D-  Vendor Specific
0xff  0x090  7            4953  -D-  Vendor Specific
0xff  0x098  7           12129  -D-  Vendor Specific
0xff  0x0a0  7             100  -D-  Vendor Specific
                                |||_ C monitored condition met
                                ||__ D supports DSN
                                |___ N normalized value

Pending Defects log (GP Log 0x0c)
No Defects Logged

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x0008  2            0  Device-to-host non-data FIS retries
0x0009  2            2  Transition from drive PhyRdy to drive PhyNRdy
0x000a  2            3  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x000d  2            0  Non-CRC errors within host-to-device FIS
0x000f  2            0  R_ERR response for host-to-device data FIS, CRC
0x0012  2            0  R_ERR response for host-to-device non-data FIS, CRC
0x8000  4        44692  Vendor specific

does it help?

adding that this drive is brand new, in a sealed bag…

my ZRAID config

My eyes don’t see any critical failures from the smart log. Consider reseating connections or connecting directly to the motherboard if available, then running a scrub to see if it clears things up.

sas2flash or sas3flash should give us more info on your HBA. No chance that you got a fan pointing at your HBA? They get hot & don’t enjoy that.

here are the results:

root@truenas:/home/admin# sas2flash
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18) 
Copyright (c) 2008-2014 LSI Corporation. All rights reserved 

        No LSI SAS adapters found! Limited Command Set Available!
        Finished Processing Commands Successfully.
        Exiting SAS2Flash.
root@truenas:/home/admin# sas3flash
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02) 
Copyright 2008-2017 Avago Technologies. All rights reserved.

        No Avago SAS adapters found! Limited Command Set Available!
        Finished Processing Commands Successfully.
        Exiting SAS3Flash.

huh… uhh - do try to dig up info on that HBA, lotta folks in the past have eventually had serious issues using flaky controllers. The worst part is, they seem to work, until they don’t.

Or, at worst, see if you can switch this drive directly to the motherboard with a working drive. If errors follow the drive - we have an answer & it is an easy RMA since it is a new drive. They can be DOA, has happened to me in the past twice in a row before :frowning:

about the HBA: lspci | grep RAID should show us what type of controller you have. Then we can get more info with another command depending on the model.

1 Like

it gives me no answer…

details:

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/Ivy Bridge DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)
00:06.0 PCI bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor PCI Express Root Port (rev 09)
00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05)
00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 1 (rev b5)
00:1c.4 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 5 (rev b5)
00:1c.6 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 7 (rev b5)
00:1c.7 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 8 (rev b5)
00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a5)
00:1f.0 ISA bridge: Intel Corporation C204 Chipset LPC Controller (rev 05)
00:1f.2 IDE interface: Intel Corporation 6 Series/C200 Series Chipset Family Desktop SATA Controller (IDE mode, ports 0-3) (rev 05)
00:1f.5 IDE interface: Intel Corporation 6 Series/C200 Series Chipset Family Desktop SATA Controller (IDE mode, ports 4-5) (rev 05)
01:00.0 System peripheral: Hewlett-Packard Company Integrated Lights-Out Standard Slave Instrumentation & System Support (rev 05)
01:00.1 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200EH
01:00.2 System peripheral: Hewlett-Packard Company Integrated Lights-Out Standard Management Processor Support and Messaging (rev 05)
01:00.4 USB controller: Hewlett-Packard Company Integrated Lights-Out Standard Virtual USB Controller (rev 02)
03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
04:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)

You do not have a SAS controller as it seems, just your mainboard’s onboard SATA controller:

00:1f.2 IDE interface: Intel Corporation 6 Series/C200 Series Chipset Family Desktop SATA Controller (IDE mode, ports 0-3) (rev 05)
00:1f.5 IDE interface: Intel Corporation 6 Series/C200 Series Chipset Family Desktop SATA Controller (IDE mode, ports 4-5) (rev 05)

Which should not be in IDE mode but in AHCI mode - no idea if this can cause such problems though.

some information that I found:
HPE Dynamic Smart Array B120i Controller

I cannot see this controller in your lspci output.
It should be connected via PCIE 2.0 x4.
Is the output complete?

Or could this just be a fancy HP term for saying “we’re using the Intel Chipset SATA controller here”? The B120i is no SAS controller…

Yes, you’re right, drives are connected direclty to the motherboard via the integrated controller…


and it’s probably not a SAS indeed

have you checked the controller mode in BIOS? It should be on AHCI, not IDE.

I can switch it?

Shouldn’t do any harm and if it does against all odds you can just switch back to IDE mode.

Yeah, that is a Gen8 HP Microserver, that does not have an HBA, the raid functionality comes via specific drivers.

That said, if you put the controller in AHCI mode, it does always boot from the first 3.5 drive bay, so if you want it to boot from the ODD port, you’ll need to jump to a few hoops to get it working.

The other solution is to put the controller in legacy (or is called raid?), which allows to set up each disk as an array, then you can boot from any disk you want. I think that is called Intelligent Provisioning feature by HP. I never have used this, though.