Assessing pool and disk status - help needed

wilcomir · December 6, 2024, 6:33am

Hello everyone!

I am running a virtualized truenas scale instance with an HBA in PCI passthrough. The root disks are virtualized from the host, while the data disks are connected to the HBA.

I have 4 total disks organized in 2 mirror VDEVs. I run the pool scrub daily, smart short test daily for all disks, and the long test is ran weekly, mon to thu, one disk per day. I do not run any check on the boot pool as it is virtualized & backupped anyway.

As of yesterday I got an alert about /dev/sde failing the smartctl long test. I have examined the logs, and found one reallocated sector, plus four ATA errors.

I manually scrubbed the pool and cleared the errors, as it successfully fixed everything up, during the process I got another ATA error, total of five, and another reallocated sector.

I finally ran another long test on sde, but it failed again at around 60% remaining; the LBA looks to always be the same at least looking at the ATA errors.

I am a bit at a loss about how to investigate/debug this further; my impression is that I have to closely monitor this disk and see what happens. This can still be a failing cable, but for some reason it does not look like a cable problem to me.

Here are the info I think are relevant:

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       60%      6152         -
# 2  Short offline       Completed without error       00%      6139         -
# 3  Extended offline    Completed: read failure       60%      6138         -
# 4  Short offline       Completed without error       00%      6125         -
# 5  Short offline       Completed without error       00%      6101         -
# 6  Short offline       Completed without error       00%      6077         -
# 7  Short offline       Completed without error       00%      6053         -
# 8  Short offline       Completed without error       00%      6029         -
# 9  Short offline       Completed without error       00%      6005         -
#10  Short offline       Completed without error       00%      5981         -
#11  Extended offline    Aborted by host               30%      5978         -
#12  Short offline       Completed without error       00%      5956         -
#13  Short offline       Completed without error       00%      5932         -
#14  Short offline       Completed without error       00%      5908         -
#15  Short offline       Completed without error       00%      5884         -
#16  Short offline       Completed without error       00%      5860         -
#17  Short offline       Completed without error       00%      5836         -
#18  Extended offline    Completed without error       00%      5815         -
#19  Short offline       Completed without error       00%      5788         -
#20  Short offline       Completed without error       00%      5764         -
#21  Short offline       Completed without error       00%      5740         -```

wilcomir · December 6, 2024, 6:34am

And another log output - apologies I was unable to edit the original post.

smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST16000NM000J-2TW103
Serial Number:    ZR5DGX7Y
LU WWN Device Id: 5 000c50 07445c42a
Firmware Version: SN04
User Capacity:    16,000,900,661,248 bytes [16.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Dec  6 07:34:00 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 118) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline 
data collection:                (  567) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1451) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x70bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   079   064   044    Pre-fail  Always       -       78235738
  3 Spin_Up_Time            0x0003   092   090   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       49
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       2
  7 Seek_Error_Rate         0x000f   086   060   045    Pre-fail  Always       -       409588130
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       6157
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       49
 18 Unknown_Attribute       0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   060   046   000    Old_age   Always       -       40 (Min/Max 37/52)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       33
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1287
194 Temperature_Celsius     0x0022   040   054   000    Old_age   Always       -       40 (0 24 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   100   000    Old_age   Offline      -       6113 (66 161 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       17708080098
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       53268052734

SMART Error Log Version: 1
ATA Error Count: 5
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 5 occurred at disk power-on lifetime: 6143 hours (255 days + 23 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 08 ff ff ff 4f 00  45d+08:17:47.441  WRITE FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  45d+08:17:38.117  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  45d+08:17:38.117  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  45d+08:17:38.105  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  45d+08:17:38.104  READ FPDMA QUEUED

Error 4 occurred at disk power-on lifetime: 6137 hours (255 days + 17 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 18 ff ff ff 4f 00  45d+01:53:03.582  WRITE FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  45d+01:53:03.581  READ FPDMA QUEUED
  ea 00 00 00 00 00 00 00  45d+01:53:03.558  FLUSH CACHE EXT
  60 00 00 ff ff ff 4f 00  45d+01:53:03.493  READ FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  45d+01:53:03.490  WRITE FPDMA QUEUED

Error 3 occurred at disk power-on lifetime: 6137 hours (255 days + 17 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 00 ff ff ff 4f 00  45d+01:50:06.872  WRITE FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  45d+01:50:06.872  WRITE FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  45d+01:50:06.872  WRITE FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  45d+01:50:06.872  WRITE FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  45d+01:50:06.872  WRITE FPDMA QUEUED

Error 2 occurred at disk power-on lifetime: 6137 hours (255 days + 17 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 08 ff ff ff 4f 00  45d+01:50:04.628  WRITE FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  45d+01:49:55.309  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  45d+01:49:55.309  READ FPDMA QUEUED
  60 00 00 ff ff ff 4f 00  45d+01:49:55.298  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00  45d+01:49:55.298  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 6137 hours (255 days + 17 hours)
  When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 08 ff ff ff 4f 00  45d+01:49:51.874  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00  45d+01:49:51.874  WRITE FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  45d+01:49:51.874  WRITE FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  45d+01:49:51.874  WRITE FPDMA QUEUED
  61 00 00 ff ff ff 4f 00  45d+01:49:51.873  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       60%      6152         -
# 2  Short offline       Completed without error       00%      6139         -
# 3  Extended offline    Completed: read failure       60%      6138         -
# 4  Short offline       Completed without error       00%      6125         -
# 5  Short offline       Completed without error       00%      6101         -
# 6  Short offline       Completed without error       00%      6077         -
# 7  Short offline       Completed without error       00%      6053         -
# 8  Short offline       Completed without error       00%      6029         -
# 9  Short offline       Completed without error       00%      6005         -
#10  Short offline       Completed without error       00%      5981         -
#11  Extended offline    Aborted by host               30%      5978         -
#12  Short offline       Completed without error       00%      5956         -
#13  Short offline       Completed without error       00%      5932         -
#14  Short offline       Completed without error       00%      5908         -
#15  Short offline       Completed without error       00%      5884         -
#16  Short offline       Completed without error       00%      5860         -
#17  Short offline       Completed without error       00%      5836         -
#18  Extended offline    Completed without error       00%      5815         -
#19  Short offline       Completed without error       00%      5788         -
#20  Short offline       Completed without error       00%      5764         -
#21  Short offline       Completed without error       00%      5740         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try 'smartctl -x' for more```

wilcomir · December 7, 2024, 2:48pm

Hello, small update.

I ran a total of three long tests on sde but it keeps stopping at 60% remaining. I still have a single offline uncorrectable, and no more ATA errors, so total of three. Reallocated sector count is now up to three.

I think I can assume this disk is toast, or soon to be. Anyone else has some idea? It seems to be going slowly but I am unsure if I can accept a disk that just cannot complete a smart long test.

winnielinnie · December 7, 2024, 3:23pm

I wouldn’t let that drive be used in any pool. Consider it failed or failing.

Don’t rely on ZFS to always get you out of this jam without replacing the drive.

How old is the drive? According the the Power_On_Hours, it has been running for less than a year. Surely it’s covered by warranty? Even Seagate “recertified” drives have a 1-year warranty.

winnielinnie · December 7, 2024, 3:30pm

This is overkill. A monthly scrub should be fine, combined with weekly SMART tests.

wilcomir · December 7, 2024, 5:02pm

Thanks @winnielinnie I appreciate you taking some time to answer me.

I bought 4 identical drives off of Amazon this march, all refurbished of course. Do you happen to know how the RMA process would work? Seagate tells me to contact Amazon, apparently - I guess I have to buy another drive in the meantime though.

winnielinnie · December 7, 2024, 5:24pm

Refurbished or manufacturer recertified?

From what I understand, refurbished drives are covered only for 90 days, while manufacturer recertified are covered anywhere from 6 months to 2 years, depending on the seller.

EDIT: To make matters more confusing, “recertified” drives listed on Amazon may in fact be refurbished or “seller recertified”, but not manufacturer recertified.

I have a hunch your drives are only covered for 90 days.

wilcomir · December 7, 2024, 5:30pm

Hey, according to seagate yes the coverage is expired, but Amazon is reimbursing it so I am buying another one

winnielinnie · December 7, 2024, 5:43pm

That’s good news! Is Amazon requiring you to return the drive first, before issuing a refund or exchange, or will they reimburse you and let you return the drive later?

At least if you can keep this drive in the pool until getting a replacement, you might be able to prevent the pool from falling into a degraded state.

wilcomir · December 7, 2024, 6:44pm

No they want the drive first and I get the money after. I ordered a new one, will do the resilver with both drives in and then offline and detach the failing one - then I will ship it back to them.

Stux · December 7, 2024, 10:58pm

The drive is failing. Replace it.

Stux · December 7, 2024, 11:02pm

Yes, I typically order a new replacement drive… and then RMA the failing one (if in warranty), as it’s better to have a failing drive than a failed or removed drive

The same applies when replacing, if possible leave the old drive connected while performing the replacement.

wilcomir · December 8, 2024, 8:15am

Thanks everyone for your input.

Now I am curious though - this drive seem to have just a single failing sector, which I would normally expect to be reallocated at a certain point, especially since this is in a mirror VDEV.

My understanding is that if this sector contained data, during a scrub it would be overwritten and therefore reallocated. Since this has not happened, I assume the sector currently does not contain any data; in fact my pool is only 20% full.

Now, is there any way I can try to write something to that specific sector? I would like to see if it gets reallocated, and if subsequently a long test passes.

What I also do not understand is why a long test just stops & gives up: if this is a single bad sector, then surely the long test can highlight this and proceed? Does this mean that everything after that sector is FUBAR?

etorix · December 8, 2024, 9:05am

A scrub reads data to check its content. Reallocation occurs upon write failure.

Stux · December 8, 2024, 2:08pm

If there is an error reading the LBA, the LBA is rewritten with a corrected block.

You would expect to see an error in the pool status in that case.

A failed smart test is a warranty claim if inside warranty term. At least in my experience.

wilcomir · December 9, 2024, 6:07am

Thanks to both - that was actually my assumption about zfs in a mirror vdev:

scrub is started, data & checksum is read “section by section”* from both disks
if it corresponds, all good
if it does not correspond, we try to determine the good section via the checksum
if this goes well, data is overwritten on the bad section → this would trigger a reallocation
if this does not go well, you got unlucky and the error is probably uncorrectable

I have to say I did have a warning for the whole pool when this all started, I triggered a scrub and it went away, but the offline uncorrectable remained. Perhaps the offline uncorrectable does not contain any data & therefore ZFS does not scrub it…

* I use the word “section” as in “chunk of data” I am not sure if ZFS works by sector, block or whatever - not really relevant in this case I would say.