SMART tests fail but badblocks passes

I rolled the dice and picked up several used drives from ebay. As they were used drives, I wanted to do some additional testing before trusting them with any of my data. One of the drives is failing smart tests but passed a badblocks run.

Here are the most recent runs of smartctl:

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Failed in segment -->       3    2319                 - [0x1 0xb 0x96]
# 2  Background short  Failed in segment -->       3    2314                 - [0x1 0xb 0x96]
# 3  Background long   Completed                   -    2236                 - [-   -    -]
# 4  Background long   Failed in segment -->       7    2193         175919960 [0x3 0x5d 0x1]
# 5  Background short  Completed                   -    2192                 - [-   -    -]
# 6  Background short  Failed in segment -->       3    2192                 - [0x1 0xb 0x96]
# 7  Background short  Completed                   -    2169                 - [-   -    -]

The drive showed up with one passed short test (#7). It failed my first test (#6), then passed (#5) then failed a long test (#4). At some point I did a run of badblocks, which it passed:

truenas_admin@truenas[~]$ time sudo badblocks -t random -w -s -b 4096 /dev/sda                                                                    
Testing with random pattern: done                                                                                                                 
Reading and comparing: ^C9.76% done, 64:10:04 elapsed. (0/0/0 errors)                                                                             
                                                                                                                                                  
Interrupted at block 142973376                                                                                                                    
sudo badblocks -t random -w -s -b 4096 /dev/sda  247.30s user 1524.73s system 0% cpu 64:10:04.34 total

and now it’s failing the smart tests again. What gives? Since badblocks writes then reads from every sector, isn’t that supposed to be more robust than a smart test? Would you consider this drive DOA?

If it’s at all relevant, these are SAS drives and it took something like 2.5 days to run the badblocks test.

OK, well clock is ticking on my opportunity to return these drives. I think I’ll go with what SMART tests are telling me and return/exchange them before my window closes. I was hoping someone could give me some insight as to why badblocks would pass but SMART would fail. If I actually had data on that drive I wouldn’t even be able to run destructive badblocks but the SMART results would tell me to replace it ASAP so why play with fire on this one!

First of all, thank you for joining our forum.

I’m sorry no one answered your posting. Please keep in mind that this is all voluntary support, once in a while some of the iXsystems employees will jump in and toss out some good information but that is the exception, it is mostly just us little guys out here trying to help out.

I only just saw this posting so I guess late is better than never.

If a SMART Short or Long test fails, the drive is bad regardless of any other testing.

As to why badblocks passed it, I can’t say definitively, but I do not see the results for all four test patterns. Maybe you received those but you did use ā€œrandomā€ and looks like just the one test pattern, which was random. I realise badblocks takes a long time to test, especially with large capacity drives but if you are to use it as a tool in the future, I would recommend using all four test patterns. There is a reason it isn’t just one test pattern and it has to do with how the data is recorded to the drive, the magnetic flux changes to ensure it can faithfully write each possible variation. But I applaud you for taking action. You tried to diagnose the failure.

Next time please post the model of the drive, it would have helped just to round out the data, and that goes for any problem. Include the hardware specs (most is already in your posts at the bottom).

2 Likes

Hey it’s all good! I recognize this isn’t even a TrueNAS specific question, but I thought I’d ask here since there’s a lot of overlap, and well, you kind of need working (tested) drives in order to use TrueNAS. I appreciate you chiming in anyway :slight_smile:

If a SMART Short or Long test fails, the drive is bad regardless of any other testing.

Duly noted. Happy I’m making the right call here. I was perhaps naively hoping that the fail was a one off and might go away but alas that’s not to be the case! I’ve had at least one drive where a SMART test failed, then future tests it somehow passes.

As to why badblocks passed it, I can’t say definitively, but I do not see the results for all four test patterns.

Your observation is correct. I only ran one (random) pass with it, then reran my smart tests. This was, of course, after running an sg_format on the drive as well. Is it true that sg_format fills in each block with zeros as it formats the drive?

if you are to use it as a tool in the future, I would recommend using all four test patterns

Will do! And yes, you are correct, I just did one pass, which took about 2.5 days to do, so four passes would have been something around 10 days to run. (I’m currently running third pass on my other drives as I write this).

Next time please post the model of the drive

Sure thing! fwiw and for posterity these drives are 6TB HGST HUS726060AL5214 SAS drives (I think the name after the HGST is the model not a serial number):

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS726060AL5214
Revision:             NE00
Compliance:           SPC-4
User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca25542863c
Serial number:        K1H5L56F
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Thu Feb 27 22:11:00 2025 CST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

I don’t know but that is a common practice so I wouldn’t doubt it.

I just read that these are only 6TB drives. That kind of sounds slow in the testing. To be honest, it has been a while since I have run badblocks on a drive. I am migrating to all NVMe and doing a lot of write testing to any SSD is just a bad thing to do, it sucks the life out of the drive.

If you have an other questions, reach out to me if it is about drives, I enjoy learning anything new and I know quite a bit. I also know that I don’t know it all as well.

Not sure if you have considered using Multi-Report (linked below), a script I started when FreeNAS 8.0 was out, actually it was probably 8.3 but it was before 9. If you have not heard of it, take a look at it. I’m working on version 3.17 right now but it will be a few months (could be 6 months) to complete, unless v3.16 has a critical problem, then I will fix it immediately.

Take care,
-Mark (aka. Joe)

The ā€˜S’ tells us it is a SMR drive, which explains the sluggishness. Replace it!

It failed SMART tests (even including a short one): Replace it!

As to why SMART failed but badblocks passed, the full report from smartctl -x might have given some clue, but you decided that snippets were enough for us to chew on…
(Going out on a limb, the drive might have permanently reallocated sectors.)

2 Likes

Actually the breakdown of the drive model is:
H = HGST
U = Ultrastar
S = Standard (v.s Compact)
72 = 7200 RPM
60 = Full Capacity 6TB
60 = Capacity (yes again)
A = Generation Code
L = 26.1mm Height (thick)
52 = 512e SAS 12Gb/s
1 = 128MB Buffer
4 = Secure Erase Overwrite

The drive is not SMR.

Now if the model number had an ā€˜S’ in the second spot (HSH72… for example), yes, it would be SMR.

If I am wrong, please provide a link of proof, I don’t mind being wrong as Iong as I learn, but I want to prove it.

Here is my link address to the drive datasheet.

3 Likes

Second vs. third position… I stand corrected, thanks @joeschmuck.

Here you go! Full ā€œsmartctl -xā€ test results. (Not sure the best way to format this without spitting out a full page of results):

SMART Results
truenas_admin@truenas[~]$ sudo smartctl -x /dev/sda 
[sudo] password for truenas_admin: 
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HGST
Product:              HUS726060AL5214
Revision:             NE00
Compliance:           SPC-4
User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000cca2553f5d20
Serial number:        XXXXXXXXX <redacted>
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Fri Feb 28 10:21:16 2025 CST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     32 C
Drive Trip Temperature:        55 C

Manufactured in week 01 of year 2017
Specified cycle count over device lifetime:  50000
Accumulated start-stop cycles:  10
Specified load-unload count over device lifetime:  600000
Accumulated load-unload cycles:  2668
Elements in grown defect list: 7

Vendor (Seagate Cache) information
  Blocks sent to initiator = 11881567664209920

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0      304         0       304   13560151     806340.124           0
write:         0      349         0       349    7283843      89907.800           8
verify:        0        0         0         0       7079          0.000           0

Non-medium error count:        0

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Failed in segment -->       3    2392                 - [0x1 0xb 0x96]
# 2  Background long   Failed in segment -->       3    2319                 - [0x1 0xb 0x96]
# 3  Background short  Failed in segment -->       3    2314                 - [0x1 0xb 0x96]
# 4  Background long   Completed                   -    2236                 - [-   -    -]
# 5  Background long   Failed in segment -->       7    2193         175919960 [0x3 0x5d 0x1]
# 6  Background short  Completed                   -    2192                 - [-   -    -]
# 7  Background short  Failed in segment -->       3    2192                 - [0x1 0xb 0x96]
# 8  Background short  Completed                   -    2169                 - [-   -    -]

Long (extended) Self-test duration: 6 seconds [0.1 minutes]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 2523:48 [151428 minutes]
    Number of background scans performed: 6,  scan progress: 0.00%
    Number of background medium scans performed: 6

General statistics and performance log page:
  General access statistics and performance:
    Number of read commands: 2629300700
    Number of write commands: 708196620
    number of logical blocks received: 175601170995
    number of logical blocks transmitted: 1574883054184
    read command processing intervals: 0
    write command processing intervals: 0
    weighted number of read commands plus write commands: 0
    weighted read command processing plus write command processing: 0
  Idle time:
    Idle time intervals: 3531682256
      in seconds: 176584112.800
      in hours: 49051.142

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 3
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: SMP phy control function
    reason: unknown
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000cca2553f5d21
    attached SAS address = 0x5003048001895e3f
    attached phy identifier = 2
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0
relative target port id = 2
  generation code = 3
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: power on
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000cca2553f5d22
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization count = 0
    Phy reset problem count = 0

In case it helps, here are the results from sg_readcap:

truenas_admin@truenas[~]$ sudo sg_readcap /dev/sda 
READ CAPACITY (10) indicates device capacity too large
  now trying 16 byte cdb variant
Read Capacity results:
   Protection: prot_en=0, p_type=0, p_i_exponent=0
   Logical block provisioning: lbpme=0, lbprz=0
   Last LBA=11721045167 (0x2baa0f4af), Number of logical blocks=11721045168
   Logical block length=512 bytes
   Logical blocks per physical block exponent=3 [so physical block length=4096 bytes]
   Lowest aligned LBA=0
Hence:
   Device size: 6001175126016 bytes, 5723166.6 MiB, 6001.18 GB, 6.00 TB
1 Like

Your formatting is perfect. Thanks!

Now this is SAS and SMART reports are a lot less comprehensive than with SATA drives, but 8 ā€œUncorrected Write Errorsā€ does not look good and I think that this line means ā€œ7 reallocated sectorsā€:

Yes, that is true. In the past when I hit 5 reallocated sectors, regardless of any other data, I would start considering a drive replacement. So 7 is more than I would live with. It is so much better to replace a drive on your planned time than if the failure needs to be fixed immediately. But in this particular case, you have test failures which means, do very soon.

Just to easy your mind a tiny bit. A Short/Long test failure due to an LBA read, does not mean your drive is dead and you should not use it. It means it could not read data in a location.

Run a SCRUB, if it passes, your data is fine. And the computer may try to write data to bad location, but it reads the data back immediately to verify it matches. If it doesn’t it typically tries a few times, then it moves on to a different location. The drive may mark the suspect LBA as Pending Sector or mark it as a Reallocated sector. As I understand it, if the drive can write to the LBA and it takes a few tries, that is Pending. It hasn’t failed hard.

So if a SCRUB passes, your data is fine.

I don’t have any data on the disk. This is a ā€œnewā€ (from ebay) drive that I haven’t started using yet. You’re not recommending I make a single drive pool then run a scrub on that?

1 Like

Fair enough. This is just a used drive which has failed to pass qualification.

I’m not trying to confuse you, I read a lot of these posts each day and I just forgot you were testing them. Either way, it is still good advice when you run across your next drive failure, when you have a pool.

Cheers

1 Like