Issues with formatting drives

JayC · February 2, 2025, 6:42pm

Ok for context I’m pretty new at this NAS stuff. Bought a used T320, updated both bios and the lifecycle controller firmware to the latest as well as flashed the controller into “IT” mode and changed the CMOS battery, got idrac setup. Am also on the latest version of Truenas. I Only have a boot pool atm, and no other storage pools.

The problem is that I now purchased 10 x 4TB used seagate SAS drives on ebay, first thing I did was run a long health scan on 5 drives (I only have 8 bays so I was planning on doing 2 passes, testing 5 drives each), all 5 drives passed no problems but I was presented with a critical error by truenas saying (DIF) was unsupported from those drives. A quick google told me this could be solved by formatting the drives and so I attempted, I ran “sudo sg_format --format --size=512 /dev/sda” in parallel via tmux as advised. This worked for 3/5 of the drives but I got this error when attempting the others

Format unit in progress, 7.57% done
test unit ready:
Descriptor format, <<<deferred>>>; Sense Key: Hardware Error
Additional Sense: Defect list not found
	Descriptor type: Information: 0x00000000228183a0
	Descriptor type: Field Replaceable unit code: 0x93
	Descriptor type: Vendor specific [0x80]
		00 00 00 00 00 00 00 00 00 00 00 00 00 00
FORMAT UNIT Complete
truenas_admin@truenas[~]$

(I had to manually type in the above because its not letting me paste images, but it should be verbatim outside of indention spacing)

I retried them again, unplugged the drives an replugged on different slots as well as removed all other non problematic drives from their bays. I just pulled the plug from the system because I heard it might help but I’m not expecting much to change on this next attempt. Normally I would chalk this up as the drives being bad but them passing the long with no errors has been bugging me, anyone have any ideas? here’s some more info on the drives

truenas_admin@truenas[~]$ sudo smartctl -a /dev/sda
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST4000NM0023
Revision:             XMGH
Compliance:           SPC-4
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c5006276db6f
Serial number:        Z1Z5S1QF0000C510AFWY
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Sun Feb  2 10:15:05 2025 PST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     38 C
Drive Trip Temperature:        68 C

scsiPrintBackgroundResults Failed [medium or hardware error (serious)]
Manufactured in week 38 of year 2014
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  610
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  3533
Read defect list: asked for grown list but didn't get it
Vendor (Seagate Cache) information
  Blocks sent to initiator = 3534769525
  Blocks received from initiator = 256483996
  Blocks read from cache and sent to initiator = 2498505367
  Number of read and write commands whose size <= segment size = 1325140844
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 56937.43
  number of minutes until next internal SMART test = 59

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   3873026852        0         0  3873026852          0     678553.128           0
write:         0        0         0         0          0     276646.093           0
verify:  6471279        0         0   6471279          0          0.000           0

Non-medium error count:  3586938


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -   56886                 - [-   -    -]
# 2  Background short  Completed                   -   18028                 - [-   -    -]
# 3  Background short  Completed                   -   15849                 - [-   -    -]
# 4  Background short  Completed                   -   11302                 - [-   -    -]
# 5  Background short  Completed                   -   10681                 - [-   -    -]
# 6  Background short  Completed                   -    1789                 - [-   -    -]
# 7  Background short  Completed                   -    1646                 - [-   -    -]

Long (extended) Self-test duration: 32700 seconds [9.1 hours]

truenas_admin@truenas[~]$ sudo smartctl -a /dev/sdb
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST4000NM0023
Revision:             XMGH
Compliance:           SPC-4
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x5000c5006277660b
Serial number:        Z1Z5SJ080000C510BZ1Y
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        Sun Feb  2 10:16:51 2025 PST
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     37 C
Drive Trip Temperature:        68 C

scsiPrintBackgroundResults Failed [medium or hardware error (serious)]
Manufactured in week 38 of year 2014
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  607
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  3508
Read defect list: asked for grown list but didn't get it
Vendor (Seagate Cache) information
  Blocks sent to initiator = 2065260292
  Blocks received from initiator = 875576814
  Blocks read from cache and sent to initiator = 519023055
  Number of read and write commands whose size <= segment size = 1883588032
  Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 56904.57
  number of minutes until next internal SMART test = 59

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1639291679        0         0  1639291679          0     762657.538           0
write:         0        0         0         0          0     373244.293           0
verify:  2630170        0         0   2630170          0          0.000           0

Non-medium error count:  3571678


[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -   56853                 - [-   -    -]
# 2  Background short  Completed                   -   18028                 - [-   -    -]
# 3  Background short  Completed                   -   15850                 - [-   -    -]
# 4  Background short  Completed                   -   11303                 - [-   -    -]
# 5  Background short  Completed                   -   10682                 - [-   -    -]
# 6  Background short  Completed                   -    1790                 - [-   -    -]
# 7  Background short  Completed                   -    1646                 - [-   -    -]

Long (extended) Self-test duration: 32700 seconds [9.1 hours]

this is an example of what they looked like before the first reformat:

(~~I can’t post images, but it said user capacity [4.00 TB] , logical block size: 512, Formatted with type 2 protection 8 bytes of protection information per logical block~~ (all of which is missing from the current information sections now for these two drives)

It is also good to note that they appear to be spitting the error around the same %, the one drives fails in the low 20s and the second in the mid 70s. Cheers

edit: trying to just call sg_format with -v gave me this which might be relevant:

SmallBarky · February 2, 2025, 7:26pm

Browse some other threads and do the Tutorial by the Bot to get your forum trust level up. You should be allowed images then.

TrueNAS-Bot
Type this in a new reply and send to bring up the tutorial, if you haven’t done it already.

@TrueNAS-Bot start tutorial

JayC · February 3, 2025, 3:06am

@TrueNAS-Bot start tutorial

ngl low key got trolled but its chill. Also restarting the NAS didn’t seem to do the trick

Arwen · February 3, 2025, 3:35am

Drives “sda” and “sdb” are 10 years old, and about 6.5 years powered up. It is perfectly understandable for them to have medium or hardware errors.

On rare occasions a reformat can restore functionality by finding and sparing out failed media sections.

If, however, their is more bad media than their is spare space, that could be harder to overcome. Possibly reducing the size by a few MBs, then trying the re-format. I’ve not done that before. If it works on the faulty drives, you would have to repeat on the good drives so that they all have the same amount of usable space.

JayC · February 3, 2025, 3:42am

Yeah I understand the age of the drives, they’re not meant to handle critical data but would these errors not normally have been caught by the initial S.M.A.R.T long scan before the reformat? It feels like if I hadn’t attempted a reformat in the first place to remove the DIF error they would’ve worked perfectly fine

Arwen · February 3, 2025, 6:43am

I doubt that the SMART long test does much writing, if any. A low level format writes new header, blank data block, ECC and trailer for each sector, unrelated to where the prior data existed on the track. Thus, blocks that appeared good on read from before, may now report as bad. Not certain that is the case…

Anyway, as I said, if it is a case where their are not enough spares to spare out the bad spaces, it may be possible to shrink the usable space slightly to make more spare space. That is all theoretical.

I do know Sun Microsystems used to shrink all it’s HDDs to a standard size so that drives from different manufacturers would end up the same size. Not sure if that shrinking allowed the drives to use the space as additional spare sectors.