Resilvering is Cooking My Drive

  NAME            STATE     READ WRITE CKSUM
        Tank1           DEGRADED     0     0     0
          raidz3-0      DEGRADED   175     0     0
            disk01      FAULTED    136     0     1  too many errors
            disk02      FAULTED     66     0     0  too many errors
            disk03      DEGRADED   178     0     0  too many errors
            disk04      DEGRADED   178     0     0  too many errors
            disk05      ONLINE       0     0     0
            disk06      FAULTED     67     0     1  too many errors
            disk07      ONLINE       0     0     0
            disk08      ONLINE       0     0     0
            disk09      ONLINE       0     0     0
            disk10      ONLINE       0     0     0
            disk11      ONLINE       0     0     0
            disk12      ONLINE       0     0     0
        cache
          cache01       ONLINE       0     0     0

Disk 3 was a replacement disk and it faulted during resilvering. Not only did the resilvering fail, disk 2 also faulted as well. I have ordered some replacement disks and they probably won’t be here till next week. I have a bunch of question…such as how many drives can i resilver at at time? I know when i replace a disk the resilvering process starts automatically, can i replace all 3 disks at the same time? Also i know resilvering puts pressure on disks so disks that are shaky are more likely to fail… is there a way to reduce the pressure thats put on the disk during resilvering. I don’t mind the process taking longer but i want to prevent another disk from faulting or i’ll be fubared. Obviously i am backing up the critical data.

If there is any chance to keep failing disks & replacement disks connected at the same time it is best.

Otherwise… that is a lot of drives having issues at once; how are they connected? Any chance you check the smartctl -a /dev/sdWHATEVERHERE?

2 Likes

disk1

=== START OF INFORMATION SECTION ===
Vendor:               IBM-XIV
Product:              ST6000NM0054
Revision:             XXXX
Compliance:           SPC-4
User Capacity:        6,001,175,122,432 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      <redacted>
Serial number:        <redacted>
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        <redacted>
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     42 C
Drive Trip Temperature:        65 C

Accumulated power on time, hours:minutes 44146:46
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1537088823        0         0  1537088823          0     464220.521           0
write:         0 <max>     <max>               0          0     271956.583           0
verify: 51148137        1         0  51148138          1        684.288           0

Non-medium error count:       36

disk 2

=== START OF INFORMATION SECTION ===
Vendor:               IBM-XIV
Product:              ST6000NM0054
Revision:             XXXX
Compliance:           SPC-4
User Capacity:        6,001,175,122,432 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      <redacted>
Serial number:        <redacted>
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        <redacted>
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 2
Power on minutes since format <not available>
Current Drive Temperature:     42 C
Drive Trip Temperature:        65 C

Accumulated power on time, hours:minutes 43483:08
Elements in grown defect list: 1

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1536910471        0         0  1536910471          0     436289.137           0
write:         0 <max>     <max>               1          2     298587.672           1
verify: 62712986        1         0  62712987          4        827.363           3

Non-medium error count:       55

disk 3

=== START OF INFORMATION SECTION ===
Vendor:               IBM-XIV
Product:              ST6000NM0054
Revision:             XXXX
Compliance:           SPC-4
User Capacity:        6,001,175,122,432 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      <redacted>
Serial number:        <redacted>
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        <redacted>
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned = 2
Power on minutes since format <not available>
Current Drive Temperature:     41 C
Drive Trip Temperature:        65 C

Accumulated power on time, hours:minutes 40578:26
Elements in grown defect list: 3

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1514933214        4         0  1514933218          4     426161.934           0
write:         0 <max>     <max>               7          7     289787.782           0
verify: 58030150        2         0  58030152          3        761.732           1

Non-medium error count:       81

disk 4

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST6000NM0034
Revision:             XXXX
Compliance:           SPC-4
User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      <redacted>
Serial number:        <redacted>
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        <redacted>
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     42 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 62972:54
Manufactured in week XX of year XXXX
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  32
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  2770
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 2996945824
  Blocks received from initiator = 4037554984
  Blocks read from cache and sent to initiator = 1110404880
  Number of read and write commands whose size <= segment size = 1285212082
  Number of read and write commands whose size > segment size = 3276

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 62972.90
  number of minutes until next internal SMART test = 17

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   1317426148        0         0  1317426148          0     500712.718           0
write:         0        0         1         1          1     241863.552           0

Non-medium error count:        7

disk 5


=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST6000NM0034
Revision:             XXXX
Compliance:           SPC-4
User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      <redacted>
Serial number:        <redacted>
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        <redacted>
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     40 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 24409:58
Manufactured in week XX of year XXXX
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  101
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1072
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 1702257846
  Blocks received from initiator = 2252324992
  Blocks read from cache and sent to initiator = 3802460774
  Number of read and write commands whose size <= segment size = 458364490
  Number of read and write commands whose size > segment size = 1565629

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 24409.97
  number of minutes until next internal SMART test = 53

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   189751524        0         0  189751524          0     693067.712           0
write:         0        0         4         4          4      51999.200           0

disk 6

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST6000NM0034
Revision:             XXXX
Compliance:           SPC-4
User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      <redacted>
Serial number:        <redacted>
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        <redacted>
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     44 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 37500:56
Manufactured in week XX of year XXXX
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  116
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1661
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 1907351399
  Blocks received from initiator = 992475600
  Blocks read from cache and sent to initiator = 2311324172
  Number of read and write commands whose size <= segment size = 428214578
  Number of read and write commands whose size > segment size = 475775

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 37500.93
  number of minutes until next internal SMART test = 52

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   130392425        0         0  130392425          0     517829.838           0
write:         0        0         7         7          7      41885.166           0
verify:  9526670        0         0   9526670          0        156.079           0

Non-medium error count:       15

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
No Self-tests have been logged

disk 7

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST6000NM0034
Revision:             XXXX
Compliance:           SPC-4
User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      <redacted>
Serial number:        <redacted>
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        <redacted>
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     32 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 3221:39
Manufactured in week XX of year XXXX
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  145
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  2986
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 3642878040
  Blocks received from initiator = 4158124336
  Blocks read from cache and sent to initiator = 504539258
  Number of read and write commands whose size <= segment size = 525069723
  Number of read and write commands whose size > segment size = 3140525

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 3221.65
  number of minutes until next internal SMART test = 9

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   134433398        0         0  134433398          0      76631.949           0
write:         0        0         2         2          2      46116.577           0
verify:        6        0         0         6          1          0.000           1

Non-medium error count:       19

No Self-tests have been logged

disk 8


=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST6000NM0034
Revision:             XXXX
Compliance:           SPC-4
User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      <redacted>
Serial number:        <redacted>
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        <redacted>
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     39 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 3221:39
Manufactured in week XX of year XXXX
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  61
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  4494
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 1449338872
  Blocks received from initiator = 765594632
  Blocks read from cache and sent to initiator = 652611179
  Number of read and write commands whose size <= segment size = 518541897
  Number of read and write commands whose size > segment size = 2377487

Vendor (Seagate/Hitachi) factory information
  number of hours powere

disk 9


=== START OF INFORMATION SECTION ===
Vendor:               IBM-XIV
Product:              ST6000NM0054 D5
Revision:             XXXX
Compliance:           SPC-4
User Capacity:        6,001,175,122,432 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      <redacted>
Serial number:        <redacted>
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        <redacted>
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     42 C
Drive Trip Temperature:        65 C

Accumulated power on time, hours:minutes 5439:58
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   3205797785        0         0  3205797785          0      54184.744           0
write:         0 18446744073709551615  18446744073709551615         0          0      62864.343           0
verify: 19918500        0         0  19918500          0        274.100           0

Non-medium error count:       29

disk 10

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              ST6000NM0034
Revision:             XXXX
Compliance:           SPC-4
User Capacity:        6,001,175,126,016 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      <redacted>
Serial number:        <redacted>
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        <redacted>
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     45 C
Drive Trip Temperature:        60 C

Accumulated power on time, hours:minutes 3029:16
Manufactured in week XX of year XXXX
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  148
Specified load-unload count over device lifetime:  300000
Accumulated lo

disk 11


=== START OF INFORMATION SECTION ===
Vendor:               IBM-XIV
Product:              ST6000NM0054 D5
Revision:             XXXX
Compliance:           SPC-4
User Capacity:        6,001,175,122,432 bytes [6.00 TB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      <redacted>
Serial number:        <redacted>
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        <redacted>
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     34 C
Drive Trip Temperature:        65 C

Accumulated power on time, hours:minutes 40465:16
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   3915962011        0         0  3915962011          0     314866.978           0
write:         0 18446744073709551615  18446744073709551615         4          4     187429.692           0
verify: 2984967141        0         0  2984967141          0       6618.256           0

Non-medium error count:       20

SAS drives eh? I don’t have much experience, but seems they have ā€œnon-medium error countsā€ā€¦ which google-fu suggests isn’t the drives but cable or controller related.

How are these drives connected? If we’re really lucky this is could just be a faulty wire.

2 Likes

There is a high possibility, that it is controller related.

But at the same time, your setup could be my first real world example of something I call the ā€œbad batch problemā€.

You have 12 disk, all from the same vendor.
All from the same drive family.
They are what, 10 years old?
We know that they ran for over 7y.
And your replacement disk has 5y runtime.

According to backblaze, the bathtub curve is not entirely true anymore, but there is still a huge spike correlated to age.

So drives don’t fail for X years, and then they start to fail close to each other. If your drive pool is mostly unused otherwise, failure rate even gets accelerated, because when the resilver starts, there is stress on the disks.

Right now three drives faulted. Your pool is on your last leg.
I hope you can resilver one, before another one fails.
I would insert a new drive, and by new I mean brand new.
If possible insert the drive into an empty slot.

Even if your pool does survive this, IMHO your pool is a ticking time bomb and you should replace all drives.

Having 6 mirrors where one drive is brand X and the other one is brand Y, is still a little risky, but IMHO less risk than a 12 wide RAIDZ3 consisting of ultra old drives, all from the same brand, all with the same runtime.

In my mirror example, all 6 drives from the same vendor can go down at the same time and the pool is still fine.
In your RAIDZ3, if only 4 out of the 11 drives die at the same time, your pool is gone.

2 Likes

yes.

Keep in mind though, that in your example if both disks from one of the mirrors die before you can remove the vdev from the pool you whole pool is gone. I would of course prefer a new (and I mean new disks) pool like you described over a 10 year old RAIDZ3 pool but if I had to choose how to configure a new pool using 12 new disks, be they from the same or mixed vendors, I’d still choose the RAIDZ3 (at least if I don’t need the performance boost from the mirrors).

@Sara Do you have any recommendation of drives i should use?

I also been curious, can i mix SATA and SAS drives?

Also is there any way to reduce the intensity of the scrubbing? I hate to replace a disk and the bomb goes off…

Yes. You can mix any drives, you could have nvme, USB, SAS and SATA all in the same array.

I had a similar issue a while ago and it was a faulty connector to the HBA. Changing the cable fixed it.

It fixed the drive failing issue?

Reseating fixed it, but ultimately the cable had a broken clip and was replaced. Fewer detected errors than you, but most of my array is SATA so silent corruption was missed until scrubbed.

It is possible, although less likely that you and I have/had an overheating controller. Are you using a HBA?

Not saying it isn’t your drives, but a dodgy link seems more likely than mass failure.

@mmentzewD5PmP

First of all, Welcome to the TrueNAS Forums.

Second, you do not need to sanitize all the data you did, if you have some new drives, maybe remove the drive serial number only.

Third, I didn’t see you define your hardware. This would be a very good start, otherwise we assume you know the status of all your hardware, which is not good.

As @Sara said, this could be the SAS controller at fault, or a backplane, or… What controller card do you have and is there lots of airflow to keep it cool. Some people build a system and stuff in a server high airflow controlled card into a non-server case and these things overheat easily.

Looking at the drive data, I see no temperature issues reported. I do see that these drives are older drives. And if the drive temps do go up significantly during a RESILVER or SCRUB operation, then you are not cooling the drives properly. This is another airflow issue.

Do not underestimate the need for good airflow, it is critical. And I’m not saying this is your sole problem but I suspect it is part of a problem, just a very wild guess.

So provide the hardware details. And are the drives Refurbished? Did you run Badblocks on them before you built your NAS?

Hardware: PowerEdge R720xd

sas2ircu 0 display
LSI Corporation SAS2 IR Configuration Utility.
Version 20.00.00.00 (2014.09.18)
Copyright (c) 2008-2014 LSI Corporation. All rights reserved.

Read configuration has been initiated for controller 0
------------------------------------------------------------------------
Controller information
------------------------------------------------------------------------
  Controller type                         : SAS2008
  BIOS version                            : 7.39.02.00
  Firmware version                        : 20.00.07.00
  Channel description                     : 1 Serial Attached SCSI
  Initiator ID                            : 0
  Maximum physical devices                : 255
  Concurrent commands supported           : 3432
  Slot                                    : Unknown
  Segment                                 : 0
  Bus                                     : 2
  Device                                  : 0
  Function                                : 0
  RAID Support                            : No


    vendor     = 'Broadcom / LSI'
    device     = 'SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon]'
    class      = mass storage
    subclass   = SAS

Now. I bought a replacement disk from amazon and used that earlier last week and that worked fine. The second replacement disk faulted earlier last month. I built a test pc to test the faulted disks. I run a long smart test and badblocks on them. If any of them show any bad sector during a badblocks test, even if i great results after running a long smart test, i discard the drive immediately.
Hence my surprise and fear when this disk that showed very little error from a smart test and passed a badblocks test failed. I admin while resilving the drive last week i noticed that the cooling wasn’t the best. The drives where reaching temperatures of close to 60C.
I didn’t want to make this mistake during this recent resilvering so i made sure that there was adequate cooling and it didn’t go over 50C however another drive faulted.

Since i used to buy these amazon… i am thinking of replacing them with these

Your LSI card needs LOTS of airflow. LOTS! or it will overheat and cause you some problems.

Your drives getting near 60C is also a sign of poor cooling.

Question: Are you using the correct firmware for your SAS2008 card, for TureNAS. I believe the command is: sas2flash -listall which should display the the data, but I think ā€œ18.00.00.00Wā€ is IR Mode, also the firmware is not up to date. The IR firmware will likely cause you problems as you have a RAID card doing what ZFS is doing.

1 Like
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2008(B2)

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS2008(B2)     20.00.07.00    14.01.00.08    07.39.02.00     00:02:00:00

        Finished Processing Commands Successfully.
        Exiting SAS2Flash.

I have corrected my initial post. During the recent resilvering that failed i used an air mover. Even there are 2 a/c units in the server room

Sure, I just think that this risk is smaller than 4 drives failing out of 11 all from the same brand and age.

Anything that comes with 5y warranty, is not SMR and if possible with helium. IMHO helium drives (all drives 18TB and bigger) are more silent and less heat.

1 Like

Thanks for posting that. It does in fact look like you are running the correct firmware. One more thing ruled out.

But still, the cooling of the HBA is very important. I don’t know if you have an IR thermal gun or device, but these are handy to check the temperature of the components.

What is the current status of your system? Is it resilvering a drive now, are you waiting on a few new drives? Do you have your important information backed up? Is the system just sitting idle doing some normal server stuff?

Now that I have had time to read all the SMART data you provided, here is my conclusion:
First, I wish you didn’t remove the entire serial number. It just makes it more difficult to track which drive is which.

Disk 1: I see nothing wrong here.
Disk 2: This drive has 1 bad sector (Grown Defects). I would not worry too much about this yet, but keep an eye on this. If it starts to increase, replace the drive.
Disk 3: This drive has 3 bad sectors (Grown Defects). I would not worry too much about this yet, but keep an eye on this. If it starts to increase, replace the drive.
Disk 4: I see nothing wrong here.
Disk 5: I see nothing wrong here.
Disk 6: I see nothing wrong here.
Disk 7: I see nothing wrong here.
Disk 8: It looks like the end of the data was cut off. Run smartctl -x /dev/??? and post it if there is different data.
Disk 9: I see nothing wrong with this drive, other than the enormous number of Write Errors. Check your Data Cable.
Disk 10: can you run smartctl -x /dev/??? and if the output has extra information, post it.
Disk 11: I see nothing wrong with this drive, it has over 40K hours but that is fine.

All of this brings me to the conclusion that your either have a bad HBA, overheating HBA (HBA’s need Air Flow, not just a cool place. The amount of air is what removed the heat. And I haven’t heard you say if there is great air flow, only that is it an air conditioned space.

Or the SAS data cables are not very good.

Question: Your SAS cables, it appears that the last 6 drives have no errors and then you have Disk 5 as well. Is there any commonality with the drive cabling? I’m looking to see if possibly the filing drives are all on one SAS plug.

How are your drive physically connected to the SAS cable, do you have a backplane the drives plug into? If yes, also check the power cable to the backplane.

Solutions to these:
1 - Ensure great Airflow across the HBA. Put a fan on it, many people do if it is not in a server chassis.
2 - If you have them, replace the data cables. If you are using a backplane, that too could be a problem. If you cannot replace the cables right now, then swap them around. Move connector for Disk 1 to Disk 6, 2 to 7,… and ensure you disconnect each side of the cable and connect them. This may allow for a better contact with the data connections.
3 - After all that is done, ā€œIF YOU HAVE A BACKUP OF YOUR DATAā€, run a scrub. Since you only have a RAIDZ3 and you have 3 Faulted drives, you should be cautious stressing the system.
4 - One thing you can do: zpool clear Tank1 to clear the errors. This will just reset the values and as more errors occur, you will see them. Also, if you do disconnect and reconnect every data cable, when you start to RESILVER, your READ values will hopefully not increase for any drive.

Best of luck to you.

2 Likes

So I had a similar issue for months and it was driving me crazy for weeks. Eventually in my case it was caused by routing of SAS cables.
The SAS cables were running right next to the power feed cables for the motherboard so under very high load it was enough to cause issues and random disk drop outs.
In my case it may have been due to the cheap AliExpress cables I was using and the lack of shielding on the cables themselves.
I haven’t seen a single failure or transient error since I rerouted the drive cable runs

8 Likes