Replacing drives procedure - RAIDZ1 but two drives impacted

Hi folks,
I have built my first TrueNAS server and am generally super happy.
Unfortunately I used four disks for a RAIDZ1 (3x 3TB, 1x 4TB) coming out of a Synology that has been running for a while… the short SMART tests showed no errors, but the extended ones failed for two 3TB drives. Pool is not degraded yet, but obviously I want to replace them.

Now I am not sure which one to start with - and also if I can just buy an 8 TB drive and bring it into the array or if I should only get a 3TB drive (would this give me additional space or will it only ever use 4x 3TB in a RAIDZ1?).
Here the SMART report of sde

truenas_admin@truenas-koe[~]$ sudo smartctl -a /dev/sde
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N2AYHKD4
LU WWN Device Id: 5 0014ee 2b6e3f6a9
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jul 14 09:23:41 2025 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 113) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline 
data collection:                (39540) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 397) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       471
  3 Spin_Up_Time            0x0027   184   176   021    Pre-fail  Always       -       5775
  4 Start_Stop_Count        0x0032   092   092   000    Old_age   Always       -       8445
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   021   021   000    Old_age   Always       -       58186
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       267
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       34
193 Load_Cycle_Count        0x0032   189   189   000    Old_age   Always       -       35348
194 Temperature_Celsius     0x0022   113   108   000    Old_age   Always       -       37
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       5

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       10%     57908         1559325720
# 2  Extended offline    Completed: read failure       10%     57860         1559325720
# 3  Extended offline    Completed: read failure       10%     57798         1559325760
# 4  Short offline       Completed without error       00%     57663         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

and here of sdb

truenas_admin@truenas-koe[~]$ sudo smartctl -a /dev/sdb
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4N4JFRPZ9
LU WWN Device Id: 5 0014ee 20d6ee415
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database 7.3/5528
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jul 14 09:30:01 2025 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                        the read element of the test failed.
Total time to complete Offline 
data collection:                (38880) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 390) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   180   174   021    Pre-fail  Always       -       5966
  4 Start_Stop_Count        0x0032   092   092   000    Old_age   Always       -       8266
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   031   031   000    Old_age   Always       -       51049
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       255
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       30
193 Load_Cycle_Count        0x0032   190   190   000    Old_age   Always       -       32288
194 Temperature_Celsius     0x0022   115   109   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     50765         153168120
# 2  Extended offline    Completed: read failure       90%     50655         153168120
# 3  Short offline       Completed without error       00%     50526         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

What’s the best way to proceed?
Thanks
Chris

I also just noticed that the 3TB drives are CMR and the 4TB drive is SMR. Is that a problem?

Yes - SMR is a problem as it has bulk write performance issues.

Before advising on the best approach, does your system have at least one or possibly 3 or 4 SATA drives free, or is it fully populated with 4 drives?

Hi @Protopia - thanks for your reply!
So SMR is a performance issue but not relevant for data integrity/loss?

Yes, I do have 6 SATA slots!
Thanks a bunch!

Ok - so 6 SATA slots means that you have enough SATA slots to put one or two new disks in parallel but not enough to run both your existing pool and create a new pool to migrate to.

The reason I asked is that with old (and therefore less reliable) drives, you should really be thinking about RAIDZ2 rather than RAIDZ1. And for drives > 6TB or 6+ drives you should also be thinking about RAIDZ2. So since you need to buy at least 3 new drives (for 2 failing drives and 1 SMR drive), it might make sense to build a new pool and migrate your data rather than replacing drives one by one - but you would need 8 SATA slots to do that.

(I can think of a way to achieve this technically with only 6 slots, however it would involve some risks so I cannot recommend it. And you would need to buy 4 new drives now rather than only 2, so it might not be affordable either.)

So it looks like you are stuck with RAIDZ1 and we need to look at a drive replacement strategy…

  1. You cannot mix drives of different sizes within a vDev - you will only be able to use the amount of space on the smallest drive, so buying an 8TB drive won’t enable you to replace 2x 3TB drives.

  2. If you are buying new drives it might make sense to buy slightly bigger ones, because when you replace the remaining old drives with more of the bigger drives you will be able to use the extra space.

  3. The SMR drive is (as you say) fine from an integrity perspective, and for a low write NAS it will be fine in normal useage too. But if for any reason that drive loses integrity without it actually failing and you want to resilver it, then the resilver will take forever and a day.

  4. With a drive replacement strategy it is lower risk to install the replacement drive alongside and do the ZFS Replace than to pull a drive and put the replacement in its place. This is especially true when you have two partially failing drives - if their failed sectors are from different records, then you still have a complete copy, but if you pull one of the drives to replace it, then any failed sectors on the 2nd failing drive will not have redundancy and you will end up losing data.

So, you need to buy at least 2x drives of at least 3TB each. If it were me, I would be buying 4TB or 6TB drives with a view to future expansion.

  1. Buy 2x new drives and install them in slots 5 and 6 and reboot.
  2. Go into the UI Storage page for your drive and click on the 1st failing drive, click Replace and select the first of the new drives and wait for the replace to complete. Then repeat the process with the 2nd failing drive and the 2nd new drive.
  3. Power down, remove the failed drives and reboot.
1 Like

This part isn’t necessary; if you have the ports (which it sounds like OP does), you can run multiple replacements simultaneously.

You’ll just need to manually expand the pool, because iX has made nonsensical changes re: pool auto-expansion in SCALE.

1 Like

ix + non-sensical? Surely not!

1 Like

@Protopia Thanks so much for the detailed how to!
Just ordered two 4TB drives as you suggested and will follow along. SMR drive will hopefully last and be replaced next.

Makes sense - will add 6 drives :crossed_fingers: