Pool decline and fear of cold feet

Hi all,

A few years ago, I built a NAS to replace my aging QNAP, but due to life (family, kids), the new system ended up mostly idle for several years. It consists of:

  • Supermicro X11SSH-F
  • Intel(R) Xeon(R) CPU E3-1245 v6 @ 3.70GHz
  • 64GB ECC RAM
  • Mirrored 240GB SSD boot pool
  • 6× Seagate IronWolf 4TB in RAIDZ2 (data pool)
  • All drives connected to the onboard C236 controller

Originally installed with FreeNAS, later updated to TrueNAS Core, and recently clean installed with TrueNAS Scale (now CE). I reconfigured the pools, but haven’t stored real data yet.

After the first scrub (is this the correct term?), ZFS reported errors on 3 of the 6 HDDs. A second scrub showed different numbers and a fourth disk joined the list. This is giving me serious hesitation about putting the system into production.

Important context:

  • The system was properly maintained with updates.
  • It ran idle but powered on, in a stable environment (~35-40°C drive temps).
  • No abnormal vibrations, power issues, etc.
  • I have a solid offline backup strategy (cold storage HDDs on rotation).

Now I’m wondering:

  • Are such errors to be expected after a few years, even if the drives were barely used (Power_On_Hours is now almost 4.5 years )?
  • Could this indicate that these disks are degrading and should be proactively replaced before going into production?
  • Is it common for several disks to fail like this around the same time?
  • Should I also suspect other hardware such as the SATA controller or the cabling (or is that not the case with this type of error)?

I have some spare Seaget IronWolf Pro ST4000VNA06 and WD Red Pro WD4005FFBX drives ready if replacement is advised — but I’d prefer to understand the failure pattern before acting. Especially since ZFS health consistently displays a green check mark and a total of 0 errors. I’m having a hard time understanding this.
I also now have an LSI SAS9300-8i which I could use.

I would have liked to have a little more feeling and experience with the situation so that I can soon entrust my data to this TrueNAS system with peace of mind

Thanks in advance for your insights!

New alerts:

  • Device: /dev/sdd [SAT], 8 Currently unreadable (pending) sectors.
  • Device: /dev/sdd [SAT], 8 Offline uncorrectable sectors.
  • Device: /dev/sde [SAT], 8 Currently unreadable (pending) sectors.
  • Device: /dev/sde [SAT], 8 Offline uncorrectable sectors.

Current alerts:

  • Device: /dev/sdc [SAT], 8 Currently unreadable (pending) sectors.
  • Device: /dev/sdc [SAT], 8 Offline uncorrectable sectors.
  • Device: /dev/sdd [SAT], 8 Currently unreadable (pending) sectors.
  • Device: /dev/sdd [SAT], 8 Offline uncorrectable sectors.
  • Device: /dev/sdd [SAT], 8 Currently unreadable (pending) sectors.
  • Device: /dev/sdd [SAT], 8 Offline uncorrectable sectors.
  • Device: /dev/sde [SAT], 8 Currently unreadable (pending) sectors.
  • Device: /dev/sde [SAT], 8 Offline uncorrectable sectors.
  • Device: /dev/sdg [SAT], 32 Currently unreadable (pending) sectors.
  • Device: /dev/sdg [SAT], 32 Offline uncorrectable sectors.

sudo smartctl -a /dev/sdg

=== START OF INFORMATION SECTION ===
Model Family: Seagate IronWolf
Device Model: ST4000VN008-2DR166
Serial Number: ZGY5VBLC
LU WWN Device Id: 5 000c50 0c326fb2c
Firmware Version: SC60
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5980 rpm
Form Factor: 3.5 inches
Device is: In smartctl database 7.3/5528
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Oct 1 08:40:34 2025 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 581) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 615) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x50bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 078 064 044 Pre-fail Always - 59287336
3 Spin_Up_Time 0x0003 094 093 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 113
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 968311660
9 Power_On_Hours 0x0032 057 057 000 Old_age Always - 38033 (61 244 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 82
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 071 052 040 Old_age Always - 29 (Min/Max 21/29)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 123
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 418
194 Temperature_Celsius 0x0022 029 048 000 Old_age Always - 29 (0 17 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 32
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 32
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 37714h+10m+23.806s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 15608177768
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1642305064

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Short offline Completed without error 00% 38032 -

2 Extended offline Completed: read failure 20% 36752 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

The above only provides legacy SMART information - try ‘smartctl -x’ for more

You said ZFS reported errors on 3 of the 6 HDDs, can you clarify how zfs reported those errors. Please share the output of sudo zpool status.

I am asking because what you posted are a list of TrueNAS alerts which are based on SMART data - they are not based on ZFS. Drive letters tend to jump around after reboots, therefore I can’t tell how many drives are actually affected.

2 Likes

I wouldn’t suspect SATA controller or cabling if all you have is pending sectors. Those are reported by the hard drive themselves - that data shouldn’t be susceptible to bad controller/cabling.

2 Likes

Pending/uncorrectable are bad sectors. These are not ZFS errors but hardware failures—on three drives at the same time. Bad batch, or drives arriving at the end of their life span.
Replace each and every of the failing drives as soon as possible—expecting that the remaining drives may develop issues sooner than later.

2 Likes

IME 5 years is about average for drive replacement cadence. All the SMART errors mean is that the drive couldn’t read a particular sector (the pending means it hasn’t fully confirmed total data loss of the physical sector, the uncorrectable means the data itself is unreadable).

Most likely the drives ended up idling the heads on a particular sector and it wore out that section of the disk. It’s theoretically possible to try recovering it if you go through some efforts to identify it and surrounding sectors and massage write over it. This will cause data errors on the next scrub so obviously only do one drive at a time. Alternatively, swap out a drive and just run the wipe function to just smatter the whole disk and see if the drive releases the pending sector state.

Either way, these drives are old and should be swapped out. You can toss them in other systems that ignore SMART, but until the situation is resolved you’ll get those alerts every day when the SMART daemon checks the disk again (ask me how I know).

If they were from the same batch (pretty common if you buy them all at the same time) and installed at the same time, then yes, this is a kind of legendary problem told in the folklore of NAS adventures, especially when it gets so bad that a drive dies and you try replacing it only for another to die during the replacement (and your pool gets lost because you exceeded the available redundancy protection).

I also include myself as one that lessoned this out the hard way. :sweat_smile:

1 Like

@bacon It is indeed about the S.M.A.R.T. data rather than the ZFS data

pool: apps-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:01 with 0 errors on Sun Aug 24 00:00:03 2025
config:

    NAME                                    STATE     READ WRITE CKSUM
    apps-pool                               ONLINE       0     0     0
      7e3798e1-91aa-4cca-827c-0a31be495240  ONLINE       0     0     0

errors: No known data errors

pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:38 with 0 errors on Sat Sep 6 03:45:39 2025
config:

    NAME        STATE     READ WRITE CKSUM
    boot-pool   ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        sda3    ONLINE       0     0     0
        sdb3    ONLINE       0     0     0

errors: No known data errors

pool: data-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:01 with 0 errors on Sun Aug 24 00:00:03 2025
config:

    NAME                                      STATE     READ WRITE CKSUM
    data-pool                                 ONLINE       0     0     0
      raidz2-0                                ONLINE       0     0     0
        7f5f2575-bf49-4170-94f4-448d62ec5386  ONLINE       0     0     0
        031dea2f-a0ab-4ee6-88ab-48a5b9bb86c2  ONLINE       0     0     0
        2174db01-99e4-4c4e-98b2-b09c2ecf67a2  ONLINE       0     0     0
        625e6359-eed3-4dbb-b002-971cc5d2f54a  ONLINE       0     0     0
        3ef8004c-84a0-4822-825f-93b1019832bb  ONLINE       0     0     0
        c58aba4c-2f4e-4af9-9a5d-b161ea6a0f8a  ONLINE       0     0     0

errors: No known data errors

@Tsaukpaetra and @etorix so 5 years… In that case, I’ve had great luck with my 4x WD2002FYPS in my QNAP. These have been active since 2010.

I did indeed buy the six Seagate HDDs in one batch back then. A significant expense for me at the time. I’ll be buying future drives individually.
Are there any recommendations? Are the two drive types I currently have in stock (ST4000VN006/ST4000VNA06 and WD4005FFBX) suitable for my needs (24/7 availability with infrequent use in a home network)?
How can I initially test the new drives before putting them on the shelf as cold spares?

Thanks for the replies. It reassures me somewhat that the SMART errors are probably due to age and possibly from the same production batch.

I’ll continue my adventure.

I wouldn’t say 5 years is a normal time for HDDs (or at least enterprise-grade ones) to develop faults. Seagate Exos, WD Red Pro and Toshiba Cloudscale HDDs all have 5 years of warranty (or at least the models I’ve recently looked at do).
It could be planned planned obsolescence of course if they all failed right after the warrantys ends but I haven’t heard of that in enterprise HDDs yet.

I would say so. All CMR. Just be aware that the Seagate ST4000VNA06 is just 5400rpm.

I’d first run long SMART tests on them, maybe badblocks as well and then check the SMART data. If you have some space and time for it you could also create a pool on them, store some data on it and let them run for some weeks. Then run a scrub at the end just to be sure. If all that works without errors I’d be very confident in them.

1 Like

Yeah, my case is much more cautious and probably more twitchy since I have a somewhat more error-prone environment. I try not to leave anything to change with this.

Yes, having a planned replacement cadence is good, staggering purchases is usually the easiest natural way to do it.

My current routine depends on landing target.
If it’s going to be a hot spare, slot into the bay and run a wipe with random data task from the UI (technically doesn’t matter if it’s zeroes or random). Once done, kick off a manual SMART long test. This usually takes about a day per 12gb, if everything is good, add it to the appropriate pool as a spare.

For a cold spare I just chuck it into a random PC and run Darik’s Boot and Nuke with a basic three pass and toss it on a shelf when it’s done. It’s unusual for a drive to die when unused, it’s almost always when powering down and up that’s the hardest on them.

Those should be just fine, keeping with slower drives typically means lower power and heat generation, which is important and it seems like your environment is already stable enough for it to be a non-issue so long as you keep them moving.

2 Likes

In contrast, the drives in the OP are Ironwolves with 3 year warranties.

Yes, it is possible that multiple drives are reaching their end of life.

I agree with that argument solely for the enterprise-grade drives. I have a bunch of 4TB WD Gold drives that I recycled from my workplace and they are approaching 9 years of power-on time. The ones reaching 10 years are beginning to show bad sectors, and I’m replacing them gradually with WD Purple, since that’s the surveillance and entertainment media pool. I actually like the fact the Purple drives run at least 5C degrees cooler than the Red Pro drives in the same chassis - the major benefit of 5400RPM drives, where I don’t need crazy read/write performance for the use case.

For the OP though, Ironwolf drives wearing out at around the 5 year mark is typical, in my experience. I have two NAS appliances, where the master runs WD drives and the slave (currently a Synology Rackstation that I’m wearing down until it dies) runs Seagate Ironwolf drives I relegated from the previous master NAS I had before I saw the light and went to TrueNAS. Bear in mind, the slave NAS does not operate 24/7, as it’s only used to sync files with the master on a nightly basis for backup, which lasts for about one hour each night.

I run on the principle that if one brand (and hopefully one particular series of that brand) runs into a batch problem, then at least the other appliance will not be as likely to suffer the same consequences within a short timeframe. If a drive has a three year warranty, then my experience has been to witness issues occurring no earlier than five years of power-on time. That’s why I relegated those Ironwolf drives to slave NAS duties and moved up to Red Pro drives in the master NAS.

This is my first time using the Red Pro drives, which has come to five years now, so it’ll be interesting to see how it goes over the next few years, hopefully longer, considering their five-year warranty. The Purple drives are completely new to me, as in just a few months, so I’m not sure what the outcome will be with them but hopefully they’ll see about five years, given they’re constantly being written to 24/7.

2 Likes