Problem/Justification
When a disk dies or has so many errors that TrueNAS will disconnect it from the pool, a scrub will put even more stress on the system which can increase the risk of dataloss if there is no redundancy left.
Impact
Less strain on a pool that might not have any reduancy left.
User Story
A few days ago, one of my disks died during a scrub, to my surprise the scrub was not paused/stopped, it just carried on, with a much longer ETA.
In my opinion, when a pool gets in such a state during a scrub, that scrub should be at the very least paused, and all further scrub jobs of that pool should be paused as well.
I like the intention of this feature request however the trigger for stopping a scrub will need to be more exact than “Degraded”. If a scrub throws one chksum error, the pool is degraded. This does not mean the drive failed, but it might mean this.
Suggest a different trigger, maybe excessive error counts of 50 or more for example. Maybe the drive is “Detached”? I really don’t know what would be the best trigger. Once all is done, I would vote for it. Make sure you vote for it as well.
Thanks for the feedback! I changed my feature requests to clarify that this should only happen when a disk dies or is removed from the pool by TrueNas due to too many errors.
I voted for this feature request as well.
I feel it is important that when a drive has been recognized as defective, even slightly defective, that any additional wear should be avoided, such as a Scrub should be terminated until:
A) The affected drive serial number has been replaced with a different drive serial number, indicating the drive was replaced, then scrubs can be performed in the future without further user interaction.
B) The administrator clicks a button to tell it to resume scrubbing regardless of the drive errors, for this one time. I say for this one time as you would not want to cover up a real problem for any length of time.
1 Like