This resource was originally created by user: jgreco on the TrueNAS Community Forums Archive. Please DM this account or comment in this thread to claim it.
One of the problems with consumer-grade hard drives is that most of them will hang in the event that they run into an error, and will internally retry the operation, possibly for a minute or more. For a desktop PC, where redundancy does not exist, this is the correct course of action, because failure of a sector means loss of the data.
Enterprise class drives typically support the ability to limit the amount of time a drive wastes trying to recover data. Most of these drives are used in RAID arrays, and so in the event of a failure, the data can be recovered from parity. A drive encountering read errors cannot be allowed to hang for large amounts of time, because this stalls whatever the server is trying to do. So manufacturers include features to control the retries of failures.
For Western Digital, this is called TLER - Time-Limited Error Recovery. Great PDF.
For Seagate, it is called ERC - Error Recovery Control.
Samsung and Hitachi call it CCTL.
Some people are confused and think that these features are only necessary for hardware RAID, or aren’t useful for software RAID. It is absolutely true that this is a very important feature for hardware RAID, because a hardware RAID controller is probably configured to deem a “hung” hard drive as failed and to place it in an offline or recovery status, which has many negatives associated with it. So you absolutely do want TLER/ERC/etc for a hardware RAID setup.
But what about ZFS?
If you’ve got a ZFS pool, and your underlying disk device appears to hang for a minute, you probably stop serving up data. This is likely to be bad behaviour for a filer. Unlike a hardware RAID controller, ZFS will typically wait for the command to complete, and if it is trying to read many sectors, this could take a very long time. So TLER/ERC/etc are also desirable properties for a ZFS system.
We’ve been thrilled in recent years to see the addition of “NAS” class hard drives, which are essentially conventional consumer-grade hard drives that have firmware that defaults to supporting TLER/ERC.
You can verify that a drive has TLER/ERC turned on by probing it with smartctl.
Code:
smartctl -l scterc /dev/ada0 smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p8 amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control: Read: Disabled Write: Disabled
That doesn’t have it.
Code:
smartctl -l scterc /dev/ada4 smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p8 amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control: Read: 70 (7.0 seconds) Write: 70 (7.0 seconds)
That does, and it’s set to a typical 7 seconds. Further, the same command can be used to try to set ERC.
Code:
smartctl -l scterc,80,80 /dev/ada4 smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p8 amd64] (local build) Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control set to: Read: 80 (8.0 seconds) Write: 80 (8.0 seconds)
Some hard drives may not come with TLER/ERC enabled by default but can have it turned on regardless. If you try this, make sure to power cycle the drive to make sure the setting sticks around. It’s hard to test for TLER/ERC working correctly without actually encountering a bad drive, however.
[2015-02-10] : I note that we just picked up some Samsung ST2000LM003 2.5" 2TB drives which appear to allow TLER to be set, but the setting appears to do nothing and isn’t persistent. I happened to luck out in that a drive failed SMART testing with a bad sector and was therefore easily tested.
I’ll be pruning responses to this thread, but if you have useful information to share, I may update this post and credit you.