I’m running ElectricEel-24.10.2 and I have smart short tests running periodically. It seems that when they run, I get alerts to pending sector errors on 1 or 2 ssd’s in my storage pool. I’m wondering if these alerts are bogus because whenever I read the Curent_Pending_Sector raw count from smartctl -x, the count is always 0 for the ssd in question. Am I interpreting this correctly? Could there be a bug in the alerting? Everything else seems fine, I don’t have any other errors with these drives.
You have asked a question without providing any read data for us to examine and give you a proper analysis.
Joes Rules (link in my signature) has a list of data to provide for a given problem type, of which drive issues is one of those. To get some good and accurate help, please provide the required data.
As for the error messages you recieve, paste the exact full error message. Also ensure you track your drives using a serial number, the drive IDs can change on you with each reboot, even though they often remain the same, but they can an will change periodically when you least expect it.
What I’d like to see is:
The full output from each suspect drive of smartctl -x /dev/??? in code brackets.
zpool status -v
The exact error message(s).
You say pending sector errors on 1 or 2 of your SSDs. Is it 1 or 2?
Do not assume we understand you. Assume we are idiots and you have to explain everything in detail. It sounds a bit harsh but the worst thing a person could do is make an assumption that causes more harm.
Well time for me to call it a night. If you post the required data, someone or I will offer assistance.
I don’t know without seeing the data.
Could be, but again, need to see the data.
And then a failure happens when you wished you saw it earlier and could plan for it. It happens to the best of us.
Thanks for taking the time to look at this. I’m running ElectricEel-24.10.2
So I’m getting alert notifications thru the TrueNAS UI with the following messages. They seem to be triggered following the S.M.A.R.T short tests I have scheduled. I don’t see these alerts every day just once or twice a week, the S.M.A.R.T tests run daily at midnight.
Device: /dev/sdc [SAT], 1 Currently unreadable (pending) sectors
Device: /dev/sdh [SAT], 1 Currently unreadable (pending) sectors
My pool seems fine and I do not see any read, write or checksum errors. A scrub completes fine without any problems:
admin@nas01$ zpool status -L zvol
pool: zvol
state: ONLINE
scan: scrub repaired 0B in 00:04:27 with 0 errors on Tue Mar 25 21:56:30 2025
config:
NAME STATE READ WRITE CKSUM
zvol ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
sdi1 ONLINE 0 0 0
sdj1 ONLINE 0 0 0
sdc1 ONLINE 0 0 0
sdg1 ONLINE 0 0 0
sdh1 ONLINE 0 0 0
logs
mirror-3 ONLINE 0 0 0
sdd1 ONLINE 0 0 0
sde1 ONLINE 0 0 0
spares
sdf1 AVAIL
errors: No known data errors
When I examine the data from the SSD’a using smartctl, I do not see the Current_Pending_Sector count incrementing. As a matter of fact it is always 0. So, I’m wondering if these alerts are legitimate? I am aware that drive names can change following a reboot so this data is collected from both SSD’s using smartctl -x /dev/sdX
[sdc.txt|attachment](upload://MZCRF9CqB8fRjAmLEL70yB1g7N.txt) (32.7 KB)
[sdh.txt|attachment](upload://hHHM7CXrI2IDL6l0dh86lGnCcos.txt) (18.1 KB)
regards
John Rushford
/dev/sdc (P220EDCB23102704018) has some ICRC (Interleaved Cyclic Redundancy Check) errors and UDMA CRC errors, could be caused by a bad cable connection. I’d reseat the connection and confirm if this continues. The values are quite low though so may not be of concern.
199 UDMA_CRC_Error_Count -O--CK 100 100 000 - 2
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 20 Command failed due to ICRC error
/dev/sdh (P220EDCB23102704033) looks fine.
I don’t see anything in relation to bad/pending sectors. Have you rebooted since the alert? Maybe the labels have been shuffled around.
No, I haven’t rebooted and I’m aware that the labels can move around following a reboot. I’m 100% positive that the data shown corresponds to the ssd’s identified in the alerts. I did have some issues with the LSI-9300-16i HBA, I had to change the PCIe bus speed to gen2 in the bios. This seems to have corrected that issue I saw when I initially built this pool. Perhaps that accounts for the CRC errors.
Ok, I flashed my HBA with the 16.00.12.00 firmware, everything looks ok after the reboot. I’m going to monitor for a few days and then I might change the PCIe speed bios setting back to Auto from gen2.