ATA Error Counts increasing

joeschmuck · May 21, 2025, 2:13am

Well i was hoping you provided the wrong drive and you find a different drive infocates the ATA error. Good luck troubleshooting this problem.

ruffhi · May 21, 2025, 3:43am

So … I pulled the four HDD cages out of the case … and found that one of the molex power connectors was either loose or out. That would have taken out / impacted 4 of my drives and could be the cause. I am just letting it sit in the janky (my new fav word - thanks Fleshmauler!) mode at the moment. I also ipmi’d the fan up to max and now all 8 HDDs are chill at or under 30ºC.

So … while watching an episode of Orphan Black, I googled, read up, youtubed and chatGPT’d about replicating my plexdata to my devl server.

Apparently, I need to establish a SSH connection between the two and then run a replication snapshot (see how much jargon you can pick up in a few hours). I tried it only for TrueNAS to complain about self certificates (or something like that).

But that is a topic for another post.

Edit: One more round on youtube and ZimaBlade pointed out that you can just ignore the self certificate error. Replication task now running.

16.76 GiB / 4.01 TiB … it will be a long night.

ruffhi · May 21, 2025, 4:14am

Last post from me for tonight.

One thing that I have found over time is to decomplicate systems. If you can go from A to C without having to go through B, then do it.

This is another example … why have a HDD cage between your mb sata connection and your HDD sata connection when you don’t need one. Sure, it will make it harder to swap HDDs in / out but I have found that I don’t actually use that option very much.

It is like putting a drain port in the middle of the tube between your CPU port and your radiator port - just another connection (2 in fact) that could fail. Put your drain port directly on one of your radiator ports and leave the tubing alone.

joeschmuck · May 22, 2025, 2:40am

I completely agree. I know many people want a hot swap drive bay because they feel it will make replacing a failed drive so much easier and faster. How often does a drive fail and need to be replaced? Every 3 to 5 years? How difficult is it to remove four screws and disconnect a power and data cable? The screws are optional with some of the slides they use these days.

Keep it simple. If you have a NEED for a hot swap bay, then it makes sense, like a HA system. Otherwise save a few bucks.

Brother, I hope that problem is completely gone and the issue was the power connector.

@Fleshmauler saves the day! Good advice on what to check.

ruffhi · May 22, 2025, 10:51am

Things have been very calm. I am starting to not feel so much panic when I get an ALERT email. It use to be ‘BLAH is offline’. Now it is boot scrub finished.

Ran a scrub - 3 hrs of boring
Now running a replicate of my plex medium files to my devl server.

neofusion · May 22, 2025, 12:53pm

It’s a good question to ask because device id:s (sda, etc) jump around every reboot, so what was sda one boot could be a different drive the next.

PK1048 · May 22, 2025, 1:12pm

Device IDs should only change if there was a change in physical configuration. For example, I have a 16-slot hot swap drive cage, when I booted my TN it only had 12 drives in it. I have since added four more. At the next boot I expect my device names to have changed but NOT the UUID or PATH ID. If you make no changes to your hardware the device tree should not change.

This was a problem with very early ZFS. ZFS preferentially used the cache file (where it tracks which zpools are imported and where they’re physical device are) over the on disk labels when importing a zpool and if you changed the physical configuration you could end up with a zpool that did not import at boot. Today ZFS uses the labels on the devices to map out the zpools vs devices and assemble the zpool when importing at boot time. Note that TrueNAS exports all data zpools when you shutdown and imports them when you boot.

neofusion · May 22, 2025, 2:33pm

Device names like sda, sde, etc. can in fact change every reboot. It happening isn’t dependant on a physical change. As far as I understand it, it comes down to the timing during the boot, what order the drives end up initialising in.

ZFS functionality in TrueNAS will normally not be impacted by this because TrueNAS uses the partuuid to reference the device (in most cases).

But the end user can still be affected because references in some error message use the device name (sda) as identifier. Because the discussion in this topic included running SMART tests I thought it sensible to bring it up, since running smartctl is typically done on a device name, not the part-uuid. You really want to make sure you verify that sda this boot is still the drive you think it is.

ruffhi · May 22, 2025, 3:54pm

Agree on the 10k ways of referring to a single HDD is annoying. I have a script that runs periodically that dumps various ways a single HDD referenced (ID, UUID, blah, blah, ). Very useful reference point.

joeschmuck · May 22, 2025, 5:32pm

This is unfortunately not true. While a system can power up and reboot while maintaining the same Device IDs, it is a factor of which one is recognized first gets the first drive id. Never only track a drive issue by the drive ID, that is problematic.

@ruffhi If you are using my little script Multi-Report, it provides all the cross references needed to identify a drive with everything we need to track a drive using UUID, Device ID, Serial Number, where it is in a pool, all that fun stuff. Blah Blah Blah, lol.

I’m glad your issues are not coming back and maybe the backplane or the power connector was the problem.

ruffhi · May 22, 2025, 6:41pm

@ruffhi If you are using my little script Multi-Report, it provides all the cross references needed to identify a drive with everything we need to track a drive using UUID, Device ID, Serial Number, where it is in a pool, all that fun stuff. Blah Blah Blah, lol.

I have my own versions - smaller, more focused on single tasks. But I have a copy of your multi which I borrow (?) steal (?) use (!) as a good resource.

Thanks

PK1048 · May 22, 2025, 8:56pm

Computers are state machines. Well behaved state machines are consistent in their operations. So if the kernel probes the attached devices in a deterministic order, and I am not aware of any kernel that probes it’s attached devices in an indeterministic (random) order, then it is probing the devices and assigning device names in a specific order. Some kernels even cache the names it assigns (SunOS / Solaris /etc/path_to_inst ) so that devices do not change device name even when the device tree changes.

I do agree that if drives spin up in a different order, then the kernel will see them in a different order and assign potentially different device IDs. But, I have never seen a well behaved system where the kernel is probing the devices before they are all accessible. A non-well behaved system (one with components that are not behaving correctly) needs to be repaired to return it to well-behaved status.

I am very curious under what circumstances you ( @joeschmuck ) have seen device IDs change on a stable system with no physical changes (by stable I mean one that has been rebooted with the current physical state).

All of the above should not encourage anyone to randomly start making changes when they are having a ZFS problem in the hopes that it will fix the problem. I cannot recount the number of times someone lost a zpool because they started making changes before they asked for help or let the system sort itself out on it’s own.

joeschmuck · May 22, 2025, 11:47pm

(Joe steps up on his Soap Box)

@PK1048 you are making an assumption that all systems operate 100% exactly the same every time. I do not assume a system is always perfect nor will boot exactly the same each time. I can’t tell you how many times that has bit me in the rear, especially when involved with a government contracted system, no fun. I will say that likely 99.9% of the time, my personal systems all boot repeatably so the Device ID typically remain the same, but then again, I am running TrueNAS on ESXi so all the drives should be fully ready well before my VM starts.

Generally we have to tell people that the Device ID may change and to use the serial number for tracking purposes because when they are asking for help, there is some sort of problem. The system is not 100% perfect.

I’m not saying that most systems will bootstrap and have different results every time, and maybe most of the time it is all exactly the same, but to assume it will always be the same is in my opinion, incorrect.

We can disagree, I have no problems with that.
I have disagreed with others as well, but I try to explain exactly why I disagree. It certainly isn’t because I enjoy being difficult.

I am putting out information in which to best help a person troubleshoot a drive issue, to use the drive serial number to track a drive is the best practice and most accurate way to track a drive. Using the Device ID, which can change, and actually has done so for many people who replaced the wrong drive due to this identification have lost data or at best, wasted a lot of time and the forums are splattered with this reported problem of the Device ID changing. This is why, in my little script I provide all the cross reference data so identifying and keeping track of the potential failed drive is possible and to help a person feel secure in the drive they replace.

Thanks for the last part of your message, it is very true about lost pools. For those reading this and not sure what were are talking about…

Scenario: Joe is running TrueNAS and the system has been up and running for 14 days and it has a RAIDZ1 pool, then gets an error message that drive sda has problems. Joe now powers down the system and then orders a new replacement drive. The drive arrives, but Joe didn’t record the serial number, he just knows sda is the problem. Joe turns on the system to identify the drive serial number for sda. Then follows the GUI steps to replace a drive, Powers off, replaces the drive, powers on and the pool is gone because one of the good drives were replaced vice the actual failed drive.

This kind of scenario has happened way too many times in my past xx years using FreeNAS/TrueNAS. And with TrueNAS becoming much more user friendly, this is likely to happen much more often as people who are not well educated in this stuff do things wrong.

I feel we are just discussing two different points, you discussing a perfectly good running system, mine discussing a not 100% perfect system. I error on the side of caution.

Okay, I’m off my soap box (for those who remember those days). Also, my smartphone is needing some power.

EDIT: I just found out that the USA currency will be dropping the one cent piece. There goes the saying “I’ll give you my two cents”. Thought I’d toss that in here since the Soap Box is an old phrase as well.

Cheers!