Weird CHKSUM errors after drive replacement

I’ve got a TrueNAS 24.10 server with an 18-wide RAIDZ3.
Data disks are all Seagate Exos 18TB SAS drives.
There is also a mirrored special metadata device but that shouldn’t be of any consequence here.

Roughly a week ago one of the disks in the RAIDZ3 started having pending defects and failed SMART tests.

We replaced the disk with one that we had in cold storage and used the same drive bay.
I ran a short, then long smart tests - both without issues.
The array resilvered but started reporting some zfs checksum errors.
Running a scrub brought the number of checksum errors up to >2500 only on the “new” disk. SMART values remained all good.
Putting the disk in another drive bay did not change anything.

We had also received the replacement disk for the one, that first reported SMART errors by then. I replaced the disk from cold storage with the completely new one and put it in a formerly completely unused drive bay.

The pool again resilvered just fine but the completely new disk now also shows zfs checksum errors - about 30 after resilvering and now, at about 22% of a new scrub, already 43.

All other disks are still fine.
Any idea what might be the fault here?

You should treat this as two different problems. You HAD a drive starting to fail and have fixed that problem. Now you have a ZFS error.

I will admit that I’m not the authority on ZFS but I will give it a shot.

At a minimum I need the outputs of these commands:
zpool status
lsblk -bo NAME,MODEL,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
smartctl -a /dev/sdX *where sdX = the suspect drive`

If you feel the need to redact the serial number, please leave the last 4 characters so we can follow the drives.

It actually is sdx xD

zfs_disk_details.txt (18.2 KB)

I am aware that there is a redundancy mismatch between the special device(s) and the raidz3 - we are planning to fix that.

Because you have a Seagate high capacity drive, let’s grab the output of smartctl -l farm /dev/sdx and see if it looks okay. You may need to go online to verify the warranty is good as well to make sure you don’t have a heavily used drive that got back into the supply chain. And while this may not be ideal, if it is used, at least we know the problem.

here’s the FARM data
farm.txt (22.6 KB)

warranty is fine

Stupid question time: I made an assumption, stupid me. I just want to ensure it is clear. After the drive was replaced, did you run the command zpool clear pool to clear the cksum error count? If not, then enter that command. Next run a scrub on that pool zpool scrub pool and then once done, check the status of that pool again.

But you may have already done these step, I just should not assume things.

I did clear errors multiple times since the first errors appeared. Last time before I started the latest scrub. I cancelled that, after the drive had accumulated 43 errors after 22% of the scrub as I just started the scrub to see if new errors would appear.

Looking at the farm data:
The head write power on hours are: 278574020 = over 31,800 years. I have seen this before and learned to not trust the value is what it states it to be.

It looks like you have written to about 5% of the drive, so the numbers look good in this respect. Overall I do not see any glaring FARM issues.

Yes, let the scrub complete. It should have no errors after being resilvered but it looks like your server is not playing nice.

The odd thing here, and a good thing, is the errors are only on the new drive.

As for the drive bay, can you be more descriptive here. I do not see that in your build info. Are you using the same data cable? This of course sounds like a hardware failure so need to think about what is common, then make a change to see if the problem follows the drive or some other electronics. Since I don’t know the hardware you have for the drive interface, it could be a drive bay, a data connection, the power connection, the HBA if using the same data connection. Not too many things it could be. And of course the last thing, the drive could be bad as well. Odds of getting two bad drives? Not high but if they were from the same batch, possible.

This thread is not about my personal rig but about one at my workplace.
It’s a 4U Supermicro server with a SAS3408 controller and 30 drive bays in the front.

As stated in my first message here, I’ve now used three different bays. First the same one that the old drive failed in for the one from cold storage, then I put that one in another, then I put the new drive that I just sent you the FARM data of into a third bay. I honestly don’t know how the cabling is inside but the three bays are spread relatively far apart so I would be very surprised if all these and only these bays had a cable in common that no other bay shares. All SAS disks connect to the same controller.

As I said one “new” drive was in cold storage here for years, the other has a completely different serial and we received it just a few days ago. I can give you SMART/FARM data of the drive from cold storage as well if you need it. It’s still in the server but in no pool.

Don’t need any more FARM data, I don’t think that is the cause.

I also don’t think it is the drive bay backplane if you have placed the drives in completely different areas.

I have run out of ideas. As I said, I’m not the ZFS expert. The only other thing is, have you powered down completely and powered back up? Notice I didn’t say reboot. Stranger things have happened.

And now we wait for additional help.

A power cycle is what I’ve planned for tomorrow.
Thanks nonetheless.
Any idea who is the ZFS expert here?

There are a few… @HoneyBadger @Arwen @etorix @Protopia and many more. Just not me.

1 Like

My only contribution at this point, is to check:

  • Aggressive head parking
  • Aggressive spin down / up

Neither should be an issue with these drives. But, we, here in the TrueNAS forums, have seen some odd things before.

Post the output from smartctl -x /dev/sdx or which ever drive is having the problem.

The original failure could be real, these replacements could be features enabled on the drives that are not generally suitable for TrueNAS or ZFS.

Don’t know about the SAS Exos, but the SATA Exos park the heads way too often, unless you disable it with SeaChest.

I am by no means an expert, nor do I work with or handle a TrueNAS box in a enterprise environment. However, I have used enterprise hardware in my home setup, and have gone through a couple of SAS expanders that have gone bad, and the issues you are experiencing are similar to what I’ve had happen when the expander is going bad. Drives suddenly having ZFS checksum errors (or even ZFS read/write errors), and the errors occur on random drives on every power cycle, reboot, and test (scrub).

When this happened, I’ve pulled said “problem” drives from the JBOD, and used a different system to run a full surface test only for the test to come back with no errors. Pop the drive back in the JBOD, and now the same drive, or even a different drive is showing ZFS errors. Take the drive out, run a full surface test again, all ok, put it back in the JBOD, ZFS scrub all ok this time, but next day its showing errors again, rinse repeat, over and over again, till the expander finlly gave up the ghost and stopped working altogether.

Put in a “new” expander, and all is working fine again. Scrubs come back clean, no errors after days, weeks, or multiple future scrubs.

That was a little more then a year ago, and I just had the same issue (same model HP SAS expander as well) crop up again just a couple weeks ago. Drive(s) suddenly started having various ZFS errors, and every powercycle/reboot/scrub a different drive would have different errors, or no errors till a day or two later.

Thanks for all your input. I’ll report back after the power cycle.

I’ve power cycled the server - still same issue.

I’ve opened a bug report.

smart output is attached there.

My report was closed because I should update to the latest version. According to the software status page 24.10 is still recommended for conservative users. This storage needs to be rock solid so I don’t want to update to 25.04 until it is actually recommended to do so.

1 Like

I checked and neither SMART output seems to indicate aggressive head parking or spinning down.

Sorry, I have no more ideas.

Thanks nonetheless. I hope very much iXsystems might reopen the issue.