Checksum errors while replicating

octave · February 5, 2025, 10:13pm

I have 2 TrueNAS Scale servers that I manage remotely (both on version 24.10.2). The main server has a pool of 6 x RAIDZ1 | 10 wide | 18.19 TiB. The second server is used as a mirror, and the pool is set up as 3 x RAIDZ2 | 20 wide | 18.19 TiB. I can have physical access to them next week; in the meantime, I can only work on them remotely.

They’ve been running for 3 years without any issues. We recently moved the servers to a new room and then, after that, we upgraded from Core to Scale. We also wanted to use ZFS replication for the mirror instead of rsync, so we wiped the mirror (we also have offline backups, by the way).

After upgrading to Scale, we initiated a brand new replication task (no encryption). Please view the attached screenshot.

Since we were mirroring 250TB, it took about 5 days.

The first problem: 2 days into the transfer, a scrub task on the main server started to run and began finding tons of checksum errors on a cluster of drives, which are all under the same LSI 9305-16I HBA card. There are 4 x LSI 9305-16I cards. My guess is that there is something wrong with that card, or it needs to be reseated or replaced? I attached a screenshot of the errors. If I run zpool status, I get a list of about 30 files with permanent errors.

What happened next is that the ZFS replication failed right at the end, with this message: Partially received snapshot is saved. A resuming stream can be generated on the sending system by running: zfs send -t 789c636064000310a.... If I try to run this, I get this error:

Error: Stream cannot be written to a terminal.

You must redirect standard output.

And if I try to start the replication task, it keeps giving me the “Partially received snapshot is saved” error.

The second problem: On the mirror server, I can see the Mirror dataset usage is about the size of the Main, so almost all the data has been transferred. But I can’t access that data. If I go to /mnt/Main/Mirror, there is nothing in it. Also, when clicking on the Mirror dataset, I get this error: [EFAULT] Failed retrieving USER quotas for Main/Mirror. Is there a way I can access the data that has been transferred?

Any pointers for both issues would be welcome! Thank you!

octave · February 5, 2025, 10:57pm

Our 2 ssds are in a L2ARC cache vdev, it acts as a second-level read cache. We don’t have an SLOG vdev.

Protopia · February 6, 2025, 9:24am

Sorry - a brain fart on my part! D’oh!!!

I assume that your server has >= 4GB of memory and that your stats show this is useful.

Protopia · February 6, 2025, 9:40am

I am not an expert on this sort of hardware issue, but given that you haven’t yet had any responses from others here is my input anyway…

You have a lot of data here, and a distinct risk of vDev raidz1-2 going offline and taking the entire pool with it. So whatever you do you need to be careful.

I am in two minds about what you should focus on first. If you can fix the replication and get it to complete, you might then have a full and valid backup. But on the other hand, any stress to the primary might push it over the edge. On the whole I think it is probably better to focus on the Primary and worry about the backup replica later but this will depend on what other backups you have and whether the 30 files need to be recovered from the replica or not.
You should probably get further advice on this before doing it, but issing a zpool clear Main to clear the errors back to zero might help avoid it going offline.
You should probably stop all regular tasks like scrubs on this pool in order to avoid stressing it.
If the affected drives are all on one HBA, then swapping out the HBA for a known good one (pre-flashed with the correct firmware on another system) might be a good first step. If there will be a delay on this, then removing and reseating the existing HBA might be a good interim step. Then you can see whether errors start to occur in volume again.
The next step after that should IMO be to bring that vDev back from degraded, by resilvering. (Check the SMART stats for the offline drive first, but since it was taken offline due (we think) to an HBA issue, the likelihood is that it will be fine.)
Then I would run a scrub. And check that I don’t get errors again.
Then finally I would think about whether I want to try to recover the replica and do an incremental replication to complete it or to simply start again. And this decision would be based on whether the scrub results show that you need to attempt to recover from the partial replica or not.

octave · February 6, 2025, 6:34pm

Hi @Protopia,

Thanks for your reply. Yes it’s pretty much what I’m planning to do.

I was able to mount the partial zfs replication dataset available on the mirror (as it was only missing a few gigabytes) by running this command zfs mount -a.

I will be swapping HBA cards on Monday. I will update back after that.