I have my Server pool that has got one checksum error on one drive and it’s been already 2 weeks and nothing else has occurred since.
Pool is showing as degraded so I was wondering if I should be worried about it and do something or not.
I guess it is something that can happen but maybe not and drive has to go?!
I have already run a pool scrub and all went well, no errors.
Please advise.
EDIT:
Let me add a bit of context since I can’t a screenshot to this post.
The pool is a RaidZ2 with 8 drives of 6TB WDC WD60EFPX-68C model.
The smartctl tests on that drive did not bring any error it seems as they all show “SUCCESS” for the 16 tests displayed.
Will be really usefull if you provide more detailed info about your system, and the output of some command (in quote brackets please )
Hw spec
how disk are connected to the mainboard
pool status
smart status of the involved disk
Actually your pool can handle the loss of another one disk without losing data, although you are not yet in a critical situation, IMHO consider to check your backup are fine
So I think this is a ZFS issue, not a drive issue.
Either way, take a look at my Drive Troubleshooting Flowcharts link below, also in the TrueNAS resources. It will help guide you on what you should do and if you should be concerned. It will address the simple ZFS error.
Well, if there has been an error for two weeks and nothing has yet been done to investigate and resolve any outstanding issue, the situation might well have evolved into something more sinister…
The pool is running on a Dell Poweredge T430 Server.
The disks are connected to a LSI MegaRAID SAS-3 3108 (IT mode) through the server’s SAS onboard PCB.
The pool status is online but unhealthy.
For backup I have another Truenas server that replicates the server in a different location.
But that leads me to have more question now…
If my data is corrupted, the replication server will replicate the corruption I assume? correct?
I am no genius and I am at loss in this situation.
I thought that the worst that can happen would be 2 drives failing, degrading my pool to the point of it being offline.
Now you guys are telling me there might a ZFS corruption that would impact my data. I was not aware this could happen that easily without any prior warning.
How can I figure out if my data is corrupted?
Can I rely on my other Truenas server data that replicated my main server?
Thanks! I’ve been trying to follow your guide.
I have ran a scrub afterwards and I believe I’ve got no errors as the system did not report any of them but I am wondering where I can see the logs from the scrub to confirm that.
Actually using the “zpool status -v” I’ve getting this (the incident actually happened prior to the 22th of June so my scrub was ran soon after the incident):
scan: scrub repaired 0B in 05:32:33 with 0 errors on Sun Jun 22 01:52:05 2025
errors: No known data errors
Is this reassuring?
How can I test the integrity of my data please?
I am currently running another scrub… I’ll update once it’s done.
It is re-assuring. If I’m not crazy, technically if there is an issue with a file on a disk, but you run a scrub & the issue is detected; then as long as there are enough healthy copies then a scrub will reolve the problem.
…so seems that zfs protected you like intended. Should you investigate on what could have caused a single error? Yes. Is it possible it was a random blip & that there is a limit to what is a reasonable investigation? Yes. Can something still go wrong & be either related or entirely unrelated? Yes.
Actually with raidz2 it would take three failures to take down the pool. But you should do your utmost to never be in that situation. React to the first error and do not allow a second error to creep in. We’ve seen a few cases where, for some reasons, the user ignored an error and only came to ask for help when more errors occurred; total pool loss was the usual outcome.
With two degrees of redundancy:
One error. You’re still fine but act NOW!
Two errors. You’re at risk.
Three errors.
Fortunately, yours may have been a fluke.
That could be an issue. Even in IT mode, this is still a RAID controller and not a proper HBA.
Best replace by a 3008 at your earliest convenience.
So the scrub done today did not report any error or data repaired:
scan: scrub repaired 0B in 05:11:02 with 0 errors on Sun Jul 20 15:04:32 2025
I will therefore clear the pool status.
I will also take your advice and replace that Raid controller with an HBA card.
I wasn’t aware that could cause problems.
I have 2 more questions please:
In case of data corruption, will it as well impact a replication server data? If so, would you recommend switching my backup to an rsync based backup method to avoid that possible scenario?!
How would you investigate that problem of one CRC error? It seems to me like chasing a random issue that may never reproduce. I was waiting to see if an additional error would show up on the same drive and disconnect the drive from the pool. But with a single error, I simply thought it could be a glitch hence why I took some time to react.
ZFS never returns corrupted data so errors should not propagate. Anyway, rsync would not help.
That is the question…
At the very least, check the hardware: Long SMART test, drive controller cooling, PSU, memtest on RAM. With recurring errors, you would try shuffling drives and/or cables around and see whether errors follow drives or follow cables/ports. But a single error which does not repeat is very difficult to track down.
So far you have not listed the smartctl output for the drive you think had the CRC error. Please make sure we are all talking about the same thing, CRC will have someone jumping to a drive failure as most people with think UDMA_CRC_Errors, CKSUM will have us jumping to a ZFS error. Both are check sums but both are different problems and may or maynot be related.
If you are referring to a ZFS CKSUM error, clear the error and monitor for a reoccurrence. there are too many reasons it could have happened to speculate and drive yourself nuts.
If you are referring to a drive UDMA_CRC_Error then you run a SMART Long test. That recorded error will never return to a zero value, it lives with the drive forever. UDMA_CRC_Errors are most typically caused by a suspect SATA cable (data cable), and other hardware can cause it as well. There is a lot of data on this, just Google it.
By itself, this can just be an example of bitrot. Which is normal.
The trick is to try and determine a cause, or to see if SMART etc is showing an issue.
I would suggest ensuring that smart tests are being run, and look at the results for the disk in question.
It could well show you a single pending sector etc.
Which in a way shows the drive failed… at least failed to be able to read a sector.
So, it depends on what happened… but it could just be a once off issue. You can clear it with zpool clear <pool>, and then keep an eye on it… if it happens again… then worry about it.
Yeah, ain’t this the truth. I have one pool with 3 drives and a single CRC error from the day I installed it, on one disk. No more, no errors ever since then, it was like it was just born that way. Never gives me any trouble since.
Some things you just have to shrug and keep an eye on it through future reporting. If there aren’t more errors, there’s not much to work with.