Hi all, I’d like to get some input on what I’m observing on a Dragonfish-24.04.2.1. system.
After running a scrub, the zpool status output looks as shown below. Plz note the large chksum error counts, which is the same for both vdev mirror members.
What I’d like to understand is whether these checksum errors have a material impact, as I don’t seem to see any impact to two iSCSI volumes that part of that pool. Also, the cksum error count at the mirror-0 level shows “0”. Does this mean that the mirror status is healthy, regardless of the cksum errors on the single disks?
What would you do if you see this kind of stats on your own system? I’m planning to run a RAM test over the weekend, to check whether all looks good in that space (32 GBs non-ECC RAM).
I’ve also attached the SMART stats. The disks are somewhat aged but not ancient. Disk 1 has a high Command_Timeout counter but it hasn’t budged for ages.
Any input/opinion will be welcome.
root@nas3[~]# zpool status Pool1 -v
pool: Pool1
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 1 days 05:21:41 with 4 errors on Thu Oct 17 03:52:05 2024
remove: Removal of vdev 2 copied 1.62T in 7h50m, completed on Wed Sep 11 17:28:21 2024
124M memory used for removed device mappings
config:
NAME STATE READ WRITE CKSUM
Pool1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdd2 ONLINE 0 0 258K
ata-WUH721816ALE6L4_2CHD6NLN-part2 ONLINE 0 0 258K
logs
c8448ba9-714a-4dcf-a35f-92eb393b70af ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
<0x13329>:<0x1>
<0x1c167>:<0x1>
Pool1/VMs-LAB:<0x1>
<0x134b6>:<0x1>
Pool1/VMs-PROD@auto-2024-10-12_01-00:<0x1>
Pool1/VMs-PROD:<0x1>
Your disks are running way too hot, the first one is 51° with a max of 58°, this can really damage disks.
I would start improving airflow before start testing anything, or start a scrub!
Another example of why smart tests passing doesn’t necessarily mean much. CRC errors. Usually bad cable/port. But definitely running way too hot. If the drives are that hot, the HBA (if any) must be hotter. And that’s not good so it may indeed merely be just temps. The drives are under the max, but, I’d hate to see HBA temps. That case or airflow needs fixing!
The mirror is made up of “sdd2” and what looks like a partition; none is the expected kind of UUID. What’s the hardware (beyond non-ECC RAM) and how is it set up?
Thank you all for the feedback. Some answers to the various Qs:
I’ve removed the case enclosure of my NAS system a few days ago, that’s why I think that airflow was bad. I’ve re-installed the enclosure and now temp of disks is 40c. This is a small mid-tower build sitting in my office, so can’t be too noisy…
I’ve run a memory test and it reported errors in a memory location, see screenshot attached.
MB is AsRock Rack E3C246D4I-2T, 1 x 32 GB module and Xeon E-2224 as CPU.
The disk names in the output might look odd as this system was upgraded from core. I replaced the disks a while ago and when replacing one of the disks I messed up something in the TrueNAS GUI, so I then had to issue some commands on the commandline and ended up with those odd disk names…
I’ve ordered a new 32 GB ECC RAM module, as there is definitely an issue with the current RAM module…
I totally understand that, nas can’t be too noisy for me too.
I really have had good improvement with a case which has sound-absorbing bulkheads, and off course you can add them yourself.
Another thing that can help (from my experience ) Is add more fan as you can, big, but letting them run as slow as possible: i have 3 for put airflow in (2 frontally direct on disks, 1 lateral for NVME and CPU that Is fanless), and 1 on top for extraction; 1 Is the one in the PSU but just cool itself. The max temp i see Is 37-38°, they run 29~31° 99% of time
Update on this: I’ve replaced the memory with ECC RAM and then tested again with memtest, the errors are gone (… as to be expected).
But that didn’t get rid of the CKSUM errors and/or corruption errors in the ZFS Pool, as I suspect that whatever was corrupted in memory has been persisted to disk…
So I decided to backup data in various ways in order to rebuild the pool. The corruption prevented pool replications to work correctly and copying corrupted files from mounted file systems was also a bit tricky. A Windows tool called “unstoppable copy” helped with that.
Anyways, it looks like I’m on track to have a clean system up and running very soon again…