All drives in a pool degraded - many checksum errors. Pool data seems ok, all errors are in rrd files

Hi all,
I’ve been using FreeNAS for about 7 years and TrueNAS for the last 6 or so months. Unfortunately I know next to nothing about it, I just set it up using the how to guides and have had no problems so far. It is just a simple home system with one user (me). I use plex to watch movies etc and listen to music, and an SMB share to access documents and that’s it.

I recently had one of my drives go bad in a pool of 5 x WD Red 3TB in RAIDZ1. I replaced with a new WD Red 4TB drive (edit: this is a WD WD40EFPX Red Plus, which is CMR, I checked this after finding another post about SMR drives causing issues). I followed the same process that I followed last time a drive went bad on me but this time I ran into some problems. the system hung several times during resilvering and I had to restart it several times. I did some trouble-shooting and looking around the forum and there was suggestion that it may have been the new drive drawing more power than the old drive and the PSU not being able to handle it. (edit2: I just found my notes on the errors that made me try unpluggin the drive: it was “Low water mark reached. Dropping 100% of metrics” written repeatedly on the monitor screen when part way through a resilver). I physically removed a drive from another pool from the system and then it did get through the resilvering overnight. After resilvering, all 4 of the old drives said they were shown as degraded and the new drive said it was ok.

I ran a scrub on the pool and now all drives say they are degraded and there are the exact same number of (thousands of) checksum errors on every drive.

All of my data seems to be there and working fine. I ran a zpool status -v on the pool and all the corrupt files all seem to be rrd files to do with logging the status of the system. I have no idea what these are or how to clear them. I have searched the forum and tried various solutions from other threads but I cannot find anything that works.

When I reboot the maching I get a repeated message that takes up the whole monitor of:

vm_fault: pager read error, pid 2329 (rrdcached)

and then 4 lines of:

truenas.local collectd 2429 - - rrdcached plugin: Failed to connect to RRDCacheD at unix:/var/run/rrdcahed.sock: Unable to connect to rrdcached: Connection refused (status=61)

I’ve tried fixes from this thread but they don’t resolve my issue rrdcached plugin: failed to connect | TrueNAS Community

Below is the result of the last time I ran zpool status… it was previously 1000s of checksum errors but now is 130… but growing again:

root@truenas[~]# zpool status -v WD3X5
pool: WD3X5
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub repaired 140K in 06:26:17 with 56 errors on Sun Jun 16 18:12:30 2024
config:

    NAME                                            STATE     READ WRITE CKSUM
    WD3X5                                           DEGRADED     0     0 0
      raidz1-0                                      DEGRADED     0     0 0
        gptid/8b60b5dd-2abc-11ef-bf68-480fcf670049  DEGRADED     0     0   130  too many errors
        gptid/2da89670-7a51-11e7-b34d-001bfcee4a51  DEGRADED     0     0   130  too many errors
        gptid/a9f077ad-e778-11ee-89a2-480fcf670049  DEGRADED     0     0   130  too many errors
        gptid/32bbdc95-7a51-11e7-b34d-001bfcee4a51  DEGRADED     0     0   130  too many errors
        gptid/37756027-7a51-11e7-b34d-001bfcee4a51  DEGRADED     0     0   130  too many errors

errors: Permanent errors have been detected in the following files:

    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_busy_percent-ada0.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_busy_percent-ada1.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_bw-ada0.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_bw-ada1.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_latency-ada0.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_latency-ada1.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_latency-ada2.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_ops-ada0.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_ops-ada1.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_ops_rwd-ada0.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_ops_rwd-ada2.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_queue-ada0.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_queue-ada1.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_queue-ada2.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/load/load.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/memory/memory-free.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/memory/memory-inactive.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_eviction-eligible.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_eviction-ineligible.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_operation-deleted.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-demand_data-miss.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-demand_metadata-hit.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-demand_metadata-miss.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-mfu-hit.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-mfu_ghost-hit.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-mru-hit.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-mru_ghost-hit.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-prefetch_data-miss.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-prefetch_metadata-hit.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-arc.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-bonus_size.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-c.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-dbuf_size.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/cputemp-0/temperature.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-dnode_size.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/cputemp-1/temperature.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/cputemp-10/temperature.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-mfu_ghost_size.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/cputemp-2/temperature.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/cputemp-3/temperature.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-mru_ghost_size.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-mru_size.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-p.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/cputemp-6/temperature.rrd
    /var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/hash_collisions.rrd

I’ve checked all the cabling to the drives. All seems ok.
I’ve tried clearing the errors and doing another scrub but they just start coming back again.
I’ve also tried stopping and starting the rrdcached service.

I’m really at a loss and not sure where to go from here.

Can I delete these files with errors? How do I do that? I don’t even know how to reach these folders other than through the shell on the GUI.

Will restoring my system config from a backup help?

Can I safely backup the data that is currently on this pool? There is nothing super critical on there, just media files and working documents. I have a backup but it is quite out of date so I’d prefer not to lose everything that is currently in the pool, it would be quite a pain.

Any ideas, assistance, help on what to try next would be greatly appreciated.

System is an HP Z440 with an Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz and 16GB RAM, I don’t know much more about it.
2 pools, one is just a single 8TB drive, the other is the 5 x WD red drives. having the 8TB drive installed or not doesn’t seem to make a difference. All drives are running straight off the motherboard. I’m running truenas core 13.0-U6.1 off a USB stick. I know this isn’t recommended but it was how I always ran FreeNAS and I haven’t gotten around to working out how to migrate to an SSD as I’ve run out of SATA slots on the board so I’m not exactly sure how I’m going to do this. It was next item on the list once I replaced this fauly drive and got the pool healthy again.

edit 3: I should also mention that there is no reporting data at all available. All of the graphs in the GUI under “reporting” are just blank.

What setting are you using in the SATA controller : AHCI ?

There are a lot of threads about this kind of problem, do a Google search on ‘truenas rrd errors’ and you should find a few postings, mainly on the old forum site.

The files you have are in your System Dataset it looks like. Where is your system dataset? Based on the errors I suspect it is on your pool.

First, make a backup of your TrueNAS configuration file.

The following steps could be wrong, it is what I would try myself but I encourage you to read other forum threads on resolving this kind of problem. You might be able to just delete the files, but I would do a little research. If someone states that I’m full of crap, then I will strike this post.

Next you might try to move your System dataset to the boot-pool and once that is done, then run another scrub on the pool. I don’t know if those files will be deleted automatically but if not, you can then manually delete them.

After you have cleaned up that mess of files, run a scrub again, make sure you have no files listed, but you will still have checksum errors.

IF you have no further files listed as corrupt, run a clear using zpool clear pool and then run another scrub, all should be good in the world.

You may move your System Dataset back to the pool if you desire. If your hard drives are always spinning, then I personally like my system dataset located there.

2 Likes

Thanks for the reply. I’m not sure. I will do dome googling and work out how to check this. I’m was not aware I can change this on the HP system, so will educate myself and do some poking around.

Thanks for the reply. I did some searching yeterday but not with those exact keywords so will have another go tonight after work and see what other info I can find. I had tried a few things like moving the system data set and checking/unchecking ‘syslog’ that came up with a few threads but not everything in the ordr you have said so I will give that a go while I’m poking around for other info.

Thanks for taking the time to reply.

One thing I should have said, do you have a backup of all your data? If not, can you? I recommend it regardless, but you should be able to wipe it all out and rebuild to recover.

I doubt this is a a connectivity problem, not for all drives in the one pool, that doesn’t make sense.
When you are done “fixing it”, and if the problem comes back, I would recommend you run MemTest86+ for a few days, make sure nothing fails. Then a CPU stress test for at least 30 minutes.

When your system is running, look for SWAP UTILIZATION (Reporting → Memory) and ensure the value for USED: is zero. If you have a value there then you ran out of RAM.

Question: You were using FreeNAS, what version? How did you upgrade to TrueNAS 13? And exactly what version are you running right now?

Question: Did you ad any VM’s, Jails, any new services?

Thanks again.

I don’t have a recent backup. I am planning on doing one. If I make a copy of this pool, will it copy the corrupt files too?

Thanks… I’ll do some looking into that. I have never used MemTest86+ but hopefully I can work it out. I’ll have to do some googling on how to do a CPU stress test too.

I can’t remember which version of FreeNAS I was on… a very old one! I did a clean install of TrueNAS and imported my pools. I’m running truenas core 13.0-U6.1, which I think is the latest stable version.

No VMs, the only Jail is Plex and that has been running for about 6 months with no problems. No new services, just SMB and I’ve used that forever.

I’ve done this and the corrupt files haven’t been deleted, so I’ll have a go at manually doing that now and then run another scrub and see what happens.

I didn’t really think those files would just go away, however since you moved your System Dataset, you should have no issues removing those files and cleaning up this mess. Leave the System Dataset on the boot-pool for a while and see if all remains good. And you can leave your System Dataset on your boot-pool, many people do, especially those folks who like to sleep the hard drives when not in use (Home Media Server).

TrueNAS 13.0-U6.3 I think is the latest and you can update to that via the GUI, “HOWEVER”, if it was running fine for approximately 6 months, there is no need up update at this time. So don’t mess with it if it’s working. Well once you get those files cleaned up.

Good luck.

The little HP data I can find only shows various RAID settings for the SATA controller in the BIOS. This would be a big NONO !

Thanks again for the info and help.
I just taught myself how to SSH in and also discovered midnight commander. That made navigating through and deleting the files a lot easier. I’m very happy that I’m learning a lot through this process. Last night I was pulling my hair out. Tonight I feel some achievement.
I’ve just finished deleting all the files. Now, when I run zpool status I get this:

errors: Permanent errors have been detected in the following files:

    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x137>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x138>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x13e>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x13f>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x145>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x146>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x147>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x14c>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x14d>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x153>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x155>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x15a>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x15b>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x15c>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x167>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x169>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x16a>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1ad>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1ae>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1af>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1b3>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1b4>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1b5>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1b6>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1b7>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1b8>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1b9>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1bb>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1bc>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1c0>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1c1>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1c2>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1c5>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0xc6>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1c6>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0xc7>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0xc8>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1c9>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0xca>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0xcb>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1cb>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1cd>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0xce>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0x1ce>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0xcf>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0xd0>
    WD3X5/.system/rrd-b17eb1df7fd94a8281208d511a311fb0:<0xd1>

It is the same number of errors as files I deleted… so I’m hoping this is just a remnant of that and that these will just clear out when I run another scrub, which I am about to start.

Thanks for the info, hopefully advice from joeschmuck works and gets my pool back up in a healthy state. Then I’ll do some investigation into what the settings on the SATA controller are. I’ve had this same HP system running FreeNAS for over 2 years and TrueNAS for over 6 months without any issues, but if it is something I can improve on, I’ll look into it.

Thanks again!
After manually deleting the files and scrubbing overnight, the corrupt files are gone, and doing a zpool clear has cleared the checksum errors. The pool is appearing as healthy. I’ll keep a close eye on it for the next week while I do more research on setting up appropriate backups and some of the other things I’ve come across while trawling through forum posts trying to resolve this issue.

2 Likes