Hi all,
I’ve been using FreeNAS for about 7 years and TrueNAS for the last 6 or so months. Unfortunately I know next to nothing about it, I just set it up using the how to guides and have had no problems so far. It is just a simple home system with one user (me). I use plex to watch movies etc and listen to music, and an SMB share to access documents and that’s it.
I recently had one of my drives go bad in a pool of 5 x WD Red 3TB in RAIDZ1. I replaced with a new WD Red 4TB drive (edit: this is a WD WD40EFPX Red Plus, which is CMR, I checked this after finding another post about SMR drives causing issues). I followed the same process that I followed last time a drive went bad on me but this time I ran into some problems. the system hung several times during resilvering and I had to restart it several times. I did some trouble-shooting and looking around the forum and there was suggestion that it may have been the new drive drawing more power than the old drive and the PSU not being able to handle it. (edit2: I just found my notes on the errors that made me try unpluggin the drive: it was “Low water mark reached. Dropping 100% of metrics” written repeatedly on the monitor screen when part way through a resilver). I physically removed a drive from another pool from the system and then it did get through the resilvering overnight. After resilvering, all 4 of the old drives said they were shown as degraded and the new drive said it was ok.
I ran a scrub on the pool and now all drives say they are degraded and there are the exact same number of (thousands of) checksum errors on every drive.
All of my data seems to be there and working fine. I ran a zpool status -v on the pool and all the corrupt files all seem to be rrd files to do with logging the status of the system. I have no idea what these are or how to clear them. I have searched the forum and tried various solutions from other threads but I cannot find anything that works.
When I reboot the maching I get a repeated message that takes up the whole monitor of:
vm_fault: pager read error, pid 2329 (rrdcached)
and then 4 lines of:
truenas.local collectd 2429 - - rrdcached plugin: Failed to connect to RRDCacheD at unix:/var/run/rrdcahed.sock: Unable to connect to rrdcached: Connection refused (status=61)
I’ve tried fixes from this thread but they don’t resolve my issue rrdcached plugin: failed to connect | TrueNAS Community
Below is the result of the last time I ran zpool status… it was previously 1000s of checksum errors but now is 130… but growing again:
root@truenas[~]# zpool status -v WD3X5
pool: WD3X5
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub repaired 140K in 06:26:17 with 56 errors on Sun Jun 16 18:12:30 2024
config:
NAME STATE READ WRITE CKSUM
WD3X5 DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
gptid/8b60b5dd-2abc-11ef-bf68-480fcf670049 DEGRADED 0 0 130 too many errors
gptid/2da89670-7a51-11e7-b34d-001bfcee4a51 DEGRADED 0 0 130 too many errors
gptid/a9f077ad-e778-11ee-89a2-480fcf670049 DEGRADED 0 0 130 too many errors
gptid/32bbdc95-7a51-11e7-b34d-001bfcee4a51 DEGRADED 0 0 130 too many errors
gptid/37756027-7a51-11e7-b34d-001bfcee4a51 DEGRADED 0 0 130 too many errors
errors: Permanent errors have been detected in the following files:
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_busy_percent-ada0.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_busy_percent-ada1.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_bw-ada0.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_bw-ada1.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_latency-ada0.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_latency-ada1.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_latency-ada2.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_ops-ada0.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_ops-ada1.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_ops_rwd-ada0.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_ops_rwd-ada2.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_queue-ada0.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_queue-ada1.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/geom_stat/geom_queue-ada2.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/load/load.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/memory/memory-free.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/memory/memory-inactive.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_eviction-eligible.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_eviction-ineligible.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_operation-deleted.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-demand_data-miss.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-demand_metadata-hit.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-demand_metadata-miss.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-mfu-hit.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-mfu_ghost-hit.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-mru-hit.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-mru_ghost-hit.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-prefetch_data-miss.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_result-prefetch_metadata-hit.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-arc.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-bonus_size.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-c.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-dbuf_size.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/cputemp-0/temperature.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-dnode_size.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/cputemp-1/temperature.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/cputemp-10/temperature.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-mfu_ghost_size.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/cputemp-2/temperature.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/cputemp-3/temperature.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-mru_ghost_size.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-mru_size.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/cache_size-p.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/cputemp-6/temperature.rrd
/var/db/system/rrd-b17eb1df7fd94a8281208d511a311fb0/localhost/zfs_arc/hash_collisions.rrd
I’ve checked all the cabling to the drives. All seems ok.
I’ve tried clearing the errors and doing another scrub but they just start coming back again.
I’ve also tried stopping and starting the rrdcached service.
I’m really at a loss and not sure where to go from here.
Can I delete these files with errors? How do I do that? I don’t even know how to reach these folders other than through the shell on the GUI.
Will restoring my system config from a backup help?
Can I safely backup the data that is currently on this pool? There is nothing super critical on there, just media files and working documents. I have a backup but it is quite out of date so I’d prefer not to lose everything that is currently in the pool, it would be quite a pain.
Any ideas, assistance, help on what to try next would be greatly appreciated.
System is an HP Z440 with an Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz and 16GB RAM, I don’t know much more about it.
2 pools, one is just a single 8TB drive, the other is the 5 x WD red drives. having the 8TB drive installed or not doesn’t seem to make a difference. All drives are running straight off the motherboard. I’m running truenas core 13.0-U6.1 off a USB stick. I know this isn’t recommended but it was how I always ran FreeNAS and I haven’t gotten around to working out how to migrate to an SSD as I’ve run out of SATA slots on the board so I’m not exactly sure how I’m going to do this. It was next item on the list once I replaced this fauly drive and got the pool healthy again.
edit 3: I should also mention that there is no reporting data at all available. All of the graphs in the GUI under “reporting” are just blank.