I am experiencing an error in my boot pool. From what I remember I already had this, or a similar error a while back and decided to reinstall (when I was still on CORE).
Any ideas on how I can debug the cause?
See details below:
pool: boot-pool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 2K in 00:00:18 with 2 errors on Wed Feb 22 03:45:18 2023
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sda2 ONLINE 1 0 0
errors: Permanent errors have been detected in the following files:
<0x107>:<0x1185>
<0x107>:<0x118b>
/var/db/system/rrd-6c934beb66de482e8faef2d3b30acc82/localhost/df-mnt-pool0-iocage/df_complex-used.rrd
/var/db/system/rrd-6c934beb66de482e8faef2d3b30acc82/localhost/df-mnt-pool0/df_complex-reserved.rrd
/var/db/system/rrd-6c934beb66de482e8faef2d3b30acc82/localhost/df-mnt-pool0-iocage-images/df_complex-used.rrd
/var/db/system/rrd-6c934beb66de482e8faef2d3b30acc82/localhost/df-mnt-pool0-iocage-log/df_complex-free.rrd
It is unclear from your message whether the reinstall was to a fresh device, but, I’d try one or more of the following, assuming this is a repeat error. All start with saving config. While the problem could be RAM or controller issues, I think it most likely to be device/cable related. Therefore, you’ll want to try things to eliminate device, cable, or port.
reinstall ensuring you choose the format device option,
#1 after moving to a different port/cable, or
install to a fresh device on another port with a different cable.
Good luck.
John
Sorry for not being precise. I simply reinstalled (same hardware) and restored the config I had backed up.
That’s what I did again today. I wasn’t entirely happy with how migration from Core to Scale went, so I simply reinstalled Scale, did my setup and will now monitor if the errros happen again.
Have an updated backup of your configuration always at hand; can you please list your hardware in order for us to understand if you are using ECC RAM, which kind of device you are using as your boot-drive, and how you are connecting your drives?
HP ProLiant MicroServer
CPU - AMD Turion II Neo N40L Dual-Core Processor
RAM - 16GB ECC UDIMM
Drives - 4x SATA for data pool, 1x SATA for boot pool (drive specs above)
What is on this hardware? Just the system pool or anything else?
If anything else on this SSD, is it backed up to the HDD?
Assuming that this is only the system pool, or if anything else then it is backed up, I would cut my losses and replace the SSD with a new one, reinstall the OS, restore the system config, restore any other data to the SSD needed, and get back to being able to sleep at night.
You can then examine the SSD on another computer at your leisure to determine whether it is failing or not.
NOTE: I know that having additional pools / partitions on the system drive is unsupported, but that doesn’t mean people don’t do it. When the system drive needs to be 16GB and the smallest SSD is 128GB, there is an incentive to make use of the additional space even if it is frowned upon.
The system is installled on a M.2 SATA that I had in excess when I built this system a few years back. However, since the board doesn’t have any M.2 slots, I purchased a cheap M.2 to SATA adapter. The M.2 SSD sits in the adapter which is connected to the SATA port.
The M.2 SSD is an INTEL SSDSCKKF240H6L (“Intel Pro 5400s”)
That’s what I think too. I am swapping it for different SSD these days and will report back.
Yes it doesn’t have many hours on it. This is not a 24/7 system that serves a lot of services. It’s a pure data backup that I turn on once a month, run my backups. and turn it off again.
But I very much doubt Intel takes back a 2019 drive, even though it does have that little Power-On-Hours on it.
The problem is the sh***y case/mobo I am using… it has only 5 SATA ports, and I am using 4 for data disks, and one for system drive. I guess I need to upgrade to a better case…