Write errors during pool extend - what do I do now?

Bryan_Gratz · March 27, 2025, 8:30pm

I spoke to soon. I should’ve known not to say anything until the scrub was done.

There are now 600+ checksum errors on every single drive, except the very first drive that failed (drive 7), which has over 9000 checksum errors.

admin@DOLLY[~]$ sudo zpool status -LP DOLLY-ARRAY
[sudo] password for admin: 
  pool: DOLLY-ARRAY
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 200K in 00:00:01 with 0 errors on Thu Mar 27 13:16:17 2025
expand: expanded raidz2-0 copied 10.8T in 2 days 11:09:15, on Thu Mar 27 02:39:31 2025
config:

        NAME           STATE     READ WRITE CKSUM
        DOLLY-ARRAY    ONLINE       0     0     0
          raidz2-0     ONLINE       0     0     0
            /dev/sdf1  ONLINE       0     0 9.03K
            /dev/sdd1  ONLINE       0     0   648
            /dev/sdh1  ONLINE       0     0   661
            /dev/sde1  ONLINE       0     0   632
            /dev/sdg1  ONLINE       0     0   670
            /dev/sdj1  ONLINE       0     0   653
            /dev/sdc1  ONLINE       0     0   691
            /dev/sdk1  ONLINE       0     0   612
        cache
          /dev/sdi1    ONLINE       0     0     0

errors: No known data errors
admin@DOLLY[~]$

Since the pool is still online, and has “no known data errors”, I assume that means that all the data is still intact, through duplication between each drive?

I don’t understand how this happened. Any ideas would be great.

etorix · March 28, 2025, 8:16am

Cable issue, overheating controller, outdated controler firmware…
If you’re running the drives directly on the motherboard, the result is very puzzling.

Very bad RAM?

yorick · March 28, 2025, 12:43pm

So, no ECC … RAM status is anyone’s guess. You can get memtest86+ (the FOSS version), burn it to a USB, and run that in continuous loop for 5 days. 5 days clear probably means RAM is fine.

If it was RAM that’s great, you found an issue and can fix it.

Beyond that it gets a bit tough … heat overall? BIOS? Faulty motherboard? PSU?

Were the drives burnt in via burn-in script and badblocks?

Bryan_Gratz · March 28, 2025, 7:17pm

Hm, memory issue would be easy to diagnose. I’ll run memtest next chance I get. System had been running fine until these drives were added, so I’m not sure. The only thing I can think of: I had to tape over the smbus pins on the HBA, maybe there is occasional contact causing smbus conflicts with memory? Is that possible?

Controller issue sounded likely, I looked into it further. Apparently, the motherboard has two controllers - 6 of the sata ports are directly from the chipset, and the other two are on an AsMedia ASM1061. My eight drives are spread across both, so I’m not sure the controllers are to blame.

I want to make sure the data on the drives is intact. Once whatever problem is causing these errors is solved, will I be able to rebuild the erroneous data using the parity from the other drives? Did the zfs scrub already take care of that, and it’s just alerting me that it happened?

etorix · March 28, 2025, 7:58pm

The last scrub reported that so far ZFS has maintained data integrity despite the multiple errors. But if all drives keep on causing errors something will eventually happen.