Scale crashes while resilvering

Scale dragonfish 24.04.2.2 also during 24.04.2
128Gb memory
boot-pool 1 SSD
app-pool 2 ssd mirror
store 9 x 8TB HDD

I replaced 1 8TB disk due to 8 Offline uncorrectable sectors on the disk

During resilvering my system crashes. Nothing in the logs, nothing on screen, going straight to reboot.
This happended more then 10 times. It reached 10% sometimes more then 15% most off the times 6 or 7%

I have changed disks before, no issues then with resilvering

I pulled half my memory, same issue
pulled the rest and used the earlier pulled. same issue
I swap my gpu for a simpeler one, same issue
I stopped all VM and containers, same issue
I change /sys/module/zfs/parameters/zfs_scan_checkpoint_intval to 600 in the hope it resumes after a crash, but it always restarted
I removed the ups behind the server, same issue
I have a 850w PSU.
My metering software on the plug, shows no values over 250W
Nothing is feeling hot, all fans are working.
Without resilvering, the server runs for weeks without issue and working hard.

Please advice

Tell us about the HBA ?

no real hba.
6 sata ports on motherboard
2 extra pci cards with sata ports

They most likely crap out during resilvering.

1 Like

I have to say, while I answered that I also thought this could be an issue. The new drive is connected to one of those pci cards. I will swap one that is located on the mobo. and try again.

Resilvering puts stress on ALL the drives. So the issue might persist.

I know, but I have to try everything at this moment
I expect the most writes are done on the new disk.

fingers crossed

Yeah but a controller that is overheating can write garbage to your discs, corrupting the pool.

Better get a proper HBA. They are like 30 bucks on ebay these days.

But lots of reads from the old discs…

I just ordered a LSI 2940, even if my port swap works I wil change my pci cards for this one.

thanks for your advice

1 Like

Dont forget to flash it to it-mode. Good luck :+1:t2:

2 Likes

just to inform you: My system just crashed again. resilvering was at 10%
My HBA will come within 10 days.
My data is save. it is a raidZ2.

edited: typo

Run the basic stress tests, Cpu and RAM. It could be a failing motherboard. Make sure it is stable. And you didn’t specify which motherboard you have. Well you don’t specify any component. Shame as it could possibly help.

you are absolutely right, I am ashamed :grin:

My motherboard is a MSI B450-A PRO MAX ( I updated the firmware between al my silvering tests)
My CPU is an AMD Ryzen 7 5700x-8

I did run stress test when I build the system. It has been running now for almost 18 months, as stable as could be.

For now I will just wait for my HBA to arrive. And then try all again.
Workweek is starting so no more testing/playing

That was then. How is it now is the real question? I’d still run the stress tests to have confidence something didn’t break.

Just to inform you. I placed a new “Inspur 9211-8i 6Gbps Hba Lsi Fw: P20 IT Modus Zfs Freenas Unraid + 2 * SFF-8087 Sata” from aliexpress.
resilvering was faster then before and finished in 1 go.

1 Like