Checksum errors after extended poweroff period

yahwon · July 9, 2024, 5:08pm

Hellohi, so, I’ve got myself a server which is currently not permanently deployed - I’m still playing around and building trust with it but in the long term I intend to use it as my personal storage/home server/workhorse for anything I don’t want to stress my poor pc with.
I have noticed that after longer poweroff periods (days - weeks) there are always several (1-10) checksum errors on a couple (1-4) of my drives (never all in the same mirror vdev up until now though), I have not noticed any of these appearing while the system is powered on.
If I scrub this pool in this case sometimes it does not even seem to (?have to?) repair anything, the most recent run however repaired 920k. Up until now I’ve kept just clearing them but in the long run that does not seem optimal?

Not sure how concerning/bad/unexpected this is.

Specs/Pool setup:
I’m running TrueNAS Scale 23.10.2
CPU: EPYC 7352
RAM: 8x16GB 2400MT ECC (Micron MTA36ASF2G72PZ-2G3B1)
Mobo: Supermicro H12SSL-NT
24 SSDs, 8x Transcend SSD230S, 8x WD Blue SA510, 8x Samsung 870QVO (yes, I noticed that the capacities mismatch a bit too late but such is life).
All of them arranged in a 8vdev pool, each vdev being a 3 wide mirror with one kind of drive each.
The SSDs are connected to the 6 backplanes that came with my Inter-Tech 4U-4424 case so that each drive in a vdev is connected to a different backplane.
The backplanes are connected to the mobo via 3x LSI SAS 9207-8i HBA (in HBA mode) so that every HBA has a connection with each vdev.
I did it this way because it sounded like a good idea for redundancy reasons to me.

Running Memtest against this system for ~3 days reported no issues, so my assumption is that CPU/Mobo/RAM are good.
I rechecked all the cabling and didnt find any issues.
I got all of the drives new and the errors seem pretty random distribution wise so I assume those are good aswell because they won’t all be broken from the factory.
smartctl output didn’t seem concerning either.
Which leaves cabling/HBAs/backplanes being broken? But then again since the errors seemed to be distributed pretty well I would think that all would have to be broken which seems a bit far fetched.

Not really sure how to approach that at this point - can I get any opinions on this?

SmallBarky · July 9, 2024, 6:17pm

Do you get errors if you leave the machine running and doing regular work? Things like bulk copying files to and from

Is there a battery on the HBA?

yahwon · July 9, 2024, 6:33pm

If ~0.5TB of files counts as bulk copying files then it has managed that without issues as far as I can tell. I’ve also been viewing&editing a bunch of files off of it and have an application that stores its data and an sqlite db on it, all without any issues. The only symptoms I have of something being wrong are the checksum errors after it has been off for a while (the Apps tab has also been misbehaving but I currently don’t care about that really).
Regarding the HBA - I’m not sure but ctrl-f in the user guide for ‘battery’ found nothing so I assume not? I just plugged them into PCIe and the backplanes and that seemed to work.
Do you think it could be a writes not properly written to disk when I turn it off? I have always turned the system off via the power button in the UI in hopes that it would give all components sufficient time to sync everything to disk.