Hellohi, so, I’ve got myself a server which is currently not permanently deployed - I’m still playing around and building trust with it but in the long term I intend to use it as my personal storage/home server/workhorse for anything I don’t want to stress my poor pc with.
I have noticed that after longer poweroff periods (days - weeks) there are always several (1-10) checksum errors on a couple (1-4) of my drives (never all in the same mirror vdev up until now though), I have not noticed any of these appearing while the system is powered on.
If I scrub this pool in this case sometimes it does not even seem to (?have to?) repair anything, the most recent run however repaired 920k. Up until now I’ve kept just clearing them but in the long run that does not seem optimal?
Not sure how concerning/bad/unexpected this is.
Specs/Pool setup:
I’m running TrueNAS Scale 23.10.2
CPU: EPYC 7352
RAM: 8x16GB 2400MT ECC (Micron MTA36ASF2G72PZ-2G3B1)
Mobo: Supermicro H12SSL-NT
24 SSDs, 8x Transcend SSD230S, 8x WD Blue SA510, 8x Samsung 870QVO (yes, I noticed that the capacities mismatch a bit too late but such is life).
All of them arranged in a 8vdev pool, each vdev being a 3 wide mirror with one kind of drive each.
The SSDs are connected to the 6 backplanes that came with my Inter-Tech 4U-4424 case so that each drive in a vdev is connected to a different backplane.
The backplanes are connected to the mobo via 3x LSI SAS 9207-8i HBA (in HBA mode) so that every HBA has a connection with each vdev.
I did it this way because it sounded like a good idea for redundancy reasons to me.
Running Memtest against this system for ~3 days reported no issues, so my assumption is that CPU/Mobo/RAM are good.
I rechecked all the cabling and didnt find any issues.
I got all of the drives new and the errors seem pretty random distribution wise so I assume those are good aswell because they won’t all be broken from the factory.
smartctl output didn’t seem concerning either.
Which leaves cabling/HBAs/backplanes being broken? But then again since the errors seemed to be distributed pretty well I would think that all would have to be broken which seems a bit far fetched.
Not really sure how to approach that at this point - can I get any opinions on this?