WRITE FPDMA QUEUED issue with new motherboard and 2 different SATA DOMs

UPDATE - SOLVED

Leaving up for folks who may have the same issue in the future.

Turns out both of the SATA DOMs are bad. I have tried a new SSD on both boards and… the issue was the drive all along.
If you’re looking at this in the future trying to solve the issue. Download a live distro, I used “hrmph” (funny name but loaded with useful utils) or install TrueNAS on another disk and check SMART. Both drives had an ungodly amount of errors. Only grabbed this screenshot. If it had been one drive I would not have been tripped up but, being both I assumed it must be something else.

Back on Dec 31 of last year, I woke up to my TrueNAS box locking up and refusing to boot. link to old thread I was getting on both of my SATA DOMs on all sata ports (these have external power cables). I figured it was a bad board but long story short I couldn’t replace the board until recently. This morning I tried to boot from the SATA DOM that was my boot device (the other was a spare) and got a checksum error. I was thinking “the port probably borked the data, I have the config file, I will reinstall”. Immediately I got a failed command: WRITE FPDMA just like on the old board. No big deal, I have the other drive. The install was going fine until at one point I got the failed command: WRITE FPDMA QUEUED error again. It was much further in to the install and after throwing the error twice and failing with a failed COMRESET failed (ernno=-16 before continuing. Once it finished I tried booting and it acted like there was no valid image on the boot media. So I thought I would try one more time and got 1/0 error, deu sda, sector 1055248 op 0x1: (WRITE) flags 0x0 phys. seg 1 prio class 2 before the installer even started (second attached photo).
I am a little perplexed. It seems unlikely that 2 drives (including 1 not in use) that I know were good died at the same time but it’s possible. I would also be surprised if both boards had exactly the same issue (drives show up, the primary SATA DOM fails immediately, the other one takes longer) or some combination of the two. It’s such a specific problem that I am not quite sure how to do so. If I only had 1 dis, it would be obvious what the issue was but having 2 that I tested when I bought fail at the same time is strange but not impossible. I in theory can use my PC to diag the drives but if I can avoid that ( very tight fit with the 4090 in the small case I have) I would like to but I think there may be no way around it.
I guess what I am asking is what suggestions do you have for troubleshooting/isolating the problem? They do not support SMART at all (figured there might be something helpful in there) and i can’t find if there is any similar tool available for these from supermicro’s documentation. If the new board is bad, I have 30 days to exchange or return it. If I need a new drive I can get one ordered but I don’t love the idea of firing the parts cannon at this problem.
If I do need a new boot disk, any suggestions for reliable smaller SSDs, they’re pretty cheap but I would rather have a 32GB or 64GB disk with high endurance (even used) than a large consumer drive I think. Maybe with wear leveling the 512GB drive would be similarly durable but I want to avoid this issue again.

System Specs (I realize some may not be relevant but just want to make sure I am giving any info that might help):
TrueNAS SCALE 23.10.1
Supermicro X10-DRL-i (on my second one)
Xeon 2660v3
64GB (4x16GB) ECC 2133mhz Samsung DDR3 (M393A2G40DB0)
LSI SAS3008 HBA card (data pool hooked up to this)
5 HGST He12 Drives

Thanks in advance for your help!

Updates: the disk that seems to always throw the FPDMA error passed an entire run of diskscan. Running the other drive through now. Next will be sdparm. I am guessing that the same command is never being used by this test though so that could be why. The first disk (/dev/sda- photo attached) is 64GB and took nearly 33 min. The second is 16GB and took 4 minutes (/dev/sdb - photo attached). Not sure how useful either are but this issue is so weird that I am having a hard time knowing what to do without a second working system to test components on.


Screenshot 2024-07-28 at 3.12.31 PM
diskscan /dev/sdb result:

diskscan /dev/sda result:

1 Like