Is my backplane bad? More R/W errors during Extended SMART test leading to faulted drives, but SMART continues to pass

Getting this out the way: I am RMA’ing all 4 drives that have so far faulted in TrueNAS despite weekly healthy Extended test results. That said, please read on.

In my last thread, I was seeking advice regarding a couple drives that faulted, but with passing SMART tests. I know the prevailing advice on the sub is to replace the drives, regardless of SMART, since it isn’t the final answer for drive health. I am in the process of replacing them.

Despite that, I’m concerned that this will continue happening, and I wonder if it’s the backplane that I’m using (Supermicro BPN-SAS3-826EL-- see sig for full build for “Lily”). Logically, this could make sense: if the drives are reporting that they’re fine (repeatedly), and the drives are being marked as faulted due to read/write errors in TrueNAS, then it should be possible that the problem lies in the transmission (i.e. the cables or the backplane)-- right?

The problem is I don’t know how to adequately test this. I could replace the two Mini-SAS HD SFF8643 cables that transport the data from all 12 bays (which came with the case), but beyond that… what’s the simplest way to rule out the backplane/cables without destroying my zpool? I do have that second server in my sig that I could temporarily commandeer. The problem is that these errors have only happened twice since building the server in June 2024-- once right after building (2024-06-06) and once yesterday (2024-08-05). The infrequency of the errors, despite daily Short and weekly Long SMART tests, makes this difficult to diagnose.

Open to all ideas. Thanks in advance!

The SMART test result is not conclusive evidence, but the data obtained from those tests usually it is: without looking at the smart data (I wasn’t able to find any in the linked thread) any suggestion regarding the drives would be a blind shot in the absence of failed smart tests. SMR drives are known to cause similar issues, but I see you are using Exos.

Eh, lots of troubleshooting.

A good idea.

I would say putting the drives in a different system with different hardware, or putting different drives in the problematic system… then stress testing the hardware with @jgreco’s solnet-array-test | TrueNAS Community.

The issue is, there is no easy or short way to troubleshoot this issue. Could also be a power related issue (ie faulty power cables, faulty PSU, the drives not getting enough clean power they require, etc…).

Heck, could also be some bit flipping in the RAM since it appears from your siganture Lily is not using ECC.

The hard part is to identify the correct troubleshooting steps and decide which goes before the others in order to save time, money, and effort: it is as rewarding as it’s frustrating.

Please provide the required informations and maybe we can save you some pain :slight_smile:

Ultimately, you might just want to wait and see if the empire issue strikes again.

I also didn’t see smart data. Nor what type of drives they are :slight_smile:

If they are SATA drives there is normally a UDMA CRC error field which if that is non zero or increases when you see checksum errors is indicative of a cabling/backplane/power issue.

And yes, these issues can also be caused by poor power cabling or a failing power supply.

And this is normally the case when you see multiple failures like this.

So you are confusing hard drive failure with ZFS problems it sounds like.

A degraded pool generally comes from data not being recorded or dropped or scrambled going to/from the hard drive. Not that the drive recorded data incorrectly or it is failing. Sure there are exceptions however this does not sound like a hard drive failure. It sounds more like your other hardware.

As others have said, you did not post SMART data so we are guessing here however lack of TrueNAS throwing you a drive error like “UDMA CRC Errors ion /dev/ada2” or “Sector error from 0 to 1 on /dev/ada0”, something like that, then I doubt your drives are at fault.

The title of this thread states you have R/W errors during Extended SMART testing. What did you receive to make you come to this conclusion? I ask not to make you prove anything but because if there is some drive failure I’m not aware of, I want to know.

How much system testing did you do before putting your system online? For stability purposes. And I don’t want to remind you about your non-ECC RAM (woops). And do you need an online spare with a RAIDZ2? My advice, remove the spare. And do you need the L2ARC? You might not so I’d remove that as well. Just some friendly advice, you do not have to take it, that is fine. But if you can simplify then troubleshooting your problems will be a little easier.

As for troubleshooting, Stress Test your systems. Look at the backplane as I have see others have problems with backplanes.

Time for me to go plug a tire and hope the plug holds air.

2 Likes