One out of 4 drives keeps dropping out?

T0meo · June 19, 2024, 1:50pm

Hello.

I haven’t seen this issue anywhere yet, thus I decided to create an account and post it.

I did a TrueNAS Scale build 3 or so months back. Since then, I had one prominent issue. One of the 4 drives I had kept dropping out. By dropping out, what do I mean? The drive in question would randomly spin down with a sound typical when you unplug power from it. A loud short screech and out. The drive does work and operates as usual, but it drops out randomly, freezing any tasks that were happening on the drives, untill it spins up. This issue keeps happening on and off. At one point it did not appear for 2 weeks, only now to return. I’ve seen someone post this type of sound on reddit and someone told them that it’s the head hitting the platter, but this can’t be right.

Things I tried:

Switch around the SATA cables with other drives (Still the same drive had the issue).
Get brand new SATA cables (The same issue)
Switch out the drive (The new replacement drive would cut-out, the same issue.)
Switch around the power connections (The same issue).
Tested the first drive for any bad blocks/surfaces. Test ran fine through a USB dock, without any drop-outs on my main workstation PC. (Edit: I need to specify, by first drive I mean the one before the replacement arrived)

Mind you, this issue happens ONLY with 4 drives, only on the last one. I haven’t tried taking the other drives offline yet, should I try that next? See if the 4th drive stops dropping out if I take offline the 2rd or 3rd drive?

One thing I noticed, when I pulled out the GPU and instead used the cpu integrated one, NO drive was detected. NONE of them. 4 sata drives and one NVME drive. Only bios would boot to show no drives.

As for hardware:

Ryzen 5 2600 (The integrated GPU came from another spare CPU I had. That one is a Ryzen 3 2200G)
24GB of ddr4 3200 ram (not ECC)
4 Seagate Constellation ES.3 4TB SATA III 3,5 (Running in RAIDZ1)
1 NVME drive (Some random brand laptop grade ssd.)
550W PSU (Would need to take it out to know the brand, it shouldn’t be the issue though?)
MOBO GA-A320M-H
Zotac 1050ti. For the past 2 weeks or so I was running a 3060TI, there were no drops in that timeframe. The issue is, I remember that sometimes these drops would still happen, even on the 3060ti. I’m not planning to run a 3060 in this build. It takes like 22W idle. That’s too much, considering the 1050ti does about 2-6W idle. (lol)

Possible suspect that I’m set on:

The motherboard has some issues (Since when no GPU would cause no drives? That’s weird? Maybe the PCI connection is broken at some point and it keeps dropping one drive?)

Anyone had some similar issue in the past? I’m considering dropping 200$ to get a new ATX motherboard with extra 32gigs of ram and extra pci slots at this point. Would want to make sure that my suspictions are right with this one before I get some extra hardware.

Another edit: Possible BIOS issue? I remember updating it to some specific version, since the really recent ones did not want to run my CPU. That version isn’t THAT old from what I recall.

There’s a lot of question marks that I’m putting. The reason is, I have no clue what could be the reason at this point. I don’t have another spare motherboard to test if it’s that, I’m basically guessing blindly by troubleshooting.

somethingweird · June 19, 2024, 2:27pm

Is 550w enough power for this setup? Just wondering.

T0meo · June 19, 2024, 2:45pm

For sure. About 200-250W should be left after full load (including power rating). The home-server itself doesn’t compute anything. The CPU is the only part that is working constantly (running VMs and apps). The GPU is only there, since without it no drives function.

T0meo · June 19, 2024, 4:30pm

So a quick update.

I took out the 1050ti and replaced it back again with a 3060ti. After I applied quite a bit of pressure, everything seems to be working, BUT sometimes the timeout still happens currently running fine for the past 30 minutes). So this in fact is some kind of motherboard issue.

Stux · June 19, 2024, 10:25pm

How is the power organized for the drives? One long string of four sata power connectors? Is it the last one that has an issue.

Sounds like a brownout on one of the rails.

T0meo · June 19, 2024, 10:42pm

The PSU has in total 8 sata power connectors. 2 cables with 3X sata connectors are used to connect 4 drives (one cable for 2 drives, the other one for the next 2 drives, leaving one connection open). So this can’t be that. The rest of the sata connectors are on the molex cables. This was the first basic thing I tried outing.

It is most likely some type of rail issue. The simple fact that the drives aren’t detected in bios without any GPU present is just not normal. The pressure or moving around the GPU fixing the issue is also a giveaway. I’ll most likely just order a new motherboard at this point

T0meo · June 22, 2024, 2:35pm

Just got a new motherboard, the issue is still there. Only one out of all the drives dropping out. I have no clue what the issue is now.

joeschmuck · June 22, 2024, 3:33pm

What Rev is the motherboard (rev 1.x, 2.x, or 3.x)?

The brand/model could be the key. I have to agree, it doesn’t sound like you have enough power however a few experiments to rule things out.

The drive in question? It is always that one drive, correct? If yes, disconnect the data cable from it. Run it degraded for a while (not really good unless you have a backup) and see if the drive powers down. If it does, how long does it take. If it does power down, turn off and on the machine, see if the spin down happens at the same interval.

If it happens at the same interval with the data cable removed, the drive is turning itself off.

If the drive continues to spin and never spins down, then it is probably not a power supply issue and likely TrueNAS is telling the drive to spin down.

Take the troubleshooting steps one at a time.

Stupid question, the MB BIOS, is it at the most current revision?

I have no idea why the lack of a GPU would make the BIOS not recognize the drives, hence the BIOS revision question.

While you are at it, how about a few photos of the case layout. I’m looking at air flow, heat buildup issues.

EDIT: Is the power supply fan running? Or is it fanless?

T0meo · June 25, 2024, 12:23am

So, you’ve given me some ideas to test around. I did not do any tests on live data, since I was worried about data loss. I decided to just go for it, even If I’d experience some sort of data loss. It was the only way to get quick results.

The idea to remove the data cable actually helped diagnose the whole thing. Even if the data cable was out, the drive would experience that shutdown within 4-10 seconds. Yes, it was only a single drive. This made me question the PSU. The PSU fan spins, it isn’t fanless. It did NOT fit that the PSU is too weak. It ran just fine hitting about 450W while in my old rig, unless it would be some rail issue within the PSU itself.

While replacing the whole motherboard, I switched the power connections for the drives once again. It turns out, whenever I slightly nudge some of the connections, they fail with the same issue I had before. Turns out the PSU is fine, the previous motherboard is fine, everything is fine BUT the sata power connectors on the PSU cables. I think 3 out of the 8 are with issues.

I couldn’t get the answer by myself, since I did not test ALL the sata connections. Turns out I tested out the 3 bad ones (The luck in this is absurd.) and assumed it’s the motherboard at that point or the drive.

Thanks for that rundown!

joeschmuck · June 25, 2024, 2:06am

A note of warning: If you need to replace your power connectors (assuming it is a modular design), ensure you buy the proper cables. They may all look the same but they are likely to be wired differently at the power supply connector. Yes, there are posts like that on the old forums. Be careful.

T0meo · August 15, 2024, 2:03am

So after even more testing and even more drive drop-outs, it turns out it’s related to the apps themselves. Not the drives, not the power, not the connectors. Unless it’s some global issue for the drives I have.

When I pick a pool for my apps and it happens to be a HDD pool, ONE drive will start dropping out. Not instantly, it will happen over a week or so.

The solution I encountered, it was a temporary thing. It’s a random luck if everything will work or not. Restarting multiple times sometimes would make everything work.

The up to date solution for me was installing everything onto some small SSD.

Stux · August 15, 2024, 4:51am

This is not something an App should be able to do.

I suspect that the apps may be triggering the cause, but is not actually the cause.

Ie, apps cause extra load. extra load causes a failure.

“too much load” should not cause your drives to drop out.

Drives should not drop out. Full stop. Unless something is wrong.

Davvo · August 15, 2024, 7:50am

Did you run any smart long test on the drives?
When was the last time the pool was scrubbed?

While it’s reccomended to install Apps and VMs on SSD, it’s because of performance reasons. The issue you are facing should not be happening and smells of hardware failure.

T0meo · August 15, 2024, 9:40am

While I agree that the apps SHOULDN’T be able to do this, they are most definetly causing some sort of issue to the drives.

Yes, I’ve ran multiple long tests on them. No failures. I’ve also not encountered any drive drop-outs since I started using an SSD. I can however confirm, installing and using the apps on a HDD started the dropout once again.

The pool scrubs every week or so. Since then it’s been scrubbed about 7-10 times. (This was set due to the drive drops. Might change it to be lower now.)

About a week ago, I’ve had the same identical issue with the drop-outs. It got to the point where the whole pool got corrupted, unrecoverable. Nothing worked, import -f didn’t work, told me to restore from a backup. When I looked inside what’s corrupted, several app components were missing inside the k3s folder, like settings and stuff. This isn’t a one-time thing. I re-installed the apps to the same drive 3 times and 3 times it corrupted itself.

If you guys are 100% this can’t be the apps themselves, I must be the unluckiest person on here. I rulled out power (connected a separate brick, still caused drop-outs). The motherboard was changed. New ram was added. Any part that could cause issues was replaced

T0meo · August 15, 2024, 9:42am

I might also add, while installing the apps on a 4 drive pool, it corrupted the WHOLE pool. It was that bad, the TrueNAS system WOULDN’T boot with the pool active. It caused a kernel panic.

I got around this by unplugging all the drives and plugging them back again when the system was on-line. Then I had to enable read-only to recover data. Enabling read/write caused another kernel panic.

Davvo · August 15, 2024, 9:44am

Create a bug report and see what iX tells you then.

T0meo · October 29, 2024, 3:25pm

It’s been a while, but I got to the root cause of everything. It was indeed the PSU. Sorry lads for trying to find the issue at a place it wasn’t. This was a combination of unlucky events, from the PSU to the boot drive that wasn’t new (It was supposed to be new, though it already had over 40tb writes and reads.)