HDD dying or just a bad cable?

Flyboy · November 6, 2025, 3:28am

I am curious if anyone can provide or point me towards some good troubleshooting tips or best practices to determine the viability of a hard drive that seems to continue to have issues. I am new to TrueNAS and even owning a NAS in general. This past year I modded a Dell Optiplex 5050 and threw some 8tb drives in RAID running HexOS which was obviously running TrueNAS Scale in the background on RAIDZ1. Long story short I have mostly given up on HexOS and have been running to TrueNAS for basically everything. Two weeks ago I had a drive seemingly die on me. I turned the NAS off and back on as that was the typical PC thing I knew to do and still didn’t work. I ended up shutting it down again and the unplug and replug cables method seemed to revive the drive. Through the process I decided to buy another 8tb drive and go to RaidZ2. I finished that about a week ago and today I got another email that the drive is showing faulted, smart error log failed, smart self test log failed, and the drive offline and unreadable. When I added the 4th drive a week ago I switched power cables. Today I am going to try and replace the sata cable and see if I can get it back, but is there anything I could be missing as to why a drive randomly goes offline? Or does anyone have any best practices as to what they do when they start getting drive error messages?

Thanks in advance.

Fleshmauler · November 6, 2025, 4:10am

Output of ‘smartcl -a /devlsd#` (replace # with the drive letter), will go a long way in identifying the issue. Any other system details, confirmation on how these drives are connected (directly to motherboard, hba, something horrible?), etc. will go a long way in being able to help.

Do you have scheduled smart tests & scrubs setup? Is it always the same drive failing? Length of sata data cables?

Flyboy · November 6, 2025, 3:01pm

Yeah, thanks for the help. First question the ‘smartcl -a /devlsd#’ command with replacing # with the drive letter, is that supposed to be inserted into shell? I tried it and I got “zsh: command not found: smartcl” Did I input it wrong?

The Dell optiplex 5050 motherboard has 4 SATA drive connections, so I have all 4 drives directly to the motherboard, each on their own cable.

I am unsure if I have smart tests and scrubs setup. HexOS did the initial setup. Since the drive has failed I have received several messages saying “read smart error log failed”, “read smart self test log failed”, “not capable of smart self-check”, so I would imagine some are setup? EDIT: After some looking in data protection menu, I only see scrub tasks programmed to happen once a week. All the other tasks do not seem to have anything set.

So far it has always been the same drive failing. My drive being sdb, and the same serial number every time, I checked both times. The length of the cable is short, maybe 10 inches at the most? I have it all fitting carefully inside the case.

Hopefully those help. Happy to answer any other questions. Thanks for your time.

joeschmuck · November 6, 2025, 5:04pm

Take a look at my signature links for the flowcharts. Those will help a lot I hope. It is also in the TrueNAS docs as a step by step process, since I couldn’t upload an image.

etorix · November 6, 2025, 5:06pm

And dont’ forget:

Your post provides no information on the hardware, wiring.
And we’re left wondering whether you actually went through “backup-destroy-restore” or merely expanded the initial raidz1 instead of going for raidz2.

Fleshmauler · November 6, 2025, 5:33pm

No, I just had a bunch of typos: smartctl -a /dev/sd#

joeschmuck · November 6, 2025, 5:48pm

@Fleshmauler

It happens. To me more while using my smartphone.

Flyboy · November 6, 2025, 6:51pm

Fantastic, thank you sir. I figured there was info already made somewhere. I will start there and see what I can do on my own. I am new to the forums so I wasn’t sure where to look. Thanks for your time!

Flyboy · November 6, 2025, 6:57pm

Ah, I must be messing things up now. When I insert that command into my shell, now all I get is it bring up the next line saying “quote>”

To be clear this is what I am inserting

smartctl -a /dev/sdb’

I also tried it without the ‘ at the end and i still received command not found.

I am running TrueNAS Scale, ElectricEel-24.10.2.4. Is that what is causing my to not insert the command right?

Fleshmauler · November 6, 2025, 7:01pm

omg - obviously I’d include another “ ` “ by accident… that being said, without the extra ` it should produce the needful.

It does work fine for me in the shell:

dan · November 6, 2025, 7:02pm

Because you’re running as root.

@Flyboy, try prefixing sudo to that command: sudo smartctl -a /dev/sdb

Fleshmauler · November 6, 2025, 7:03pm

ffs - how many times am I going to screw up giving a basic command to someone in this post?

joeschmuck · November 6, 2025, 8:44pm

In the instructions it tells you that all commands are encompassed in single quotes so you know where the command starts and ends. Do not include the single quotes at the front or back of the command.

Edit: none of these commands can cause harm no matter what you do, except the nvme commands. Those could be botched if you try hard enough.

Flyboy · November 7, 2025, 6:00pm

Haha, yeah the Sudo prefix worked for me. It definitely is not communicating though. I just get an inquiry failed. I then tried my other drives and received tons of info, but when using the one for the failed drive, its just back to Inquiry Failed.

And then going through the critical drive error flow chart I obviously cannot get any smart test and its just showing faulted and offline. So seems like something is definitely wrong and I am going to have to replace a cable or the drive. I will finally have time this evening to actually tear into the system. Ill start back with troubleshooting steps if I can get the drive back online again. Thanks for the tips.

Flyboy · November 9, 2025, 2:30am

Okay, additional question after some troubleshooting. I don’t understand why, but sometimes after clearing any alerts, a simple restart of the NAS, the drive will come back online. At least once it has taken to open the PC and pull plugs, reinstall, and restart the drive pops back up. When I do so it boots without a problem, no errors and everything is green, passes a short SMART test and a Scrub with no errors. But then after a random about of time; could be days, could be hours, the same drive will inevitably hard fail again. This has happened at least 3-4 times now. I have replaced the SATA cable directly to the mother board with a new cable. The drive has been hooked up though a splitter from the power supply and as the other drive has never died, I dont imagine thats the issue. I also tried it in another power configuration and it still failed once. I am at the point where I am 80% sure the drive has some issue where it fails after awhile. Before I just drop another 200 on a new drive, I want to make sure somehow there isn’t some issue with that port on the motherboard, or the motherboard itself.

(I have a Dell Optiplex 5050MT I repurposed that has a Dell WWJRX motherboard.)

So my question is this, how would truenas react to swapping where 2 drives are hooked up on the motherboard? or is there a way to do that officially through TrueNAS? I want to see if the problem follows with the drive or the issue stays on that same location on the Motherboard.

Fleshmauler · November 9, 2025, 4:44am

It wouldn’t care at all unless something is critically wrong with drives/ports/controller.

Flyboy · November 9, 2025, 5:31am

gotcha, thanks. I wasnt sure if it would freak out or just be fine and realize all the drives are there, just in various locations than before.

etorix · November 9, 2025, 8:26am

ZFS tracks pool members by UUID and does not care about location or device identifiers.

Flyboy · November 10, 2025, 4:00pm

Flowcharts were fantastic. It has definitely taught me a lot. I appreciate it. I don’t know If my drive randomly going offline with no input/output is related, but at least now I see I have 2376 RAW 197 pending sectors and 2376 RAW 198 offline_uncorrectable. But as indicated ill just monitor them. Especially since I am running Raidz2. I have also setup weekly scrubs and short smart test as per guidance I read.

Additional question. I see also by going to SMART test results for the suspect drive that it has many short offline result success, but the two extended offline are showing failed with ‘1409470216’ LBA of First Error. I am trying to run another offline test to see if I can get one to pass, but anytime I try to run an offline smart test I get this response

“Sending command: “Execute SMART off-line routine immediately in off-line mode”. Drive command “Execute SMART off-line routine immediately in off-line mode” successful. Testing has begun.”

But then I get nothing showing in the jobs and I never seem to get another extended offline test that even fails. Am I getting confused with the different style tests? I can run a short and long smart test that passes.

joeschmuck · November 10, 2025, 4:39pm

If you cannot pass a SMART Extended/Long test, the drive has failed. If you have not done this, you should do this now. smartctl -t long /dev/sd? for the drive in question. If it cannot “complete” the test and it shows a failure, replace the drive.

Topic		Replies	Views
Added a Disk, Can't See It TrueNAS General SCALE , Hardware	46	371	February 9, 2025
Is my TrueNAS Scale system limited to 6 Sata ports? TrueNAS General SCALE , Hardware , SAS	12	626	June 17, 2024
My drive died. Can I temporaily switch from RaidZ1 to Raid 1 until I find replacement? TrueNAS General Hardware , ZFS	25	273	September 25, 2025
Getting IO error on 55 day old drive. Smart test passes. Please advise TrueNAS General	27	267	May 26, 2026
Scale ZFS Read and Checksum Error TrueNAS General SCALE , ZFS	6	456	May 6, 2025

HDD dying or just a bad cable?

Related topics