Drive suddenly "Unhealthy". Unable to run S.M.A.R.T. tests. Help confirming if drive is bad?

Hi everyone,

I’m having an issue with one of my disks. I’d love to confirm if it’s the disk or something else before going through the RMA process. I’d hate to mail it off and have them say “the drive is fine”, etc.

My system:
I have TrueNAS Core v13.0-U6.7 running on a Dell PowerEdge T320 with an H310 HBA SATA/SAS card flashed to IT mode. I’m booting from an SSD and have a ZFS2 pool of six 26 TB Seagate Exos ST26000NM000C drives (a refurb from ServerPartDeals). The SSD is plugged into the motherboard and the six drives are plugged into the HBA card.

This is my secondary NAS, used as a backup to my primary TrueNAS machine. This server has been off for over a month as I rearranged cabling, etc.

What happened:
Yesterday, after being shut off for over a month, I turned the system on. The web interface informed me drive 5 was “unavailable”. The alert said:

“Pool TrueNAS Clone state is DEGRADED: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. The following devices are not healthy. Disk ATA ST26000NM000C-3W ZXA0XXX is UNAVAILABLE”

What I’ve tried from the hardware side:

  • Opened the Dell PowerEdge and re-seated all the HBA drive cables and all power connectors. No capacitors or components look suspicious.
  • Swapped out the modular 750W power supply with a spare.
  • Confirmed all other disks are operating fine. They pass all short S.M.A.R.T. tests
  • Swapping disk 5 into the two spare drive bays on the NAS, with no change in the problem.

When the system boots, all the LEDs on the drive carriers initially flash normally. However, during this initial process, disk 5 stays on blinking away for an extended period of time, as if it’s having trouble accessing it.

The log shows this group repeated a few times:

Nov 16 15:14:09 truenas2 (da5:mps0:0:9:0): CAM status: SCSI Status Error
Nov 16 15:14:09 truenas2 (da5:mps0:0:9:0): SCSI status: Check Condition
Nov 16 15:14:09 truenas2 (da5:mps0:0:9:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
Nov 16 15:14:09 truenas2 (da5:mps0:0:9:0): Error 5, Retries exhausted

Problems running S.M.A.R.T. on TrueNAS Web UI:

When attempting to run a S.M.A.R.T. test via the web UI on drive 5, short or long, I get the following:

“Sending command: “Execute SMART Short self-test routine immediately in off-line mode”. Command “Execute SMART Short self-test routine immediately in off-line mode” failed: scsi error aborted command”

I tried taking the drive offline and repeating all the S.M.A.R.T. tests and got the same error.

I then pulled drive 5 and plugged it into the SATA port on my PC. When the drive gets power I hear it struggling and it’ll even halt the BIOS boot process.

On Windows running CrystalDiskInfo says it’s “Healthy”. However, I also tried smartctl via the Windows command line, but I get an error “failed: input/output error” when trying short or long S.M.A.R.T. tests.

I haven’t tried a Linux system yet, but can try too. Here’s an image of the Windows CrystalDiskInfo report:

Is there anything else I should try before RMA’ing the drive?
This is the first disk issue I’ve had that wasn’t resolved by re-seating cables (back when I was using a custom tower), so I’m unsure if I’m missing anything. I’d be cautious to format the disk and using it again if it’s really problematic.

Any thoughts or ideas would be wonderful, thank you!

P.S. Apologies for any rich text formatting errors.

-Steve

RMA Time :slight_smile:

-

You’ve basically ruled out everything physical, what I suspect is firmware on the drive itself has gone bad bad, that even CrystalDiskInfo isn’t actually able to request a refresh health and is using an old cache value on the device firmware, which would be why you can’t even run SMART tests; as it’s not responding to these requests anymore.

scsi error aborted command and failed: input/output error are active failures, it’s reporting the command it self failed

This alone is enough to warrant an RMA without further questions or testing to be honest in my opinion, not worth the time or effort after that finding

1 Like

There have been some refurb drives failing recently. I’m curious if these were par tof the China incident. Is it still able to be returned? If yes, then depending on what happens here, you may be contacting them and asking for a replacement. It is part of their job and I’m sure they get a lot of returns.

I will assume that before you can move onto the next step, the current step works.

  1. Now for the fun…
  2. Read the drive serial number off the label, you need to track by the serial number, always.
  3. Reinstall the drive into your NAS. This is only to ensure we are both using a known OS and what is installed on it.
  4. Power on the system.
  5. Open an SSH window.
  6. Log in as root
  7. Run smartctl –scan and you should have several drives listed.
  8. Run smartctl -a /dev/da0 and cycle through all the drives until you locate the drive with the matching serial number.
  9. Run smartctl -t /dev/daX where X=the drive.
  10. Immediately run echo $? and you should have a ‘zero’ return value which indicates the command was accepted. Typically the command on step 9 would immediately throw you an error but if it doesn’t, use the echo command.
  11. Wait 5 minutes and run smartctl -x /dev/daX and hopefully you will have a nice long piece of data.
  12. Post the data here for others to see what might be wrong.

If you cannot get the drive to recognize smartctl then you have an issue. With that said, the data from Crystal Disk Info comes from the SMART data so I suspect you ran the command improperly.

Post what you find.

1 Like

I suppose that you meant smartctl -t short /dev/daX

1 Like

Yup, thanks for the assist. Sometimes my mind knows what to do but the fingers do not hit the keys. :clown_face:

Thank you @itsharryshelton for the advice, I plan to pack up the drive today.

Hi @joeschmuck,

Yes, the drive is within warranty and I’ve opened an RMA claim, I just wanted to ensure I crossed my t’s and dotted my i’s before I sent it back.

I have a monitor plugged into the NAS and was able to run the command as root from the shell. If I try a short test, I get a “failed: scsi error aborted command” error. However, performing short tests on other drives are successful.

Photo from the error on the monitor connected to the NAS:

In addition, I tried a slightly different approach with my PC. I booted an Ubuntu Live USB drive and installed smartmontools.

I also opened GParted just to have a look, the drive made all sorts of awful noises and I had to click through at least a dozen errors “Error fsyncing/closing dev/sda: Remote I/O error”

When run sudo snartctl -t short /dev/sda (because previously it said “permission denied” when run w/o sudo…

I get the following error:

Sending command: “Execute SMART Short self-test routine immediately in off-line mode”.

Command: “Execute SMART Short self-test routine immediately in off-line mode” failed. scsi error data protection error.

I guess this confirms things?

Thanks,

-Steve

You should have tried smartctl -d scsi -t short /dev/da5 which is what the purpose of the scan was. I am sorry, I was not clear in that, in fact I didn’t even state to do that, that is on me. But the the RMA will still work.

Thanks for replying, I booted the NAS up again to try that command…. but now disk da5 is no longer showing up at all, so I can’t even run a test! RMA time I suppose… thanks!

Then you have no doubt, something is wrong. Good for you to at least try it. It would have been nice if the drive came to life and there were no issues with it, then you could say “It was that blasted data cable “again”. I hope it all works out.

Hi all,

Just an update, the drive was RMA’d and they said it was “DOA”, so there certainly was something up with the drive. Thankfully a replacement arrived the other day and it finished resilvering today, and all seems to be up and running again. Thank you!

2 Likes