Device: /dev/sdd [SAT], ATA error count increased from 5860 to 5864

vvasilovski · August 11, 2024, 4:14pm

What should I do with this message: “Device: /dev/sdd [SAT], ATA error count increased from 5860 to 5864”. Please help! I recently migrated to Scala. The system works, but when I restart it shows me the message.
This is a M2 SSD boot drive.

joeschmuck · August 11, 2024, 4:17pm

You should do a few things:

post the entire output of smartctl -x /dev/sdd
post your system hardware and configuration.
post the exact version of SCALE you are using.

Do not let us “assume” anything, or we may unintentionally give you bad advice. We here are not “all knowing”, even though we try.

sfatula · August 11, 2024, 6:07pm

The odds are, replace it! But agree post results of the smartctl command as it could well be something else hardware wise.

5864 errors are 5,863 too many for me. I would have investigated at error 1 myself, data is too important.

vvasilovski · August 11, 2024, 6:12pm

NugentS · August 11, 2024, 6:21pm

And your hardware?
Also - how is the M.2 connected to the system?

sfatula · August 11, 2024, 6:38pm

Yeah, need ALL the hardware. The SSD itself doesn’t look like it’s failing to me, so something up the chain must be having trouble, is incompatible, has old firmware, etc. Definitely how it’s connected is important.

vvasilovski · August 11, 2024, 7:03pm

The computer is an OptiPlex 7040 - Small Form Factor.
M2 is connected to the PC motherboard in a factory connector.
I apologize if there are any inaccuracies, but I am using a machine translation.

joeschmuck · August 11, 2024, 8:59pm

I see nothing actually wrong with the M.2 SSD.

I highly recommend that you run a SMART Long/Extended test on this drive. This can be done using the command smartctl -t long /dev/sdd and then wait 60 minutes minimum and then grab the output of smartctl -a /dev/sdd and post that.

In the Extended Self-test log you should now have an entry stating Test Description = #1 Extended offline, Status = Complete without error, Remaining = 0%, and the current power on hours count. LBA of first error = “-”.

The only issue I personally have had with M.2 and SCALE was TrueNAS sending commands to the NVMe drive that it did not recognize, features that did not exist on the NVMe drive. I’m not saying that is the case in your situation however you do not have any actual errors with the drive. they all look like communication issues.

While it may not be what you want to hear, I recommend you ignore but monitor the errors. If you notice the value increasing at an alarming rate, lets revisit this topic, or just replace the M.2 drive with a different Make/Model NVMe drive.

Also, you could check to see if there is a newer firmware version to the M.2 NVMe drive and update it.

Make sure you backup your TrueNAS Config file, it can save your butt later.

vvasilovski · August 12, 2024, 7:04am

After the long test, this is what I came up with as the result.

joeschmuck · August 12, 2024, 9:38am

I’m going to say that these screen captures are damn hard to read on a smartphone.

However after zooming in a lot, your drive looks good. Just pay attention to those counts and run a SMART Long test weekly. If that passes and no other alarm condition exisys, you will be good.