Drive Failure... SMART Log

Hi guys…

Please advise, what do you think caused the drive failure, see attached.

G

sdg-1.txt (7.8 KB)
sdg-2.txt (20.1 KB)

This drive has failed unfortunately.
# 1 Extended offline Completed: read failure 90% 1665 1315017816

This shows you that there are many bad sectors (LBA’s):

Pending Defects log (GP Log 0x0c)
Index                LBA    Hours
    0         1315017816     1665
    1         1315017817     1665
    2         1315017818     1665
    3         1315017819     1665
    4         1315017820     1665
    5         1315017821     1665
    6         1315017822     1665
    7         1315017823     1665

1665 hours is a very premature failure however this does fall into the category of a warranty.

I would not mess with it, replace it. If you were to get the drive to pass a Long test, how long until it fails the next time?

This failure could not be introduced to the drive by anything you could do, unless you dropped it. Your maximum high temp was 51C and the drive warranty covers up to 60C.

With that said, this drive also recommends a maximum of 40C temperature and you exceeded it 780 times. This drive, and well all your drives need to be cooled better or buy drives designed to run continuously in a warmer environment. I personally think you are fine, if you stay around 45C and below, 51C is high in my personal opinion, but still within the drive design, but drives don’t last when run at the upper limit.

I don’t know what this means (yet), maybe you powered the system off without shutting it down properly? 54 times does sound like a lot.
0x03 0x040 4 54 --- Number of High Priority Unload Events

With all that said, my experience tells me that you just got a bad drive with some infant mortality. It happens and it is no fun when we get a failed product. The good thing is, it happened early.

If you have any further questions, please ask.

1 Like

This system, this cooling use to run 8drives, with the same cooling, it’s now reduced to one disk pool of 5 drives… and I’m getting this…

so should be allot less heat…
Scary while it’s still rebuilding had another drive throw errors… thinking the drive source is “questionable”

G

Are these refurbished drives? I have no idea how to reset a drive statistics to make it look new, but I wish I could do that. Maybe that is the situation?

And another drive? That is an omen.

they not suppose to be… just had another failure on another drive.
have request refund on all the drives ordered as I feel i need to replace them all,
please see attached…

G

sdb1.txt (19.8 KB)

Got 2 of my original 3 x 8TB HDD’s that was RaidZ1, will re-install them and configure a Raid1 Diskpool and move some of the critical media over… FFS…

G

0x05 0x020 1 58 --- Highest Temperature
Very close to the upper limit.

I don’t see an actual hardware failure on this drive, unless I missed it. I do see two errors that were recovered from, and these do happen. The values for ID1 and ID7 are Error Rates and on a Seagate drive mean nothing unless it is a value greater than hex FFFFFFFF. These are not. Lesser values are attributed to read ahead operations, where the drive guesses what might be the next data is to be read and it turns out it wasn’t. It is an effort to reduce latency, it works fine in things like a data center. Home use, probably only when you are copying files but random stuff, that is where that error rate goes up. It also comes down.

But if you feel the drives are questionable, I too would replace them all.

See.

could there be a setting in TN to force increased fan speed.

This is now 5 drives… causing this…

Previous it was housing 5 x 4TB and 3 x 8TB. so the heat generation should be less than previous.

G

TN is unlikely to add fan controllers to their software. I had this issue on one of my systems and was able to set the drives to max at all times, makes it noisier, but keeps my drives below 40c

once the current resolver is done and I install the old 2 x 8TB drives will drop into bios and see if I can up the fan speed. strange though, from 8 drives down to5 and having issues… although I can’t think it’s the root cause.

G

The other thing you can try is buying a separate fan controller. I looked at a Noctua controller for one of my system since I can’t force the fans on full in the BIOS.

I would also look at what direction your fans are blowing and if there is cables, etc blocking the airflow to your drives. This might just be a “I got WAY too many cables in front of my drives” issue.

Re-uploading a SMARt report.

Can see how this drive is not a problem, disk pool is reporting degraded… 1 drive out of working…
Thank F for RaidZ2.

G
sdb2.txt (19.9 KB)

Could be, but this morning when I replaced the other drive I actually made sure of cables out of the way, and moved the 5 drives across the space of the 8 bays.

I had one drive previously over heating and got temp warnings inside TN console, this time, nothing…

G

… just realised, as much as i pulled the information, is it not based on the last smart scan… which itself was not today, was before this morning.

it’s as if this drive went faulty while re-silvering the drive that was replaced this morning.

The re-silver of the new drive is now complete… Might I try a scan of this drive ?

Do we want to somehow confirm it’s state first… confirm what the dashboard is saying, but some how with causing load on it.

G

interesting enough, with that second “failure” as per above… ended re-installing 2 x 8TB drives i had, going to create a raid1 disk pool of them and move some “critical” data of this pool.

while i had the system open this morning i unplugged all the power and data cables and replugged them… also “reinstalled” the failed drive, physically and sata cable connected, want to run bad blocks on the 2 x 8TB and on the failed drive from yesterday morning.

The disk pool itself, well it’s all Green now… no sign of yesterday’s error.

Will first do bad blocks on the 8TB’s. then create the new pool, then move critical data, and then going to run a long scan on the problem drive from yesterday and then on the there drives of the pool if that one passes the scan.

and oh btw, AliExpress saying to bad so sad…

G

I have dealt with them before, got all my money back, just took a few days. Hope the drives pass badblocks if they are not willing to refund the money.

they already informed me they won’t be replacing the drive.

I bought 5 drives from vendor, had one failure in December, got refunded, while waiting for that refund I replaced the drive via another vendor. This was now the 2nd drive from the original set… and well then had that other drive that through errors yesterday, that some how now is appearing healthy…

first want to do a bad blocks on the drive that failed yesterday, if that passes then i got a spare… will then run a long test on the drive that gave errors yesterday during the re-silver and see from there.

bad blocks commands planned.

sudo badblocks -w -s -b 2048 -o out.txt /dev/sd?