2 out of 4 Disks have SMART failure

Hello together,

during the winter, I have upgraded the Hardware of my NAS. I chose four recertified Seagate Exos X16 as hard drives in a 1x RAID Z2.
2 weeks ago, the first HDD showed some SMART errors, so decided to send the drive back to the seller. Now today the second HDD shows also some SMART failure too, so I disconnected the drive as well. So 2 out of 4 disks have problems.
Now my redundancy is gone and I would like to “stop” my dataset to secure my files. Is there any special procedure to follow?

Thanks for any advice,
Regards Chris

Which vendor?

I’m using tens of recertified. Some have had a single-digit number of bad sectors, I pull and replace while running a full test to reallocate the bad sectors. Some have gone back in production without any additional errors after months. This is for data I can replace if lost, so not something I’m suggesting.

Just export the pool? That will disconnect it from further use. You can re-import later.

the drives are from a german vendor, mindfactory. I am already in connection with them. My first drive is since 1,5 weeks in there shop, but nothing so far.

Thanks for the suggestion with disconnecting the dataset.

See if you’re affected by this:

It sounds like yours weren’t sold as new, but…

Were your drives ‘manufacturer recertified’ or does the vendor ‘recertify’ them. It’s all luck, but 2 out of 4 is a very poor ratio compared to the luck I’ve had with manufacturer recertified drives.

Thanks for the information.

I don’t know if the vendor has recertified them or the OEM. In their shop they are names as " Seagate Factory Recertified", so I assume they are recertified by the OEM.

I have compared already the SMART and FARM values, and for me they looks okey. Also the operating hours of the heads were unremarkable and fits with the operating hours of the drive.

My last 5x 2TB WD Red drives were also recertified drives and have survived 5 years without any failure.

I’d assume the same from ‘factory recertified.’ Maybe just bad luck, then.

How many errors did you get? I don’t recommend this for important data, but if I get <10 and the numbers don’t increase, I just keep using them.

If you have another pool or drive you can copy the data to, maybe make a copy then disconnect the pool until you can restore parity.

the first drive had 278 Currently unreadable (pending) sectors and the second one now 178.

Ouch. Yeah I wouldn’t trust those for continued use.

I have a second NVMe dataset, but the capacity is not big enough. So I have now disconnected them and will wait till the first drive is delivered, so I can start a resilver task.

Not sure how you got to that.

I bought 5 Ironwolf drives originally, 1 outright failed, and was refunded…
we now back to 5… out of those 5. the one I bought from a different vendor from the original 5 is all good still and the SMART and FARM numbers match up, the other 4, from the original purchase, well, all 4 is failing, have failed,

after the initial 5, I bought 2 replacement ironwolfs (1 as per above) and a 1 x exos.

the Exos I havent even introduced into a disk pool as I can see the discrepancy in the FARM log.

All but that 1 have issues…

Chris, suggest you run smartctl -l farm /dev/sdX on the various drives and see if the power on number and the power cycle numbers make sense, lign up.

G

Not sure how I got to what?

Reconnect the disk.

A faulty disk that has some faulty sectors is better than no disk.

Then get the disks replaced.

I personally obtain a replacement disk before sending an in use disk back for RMA… (this may involve purchasing a “spare” disk.

1 Like

I have already checked these informations and they look valid for me.

to late, the second disk is already on the way to the vendor. The last 2 disks are now disconnected and down till I have a new disk.

It’s too late now, but I wouldn’t recommend doing that. As you know, if one more drive fails, you will lose everything. With a ‘erroring’ drive, it may last long enough to rebuild.

Anyway, hope it all works out.

Yes, there is a defined procedure for this - it is called “Shutdown & power-off”.

2 Likes