Boot pool keeps failing

So…

I used a single 500gb Crucial SSD as my boot drive it wasnt mirrored (my bad) and i got a error complainng that the boot pool was degraded (unrecoverable sectors) i assumed it was just a bad disk. I grabbed another 500gb Samsung drive mirrored the pool. It seemed fine then the Samsung drive got listed as removed and of course the pool is now degraded again.

Up until now they were just disks i had lying about so i thought better of it and bought a new 500gb Crucial disk cause it matched the model of the first but the same issue occured as with the samsung drive.

So… i stripped both disks did a full wipe of the new disk and re installed Scale and restored my config file, all seemed good until today when it failed again and now im down

Ive lost no data i have a 100% to date backup NAS which working fine, my drives that make up the SSD pool are fine as is my spinning rust disks pool

My SSD pool is made up of Intel Enterprise SSD they are all good so im wondering if the crucial disks are just crap but

Two second hand drives failing would fall right in line with my luck latey but a 3rd brand new disk failing as well seems a bit much

I have 2 new 200gb Intel Enterprise drives on the way but am i setting something up wrong that is causing the issues??? (mirroring the pool aside)

I’d apprciate any suggestions

sorry should have said

Dragonfish-24.04.0
the Boot-Pool is running on the on-board SATA ports rather then the HBA cards

:100: :100:

Have you try to run some smart on disk that fail?
And have you try another sata cable / switch SATA port?

the Motherbaord has only 4 SATA ports, 2 i intend for the boot pool and 2 are my Log Cache that sits infront of my spinning disk pool

Ive ordered new SATA cables as a precaution but i have tried running the boot pool off different SATA ports and no joy. Once the new drives arrive i will reimage and restore my config and then mirror the pool. If it fails again at that point its either SATA ports themseleves or Dragonfish

Anyone ever used a PCI-e SATA riser card???

https://www.amazon.co.uk/ACTIMED-Controller-Expansion-PCI-Express-Windows10-PCIE-SATA-4-port/dp/B0BCKK5D3J/ref=sr_1_1_sspa?crid=25BCF116EMXLX&dib=eyJ2IjoiMSJ9.gOBH4loWdwxO6BfKoWBqSPpL75mvxh0_ctSdejlQO8F_25ixlhQe7jVabgpwVDdSumMlXkEv0gjoz-gcBNG2iFg4huWXcvYgaHTknQZZseHuIGMMCwutxNgVtwqTqDnP9pZL4XYabDFLM-c2IkQgGtG7g8hemG67oVHpVQYisqEUGZyN3XGsdZJfK_gZ8XER6gIpIcShndPKfHfda-ppWRoO6utwTTeYfgAzCZPo8_Q.RMBIEUUNhyJBx8SkN77F7psUPoqKocWGt7FSFbMy2qw&dib_tag=se&keywords=pci+sata+card+4+port&qid=1716026077&sprefix=PCI+SATA%2Caps%2C133&sr=8-1-spons&sp_csd=d2lkZ2V0TmFtZT1zcF9hdGY&psc=1

Something like this and bypass those ports all together

I had run a short SMART test on the SSDs and that was a sucess

Yes, and trust me dont waste your money. It can only made things worst.

You should try the long one, not the short (on SSD is still really fast).

I have your same concerns, is really strange that all of the disk fail… Hope for you someone can be more helpful

There is / was a bug with Crucial MX500 SSDs that showed a single sector error. Not more than 1, just a single error.

If thats what you are getting - you may be OK. Try a different brand of SSD

My Intel Enterprise SSD have arrived so im going to rebuild with these anyway.

I have run a long smart test and wiped them with zeros. all good

lets see what this does

1 Like

right…

Intel drives installed, SCALE installed & my config file has been restored. All back up and running

If it goes again its either the SATA cables (dont see that myself) the motherboard ports or dragonfish

Ive ordered fresh SATA cables from Lindy so i reckon i might swap them out anyway for new

We’ll see i guess

Well that lasted all of 20mins

2nd drive removed by the administrator, Pool status degraded.

Swap the cables out tomorrow and hope its just that, have to see now if the first drive of the pool holds up

so frustrating

Can i get a double check on my trouble shooting here

i have four onboard SATA ports.
2 for the boot pool
2 for the Log Cache Drives

The log Cache drives have always worked and by assosiaction the SATA cables are good

So i have binned of the Log Cache for now and using what in my head is a known good SATA cable and port plugged a new second boot pool drive to one of these ports

IF the boot pool becomes degraded again i will have rulled out the port/drives and this is another dragonfish issue???

How Is going, still degraded? :face_with_spiral_eyes:

It was fine until about 20mins ago and then i got…

Boot pool status is DEGRADED: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected…

Thats the 4th drive now on 2 different ports and two different SATA Cables

Its also the 3rd install of dragonfish with me uploading my config after

How can it be anything other than Dragonfish at this point unless all the SATA ports are faulty on the board

In my head now my only course of action that can rule that out is a PCI-e SATA riser card and use those ports instead

Can i install a previous version of scale and upload a saved dragonfish config file???

Whole things just gone down again…

I cant believe i have that many damaged SATA cables and disks. I have ordered the riser card so i can bypass the ports on the motherboard, thats on same day delivery at least. New SATA cables should arrive today

The PC itself is still on, temps have been comparable to my other NAS all day so i dont think its an overheating issue, it never powers off so i don’t think its a power issue.

Going to check the BIOS settings and anything relating to the SATA ports but i dont think it would work for any period of time if they were wrong.

If i didnt already shave my head i’d pull my hair out

At this point I’d say it’s the power supply. PSUs can fail in many ways that do not result in a direct shutdown.
If not that, it would be the RAM or motherboard.

The RAM is easy enough to test.

1 Like

Deffo not the RAM i have the exact same RAM set in my other NAS i swapped the two sets over.

If it was the PSU its only causing the boot pool issues so by extension the onboard MB SATA ports. When it goes it fails to command line complaining of unrecoverable error. I reboot the PC and it complains it cant find a boot device.

Ive also tried different tails of the PSU to eliminate that as an issue.

I’ll see if this riser card makes a difference when it arrives, maybe swap the PSUs between NAS and see if thats a thing.

So frustrating, wanted to have finished and forgotten about it by now

The PSU is something common to all the drives.

Sounds like you need to try a different one.

When scale was new, I used a a test setup made out of old PC hardware. The MB hat two intel SATA ports and additional ports with an ASMEDIA controller. The symptoms then were a bit like yours, the pool I created always kept degrading. I did not have your patience, so I just bought a Dell 310 HBA, I think, flashed it to IT mode and the problems went away.

1 Like

thats true the thing throwing me is it always only effects the boot-pool, when i reinstall and upload my config the drives that make up my other pools are all fine

seems odd a PSU issue would only (& repeatadly) effect just the one set of drives, in this case the boot pool, i wouldnt have thought it would be that specfic

As mentioned if my next step doesnt solve it i’ll swap the PSU’s between NAS and see what happens

i have more HBA cards on standby but only 1 PCI-e x16 slot left free, was hoping to keep it free for a GPU later if i needed it for PLEX transcoding.

Leaving this as my last resort short a new mother board