Mid migration drive fault; new user. please advise

trashy · July 22, 2024, 12:43am

Hi All,
I’m most of the way done migrating from a synology system over to a new TrueNAS Scale 24 (latest) install on a supermicro dual xeon I picked up. 12 bay system. I have 8 drves in now with mirror VDEVs, all same model drives. And one is faulty and I’m not sure what to do. Here’s how I got here:

I started w/ new (refurbished) Seagate EXOs 4 x 16TB HDD in mirror vdev stripe to make a pool.
After had migrated my data to the new pool from my synology which also had 4 x 16 TB in BTRFS, I pulled those drives out and set them aside to wait ‘just in case’, and I loaded my old 4 x 4 TB drives from way back into the synology.

I then backed up key data from the TrueNAS system onto the old drives on the synology in JBOD as that will be my 3rd backup (first is cloud, second NAS, now third is syno).

At that point I feel good enough to take the 4 x 16 TB drives with I had set aside and wipe them while adding them as more mirrored VDEV to the same pool. Size increases. Looks great.

When I had backed up the data from synology, it went onto 4 drives. People talk about a ‘reflow’ to redistribute data which I wanted to do in this way. I also didn’t find an instant way to move the data within the datasets, so I set up a copy (in midnight commander) and let it rip. I had moved some of the data from the syno backup to target dataset folders with rsync before which was good but I wanted to try midnight commander. This one is a lot of data.

About 9 hours into the 15 hour copy, i get an error that one drive has increased error count, which is now at 40 read errors. A corresponding error also says " Pool EXOS-Pairs state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

Disk ST16000NM001G-2KK103 xxxxxx is FAULTED" (do you guys keep serial numbers private?)

zpool status for the drive says

state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use ‘zpool clear’ to mark the device
repaired.
also “FAULTED too many errors”

I am running a SMART test on the faulty drive right now. The big MC copy event ended without issue. I am backing up non-essential stuff to the synology because the faulty drive isn’t showing activity in its faulty state.

Should I wait until SMART is done tomorrow and run a SCRUB and will the drive, that appears to be not offline but not online, be included in that SCRUB?

Should I use that interesting function to remove that whole mirror VDEV which, I think, would migrate a few TB of data to the remaining 3 VDEVS (6 drives)?

Will the faulty drive be salvaged? I had a funny bug like this years ago when I first got my synology going and a simple system repair fixed it and never saw a problem again.

I did already order a replacement drive to be here in 2 days. I just don’t want to get screwed over in the 3-4 days it will take to get past this. Key data is at least in 3 places and data I really like is still on the new system and synology JBOD. Other stuff I kinda like, but don’t need, is all that is at risk at this point. But it would sure be a big bummer if it were gone; Just not worth a ton of expensive cloud storage.

Sorry for the noob style of this. It’s my first post and I read various forum posts across the internet about this awesome system and I hope I’m not breaking too many noob rules. Haha. Been on many forums back in the day and I know people have a way of doing stuff.

Please help me make a good choice. I kinda trust in the other refurb EXOs to hold out.

SmallBarky · July 22, 2024, 1:07am

You probably stand the same chance of failure if you attempt to move data or wait for replacement drive and resilver.

Why did you choose your mirror configuration?

Next time do a burn in test on drives.

Post your complete TrueNAS setup hardware and software details and your goals for the machine. You can get other to look over what you have and make recommendations for pool set up, etc. You might have been better off with a different pool layout and vdevs.
Your current layout kills the entire pool if both drives go out in one mirror pair. You mention cloud backup instead of backup local so I don’t know how long it would take you to recover from total pool death. Higher reads or block storage may have been reasons for your choice.

Pointing to ZFS primer.

trashy · July 22, 2024, 1:55am

Hey thanks @SmallBarky ! I also wondered if it’s similar odds. What about the remove VDEV option? I calculated it would reflow a few TB back to the remaining VDEVs and maybe take a couple hours? Hard to find info on that because my cursory understanding of TrueNAS is that that’s a newer function? Seems like most posts say you can’t yet its right there as a button you can click in the WebUI. Does it not work like that? Would it just destroy my pool of data?

I chose mirror to get some read/write speed to go with higher network speeds I’m rolling out. I should have done a burn in but only did a short SMART and I guess lesson learned. Based on the link you posted i should do SMART tests one drive at a time, too. Is that still valid? My scheduler had them set to do LONG once every 2 months and ALL DRIVES which, I assumed, ran them all at once. Perhaps the setup and intentions will help clarify some of my decisions. But I am not opposed to reframing my plans in the longer term and you’ll see why.

Setup:
2U supermicro 12-bay (8 SATA/SAS, 4 SATA/SAS/NVME combo), with 2x2.5 SATA in the back. This is on an expander backplane with nice hot swap bays. Basically this is a package from ebay that looked like a great deal and easy way to start the new path I am aiming for.
2 x Xeon E5-2680 V3.
128 GB DDR3 ECC
X10DRH-iT (2x 10GbE model)
2 x 920W PS
The HBA is AOC-S3008L-L8e in IT mode (I think I remember that right).

Drives:
8 x 16 TB EXOs SATA
2 x Samsung EVO 850 Pro (boot drive and SSD partition for apps).
1 x 1TB USB drive for Frigate recordings. This was moved over from my synology and used to alleviate the NAS of constant read/writes and make the data portable at the same time.

Use/Intentions: Been into tech since a kid around 1988 and got into my 286sx25 used from a friend in maybe in 1990ish and had to overclock it to 33mhz to get doom to run. I love tinkering with stuff. For a long time I produced music and turbocharged Nissans (including programming them with custom tunes and knock light, wide band O2, etc). Tech stuff and tweaking fun. My career is also in a similar way -- very techincal and based on tweaking and modifying detailed things. Love it.

Fast forward to about 10 years ago and I’m a family man and I miss the old days. Everything that mattered on the internet is gone. The way of life, the home pages, geocities, all kinds of crap that you wish you can reach for. I’m tearing up typing this. They said the web would keep everything, but when web 2.0 took over, we weren’t hosting our stuff, THEY were. The ‘free’ ad/bigdata people were – and they frankly don’t care about legacy at all. I’m sure the people there do, but business is business. A recent article found 38% of web links from 5 years ago are dead. It’s like the nothing from The Never Ending story is consuming our world of digital experiences!

Ok so 10 years ago I get microsoft 365 and my family uses it to backup our photos and critical stuff. It’s not bad. We each have 1TB. Added the extra acount we didn’t need as a funny little account just to squeeze that included TB for some work related data I needed held for a couple years. I get nervous about online clouds. Look at this cloudstrike crap that just happened! So I eventually get a Synology DS918+ going with 4 x 4 TB and set up BTRFS and SHR2 for extra protection. 8TB of good backup. Everything my family does goes to OneDrive and then to the synology. Feels pretty good. I dig up my old music production stuff, raw recordings, tons of ripped sample CDs I bought for production libraries, etc, and all that is backed up nice.

Then in the last year the synology getting full and I have services running on it and the added 16gb of memory I had on there just wasn’t enough. The thing just isn’t so responsive. We use it for Plex and NVR for 2 security cameras with the surveillance station, etc. But this thing is chunkin along.

Recently I got a nice HP Elite Desk Mini and put proxmox on it, some added ram, and dual NVME drives for mirror OS and data so it’s a little snappy fireball. My kids have been playing minecraft on our PCs for something like 10 years and I finally have the ability to host it and back it up nightly! I move some of the synology services over like Plex and PiHole and use the syno just as NAS. That’s going so so but now I’m kinda hooked on homelab stuff. Reading more about it and I really wanted a DS1825+ to come out this year to expand into, but the dissapointing reality is that even though Nascompares had a nice leak about one actually coming, everyone knows is going to have old-bummy hardware and JUST STORE DATA. EEK. Why do I care?

Well AI is not really that intelligent yet, but it sure has use, and it’s accelerating. AI use on the internet is grossly an idea phishing scheme for their developers and if we use their systems and find special uses, they will deploy that before we do. I don’t read the EULA (always too long) but I’m sure something likke “all your ideas are ours” is in there.

Ollama seems cool. AI agent swarms look interesting to develop, and oh man that’s just another dimension of tinkering to do on top of this hardware crap I got going on!

So I see that ebay has supermicro gear that’s maybe 5-7 years old and less than 10% of its original purchase price and basically a million times better than synology’s home-user hardware of the same price point. So here’s the thought… Shift all the family data to the supermicro, and it can probably host all that at really basic speeds. But then build another fatty Ollama server in the next year or so and use the supermicro 12 bay nas we are taking about here as a bank of data for it. Not the family data, but the other dozens of TB it has, haha. Load it up with my ideas and build out something fun.

I also might run some R Studio Server on an ubuntu server VM at times.

Ok so that came out like we’re at a bar and i’m 4 drinks into my woes.

I hope that helps. I will probably make a cute signature box like most you OGs.

SmallBarky · July 22, 2024, 2:46am

WOW, that’s a lot.

Might help with signature link. I had to guess and play around to figure out how to do the ‘hidden’ thing. Forum Suggestions - #110 by Davvo

I think you stand the same chances of the second drive in the degraded mirror no matter what you do. All three methods have the drive reading and writing, moving data around.

It might help looking at the drive / pool data from Calomel. Test is old but it give a general idea of drives, pool configurations and performance. ZFS Raidz Performance, Capacity and Integrity Comparison @ Calomel.org

trashy · July 24, 2024, 1:44am

Ok well I did get all of the data backed up to synology JBOD. I sat on it for a day as i’m waiting for replacement drive and decided to reboot. After the reboot truenas put the faulted drive back into the pool which it seems to be attempting to resilver.

If it resilvers fine what do I do? Accept it? It failed a LONG SMART test a day and a half ago…

SmallBarky · July 24, 2024, 4:08am

I wouldn’t use the failed drive because of your pool setup. I don’t want your pool dead if the other drive in that mirror pair goes bad. You also have large drives and they take quite a while to resilver.

If you don’t have to return the drive, you can run stress tests on the failed/failing drive if you have another spare computer.

Have you been keeping an eye on your drive temperatures? If they appear high, make sure your have good airflow and fans are working.

Davvo · July 24, 2024, 4:21am

I would eject it as soon as the replacement arrives.

SmallBarky · July 24, 2024, 4:53am

You heard Davvo, eject.

trashy · July 24, 2024, 10:40am

Thanks all.

I will replace immediately. After it’s out of the mirror can I run those burn in tests on the drive in the same system? There’s still 4 open bays and I’m sure the system can handle it. It’s in the middle of another long smart right now because I’ve got until this afternoon before the new drive shows up.

Thanks for all the great advice.

trashy · July 24, 2024, 2:02pm

LONG SMART failed again but system does show mirror ok, which I’m guessing means it has all the data on good sectors. It has a couple more critical alerts too from the truenas scale OS. But the mirror says it is good!

Ok just to give the latest as we wait for the replacement.

This SM server is new to me from last week, and obviously used, but looks super clean. It’s from reputable people on ebay. Anyway. Mildly concerned based on what I read about errors that it could be backplane, power source, or HBA heat related although my system temps are awesome. Drives all 32 deg C, processors 37-40 deg C, etc. I had the case open and the HBA/SAS card was quite warm.

So only reason I wonder is that I will put the new drive in a diff slot for now this afternoon when it gets here, resilver, and then move it to the slot of the failing drive — if I see errors AFTER that then maybe it’s something local to that slot or arising from technical bugs as considered above?

Also, I want to enable snapshots but I’m unsure about doing it in the middle of this hardware crap. Should I wait until the drives are settled or can that run at the same time as these drive swaps, burn in and SMART tests and resilvering?

I really appreciate the thoughtful time y’all taking with me and responses here.

Best,
Cody

Davvo · July 24, 2024, 5:51pm

Snapshots are not going to place any significant burden on the other drive of the pair.

Topic		Replies	Views
Truenas SCALE falling over after large transfers (with intel nic) TrueNAS General SCALE	52	751	January 9, 2025
Resilvering is Cooking My Drive TrueNAS General CORE , Hardware , ZFS	48	948	January 11, 2026
TrueNAS Scale - drive Checksum errors TrueNAS General SCALE , Hardware , ZFS	28	472	February 10, 2026
Rotating Cold Spares Every 6 Months TrueNAS General	44	406	March 7, 2026
mATX Case & Drive suggestions? TrueNAS General Hardware	38	1495	April 23, 2024

Mid migration drive fault; new user. please advise

Related topics