I’m using TrueNAS core on an Intel-i5 32GB, nvme boot device.
There is one pool containing six 3.5" 8TB sata drives, in a RaidZ2 configuration.
I’m just using an SMB share to access the files from a few devices on the local network.
I had a problem reading files, so checked the TrueNAS dashboard from the web interface. I got this in the alerts.
Pool Nova state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
2024-11-24 00:00:41 (America/Los_Angeles)
I feel like that means I need to replace one of my drives. They are only 7 months old so I may be able to RMA.
Is replacing a drive in TrueNAS easy to do? It wasn’t in the tutorials but I’m guessing this is something people do so is there a good guide.?
Before you do that, you’d need to ascertain that there are drives to be replaced in the first place. Hardware details and output of zpool status -v camcontrol devlist and smartctl -a /dev/adaN (or daN, for all relevant values of N)
please, with text outputs nicely placed between triple backticks ``` for redability.
The problem has gotten worse; I’m starting to worry that its pretty broken.
I couldn’t connect to the web interface.
Plugging in a monitor, I was getting some sort of ada3: problem Disconnecting the third drive made the problem go away and it booted normally, but telling me the pool was degraded. Which I guess I expected because i’d lost one drive from a Raidz-2.
I shut it down to deal with when I could make time.
Today I’m back in and cannot connect to the web interface. connecting a monitor there are some ada2 errors and a whole load of metaslab.c:2457:metaslab_load_impl() which frankly scares me.
My plan was to learn how to replace a drive and then do that.
Loosing ADA2 immediatley after ADA3 makes me worry i’ve just lost data. I don’t know how a 4+2 array works but that feels bad.
I don’t think I can type any commands; I just get those errors on loop.
adding sudo still has no noticable effect.
I’m not rightly sure I can copy paste anything, I’ve attached a keyboard and mouse to the machine.
I can’t connect remotely right now - just waiting for it to be pingable or for the web interface to work. Best I could manage is taking a picture with my phone I think.
scratch that, its just starting interfaces. I may be able to connect soon.
It took 10-15 minutes longer than normal to boot this far, I was assuming it was forever stuck but maybe I needed to be more patient
Sorry I shoudl have specified.
Its booted and I can connect to the web interface now. Which means I can launch a shell from there and copy/paste. Its also means I can stand down one defcon from thinking everything is broken.
OK so this is expected while one drive is unplugged. Its possible I was too hasty taking it out.
After ten minutes of not booting and a screen of ADA3 messages I thought it was stuck forever and removed it
root@cybertron[~]# zpool status -v
pool: Nova
state: DEGRADED
status: One or more devices could not be opened. Sufficient replicas exist for
the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-2Q
scan: resilvered 5.95M in 00:00:01 with 0 errors on Wed Nov 27 10:20:28 2024
config:
NAME STATE READ WRITE CKSUM
Nova DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
gptid/887a3199-13ad-11ef-8bb3-38d547750fc5 ONLINE 0 0 0
gptid/8893053b-13ad-11ef-8bb3-38d547750fc5 ONLINE 0 0 0
13714867910798328405 UNAVAIL 0 0 0 was /dev/gptid/88a98f95-13ad-11ef-8bb3-38d547750fc5
gptid/88a4e176-13ad-11ef-8bb3-38d547750fc5 ONLINE 2 0 0
gptid/888476a2-13ad-11ef-8bb3-38d547750fc5 ONLINE 0 0 0
gptid/888de0d2-13ad-11ef-8bb3-38d547750fc5 ONLINE 0 0 0
errors: No known data errors
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:02 with 0 errors on Fri Nov 22 03:45:02 2024
config:
I’m surprised & slightly concerned you had 1 drive fail & another giving errors so closely to each other. I’m also surprised that you’re unable to boot while the originally degraded drive is connected.
If you want to be VERY cautious:
Power off the NAS & leave it offline until you get replacement drives
Once you get replacement, connect the replacement, power on the NAS, find the dead drive in GUI & replace it with the replacement
Wait for resilver to finish
Since you got 2 drive redundancy you’ve not lost any data yet, but 2 drives back to back ain’t fun. 1 more failure and you’re out of the danger zone & straight to data loss.
If you got a second system with spare sata slots, might be worth connecting the original dead drive & running some smart -t long tests on it to see if it’d be accepted for RMA.
Don’t remove the second drive that has errors. If any failed drives are still visible to system & not causing additional issues, leave them connected while replacing drives.
(might be worth gathering all the requested info from others before shutting down, that is pretty minimal risk)