I am using HexOS but it is truenas scale 25.10.1 - Goldeye.
I recently had a HDD fail and I replaced it with a new one. Then I started having issues with this new WD red plus drive after 3 weeks. I tried another drive and that was having the same problem instantly (clicking and checksum errors etc)
I eventually replaced the Sata cable which fixed the issue but I’m now unable to get the orignal replacement drive (the WD red plus) to be become apart of the raid 1.
When I look in the VDEVs I can see the Mirror VDEV with the drive that has worked the whole time and a Replacing VDEV with an unknown drive thing and the WD red plus drive there. When I first put it in it was resilvering and I’m not sure if it completed that or not but its kinda just stuck there and I would absolutely prefer having some drive protection incase the other one also fails
Could someone please help me figure how to fix this? I really don’t wanna have to restart from scratch!
First off, we like to see the output of the following command’s output to clearly show how the pool is laid out and it’s status, (in CODE tags please):
zpool status
Next, we need to see your disks, (also in CODE tags):
lsblk -o NAME,MODEL,SIZE,LABEL,TYPE
Last, few of us here use HexOS, so any recommendations will either be through the TrueNAS GUI, (which is supposed to be available to HexOS users). Or via the Linux command line shell, (not the TrueNAS CLI).
CODE tags means use Preformatted Text mode. (</>) or Ctrl+e on the reply toolbar. It makes items easiser to read. Arwen used it for the zpool and lsblk commands above.
zpool status: (I started a srub to see if that would kick off anything)
pool: HDDs
state: DEGRADED
scan: scrub in progress since Wed Jan 21 13:14:18 2026
1.07T / 2.61T scanned at 338M/s, 188G / 2.61T issued at 57.7M/s
896K repaired, 7.03% done, 12:14:23 to go
config:
NAME STATE READ WRITE CKSUM
HDDs DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
238dfdfa-0a7c-40d9-90ca-3569714c58b8 ONLINE 0 0 0
replacing-1 DEGRADED 45 0 0
8695138f-e042-4dab-bc81-62f97b6dbbb6 ONLINE 0 0 48 (repairing)
2435261468018471615 UNAVAIL 0 0 0 was /dev/disk/by-partuuid/a88dafb9-ab3c-4f60-92a8-7a28734131d9
errors: No known data errors
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Jan 17 03:47:54 2026
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sdc3 ONLINE 0 0 0
errors: No known data errors
Disks:
NAME MODEL SIZE LABEL TYPE
sda WDC WD40EFPX-68C6CN0 3.6T disk
└─sda1 3.6T HDDs part
sdb ST4000DM004-2U9104 3.6T disk
└─sdb1 3.6T HDDs part
sdc KINGSTON SA400S37240G 223.6G disk
├─sdc1 1M part
├─sdc2 512M EFI part
└─sdc3 223.1G boot-pool part
sdd SD/MMC 0B disk
sde Compact Flash 0B disk
sdf SM/xD-Picture 0B disk
sdg MS/MS-Pro 0B disk
zd0 100G disk
Thanks for the help, I’ve gotten familiar with truenas scale’s gui (barely touched the cli though)
In general, you don’t want to run a scrub at the same time the pool is replacing a disk. In the case of a single, simple Mirror vDev, a scrub while replacing a failed disk, is somewhat useless / redundant.
But, since it is running, let it continue and check back in a few hours with another zpool status.
Now as for the status. Your server only shows 2 x 3.6TB, (aka 4TB), disks. So that may imply you removed the failing / failed disk.
Based on this output, your pool is fine at the moment. The first Mirror disk has no errors at the ZFS level.
PS: There is a difference between normal Linux shell command line, and the TrueNAS API command line, (sometimes referred to as CLI). The TrueNAS API / CLI manages TrueNAS, and the Linux shell can manage / view everything else.
You do not want to make any changes via Linux SHELL, that affect something that TrueNAS normally does. That can confuse the TrueNAS API… However, the commands that I’ve listed are “view” only type commands.
For the resilvering, last I saw (last night) was that it was scanning and when I woke up i couldn’t see it running. Could it just be that it was visible in the GUI or could something have gone wrong with the resilvering process?
pool: HDDs
state: DEGRADED
scan: scrub repaired 5.37M in 17:33:04 with 0 errors on Thu Jan 22 06:47:22 2026
config:
NAME STATE READ WRITE CKSUM
HDDs DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
238dfdfa-0a7c-40d9-90ca-3569714c58b8 ONLINE 0 0 0
replacing-1 DEGRADED 45 0 0
8695138f-e042-4dab-bc81-62f97b6dbbb6 ONLINE 0 0 86
2435261468018471615 UNAVAIL 0 0 0 was /dev/disk/by-partuuid/a88dafb9-ab3c-4f60-92a8-7a28734131d9
errors: No known data errors
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Jan 17 03:47:54 2026
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sdc3 ONLINE 0 0 0
errors: No known data errors
I just noticed that the checksum errors increased by 40
I haven’t heard any problematic sounds anymore from the HDD. Could there still be a problem with the drive or is it fine-ish?
Also could that be why its not resilvering at the moment?
Some additional hardware details may be of use; how is this hard drive connect, motherboard make/model, etc.
If you heard weird noises it could be that a head or platter was already damaged. We got rust spinning several thousand times per second while metals needles read/write 0s & 1s without touching the magic spinning rust. Inane that it works even in the best of conditions…
The likely reason the resilver has stopped is that the new, replacement drive developed too many errors. People here tend to attribute “CHSUM” errors to cables or the disk controller used by the drive. Proof of that stall is likely in logs somewhere, (and I don’t remember where it is).
If you have good backups, I would use something like this to restart the resilver:
Offline the replacement drive
Shutdown and replace the data cable to the replacement drive
Power up
Using zpool status make a note of the errors
Clear the pool errors
Online the replacement drive, this should re-start the resilver
Monitor the pool for new errors
Now I can’t be certain this is the correct method to “fix” your problem. Lots of variables here including your disk controller, power supply, HexOS and potential lack of skill to implement the above procedure. (Lack of skill in this context means if you get an warning or error, is knowing when to stop and figure out why. Not continue and potentially make things worse.)
When trying to run the zpool clear command I get the following error:
cannot clear errors for HDDs: permission denied
Should I run as root or not risk it?
We have to be careful. HexOS users may have a special skill set. You know those users IT department talk about that can crash anything. The PEBKAC. (Problem Exists Between Keyboard and Chair)
@Stressed_Out09 You are doing very well with running things and replying with info.
Just new to selfhosting. I’m just worried about an 1D10T issue being caused by the chair warmer himself lol.
Give me some enterprise servers and ill figure it out. (Also have plenty of backups and test deployments to muck around) tho as you can probably see with my homelab server its not very expensive…
Thoigh hexos did get me into truenas which im now using so much more than hexos’s dashboard.