I agree that there is still stuff to do before this can be closed, but can you be explicit about what you think still needs to be done?
As the next step in this, can you please run sudo zpool status -v and post the results, and if they are clean you should probably run a scrub.
Also, I note from the SMART results you posted that you are only doing Short self tests, though you are doing them reasonably frequently (probably more frequently than necessary). For HDDs I would personally do a short test once per week, and a long test once per month. You can schedule them to run simultaneously on all drives because they are self contained - but do them at off-peak times and don’t do them at the same time as a scrub.
If you haven’t already done so, you should also implement @joeschmuck’s Multi-Report script.
I’ve ran the script and after sucess, I rebooted my server. 1 drive was not detected or was not working. Also one of the new drives that I used for resilvering was getting disconnected so I used 6tb once instead which is connected and working, any ideas how I can remove the resilvering unavail drive.
Apart from this as you mentioned I will follow the smart test pattern that you mentioned and also remove that 4tb Seagate drive with SMR tech. Resilvering is taking ages😅
Below is the status.
admin@truenas[~]$ sudo zpool status -v
[sudo] password for admin:
pool: WorkersDev04
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Wed Mar 12 05:12:28 2025
3.07T / 6.46T scanned at 745M/s, 857G / 6.46T issued at 203M/s
202G resilvered, 12.97% done, 08:03:29 to go
config:
NAME STATE READ WRITE CKSUM
WorkersDev04 DEGRADED 0 0 0
raidz2-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
531c0dfc-8fe3-42d8-812a-d5fc7142b9b6 ONLINE 0 0 0 (resilvering)
8767114780025994078 UNAVAIL 0 0 0 was /dev/disk/by-partuuid/53e0ba3b-0801-4485-b08a-df433ae01060
d17f7ea3-9623-4476-8d24-eb8553064365 ONLINE 0 0 0 (resilvering)
35aa855c-ffa1-491a-830c-4c867bc5c987 ONLINE 0 0 0 (resilvering)
sdf1 ONLINE 0 0 0
e4f4dca8-b41e-488f-bd12-92c595276c7d ONLINE 0 0 0
994bac5d-dbc6-4bd7-8592-295428c57b45 ONLINE 0 0 0 (resilvering)
errors: Permanent errors have been detected in the following files:
<I've removed the file location for security reasons. (IDK if they will come back after resilvering is done) But I have a backup of that files anyway. (what else could be done will it affect other good files as well)>
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:31 with 0 errors on Mon Mar 10 03:45:32 2025
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdb3 ONLINE 0 0 0
sdh3 ONLINE 0 0 0
errors: No known data errors
Hmmm … I don’t quite understand the zpool status output. Comparing it to the previous output of zpool import what it seems to say is that:
Partuuid 35aa855c-ffa1-491a-830c-4c867bc5c987 that was first reported as faulted, is resilvering as expected.
We have a disk shown as 8767114780025994078 ... was /dev/disk/by-partuuid/53e0ba3b-0801-4485-b08a-df433ae01060 that is no longer available, however partuuid 53e0ba3b-0801-4485-b08a-df433ae01060 wasn’t in the zpool import at all, so I have no idea what this is or where it came from, but I suspect that it was a ZFS label that had previously been part of the pool but perhaps had been replaced at some previous time. I think, or perhaps hope, that this is not going to be any real issue. However, I do wonder whether this was a cause of the pool not importing in the first place.
Then we have two partuuids apparently resilvering the same device: 531c0dfc-8fe3-42d8-812a-d5fc7142b9b6 which was an original partuuid and d17f7ea3-9623-4476-8d24-eb8553064365 which is a new one from somewhere. I am not sure how the same device could be shown as being resilvered twice, so I think we need to wait and see what happens after the resilver has finished.
And we have one other partuuids being resilvered: 994bac5d-dbc6-4bd7-8592-295428c57b45.
So it looks like you have a RAIDZ2 which has resilvers on 3 of the 5 drives which is NOT a good sign when RAIDZ2 only allows for 2 drives to be lost. But then again, if the zfs labels are screwed up, perhaps the output of zpool status is also screwed up.
My advice is as follows:
Don’t reboot again for the moment unless you have to in order to reduce the risk of another pool import failure and to preserve the device name mappings. Whilst the resilver is still running, run the following commands (in some cases again) and post the results now:
sudo lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID again to give us the current device name mappings and post the results.
sudo zdb -l dev/sdX for each disk of this pool andsudo zdb -l dev/sdXn against each of the ZFS partitions for this pool as shown in this lsblk output to see what the zfs labels are.
Then wait until the resilver complete and run sudo zpool status -v and the same commands again and post the results. As mentioned don’t do anything else until we have all the above results and can see how consistent everything is.
P.S. In future please do NOT use your initiative to “reboot the server” or “use a 6TB drive” unless we advise you to do so, as it may make things worse. I have no idea what the status was after the import worked and before you rebooted, and no idea whether doing these things made anything worse - but if you had a working pool with all files visible immediately after the import and you don’t when the resilvering completes then it is quite possible that doing these two things will have lost you some (or possibly even all) of your data.
Point 1, 2, 3 and 4 I think the data is back and I can access everything all fine after resilvering.
This is the latest zpool status (currently scrubbing)
admin@truenas[~]$ sudo zpool status
pool: WorkersDev04
state: ONLINE
scan: scrub in progress since Fri Mar 14 10:26:35 2025
2.07T / 6.46T scanned at 75.6G/s, 0B / 6.46T issued
0B repaired, 0.00% done, no estimated completion time
config:
NAME STATE READ WRITE CKSUM
WorkersDev04 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
d17f7ea3-9623-4476-8d24-eb8553064365 ONLINE 0 0 0
35aa855c-ffa1-491a-830c-4c867bc5c987 ONLINE 0 0 0
d37dc96f-bc91-4510-9e48-654e2b37f409 ONLINE 0 0 0
e4f4dca8-b41e-488f-bd12-92c595276c7d ONLINE 0 0 0
994bac5d-dbc6-4bd7-8592-295428c57b45 ONLINE 0 0 0
errors: No known data errors
pool: boot-pool
state: ONLINE
scan: scrub repaired 0B in 00:00:31 with 0 errors on Mon Mar 10 03:45:32 2025
config:
NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
sdb3 ONLINE 0 0 0
sdh3 ONLINE 0 0 0
errors: No known data errors
lsblk is showing this.
admin@truenas[~]$ sudo lsblk -bo NAME,MODEL,ROTA,PTTYPE,TYPE,START,SIZE,PARTTYPENAME,PARTUUID
NAME MODEL ROTA PTTYPE TYPE START SIZE PARTTYPENAME PARTUUID
sda ST4000VX007-2DT166 1 gpt disk 4000787030016
└─sda1 1 gpt part 4096 4000784056832 Solaris /usr & Apple ZFS 35aa855c-ffa1-491a-830c-4c867bc5c987
sdb CONSISTENT SSD S6 128GB 0 gpt disk 128035676160
├─sdb1 0 gpt part 40 1048576 BIOS boot ff5d717e-ccd3-40d0-9735-e578907f719a
├─sdb2 0 gpt part 2088 536870912 EFI System 3146caa8-78ac-4e92-9a0c-e7e68ac394b7
├─sdb3 0 gpt part 34605096 110317850112 Solaris /usr & Apple ZFS e018c874-4812-4b3a-9e6b-c1f0e4d198e3
└─sdb4 0 gpt part 1050664 17179869184 Linux swap 37128fac-e8c0-4d8f-9122-b60af51d2220
└─md127 0 raid1 17162043392
└─md127 0 crypt 17162043392
sdc WDC WD40PURZ-85TTDY0 1 gpt disk 4000785948160
└─sdc1 1 gpt part 4096 4000783008256 Solaris /usr & Apple ZFS 994bac5d-dbc6-4bd7-8592-295428c57b45
sdd WDC WD40PURX-64NZ6Y0 1 gpt disk 4000787030016
└─sdd1 1 gpt part 4096 4000784056832 Solaris /usr & Apple ZFS e4f4dca8-b41e-488f-bd12-92c595276c7d
sde WDC WD60PURZ-85ZUFY1 1 gpt disk 6001175126016
└─sde1 1 gpt part 4096 4000785105408 Solaris /usr & Apple ZFS d17f7ea3-9623-4476-8d24-eb8553064365
sdg WDC WD40PURZ-85TTDY0 1 gpt disk 4000787030016
└─sdg1 1 gpt part 4096 4000784056832 Solaris /usr & Apple ZFS d37dc96f-bc91-4510-9e48-654e2b37f409
sdh EVM25/128GB 0 gpt disk 128035676160
├─sdh1 0 gpt part 4096 1048576 BIOS boot 10f746b2-d9b8-4364-8428-52432573852b
├─sdh2 0 gpt part 6144 536870912 EFI System 143e569e-c80e-4088-bf97-316a879dcc44
├─sdh3 0 gpt part 34609152 110315773440 Solaris /usr & Apple ZFS 4ef9db8b-a7ba-4749-ab68-59ddb19d125d
└─sdh4 0 gpt part 1054720 17179869184 Linux swap 9e9e064a-a34c-4e38-8de8-28ca2c078584
└─md127 0 raid1 17162043392
└─md127 0 crypt 17162043392
I tried the sudo zdb -l dev/sda b or other letters it didnt worked.
for the 6tb once its a quick one but yea I need to use a 4tb once. I found survellence drives quite cheap over here so I use that I hope they are fine (please let me know your thoughts) Nas once are double the double or priced more than the survellence drives.
Reboot😅 I quite often do it as we dont have stable power in here just scared of data loss what jsut happened with me was like a nightmare.
IDK if its the power supply or something the some drives automatically keeps disconnecting not sure if the sata power adapters or the drives itself or the power supply.
for the molex to sata splitter I bouth it from PI Plus website ( Pi+® (PiPlus®) Molex IDE 4Pin Male to 5 x SATA Power Cable-18AWG)
For power supply running 650 bronze from cooler master quite old one.
Ordered a new one the model is TUF-GAMING-550B.
Also I need a case as of now they are all loose some on the table and some inside the case. I dont have a proper case. Would DIY case works or should I look for a new one. I currently have a case in mind i.e.prolab ai838 (can support 10 drives) Its good but way to costly.
Thank you so much Truenas community and @Protopia,
P
Are some of these drives previously used in a storage system? I don’t think that there should md raid partitions on drives used in Truenas Scale. If they were used yo might want to remove the md raid superblocks. If Truenas sees there is a raid partition on two of the drives i it’s pool then it can/will mess things up upon boot.
I would have to disagree somewhat as I believe there is another likely possibility.
The only time I have seen these md raid partitions in Scale is when previously used disks had a previous madam software raid on them. Such as disks used in a different storage system like QNAP, Synology, a security system, purchased used disks, etc.
I have 2 in service systems that came up (upgraded/migrated) through all the versions of scale to current. All the drives do have swap partitions (1024 as part 1 on each disk) as that was initially what the setup was in Scale. None of these drives currently have any md raid1 partitions.
Most of these drives did initially have md raid partitions as I reused disks from a couple of different storage systems. This presented all kinds of issues such as missing drives after reboot, drives that showed in lsblk but were not visible in Scale, etc.
Some of the affected disks would sometimes show up in the Scale GUI, but then I found the format in the disk setup of Scale would not clear the superblocks (as a safety measure I was told) and the disks were still an issue after a reboot even though they had been made members of a vdev and thus in a pool.
The correct permanent solution was to boot into a live linux environment, remove the superblocks on the affected drives, then zero the drives out. This removed all traces of the old software raid(s), and Scale could then properly set up the partitions on the drives it needs and not get tripped up with an old raid.
My install, currently still on Dragonfish-24.04.2.5, has swap and md127 partitions on all drives that were there from the beginning. I did not add those myself or tinkered with the swap in any way.