How do I manually fail a drive?

My secondary server started a scrub on saturday and a couple hours in sdc seems to have kicked the bucket. Unfortunately, since I have these drives mounted through a RAID controller, that means that it’s not failing the drive automatically and the drive is just stuck at 100% utilization. (Yes, I know I’m not supposed to do it this way, but I couldn’t figure out another way with this old junky Dell server).

The issue I have now is the storage area of the UI won’t load because things are waiting forever on this failed disk. How do I manually fail a disk through the CLI? Bonus round, I know /dev/sdc is bad, but zpool status -v shows GUIDs. How do I convert sdc into the necessary GUID for whatever command I need to run?

Take a look at the Raid Controller software - probably by rebooting

I want to fail it before I reboot it if at all possible.

OK, I rebooted, but the drive came back for boot, but is now failed again and I’m stuck in the same spot. How do I identify this drive so I can detach it?

EDIT: OK, the answer was ls -alh on /dev/disk/by-id. Not I just need a way to fail this drive out.

My funny bone says to use a hammer, that will fail it for certain.

If you know which drive it is, why not do as previously suggested, use the RAID controller software (accessed typically when you are powering on the system) to remove the drive from the RAID. You can’t do that from TrueNAS or Debian if you are using a RAID controller. It is hardware based, not software based like ZFS.

Simple, someone will have to physically go touch the machine to do that. The ipmi is too old with too many vulnerabilities for us to connect. I have spare drives in the machine, if I can detach this drive I can easily replace it. It’s just very inconvenient to go physically access the machine.

Well tough. You are using TrueNAS with a RAID controller (not reccomended). There is no software on TN that can “control” the RAID Controller so the only choice you have is to use IPMI / iLO / Whatever OR physically work with the machine.

The RAID controller will be presenting to TN a bunch of drives as a single virtual disk. TN only sees the virtual disk

Actually - how are you presenting the disks? RAID 0/1/5/6 or JBOD or a bunch of single drive RAID 0?

I’m not silly, I am presenting them as a bunch of single RAID 0s at least. :wink:

Would have done JBOD if I could, but it did not appear to be an option. :upside_down_face:

I know this is not a good way to do things, but my work has silly purchasing restrictions and buying a used card on eBay is just not an option for me. Have to make do with what’s available or buy new and there’s no budget for new.

So right now I need to get from /dev/disk/by-id/scsi-36782bcb01f8af00020d4690c07d709f5-part2 to a serial number or at least a slot number my coworker can yank. Any suggestions on that?

What does the GUI Storage/Disks show you? Is there a serial number in there?

Then do Storage/Manage Devices on the pool in question.

Can you correlate anything with those two pages?

OK, I was able to reboot which seemed to give me a few minutes to stop the scrub it kept trying to do and detach the bad drive from the pool. Got one of the spares loaded up and resilvering.

Now I can actually load those pages and maybe figure out the serial number business.

OK, there’s nothing on there that looks like anything related to the serial number, but I figured out how to get the slot numbers.

root@freenas1[/dev/disk/by-id]# lsscsi
[0:2:0:0]    disk    DELL     PERC H700        2.10  /dev/sdb
[0:2:1:0]    disk    DELL     PERC H700        2.10  /dev/sda
[0:2:2:0]    disk    DELL     PERC H700        2.10  /dev/sde
[0:2:3:0]    disk    DELL     PERC H700        2.10  /dev/sdc
[0:2:4:0]    disk    DELL     PERC H700        2.10  /dev/sdd
[0:2:5:0]    disk    DELL     PERC H700        2.10  /dev/sdf
[0:2:6:0]    disk    DELL     PERC H700        2.10  /dev/sdg
[0:2:7:0]    disk    DELL     PERC H700        2.10  /dev/sdh
[0:2:8:0]    disk    DELL     PERC H700        2.10  /dev/sdi
[0:2:9:0]    disk    DELL     PERC H700        2.10  /dev/sdj
[0:2:10:0]   disk    DELL     PERC H700        2.10  /dev/sdk
[0:2:11:0]   disk    DELL     PERC H700        2.10  /dev/sdl
[0:2:12:0]   disk    DELL     PERC H700        2.10  /dev/sdm

The third column on the left there tells me which slot on the server it is. Then I can run a smartctl to get the info on it, slot 03 in this case.

root@freenas1[/dev/disk/by-id]# smartctl -a /dev -d megaraid,03
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.12.15-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital RE4
Device Model:     WDC WD2003FYYS-18W0B0
Serial Number:    WD-WMAY02031535
LU WWN Device Id: 5 0014ee 60114615e
Add. Product Id:  DELL(tm)
Firmware Version: 01.01D02
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database 7.3/5770
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Tue Jun 24 15:19:04 2025 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

zpool-offline is probably your friend:

ZPOOL-OFFLINE(8)                                                                         BSD System Manager's Manual                                                                        ZPOOL-OFFLINE(8)

NAME
     zpool-offline — take physical devices offline in ZFS storage pool

SYNOPSIS
     zpool offline [--power|[-ft]] pool device…
     zpool online [--power] [-e] pool device…

DESCRIPTION
     zpool offline [--power|[-ft]] pool device…
             Takes the specified physical device offline.  While the device is offline, no attempt is made to read or write to the device.  This command is not applicable to spares.

             --power
                     Power off the device's slot in the storage enclosure.  This flag currently works on Linux only

             -f      Force fault.  Instead of offlining the disk, put it into a faulted state.  The fault will persist across imports unless the -t flag was specified.

             -t      Temporary.  Upon reboot, the specified physical device reverts to its previous state.

     zpool online [--power] [-e] pool device…
             Brings the specified physical device online.  This command is not applicable to spares.

             --power
                     Power on the device's slot in the storage enclosure and wait for the device to show up before attempting to online it.  Alternatively, you can set the ZPOOL_AUTO_POWER_ON_SLOT envi‐
                     ronment variable to always enable this behavior.  This flag currently works on Linux only

             -e      Expand the device to use all available space.  If the device is part of a mirror or raidz then all devices must be expanded before the new space will become available to the pool.

SEE ALSO
     zpool-detach(8), zpool-remove(8), zpool-reopen(8), zpool-resilver(8)

OpenZFS                                                                                        August 9, 2019                                                                                        OpenZFS