Looking for help with a pool, do I just replace the device?

roger_bennett · November 24, 2024, 10:14am

I’m a domestic user and not an expert.

I’m using TrueNAS core on an Intel-i5 32GB, nvme boot device.
There is one pool containing six 3.5" 8TB sata drives, in a RaidZ2 configuration.
I’m just using an SMB share to access the files from a few devices on the local network.

I had a problem reading files, so checked the TrueNAS dashboard from the web interface. I got this in the alerts.

Pool Nova state is ONLINE: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.
2024-11-24 00:00:41 (America/Los_Angeles)

I feel like that means I need to replace one of my drives. They are only 7 months old so I may be able to RMA.

Is replacing a drive in TrueNAS easy to do? It wasn’t in the tutorials but I’m guessing this is something people do so is there a good guide.?

SmallBarky · November 24, 2024, 1:05pm

Please tag inital post with CORE

Documentation, yes an easy task.

etorix · November 24, 2024, 1:47pm

Before you do that, you’d need to ascertain that there are drives to be replaced in the first place. Hardware details and output of
zpool status -v
camcontrol devlist and
smartctl -a /dev/adaN (or daN, for all relevant values of N)
please, with text outputs nicely placed between triple backticks ``` for redability.

roger_bennett · November 27, 2024, 7:44pm

OK. Thanks I think.

The problem has gotten worse; I’m starting to worry that its pretty broken.

I couldn’t connect to the web interface.
Plugging in a monitor, I was getting some sort of ada3: problem Disconnecting the third drive made the problem go away and it booted normally, but telling me the pool was degraded. Which I guess I expected because i’d lost one drive from a Raidz-2.
I shut it down to deal with when I could make time.

Today I’m back in and cannot connect to the web interface. connecting a monitor there are some ada2 errors and a whole load of metaslab.c:2457:metaslab_load_impl() which frankly scares me.

My plan was to learn how to replace a drive and then do that.

Loosing ADA2 immediatley after ADA3 makes me worry i’ve just lost data. I don’t know how a 4+2 array works but that feels bad.

I don’t think I can type any commands; I just get those errors on loop.

SmallBarky · November 27, 2024, 7:52pm

[/s]put ‘sudo’ before the commands like

sudo zpool status -v

Please copy and post the results using Preformatted text (Ctrl+e). Looks like </> on toolbar above where you reply and comment

EDIT - just noticed you are on CORE, not SCALE. no SUDO necessary for you.

roger_bennett · November 27, 2024, 7:59pm

adding sudo still has no noticable effect.
I’m not rightly sure I can copy paste anything, I’ve attached a keyboard and mouse to the machine.

I can’t connect remotely right now - just waiting for it to be pingable or for the web interface to work. Best I could manage is taking a picture with my phone I think.

roger_bennett · November 27, 2024, 8:02pm

scratch that, its just starting interfaces. I may be able to connect soon.
It took 10-15 minutes longer than normal to boot this far, I was assuming it was forever stuck but maybe I needed to be more patient

SmallBarky · November 27, 2024, 8:04pm

can you give us details on your hardware? Motherboard model, how the disks are physically attached? Directly to the motherboard?

roger_bennett · November 27, 2024, 8:09pm

Yeap sorry. I’ll get the details now.

Intel i5-7600K
CORSAIR Vengeance LPX 32GB DDR4 2133
Motherboard ASUS P10S WS

The motherboard has eight sata slots, so I decided to use six of them to connect my drives in a 4+2 Raidz-2 config.

Its been working with no known problems for about seven months.

roger_bennett · November 27, 2024, 8:11pm

sudo zpool status -v 
Sorry,  user root is not allowed to execute '/usr/local/bin/zpool status -v' as root on cybertron.local

SmallBarky · November 27, 2024, 8:14pm

leave off the sudo part, That was a mistake, I thought you were on SCALE

roger_bennett · November 27, 2024, 8:16pm

Sorry I shoudl have specified.
Its booted and I can connect to the web interface now. Which means I can launch a shell from there and copy/paste. Its also means I can stand down one defcon from thinking everything is broken.

Core
TrueNAS-13.0-U6.1

roger_bennett · November 27, 2024, 8:19pm

OK so this is expected while one drive is unplugged. Its possible I was too hasty taking it out.
After ten minutes of not booting and a screen of ADA3 messages I thought it was stuck forever and removed it



root@cybertron[~]# zpool status -v
  pool: Nova
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-2Q
  scan: resilvered 5.95M in 00:00:01 with 0 errors on Wed Nov 27 10:20:28 2024
config:

        NAME                                            STATE     READ WRITE CKSUM
        Nova                                            DEGRADED     0     0 0
          raidz2-0                                      DEGRADED     0     0 0
            gptid/887a3199-13ad-11ef-8bb3-38d547750fc5  ONLINE       0     0 0
            gptid/8893053b-13ad-11ef-8bb3-38d547750fc5  ONLINE       0     0 0
            13714867910798328405                        UNAVAIL      0     0 0  was /dev/gptid/88a98f95-13ad-11ef-8bb3-38d547750fc5
            gptid/88a4e176-13ad-11ef-8bb3-38d547750fc5  ONLINE       2     0 0
            gptid/888476a2-13ad-11ef-8bb3-38d547750fc5  ONLINE       0     0 0
            gptid/888de0d2-13ad-11ef-8bb3-38d547750fc5  ONLINE       0     0 0

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:02 with 0 errors on Fri Nov 22 03:45:02 2024
config:

SmallBarky · November 27, 2024, 8:22pm

Try doing the other part

camcontrol devlist and
smartctl -a /dev/adaN (or daN, for all relevant values of N)

Fleshmauler · November 27, 2024, 8:23pm

I’m surprised & slightly concerned you had 1 drive fail & another giving errors so closely to each other. I’m also surprised that you’re unable to boot while the originally degraded drive is connected.

If you want to be VERY cautious:

Power off the NAS & leave it offline until you get replacement drives
Once you get replacement, connect the replacement, power on the NAS, find the dead drive in GUI & replace it with the replacement
Wait for resilver to finish

Since you got 2 drive redundancy you’ve not lost any data yet, but 2 drives back to back ain’t fun. 1 more failure and you’re out of the danger zone & straight to data loss.

If you got a second system with spare sata slots, might be worth connecting the original dead drive & running some smart -t long tests on it to see if it’d be accepted for RMA.

Don’t remove the second drive that has errors. If any failed drives are still visible to system & not causing additional issues, leave them connected while replacing drives.

(might be worth gathering all the requested info from others before shutting down, that is pretty minimal risk)

etorix · November 27, 2024, 8:28pm

That’s the best part. Get a new drive, add it and you’re fully back in business.
(Have a look at SMART reports though.)

roger_bennett · November 27, 2024, 8:29pm


root@cybertron[~]# camcontrol devlist
<WDC WD80EAAZ-00BXBB0 01.01A01>    at scbus0 target 0 lun 0 (ada0,pass0)
<WDC WD80EAAZ-00BXBB0 01.01A01>    at scbus1 target 0 lun 0 (ada1,pass1)
<WDC WD80EAAZ-00BXBB0 01.01A01>    at scbus3 target 0 lun 0 (ada2,pass2)
<WDC WD80EAAZ-00BXBB0 01.01A01>    at scbus5 target 0 lun 0 (ada3,pass3)
<WDC WD80EAAZ-00BXBB0 01.01A01>    at scbus6 target 0 lun 0 (ada4,pass4)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus7 target 0 lun 0 (ses0,pass5)
root@cybertron[~]#

roger_bennett · November 27, 2024, 8:30pm

Not 100% I understand that. five out of six data devices and … something else?

Fleshmauler · November 27, 2024, 8:33pm

If I’m not mistaken <AHCI SGPIO Enclosure 2.00 0001> could be the motherboard controller for the drives. *In short, nothing scary - no monsters here.

roger_bennett · November 27, 2024, 8:34pm

root@cybertron[~]# smartctl -a /dev/ada0
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD80EAAZ-00BXBB0
Serial Number:    WD-RD039N8E
LU WWN Device Id: 5 0014ee 2c0c26abe
Firmware Version: 01.01A01
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5640 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Nov 27 12:31:10 2024 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (12464) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 801) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x0031) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   206   190   021    Pre-fail  Always       -       6683
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       71
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1966
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       71
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       21
193 Load_Cycle_Count        0x0032   179   179   000    Old_age   Always       -       63937
194 Temperature_Celsius     0x0022   120   108   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing