Hi everyone.
I have a true nas core x64 with 5 disks of 2tb each in a RAIDZ1 pool.
I turn on my nas very few times a month, like 2 times every 30 days. That is, I turn it on when my work PC is full and I have to move projects to the NAS.
What’s happening: last night I turn it on and something is wrong (slow and intermittent transfer), so I access its web page and see that it tells me Pool “DEGRADED” … enter the pool and see that the ada3 disk is “REMOVED”… I obviously panic. I don’t know what to do. But I see that at the top left it says SCANNING in progress and it gives me 3 hours to finish. So I let it do it.
This morning I access the nas web page and see that everything is online…
this thing doesn’t leave me calm, what can I do to be sure that my nas is OK and maintained? Can I launch some periodic command to make sure that everything is always in order?
I read in one of these topic that someone schedule a scrub and smar test. Are they right? or maybe can stressed up the nas and disks too much? Thanks.
Hi! Yeah of course i can share everything you need to help me but how i can do this?
hardware specs (is there a specific page that show me these info?)
output of zpool status from the CLI (how can i take this output?)
output of smartctl -a /dev/ada3 (how can i take this output?)
(I assembled my Nas a few years ago following the guide on the True Nas site, I’m not a disk and NAS expert, i’m sorry. But I settle in quickly and can follow directions.)
What hardware are you using? How are your drives connected? What drives are you using? Home Screen will give you some info about processor and RAM.
So you will do this via a shell session. Depending on what version of TrueNAS you are using you may or may not have an option within the UI. If not you will need to enable ssh and connect that way.
Yeah but what is CLI? and how i can get that output? thanks.
smartctl -a /dev/ada3 i think i’m able to get it:
root@NAS3[~]# smartctl -a /dev/ada3
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate BarraCuda 3.5 (SMR)
Device Model: ST2000DM008-2FR102
Serial Number: WFL2YFDG
LU WWN Device Id: 5 000c50 0cc305c3d
Firmware Version: 0001
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
TRIM Command: Available
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sun Oct 6 14:19:12 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 199) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x30a5) SCT Status supported.
SCT Data Table supported.
…I have a 100 TB pool ( show off ) and scrubbing takes 2 days.
But scrubbing is important as it tests each HDD and it will let you know, every month ( or whenever you schedule it ) if a HDD is getting iffy.
If you use that NAS box the way you describe it, the build in system ( that makes TrueNAS better than just an external USB drive ), can not check for failing hardware.
Also, store that NAS in a dry place. If you keep it on it stays moisture free but if is off, humidity can brake PCs more than usage.
I just wanted to bring this up as it is something to keep in mind
Yeah I use it in this way becouse I tried many HDD box like icydock etc… with 4 bay or more, but, usb transfer was slowest then gigabit lan, and no raid possibility with those usb boxes. And also, the power management of those usb boxes was terrible for supply HDDs. When I powered off the boxes I remember that all disks make a terrible sounds because there were not a slow down spinning process. Just a hot power cut off.
I tried also many nas os before TrueNas. This is the one I love for easy of use and modern UI.
Anyway… today the ada3 is figured again as “REMOVED” … also today the NAS has make a resilver . But now sems it is performing a second resilver… again? Why?
I can’t perfrom the command to ada3… maybe becouse appear as “removed”
root@NAS3[~]# smartctl -a /dev/ada3
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
/dev/ada3: Unable to detect device type
Please specify device type with the -d option.
Use smartctl -h to get a usage summary
root@NAS3[~]#
this is ada2 for example…
FreeBSD 13.1-RELEASE-p7 n245428-4dfb91682c1 TRUENAS
TrueNAS (c) 2009-2023, iXsystems, Inc.
All rights reserved.
TrueNAS code is released under the modified BSD license with some
files copyrighted by (c) iXsystems, Inc.
For more information, documentation, help or support, go here:
http://truenas.com
Welcome to TrueNAS
Warning: the supported mechanisms for making configuration changes
are the TrueNAS WebUI and API exclusively. ALL OTHERS ARE
NOT SUPPORTED AND WILL RESULT IN UNDEFINED BEHAVIOR AND MAY
RESULT IN SYSTEM FAILURE.
root@NAS3[~]# smartctl -a /dev/ada3
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
/dev/ada3: Unable to detect device type
Please specify device type with the -d option.
Use smartctl -h to get a usage summary
root@NAS3[~]# smartctl -a /dev/ada2
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.14 (AF)
Device Model: ST2000DM001-1ER164
Serial Number: Z4ZCPNTH
LU WWN Device Id: 5 000c50 0b4dbf71e
Firmware Version: CC28
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 7200 rpm
Form Factor: 3.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Mon Oct 7 07:51:46 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 80) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 216) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x1085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 115 099 006 Pre-fail Always - 85312696
3 Spin_Up_Time 0x0003 096 095 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 190
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 17961673
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 988
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 177
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0
189 High_Fly_Writes 0x003a 093 093 000 Old_age Always - 7
190 Airflow_Temperature_Cel 0x0022 068 049 045 Old_age Always - 32 (Min/Max 22/32)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 75
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 544
194 Temperature_Celsius 0x0022 032 051 000 Old_age Always - 32 (0 8 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 922h+54m+00.085s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 17215653685
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 92150140791
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 967 -
# 2 Conveyance offline Completed without error 00% 620 -
# 3 Short offline Completed without error 00% 11 -
# 4 Short offline Completed without error 00% 0 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
root@NAS3[~]#
root@NAS3[~]# >....
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 216) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x1085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 115 099 006 Pre-fail Always - 85312696
3 Spin_Up_Time 0x0003 096 095 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 190
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 17961673
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 988
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 177
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0
189 High_Fly_Writes 0x003a 093 093 000 Old_age Always - 7
190 Airflow_Temperature_Cel 0x0022 068 049 045 Old_age Always - 32 (Min/Max 22/32)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 75
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 544
194 Temperature_Celsius 0x0022 032 051 000 Old_age Always - 32 (0 8 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 922h+54m+00.085s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 17215653685
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 92150140791
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 967 -
# 2 Conveyance offline Completed without error 00% 620 -
# 3 Short offline Completed without error 00% 11 -
# 4 Short offline Completed without error 00% 0 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
root@NAS3[~]# >....
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 216) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x1085) SCT Status supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 115 099 006 Pre-fail Always - 85312696
3 Spin_Up_Time 0x0003 096 095 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 190
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 17961673
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 988
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 177
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0
189 High_Fly_Writes 0x003a 093 093 000 Old_age Always - 7
190 Airflow_Temperature_Cel 0x0022 068 049 045 Old_age Always - 32 (Min/Max 22/32)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 75
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 544
194 Temperature_Celsius 0x0022 032 051 000 Old_age Always - 32 (0 8 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 922h+54m+00.085s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 17215653685
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 92150140791
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 967 -
# 2 Conveyance offline Completed without error 00% 620 -
# 3 Short offline Completed without error 00% 11 -
# 4 Short offline Completed without error 00% 0 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
I don’t know how to help. Am new at this but, I’d shutdown, swap the cables from that drive with another to remove the possibility that is related to the controller or port or sulfidation of the contacts or power to that drive.
My guess is that the drive is not seen, as if not connected.
Because is rare for it to be just a software issue without the HDD not having an issue first.
oh my… the resilver is restarting again after reach the 100%.
I shut down the nas to check the cable of disk 3.
It take some time to shutting down this time.
The cable is ok… it is in a location that no one can move or touch it.
I try to power on.
Nothing. The ada3 is removed and the resilvering is starting from where it remain befor the shut down. Can i interrupt the resilvering now, shutting down the nas and replacing the disk with new one? or may i have to wait… or perform some command?
It seems that you performed only short S.M.A.R.T. tests.
Be sure to perform these two actions:
1. Check Hard disks: Take long S.M.A.R.T. test, not a short one. It will take several hours, but it will test them thoroughly. You can do this
from graphical user interface: Storage → Disks → select drive(s) → Manual test → Long
from command line interface: smartctl -t long /dev/ada3
After that, you can check the results with smartctl -a /dev/ada3
2. Check (and correct) file system: You can do this
from graphical user interface: Storage → Pools → Gear icon → Scrub pool
from command line interface: zpool scrub Disco_1
After that, you can check the results with zpool status Disco_1
Whatever the results are, RAIDZ1 provides enough level of redundancy, so you can buy a new disk and replace without loosing the data. In fact - that’s why we are using TrueNAS.
Yes,the server does not see it. There’s either connection problem or the disk failed.
Shut down the computer and check the cables - both power and data. If after that the disk is still missing, then it failed. In that case replace it with a new one. As I said, your configuration (RAIDZ1) allows you to do this without loosing the data.