I received a message overnight about a corrupted file on the SSD boot device.
zpool status -xv
pool: freenas-boot
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:00:31 with 1 errors on Fri Aug 9 03:45:31 2024
config:
NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
ada8p2 ONLINE 0 0 6
errors: Permanent errors have been detected in the following files:
//boot/kernel-debug/iser.ko
The system is running TrueNAS-13.0-U6.2. The boot device is a SanDisk Plus 120GB connected via a SATA cable to the Supermicro X10SRH-cF motherboard. The syslog is also kept on the boot SSD, but nothing else uses it.
The boot SSD passes a SMART short test. I have never been able to get it to do long tests, so I simply have a short test run every day.
Digging deeper, the corrupted file seems to be related to iSCSI, which I donāt use.
Looking back at the daily SMART reports I receive via email, I see that the Reported_Uncorrect value did a sudden step increase from 8 to 10076 on 2024-07-28, and is now at 10092. Iāll order a replacement boot SSD. Comments?
Given that the SSD is much larger than strictly needed for a boot device, I wonder if there is any benefit to setting file copies to 2 or more. Is that possible on a boot device?
smartctl -x /dev/ada8
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: SandForce Driven SSDs
Device Model: SanDisk SDSSDA120G
Serial Number: 162214402614
LU WWN Device Id: 5 001b44 4a4a89abb
Firmware Version: Z22000RL
User Capacity: 120,034,123,776 bytes [120 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 1.8 inches
TRIM Command: Available
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 T13/2015-D revision 3
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Aug 9 08:22:47 2024 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Disabled
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Unavailable
ATA Security is: Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x71) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 10) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
5 Retired_Block_Count -O--CK 100 100 000 - 25
9 Power_On_Hours_and_Msec -O--CK 119 100 000 - 11895h+00m+00.000s
12 Power_Cycle_Count -O--CK 100 100 000 - 242
166 Unknown_Attribute -O--CK 100 100 000 - 19873
167 Unknown_Attribute -O--CK 100 100 000 - 15
168 Unknown_Attribute -O--CK 100 100 000 - 19926
169 Unknown_Attribute -O--CK 100 100 000 - 42
170 Reserve_Block_Count -O--CK 100 100 000 - 25
171 Program_Fail_Count -O--CK 100 100 000 - 0
172 Erase_Fail_Count -O--CK 100 100 000 - 0
173 Unknown_SandForce_Attr -O--CK 100 100 --- - 19822
174 Unexpect_Power_Loss_Ct -O--CK 100 100 000 - 201
187 Reported_Uncorrect -O--CK 100 100 000 - 10092
194 Temperature_Celsius -O---K 060 100 000 - 40 (Min/Max 0/47)
199 SATA_CRC_Error_Count -O--CK 100 100 000 - 0
230 Life_Curve_Status -O--CK 100 100 000 - 663
232 Available_Reservd_Space PO--CK 100 100 004 - 91
233 SandForce_Internal -O--CK 100 100 000 - 1667078
241 Lifetime_Writes_GiB ----CK 253 253 000 - 51887
242 Lifetime_Reads_GiB ----CK 253 253 000 - 4364
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 GPL,SL R/O 1 Summary SMART error log
0x02 GPL,SL R/O 1 Comprehensive SMART error log
0x03 GPL,SL R/O 1 Ext. Comprehensive SMART error log
0x04 GPL,SL R/O 8 Device Statistics log
0x06 GPL,SL R/O 1 SMART self-test log
0x07 GPL,SL R/O 1 Extended self-test log
0x09 GPL,SL R/W 1 Selective self-test log
0x10 GPL,SL R/O 1 NCQ Command Error log
0x11 GPL,SL R/O 1 SATA Phy Event Counters log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
No Errors Logged
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 11786 -
# 2 Short offline Completed without error 00% 11774 -
# 3 Short offline Completed without error 00% 11762 -
# 4 Short offline Completed without error 00% 10468 -
# 5 Short offline Completed without error 00% 10457 -
# 6 Short offline Completed without error 00% 10446 -
# 7 Short offline Completed without error 00% 10436 -
# 8 Short offline Completed without error 00% 10424 -
# 9 Short offline Completed without error 00% 10389 -
#10 Reserved (0x0d) Completed without error 00% 7965 -
#11 Vendor (0x4b) Self-test routine in progress 90% 7955 -
#12 Short offline Unknown status (0xb) 10% 7945 -
#13 Short offline Completed without error 00% 7877 -
#14 Short offline Completed without error 00% 7868 -
#15 Short offline Completed without error 00% 7857 -
#16 Short offline Completed without error 00% 7847 -
#17 Reserved (0x0d) Completed without error 00% 7838 -
#18 Vendor (0x4b) Self-test routine in progress 90% 7693 -
#19 Short offline Unknown status (0xb) 10% 63819 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Commands not supported
Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 2) ==
0x01 0x008 4 242 --- Lifetime Power-On Resets
0x01 0x010 4 11895 --- Power-on Hours
0x01 0x018 6 51887 --- Logical Sectors Written
0x01 0x020 6 3531021314 --- Number of Write Commands
0x01 0x028 6 4364 --- Logical Sectors Read
0x01 0x030 6 175808871 --- Number of Read Commands
0x02 ===== = = === == Free-Fall Statistics (empty) ==
0x03 ===== = = === == Rotating Media Statistics (empty) ==
0x04 ===== = = === == General Errors Statistics (rev 1) ==
0x04 0x008 4 10092 --- Number of Reported Uncorrectable Errors
0x04 0x010 4 201 --- Resets Between Cmd Acceptance and Completion
0x05 ===== = = === == Temperature Statistics (rev 1) ==
0x05 0x008 1 40 --- Current Temperature
0x05 0x010 1 -4 --- Average Short Term Temperature
0x05 0x018 1 58 --- Average Long Term Temperature
0x05 0x020 1 45 --- Highest Temperature
0x05 0x028 1 8 --- Lowest Temperature
0x05 0x030 1 -1 --- Highest Average Short Term Temperature
0x05 0x038 1 0 --- Lowest Average Short Term Temperature
0x05 0x040 1 -1 --- Highest Average Long Term Temperature
0x05 0x048 1 0 --- Lowest Average Long Term Temperature
0x05 0x050 4 0 --- Time in Over-Temperature
0x05 0x058 1 100 --- Specified Maximum Operating Temperature
0x05 0x060 4 0 --- Time in Under-Temperature
0x05 0x068 1 0 --- Specified Minimum Operating Temperature
0x06 ===== = = === == Transport Statistics (rev 1) ==
0x06 0x008 4 1442 --- Number of Hardware Resets
0x06 0x018 4 0 --- Number of Interface CRC Errors
0x07 ===== = = === == Solid State Device Statistics (rev 1) ==
0x07 0x008 1 148 --- Percentage Used Endurance Indicator
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value
Pending Defects log (GP Log 0x0c) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x0008 2 0 Device-to-host non-data FIS retries
0x0009 2 0 Transition from drive PhyRdy to drive PhyNRdy
0x000a 2 5 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x000d 2 0 Non-CRC errors within host-to-device FIS
0x000f 2 0 R_ERR response for host-to-device data FIS, CRC
0x0010 2 0 R_ERR response for host-to-device data FIS, non-CRC
0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
0x0013 2 0 R_ERR response for host-to-device non-data FIS, non-CRC