Bonjour à tous,
Je rencontre récemment un problème sur le SSD qui héberge les quelques applications installées sur mon serveur TrueNAS Scale (vers. 23.10.2) que je n’arrive pas à diagnostiquer. ZFS m’indique qu’il s’agit d’un soucis dans le dataset ix-applications/k3s.
Voici les résultats donnés à la commande zpool status -xv
:
pool: Apps Data
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
scan: scrub repaired 0B in 00:01:28 with 2 errors on Sun Nov 17 00:01:35 2024
config:
NAME STATE READ WRITE CKSUM
Apps Data ONLINE 0 0 0
c2e24faa-93e4-4695-9c11-0ec1adbdb298 ONLINE 0 0 12
errors: Permanent errors have been detected in the following files:
/mnt/Apps Data/ix-applications/k3s/agent/containerd/io.containerd.content.v1.content/blobs/sha256/273dcc6abc5af9b503fcf06af8a60d803826a7be16992aced20a0202a9a2385e
Apps Data/ix-applications/k3s@ix-applications-backup-HeavyScript_2024_11_20_17_28_20:/agent/containerd/io.containerd.content.v1.content/blobs/sha256/273dcc6abc5af9b503fcf06af8a60d803826a7be16992aced20a0202a9a2385e
J’ajoute que la commande smartctl ne semble pas avoir détecté de problème avec ce disque (les tests n’ont pas rapporté d’erreurs visiblement :
=== START OF INFORMATION SECTION ===
Device Model: SPCC Solid State Disk
Serial Number: AA000000000000007963
LU WWN Device Id: 0 000000 000000000
Firmware Version: HT3618B9
User Capacity: 128,035,676,160 bytes [128 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: Not in smartctl database 7.3/5528
ATA Version is: ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Wed Nov 20 20:47:44 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Unavailable
Rd look-ahead is: Disabled
Write cache is: Enabled
DSN feature is: Unavailable
ATA Security is: Disabled, frozen [SEC2]
Wt Cache Reorder: Unavailable
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 18) seconds.
Offline data collection
capabilities: (0x5d) SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0002) Does not save SMART data before
entering power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 4) minutes.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate -O--CK 100 100 050 - 0
5 Reallocated_Sector_Ct -O--CK 100 100 050 - 0
9 Power_On_Hours -O--CK 100 100 050 - 6159
12 Power_Cycle_Count -O--CK 100 100 050 - 22
160 Unknown_Attribute -O--CK 100 100 050 - 0
161 Unknown_Attribute -O--CK 100 100 050 - 25093
163 Unknown_Attribute -O--CK 100 100 050 - 56
164 Unknown_Attribute -O--CK 100 100 050 - 2222
165 Unknown_Attribute -O--CK 100 100 050 - 1882
166 Unknown_Attribute -O--CK 100 100 050 - 1054
167 Unknown_Attribute -O--CK 100 100 050 - 1521
168 Unknown_Attribute -O--CK 100 100 050 - 0
169 Unknown_Attribute -O--CK 100 100 050 - 100
175 Program_Fail_Count_Chip -O--CK 100 100 050 - 473112569
176 Erase_Fail_Count_Chip -O--CK 100 100 050 - 9399547
177 Wear_Leveling_Count -O--CK 100 100 050 - 15863229
178 Used_Rsvd_Blk_Cnt_Chip -O--CK 100 100 050 - 0
181 Program_Fail_Cnt_Total -O--CK 100 100 050 - 0
182 Erase_Fail_Count_Total -O--CK 100 100 050 - 0
192 Power-Off_Retract_Count -O--CK 100 100 050 - 8
194 Temperature_Celsius -O--CK 100 100 050 - 38
195 Hardware_ECC_Recovered -O--CK 100 100 050 - 0
196 Reallocated_Event_Count -O--CK 100 100 050 - 0
197 Current_Pending_Sector -O--CK 100 100 050 - 0
198 Offline_Uncorrectable -O--CK 100 100 050 - 0
199 UDMA_CRC_Error_Count -O--CK 100 100 050 - 0
232 Available_Reservd_Space -O--CK 100 100 050 - 5
241 Total_LBAs_Written -O--CK 100 100 050 - 206331
242 Total_LBAs_Read -O--CK 100 100 050 - 4303
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 GPL,SL R/O 1 Summary SMART error log
0x02 GPL,SL R/O 1 Comprehensive SMART error log
0x03 GPL,SL R/O 1 Ext. Comprehensive SMART error log
0x04 GPL,SL R/O 8 Device Statistics log
0x06 GPL,SL R/O 1 SMART self-test log
0x07 GPL,SL R/O 1 Extended self-test log
0x09 GPL,SL R/W 1 Selective self-test log
0x10 GPL,SL R/O 1 NCQ Command Error log
0x11 GPL,SL R/O 1 SATA Phy Event Counters log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa0 GPL,SL VS 16 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 0 (1 sectors)
No Errors Logged
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Offline Completed without error 00% 5683 -
# 2 Offline Self-test routine in progress 10% 5683 -
# 3 Offline Self-test routine in progress 10% 5683 -
# 4 Offline Self-test routine in progress 10% 5683 -
# 5 Offline Self-test routine in progress 10% 5683 -
# 6 Offline Self-test routine in progress 10% 5683 -
# 7 Offline Self-test routine in progress 10% 5683 -
# 8 Offline Self-test routine in progress 10% 5683 -
# 9 Offline Self-test routine in progress 10% 5683 -
#10 Offline Self-test routine in progress 10% 5683 -
#11 Offline Self-test routine in progress 10% 5683 -
#12 Offline Self-test routine in progress 10% 5683 -
#13 Offline Self-test routine in progress 10% 5683 -
#14 Offline Self-test routine in progress 10% 5683 -
#15 Offline Self-test routine in progress 10% 5683 -
#16 Offline Self-test routine in progress 10% 5683 -
#17 Offline Self-test routine in progress 10% 5683 -
#18 Offline Self-test routine in progress 10% 5683 -
#19 Offline Self-test routine in progress 10% 5683 -
SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Commands not supported
Device Statistics (GP Log 0x04)
Page Offset Size Value Flags Description
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 22 --- Lifetime Power-On Resets
0x01 0x010 4 6159 --- Power-on Hours
0x01 0x018 6 13522108416 --- Logical Sectors Written
0x01 0x020 6 450236669 --- Number of Write Commands
0x01 0x028 6 282001408 --- Logical Sectors Read
0x01 0x030 6 1858915 --- Number of Read Commands
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 22 --- Lifetime Power-On Resets
0x01 0x010 4 6159 --- Power-on Hours
0x01 0x018 6 13522108416 --- Logical Sectors Written
0x01 0x020 6 450236669 --- Number of Write Commands
0x01 0x028 6 282001408 --- Logical Sectors Read
0x01 0x030 6 1858915 --- Number of Read Commands
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 22 --- Lifetime Power-On Resets
0x01 0x010 4 6159 --- Power-on Hours
0x01 0x018 6 13522108416 --- Logical Sectors Written
0x01 0x020 6 450236669 --- Number of Write Commands
0x01 0x028 6 282001408 --- Logical Sectors Read
0x01 0x030 6 1858915 --- Number of Read Commands
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 22 --- Lifetime Power-On Resets
0x01 0x010 4 6159 --- Power-on Hours
0x01 0x018 6 13522108416 --- Logical Sectors Written
0x01 0x020 6 450236669 --- Number of Write Commands
0x01 0x028 6 282001408 --- Logical Sectors Read
0x01 0x030 6 1858915 --- Number of Read Commands
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 22 --- Lifetime Power-On Resets
0x01 0x010 4 6159 --- Power-on Hours
0x01 0x018 6 13522108416 --- Logical Sectors Written
0x01 0x020 6 450236669 --- Number of Write Commands
0x01 0x028 6 282001408 --- Logical Sectors Read
0x01 0x030 6 1858915 --- Number of Read Commands
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 22 --- Lifetime Power-On Resets
0x01 0x010 4 6159 --- Power-on Hours
0x01 0x018 6 13522108416 --- Logical Sectors Written
0x01 0x020 6 450236669 --- Number of Write Commands
0x01 0x028 6 282001408 --- Logical Sectors Read
0x01 0x030 6 1858915 --- Number of Read Commands
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 22 --- Lifetime Power-On Resets
0x01 0x010 4 6159 --- Power-on Hours
0x01 0x018 6 13522108416 --- Logical Sectors Written
0x01 0x020 6 450236669 --- Number of Write Commands
0x01 0x028 6 282001408 --- Logical Sectors Read
0x01 0x030 6 1858915 --- Number of Read Commands
0x01 ===== = = === == General Statistics (rev 1) ==
0x01 0x008 4 22 --- Lifetime Power-On Resets
0x01 0x010 4 6159 --- Power-on Hours
0x01 0x018 6 13522108416 --- Logical Sectors Written
0x01 0x020 6 450236669 --- Number of Write Commands
0x01 0x028 6 282001408 --- Logical Sectors Read
0x01 0x030 6 1858915 --- Number of Read Commands
|||_ C monitored condition met
||__ D supports DSN
|___ N normalized value
Pending Defects log (GP Log 0x0c) not supported
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0009 2 121 Transition from drive PhyRdy to drive PhyNRdy
0x000a 2 88 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x000d 2 0 Non-CRC errors within host-to-device FIS
Il n’y rien de crucial sur ce disque, hormis le fait que j’ai mis du temps à configurer ces applications. J’ai fait un backup de tout ça avec heavyscript au cas où, mais pourriez-vous m’aiguiller sur ce problème ? Y-a-t-il une solution ? Ou il faut que je remplace au plus vite ce disque ?
D’avance merci pour votre aide.
Thibaut M