Confusing crash on storage

Cousins_Nerdnee · January 22, 2025, 7:49pm

version: TrueNAS-SCALE-24.10.1

So I am relatively new to TrueNas, but my server just keeps crashing.
Reviewing the syslogs and notifications I see that it is complaining about a couple of my hard drives. Funny that both drive happen to be my WD drives.

Running long smart tests seem to reveal nothing in the smartctl.

I have spare drive for if these die but I am not sure if they need to be replaced right now based on these results.
And directly after these errors occur, TrueNas freezes and i have to power cycle the box just to get back into it. It happens every few days or so.

Can anyone provide some insight into this?

Jan 20 19:14:29 truenas smartd[2946]: Device: /dev/sdc [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Jan 20 19:14:29 truenas smartd[2946]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 103 to 102
Jan 20 19:14:29 truenas smartd[2946]: Device: /dev/sdc [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
Jan 20 19:14:29 truenas smartd[2946]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 7 to 27
Jan 20 19:14:30 truenas smartd[2946]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 118 to 119
Jan 20 19:17:01 truenas CRON[18488]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   182   169   021    Pre-fail  Always       -       5900
  4 Start_Stop_Count        0x0032   095   095   000    Old_age   Always       -       5534
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       6348
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       139
 16 Total_LBAs_Read         0x0022   001   199   000    Old_age   Always       -       54328586736
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       85
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       8214
194 Temperature_Celsius     0x0022   120   102   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      5149         -
# 2  Extended offline    Completed without error       00%      4981         -
# 3  Extended offline    Completed without error       00%      4819         -
# 4  Extended offline    Completed without error       00%      4734         -
# 5  Extended offline    Completed without error       00%      4617         -
# 6  Extended offline    Completed without error       00%      4548         -

etorix · January 22, 2025, 8:41pm

The above SMART report is from a healthy drive. But drive letters can change at each reboot, so ‘sdc’ is probably something else now: Check all drives.

Cousins_Nerdnee · January 22, 2025, 9:01pm

I have ran a check on all drives and they all reported similar results. I will run again to verify but if we assume for a moment that these smart tests all come back without issue, what would be my next step?

etorix · January 22, 2025, 9:09pm

If all SMART tests come good and TrueNAS hallucinated the above alerts, you’re actually running TrueNAS AI Edition…

Cousins_Nerdnee · January 23, 2025, 12:48am

While that is running, I do have a question. Truenas did freeze and require power cycling right after this error. That has now happened 3-4 times and I am not sure if it is crashing because I screwed something up or a dying drive can kill it. I think it is fair to say that TrueNas being taken down by 1 bad drive in a raidz1 array is not optimal.

etorix · January 23, 2025, 7:59am

That’s awfully hot… Could it be a SSD? And it’s sdb, not sdc

Time to describe your hardware in detail, I think.

Cousins_Nerdnee · January 23, 2025, 6:22pm

It is a hard drive with 2 intakes blowing into them. idk why it is marking it that high. My thermometer states its barely above room temp.
Also the long smart test came back and there is one drive that has allocated sectors. I did not know the drive letters change on reboot, so thank you for that information.

I will replace that drive. In the mean time, I am a bit concerned that TrueNas froze just after it logged this error. I thought it was me having xmp on in the bios but that was already off and that was the initial issue that lead me down this rabbit hole. Thoughts?

neofusion · January 23, 2025, 6:31pm

If you post the full reports we can help you identify the failing drive.

The partial report you did post is likely not related to the failing drive referred to in your log file.

Cousins_Nerdnee · January 23, 2025, 8:28pm

I found the dying drive by checking all the drives. I did not know the drive letters changed on reboot. I think as of the time I am writing this, it is sda.
I am about to replace it.

Right now I am concerned about the fact that TrueNas froze just after it posted that error in the syslogs. For context on the syslog, the entry that came before that block was 3 min older and did not seem relevant.

Here’s the smart report for it.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   104   099   006    Pre-fail  Always       -       6854048
  3 Spin_Up_Time            0x0003   092   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   095   095   020    Old_age   Always       -       5404
  5 Reallocated_Sector_Ct   0x0033   011   011   036    Pre-fail  Always   FAILING_NOW 58616
  7 Seek_Error_Rate         0x000f   048   048   030    Pre-fail  Always       -       2173288369086
  9 Power_On_Hours          0x0032   063   063   000    Old_age   Always       -       32540
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       96
183 Runtime_Bad_Block       0x0032   098   098   000    Old_age   Always       -       2
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   097   000    Old_age   Always       -       8590065689
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   071   054   045    Old_age   Always       -       29 (Min/Max 28/30)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       48
193 Load_Cycle_Count        0x0032   098   098   000    Old_age   Always       -       5406
194 Temperature_Celsius     0x0022   029   046   000    Old_age   Always       -       29 (0 17 0 0 0)
195 Hardware_ECC_Recovered  0x001a   014   007   000    Old_age   Always       -       6854048
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       4722 (198 91 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       1085210477
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1659397574

Here is the syslog brock that came before it.

Jan 20 18:55:56 truenas systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
Jan 20 19:00:04 truenas systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Jan 20 19:00:04 truenas systemd[1]: sysstat-collect.service: Deactivated successfully.
Jan 20 19:00:04 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Jan 20 19:10:00 truenas systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Jan 20 19:10:00 truenas systemd[1]: sysstat-collect.service: Deactivated successfully.
Jan 20 19:10:00 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Jan 20 19:11:13 truenas systemd[1]: Created slice user-950.slice - User Slice of UID 950.
Jan 20 19:11:13 truenas systemd[1]: Starting user-runtime-dir@950.service - User Runtime Directory /run/user/950...
Jan 20 19:11:13 truenas systemd[1]: Finished user-runtime-dir@950.service - User Runtime Directory /run/user/950.
Jan 20 19:11:13 truenas systemd[1]: Starting user@950.service - User Manager for UID 950...
Jan 20 19:11:13 truenas systemd-xdg-autostart-generator[16476]: Exec binary '/usr/libexec/at-spi-bus-launcher' does not exist: No such file or directory
Jan 20 19:11:13 truenas systemd-xdg-autostart-generator[16476]: /etc/xdg/autostart/at-spi-dbus-bus.desktop: not generating unit, executable specified in Exec= does not exist.
Jan 20 19:11:13 truenas systemd[16460]: Queued start job for default target default.target.
Jan 20 19:11:13 truenas systemd[16460]: Created slice app.slice - User Application Slice.
Jan 20 19:11:13 truenas systemd[16460]: Reached target paths.target - Paths.
Jan 20 19:11:13 truenas systemd[16460]: Reached target timers.target - Timers.
Jan 20 19:11:13 truenas systemd[16460]: Starting dbus.socket - D-Bus User Message Bus Socket...
Jan 20 19:11:13 truenas systemd[16460]: Listening on gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jan 20 19:11:13 truenas systemd[16460]: Listening on gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Jan 20 19:11:13 truenas systemd[16460]: Listening on gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Jan 20 19:11:13 truenas systemd[16460]: Listening on gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Jan 20 19:11:13 truenas systemd[16460]: Listening on gssuserproxy.socket - GSS User Proxy.
Jan 20 19:11:13 truenas systemd[16460]: Listening on dbus.socket - D-Bus User Message Bus Socket.
Jan 20 19:11:13 truenas systemd[16460]: Reached target sockets.target - Sockets.
Jan 20 19:11:13 truenas systemd[16460]: Reached target basic.target - Basic System.
Jan 20 19:11:13 truenas systemd[16460]: Reached target default.target - Main User Target.
Jan 20 19:11:13 truenas systemd[16460]: Startup finished in 189ms.
Jan 20 19:11:13 truenas systemd[1]: Started user@950.service - User Manager for UID 950.
Jan 20 19:11:13 truenas systemd[1]: Started session-1.scope - Session 1 of User admin.

etorix · January 23, 2025, 10:14pm

Ouch! Judging by the number of reallocated sectors this drive is failing really badly and really fast. It’s good you’ve found it and can replace it quickly.

Cousins_Nerdnee · January 23, 2025, 10:17pm

Agreed. I just wish I didn’t find it by a server crash. Any idea what that might be about?