So I am relatively new to TrueNas, but my server just keeps crashing.
Reviewing the syslogs and notifications I see that it is complaining about a couple of my hard drives. Funny that both drive happen to be my WD drives.
Running long smart tests seem to reveal nothing in the smartctl.
I have spare drive for if these die but I am not sure if they need to be replaced right now based on these results.
And directly after these errors occur, TrueNas freezes and i have to power cycle the box just to get back into it. It happens every few days or so.
Can anyone provide some insight into this?
Jan 20 19:14:29 truenas smartd[2946]: Device: /dev/sdc [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Jan 20 19:14:29 truenas smartd[2946]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 103 to 102
Jan 20 19:14:29 truenas smartd[2946]: Device: /dev/sdc [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
Jan 20 19:14:29 truenas smartd[2946]: Device: /dev/sdc [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 7 to 27
Jan 20 19:14:30 truenas smartd[2946]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 118 to 119
Jan 20 19:17:01 truenas CRON[18488]: (root) CMD (cd / && run-parts --report /etc/cron.hourly)
The above SMART report is from a healthy drive. But drive letters can change at each reboot, so ‘sdc’ is probably something else now: Check all drives.
I have ran a check on all drives and they all reported similar results. I will run again to verify but if we assume for a moment that these smart tests all come back without issue, what would be my next step?
While that is running, I do have a question. Truenas did freeze and require power cycling right after this error. That has now happened 3-4 times and I am not sure if it is crashing because I screwed something up or a dying drive can kill it. I think it is fair to say that TrueNas being taken down by 1 bad drive in a raidz1 array is not optimal.
It is a hard drive with 2 intakes blowing into them. idk why it is marking it that high. My thermometer states its barely above room temp.
Also the long smart test came back and there is one drive that has allocated sectors. I did not know the drive letters change on reboot, so thank you for that information.
I will replace that drive. In the mean time, I am a bit concerned that TrueNas froze just after it logged this error. I thought it was me having xmp on in the bios but that was already off and that was the initial issue that lead me down this rabbit hole. Thoughts?
I found the dying drive by checking all the drives. I did not know the drive letters changed on reboot. I think as of the time I am writing this, it is sda.
I am about to replace it.
Right now I am concerned about the fact that TrueNas froze just after it posted that error in the syslogs. For context on the syslog, the entry that came before that block was 3 min older and did not seem relevant.
Jan 20 18:55:56 truenas systemd[1]: run-credentials-systemd\x2dtmpfiles\x2dclean.service.mount: Deactivated successfully.
Jan 20 19:00:04 truenas systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Jan 20 19:00:04 truenas systemd[1]: sysstat-collect.service: Deactivated successfully.
Jan 20 19:00:04 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Jan 20 19:10:00 truenas systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
Jan 20 19:10:00 truenas systemd[1]: sysstat-collect.service: Deactivated successfully.
Jan 20 19:10:00 truenas systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
Jan 20 19:11:13 truenas systemd[1]: Created slice user-950.slice - User Slice of UID 950.
Jan 20 19:11:13 truenas systemd[1]: Starting user-runtime-dir@950.service - User Runtime Directory /run/user/950...
Jan 20 19:11:13 truenas systemd[1]: Finished user-runtime-dir@950.service - User Runtime Directory /run/user/950.
Jan 20 19:11:13 truenas systemd[1]: Starting user@950.service - User Manager for UID 950...
Jan 20 19:11:13 truenas systemd-xdg-autostart-generator[16476]: Exec binary '/usr/libexec/at-spi-bus-launcher' does not exist: No such file or directory
Jan 20 19:11:13 truenas systemd-xdg-autostart-generator[16476]: /etc/xdg/autostart/at-spi-dbus-bus.desktop: not generating unit, executable specified in Exec= does not exist.
Jan 20 19:11:13 truenas systemd[16460]: Queued start job for default target default.target.
Jan 20 19:11:13 truenas systemd[16460]: Created slice app.slice - User Application Slice.
Jan 20 19:11:13 truenas systemd[16460]: Reached target paths.target - Paths.
Jan 20 19:11:13 truenas systemd[16460]: Reached target timers.target - Timers.
Jan 20 19:11:13 truenas systemd[16460]: Starting dbus.socket - D-Bus User Message Bus Socket...
Jan 20 19:11:13 truenas systemd[16460]: Listening on gpg-agent-browser.socket - GnuPG cryptographic agent and passphrase cache (access for web browsers).
Jan 20 19:11:13 truenas systemd[16460]: Listening on gpg-agent-extra.socket - GnuPG cryptographic agent and passphrase cache (restricted).
Jan 20 19:11:13 truenas systemd[16460]: Listening on gpg-agent-ssh.socket - GnuPG cryptographic agent (ssh-agent emulation).
Jan 20 19:11:13 truenas systemd[16460]: Listening on gpg-agent.socket - GnuPG cryptographic agent and passphrase cache.
Jan 20 19:11:13 truenas systemd[16460]: Listening on gssuserproxy.socket - GSS User Proxy.
Jan 20 19:11:13 truenas systemd[16460]: Listening on dbus.socket - D-Bus User Message Bus Socket.
Jan 20 19:11:13 truenas systemd[16460]: Reached target sockets.target - Sockets.
Jan 20 19:11:13 truenas systemd[16460]: Reached target basic.target - Basic System.
Jan 20 19:11:13 truenas systemd[16460]: Reached target default.target - Main User Target.
Jan 20 19:11:13 truenas systemd[16460]: Startup finished in 189ms.
Jan 20 19:11:13 truenas systemd[1]: Started user@950.service - User Manager for UID 950.
Jan 20 19:11:13 truenas systemd[1]: Started session-1.scope - Session 1 of User admin.
Ouch! Judging by the number of reallocated sectors this drive is failing really badly and really fast. It’s good you’ve found it and can replace it quickly.