I pulled sdb disk out of the ZFS pool because it is a SSD (Kingston) cache disk. I’m currently running a full SMART test on it to see if that reveals anything. The short tests were fine…
Any suggestions or ideas would be appreciated. Thank in advance for the help.
I can’t tell anything from what you posted.
The first thing is list your hardware. Next run some tests on your system to see it it is stable. RAM and CPU stress tests, and not just one pass, 24 hours or longer for the RAM and several hours minimum for the CPU stress test.
I vaguely remember an old bug where this happened before with an old TrueNAS 12.x build with Kingston SSD’s in the pool I put together for a dev server at my old job just for us to toy around with technologies, etc. Anyway, that’s a different architecture, FreeBSD vs Linux…
Anyway, my build is here. I have the following SSD’s for the boot (KINGSTON_SV300S37A120G) drive and the drive that was in as the cache (KINGSTON_SV300S37A240G) drive on the data pool. I pulled that temporarily based on the errors.
This was the first time I was able to actually get some logs. I can’t SSH in or access the console as it’s is unresponsive when this issue happens.
So after some testing, it looks like it was the cache disk that died. It was causing the pool to lock up. Since I’ve removed the disk, the system has been fully functional.
Any recommendations for SSD drives to use with ZFS?
You can see the smartctl output below:
smartctl -i -A /dev/sdb
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: SandForce Driven SSDs
Device Model: KINGSTON SV300S37A240G
Serial Number: 50026B7764002F88
LU WWN Device Id: 5 0026b7 764002f88
Firmware Version: 60AABBF0
User Capacity: 240,057,409,536 bytes [240 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
TRIM Command: Available
Device is: In smartctl database 7.3/5528
ATA Version is: ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Aug 21 14:32:32 2024 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0032 120 120 050 Old_age Always - 0/0
5 Retired_Block_Count 0x0033 100 100 003 Pre-fail Always - 0
9 Power_On_Hours_and_Msec 0x0032 092 092 000 Old_age Always - 7596h+29m+37.090s
12 Power_Cycle_Count 0x0032 097 097 000 Old_age Always - 3836
171 Program_Fail_Count 0x000a 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
174 Unexpect_Power_Loss_Ct 0x0030 000 000 000 Old_age Offline - 161
177 Wear_Range_Delta 0x0000 000 000 000 Old_age Offline - 1
181 Program_Fail_Count 0x000a 100 100 000 Old_age Always - 0
182 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0012 100 100 000 Old_age Always - 0
189 Airflow_Temperature_Cel 0x0000 039 056 000 Old_age Offline - 39 (Min/Max 16/56)
194 Temperature_Celsius 0x0022 039 056 000 Old_age Always - 39 (Min/Max 16/56)
195 ECC_Uncorr_Error_Count 0x001c 120 120 000 Old_age Offline - 0/0
196 Reallocated_Event_Count 0x0033 100 100 003 Pre-fail Always - 0
201 Unc_Soft_Read_Err_Rate 0x001c 120 120 000 Old_age Offline - 0/0
204 Soft_ECC_Correct_Rate 0x001c 120 120 000 Old_age Offline - 0/0
230 Life_Curve_Status 0x0013 100 100 000 Pre-fail Always - 100
231 SSD_Life_Left 0x0000 097 097 011 Old_age Offline - 4294967297
233 SandForce_Internal 0x0032 000 000 000 Old_age Always - 7969
234 SandForce_Internal 0x0032 000 000 000 Old_age Always - 9067
241 Lifetime_Writes_GiB 0x0032 000 000 000 Old_age Always - 9067
242 Lifetime_Reads_GiB 0x0032 000 000 000 Old_age Always - 5590
244 Unknown_Attribute 0x0000 098 098 010 Old_age Offline - 5570623
Welp, didn’t have lockups for a few days with the SSD out of the pool. Had one last night.
Biting the bullet and running memtest now, I don’t think it’s the memory, I think it’s the CPU or the 24.0.4.2 release… We will see as I test… I will try rolling back to the 24.0.4.0 release and see if that helps as well.
Maybe im wrong… But i remember to have read about some problem on BIOS setting causing crash on ryzen 1x00 CPU… Something related to C state that Need to be disabled…
So I found this… similar issue to what I’m having. I went ahead and set the Power Supply Idle Control to Typical Current Idle in the BIOS. See what I get. Thanks for pointing me in that direction.
yeah if you’re using a first gen ryzen you have to disable global c-states, erp-ready and amd cool/quit. otherwise the system will hard lock after some time, for my 1600x it was around 3 days.
never played around with power supply idle control. 3 years ago when is started my truenas journey i had to disable the above mentioned settings, else the system wouldn’t become stable.
I’ll take a look through the settings and see if I can disable those. Once I get things stabilized, I will post back a summary thread of what I did to stabilize things.
One more question for you, in the thread I linked about with the PSU. He mentioned that his Vcore didn’t drop below 0.8v anymore. Was that the case with your config as well? I’m noticing mine is hovering around 0.9v now.
Honestly i can’t remember, it’s been Like close to 3 years since i’ve switched to a 3700x which doesn’t need the above Changes and works fine Out of the Box
Cool, I have a 3900xt in my desktop, maybe I’ll upgrade that one to the 5900x since I do some gaming on it and drop the 3900xt into my server. Moar cores!
I marked off my reply from a few days ago as the solution. It’s been solid since I’ve changed that setting. Thank you for all the suggestions. I will be looking to upgrade to the a 3000/5000 series Ryzen as a fix in the future, but for now, this is working great and it’s more power than I need.