Crashing/Freezing

dasunsrule32 · August 18, 2024, 10:12pm

Howdy!

I’ve been trying to diagnose an issue with my server with it locking up and having to reset or power cycle the server at least once a day.

I’m on 24.04.2. My build is here.

I finally caught a log from dmesg and the console.

dmesg:

[206636.430502] ------------[ cut here ]------------
[206636.430887] NETDEV WATCHDOG: enp36s0 (igb): transmit queue 1 timed out 5484 ms
[206636.431199] WARNING: CPU: 14 PID: 0 at net/sched/sch_generic.c:525 dev_watchdog+0x235/0x240
[206636.431474] Modules linked in: tcp_diag(E) udp_diag(E) inet_diag(E) nf_conntrack_netlink(E) xt_nat(E) xt_tcpudp(E) xt_conntrack(E) nft_chain_nat(E) xt_MASQUERADE(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) nft_compat(E) overlay(E) nvidia_uvm(POE) nf_tables(E) nfnetlink(E) veth(E) br_netfilter(E) nvme_fabrics(E) nvme_core(E) sunrpc(E) binfmt_misc(E) bridge(E) 8021q(E) garp(E) stp(E) mrp(E) llc(E) bonding(E) tls(E) ntb_netdev(E) ntb_transport(E) ntb_split(E) ntb(E) ioatdma(E) essiv(E) authenc(E) crypto_null(E) dm_crypt(E) ib_core(E) ipmi_ssif(E) nvidia_drm(POE) nvidia_modeset(POE) intel_rapl_msr(E) intel_rapl_common(E) amd64_edac(E) edac_mce_amd(E) nvidia(POE) kvm_amd(E) kvm(E) snd_hda_codec_hdmi(E) irqbypass(E) ghash_clmulni_intel(E) snd_hda_intel(E) sha512_ssse3(E) sha256_ssse3(E) snd_intel_dspcfg(E) sha1_ssse3(E) snd_hda_codec(E) snd_hda_core(E) aesni_intel(E) snd_hwdep(E) crypto_simd(E) snd_pcm(E) ast(E) cryptd(E) snd_timer(E) acpi_ipmi(E) snd(E) rapl(E)
[206636.431534]  drm_shmem_helper(E) video(E) wmi_bmof(E) sp5100_tco(E) soundcore(E) drm_kms_helper(E) ccp(E) ipmi_si(E) acpi_cpufreq(E) pcspkr(E) joydev(E) watchdog(E) k10temp(E) ipmi_devintf(E) evdev(E) ipmi_msghandler(E) sg(E) button(E) loop(E) drm(E) efi_pstore(E) configfs(E) dm_mod(E) ip_tables(E) x_tables(E) autofs4(E) zfs(POE) spl(OE) efivarfs(E) raid10(E) raid456(E) async_raid6_recov(E) async_memcpy(E) async_pq(E) async_xor(E) async_tx(E) xor(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) raid1(E) raid0(E) multipath(E) linear(E) md_mod(E) hid_generic(E) sd_mod(E) usbhid(E) t10_pi(E) cdc_ether(E) hid(E) usbnet(E) mii(E) crc64_rocksoft(E) crc64(E) crc_t10dif(E) crct10dif_generic(E) ahci(E) xhci_pci(E) ahciem(E) crct10dif_pclmul(E) libahci(E) crct10dif_common(E) xhci_hcd(E) crc32_pclmul(E) libata(E) igb(E) crc32c_intel(E) i2c_piix4(E) i2c_algo_bit(E) dca(E) usbcore(E) scsi_mod(E) usb_common(E) scsi_common(E) wmi(E) gpio_amdpt(E) gpio_generic(E)
[206636.437696] CPU: 14 PID: 0 Comm: swapper/14 Tainted: P           OE      6.6.32-production+truenas #1
[206636.438161] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X470D4U, BIOS P3.50 11/02/2020
[206636.438628] RIP: 0010:dev_watchdog+0x235/0x240
[206636.439091] Code: ff ff ff 48 89 df c6 05 f6 38 40 01 01 e8 83 2d fa ff 45 89 f8 44 89 f1 48 89 de 48 89 c2 48 c7 c7 b0 e3 11 a1 e8 1b 29 6b ff <0f> 0b e9 2a ff ff ff 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90
[206636.440251] RSP: 0018:ffffab458052ce78 EFLAGS: 00010286
[206636.440734] RAX: 0000000000000000 RBX: ffff88de6b048000 RCX: 0000000000000027
[206636.441237] RDX: ffff88ed1eda13c8 RSI: 0000000000000001 RDI: ffff88ed1eda13c0
[206636.441750] RBP: ffff88de6b048488 R08: 0000000000000000 R09: ffffab458052cd00
[206636.442265] R10: 0000000000000003 R11: ffff88ed5f2f57a8 R12: ffff88de6a502140
[206636.442792] R13: ffff88de6b0483dc R14: 0000000000000001 R15: 000000000000156c
[206636.443316] FS:  0000000000000000(0000) GS:ffff88ed1ed80000(0000) knlGS:0000000000000000
[206636.443844] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

console:

I pulled sdb disk out of the ZFS pool because it is a SSD (Kingston) cache disk. I’m currently running a full SMART test on it to see if that reveals anything. The short tests were fine…

Any suggestions or ideas would be appreciated. Thank in advance for the help.

joeschmuck · August 19, 2024, 12:09am

I can’t tell anything from what you posted.
The first thing is list your hardware. Next run some tests on your system to see it it is stable. RAM and CPU stress tests, and not just one pass, 24 hours or longer for the RAM and several hours minimum for the CPU stress test.

Post what you find out.

dasunsrule32 · August 19, 2024, 12:24am

Hi @joeschmuck!

Yeah, I was planning on doing that as well.

I vaguely remember an old bug where this happened before with an old TrueNAS 12.x build with Kingston SSD’s in the pool I put together for a dev server at my old job just for us to toy around with technologies, etc. Anyway, that’s a different architecture, FreeBSD vs Linux…

Anyway, my build is here. I have the following SSD’s for the boot (KINGSTON_SV300S37A120G) drive and the drive that was in as the cache (KINGSTON_SV300S37A240G) drive on the data pool. I pulled that temporarily based on the errors.

This was the first time I was able to actually get some logs. I can’t SSH in or access the console as it’s is unresponsive when this issue happens.

Thank you for the help!

dasunsrule32 · August 21, 2024, 6:31pm

So after some testing, it looks like it was the cache disk that died. It was causing the pool to lock up. Since I’ve removed the disk, the system has been fully functional.

Any recommendations for SSD drives to use with ZFS?

You can see the smartctl output below:

smartctl -i -A /dev/sdb  
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.32-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     SandForce Driven SSDs
Device Model:     KINGSTON SV300S37A240G
Serial Number:    50026B7764002F88
LU WWN Device Id: 5 0026b7 764002f88
Firmware Version: 60AABBF0
User Capacity:    240,057,409,536 bytes [240 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
TRIM Command:     Available
Device is:        In smartctl database 7.3/5528
ATA Version is:   ATA8-ACS, ACS-2 T13/2015-D revision 3
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Aug 21 14:32:32 2024 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x0032   120   120   050    Old_age   Always       -       0/0
  5 Retired_Block_Count     0x0033   100   100   003    Pre-fail  Always       -       0
  9 Power_On_Hours_and_Msec 0x0032   092   092   000    Old_age   Always       -       7596h+29m+37.090s
 12 Power_Cycle_Count       0x0032   097   097   000    Old_age   Always       -       3836
171 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0
172 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
174 Unexpect_Power_Loss_Ct  0x0030   000   000   000    Old_age   Offline      -       161
177 Wear_Range_Delta        0x0000   000   000   000    Old_age   Offline      -       1
181 Program_Fail_Count      0x000a   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count        0x0032   100   100   000    Old_age   Always       -       0
187 Reported_Uncorrect      0x0012   100   100   000    Old_age   Always       -       0
189 Airflow_Temperature_Cel 0x0000   039   056   000    Old_age   Offline      -       39 (Min/Max 16/56)
194 Temperature_Celsius     0x0022   039   056   000    Old_age   Always       -       39 (Min/Max 16/56)
195 ECC_Uncorr_Error_Count  0x001c   120   120   000    Old_age   Offline      -       0/0
196 Reallocated_Event_Count 0x0033   100   100   003    Pre-fail  Always       -       0
201 Unc_Soft_Read_Err_Rate  0x001c   120   120   000    Old_age   Offline      -       0/0
204 Soft_ECC_Correct_Rate   0x001c   120   120   000    Old_age   Offline      -       0/0
230 Life_Curve_Status       0x0013   100   100   000    Pre-fail  Always       -       100
231 SSD_Life_Left           0x0000   097   097   011    Old_age   Offline      -       4294967297
233 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       7969
234 SandForce_Internal      0x0032   000   000   000    Old_age   Always       -       9067
241 Lifetime_Writes_GiB     0x0032   000   000   000    Old_age   Always       -       9067
242 Lifetime_Reads_GiB      0x0032   000   000   000    Old_age   Always       -       5590
244 Unknown_Attribute       0x0000   098   098   010    Old_age   Offline      -       5570623

dasunsrule32 · August 22, 2024, 2:57pm

Welp, didn’t have lockups for a few days with the SSD out of the pool. Had one last night.

Biting the bullet and running memtest now, I don’t think it’s the memory, I think it’s the CPU or the 24.0.4.2 release… We will see as I test… I will try rolling back to the 24.0.4.0 release and see if that helps as well.

dasunsrule32 · August 22, 2024, 7:43pm

Memtest+ passed for 6+ hours. Running some CPU tests now. Been running the CPU tests for over 2 hours now at 90C+ still no issues… sighs

There is the matter of a Nvidia GPU as well… testing that as well…

oxyde · August 22, 2024, 8:10pm

Maybe im wrong… But i remember to have read about some problem on BIOS setting causing crash on ryzen 1x00 CPU… Something related to C state that Need to be disabled…

dasunsrule32 · August 22, 2024, 8:11pm

Let me check… I don’t recall that.

dasunsrule32 · August 22, 2024, 8:27pm

So I found this… similar issue to what I’m having. I went ahead and set the Power Supply Idle Control to Typical Current Idle in the BIOS. See what I get. Thanks for pointing me in that direction.

LarsR · August 22, 2024, 8:29pm

yeah if you’re using a first gen ryzen you have to disable global c-states, erp-ready and amd cool/quit. otherwise the system will hard lock after some time, for my 1600x it was around 3 days.

dasunsrule32 · August 22, 2024, 8:29pm

Will what I set above suffice?

LarsR · August 22, 2024, 8:31pm

never played around with power supply idle control. 3 years ago when is started my truenas journey i had to disable the above mentioned settings, else the system wouldn’t become stable.

dasunsrule32 · August 22, 2024, 8:32pm

I’ll take a look through the settings and see if I can disable those. Once I get things stabilized, I will post back a summary thread of what I did to stabilize things.

Thank you!

dasunsrule32 · August 22, 2024, 8:49pm

One more question for you, in the thread I linked about with the PSU. He mentioned that his Vcore didn’t drop below 0.8v anymore. Was that the case with your config as well? I’m noticing mine is hovering around 0.9v now.

LarsR · August 22, 2024, 9:03pm

Honestly i can’t remember, it’s been Like close to 3 years since i’ve switched to a 3700x which doesn’t need the above Changes and works fine Out of the Box

dasunsrule32 · August 22, 2024, 9:04pm

Cool, I have a 3900xt in my desktop, maybe I’ll upgrade that one to the 5900x since I do some gaming on it and drop the 3900xt into my server. Moar cores!

dasunsrule32 · August 24, 2024, 9:28pm

So far, so good… I’ll post the solution as soon as I verify stabiltiy.

Uptime: 2 days 59 minutes as of   17:27

dasunsrule32 · August 25, 2024, 10:45pm

I marked off my reply from a few days ago as the solution. It’s been solid since I’ve changed that setting. Thank you for all the suggestions. I will be looking to upgrade to the a 3000/5000 series Ryzen as a fix in the future, but for now, this is working great and it’s more power than I need.