Hey there,
I had to seize the work on project for quite a while but in about two weeks I will be finally on-site again and resume the troubleshooting and hopefully finish the project by either solving it or giving up on virtualization once and for all…
So, in the meantime I was able to do a few tests:
I had someone to check the temperature on the HBA with a laser temp gun thermometer and the highest temperature measured was about 60 °C. Next to this the highest reported temperature from the lsiutil-program was about 68°C, which in my opinion would explain maybe throttling but never turning off the data stream completely by shutting down.
Even before steady state temperature the system failed while transferring a few big dvd rips
Shortly after that the transfer dropped to 0 while still reporting 5MB/s and doing nothing. See reported by TrueNAS:
Weirdly enough next to spot 4 in the center of the circuit board, spot 1 and the edge of the card was the hottest. Maybe damaged component?
I also did a memtest and it passed
So, TrueNAS on bare metal …. Well, I am running raptor lake Worked well (just using mainboard HBA with SSDs) with VMs but bare metal → I couldn’t install it and as far as I am aware I must update to a TN Scale beta to even be able to install it with Gen 12/13/14 Intel. There will still be no guarantee that it will work properly.
I’d be grateful for any form of suggestions and ideas! For now, I think the best way is to get a new HBA and put all the drives on one HBA. Then put all the stuff into an old i7 Gen 4 machine and troubleshoot it ! I will use regular IP connections over a switch instead of VMs and watch the behavior.
That would be 80 bucks down the drain for the HBA. But at this point the troubleshooting alone was multiple times the work compared to that …
Can check that next week, I let the machine be disconnected until I get my hands on it.
May I ask what you are looking for ? Because TN Core won’t even boot on it bare metal.
Sadly, have personally experienced this on my personal rig. I’ve had to manually cap frequencies to 5.5ghz max across all cores - default behaviour on my motherboard of boosting to 5.8bhz, even delided & with water cooling would not be stable…
Manually setting the P1 & P2 to ~250 & 280 respectively & cap CPU Core limit to ~300amps also helped. I think I also had to manually set the cache ratio & voltages; can’t remember what I put them to off of the top of my head. Each of these steps slowly introduced stability, where now I can’t even remember the last time I had an issue.
In short 13900k/14900k are a pita to get working if you’re not lucky, even on W680 boards, let alone z690/z790. Oddly haven’t had the same nonsense with 12900k.
Apparently lowering the RAM’s speeds helps greatly as well… basically, Intel messed up real big trying to match the pace of AMD. Lunar Lake should be way better though.
I dun want to disable XMP or go under 6400mhz since that’d be a bigger performance hit than the maybe ~1-3% I lost getting it stable so far? Depends on the benchmark. Though if problems come back I guess I’d have to.
I’ve sadly learned more about overclocking trying to get this thing stable that I have in the last 10 years of trying to eek out additional performance. I miss the simple days of bumping ratio + voltage as much as possible without melting up the cpu (not melting cpu optional).
Intel 100% pushed the 13900k (and rebranded ‘14900k’) way too far and Asus defaults were disgustingly unhelpful. I remember struggling to cool this POS while delided - something I’ve never experienced before.
For most immediate results would recommend turning down the boost ratio & set it to all core max ratio (instead of all core + 2 boost) you can hit stable (this helped with consistent crashing), tuning down the cpu cache ratio (this specifically helped with intermittence crashing - just slowly lower it by 100mhz every time you crash randomly when doing decompression or pre-caching shaders, those specifically triggered issues for me), lower CPU Core limit to something that JUST BARELY lets you hit your stable max all core ratio(~300amps for me), then messing with P1/P2/setting manual voltage curves to keep the CPU from melting as it starts doing some odd things with voltages when you lower clocks for… reasons?..
So far I didn’t change anything. What exact BIOS setting are you looking for ?
It is so funny that you guys immediatly went with the current oxidation story I was thinking about that for a long time, but I was afraid to ask here because of the coincidence but right now as you say … it seems serious …
… BUT my i9 (13900KS) is not in the official list (by GamersNexus) of the problematic processors ! At least that is for the oxidation story or did I understand something wrong ?
Nevertheless, I will check for that bug by lowering the power/clock settings. Can you let me know which settings exactly need to be changed ? I will research it but maybe you guys know immediately what the setting is that fixes it for now.
I also seperated the system and I am running TN bare metal now on a 4th gen Intel while staying on the i9 with the Unraid system.
Thanks again for the support guys! I went with the consumer i9 series because of the single core performance, I wouldn’t have thought that this leads to months of troubleshooting pain.
Which CPU ? Also i9 13900KS ? Should I also go with lower clocks together with a lower power setting ?
Did you turn of the intelligent 6Ghz boost (forgot the name) of Intel ?
Will try that ! What exact Bios setting is P1 & P2 ? Is that longterm/shortterm boost ?
Damn it ! I didn’t buy a highend CPU to fix it myself …
Naw the 13900k (got it on cheap because someone was excited to sidegrade to a KS).
Depends on your motherboard manufacturer, but yes, it is longterm/shortterm boost. For me it was somewhere in the CPU Power Management settings… I’d send you a picture but I’m too lazy to get a fat32 usb & start screenshoting it. Would also, again, recommend looking at the Cache Ratio & slightly dropping it.
Obvious disclaimer; this will impact performance. Lowering power targets doesn’t work any other way. You’re just trying to find the exact line where you hit stability without sacrificing a step more.
Sorry for turning the post into overclocking thread For the next one I’ll see if we can turn the subject to hardline watercooling!
Edit: You could also wait for Intel’s microcode fix coming out next month & see if that saves you a few days of slightly tuning values & then running validations.
With lower power settings you’ll be throttled due to lack of available power automagically (should make sense, less power, less Hz). But yes, I’ve always preferred all core being same speed than 1-2 cores magically hitting 6GHz for 5 seconds & melting. Your single core benchmark performance will go down, but realistically that’ll only be in a single run benchmark - any extended benchmarks will average out the nonsense claimed boosts that intel tried to cook into our chips
I basically had to enforce intel stock, disable any sort of intelligence, and manually spend a few days setting clocks, power targets and voltage curves manually. I disabled any kind of 1 core/2 core boost targets as I found it to be a nonsense gimmick that had at best maybe a 1-3% single core positive impact in my daily use as I don’t have much need for something to use a single core at 5.8Ghz for 30s while the CPU melted. Once again, I learned more about my BIOS and overclocking by underclocking, undervolting, and fixing this CPU than I ever had trying to get more performance out of any other device I’ve ever owned.