Stability Isues with TrueNAS Core server

We have a TrueNAS-13.0-U6.1 Core server, we’ve had the hardware a number of years, it was originally installed by the vendor, and we’ve not used it much.

But as it sits there with 130TB we want to use it to provision NFS datastore to VMware vSphere servers, we have used this server on and off over the years for NFS but it’s never been reliable, every now and then it just NFS stops responding to ALL (there are many ESXi servers) so it’s not just one, all these servers also connected to Synology NAS which do not show this issue.

So, we recently upgraded all the networking in the server to 10GBe, Jumbo Frames, fibre completed all the networking, as we suspected it could be the 1GBe.

and, we notice the same thing!

But this time, the GUI stopped working, NFS stopped working, and all we could do was power off and on again.

we don’t believe it’s hardware which is an Intel(R) Xeon(R) Bronze 3204 CPU @ 1.90GHz, 128GB, 130TB, 16 x HGST Ultrastar 12TB disks- RAIDZ2, 2 SSDs for OS.

Motherboard is a Intel S2600STB, J17012-600

There are no issues with the disks, all checked using Smart data, although having spent some time looking at the NAS, I’m surprised no cache for ARC and/or ZIL.

Any ideas, what to tweak of check.

I would expect the performance to be far better, than what we are experiencing with the specification compared to a Synology NAS.

Not a good plan, most of the time.

Are you trying to do VMs, or any block device-type workload really, with a single 16-wide RAIDZ2 vdev? Because that would be a recipe for glacial performance.

1 Like

Just stopped working again, no response from GUI or NFS datastores, very light load!

It’s frustrating, responds to pings, and the last lockup, was just trying to enable SSH !

eventually it has responded!

If you could support your post with evidence, of why not to use Jumbo Frames ? (and it was setup before without, and now this is with, and seems to make no difference)

The original vendor supplied NAS was supplied as RAIDZ2 vdev.

It is for a Content Library in vSphere which is templates and isos, useful if it could host VMs.

we are using NFS, so no Block.

This recent event, just enabling SSH caused a temp pause in operation. If I look in messages, something crashed, loads of failures in a client.py module, and went offline for 2 hours!

gone again! Cannot even display System Processes from the console, it just sits there at top !

Tanked again, server is up, no access via ssh, although responds, but just sitting waiting to login, no access via SCP, datastores via NFS will likely timeout, and console shows no networking, CPU as normal.

Mystery but this is not stable for production.

How do you know it makes no difference if you haven’t tested? Jumbo frames add complexity and complexity is not your friend when it comes to diagnosing an issue. The closer your system is to plain vanilla, the easier it is to pinpoint a cause.

I’d also consider your choice of words. The folk here are not your paid IT consultants. It’s on you to do your research if a recommendation rubs you the wrong way, not @ericloewe to have to justify himself. See the @jgreco 10GbE primer over in the old forum as a starting point.

That aside, I wonder if you have a spare SFP+ card around to plug into one of the PCIe slots to see if it’s the Intel network chipset that has gone rogue.

1 Like

The system has been in use for years 1Gbe - No Jumbo Frames. (both interfaces tested)

This was why a working 10Gbe interface was used, configured as Jumbo Frames.

(both interfaces tested)

Both actually cause the same issue.

In the recent case, just using SSH to connect failed and caused the server to pause for 2 hours doing nothing.

These are not performance related issues, just stability on the OS on the platform.

Weird crashes and core dumps, and lots of client.py python crashes in messages, this is just stock OS nothing has been added or taken away.

Got lots of spare kit, we could try unless the OS does not like the Intel Server.

Server is backup and running, after another pause of 2 hours.

Could memory have gone bad and cause random issues? I wonder since I had some RAM sticks go kaploink after a year of use.

This is all ECC memory, so if a memory stick went bad, it would report and continue operation.

This server for the NAS cost £30,000, so it’s was designed and purchased for this very purpose of high storage capactity. (not my choice or budget). We would have purchased a Synology NAS.

It’s just gone and tanked again, just removing some files from the volume via SSH, so no inbound data transfer from any ESXi servers.

I’m wondering if the compress and dedupe which were enabled on the volume is the cause.

Interesting, the server appeared to stop responding again so I

stopped iSCSI
stopped AFP
stopped SMB
stopped Syslog
stopped SNMP

and it jumped back into life! We are not using iSCSI, AFP or SMB, so those don’t really matter to us.

Wow, you overpaid. Massively. And they sold you RAIDZ2 for block storage?If so, that’s downright criminal. And if they did a single RAIDZ2 vdev at 16-wide, they need to be named and shamed. What’s the output of zpool status?

2 Likes

Yikes, huge glaring red flag. The red army would be embarrassed to march under a flag as red as that one.

1 Like

This was not current present pricing, but not my budget anyway.

You keep mentioning block storage we are just using NFS.

I’m wondering if DeDupe and Compression could be the issue, but the pool would needed to be destroyed and rebuilt. I did post that earlier, so what’s the issue with DeDupe, if it’s a Red Flag ?

But did you see the mention above stopping

stopped iSCSI
stopped AFP
stopped SMB
stopped Syslog
stopped SNMP

returned the server to normal,

zpool status returns, all disks online, no errors.

and why does syslog-ng keep crashing and dumping.!

Good catch. Dedupe has its purposes but a bronze pair of Xeon, a wide Z2 VDEV with lots of spinning HDDs, and you might have some trouble on your hands? Who knows, it is out of my wheelhouse.

but that would show in CPU ? and it’s never showing any high CPU

and we can find the server non responsive with no traffic hitting the disks

VMs are a block storage workload. NFS won’t magically change that.

Very long story. The short version is that it sucks, it destroys performance and is detrimental to 95% of use cases where one might naïvely think of employing it. Turning on dedup without careful analysis is, to put it mildly, not a sign of a competent vendor.

Not really, if dedup is indeed the problem, and it’s certainly a problem, your performance would be destroyed by disk latency before you run into serious CPU bottlenecks (though those are a thing with dedup). Particularly when running a single vdev.

Well, there are a few options:

  1. Completely borked hardware - let’s assume not for now
  2. Completely borked install - won’t hurt to scrub the boot pool, but this doesn’t scream likely
  3. System dataset is on the same pool that’s overburdened by dedup and, pending confirmation, a somewhat-tragically wide RAIDZ2 vdev. As services try to write to or read from the system dataset, everything crawls to a halt, services hang, some might crash outright… It sounds like a cop out, but it’s looking pretty likely.
1 Like

It confirms my suspicions, from the start.

time to trash and rebuild, and decide what to install.

HGST Ultrastar 12TB disks… Can you provide the exact model of disks please?

Everything else as pointed out by @ericloewe smells of a bad combo for performance.

Seems you’re not actually hosting VMs on the NFS.

1 Like