Help me find hte cause of this burst of 'iowait' in Scale

I have two VMs (Windows & Debian running Docker) and two lightweight Truecharts apps. I notice that sometimes the Debian VM will lock up for a minute, but within the VM I can’t see any issues. However, on the TrueNAS side the reporting shows a burst of high iowait. I’m trying to determine the cause.

Just for the record, I don’t really have issues with the Windows VM. I actually moved an app that was dockerized because it would hang randomly causing this iowait issue, but it’s flawless on the Windows VM.

1 Like

I do experience the same since the upgrade to Dragonfish-24.04.0.

The UI of TrueNAS is almost unusable during these iowait bursts.

While it’s not impacting performance (yet) on my system, I’m also seeing very regular iowait spikes on my cobia (23.10.2) system. It’s an all SAS SSD system, so the speed is likely what’s stopping the spikes from affecting performance, but the behaviour is similar.

I spent some time looking at htop while it was doing this and it seemed to be middlewared causing the waits.

The real question is, how can we properly deduce the cause? What tools, CRONS, and logs do we need to generate to pin down or get an idea of the cause?

I’ve heard the ‘IOWAIT’ metric is borderline misleading and not accurate. However, it is undeniable that the IOWAIT is always high when the system/apps/vms/containers experience hangs or delays.

I have two mirrored SSD vdevs for boot and VMs. They write from within the VM/Containers via SMB/NFS to my mirrored storage vdevs. Speed should not be an issue here.

crossposting (assuming you are on Dragonfish)

1 Like

Thanks. I am still on Cobia 23.10.2.

IOWait seems to be directly tied to disk backlog.

I decided to TRIM the SSD, which I always kept auto-trim disabled as per the defaults. At first glance, this appears to have alleviated the disk backlog, but it will take more time to see if this persists.

I will report back with my findings in the future.

Yeah, so IOWAIT is a CPU state when it is waiting for an IO transaction to complete (either a read or a write) and can do no other work. This is typically disk I/O, but can also include NFS, but that is (I believe) the only network traffic that causes IOWAIT.

I typically see very high IOWAIT states (and consequently load) when TRIM’ing my disks, but not high disk I/O.

Scrubbing, on the other hand, produces high I/O but not so high IOWAIT.

As per common advice: disable auto-trim and schedule weekly jobs outside of peak usage hours, and monthly scrub jobs, also outside of peak usage hours, but not at the same time as the trim jobs! :slight_smile:

It’s been 48 hours now and I haven’t seen much of an issue with disk backlog anymore. The SSD drives for my boot and VM pools needed a TRIM. I’ve now enabled a weekly CRON TRIM for both.