Dealing with disk resets

Phoenix · August 10, 2024, 3:59pm

I have a pool that consists of a few refurbished SAS HDDs. It’s a homelab setup so most of the time there is not a lot of activity and the NAS works fine. However in one particular scenario, during Proxmox backup job of a bunch of VMs and containers, NAS starts to struggle. Some disks produce the following errors:

Aug 06 01:03:51 truenas kernel: sd 0:0:1:0: attempting task abort!scmd(0x00000000a7464490), outstanding for 31816 ms & timeout 30000 ms
Aug 06 01:03:51 truenas kernel: sd 0:0:1:0: [sdd] tag#1352 CDB: Write(16) 8a 00 00 00 00 00 40 e3 f1 18 00 00 08 00 00 00
Aug 06 01:03:51 truenas kernel: scsi target0:0:1: handle(0x0009), sas_address(0x5000c500ca860875), phy(0)
Aug 06 01:03:51 truenas kernel: scsi target0:0:1: enclosure logical id(0x56c92bf000281605), slot(3) 
Aug 06 01:03:51 truenas kernel: scsi target0:0:1: enclosure level(0x0000), connector name(     )
Aug 06 01:03:51 truenas kernel: sd 0:0:1:0: task abort: SUCCESS scmd(0x00000000a7464490)

some disks have dozens of these error. Eventually NFS hangs and the then the system needs to be rebooted.

I replaced two of the worst offenders and now try to check other spare drives if they are at higher risk of struggling. However I’m not able to fully eliminate the risk and I don’t have the budget to replace all disks with new unused disks.

Therefore I wonder if there are any strategies to mitigate the risks and I would appreciate some input/advice/recommendations.

So far I thought of the following:

Perhaps adding 2 or 3 small SSDs as cache VDEVs would help absorb the spikes of traffic during backup? If I copy some big files then things are fine. I suspect that perhaps during the backup the IO is less consecutive and the disks do not like it and perhaps if it goes through cache it would smooth things out?
Not sure if there is a way to throttle the IO during backup on Proxmox or TrueNAS side. The system has 6 HDDs and the network interface is 2.5 GBit/s so overall the load on each disk is not that high, it goes up to 40 MB/s, but perhaps lowering it during backup would help
Perhaps adding more VRAM would help? Currently the system is equipped with 32 GBs and most of it is used for cache or free. Perhaps during backup the system is able to bugger some IO in RAM while waiting for the disks to timeout after 30 seconds and reset but eventually runs out of memory and then NFS crashes?
Finally I wonder if there is a way to reduce the timeout for the disks and make them reset after for example 10 seconds instead of 30. Perhaps that would help as well.

I understand that my thoughts above are highly speculative and that the proper way would be to purchase reliable equipment. However given the budget constraints I’m forced to explore some workarounds.

SmallBarky · August 10, 2024, 4:32pm

You didn’t post your hardware details and setup. Are these two different machines, proxmox and TrueNAS? Post the TrueNAS hardware, version and disk / pool layout info. Recommended networking is Chelsio and Intel, with some warnings.

What disks are you using? Are they NAS grade drives using SMR (shingled magnetic recording)?

Phoenix · August 10, 2024, 5:10pm

Yes, these are two different machines.

NAS:
CPU: AMD 2200G
RAM: 32 GB
Network: I225-V
Controller:
Controller type : SAS3008
BIOS version : 8.37.02.00
Firmware version : 16.00.10.00
HDDs: 6 x ST16000NM002G
Pool: 1xRAIDZ2 + 1xLog VDEV (2 TB SATA SSD)

Proxmox runs on MiniPC with 5600H, 64 GB RAM and 2.5 GB Ethernet.

neofusion · August 10, 2024, 6:02pm

It might not be the answer to your specific resets, but the 16.00.12.00 controller firmware does address reset issues.

Phoenix · August 10, 2024, 9:07pm

Thank you for the pointer. The thread states explicitly that it does not affect SAS disks which is what I use. So I’m not sure if I stand to gain anything should I decide to update the FW on the controller.

Fleshmauler · August 11, 2024, 1:33am

Worth a try to test?

neofusion · August 11, 2024, 10:53am

While it’s not clear which direction data is flowing, since both TrueNAS and Proxmox can host VMs, I will for the moment presume Proxmox runs the VMs and you are storing backups of said VMs on NFS shares hosted on your TrueNAS. So the resets are happening when your TrueNAS HDDs are being written to. The type of ZFS cache that apply in that scenario is a SLOG. Long story short, a SLOG is not going to reduce the write pressure on your HDDs; they are more a way to use sync writes with higher throughput than without.

There may be, but I question the wisdom of sidestepping the cause like that. I find it akin to sweeping it under the carpet.

I going to presume that you mistyped “VRAM”, since video memory is an unlikely factor here. There’s no denying that ZFS love RAM. There have been reports of OOM issues observed during NFS transfers in 24.04.# but I am not sure what the current status is. That is a separate issue from your resets though. Other than that, see the previous point. The system should be able to handle a transfer with “just” 32GB RAM without a hitch.

See above. Address the cause instead of the symptoms.

Other than my previous suggestion to update the firmware, resets can be triggered by an overheating HBA. Without knowing what motherboard and case you’re using, it’s difficult to ascertain if you have adequate cooling.

HBAs are typically made for systems with high static pressure airflow in mind, the kind with fans that make your ears bleed. Without that airflow, say, like in an ordinary tower case, you may need dedicated fan pointing directly at the HBA to keep it from overheating.

Phoenix · August 11, 2024, 8:52pm

That was the correct assumption, backups go from Proxmox to TrueNAS. Thank you for the explanation.

And it was indeed about RAM, not VRAM. I ran into the OOM issue which was then solved by the update, after that there were no OOM messages in the logs but some traces from nfs processes.

For the time being I have replaced the two disks that had a lot of resets and things seem to be more stable.

Thanks a lot for the comprehensive feedback