Force SATA link speed to 3.0gbps

i am having disk errors on multiple pools. details of my issues and hardware can be found here:

and here

even after moving drives from my HL15 (1.0) backplane to a SAS expander i am still getting disk errors so it is not the motherboard SATA controller, or the 9400-8i HBA card, the HBA cables, or the backplane or i would not have received errors using the SAS expander.

one suggestion is to reduce the SATA link speed

DigitalGarden at 45home labs pointed me to

which mentions adding extraargs=libata.force=3.0 to GRUB, and he helpfully linked to this post

which appears to indicate the command

midclt call system.advanced.update '{"kernel_extra_options": "libata.force=1.5Gbps"}' should work, but at least for that individual it did not seem to work?

so my question to the community is to assist in confirming exactly how i either use midclt and or directly edit the grub boot loader to try forcing my disks to use a link speed of 3gbps instead of 6.0

ok, so i decided to spin up TrueNAS on an old dell desktop i had to play around

i was able to temporally set the setting on this system using libata.force=3.0G using instructions here: Kernel parameters - ArchWiki by:

  1. rebooting at the GRUB screen pressing e and at the line starting with linux added libata.force=3.0G and pressed ctrl+x to allow the system to boot

confirmed

1.) it did set the SATA disk speed to 3.0Gbps for the single SATA disk (the boot disk is a small NVME disk) when checking using smartctl -x /dev/sda

4.) confirmed it was temporary by rebooting normally and the SATA speed was again back to 6.0Gbps

right now i have LONG smart tests running on my disks which have around a 1950 min (32 hour) time duration (18TB drives) so i am going to let that finish and then i will try setting the libata.force=3.0G option temporarily on my troubled system and see what happens.

does anyone know if the libata.force=3.0G command controls disks on SATA ports on a motherboard AND the ports on an HBA like my LSI 9400 cards, or will i need to configure the LSI 9400 cards using storcli or something?

i ask as i have seen two sets of errors:

disks on my motherboard SATA controller give the following errors:

829544.395755] ata5.00: exception Emask 0x0 SAct 0x1880002 SErr 0x0 action 0x6 frozen
[829544.396278] ata5.00: failed command: READ FPDMA QUEUED
[829544.396760] ata5.00: cmd 60/00:08:f8:48:12/08:00:32:07:00/40 tag 1 ncq dma 1048576 in
[829544.397713] ata5.00: status: { DRDY }
[829544.398181] ata5.00: failed command: READ FPDMA QUEUED
[829544.398648] ata5.00: cmd 60/00:98:a8:34:0c/05:00:96:07:00/40 tag 19 ncq dma 655360 in
[829544.399597] ata5.00: status: { DRDY }
[829544.400086] ata5.00: failed command: READ FPDMA QUEUED
[829544.400566] ata5.00: cmd 60/40:b8:78:48:12/00:00:32:07:00/40 tag 23 ncq dma 32768 in
[829544.401580] ata5.00: status: { DRDY }
[829544.402080] ata5.00: failed command: READ FPDMA QUEUED
[829544.402575] ata5.00: cmd 60/40:c0:b8:48:12/00:00:32:07:00/40 tag 24 ncq dma 32768 in
[829544.403375] ata5.00: status: { DRDY }
[829544.403799] ata5: hard resetting link
[829544.718017] ata5: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[829544.780035] ata5.00: configured for UDMA/133
[829544.780058] ata5: EH complete

HOWEVER disks on my LSI 9400 cards seem to only give these errors:

sd 0:0:2:0: attempting task abort!scmd(0x0000000064279725), outstanding for 30236 ms & timeout 30000 ms
sd 0:0:2:0: [sdi] tag#357 CDB: Read(16) 88 00 00 00 00 01 c1 b7 17 08 00 00 00 30 00 00
sd 0:0:2:0: task abort: SUCCESS scmd(0x0000000064279725)
sd 0:0:2:0: attempting task abort!scmd(0x00000000dc41a872), outstanding for 30460 ms & timeout 30000 ms
sd 0:0:2:0: [sdi] tag#354 CDB: Read(16) 88 00 00 00 00 01 c1 b7 17 e0 00 00 00 30 00 00
sd 0:0:2:0: No reference found at driver, assuming scmd(0x00000000dc41a872) might have completed
sd 0:0:2:0: task abort: SUCCESS scmd(0x00000000dc41a872)
sd 0:0:2:0: attempting task abort!scmd(0x00000000bbf55408), outstanding for 30460 ms & timeout 30000 ms
sd 0:0:2:0: [sdi] tag#353 CDB: Read(16) 88 00 00 00 00 06 f2 1e 4d f0 00 00 01 70 00 00
sd 0:0:2:0: No reference found at driver, assuming scmd(0x00000000bbf55408) might have completed
sd 0:0:2:0: task abort: SUCCESS scmd(0x00000000bbf55408)
sd 0:0:2:0: attempting task abort!scmd(0x0000000090bc8f3d), outstanding for 30244 ms & timeout 30000 ms
sd 0:0:2:0: [sdi] tag#383 CDB: Read(16) 88 00 00 00 00 07 10 10 93 b0 00 00 00 38 00 00
sd 0:0:2:0: No reference found at driver, assuming scmd(0x0000000090bc8f3d) might have completed
sd 0:0:2:0: task abort: SUCCESS scmd(0x0000000090bc8f3d)
sd 0:0:2:0: attempting task abort!scmd(0x00000000a8dd63ed), outstanding for 30248 ms & timeout 30000 ms
sd 0:0:2:0: [sdi] tag#382 CDB: Read(16) 88 00 00 00 00 07 10 10 94 88 00 00 07 e8 00 00
sd 0:0:2:0: No reference found at driver, assuming scmd(0x00000000a8dd63ed) might have completed
sd 0:0:2:0: task abort: SUCCESS scmd(0x00000000a8dd63ed)
sd 0:0:2:0: Power-on or device reset occurred

and i am not sure if that means the libata will affect only my motherboard drives or if it will affect all drives?

i have also stumbled upon another possibility.

this article

is talking about needed to turn off Native Command Queuing for all his WD gold 16tb drives by setting the queue depth to 1, and it has links to other articles and discussions about flawed NCQ in WD Golds in ZFS github from 2020…

I am using WD Gold 18TB drives…

i am going to let the LONG SMART tests finish (should be done late tonight) and i am going to use this script to set the queue length to a value of 1 for ONLY my WD gold 18TB drives. this will leave my micron 1.92TB SSDs and my WD Purple drive (for Frigate Surveillance) alone at their default queue lengths of 32.

#!/bin/sh

for i in /dev/sd? ; do
	#echo "$i"
	model=$(smartctl -i $i | grep "Device Model")
	if [[ "$model" =~ "Micron" ]] || [[ "$model" =~ "PURZ" ]]; then
		echo "skipping disk: $i --> $model"
	else
		echo "Disabling NCQ for disk $i"
		echo 1 > "/sys/block/${i/\/dev\/}/device/queue_depth"
	fi
done

so my plans are:

1.) set queue depth to 1, test system
2.) if still errors, then try setting libata.force=3.0G. while this will not affect one of my pools since it will always be running off an HBA, it would allow me to test my other pool connected to the motherboard SARA controller. this pool also gets errors on heavy loads so it will be worth testing. Assuming this fixes the issue, then i will worry about configuring the HBA controllers.
3.) if still errors, then i will try testing with BOTH libata.force=3.0G and the queue depth set to 1

i am hoping the issue will be fixed with option 1.

think i found my issue and my fix

is talking about needed to turn off Native Command Queuing for all his WD gold 16tb drives by setting the queue depth to 1, and it has links to other articles and discussions about flawed NCQ in WD Golds in ZFS github from 2020…

I am using WD Gold 18TB drives…

i am going to use this script to set the queue length to a value of 1 for ONLY my WD gold 18TB drives. this will leave my micron 1.92TB SSDs and my WD Purple drive (for Frigate Surveillance) alone at their default queue lengths of 32.

#!/bin/sh

for i in /dev/sd? ; do
	#echo "$i"
	model=$(smartctl -i $i | grep "Device Model")
	if [[ "$model" =~ "Micron" ]] || [[ "$model" =~ "PURZ" ]]; then
		echo "skipping disk: $i --> $model"
	else
		echo "Disabling NCQ for disk $i"
		echo 1 > "/sys/block/${i/\/dev\/}/device/queue_depth"
	fi
done

so far after more than 30 hours, no errors on either pool when setting the queue depth to 1. i am going to keep testing, and trying to increase the queue depth to 2 and or 3 etc and see if anything errors out.