Should I be worried about CAM status: command timeouts?

nickt · February 13, 2025, 3:48am

A few months back, I upgraded my RAIDZ2 pool from 6x WD 3 TB to 6x Seagate Ironwolf 6TB drives. It’s gone perfectly well, but every once in a while I get the following in the log. I’ve never had these errors before. It is rare - maybe a handful every month or two.

Feb 13 04:21:11 saturn ahcich0: Timeout on slot 26 port 0
Feb 13 04:21:11 saturn ahcich0: is 00000000 cs 04000000 ss 00000000 rs 04000000 tfd c0 serr 00000000 cmd 0004da17
Feb 13 04:21:11 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 04:21:11 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 04:21:11 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
Feb 13 05:41:34 saturn ahcich0: Timeout on slot 29 port 0
Feb 13 05:41:34 saturn ahcich0: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd c0 serr 00000000 cmd 0004dd17
Feb 13 05:41:34 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 05:41:34 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 05:41:34 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
Feb 13 07:01:19 saturn ahcich0: Timeout on slot 30 port 0
Feb 13 07:01:19 saturn ahcich0: is 00000000 cs 40000000 ss 00000000 rs 40000000 tfd c0 serr 00000000 cmd 0004de17
Feb 13 07:01:19 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 07:01:19 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 07:01:19 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
Feb 13 07:07:42 saturn ahcich0: Timeout on slot 5 port 0
Feb 13 07:07:42 saturn ahcich0: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd c0 serr 00000000 cmd 0004c517
Feb 13 07:07:42 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 07:07:42 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 07:07:42 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
Feb 13 09:16:21 saturn ahcich0: Timeout on slot 31 port 0
Feb 13 09:16:21 saturn ahcich0: is 00000000 cs 80000001 ss 00000000 rs 80000001 tfd c0 serr 00000000 cmd 0004df17
Feb 13 09:16:21 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 09:16:21 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 09:16:21 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain

SMART shows no errors on the relevant drive; the corresponding zpool has detected zero errors. I’m aware that these errors sometimes indicate poorly seated SATA cables, but I’ve doubled checked all those.

Should I just ignore, or is there something deeper here?

nickt · February 27, 2025, 11:48pm

Just bumping this… seems happen every ten days or so - only one or two reports. Everything is stable and error free, but they’re scary warnings. Would be great if someone could share their experience with these messages.

Thanks

Jorsher · February 28, 2025, 6:52am

I’ve got these errors before and they permanently scarred my SMART stats for the affected drives. In one case it was due to a storage controller having a brain fart so I just moved on…

In the other case, my old build had a drive bay that was creating intermittent errors. I believe they were CAM errors.

You’ve checked the cables, so that’s good. The next thing you might want to try is shutting down, swapping the erroring disk with another not-erroring disk, starting back up. If the errors return on the same disk, you know it’s a disk issue. If the errors return on the same slot/cable/connection, you know it’s a problem with the slot/cable/connection/etc.

nickt · March 30, 2025, 11:43pm

Thanks - that’s a good suggestion. I’ll give it a try, although I haven’t had any errors for a few weeks now. Feels like a bit of a needle in a haystack…

Belphegor · April 7, 2025, 8:39pm

This reminds me of an issue I had with Ironwolf drives four years ago. This was so bad that a scrub of the pool could degrade the 8 disks RAIDZ2 array. One random disk would be ejected nearly each time and none of the usual cable checks or disk swap brought anything.

It turns out that the Ironwolf 8 TB disks ST8000VN004-2M21 SC60 were incompatible with the integrated LSI 2308 HBA in my Supermicro X10SL7-F. This is documented in this post in the old forum by an another user who encountered the same issue. I ended up moving the disks to a different system based on an AMD X570 chipset which is now rock solid (knocking on wood).

To further assess the problem, I would suggest trying the following changes:

Disabling NCQ (which helped as a workaround with the original problem at the time using the LSI 2308 HBA since the disk controller couldn’t keep up with requests, hence the timeouts)
Moving the disks to a different disk controller (I assume they are directly attached to the C224 PCH in your motherboard but if that is not the case it might be worth a try)
Searching for a firmware update for the affected disks using the Download Finder tool and inputting the serial number. Sometimes Seagate will silently patch issues which trigger CAM errors in FreeBSD.
Do a side-grade migration to SCALE to see if the drives react differently.

Johnny_Fartpants · April 8, 2025, 5:41am

What’s the output of smartctl -a /dev/ada0

Just realised this thread is quite old but my guess is that ada0 is on the way out.

nickt · April 16, 2025, 2:16am

As I mentioned, there are no issues I can see with SMART stats that suggest ada0 is in any trouble. But here you go in case you can see something I’ve missed:

# smartctl -a /dev/ada0
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST6000VN006-2ZM186
Serial Number:    WVX07BGW
LU WWN Device Id: 5 000c50 0f6fafc9b
Firmware Version: SC60
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Apr 16 12:12:53 2025 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 646) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x70bd)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   082   064   006    Pre-fail  Always       -       172170960
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       4
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       106
  7 Seek_Error_Rate         0x000f   082   060   045    Pre-fail  Always       -       163987777
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       2298
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       4
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       77310590994
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   072   063   040    Old_age   Always       -       28 (Min/Max 27/29)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       48
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       94
194 Temperature_Celsius     0x0022   028   040   000    Old_age   Always       -       28 (0 23 0 0 0)
195 Hardware_ECC_Recovered  0x001a   082   064   000    Old_age   Always       -       172170960
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       2294 (161 111 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       56593895724
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       233078403927

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      2218         -
# 2  Short offline       Completed without error       00%      2122         -
# 3  Extended offline    Completed without error       00%      2088         -
# 4  Short offline       Completed without error       00%      2025         -
# 5  Short offline       Completed without error       00%      1929         -
# 6  Short offline       Completed without error       00%      1857         -
# 7  Short offline       Completed without error       00%      1761         -
# 8  Extended offline    Completed without error       00%      1714         -
# 9  Short offline       Completed without error       00%      1665         -
#10  Short offline       Completed without error       00%      1569         -
#11  Short offline       Completed without error       00%      1473         -
#12  Short offline       Completed without error       00%      1377         -
#13  Extended offline    Completed without error       00%      1360         -
#14  Short offline       Completed without error       00%      1281         -
#15  Short offline       Completed without error       00%      1185         -
#16  Short offline       Completed without error       00%      1089         -
#17  Extended offline    Completed without error       00%      1036         -
#18  Short offline       Completed without error       00%       993         -
#19  Short offline       Completed without error       00%       897         -
#20  Short offline       Completed without error       00%       801         -
#21  Short offline       Completed without error       00%       705         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

CAM errors are rare. To give you an idea, last error was 2 March. A couple of bursts in February:

# bzcat /var/log/messages.0.bz2 | grep CAM
Feb  3 17:25:58 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 04:21:11 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 05:41:34 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 07:01:19 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 07:07:42 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 09:16:21 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 17:27:15 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 17:27:50 saturn (aprobe0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 04:07:30 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 05:26:13 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 05:44:36 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 05:50:59 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 05:59:42 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 06:55:55 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 23 04:23:43 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 23 08:14:54 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 23 18:31:23 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Mar  2 08:09:13 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout

I’m really not sure what to make of it. I haven’t done the cable switch test - the rarity of the errors makes it a tough exercise to pin anything down.

Johnny_Fartpants · April 16, 2025, 5:48am

This doesn’t look like a healthy drive to me. This value starts at 0 with a range of 1-10 not being great but 100+ means replace the drive. I’d do this first and see if your errors stop.

nickt · April 21, 2025, 12:19am

Aw crap - you’re right - I hadn’t noticed. Seagate’s stupid encoding of some of their SMART values means I haven’t been looking closely enough - but this one’s easy enough to read.

I guess the question is whether reallocated sectors could cause CAM timeouts, or whether CAM timeouts could lead to reallocated sectors.

In any case, time to test the warranty…

nickt · April 21, 2025, 2:15am

Warranty claim underway.

Having said that, I’m not sure this one is too bad…? From the Seagate SMART Attribute Specification:

3.4 Attribute ID 5: Retired Sectors Count

Normalized Retired Sectors Count = 100 - (100 * NumberOfRetiredSectors / (MinimumNumberOfSparesAvailable))
where MinimumNumberOfSparesAvailable depends on factory certification method, and available spare locations.

Raw Usage
Raw [1 – 0] = Current Retired Sector Count
Raw [3 - 2] = Current Retired Sector Count since SMART was last reset.

Looking at the report:

5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       106

The normalised value is 100, which means that – although there have been reallocated sectors – there is plenty of spare capacity for reallocations available. At least, that’s how it reads to me…

But I’m taking no chances, so will replace the drive under warranty.

sfatula · April 21, 2025, 3:32am

Any errors more than 1 or 2 and I replace. If it’s getting more errors, even if it has space, it’s not good and I care more about my data than hassle of getting it replaced.