Should I be worried about CAM status: command timeouts?

A few months back, I upgraded my RAIDZ2 pool from 6x WD 3 TB to 6x Seagate Ironwolf 6TB drives. It’s gone perfectly well, but every once in a while I get the following in the log. I’ve never had these errors before. It is rare - maybe a handful every month or two.

Feb 13 04:21:11 saturn ahcich0: Timeout on slot 26 port 0
Feb 13 04:21:11 saturn ahcich0: is 00000000 cs 04000000 ss 00000000 rs 04000000 tfd c0 serr 00000000 cmd 0004da17
Feb 13 04:21:11 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 04:21:11 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 04:21:11 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
Feb 13 05:41:34 saturn ahcich0: Timeout on slot 29 port 0
Feb 13 05:41:34 saturn ahcich0: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd c0 serr 00000000 cmd 0004dd17
Feb 13 05:41:34 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 05:41:34 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 05:41:34 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
Feb 13 07:01:19 saturn ahcich0: Timeout on slot 30 port 0
Feb 13 07:01:19 saturn ahcich0: is 00000000 cs 40000000 ss 00000000 rs 40000000 tfd c0 serr 00000000 cmd 0004de17
Feb 13 07:01:19 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 07:01:19 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 07:01:19 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
Feb 13 07:07:42 saturn ahcich0: Timeout on slot 5 port 0
Feb 13 07:07:42 saturn ahcich0: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd c0 serr 00000000 cmd 0004c517
Feb 13 07:07:42 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 07:07:42 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 07:07:42 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
Feb 13 09:16:21 saturn ahcich0: Timeout on slot 31 port 0
Feb 13 09:16:21 saturn ahcich0: is 00000000 cs 80000001 ss 00000000 rs 80000001 tfd c0 serr 00000000 cmd 0004df17
Feb 13 09:16:21 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 09:16:21 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 09:16:21 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain

SMART shows no errors on the relevant drive; the corresponding zpool has detected zero errors. I’m aware that these errors sometimes indicate poorly seated SATA cables, but I’ve doubled checked all those.

Should I just ignore, or is there something deeper here?

Just bumping this… seems happen every ten days or so - only one or two reports. Everything is stable and error free, but they’re scary warnings. Would be great if someone could share their experience with these messages.

Thanks

I’ve got these errors before and they permanently scarred my SMART stats for the affected drives. In one case it was due to a storage controller having a brain fart so I just moved on…

In the other case, my old build had a drive bay that was creating intermittent errors. I believe they were CAM errors.

You’ve checked the cables, so that’s good. The next thing you might want to try is shutting down, swapping the erroring disk with another not-erroring disk, starting back up. If the errors return on the same disk, you know it’s a disk issue. If the errors return on the same slot/cable/connection, you know it’s a problem with the slot/cable/connection/etc.

2 Likes

Thanks - that’s a good suggestion. I’ll give it a try, although I haven’t had any errors for a few weeks now. Feels like a bit of a needle in a haystack…

This reminds me of an issue I had with Ironwolf drives four years ago. This was so bad that a scrub of the pool could degrade the 8 disks RAIDZ2 array. One random disk would be ejected nearly each time and none of the usual cable checks or disk swap brought anything.

It turns out that the Ironwolf 8 TB disks ST8000VN004-2M21 SC60 were incompatible with the integrated LSI 2308 HBA in my Supermicro X10SL7-F. This is documented in this post in the old forum by an another user who encountered the same issue. I ended up moving the disks to a different system based on an AMD X570 chipset which is now rock solid (knocking on wood).

To further assess the problem, I would suggest trying the following changes:

  1. Disabling NCQ (which helped as a workaround with the original problem at the time using the LSI 2308 HBA since the disk controller couldn’t keep up with requests, hence the timeouts)
  2. Moving the disks to a different disk controller (I assume they are directly attached to the C224 PCH in your motherboard but if that is not the case it might be worth a try)
  3. Searching for a firmware update for the affected disks using the Download Finder tool and inputting the serial number. Sometimes Seagate will silently patch issues which trigger CAM errors in FreeBSD.
  4. Do a side-grade migration to SCALE to see if the drives react differently.

What’s the output of smartctl -a /dev/ada0

Just realised this thread is quite old but my guess is that ada0 is on the way out.

As I mentioned, there are no issues I can see with SMART stats that suggest ada0 is in any trouble. But here you go in case you can see something I’ve missed:

# smartctl -a /dev/ada0
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST6000VN006-2ZM186
Serial Number:    WVX07BGW
LU WWN Device Id: 5 000c50 0f6fafc9b
Firmware Version: SC60
User Capacity:    6,001,175,126,016 bytes [6.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Apr 16 12:12:53 2025 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)	Offline data collection activity
					was never started.
					Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(    0) seconds.
Offline data collection
capabilities: 			 (0x73) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					No Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 646) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x70bd)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   082   064   006    Pre-fail  Always       -       172170960
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       4
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       106
  7 Seek_Error_Rate         0x000f   082   060   045    Pre-fail  Always       -       163987777
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       2298
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       4
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       77310590994
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   072   063   040    Old_age   Always       -       28 (Min/Max 27/29)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       48
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       94
194 Temperature_Celsius     0x0022   028   040   000    Old_age   Always       -       28 (0 23 0 0 0)
195 Hardware_ECC_Recovered  0x001a   082   064   000    Old_age   Always       -       172170960
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       2294 (161 111 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       56593895724
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       233078403927

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      2218         -
# 2  Short offline       Completed without error       00%      2122         -
# 3  Extended offline    Completed without error       00%      2088         -
# 4  Short offline       Completed without error       00%      2025         -
# 5  Short offline       Completed without error       00%      1929         -
# 6  Short offline       Completed without error       00%      1857         -
# 7  Short offline       Completed without error       00%      1761         -
# 8  Extended offline    Completed without error       00%      1714         -
# 9  Short offline       Completed without error       00%      1665         -
#10  Short offline       Completed without error       00%      1569         -
#11  Short offline       Completed without error       00%      1473         -
#12  Short offline       Completed without error       00%      1377         -
#13  Extended offline    Completed without error       00%      1360         -
#14  Short offline       Completed without error       00%      1281         -
#15  Short offline       Completed without error       00%      1185         -
#16  Short offline       Completed without error       00%      1089         -
#17  Extended offline    Completed without error       00%      1036         -
#18  Short offline       Completed without error       00%       993         -
#19  Short offline       Completed without error       00%       897         -
#20  Short offline       Completed without error       00%       801         -
#21  Short offline       Completed without error       00%       705         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

CAM errors are rare. To give you an idea, last error was 2 March. A couple of bursts in February:

# bzcat /var/log/messages.0.bz2 | grep CAM
Feb  3 17:25:58 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 04:21:11 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 05:41:34 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 07:01:19 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 07:07:42 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 09:16:21 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 17:27:15 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 17:27:50 saturn (aprobe0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 04:07:30 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 05:26:13 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 05:44:36 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 05:50:59 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 05:59:42 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 06:55:55 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 23 04:23:43 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 23 08:14:54 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 23 18:31:23 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Mar  2 08:09:13 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout

I’m really not sure what to make of it. I haven’t done the cable switch test - the rarity of the errors makes it a tough exercise to pin anything down.

This doesn’t look like a healthy drive to me. This value starts at 0 with a range of 1-10 not being great but 100+ means replace the drive. I’d do this first and see if your errors stop.

1 Like