A few months back, I upgraded my RAIDZ2 pool from 6x WD 3 TB to 6x Seagate Ironwolf 6TB drives. It’s gone perfectly well, but every once in a while I get the following in the log. I’ve never had these errors before. It is rare - maybe a handful every month or two.
Feb 13 04:21:11 saturn ahcich0: Timeout on slot 26 port 0
Feb 13 04:21:11 saturn ahcich0: is 00000000 cs 04000000 ss 00000000 rs 04000000 tfd c0 serr 00000000 cmd 0004da17
Feb 13 04:21:11 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 04:21:11 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 04:21:11 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
Feb 13 05:41:34 saturn ahcich0: Timeout on slot 29 port 0
Feb 13 05:41:34 saturn ahcich0: is 00000000 cs 20000000 ss 00000000 rs 20000000 tfd c0 serr 00000000 cmd 0004dd17
Feb 13 05:41:34 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 05:41:34 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 05:41:34 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
Feb 13 07:01:19 saturn ahcich0: Timeout on slot 30 port 0
Feb 13 07:01:19 saturn ahcich0: is 00000000 cs 40000000 ss 00000000 rs 40000000 tfd c0 serr 00000000 cmd 0004de17
Feb 13 07:01:19 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 07:01:19 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 07:01:19 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
Feb 13 07:07:42 saturn ahcich0: Timeout on slot 5 port 0
Feb 13 07:07:42 saturn ahcich0: is 00000000 cs 00000020 ss 00000000 rs 00000020 tfd c0 serr 00000000 cmd 0004c517
Feb 13 07:07:42 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 07:07:42 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 07:07:42 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
Feb 13 09:16:21 saturn ahcich0: Timeout on slot 31 port 0
Feb 13 09:16:21 saturn ahcich0: is 00000000 cs 80000001 ss 00000000 rs 80000001 tfd c0 serr 00000000 cmd 0004df17
Feb 13 09:16:21 saturn (ada0:ahcich0:0:0:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 13 09:16:21 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 09:16:21 saturn (ada0:ahcich0:0:0:0): Retrying command, 0 more tries remain
SMART shows no errors on the relevant drive; the corresponding zpool has detected zero errors. I’m aware that these errors sometimes indicate poorly seated SATA cables, but I’ve doubled checked all those.
Should I just ignore, or is there something deeper here?
Just bumping this… seems happen every ten days or so - only one or two reports. Everything is stable and error free, but they’re scary warnings. Would be great if someone could share their experience with these messages.
I’ve got these errors before and they permanently scarred my SMART stats for the affected drives. In one case it was due to a storage controller having a brain fart so I just moved on…
In the other case, my old build had a drive bay that was creating intermittent errors. I believe they were CAM errors.
You’ve checked the cables, so that’s good. The next thing you might want to try is shutting down, swapping the erroring disk with another not-erroring disk, starting back up. If the errors return on the same disk, you know it’s a disk issue. If the errors return on the same slot/cable/connection, you know it’s a problem with the slot/cable/connection/etc.
Thanks - that’s a good suggestion. I’ll give it a try, although I haven’t had any errors for a few weeks now. Feels like a bit of a needle in a haystack…
This reminds me of an issue I had with Ironwolf drives four years ago. This was so bad that a scrub of the pool could degrade the 8 disks RAIDZ2 array. One random disk would be ejected nearly each time and none of the usual cable checks or disk swap brought anything.
It turns out that the Ironwolf 8 TB disks ST8000VN004-2M21 SC60 were incompatible with the integrated LSI 2308 HBA in my Supermicro X10SL7-F. This is documented in this post in the old forum by an another user who encountered the same issue. I ended up moving the disks to a different system based on an AMD X570 chipset which is now rock solid (knocking on wood).
To further assess the problem, I would suggest trying the following changes:
Disabling NCQ (which helped as a workaround with the original problem at the time using the LSI 2308 HBA since the disk controller couldn’t keep up with requests, hence the timeouts)
Moving the disks to a different disk controller (I assume they are directly attached to the C224 PCH in your motherboard but if that is not the case it might be worth a try)
As I mentioned, there are no issues I can see with SMART stats that suggest ada0 is in any trouble. But here you go in case you can see something I’ve missed:
# smartctl -a /dev/ada0
smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: ST6000VN006-2ZM186
Serial Number: WVX07BGW
LU WWN Device Id: 5 000c50 0f6fafc9b
Firmware Version: SC60
User Capacity: 6,001,175,126,016 bytes [6.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Apr 16 12:12:53 2025 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 646) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 082 064 006 Pre-fail Always - 172170960
3 Spin_Up_Time 0x0003 095 095 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 4
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 106
7 Seek_Error_Rate 0x000f 082 060 045 Pre-fail Always - 163987777
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 2298
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 4
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 098 000 Old_age Always - 77310590994
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 072 063 040 Old_age Always - 28 (Min/Max 27/29)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 48
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 94
194 Temperature_Celsius 0x0022 028 040 000 Old_age Always - 28 (0 23 0 0 0)
195 Hardware_ECC_Recovered 0x001a 082 064 000 Old_age Always - 172170960
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 2294 (161 111 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 56593895724
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 233078403927
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 2218 -
# 2 Short offline Completed without error 00% 2122 -
# 3 Extended offline Completed without error 00% 2088 -
# 4 Short offline Completed without error 00% 2025 -
# 5 Short offline Completed without error 00% 1929 -
# 6 Short offline Completed without error 00% 1857 -
# 7 Short offline Completed without error 00% 1761 -
# 8 Extended offline Completed without error 00% 1714 -
# 9 Short offline Completed without error 00% 1665 -
#10 Short offline Completed without error 00% 1569 -
#11 Short offline Completed without error 00% 1473 -
#12 Short offline Completed without error 00% 1377 -
#13 Extended offline Completed without error 00% 1360 -
#14 Short offline Completed without error 00% 1281 -
#15 Short offline Completed without error 00% 1185 -
#16 Short offline Completed without error 00% 1089 -
#17 Extended offline Completed without error 00% 1036 -
#18 Short offline Completed without error 00% 993 -
#19 Short offline Completed without error 00% 897 -
#20 Short offline Completed without error 00% 801 -
#21 Short offline Completed without error 00% 705 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
CAM errors are rare. To give you an idea, last error was 2 March. A couple of bursts in February:
# bzcat /var/log/messages.0.bz2 | grep CAM
Feb 3 17:25:58 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 04:21:11 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 05:41:34 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 07:01:19 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 07:07:42 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 09:16:21 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 17:27:15 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 13 17:27:50 saturn (aprobe0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 04:07:30 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 05:26:13 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 05:44:36 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 05:50:59 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 05:59:42 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 14 06:55:55 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 23 04:23:43 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 23 08:14:54 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Feb 23 18:31:23 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
Mar 2 08:09:13 saturn (ada0:ahcich0:0:0:0): CAM status: Command timeout
I’m really not sure what to make of it. I haven’t done the cable switch test - the rarity of the errors makes it a tough exercise to pin anything down.
This doesn’t look like a healthy drive to me. This value starts at 0 with a range of 1-10 not being great but 100+ means replace the drive. I’d do this first and see if your errors stop.