Imported pool write speed sometimes dropping to 0

System information is in my signature

I’ve finished importing my last pool “Tank” (described in my signature) to my new system. Performing local benchmarks using “fio” with a size of 100GB and duration of 10m results the benchmark completes successfully in most cases. An example of the success is below:

Command:

root[/mnt/tank/]# fio --ramp_time=5 --gtod_reduce=1 --numjobs=1 --bs=1M --size=100G --runtime=10m --readwrite=write --name=testfile

Starting:

testfile: Laying out IO file (1 file / 102400MiB)
Jobs: 1 (f=1): [W(1)][1.7%][w=62.8MiB/s][w=62 IOPS][eta 09m:55s]

2 minutes later(still good):

testfile: Laying out IO file (1 file / 102400MiB)
Jobs: 1 (f=1): [W(1)][26.4%][w=65.0MiB/s][w=65 IOPS][eta 07m:25s]

Finished (Success):

testfile: Laying out IO file (1 file / 102400MiB)
Jobs: 1 (f=1): [W(1)][70.8%][w=89.0MiB/s][w=89 IOPS][eta 04m:10s]
testfile: (groupid=0, jobs=1): err= 0: pid=98861: Mon Jul  8 10:52:09 2024
  write: IOPS=120, BW=120MiB/s (126MB/s)(70.6GiB/600397msec); 0 zone resets
   bw (  KiB/s): min= 1980, max=4472529, per=100.00%, avg=133161.09, stdev=502059.20, samples=1099
   iops        : min=    1, max= 4367, avg=129.43, stdev=490.34, samples=1099
  cpu          : usr=0.20%, sys=1.81%, ctx=282018, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,72298,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=70.6GiB (75.8GB), run=600397-600397msec

However in some case for no apparent reason the test will “fail” (write speed just drops to 0 basically and the ETA jumps to >2days). I don’t see any errors in the console or the /var/log/messages file. Other pools are working fine.

Finished (Failure):


testfile: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
fio-3.28
Starting 1 process
testfile: Laying out IO file (1 file / 102400MiB)
Jobs: 1 (f=1): [W(1)][0.5%][eta 02d:12h:50m:30s]
testfile: (groupid=0, jobs=1): err= 0: pid=94385: Mon Jul  8 10:36:15 2024
  write: IOPS=0, BW=525KiB/s (538kB/s)(511MiB/996233msec); 0 zone resets
   bw (  KiB/s): min= 1984, max=85499, per=100.00%, avg=14824.70, stdev=19985.78, samples=63
   iops        : min=    1, max=   83, avg=13.71, stdev=19.63, samples=63
  cpu          : usr=0.00%, sys=0.01%, ctx=3806, majf=0, minf=1
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,511,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: bw=525KiB/s (538kB/s), 525KiB/s-525KiB/s (538kB/s-538kB/s), io=511MiB (536MB), run=996233-996233msec

Does anyone have any tips on what to investigate for this kind of intermittent problem?

In the UI I see many alerts for a single disk in the pool tank:

Device /dev/gptid/488aa13e-9eb2-11e9-8305-ac1f6b0adb5a is causing slow I/O on pool tank.
2024-07-08 10:35:47

Device /dev/gptid/47d6807f-9eb2-11e9-8305-ac1f6b0adb5a is causing slow I/O on pool tank.
2024-07-08 10:35:48 

Device /dev/gptid/4cea6dd9-9eb2-11e9-8305-ac1f6b0adb5a is causing slow I/O on pool tank.
2024-07-08 10:35:48 

Device /dev/gptid/4c2fe3f8-9eb2-11e9-8305-ac1f6b0adb5a is causing slow I/O on pool tank.
2024-07-08 11:30:15

glabel status:

gptid/488aa13e-9eb2-11e9-8305-ac1f6b0adb5a     N/A  da5p2

smartctl status:

smartctl 7.2 2021-09-14 r5236 [FreeBSD 13.1-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate IronWolf Pro
Device Model:     ST14000NE0008-2JK101
Serial Number:    REPLACED
LU WWN Device Id: 5 000c50 0b570f6af
Firmware Version: EN01
User Capacity:    14,000,519,643,136 bytes [14.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jul  8 13:24:23 2024 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  567) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (1266) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   044    Pre-fail  Always       -       193407384
  3 Spin_Up_Time            0x0003   088   088   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       112
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   090   060   045    Pre-fail  Always       -       919076944
  9 Power_On_Hours          0x0032   051   051   000    Old_age   Always       -       43626
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       112
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   054   045   040    Old_age   Always       -       46 (Min/Max 40/55)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1253
193 Load_Cycle_Count        0x0032   099   099   000    Old_age   Always       -       2005
194 Temperature_Celsius     0x0022   046   055   000    Old_age   Always       -       46 (0 20 0 0 0)
195 Hardware_ECC_Recovered  0x001a   083   064   000    Old_age   Always       -       193407384
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       43586h+25m+20.653s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       358126451530
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       704231568358

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     41983         -
# 2  Short offline       Completed without error       00%     41406         -
# 3  Short offline       Completed without error       00%     40663         -
# 4  Short offline       Completed without error       00%     39967         -
# 5  Short offline       Completed without error       00%     39224         -
# 6  Short offline       Completed without error       00%     38493         -
# 7  Short offline       Completed without error       00%     37773         -
# 8  Short offline       Completed without error       00%     37029         -
# 9  Short offline       Completed without error       00%     36309         -
#10  Short offline       Completed without error       00%     35565         -
#11  Short offline       Completed without error       00%     34821         -
#12  Short offline       Completed without error       00%     34101         -
#13  Short offline       Completed without error       00%     33357         -
#14  Short offline       Completed without error       00%     32637         -
#15  Short offline       Completed without error       00%     31894         -
#16  Short offline       Completed without error       00%     31222         -
#17  Short offline       Completed without error       00%     30478         -
#18  Short offline       Completed without error       00%     29734         -
#19  Short offline       Completed without error       00%     29013         -
#20  Short offline       Completed without error       00%     28269         -
#21  Short offline       Completed without error       00%     27549         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Does anyone have any tips/suggestions on what could potentially be investigated here to determine the cause of this?