Suddenly really slow Scrub speed and increase of COMRESET errors in RED's

Tila · April 27, 2025, 12:30pm

Sudden slowdows on Scrub speed.
– Before the system has beed rock stable for years and software/firmware changes are freezed.
– Running old: TrueNAS-12.0-U8.1
– This is the first major problem I have seen.
– All drives are 4TB SATA and CRM by spec!

Scrub speed down from 984M/s to 2.26M/s - 4.68M/s
– There are no other load on the disc’s than the Scrub.

Truenas Web interface
SCRUB
Status: SCANNING
Completed: 00.86%
Time Remaining:71 days,22 hours,25 minutes,34 seconds
Errors: 0
Date: 2025-04-27 00:00:01

Only recent change is feb-march 2025 was change of broken disks to new ones.**
The 3 new Red Plus WD40EFPX-68C6CN0) SATA-drives
– They are CRM by spec**
– Resilver was fast and change was succesfully

Noticed that recently errors in “Device-to-host register FISes sent due to a COMRESET”
– Have gone up and big numers only show in WDC drives (Red and Red Plus)
– Only exeption is one da11 WDC Red Plus that does not show big increse, but still 7 for a new disc is stange…

da1 = Western Digital Blue, WDC WD40EZRZ-00WN9B0
4
da2 = Western Digital Blue, WDC WD40EZRZ-00WN9B0
4
da3 = Western Digital Blue, WDC WD40EZRZ-00WN9B0
3
da4 = Western Digital Green, WDC WD40EZRX-00SPEB0
4
da5 = Seagate Desktop HDD.15 ”almost ironwolf” ST4000DM000-2AE166
4
da6 = Western Seagate Desktop HDD.15 ST4000DM000-1F2168
3
da7 = Western Digital Red WDC WD40EFRX-68N32N0
129
da8 = Western Digital Red WDC WD40EFRX-68N32N0
93 (2025.3.16) → 96 (27.4.2025)
da9 = Seagate IronWolf ST4000VN008-2DR166
8
da10 = Western Digital Red Plus, WDC WD40EFPX-68C6CN0
65
da11 = Western Digital Red Plus, WDC WD40EFPX-68C6CN0
7
da12 = Western Digital Red Plus, WDC WD40EFPX-68C6CN0
70

Head parking / Idle timers are disabled (exept WDC Red Plus)

WDC drives “idle3ctl” method used
Seagate drives “HDAT disable APM” method used
Info how to disable (needed?) in (WDC Red Plus, WD40EFPX-68C6CN0)?

What could be the problem, ideas how to diagnose the problem?

zpool status
pool: ***** 
state: ONLINE
  scan: scrub in progress since Sun Apr 27 00:00:01 2025
        240G scanned at 4.68M/s, 225G issued at 4.38M/s, 25.4T total
        0B repaired, 0.86% done, no estimated completion time
config:

        NAME                                            STATE     READ WRITE CKSUM
        *****                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/c70 -> da4				ONLINE       0     0     0
            gptid/c96 -> da3				ONLINE       0     0     0
            gptid/cc1 -> da1 				ONLINE       0     0     0
            gptid/e0a -> da11 				ONLINE       0     0     0
            gptid/d0c -> da2				ONLINE       0     0     0
            gptid/cf7 -> da6  				ONLINE       0     0     0
            gptid/d2a -> da8  				ONLINE       0     0     0
            gptid/d3f -> da5  				ONLINE       0     0     0
            gptid/d5d -> da9				ONLINE       0     0     0
            gptid/e25 -> da12				ONLINE       0     0     0
            gptid/5f2 -> da10  				ONLINE       0     0     0
            gptid/da7d ->da7				ONLINE       0     0     0

errors: No known data errors

uptime: 1:50PM up 39 days, 14:44, 1 user, load averages: 0.72, 0.74, 0.74

zpool iostat -v -l 1
                                                  capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool                                            alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
boot-pool                                       5.93G  25.6G      0      0  11.2K    150   27ms  814us    1ms  159us   18us    6us    2ms  810us   27ms      -
  da0p2                                         5.93G  25.6G      0      0  11.2K    150   27ms  814us    1ms  159us   18us    6us    2ms  810us   27ms      -
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
*****                                           25.4T  18.2T     52     61  8.37M   524K  880ms  697us   16ms  258us  712us   27us  583us  453us  875ms      -
  raidz2                                        25.4T  18.2T     52     61  8.37M   524K  880ms  697us   16ms  258us  712us   27us  583us  453us  875ms      -
    gptid/c7089767-9e0b-11eb-8c0d-000c2994e4e5      -      -      4      5   711K  43.6K  798ms  599us   17ms  226us  643us   27us  570us  383us  792ms      -
    gptid/c961973e-9e0b-11eb-8c0d-000c2994e4e5      -      -      4      5   715K  43.6K     1s  595us   17ms  227us  656us   27us  537us  378us     1s      -
    gptid/cc1a6f1a-9e0b-11eb-8c0d-000c2994e4e5      -      -      4      5   715K  43.7K     1s  593us   17ms  226us  642us   27us  557us  377us     1s      -
    gptid/e0afbd2f-e436-11ef-80ef-000c2994e4e5      -      -      4      5   714K  43.8K  805ms  810us   15ms  282us  424us   26us  775us  542us  800ms      -
    gptid/d0c2bd83-9e0b-11eb-8c0d-000c2994e4e5      -      -      4      5   717K  43.4K     1s  594us   17ms  223us  722us   27us  538us  379us     1s      -
    gptid/cf718244-9e0b-11eb-8c0d-000c2994e4e5      -      -      4      5   714K  43.6K  940ms  741us   16ms  291us  562us   27us  435us  467us  936ms      -
    gptid/d2a57cec-9e0b-11eb-8c0d-000c2994e4e5      -      -      4      5   713K  43.6K  666ms  630us   15ms  210us  369us   27us  627us  433us  658ms      -
    gptid/d3fd40b4-9e0b-11eb-8c0d-000c2994e4e5      -      -      4      5   715K  43.7K  738ms  719us   15ms  299us  864us   27us  407us  439us  732ms      -
    gptid/d5deb979-9e0b-11eb-8c0d-000c2994e4e5      -      -      4      5   715K  43.8K  874ms  701us   16ms  298us  835us   26us  590us  422us  869ms      -
    gptid/e2524766-f243-11ef-93fd-000c2994e4e5      -      -      4      4   714K  43.8K  901ms  869us   16ms  296us  848us   26us  688us  590us  898ms      -
    gptid/5f29ba93-e321-11ef-925f-000c2994e4e5      -      -      4      4   716K  43.6K  995ms  880us   15ms  304us    1ms   26us  619us  594us  991ms      -
    gptid/da7d8690-9e0b-11eb-8c0d-000c2994e4e5      -      -      4      5   713K  43.5K  696ms  654us   16ms  219us  889us   27us  641us  447us  689ms      -
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
                                                  capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool                                            alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
boot-pool                                       5.93G  25.6G      0      0      0      0      -      -      -      -      -      -      -      -      -      -
  da0p2                                         5.93G  25.6G      0      0      0      0      -      -      -      -      -      -      -      -      -      -
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
*****                                          25.4T  18.2T  2.65K      0  18.8M      0  882ms      -   14ms      -      -      -      -      -  867ms      -
  raidz2                                        25.4T  18.2T  2.65K      0  18.8M      0  882ms      -   14ms      -      -      -      -      -  867ms      -
    gptid/c7089767-9e0b-11eb-8c0d-000c2994e4e5      -      -    208      0  1017K      0  459ms      -   15ms      -      -      -      -      -  442ms      -
    gptid/c961973e-9e0b-11eb-8c0d-000c2994e4e5      -      -    221      0  1.05M      0  555ms      -   14ms      -      -      -      -      -  538ms      -
    gptid/cc1a6f1a-9e0b-11eb-8c0d-000c2994e4e5      -      -    229      0  1.88M      0  782ms      -   14ms      -      -      -      -      -  781ms      -
    gptid/e0afbd2f-e436-11ef-80ef-000c2994e4e5      -      -    215      0  1.88M      0  512ms      -   15ms      -      -      -      -      -  494ms      -
    gptid/d0c2bd83-9e0b-11eb-8c0d-000c2994e4e5      -      -    227      0  1013K      0  573ms      -   14ms      -      -      -      -      -  557ms      -
    gptid/cf718244-9e0b-11eb-8c0d-000c2994e4e5      -      -    214      0  1.74M      0  824ms      -   15ms      -      -      -      -      -  822ms      -
    gptid/d2a57cec-9e0b-11eb-8c0d-000c2994e4e5      -      -    170      0  1.70M      0     2s      -   19ms      -      -      -      -      -     2s      -
    gptid/d3fd40b4-9e0b-11eb-8c0d-000c2994e4e5      -      -    215      0  1.36M      0     1s      -   15ms      -      -      -      -      -     1s      -
    gptid/d5deb979-9e0b-11eb-8c0d-000c2994e4e5      -      -    243      0  1.38M      0  664ms      -   13ms      -      -      -      -      -  639ms      -
    gptid/e2524766-f243-11ef-93fd-000c2994e4e5      -      -    264      0  2.22M      0     1s      -   12ms      -      -      -      -      -     1s      -
    gptid/5f29ba93-e321-11ef-925f-000c2994e4e5      -      -    254      0  1.46M      0  726ms      -   13ms      -      -      -      -      -  715ms      -
    gptid/da7d8690-9e0b-11eb-8c0d-000c2994e4e5      -      -    243      0  2.05M      0  789ms      -   13ms      -      -      -      -      -  789ms      -
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
                                                  capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool                                            alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
boot-pool                                       5.93G  25.6G      0      0      0      0      -      -      -      -      -      -      -      -      -      -
  da0p2                                         5.93G  25.6G      0      0      0      0      -      -      -      -      -      -      -      -      -      -
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
*****                                           25.4T  18.2T  2.14K      0  29.3M      0     2s      -   17ms      -      -      -      -      -     2s      -
  raidz2                                        25.4T  18.2T  2.14K      0  29.3M      0     2s      -   17ms      -      -      -      -      -     2s      -
    gptid/c7089767-9e0b-11eb-8c0d-000c2994e4e5      -      -    162      0  1.74M      0     3s      -   20ms      -      -      -      -      -     3s      -
    gptid/c961973e-9e0b-11eb-8c0d-000c2994e4e5      -      -    169      0  2.28M      0     3s      -   19ms      -      -      -      -      -     3s      -
    gptid/cc1a6f1a-9e0b-11eb-8c0d-000c2994e4e5      -      -    173      0  2.52M      0     2s      -   19ms      -      -      -      -      -     2s      -
    gptid/e0afbd2f-e436-11ef-80ef-000c2994e4e5      -      -    145      0  1.77M      0     4s      -   21ms      -      -      -      -      -     4s      -
    gptid/d0c2bd83-9e0b-11eb-8c0d-000c2994e4e5      -      -    164      0  3.34M      0     3s      -   19ms      -      -      -      -      -     3s      -
    gptid/cf718244-9e0b-11eb-8c0d-000c2994e4e5      -      -    191      0  4.09M      0     1s      -   17ms      -      -      -      -      -     1s      -
    gptid/d2a57cec-9e0b-11eb-8c0d-000c2994e4e5      -      -    225      0  1.06M      0     1s      -   13ms      -      -      -      -      -     1s      -
    gptid/d3fd40b4-9e0b-11eb-8c0d-000c2994e4e5      -      -    244      0  1.11M      0     1s      -   13ms      -      -      -      -      -     1s      -
    gptid/d5deb979-9e0b-11eb-8c0d-000c2994e4e5      -      -    184      0  3.30M      0     4s      -   17ms      -      -      -      -      -     4s      -
    gptid/e2524766-f243-11ef-93fd-000c2994e4e5      -      -    178      0  2.15M      0     1s      -   18ms      -      -      -      -      -     1s      -
    gptid/5f29ba93-e321-11ef-925f-000c2994e4e5      -      -    167      0  3.48M      0     3s      -   19ms      -      -      -      -      -     3s      -
    gptid/da7d8690-9e0b-11eb-8c0d-000c2994e4e5      -      -    186      0  2.44M      0     1s      -   17ms      -      -      -      -      -     1s      -
----------------------------------------------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----

./badblocks4
da1
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
No Errors Logged
da2
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
No Errors Logged
da3
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
No Errors Logged
da4
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
No Errors Logged
da5
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
No Errors Logged
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x06  0x018  4               2  ---  Number of Interface CRC Errors
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
da6
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
No Errors Logged
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
da7
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
No Errors Logged
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
da8
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4             691  ---  Number of Reported Uncorrectable Errors
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
da9
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
No Errors Logged
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x06  0x018  4             364  ---  Number of Interface CRC Errors
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
da10
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
No Errors Logged
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
da11
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
No Errors Logged
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x03  0x030  4               0  ---  Number of Mechanical Start Failures
da12
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
Sector Sizes:     512 bytes logical, 4096 bytes physical
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
No Errors Logged
0x04  =====  =               =  ===  == General Errors Statistics (rev 1) ==
0x04  0x008  4               0  ---  Number of Reported Uncorrectable Errors
0x06  0x018  4               0  ---  Number of Interface CRC Errors
0x03  0x030  4               0  ---  Number of Mechanical Start Failures

No Bad Blocks detected

Known issues:

da9 = Seagate IronWolf, ST4000VN008-2DR166
– Gets stedy increase in “UDMA_CRC_Error_Count” considered: Normal for this system…
da8 = Western Digital Red, WDC WD40EFRX-68N32N0
– 691 — Number of Reported Uncorrectable Errors
– Drive had “1 Uncorrectable error” in the past 2025.2.26 and had “drive regeneration” and tested to be ok.
–No erors since. If this latest scrub slowdown ising one…

Only to grow after the fix/tests are:
Number of Hardware Resets: 15133 → 15338 UP 205
Number of ASR Events: 317 → 322 UP 5
Number of High Priority Unload Events: 111 → 113 UP 2

Hardware specs:
– Running as a VM on ESXi

SAS9201-16i (LSI2116) is passtrue to the vm
– 12* 4TB SATA drives in Raidz2 and encryption is used, but no dedup.
16GB vRam
2 vCPUs

mpsutil show all
Adapter: mps0 Adapter:
       Board Name: SAS9201-16i
   Board Assembly:
        Chip Name: LSISAS2116
    Chip Revision: ALL
    BIOS Revision: 7.37.00.00
Firmware Revision: 19.00.00.00
  Integrated RAID: no
         SATA NCQ: ENABLED
 PCIe Width/Speed: x8 (5.0 GB/sec)
        IOC Speed: Full
      Temperature: Unknown/Unsupported

PhyNum  CtlrHandle  DevHandle  Disabled  Speed   Min    Max    Device
0       0001        0011       N         6.0     1.5    6.0    SAS Initiator
1       0002        0012       N         6.0     1.5    6.0    SAS Initiator
2       0003        0013       N         6.0     1.5    6.0    SAS Initiator
3       0005        0015       N         6.0     1.5    6.0    SAS Initiator
4       0004        0014       N         6.0     1.5    6.0    SAS Initiator
5       0006        0016       N         6.0     1.5    6.0    SAS Initiator
6       0008        0018       N         6.0     1.5    6.0    SAS Initiator
7       0007        0017       N         6.0     1.5    6.0    SAS Initiator
8       000a        001a       N         6.0     1.5    6.0    SAS Initiator
9       0009        0019       N         6.0     1.5    6.0    SAS Initiator
10      000c        001c       N         6.0     1.5    6.0    SAS Initiator
11      000b        001b       N         6.0     1.5    6.0    SAS Initiator
12                             N                 1.5    6.0    SAS Initiator
13                             N                 1.5    6.0    SAS Initiator
14                             N                 1.5    6.0    SAS Initiator
15                             N                 1.5    6.0    SAS Initiator

Devices:
B____T    SAS Address      Handle  Parent    Device        Speed Enc  Slot  Wdt
00   36   4433221100000000 0011    0001      SATA Target   6.0   0001 03    1
00   33   4433221101000000 0012    0002      SATA Target   6.0   0001 02    1
00   55   4433221102000000 0013    0003      SATA Target   6.0   0001 01    1
00   35   4433221104000000 0014    0004      SATA Target   6.0   0001 07    1
00   41   4433221103000000 0015    0005      SATA Target   6.0   0001 00    1
00   30   4433221105000000 0016    0006      SATA Target   6.0   0001 06    1
00   40   4433221107000000 0017    0007      SATA Target   6.0   0001 04    1
00   62   4433221106000000 0018    0008      SATA Target   6.0   0001 05    1
00   31   4433221109000000 0019    0009      SATA Target   6.0   0001 10    1
00   60   4433221108000000 001a    000a      SATA Target   6.0   0001 11    1
00   34   443322110b000000 001b    000b      SATA Target   6.0   0001 08    1
00   64   443322110a000000 001c    000c      SATA Target   6.0   0001 09    1

Enclosures:
Slots      Logical ID     SEPHandle  EncHandle    Type
  16    500605b002c8c141               0001     Direct Attached SGPIO

Expanders:
NumPhys   SAS Address     DevHandle   Parent  EncHandle  SAS Level

sas2flash -list
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01)
Copyright (c) 2008-2013 LSI Corporation. All rights reserved

        Adapter Selected is a LSI SAS: SAS2116_1(B1)

        Controller Number              : 0
        Controller                     : SAS2116_1(B1)
        PCI Address                    : 00:03:00:00
        SAS Address                    : 500605b-0-02c8-c141
        NVDATA Version (Default)       : 11.00.00.06
        NVDATA Version (Persistent)    : 11.00.00.06
        Firmware Product ID            : 0x2213 (IT)
        Firmware Version               : 19.00.00.00
        NVDATA Vendor                  : LSI
        NVDATA Product ID              : SAS9201-16i
        BIOS Version                   : 07.37.00.00
        UEFI BSD Version               : N/A
        FCODE Version                  : N/A
        Board Name                     : SAS9201-16i
        Board Assembly                 : N/A
        Board Tracer Number            : N/A

dmesg:
mps0: Controller reported scsi ioc terminated tgt 55 SMID 1938 loginfo 31120303
(da9:mps0:0:55:0): WRITE(10). CDB: 2a 00 be 1b 60 98 00 00 18 00
(da9:mps0:0:55:0): CAM status: CCB request completed with an error
(da9:mps0:0:55:0): Retrying command, 3 more tries remain
(da9:mps0:0:55:0): WRITE(10). CDB: 2a 00 be 1b 60 98 00 00 18 00
(da9:mps0:0:55:0): CAM status: SCSI Status Error
(da9:mps0:0:55:0): SCSI status: Check Condition
(da9:mps0:0:55:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da9:mps0:0:55:0): Retrying command (per sense data)
(da9:mps0:0:55:0): WRITE(10). CDB: 2a 00 be 00 20 80 00 00 10 00
(da9:mps0:0:55:0): CAM status: SCSI Status Error
(da9:mps0:0:55:0): SCSI status: Check Condition
(da9:mps0:0:55:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da9:mps0:0:55:0): Retrying command (per sense data)
mps0: Controller reported scsi ioc terminated tgt 55 SMID 1497 loginfo 31120303
(da9:mps0:0:55:0): WRITE(10). CDB: 2a 00 be 9e da 58 00 00 18 00
(da9:mps0:0:55:0): CAM status: CCB request completed with an error
(da9:mps0:0:55:0): Retrying command, 3 more tries remain
(da9:mps0:0:55:0): WRITE(10). CDB: 2a 00 be 9e da 58 00 00 18 00
(da9:mps0:0:55:0): CAM status: SCSI Status Error
(da9:mps0:0:55:0): SCSI status: Check Condition
(da9:mps0:0:55:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da9:mps0:0:55:0): Retrying command (per sense data)
(da9:mps0:0:55:0): WRITE(10). CDB: 2a 00 b9 dd 5b a8 00 00 10 00
(da9:mps0:0:55:0): CAM status: SCSI Status Error
(da9:mps0:0:55:0): SCSI status: Check Condition
(da9:mps0:0:55:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da9:mps0:0:55:0): Retrying command (per sense data)
mps0: Controller reported scsi ioc terminated tgt 55 SMID 1097 loginfo 31120303
(da9:mps0:0:55:0): WRITE(10). CDB: 2a 00 bf 78 ec 18 00 00 18 00
(da9:mps0:0:55:0): CAM status: CCB request completed with an error
(da9:mps0:0:55:0): Retrying command, 3 more tries remain
(da9:mps0:0:55:0): READ(10). CDB: 28 00 be d5 91 c0 00 00 10 00
(da9:mps0:0:55:0): CAM status: SCSI Status Error
(da9:mps0:0:55:0): SCSI status: Check Condition
(da9:mps0:0:55:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da9:mps0:0:55:0): Retrying command (per sense data)
(da9:mps0:0:55:0): WRITE(10). CDB: 2a 00 bf 63 f2 e0 00 00 08 00
(da9:mps0:0:55:0): CAM status: SCSI Status Error
(da9:mps0:0:55:0): SCSI status: Check Condition
(da9:mps0:0:55:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da9:mps0:0:55:0): Retrying command (per sense data)
mps0: Controller reported scsi ioc terminated tgt 55 SMID 1765 loginfo 31120303
(da9:mps0:0:55:0): WRITE(10). CDB: 2a 00 c5 ee 26 30 00 00 20 00
(da9:mps0:0:55:0): CAM status: CCB request completed with an error
(da9:mps0:0:55:0): Retrying command, 3 more tries remain
(da9:mps0:0:55:0): WRITE(10). CDB: 2a 00 c5 ee 26 30 00 00 20 00
(da9:mps0:0:55:0): CAM status: SCSI Status Error
(da9:mps0:0:55:0): SCSI status: Check Condition
(da9:mps0:0:55:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da9:mps0:0:55:0): Retrying command (per sense data)
(da9:mps0:0:55:0): WRITE(10). CDB: 2a 00 c5 55 76 f0 00 00 10 00
(da9:mps0:0:55:0): CAM status: SCSI Status Error
(da9:mps0:0:55:0): SCSI status: Check Condition
(da9:mps0:0:55:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da9:mps0:0:55:0): Retrying command (per sense data)

These type of dmesg errors are common for the server.
– I dont know why but only the disk connected to:

SAS9201-16i = Slot 1 (Slots 0-11) → (Controllers Fysical= A port (Ports A-D)) → 8087 SATA cable: SATA-Port= 2 (1-4) → Hot Swap Bay = 2 (Bay’s 1-12) → Currently assingned to da9
Gets stedy increase in “UDMA_CRC_Error_Count”, but it has not cause any other issues than shown up in dmesg / smart log.
– Tryed change cable (Helped a litle, I too long time to get new errors, but errors arrived over time)
– Because it takes long time to get errors, Its possible I did not catch the first ones, or old version of TrueNAS did not cause them.
– Changed a disk. the errors followed the new disc.
– Changed Hot swap sled, no help
– Disks have always passes checks etc
– This has been seen just a annoying problem for years, but not have seen causing any “real world issues” and I dont think is a direct cause of this scrub problem currently.
– Still I Wonder is this firmware, card or bay problem…?
→ Have you experienced similar problem of stedy crowing “UDMA_CRC_Error_Count”
→ Any ideas?

What is the current recomended firmware for this card for TrueNAS-12.0-U8.1 and/or later version? Could firmware upgrade sold some of the issues?

Update noticed errors in ESXi dmesg log

ESXi dmesg errors related to Truenas VM
2025-04-16T18:08:13.639Z cpu5:67819)VSCSIFs: 3908: handle 8192(vscsi0:0):Invalid Opcode (0x4d) from (vmm0:TrueNas_Core)
2025-04-22T13:03:13.605Z cpu2:67815)VSCSIFs: 3908: handle 8192(vscsi0:0):Invalid Opcode (0x4d) from (vmm0:TrueNas_Core)

ESXi dmesg errors that keep popping up every 10minutes...
2025-04-13T04:06:07.216Z cpu10:66122)DVFilter: 6027: Checking disconnected filters for timeouts
2025-04-13T04:16:07.231Z cpu6:66122)DVFilter: 6027: Checking disconnected filters for timeouts
2025-04-13T04:26:07.246Z cpu8:66122)DVFilter: 6027: Checking disconnected filters for timeouts
...
When the scrub started
2025-04-26T23:36:36.883Z cpu10:66122)DVFilter: 6027: Checking disconnected filters for timeouts
2025-04-26T23:46:36.898Z cpu10:66122)DVFilter: 6027: Checking disconnected filters for timeouts
2025-04-26T23:56:36.912Z cpu8:66122)DVFilter: 6027: Checking disconnected filters for timeouts
2025-04-27T00:06:36.927Z cpu8:66122)DVFilter: 6027: Checking disconnected filters for timeouts
2025-04-27T00:16:36.942Z cpu8:66122)DVFilter: 6027: Checking disconnected filters for timeouts
2025-04-27T00:26:36.957Z cpu6:66122)DVFilter: 6027: Checking disconnected filters for timeouts
...
2025-04-27T12:26:38.033Z cpu8:66122)DVFilter: 6027: Checking disconnected filters for timeouts
2025-04-27T12:36:38.048Z cpu7:66122)DVFilter: 6027: Checking disconnected filters for timeouts
2025-04-27T12:46:38.063Z cpu9:66122)DVFilter: 6027: Checking disconnected filters for timeouts
...
2025-04-27T17:36:38.495Z cpu4:66122)DVFilter: 6027: Checking disconnected filters for timeouts
2025-04-27T17:46:38.510Z cpu5:66122)DVFilter: 6027: Checking disconnected filters for timeouts
2025-04-27T17:56:38.525Z cpu3:66122)DVFilter: 6027: Checking disconnected filters for timeouts

At 2025-04-27T17:56:38.525 came latest DVFilter: 6027 error (at least no more since 3 housrs have passed)

Hmm…

Arwen · April 27, 2025, 8:10pm

I can’t comment on the increase of COMRESETs in WD Reds.

However, slow downs are indicative of the problem with SMR, Shingled Magnetic Recording, that is commonly used in WD Blue, WD Green and Seagate desktop drives. I did not check those models, you should do so.

SMR drives get more and more internally fragmented to the point of noticeably slower access. The comment “rock stable for years” is meaningless when dealing with SMR disks.

Now is that the cause for the slow down?
Don’t know for sure.

Another issue could be either a fullish pool, or highly fragmented pool. ZFS will change it’s write behavior at 95% full or so, to a slower free space allocation method. Thus, the recommendation to keep a bit more free space. Of course, with modern storage devices exceeding 25TBs, and pools in the 100s of TBs, it may be less of a problem. BUT, the ZFS code may not be updated, so the issue may still exist.

Having a ZFS pool highly fragmented, on top of SMR disks that are also highly fragmented, is likely to be really bad.

Tila · April 27, 2025, 9:45pm

Thanks for the reply

Back in the day I did extensive seach for the drives if they are they CRM or SMR

What I found that they are all CRM
– http://issmrdrive.com/ (now down ) But when it was up it listed the drives as CRM
– WD BLUE 4TB SMR or CMR? - #2 by Vegan - WD Blue - WD Community
WD40EZRZ-00WN9B0 - Giant (C4E in serial number) - 8 heads, PMR (CMR)
All my Blue/Green drives have C4E in serial number
– The Green drive and the two Seagate Desktop drives were on the synology list (Before Synology removed the information ) as a CRM drive.
– Solved - Horrific ZFS performance on new ST4000DM004 drive? | The FreeBSD Forums
— Seagate Desktop drive (ST4000DM000-1F2168) “running great for about 2 years now” in ZFS raidz1 use.
– The HDD Platter Capacity Database: Western Digital - 3.5" (Caviar/Blue/Green/Purple/AV)
— Platter DB can be used to conclude that my Blue/Green disks have 1TB patterns and are CRM
– The HDD Platter Capacity Database: Seagate - 3.5"
— Platter DB can be used to conclude that my Seagate Desktop disks have 1TB or 1,33TB platterns and are CRM.

→ So many diffrent sources say they are CRM or give you information to conclude that they are CRM.
– But because of the doubt I had to double check them again

The only drive that could be SMR is ST4000DM000-2AE166
– Based on info Seagate ST4000DM000 multiple product versions - amazon mismatch | TrueNAS Community
– Because ther drive does have 12V = 0.37A,
→ But also IronWolf drives have 12V = 0.37A. And the drive maches IronWolf in more ares than a SMR Barracuda Computer drive.
– Also if it had a the same plattern desing like the SMR Barracuda Compute drive (ST4000DM004-2CV104), its weight should be lower
– But is interesing is that the drives firmware is “1” same as a Barracude computer drives have “1”.
–This drive has always bothered me with its weird firmware version number…
Also I had done stress testing then personally.
– I had SMR drives to compare and these drives did not do the problems SMR drives do.
– I measured the weight of the drives. None matched the low weights of SMR drives. All had weight that matched a similar known CRM drives.

8 Heads / 1TB Platterns:
WDC Blue WD40EZRZ-00WN9B0 ~678g
WDC Green WDC WD40EZRX-00SPEB0 ~676g
Seagate DESKTOP HDD ST4000DM000-1F2168 ~615g
6 Heads / 1.33TB Platterns:
WDC Red WD40EFRX-68N32N0 ~633g
Seagate DESKTOP HDD ST4000DM000-2AE166 ~609g
Seagate IronWolf ST4000VN008-2DR166 ~602g

SMR to comparision (not used on the system):
4 Heads / 2TB Platterns:
Seagate Compute ST4000DM004-2CV104 ~428g low weight = SMR

— If any of the drives used in the pool are SMR then they have magicly acted like CRM drives for years

SMR drives are not coming to this pool, I know what they do to ZFS performance

Does any of you know, good test methods (tests I might have missed) to test if a drive truly is SMR or CRM?

Pool Used Space: 59%

Update
The Scrub speed is increased

  pool: *****
 state: ONLINE
  scan: scrub in progress since Sun Apr 27 00:00:01 2025
        1.76T scanned at 20.9M/s, 1.73T issued at 20.4M/s, 25.4T total
        0B repaired, 6.78% done, 14 days 02:07:11 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        *****                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/c7089767-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/c961973e-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/cc1a6f1a-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/e0afbd2f-e436-11ef-80ef-000c2994e4e5  ONLINE       0     0     0
            gptid/d0c2bd83-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/cf718244-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/d2a57cec-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/d3fd40b4-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/d5deb979-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/e2524766-f243-11ef-93fd-000c2994e4e5  ONLINE       0     0     0
            gptid/5f29ba93-e321-11ef-925f-000c2994e4e5  ONLINE       0     0     0
            gptid/da7d8690-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0

No errors detected /seen
Also checked all drives smartctl -x values)

But speed is increased, now going 20.9M/s, but still way too slow.

Still dont know what is causing this… Any ideas where to look or try?

etorix · April 27, 2025, 9:56pm

Latest is P20.00.07.00.
Is the HBA properly cooled?

Tila · April 27, 2025, 10:13pm

Thanks for the reply

The firmware is old (In the past it was the recomended version for Truenas…)
Probably old info now? I should upgrade it?

Any downsides to upgrade? Specially is the newer firmare ok for the old TrueNAS-12.0-U8.1? etc

HBA temp should be ok
– The HBA is direcly cooled by 12cm fan that gets fresh air direcly out side of the case, room temperature is ~23c

The HBA heat sink is warn to touch, but not hot.
–I can try to get more accurate readings later on.

etorix · April 28, 2025, 8:03am

As far as I know, firmware older than P20 is NOT OK.

Tila · April 28, 2025, 8:39am

Thanks for the info

I will upgrade the firmware when I get the problems away.
- Any good How to’s to follow and what mistakes to avoid when doing the firmware upgrade?

Update the scrub speed has increased 63.4M/s, but stills slow

pool: *****
 state: ONLINE
  scan: scrub in progress since Sun Apr 27 00:00:01 2025
        7.65T scanned at 63.4M/s, 6.96T issued at 57.7M/s, 25.4T total
        0B repaired, 27.36% done, 3 days 21:16:49 to go
config:

        NAME                                            STATE     READ WRITE CKSUM
        *****                                           ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/c7089767-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/c961973e-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/cc1a6f1a-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/e0afbd2f-e436-11ef-80ef-000c2994e4e5  ONLINE       0     0     0
            gptid/d0c2bd83-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/cf718244-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/d2a57cec-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/d3fd40b4-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/d5deb979-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/e2524766-f243-11ef-93fd-000c2994e4e5  ONLINE       0     0     0
            gptid/5f29ba93-e321-11ef-925f-000c2994e4e5  ONLINE       0     0     0
            gptid/da7d8690-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0

errors: No known data errors

fragmented 2%

zpool list -v
NAME                                             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
boot-pool                                       31.5G  5.93G  25.6G        -         -     1%    18%  1.00x    ONLINE  -
  da0p2                                         31.5G  5.93G  25.6G        -         -     1%  18.8%      -    ONLINE
*****                                           43.7T  25.4T  18.2T        -         -     2%    58%  1.00x    ONLINE  /mnt
  raidz2                                        43.7T  25.4T  18.2T        -         -     2%  58.3%      -    ONLINE
    gptid/c7089767-9e0b-11eb-8c0d-000c2994e4e5      -      -      -        -         -      -      -      -    ONLINE
    gptid/c961973e-9e0b-11eb-8c0d-000c2994e4e5      -      -      -        -         -      -      -      -    ONLINE
    gptid/cc1a6f1a-9e0b-11eb-8c0d-000c2994e4e5      -      -      -        -         -      -      -      -    ONLINE
    gptid/e0afbd2f-e436-11ef-80ef-000c2994e4e5      -      -      -        -         -      -      -      -    ONLINE
    gptid/d0c2bd83-9e0b-11eb-8c0d-000c2994e4e5      -      -      -        -         -      -      -      -    ONLINE
    gptid/cf718244-9e0b-11eb-8c0d-000c2994e4e5      -      -      -        -         -      -      -      -    ONLINE
    gptid/d2a57cec-9e0b-11eb-8c0d-000c2994e4e5      -      -      -        -         -      -      -      -    ONLINE
    gptid/d3fd40b4-9e0b-11eb-8c0d-000c2994e4e5      -      -      -        -         -      -      -      -    ONLINE
    gptid/d5deb979-9e0b-11eb-8c0d-000c2994e4e5      -      -      -        -         -      -      -      -    ONLINE
    gptid/e2524766-f243-11ef-93fd-000c2994e4e5      -      -      -        -         -      -      -      -    ONLINE
    gptid/5f29ba93-e321-11ef-925f-000c2994e4e5      -      -      -        -         -      -      -      -    ONLINE
    gptid/da7d8690-9e0b-11eb-8c0d-000c2994e4e5      -      -      -        -         -      -      -      -    ONLINE

And got 1 more error for da9 in dmesg


(da9:mps0:0:55:0): WRITE(10). CDB: 2a 00 c7 52 50 58 00 00 08 00
(da9:mps0:0:55:0): CAM status: CCB request completed with an error
(da9:mps0:0:55:0): Retrying command, 3 more tries remain
(da9:mps0:0:55:0): READ(10). CDB: 28 00 c8 fe 5c b8 00 00 10 00
(da9:mps0:0:55:0): CAM status: SCSI Status Error
(da9:mps0:0:55:0): SCSI status: Check Condition
(da9:mps0:0:55:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da9:mps0:0:55:0): Retrying command (per sense data)
(da9:mps0:0:55:0): READ(10). CDB: 28 00 6b 05 29 b0 00 00 50 00
(da9:mps0:0:55:0): CAM status: SCSI Status Error
(da9:mps0:0:55:0): SCSI status: Check Condition
(da9:mps0:0:55:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
(da9:mps0:0:55:0): Retrying command (per sense data)

da9 Number of Interface CRC Errors +1 now at 365
-But this considered to be “normal” for the system.

Device-to-host register FISes sent due to a COMRESET

da9 = Seagate IronWolf ST4000VN008-2DR166
– Got 1 more, now at 9

Other drives, no change

ESXi
-Changed power priority from low to balanced
-Dmesg, some new errors

2025-04-27T18:18:10.725Z cpu0:67695 opID=8adb71f5)World: 12236: VC opID esxui-48e3-ce29 maps to vmkernel opID 8adb71f5
2025-04-27T18:18:10.725Z cpu0:67695 opID=8adb71f5)CpuSched: 694: user latency of 901854 J6AsyncReplayManager 0 changed by 67695 hostd-worker -1
2025-04-27T18:19:29.984Z cpu7:67686 opID=ce48a161)World: 12236: VC opID esxui-ca3b-ce44 maps to vmkernel opID ce48a161
2025-04-27T18:19:29.984Z cpu7:67686 opID=ce48a161)CpuSched: 694: user latency of 901869 J6AsyncReplayManager 0 changed by 67686 hostd-worker -1
2025-04-27T18:19:36.421Z cpu0:67704 opID=625d9839)World: 12236: VC opID esxui-c318-ce51 maps to vmkernel opID 625d9839
2025-04-27T18:19:36.421Z cpu0:67704 opID=625d9839)CpuSched: 694: user latency of 901872 J6AsyncReplayManager 0 changed by 67704 hostd-worker -1

2025-04-28T01:01:20.777Z cpu5:67686 opID=e71c9ced)World: 12236: VC opID esxui-a645-de69 maps to vmkernel opID e71c9ced
2025-04-28T01:01:20.777Z cpu5:67686 opID=e71c9ced)Power: 1654: Current power management policy was set to "dynamic"
2025-04-28T01:01:20.777Z cpu5:67686 opID=e71c9ced)Config: 866: "CpuPolicy" = "dynamic", Old value: "low" (Status: 0x0)
2025-04-28T01:01:20.779Z cpu5:67686 opID=e71c9ced)Power: 1654: Current power management policy was set to "dynamic"

2025-04-28T06:58:13.702Z cpu2:67819)VSCSIFs: 3908: handle 8192(vscsi0:0):Invalid Opcode (0x4d) from (vmm0:TrueNas_Core)

And again lot of DVFilter  errors
2025-04-27T18:06:38.540Z cpu2:66122)DVFilter: 6027: Checking disconnected filters for timeouts
2025-04-27T18:16:38.555Z cpu1:66122)DVFilter: 6027: Checking disconnected filters for timeouts
....
2025-04-28T08:06:39.800Z cpu8:66122)DVFilter: 6027: Checking disconnected filters for timeouts
2025-04-28T08:16:39.815Z cpu6:66122)DVFilter: 6027: Checking disconnected filters for timeouts
2025-04-28T08:26:39.829Z cpu10:66122)DVFilter: 6027: Checking disconnected filters for timeouts

DVFilter error comes at 10min intervals.

No other errors detected

Any ideas? or things to try?

Tila · April 29, 2025, 8:49am

The Scrub finished

Scrub of pool '*****' started.
2025-04-27 00:00:00

Scrub of pool '*****' finished.
2025-04-28 22:00:03

– It has speedup a lot in the end because it only took 46 hours to run.

zpool status
  pool: *****
 state: ONLINE
  scan: scrub repaired 0B in 1 days 22:00:02 with 0 errors on Mon Apr 28 22:00:03 2025
config:

        NAME                                            STATE     READ WRITE CKSUM
        *****                                          ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/c7089767-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/c961973e-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/cc1a6f1a-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/e0afbd2f-e436-11ef-80ef-000c2994e4e5  ONLINE       0     0     0
            gptid/d0c2bd83-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/cf718244-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/d2a57cec-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/d3fd40b4-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/d5deb979-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0
            gptid/e2524766-f243-11ef-93fd-000c2994e4e5  ONLINE       0     0     0
            gptid/5f29ba93-e321-11ef-925f-000c2994e4e5  ONLINE       0     0     0
            gptid/da7d8690-9e0b-11eb-8c0d-000c2994e4e5  ONLINE       0     0     0

errors: No known data errors

Still no errors detected on reason found why the slowdown happend or what happend that is speedup back to normal.

Lookin at Truenas reporting / Disk

All discs have actet similar

There has been major changes in speed.

Very slow until speed up at 2025.4.28 8:26
– Slowdown again 2025.4.28 14:33
Speed up again 2025.4.28 17:53
– Until starting to slowdown towards the end 2025.4.28 21:46 → 22:53

Nothing special was changed to explain the speed changes.

Is there a way to debug or get more info what happens in the scrub when it slowdowns?

etorix · April 29, 2025, 9:08am

Does it actually slows down?
Or is it just that the estimate of time to completion is dismal?

Tila · April 29, 2025, 9:26am

I thinks its just not zpool status completion time calculatuon causing problems. Because all iostats etc showed slowdown as well… Or someting if causing problem in the calculation to all these things.

Also the mechanical sound of the discs was weird.
It sounded like a waterfall. Also I could feel quite strong vibration as the wave passed. I am wondering could scrub have caused a resonation issue. but is a long shot idea…

It was like litle on/off, like there was waiting period happening and it was not running 100% all the time.

So there is someting happening that slowdowns the scrub, but I dont know what…

afrosheen · May 5, 2025, 12:36am

Small suggestion for everyone with 24hour+ scrubs. Add ram.

I had 64gb because it’s all my board used to support. Scrubbing a 5 drive z1 array about 60% used took about 24 hours.

Doubled recently to 128. Now it takes 15 hours for that same pool. I just wanted more breathing room for my future move to Scale and the pile of apps I’ll be using, this was a happy side effect.

People go on and on about ram maxing for good reason. It just works better the more you can throw at it.

Tila · June 1, 2025, 8:45am

I am curious do you have Deduplication on a huge pool? Or someting other special that can cause so huge benefit for ram?

Because 128GB ram just for fast scrub sounds ridicilous

Ram is not cheap, Majority of us dont have a “free unlimited supply of it”. In normal use cases ram should be used more wisely and not just trow more to the system to fix issues (bugs and lack of memory use optimation in the code) that should not be there and can be deald with more proper fixes.

Yes I am old and used to work on 640K and even less
-We cant fix issues by just using more/faster hardware, because currently software gets more slower than the hardware can get faster. Its a road do disaster…

Dam The new Doom needs a new GPU with raytracing, wheres my wallet

Tila · June 1, 2025, 9:16am

Doing another srub run and the slowdown in now diffrent and the problem is still there.

pool: *****
 state: ONLINE
  scan: scrub in progress since Sun Jun  1 00:00:01 2025
        20.9T scanned at 504M/s, 19.9T issued at 481M/s, 25.5T total
        0B repaired, 78.20% done, 03:22:14 to go

DIsk
Now it started fast and kept going at good speed unil it hit a “slowdown area” and running fast after it

There is no external use or anyting slowing the system when its doing srub.

Memory usege looks to be stable and not using all memory. Swap is not being used.

CPU usage goes down when the slowdown happens.

ZFS Statistics, when the slowdown happens

ARC Hit Ratio drops down from 100 to ~95%
- ARC Requests demand_metadata starts to buid up from ~250K to 4,97M
– Hit: 4.87m**
– Miss: 111.14k**
– Total: 4.98m**

I am wondering…:

Could the pool be fragmented more than the ZFS is telling (says FRAG 2%)?
– Could this cause slowdown? Is there any fix?
If there is a big section of a small files in the “slowdown area”, Could it cause such a huge speed drop?
→ Is there a way to see what files the scrub is prosessing when it hits the slowdown?

UPDATE: Scrub finished

pool: *****
 state: ONLINE
  scan: scrub repaired 0B in 14:04:14 with 0 errors on Sun Jun  1 14:04:15 2025

The end was fast, The slowdown only happends in the middle of the scrub.

Overall scrub speed was now ok, but would like to try to find the reason for the slowdown in the middle.

Any ideas how I could debug this issue more?

Stux · June 1, 2025, 5:15pm

Run solnet array tester to see if any drives are running slow.

Could be a marginal/failing PSU

Tila · June 2, 2025, 4:38pm

Thanks for the info

I donwloaded this script ftp.sol.net/incoming/solnet-array-test-v2.sh and ran it

sol.net disk array test v2

1) Use all disks (from camcontrol)
2) Use selected disks (from camcontrol|grep)
3) Specify disks
4) Show camcontrol list

Option: 2

Enter grep match pattern (e.g. ST150176): <ATA

Selected disks: da1 da2 da3 da4 da5 da6 da7 da8 da9 da10 da11 da12
<ATA WDC WD40EZRZ-00W 0A80>        at scbus33 target 30 lun 0 (pass3,da1)
<ATA WDC WD40EZRZ-00W 0A80>        at scbus33 target 31 lun 0 (pass4,da2)
<ATA WDC WD40EZRZ-00W 0A80>        at scbus33 target 33 lun 0 (pass5,da3)
<ATA WDC WD40EZRX-00S 0A80>        at scbus33 target 34 lun 0 (pass6,da4)
<ATA ST4000DM000-2AE1 0001>        at scbus33 target 35 lun 0 (pass7,da5)
<ATA ST4000DM000-1F21 CC54>        at scbus33 target 36 lun 0 (pass8,da6)
<ATA WDC WD40EFRX-68N 0A82>        at scbus33 target 40 lun 0 (pass9,da7)
<ATA WDC WD40EFRX-68N 0A82>        at scbus33 target 41 lun 0 (pass10,da8)
<ATA ST4000VN008-2DR1 SC60>        at scbus33 target 55 lun 0 (pass11,da9)
<ATA WDC WD40EFPX-68C 0A81>        at scbus33 target 60 lun 0 (pass12,da10)
<ATA WDC WD40EFPX-68C 0A81>        at scbus33 target 62 lun 0 (pass13,da11)
<ATA WDC WD40EFPX-68C 0A81>        at scbus33 target 64 lun 0 (pass14,da12)
Is this correct? (y/N): y
Performing initial serial array read (baseline speeds)
Mon Jun  2 18:34:27 EEST 2025
Mon Jun  2 19:01:29 EEST 2025
Completed: initial serial array read (baseline speeds)

Array's average speed is 171.227 MB/sec per disk

Disk    Disk Size  MB/sec %ofAvg
------- ---------- ------ ------
da1      3815447MB    143     84 --SLOW--
da2      3815447MB    143     84 --SLOW--
da3      3815447MB    146     85 --SLOW--
da4      3815447MB    151     88 --SLOW--
da5      3815447MB    182    106
da6      3815447MB    165     97
da7      3815447MB    171    100
da8      3815447MB    174    102
da9      3815447MB    199    116 ++FAST++
da10     3815447MB    194    113 ++FAST++
da11     3815447MB    197    115 ++FAST++
da12     3815447MB    191    112 ++FAST++

Performing initial parallel array read
Mon Jun  2 19:01:29 EEST 2025
The disk da1 appears to be 3815447 MB.
Disk is reading at about 149 MB/sec
This suggests that this pass may take around 425 minutes

                   Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da1      3815447MB    143    153    107
da2      3815447MB    143    140     98
da3      3815447MB    146    146    101
da4      3815447MB    151    151    100
da5      3815447MB    182    186    102
da6      3815447MB    165    179    108 ++FAST++
da7      3815447MB    171    166     97
da8      3815447MB    174    173    100
da9      3815447MB    199    197     99
da10     3815447MB    194    195    101
da11     3815447MB    197    194     99
da12     3815447MB    191    190     99

Awaiting completion: initial parallel array read
Tue Jun  3 04:58:11 EEST 2025
Completed: initial parallel array read

Disk's average time is 29446 seconds per disk

Disk    Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da1         4000787030016   34653    118 --SLOW--
da2         4000787030016   35802    122 --SLOW--
da3         4000787030016   35324    120 --SLOW--
da4         4000787030016   34539    117 --SLOW--
da5         4000787030016   25806     88 ++FAST++
da6         4000787030016   27075     92 ++FAST++
da7         4000787030016   28946     98
da8         4000787030016   28612     97
da9         4000787030016   24646     84 ++FAST++
da10        4000787030016   26299     89 ++FAST++
da11        4000787030016   25706     87 ++FAST++
da12        4000787030016   25949     88 ++FAST++

Performing initial parallel seek-stress array read
Tue Jun  3 04:58:11 EEST 2025
The disk da1 appears to be 3815447 MB.
Disk is reading at about 116 MB/sec
This suggests that this pass may take around 546 minutes

                   Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da1      3815447MB    143    117     81
da2      3815447MB    143    114     80
da3      3815447MB    146    114     78
da4      3815447MB    151    117     78
da5      3815447MB    182     85     47
da6      3815447MB    165     90     55
da7      3815447MB    171    118     69
da8      3815447MB    174    116     67
da9      3815447MB    199    136     69
da10     3815447MB    194    152     79
da11     3815447MB    197    155     79
da12     3815447MB    191    153     80

Awaiting completion: initial parallel seek-stress array read

Here are the results so far, still running.

SATA disks are from different manufacturers (WDC and Seagate) and RPM 5400-5900 on purpose, as it is less likely that disks of different types will fail at the same time.
Its risky if they are identical, cause identical discs can fail aprox same time, seen it happen.

PSU has plenty of power to spare and the system is ok when doing much harder loads. Newer had a hardware crash or anyting that could relate to weak PSU, but its always good to keep mind open for problems

afrosheen · June 3, 2025, 12:22am

I grew up on various TRS80’s at the library, an apple II at school, a commodore 64 at home and DOS machines coding LOGO at school later on. I value ram and back then, depending on what it was, you couldn’t change it. So like most games consoles, the devs had to optimize the hell out of everything to make it fit and run. So your 640k was a luxury

That being said, I don’t have stupid amounts of ram just lying around. My nas gets a rebuild about every 4 years before the drives die, and DDR4 “gamer” ram isn’t too crazy. If I was loading this up with ECC (which I could), out of pocket, I’d be shopping server ram pulls from a resale shop. But I chose quantity over quality with this box, and the faster scrubs were just a nice side effect. Even now it probably has 4gb of that ram sitting idle…on Core 13. Anyway, nothing else to add, just wanted to be old I guess. Hope you solve your mystery…I glanced a few comments down and I dunno what your box is doing with no disk reads for hours when you’d think it would be full speed ahead.

Tila · June 6, 2025, 10:42am

Hi thanks for the reply.

I still havent figured out what is going on that makes the srub slowing like that middle of its run.

The biggest mystery is, why it started happening now, when no changes was made and before it had been problem free for years.

I took you advice on the ram thou and plan to add more to the system and see does it make a difference. Maybe it helps me to find a root cause of the problem.

I just need to find a good deal on a 6 * DDR3 ECC 32GB 1333Mhz or better ram that the motherbord supports to get it maxed out.

sol.net disk array test v2 Just finished, analyzing still the data I got.

sol.net disk array test v2

1) Use all disks (from camcontrol)
2) Use selected disks (from camcontrol|grep)
3) Specify disks
4) Show camcontrol list

Option: 2

Enter grep match pattern (e.g. ST150176): <ATA

Selected disks: da1 da2 da3 da4 da5 da6 da7 da8 da9 da10 da11 da12
<ATA WDC WD40EZRZ-00W 0A80>        at scbus33 target 30 lun 0 (pass3,da1)
<ATA WDC WD40EZRZ-00W 0A80>        at scbus33 target 31 lun 0 (pass4,da2)
<ATA WDC WD40EZRZ-00W 0A80>        at scbus33 target 33 lun 0 (pass5,da3)
<ATA WDC WD40EZRX-00S 0A80>        at scbus33 target 34 lun 0 (pass6,da4)
<ATA ST4000DM000-2AE1 0001>        at scbus33 target 35 lun 0 (pass7,da5)
<ATA ST4000DM000-1F21 CC54>        at scbus33 target 36 lun 0 (pass8,da6)
<ATA WDC WD40EFRX-68N 0A82>        at scbus33 target 40 lun 0 (pass9,da7)
<ATA WDC WD40EFRX-68N 0A82>        at scbus33 target 41 lun 0 (pass10,da8)
<ATA ST4000VN008-2DR1 SC60>        at scbus33 target 55 lun 0 (pass11,da9)
<ATA WDC WD40EFPX-68C 0A81>        at scbus33 target 60 lun 0 (pass12,da10)
<ATA WDC WD40EFPX-68C 0A81>        at scbus33 target 62 lun 0 (pass13,da11)
<ATA WDC WD40EFPX-68C 0A81>        at scbus33 target 64 lun 0 (pass14,da12)
Is this correct? (y/N): y
Performing initial serial array read (baseline speeds)
Mon Jun  2 18:34:27 EEST 2025
Mon Jun  2 19:01:29 EEST 2025
Completed: initial serial array read (baseline speeds)

Array's average speed is 171.227 MB/sec per disk

Disk    Disk Size  MB/sec %ofAvg
------- ---------- ------ ------
da1      3815447MB    143     84 --SLOW--
da2      3815447MB    143     84 --SLOW--
da3      3815447MB    146     85 --SLOW--
da4      3815447MB    151     88 --SLOW--
da5      3815447MB    182    106
da6      3815447MB    165     97
da7      3815447MB    171    100
da8      3815447MB    174    102
da9      3815447MB    199    116 ++FAST++
da10     3815447MB    194    113 ++FAST++
da11     3815447MB    197    115 ++FAST++
da12     3815447MB    191    112 ++FAST++

Performing initial parallel array read
Mon Jun  2 19:01:29 EEST 2025
The disk da1 appears to be 3815447 MB.
Disk is reading at about 149 MB/sec
This suggests that this pass may take around 425 minutes

                   Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da1      3815447MB    143    153    107
da2      3815447MB    143    140     98
da3      3815447MB    146    146    101
da4      3815447MB    151    151    100
da5      3815447MB    182    186    102
da6      3815447MB    165    179    108 ++FAST++
da7      3815447MB    171    166     97
da8      3815447MB    174    173    100
da9      3815447MB    199    197     99
da10     3815447MB    194    195    101
da11     3815447MB    197    194     99
da12     3815447MB    191    190     99

Awaiting completion: initial parallel array read
Tue Jun  3 04:58:11 EEST 2025
Completed: initial parallel array read

Disk's average time is 29446 seconds per disk

Disk    Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da1         4000787030016   34653    118 --SLOW--
da2         4000787030016   35802    122 --SLOW--
da3         4000787030016   35324    120 --SLOW--
da4         4000787030016   34539    117 --SLOW--
da5         4000787030016   25806     88 ++FAST++
da6         4000787030016   27075     92 ++FAST++
da7         4000787030016   28946     98
da8         4000787030016   28612     97
da9         4000787030016   24646     84 ++FAST++
da10        4000787030016   26299     89 ++FAST++
da11        4000787030016   25706     87 ++FAST++
da12        4000787030016   25949     88 ++FAST++

Performing initial parallel seek-stress array read
Tue Jun  3 04:58:11 EEST 2025
The disk da1 appears to be 3815447 MB.
Disk is reading at about 116 MB/sec
This suggests that this pass may take around 546 minutes

                   Serial Parall % of
Disk    Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da1      3815447MB    143    117     81
da2      3815447MB    143    114     80
da3      3815447MB    146    114     78
da4      3815447MB    151    117     78
da5      3815447MB    182     85     47
da6      3815447MB    165     90     55
da7      3815447MB    171    118     69
da8      3815447MB    174    116     67
da9      3815447MB    199    136     69
da10     3815447MB    194    152     79
da11     3815447MB    197    155     79
da12     3815447MB    191    153     80

Awaiting completion: initial parallel seek-stress array read
Fri Jun  6 12:59:16 EEST 2025
Completed: initial parallel seek-stress array read

Disk's average time is 207142 seconds per disk

Disk    Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da1         4000787030016  252100    122 --SLOW--
da2         4000787030016  258313    125 --SLOW--
da3         4000787030016  255573    123 --SLOW--
da4         4000787030016  248011    120 --SLOW--
da5         4000787030016  227794    110 --SLOW--
da6         4000787030016  162529     78 ++FAST++
da7         4000787030016  207245    100
da8         4000787030016  204897     99
da9         4000787030016  193133     93
da10        4000787030016  160737     78 ++FAST++
da11        4000787030016  158263     76 ++FAST++
da12        4000787030016  157112     76 ++FAST++

Arwen · June 6, 2025, 12:06pm

Well, there is one possible reason for a temporary scrub slow down that is repeatable in the same percent done.

ZFS does scrubbing of DATA, (and MetaData), so if their was a point in time where a bunch of small files were written, it could theoretically take longer. Each small piece has to be read, (and seeked to), for the scrub. That is slower than a longer & larger continuous group of data.

This may also happen with larger files that ended up fragmented. Perhaps due to slow writing, meaning a large / huge file was written in small chunks separated by enough time that ZFS wrote the small fragments because of the Async timeout of ZFS transaction groups.

Some people over look ZFS’ method of re-silver or scrubbing.

Now is this why @Tila is having slow scrubs?
Don’t know, but it is a logical guess.

For me, my backups are by file, (Rsync), which means their are a ton of small files. So, the backup disks take a while to scrub. I just let it do it's thing, as long as it does not fail.