Hi - I just moved my system to Intel Core i9-14900 Desktop Processor 24 cores (8 P-cores + 16 E-cores) and 98 Gig of RAM, the system runs perfectly fine except when the scrub is running. However during the scrub its even impossible to access the webui and also get several email alerts
New alert:
• Failed to check for alert ScrubPaused: Failed connection handshake
The following alert has been cleared:
• Failed to check for alert ZpoolCapacity: Failed connection handshake
Current alerts:
• Failed to check for alert SnapshotCount: Failed connection handshake
• Failed to check for alert HadUpdate:
• Failed to check for alert ScrubPaused: Failed connection handshake
And bunch of errors on the command line when logged in:
2024 Jul 29 14:13:59 truenas Process 313177 (php-fpm83) of user 1000 dumped core.
Module /usr/lib/libmcrypt.so.4.4.8 without build-id.
Module /usr/lib/libmcrypt.so.4.4.8
Stack trace of thread 2289:
#0 0x00005593cbff1556 n/a (/usr/sbin/php-fpm83 + 0x3f1556)
#1 0x00005593cbfb7c96 n/a (/usr/sbin/php-fpm83 + 0x3b7c96)
ELF object binary architecture: AMD x86-64
2024 Jul 29 14:15:01 truenas Process 314827 (php-fpm83) of user 1000 dumped core.
Module /usr/lib/libmcrypt.so.4.4.8 without build-id.
Module /usr/lib/libmcrypt.so.4.4.8
Stack trace of thread 2290:
#0 0x00005593cbff1556 n/a (/usr/sbin/php-fpm83 + 0x3f1556)
#1 0x00005593cbfb7c96 n/a (/usr/sbin/php-fpm83 + 0x3b7c96)
ELF object binary architecture: AMD x86-64
2024 Jul 29 14:16:23 truenas Process 318293 (php-fpm83) of user 1000 dumped core.
Module /usr/lib/libmcrypt.so.4.4.8 without build-id.
Module /usr/lib/libmcrypt.so.4.4.8
Stack trace of thread 2292:
#0 0x00005593cbff1556 n/a (/usr/sbin/php-fpm83 + 0x3f1556)
#1 0x00005593cbfb7c96 n/a (/usr/sbin/php-fpm83 + 0x3b7c96)
ELF object binary architecture: AMD x86-64
2024 Jul 29 14:16:23 truenas Process 316527 (php-fpm83) of user 1000 dumped core.
Module /usr/lib/libmcrypt.so.4.4.8 without build-id.
Module /usr/lib/libmcrypt.so.4.4.8
Stack trace of thread 2291:
#0 0x00005593cbff1556 n/a (/usr/sbin/php-fpm83 + 0x3f1556)
#1 0x00005593cbfb7c96 n/a (/usr/sbin/php-fpm83 + 0x3b7c96)
ELF object binary architecture: AMD x86-64
2024 Jul 29 14:16:57 truenas Process 320057 (php-fpm83) of user 1000 dumped core.
Module /usr/lib/libmcrypt.so.4.4.8 without build-id.
Module /usr/lib/libmcrypt.so.4.4.8
Stack trace of thread 2294:
#0 0x00005593cbff1556 n/a (/usr/sbin/php-fpm83 + 0x3f1556)
#1 0x00005593cbfb7c96 n/a (/usr/sbin/php-fpm83 + 0x3b7c96)
ELF object binary architecture: AMD x86-64
2024 Jul 29 14:17:05 truenas Process 322353 (php-fpm83) of user 1000 dumped core.
Module /usr/lib/libmcrypt.so.4.4.8 without build-id.
Module /usr/lib/libmcrypt.so.4.4.8
Stack trace of thread 2295:
#0 0x00005593cbff1556 n/a (/usr/sbin/php-fpm83 + 0x3f1556)
#1 0x00005593cbfb7c96 n/a (/usr/sbin/php-fpm83 + 0x3b7c96)
ELF object binary architecture: AMD x86-64
Is there any recommendation to run scrub so that it does not consume all the resources?
Report A Bug is also a link at top of Forum on right side.
You can open the ticket and then attach all info afterwards.
Hard to tell what is going on without knowing the entire setup and you pointed to Thunderbolt problems thread.
I don’t get any hits searching Thunderbolt in the Scale Documentation. I don’t know if it is an unsupported item or if it even presents the drives to TrueNAS ‘correctly’ or more like USB enclosures that may not present the drives directly to TrueNAS. You not being able to access the GUI doesn’t help much.
Yes I have not found anything on Thunderbolt (TB) in Scale documentation either, having said that TB is no different than any PCIe expansion card that’s external and all my drives are properly recognized and gives me better result than the SAS expanders and Supermicro servers while saving tons of electricity. The issue is to scrub almost half a PB takes so long and system being very slow during the time…
root@truenas[~]# zpool status tank
pool: tank
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
scan: scrub in progress since Sun Jul 28 00:01:43 2024
380T / 430T scanned at 2.15G/s, 380T / 430T issued at 2.14G/s
76K repaired, 88.34% done, 06:38:37 to go
Having said that can I get any logs from command line that would be helpful for the bug?
After you submit a bug ticket, it will get an auto-generated comment with a link to a private upload service. Use that link to upload a debug file from the system (System > Advanced > Save Debug).
Hardware details please, especially on disk manufacturer & models.
Specifically SMR disks run poorly, (or not at all), on ZFS. Part of the issue is that ZFS wants to do in block order scrubbing. But, SMR disks can be fragmented. Even extremely fragmented. And will get worse over time. Making scrubs take ever increasing time, even with the same amount of data, (if the data was churned / re-written).
Thunderbolt expansion drives
While Thunderbolt may appear to work, and you may need it to work, from all the forums posts I’ve read, it is quite rare. (For use with TrueNAS…) Using unusual hardware puts you in a much smaller group and one that has much less testing.
Another issue with Thunderbolt, is funnel effect.
How many disks are on each Thunderbolt connection?
Are their multiple Thunderbolt disk enclosures?
…plus the JMicron 585 controllers in there (two in each enclosure). At least port multipliers do not appear to be involved, but this cheap SATA controller could be part of your troubles.
ZFS scrub priority can be set via zfs parameters. Example: zfs_vdev_scrub_max_active. Look them up in zfs doc to see how you may want to use them. You could set those in a post init script. It appears mine is 3 for max. Setting max to a 1 might help you, every system is different. Mine has no issues at all with scrubs running, doesn’t noticeably affect anything. Another possibility is zfs_scan_vdev_limit, zfs_vdev_nia_delay (newish) and zfs_vdev_nia_credit (newish)
Out of curiosity, are you using passive or active cables with your Thunderbolt?
quote from article & link
"To get all the benefits of Thunderbolt 3, you’ll need to use active cables. Active Thunderbolt 3 cables will use integrated chips to achieve full 40Gbps throughput. You should use active cables where throughput really matters, like when connecting your laptop to 4K or 5K displays. You’ll also want to use an active cable to get the fastest throughput out of local file storage for workstations and servers, particularly if you’re connecting to a solid-state-drive (SSD)-based RAID array. "
This can be part of the problem. Not saying it is all of the problem. Basically each hard disk is capable of perhaps 150MBytes/ps, (1.2Gbits/ps). With 32 of them, that works out to 38.4Gbits/ps. Within 40Gbits/ps Thunderbolt 3 limit, (assuming you are getting it). However, just using maximum numbers is only part of performance. You could be limited to 20Gbits/ps due to lower quality cables.
I don’t claim to be an expert in performance. So take this as something to consider, not a complete answer.
We also need pool layout. Like `zpool status` in CODE tags.
Occasionally someone selects poor performing pool layout. It can appears fine, until much more data is loaded.