TruneNas Scale 24.04.0 scrub is taking all the resources

shwet · July 29, 2024, 9:18pm

Hi - I just moved my system to Intel Core i9-14900 Desktop Processor 24 cores (8 P-cores + 16 E-cores) and 98 Gig of RAM, the system runs perfectly fine except when the scrub is running. However during the scrub its even impossible to access the webui and also get several email alerts

New alert:

• Failed to check for alert ScrubPaused: Failed connection handshake

The following alert has been cleared:

• Failed to check for alert ZpoolCapacity: Failed connection handshake

Current alerts:

• Failed to check for alert SnapshotCount: Failed connection handshake
• Failed to check for alert HadUpdate:
• Failed to check for alert ScrubPaused: Failed connection handshake

And bunch of errors on the command line when logged in:

2024 Jul 29 14:13:59 truenas Process 313177 (php-fpm83) of user 1000 dumped core.

Module /usr/lib/libmcrypt.so.4.4.8 without build-id.
Module /usr/lib/libmcrypt.so.4.4.8
Stack trace of thread 2289:
#0  0x00005593cbff1556 n/a (/usr/sbin/php-fpm83 + 0x3f1556)
#1  0x00005593cbfb7c96 n/a (/usr/sbin/php-fpm83 + 0x3b7c96)
ELF object binary architecture: AMD x86-64

2024 Jul 29 14:15:01 truenas Process 314827 (php-fpm83) of user 1000 dumped core.

Module /usr/lib/libmcrypt.so.4.4.8 without build-id.
Module /usr/lib/libmcrypt.so.4.4.8
Stack trace of thread 2290:
#0  0x00005593cbff1556 n/a (/usr/sbin/php-fpm83 + 0x3f1556)
#1  0x00005593cbfb7c96 n/a (/usr/sbin/php-fpm83 + 0x3b7c96)
ELF object binary architecture: AMD x86-64

2024 Jul 29 14:16:23 truenas Process 318293 (php-fpm83) of user 1000 dumped core.

Module /usr/lib/libmcrypt.so.4.4.8 without build-id.
Module /usr/lib/libmcrypt.so.4.4.8
Stack trace of thread 2292:
#0  0x00005593cbff1556 n/a (/usr/sbin/php-fpm83 + 0x3f1556)
#1  0x00005593cbfb7c96 n/a (/usr/sbin/php-fpm83 + 0x3b7c96)
ELF object binary architecture: AMD x86-64

2024 Jul 29 14:16:23 truenas Process 316527 (php-fpm83) of user 1000 dumped core.

Module /usr/lib/libmcrypt.so.4.4.8 without build-id.
Module /usr/lib/libmcrypt.so.4.4.8
Stack trace of thread 2291:
#0  0x00005593cbff1556 n/a (/usr/sbin/php-fpm83 + 0x3f1556)
#1  0x00005593cbfb7c96 n/a (/usr/sbin/php-fpm83 + 0x3b7c96)
ELF object binary architecture: AMD x86-64

2024 Jul 29 14:16:57 truenas Process 320057 (php-fpm83) of user 1000 dumped core.

Module /usr/lib/libmcrypt.so.4.4.8 without build-id.
Module /usr/lib/libmcrypt.so.4.4.8
Stack trace of thread 2294:
#0  0x00005593cbff1556 n/a (/usr/sbin/php-fpm83 + 0x3f1556)
#1  0x00005593cbfb7c96 n/a (/usr/sbin/php-fpm83 + 0x3b7c96)
ELF object binary architecture: AMD x86-64

2024 Jul 29 14:17:05 truenas Process 322353 (php-fpm83) of user 1000 dumped core.

Module /usr/lib/libmcrypt.so.4.4.8 without build-id.
Module /usr/lib/libmcrypt.so.4.4.8
Stack trace of thread 2295:
#0  0x00005593cbff1556 n/a (/usr/sbin/php-fpm83 + 0x3f1556)
#1  0x00005593cbfb7c96 n/a (/usr/sbin/php-fpm83 + 0x3b7c96)
ELF object binary architecture: AMD x86-64

Is there any recommendation to run scrub so that it does not consume all the resources?

SmallBarky · July 29, 2024, 9:27pm

It would help if you gave your system details.

Captain_Morgan · July 29, 2024, 9:27pm

Update to 24.04.2 1st…
you may be hitting an unrelated memory issue

shwet · July 29, 2024, 9:42pm

I am not able to due to this issue

shwet · July 29, 2024, 9:42pm

Software or Hardware details…

SmallBarky · July 29, 2024, 10:10pm

Just use Report A Bug. That way all your details are captured in the ticket.

shwet · July 30, 2024, 8:02am

If you mean from UI, its unresponsive for most practical purposes.

SmallBarky · July 30, 2024, 8:34am

Report A Bug is also a link at top of Forum on right side.

You can open the ticket and then attach all info afterwards.

Hard to tell what is going on without knowing the entire setup and you pointed to Thunderbolt problems thread.

I don’t get any hits searching Thunderbolt in the Scale Documentation. I don’t know if it is an unsupported item or if it even presents the drives to TrueNAS ‘correctly’ or more like USB enclosures that may not present the drives directly to TrueNAS. You not being able to access the GUI doesn’t help much.

shwet · July 30, 2024, 9:27am

Yes I have not found anything on Thunderbolt (TB) in Scale documentation either, having said that TB is no different than any PCIe expansion card that’s external and all my drives are properly recognized and gives me better result than the SAS expanders and Supermicro servers while saving tons of electricity. The issue is to scrub almost half a PB takes so long and system being very slow during the time…

root@truenas[~]# zpool status tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub in progress since Sun Jul 28 00:01:43 2024
	380T / 430T scanned at 2.15G/s, 380T / 430T issued at 2.14G/s
	76K repaired, 88.34% done, 06:38:37 to go

Having said that can I get any logs from command line that would be helpful for the bug?

SmallBarky · July 30, 2024, 9:38am

Probably better to attach all info into bug ticket.
Link might help

DjP-iX · July 30, 2024, 1:17pm

After you submit a bug ticket, it will get an auto-generated comment with a link to a private upload service. Use that link to upload a debug file from the system (System > Advanced > Save Debug).

Arwen · July 30, 2024, 3:11pm

Hardware details please, especially on disk manufacturer & models.

Specifically SMR disks run poorly, (or not at all), on ZFS. Part of the issue is that ZFS wants to do in block order scrubbing. But, SMR disks can be fragmented. Even extremely fragmented. And will get worse over time. Making scrubs take ever increasing time, even with the same amount of data, (if the data was churned / re-written).

Thunderbolt expansion drives

While Thunderbolt may appear to work, and you may need it to work, from all the forums posts I’ve read, it is quite rare. (For use with TrueNAS…) Using unusual hardware puts you in a much smaller group and one that has much less testing.

Another issue with Thunderbolt, is funnel effect.

How many disks are on each Thunderbolt connection?
Are their multiple Thunderbolt disk enclosures?

So, back to full hardware details.

shwet · July 30, 2024, 4:47pm

Seagate IronWolf Pro 24TB Enterprise NAS Internal HDD Hard Drive – CMR 3.5 Inch SATA 6Gb/s 7200 RPM 512MB Cache (ST24000NT002)
4 x OWC ThunderBay 8 Thunderbolt 3 Storage Enclosure - each has 8 HDD

The enclosure is daisy chained and outperforms my current SAS3 solution in terms of performance and electricity usage.

The issue is no official support for TB3 enclosures in Scale and my inability to upgrade

etorix · July 30, 2024, 5:21pm

…plus the JMicron 585 controllers in there (two in each enclosure). At least port multipliers do not appear to be involved, but this cheap SATA controller could be part of your troubles.

shwet · July 30, 2024, 6:46pm

It actually does not matter, I also have OWC Thunderbay Flex 8 Thunderbolt 3 - and I use my LSI SAS controller for them and still similar issues.

sfatula · July 30, 2024, 6:50pm

ZFS scrub priority can be set via zfs parameters. Example: zfs_vdev_scrub_max_active. Look them up in zfs doc to see how you may want to use them. You could set those in a post init script. It appears mine is 3 for max. Setting max to a 1 might help you, every system is different. Mine has no issues at all with scrubs running, doesn’t noticeably affect anything. Another possibility is zfs_scan_vdev_limit, zfs_vdev_nia_delay (newish) and zfs_vdev_nia_credit (newish)

shwet · July 30, 2024, 6:58pm

I will try them!

shwet · July 30, 2024, 8:18pm

Anyway I could help test the Thunderbolt enclosures?

SmallBarky · July 30, 2024, 8:45pm

Out of curiosity, are you using passive or active cables with your Thunderbolt?

quote from article & link

"To get all the benefits of Thunderbolt 3, you’ll need to use active cables. Active Thunderbolt 3 cables will use integrated chips to achieve full 40Gbps throughput. You should use active cables where throughput really matters, like when connecting your laptop to 4K or 5K displays. You’ll also want to use an active cable to get the fastest throughput out of local file storage for workstations and servers, particularly if you’re connecting to a solid-state-drive (SSD)-based RAID array. "

Arwen · July 31, 2024, 3:14am

Excellent. No SMR disks.

This can be part of the problem. Not saying it is all of the problem. Basically each hard disk is capable of perhaps 150MBytes/ps, (1.2Gbits/ps). With 32 of them, that works out to 38.4Gbits/ps. Within 40Gbits/ps Thunderbolt 3 limit, (assuming you are getting it). However, just using maximum numbers is only part of performance. You could be limited to 20Gbits/ps due to lower quality cables.

I don’t claim to be an expert in performance. So take this as something to consider, not a complete answer.

We also need pool layout. Like `zpool status` in CODE tags.

Occasionally someone selects poor performing pool layout. It can appears fine, until much more data is loaded.