Disk io by scrub tasks caused kernel panic

user0241233 · August 23, 2024, 3:59pm

Hi,

We had an existing pool of 1 vdev (30 drives, 18TB each, raidz2) on our Truenas Scale (I know this vdev size is way over the recommended number). Last week, we added another vdev to pool (15 drives, 18TB each, raidz2). We also upgraded to 24.04.

2 days ago, nfs connections started slowing down and eventually halted.
After much troubleshooting, we narrowed it down to kernel panic because of scrub tasks taking the majority of disk IO as pausing the scrub tasks resumed normal nfs usage. Now, before the new drives and update, we’ve not had issue with scrub tasks significantly slowing down the server.

So, some questions.

I know vdevs of different sizes are not recommended as it results in slower performance. Could this have caused the slowdown?
Is it because scrub tasks are IO intensive on new drives? If that’s tha case, would it get better in the next scrub?
Could it be caused by a bug in 24.04 version?

In any case, I’m looking at the possibility of running scrub tasks only overnight (something like Incremental zpool scrub · Justin Azoff perhaps?) so users aren’t affected by it. Is something like this possible?

Or maybe throttle io by scrub tasks somehow?

Protopia · August 23, 2024, 5:10pm

I do not understand why you added a 2nd vDev of 15x18TB which is also over the recommended number of drives (12) rather than adding e.g. one vDev of 8x18TB and another of 7x18TB when you know that your existing pool has too many drives?
I don’t think I have seen a recommendation about the width of vDevs being the same.
Unless you have explicitly rebalanced the 2 vDevs, all your old data is still on the original vDev, and only new data and some metadata will be on the new vDev.
Scrubs are supposed to run at a lower priority than normal I/Os and so not have too much performance impact. Most NFS reads should be satisfied from ARC, so I am unclear why a scrub would impact these.
I am not sure whether a scrub is done at a pool/dataset level or at a vDev level. But it reads all metadata and data blocks and verifies that their checksums are correct, and where possible it fixes any incorrect checksums using redundancy data.

From the above, I am not sure why adding a 2nd vDev which is mostly empty would cause any issues.

I am therefore thinking that the cause is something other than I/O contention, such as drive overheating or something else. My advice is to review the TrueNAS Reports to see whether anything there leaps out as odd.

user0241233 · August 28, 2024, 4:06pm

It was a decision based on financial and infrastructure constraints and we trying to get the max usable storage out of 15 drives.
truenas shows a warning about mixed vdevs
I see.
Scrub tasks are possibly not the only contributing factor to this crash and actual issue may be somewhere else. One common theme is it happens during periods of high network traffic.

The logs I see are identical to the ticket and bug report below so I think its a bug that needs to be patched.

https://ixsystems.atlassian.net/browse/NAS-129886

For the time being, I’m pausing scrub tasks and moving to nfsv3 hoping this doesn’t happen again. Will update the ticket if nfsv3 made a difference.