TrueNAS Scale 22.12.3 Freezes System After Attempting to Delete

JackOfAllIT · May 27, 2024, 8:38pm

Hi Everyone,

I’ve been running the above version for a couple years on a very high performance server with 26 disk slots. Never had a single slow down until today. My users took my data pool usage to 84%. I knew that was unacceptable. It took them 3 weeks to figure out what they didn’t want. They went to their SMB share to delete documents they weren’t using. When they did so about 10AM, the whole TrueNAS slowed down to a crawl and no one could access any shares and the webUI became totally unresponsive. We also could not SSH to the host.
Just a few minutes ago, 3.5 hours plus, the entire server came back to life and started responding as usual. All the shares came back online.
The usage is now 76%. I’m going to try to get it down to 60-70% usage but I’m scared now to delete files.
Does anyone know why Truenas would slow down like that during a delete operation?

Thanks,
Bryan

essinghigh · May 27, 2024, 8:49pm

File deletions in ZFS also needs to handle the removal of metadata, checksum, etc.
Doing that over a massive amount of data will involve a lot of very small writes that generally won’t be super fast. As the pool gets fuller and more fragmented, operations like this will probably take longer.

I certainly don’t think this should be freezing the WebUI/SSH though, especially for such an amount of time. Did reporting manage to capture anything during this time? What was CPU utilization looking like?

mvd8tvJI · May 27, 2024, 9:02pm

ZFS is copy-on-write.
Copy-on-write is performed when writing to the file system, so a copy is always made.
If there is not enough free space available, contiguous free space cannot be allocated, and writes will be fatally slow.
And even when deleting, metadata needs to be copied, so deletion is fatally slow if there is little free space.
Massive deletions when free space is low is a bad idea.

JackOfAllIT · May 27, 2024, 9:13pm

Essinghigh,
Thanks for responding.
There are 24 physical cores, 48 CPU threads on this server. when the GUI would load at a few points in the 3.5 hours, the CPU showed anwhere from 8%-25% utilized, with individual threads going to 100% for a few seconds. I’ve never seen the CPU in Truenas that high previously.
We eventually received this alert that I don’t know what to do about:

Attempt to connect to netlogon share failed with error: [ENOTCONN] (6, ‘WBC_ERR_WINBIND_NOT_AVAILABLE: wbcInterfaceDetails failed’, ‘…/…/nsswitch/py_wbclient.c:1685’).

JackOfAllIT · May 27, 2024, 9:19pm

Hiiii mv…,
I agree, we never got less than 2.5TB of usable space.
So our total usable space is about 28TB with ZFS3 lvl protection. I do have all the various caches in this TrueNAS version turned on and allocated to various SSDs, including the metadata cache. Not sure if that helps the metadata copy you mention.
When the deletes were started today, we had 4TB out of the ~28TB available.

The WebUI that’s frozen says “Waiting for Active TrueNAS controller to come up…”
sometimes it loads, sometimes it doesn’t.

I presume when you say massive deletions is a bad idea when free space is low, that you mean doing smaller deletions is better. Is there a rule of thumb for the ratio of how much to delete versus free space?
Thanks!

mvd8tvJI · May 27, 2024, 9:44pm

Sorry. I don’t know.

Fleshmauler · May 27, 2024, 10:49pm

Stupid question; ‘massive’ as in the amount of files deleted or ‘massive’ as in the size of deleted file(s). Or both?

HoneyBadger · May 28, 2024, 1:47am

Hi @JackOfAllIT

Are you by any chance using deduplication on this system?

Also, can you please describe the assignment of the extra SSDs in your system to the “various caches” as you say, as well as the make/model number of these disks?

JackOfAllIT · May 28, 2024, 5:48pm

HoneyBadger,
Thanks for engaging on this.
Yes, we are using deduplication. Here is the description of the drive assignments:
Cache: 2x400GB SAS SSD
Log: 800GB SAS SSD
MetaData: 480GB SATA SSD
Dedup is actually on a mirror of 2x 1.8TB SAS HDD spinning drives.

Thanks.

HoneyBadger · May 28, 2024, 7:35pm

I’m afraid I have to be blunt - you’re painted into a corner that’s going to require you to do a full backup and restore in order to regain some semblance of order here.

Your performance problems under deletes are related to the use of deduplication, and your separate dedup vdev isn’t going to assist with this, as it’s also made up of spinning disks.

You’ve also got a perilous situation with your metadata vdev as it’s only a single drive - losing this device would mean your entire pool is unavailable regardless of the use of RAIDZ3.

And because you have a RAIDZ3 data vdev, you also can’t use the Remove vdev functionality to return the meta/dedup tables back to your main volume in order to regain some of that redundancy.

Can you show the output of zpool status -D yourpoolname here?

essinghigh · May 28, 2024, 7:35pm

Would be careful with that metadata vdev. You’ve got a single point of failure there, this should be a 2-way, or even 3-way mirror.

EDIT: @HoneyBadger beat me to the punch by under 5 seconds!!!

Protopia · May 28, 2024, 8:54pm

As @HoneyBadger suggests, creating your pool again using a better design/layout is the best solution, but this will be a painful fix.

As a fudge alternative that will not need a rebuild to fix the metadata mirror issue (but not the dedup delete issue):

Are you doing synchronous writes (from Macs or over NFS)? If not an SLOG is a waste of a decent device. You should be able to remove this without any issues.

It would not be ideal, but you could then use the 800GB SAS SSD as a second mirror disk for the 480GB Metadata SATA SSD.
If you are doing synchronous writes, then break the cache mirror (cache doesn’t need to be redundant - though a mirror may give you more I/O capacity or faster I/O) and use one of these cache SSDs for the SLOG and the current SLOG SSD as a mirror for the metadata vdev.

Hindsight Motto: ZFS pool design is best done correctly before you create the pool and populate it with data (because fixing a poor design is often going to be very painful backup and restore, and in many cases your NAS may be by far your largest storage capacity, and the only solution may end up being to create a whole new NAS to migrate to). So for new users planning their NAS, do ask for advice here in the forums before you build!!

JackOfAllIT · May 28, 2024, 9:23pm

Thank you Honeybadger,

So blunt I can do. this is quite refreshing.
Where to begin…

Output of the zpool cmd.
root@hiddenhostname[/dev/disk]# zpool status -D Pool1
pool: Pool1
state: ONLINE
scan: scrub repaired 0B in 04:01:47 with 0 errors on Sun May 12 04:01:49 2024
config:

NAME STATE READ WRITE CKSUM
Pool1 ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
961defb9-aedf-43a8-ac23-24e8a9a99120 ONLINE 0 0 0
ae333812-9f44-4543-90c7-3f9b6accb07a ONLINE 0 0 0
0963636e-c642-4a54-b931-93f27bad84b6 ONLINE 0 0 0
9e6839f1-9876-4113-a4b9-2301b61c3040 ONLINE 0 0 0
a06cd50d-6bce-413e-8118-4baa698b2db5 ONLINE 0 0 0
7f02cf2e-b5b2-4fc0-af36-755c4c651557 ONLINE 0 0 0
03942318-be03-4e3a-9295-e84effa28d39 ONLINE 0 0 0
02ec6732-7175-437c-86a2-06deb5d2d481 ONLINE 0 0 0
a7061e2f-6525-4260-8801-134703af50dc ONLINE 0 0 0
dba1a878-3c2b-4a8a-b42f-8a23739b2a75 ONLINE 0 0 0
c3540e57-fc95-4644-b8c1-7964cf663d85 ONLINE 0 0 0
387c60dd-4bd3-4593-9bbd-6f93ffceedf7 ONLINE 0 0 0
b1d20ef4-329b-4c78-bfe7-b32850e8b419 ONLINE 0 0 0
fe76d678-6141-491a-aab6-0d86b781c366 ONLINE 0 0 0
69f1af0e-b6e9-4b57-aee4-8d43b39db249 ONLINE 0 0 0
369f4a02-cf86-44bf-9ec2-fdd07ed27d71 ONLINE 0 0 0
eb1a9b95-8b46-4a5c-95c2-16af1a1262c5 ONLINE 0 0 0
5aabb141-eed0-4562-85bb-40bdfbef968d ONLINE 0 0 0
0fc578a9-a3e7-4ba2-87d4-113b67a3d360 ONLINE 0 0 0
dedup
mirror-3 ONLINE 0 0 0
23315a1b-072c-4d20-bde8-011fa1a979ae ONLINE 0 0 0
ea672079-44bf-44f7-badf-4ed654284e44 ONLINE 0 0 0
special
1e92f55d-017b-4149-b0c9-40cad3c690c5 ONLINE 0 0 0
logs
b2c0f0bb-eceb-44ef-a500-fa0ae994e77a ONLINE 0 0 0
cache
62dd122c-a6a2-4b1d-af8b-67def5bd9168 ONLINE 0 0 0
b9971860-8eaf-42c0-a9d7-874c170ce634 ONLINE 0 0 0

errors: No known data errors

dedup: DDT entries 59179711, size 912B on disk, 294B in core

bucket allocated referenced

refcnt blocks LSIZE PSIZE DSIZE blocks LSIZE PSIZE DSIZE

 1    51.2M   6.39T   6.31T   6.31T    51.2M   6.39T   6.31T   6.31T
 2    4.77M    610G    579G    582G    10.1M   1.26T   1.20T   1.20T
 4     475K   59.3G   56.6G   56.8G    2.17M    277G    265G    266G
 8    18.8K   2.34G   1.76G   1.80G     201K   24.9G   19.1G   19.5G
16    5.75K    731M    594M    603M     134K   16.6G   13.8G   14.0G
32      591   71.3M   28.8M   31.9M    22.5K   2.69G   1.05G   1.17G
64      101   11.7M    752K   1.59M    8.23K    966M   57.9M    129M

128 10 326K 68K 153K 1.62K 50.0M 12.9M 26.6M
256 24 2.52M 2.11M 2.16M 7.73K 808M 660M 678M
512 7 278K 28K 89.5K 4.63K 163M 18.5M 59.1M
1K 3 136K 12K 38.3K 3.81K 175M 15.2M 48.7M
2K 5 268K 20K 63.9K 14.1K 729M 56.4M 180M
32K 1 128K 4K 12.8K 42.6K 5.33G 170M 545M
Total 56.4M 7.05T 6.94T 6.94T 63.9M 7.98T 7.81T 7.81T

We hired and paid for a consultant that does support work for iXSystems in the past to help design and layout our filesystems, so I’m surprised to hear about the metadata.
@Protopia, we are doing the largest chunk of SMB read/writes/deletes from Mac workstations. They are culprits that stuck a ton of their videos onto this fileserver and almost filled it up. They issued the deletes the last two days that caused the system to slow to a crawl. We plan on using NFS in the future, but don’t have anyone using it at the moment.
I don’t think the cache drives are mirrored, I believe they are what is striped in other RAID parlance. I took a screenshot attached, hopefully you all can see it. We wanted maximum performance from the cache.

image1484×1541 106 KB
I assume SLOG == Log in Truenas?
Once I observed the dedup drives doing all the IO when the Macs were deleting their files, I turned off Dedup on the dataset. Can I remove the dedup partition and replace with 2 SSDs and turn that into a Metadata vdev?

Thanks!!
Jack…
p.s. I’m working on the HTML to offer the minimize option on my signature, I don’t have it figured out just yet.

JackOfAllIT · May 28, 2024, 9:26pm

So the output above isn’t formatted well, so I took screenshots of it instead.
This might make more sense.

Protopia · May 28, 2024, 9:45pm

This is a pretty basic mistake, and IMO is a reflection of someone who does not understand what a metadata vdev does and how critical it is to the survival of the pool.

Similarly, a design using HDDs for de-dup also suggests someone that doesn’t understand the performance impact/requirements of dedup.

My understanding (and I may be wrong) is that both SMB from Macs and all NFS usage create synchronous writes, so an SLOG may well be beneficial to the write performance as perceived by your Mac users.

If these videos should not have been saved, you may wish to implement disk quotas to avoid this happening again.

I agree - all the output suggests that they are not mirrors. I am not sure whether they are stripes however - so they may just give you 800GB of cache without the performance benefit of a stripe.

JackOfAllIT · May 28, 2024, 9:55pm

We fixed the marketing media share problem by building them a simple truenas with LFFs and no fancy vdevs, they are putting all recent data there. I also have that other truenas as a replication target so I can backup this truenas… In case something happens…to the metadata vdev…
I’m not sure about the Mac SMB behavior. I do see that our dedup vdevs were getting hit really hard with IO during the delete actions, but not our large 20 drive Z3. In the output above, the dedup is only 912B on disk, that seems to indicate there is any/much dedup going on. Could that be a result of my turning it off or the data being removed or both?
There are only 200MB of data left on that dataset that were deduped before I turned it off, most of which were MSWord documents.

essinghigh · May 28, 2024, 10:03pm

L2ARC is always striped, never mirrored.

Luckily it’s only a small SSD, you can buy another two for decently cheap (assuming you’ve got the IO spare) and attach them to mirror it 3 ways. It should be possible to do this (someone please correct me if I’m wrong as I have not personally used special vdevs, but general experience says it should be possible).

Also just as a note for the formatting, anything in three backticks (`) will become a codeblock.

like this!

HoneyBadger · May 29, 2024, 4:02am

So I’ve reformatted your dedup output for readability here:

dedup: DDT entries 59179711, size 912B on disk, 294B in core
bucket allocated referenced

refcnt blocks   LSIZE   PSIZE   DSIZE   blocks   LSIZE   PSIZE   DSIZE

   1    51.2M   6.39T   6.31T   6.31T    51.2M   6.39T   6.31T   6.31T
   2    4.77M    610G    579G    582G    10.1M   1.26T   1.20T   1.20T
   4     475K   59.3G   56.6G   56.8G    2.17M    277G    265G    266G
   8    18.8K   2.34G   1.76G   1.80G     201K   24.9G   19.1G   19.5G
  16    5.75K    731M    594M    603M     134K   16.6G   13.8G   14.0G
  32      591   71.3M   28.8M   31.9M    22.5K   2.69G   1.05G   1.17G
  64      101   11.7M    752K   1.59M    8.23K    966M   57.9M    129M
  128      10    326K     68K    153K    1.62K   50.0M   12.9M   26.6M
  256      24   2.52M   2.11M   2.16M    7.73K    808M    660M    678M
  512       7    278K     28K   89.5K    4.63K    163M   18.5M   59.1M
  1K        3    136K     12K   38.3K    3.81K    175M   15.2M   48.7M
  2K        5    268K     20K   63.9K    14.1K    729M   56.4M    180M
  32K       1    128K      4K   12.8K    42.6K   5.33G    170M    545M
Total   56.4M   7.05T   6.94T   6.94T    63.9M   7.98T   7.81T   7.81T

This indicates that you’re only saving a little under 1T from enabling dedup - about a 1.13:1 reduction ratio.

What that’s costing you, is 59179711 records being 912B each on disk, and 294B each in RAM. So that’s a total of about 50.2G on disk, and 16.2G in RAM. It’s small enough that reading and comparing against dedup fits in RAM - but the problem comes when it’s time to go through and delete the associated hashes from the deduplication table on your spinning disks. Dedup table entries are tiny little things, and put a huge level of I/O demand on the underlying storage - spinning disks are quite simply not fit for purpose here.

I’m disappointed here as well. Your pool is basically reliant on your single metadata SSD remaining alive, so the mirroring of your dedup HDDs and the RAIDZ3 of your main pool is negated by this. Furthermore, the use of RAIDZ3 means that while you can widen the additional special vdevs, you can’t fully remove them.

Unfortunately your current cache SSDs are too small to be used to mirror the existing 480G “metadata” SSD - you’ll have to find ones at least as large as the current unit. The log drive could be used, assuming that the cache drives are capable of delivering sufficient sync-write performance.

But your dedup tables are ultimately the bottleneck here. Even if you fully disable it (and it seems as if it’s on more than just that one dataset, based on your DDT stats above) there’s no way to “un-dedup” the existing data, so you’ll still be stuck with the necessity of the updates/deletes being bound by the awful random I/O performance of your dedup HDDs.

Can you provide the make and model of your existing SSDs? If your 400G SAS drives make viable log devices, we can do a “shuffle” around, involving removing your cache SSDs, re-using the log SSD as a mirror for your metadata SSD, but you’ll need a pair of SSDs that are at least as large as the 1.8T HDDs (1.92T is the most likely size, target a “mixed use” performance drive) to replace the dedup ones, before finally putting a 400G drive back into service as a log.

But realistically, the single 19-wide RAIDZ3 isn’t doing you any favors from a performance level either. I’d prefer seeing that as a 2x9wZ2 at most, backed up with a mirrored log device, a mirror3 for metadata, and optionally the single cache drive. And no dedup at all. However, that requires a full backup, pool destroy, and recreation.

JackOfAllIT · May 29, 2024, 5:35pm

Thank you again HoneyBadger for your prolific and sesquipedalian writing, it’s incredibly valuable to me.
I totally translated that DDT data incorrectly, thanks for the explanation.
I don’t know if there is a command that shows you all the datasets that have dedup turned on, but I looked and could not find another dataset with dedup turned on. we intended many different document types to benefit from dedup, however, I know photos and video are not great dedup candidates, however, we made a mistake with the settings on that dataset and created a new one with the correct settings, but left the Media folder there. So I believe the poor ratio is because of the type of data living there.
We’re probably going migrate all the data off of that dataset, maybe even replicate to another Truenas and then delete the dataset and see what that does to the DDT data. If it’s all gone, I will try to delete the dedup vdev.
Then I’ll reboot the box and pull out the dedup drives and replace with some SSDs to create another metadata vdev. I don’t have the exact model of most of the drives, but I do have some details:
Cache: Toshiba 400GB 10x Enterprise Performance SAS SSD(qty 2)
Metadata: Intel 480GB 1x Enterprise Value SATA SSD(qty 1)
Log: Samsung 800GB 10x Enterprise Performance SAS SSD(qty 1)
OS Boot drive: Intel 120GB 1x Enterprise Value SATA SSD
About the only thing I got right was making the cache drives a high IO performance drive.
I do have lots of combinations of SSDs available. I have some 1.92TB and 3.84TB SATA enterprise value drives. I have lots of the Samsung 800GB EP SAS.
I like what you said about the 2x9wZ2. I would have done something similar if I was doing a RAID5 and created a RAID50 with them, probably with multiple 4+1 and a 5+1 parity sets.
So I can definitely stick another 800GB Samsung drive in, but I feel like the metadata doesn’t need that much space or the crazy performance of a 10x EP drive? What do you think?
The log and cache being the most active I wanted to have the best drives. But we honestly don’t write much to this truenas everyday. I only have about 15 users and even marketing just writes once and then reads it.
Are there commands that let me move metadata/log/cache around so I can swap drives in and out?
Thanks,
Jack

essinghigh · May 29, 2024, 6:02pm

Word of the day for me! Going to start using this in my ticket notes!

You can remove the LOG and Cache drives no problem, but you cannot move the metadata vdev, ~~only expand (mirror) it~~, or replace. You could mirror and then possibly remove the smaller device too.

I am on a train currently so will update as soon as I’m home with how you can do this if nobody else responds