Question about ashift affecting allocated space

swc-phil · August 31, 2025, 3:32pm

Disclaimer: this question is not directly connected to TrueNAS. Let’s assume it is a general ZFS question.

On one of my systems (proxmox) I had a pool with ashift=9 (0 actually, but zdb showed the value of 9). This pool consisted of one 2-way NVME mirror. These NVMEs were formatted to 512e (being 4Kn); thus, aiui, causing the auto ashift 9 for the pool (and vdev).

All datasets/zvols had a recordsize <= 128K.

I’ve:

renamed this “old” pool (with export/import).
created a new ashift=12 pool (with a 4Kn-formatted drive underneath) with the same name.
taken -r snapshot of my encrypted root dataset.
sent this snapshot to the new pool with -R --raw.

After rebooting and unlocking the dataset (on the new pool) everything is working ok so far.

However! zpool list -v is showing almost 20% more allocated space (79G vs 67G) on the new pool.

While I have some assumptions about possible reasons, I would like to hear opinions/insights of ZFS-using veterans.

winnielinnie · August 31, 2025, 3:42pm

My first hunch is that something didn’t transfer over to the new pool or maybe block-cloning or deduplication, if applicable, was involved.

What if you compare datasets to each other and check block-cloning with zpool on both pools?

EDIT: It’s possible that the “less efficient” 4096-byte minimum writes could contribute to more space being used, but 20% seems too much.

EDIT 2: In terms of actual disk usage, I don’t think it would make a difference on drives with 4K physical sectors. No matter if you use ashift=9 or ashift=12, the drive cannot physically write less than a 4096-byte unit.

swc-phil · August 31, 2025, 5:11pm

But it is a new pool more occupied…

Not sure about block-cloning; I didn’t review this part of zfs yet. I do not use deduplication (even though I mb should).

I’ve already compared datasets but forgot to mention it. All datasets (at least that I’ve randomly checked) have this discrepancy in USED, and, what is more interesting, in REFER.

Again, idk how to check/troubleshoot block-cloning.

Yes, it is exactly my thought as well. I’ve read once (iirc from @mav’s post) that compression would be applied if at least one ashift can be saved. Thus, with raw replication, there could be old “tail” chunks that are less than 4K. But I doubt they would eat up 20%.

On second thought, perhaps I should watch the blocksize histogram for old and new pools.

Yeah, that is my assumption about the reason as well. ZFS thinks that it occupies less space, and NVMes just don’t report “true” values in the 512e case.

winnielinnie · August 31, 2025, 5:17pm

On both pools, check with

zpool list -o name,bcloneused,bclonesaved <poolname>

Same topology of vdevs?

I’ll pretend I never said that. Now I have to contact Archive.org to delete all evidence of my stupidity. It’s going to take a long time…

swc-phil · August 31, 2025, 5:56pm

All zeroes on both.

Welp, it was a mirror of 2 NVMe drives. I detached one drive, formatted it to 4K and created a single-drive pool with ashift=12. Now I have two single-drive pools. And yes, I have backups.

swc-phil · August 31, 2025, 8:07pm

Block size histograms:

# zdb -Lbbbs <ashift9-pool>
<...>
Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   388K   194M   194M   199K  99.5M  99.5M   218K   109M   109M
     1K:   450K   574M   768M  50.5K  60.6M   160M   607K   731M   840M
     2K:  1.23M  3.43G  4.18G  42.6K   112M   272M  1.24M  3.45G  4.27G
     4K:  1.89M  10.5G  14.7G  24.7K   131M   403M  1.90M  10.5G  14.8G
     8K:  1.28M  12.7G  27.4G  23.0K   263M   665M  1.28M  12.8G  27.5G
    16K:  2.04M  33.0G  60.4G  6.86M   110G   111G  2.01M  32.4G  59.9G
    32K:  56.3K  2.63G  63.0G  15.1K   675M   111G  85.1K  3.38G  63.3G
    64K:  22.1K  1.73G  64.8G  10.0K   905M   112G  32.1K  2.78G  66.1G
   128K:  12.9K  1.61G  66.4G   140K  17.5G   130G  13.0K  1.62G  67.7G
   256K:      0      0  66.4G      0      0   130G    209  57.2M  67.8G
   512K:      0      0  66.4G      0      0   130G      0      0  67.8G
     1M:      0      0  66.4G      0      0   130G      0      0  67.8G
     2M:      0      0  66.4G      0      0   130G      0      0  67.8G
     4M:      0      0  66.4G      0      0   130G      0      0  67.8G
     8M:      0      0  66.4G      0      0   130G      0      0  67.8G
    16M:      0      0  66.4G      0      0   130G      0      0  67.8G
<...>

# zdb -Lbbbs <ashift12-pool>
<...>
Block Size Histogram

  block   psize                lsize                asize
   size   Count   Size   Cum.  Count   Size   Cum.  Count   Size   Cum.
    512:   389K   195M   195M   200K   100M   100M      0      0      0
     1K:   439K   563M   757M  51.5K  61.9M   162M      0      0      0
     2K:  1.23M  3.43G  4.17G  43.5K   114M   276M      0      0      0
     4K:  1.92M  10.6G  14.8G  24.8K   132M   407M  2.18M  8.73G  8.73G
     8K:  1.28M  12.8G  27.6G  23.0K   263M   670M  2.86M  26.4G  35.1G
    16K:  2.07M  33.4G  60.9G  6.90M   111G   111G  2.20M  35.5G  70.6G
    32K:  57.1K  2.68G  63.6G  15.2K   676M   112G   112K  4.19G  74.8G
    64K:  22.0K  1.71G  65.3G  10.0K   906M   113G  36.3K  3.10G  77.9G
   128K:  13.0K  1.62G  66.9G   141K  17.6G   130G  13.1K  1.64G  79.5G
   256K:      0      0  66.9G      0      0   130G      0      0  79.5G
   512K:      0      0  66.9G      0      0   130G      0      0  79.5G
     1M:      0      0  66.9G      0      0   130G      0      0  79.5G
     2M:      0      0  66.9G      0      0   130G      0      0  79.5G
     4M:      0      0  66.9G      0      0   130G      0      0  79.5G
     8M:      0      0  66.9G      0      0   130G      0      0  79.5G
    16M:      0      0  66.9G      0      0   130G      0      0  79.5G
<...>

I think minor differences in Count are ok, as the new pool is in use.

I’m still trying to understand how it works…

etorix · August 31, 2025, 8:54pm

So psize and lsize match (within margin of error/new use) but asize is off.
Same compression? But it wouldn’t make sense for the new pool to use a less efficient algoritm than the old pool…

winnielinnie · August 31, 2025, 11:30pm

The -raw flag would have preserved compressed blocks as they are from the old pool.

swc-phil · September 1, 2025, 5:01am

I can’t guarantee 100% the same settings for pools. However, I have the command for pool creation in the proxmox datacenter’s notes. With ashift=12, which iirc I’ve fixed later after this auto ashift=9 “discovery”. And I used this exact command to create new pool.

So let’s say I’m 99% sure pools have the same settings (apart from ashift).