Is ashift=12 fine for 256k PHY-SEC disks, or should it be ashift=18?

I’m doing some upgrades on my lab TrueNAS box (ElectricEel-24.10.2.2), including adding a new pool of NVMe drives with 256k physical sectors/4k logical sectors. The default pool creation made it ashift=12 – which seems to be the default setting across my drives, regardless of if 512k, 4k, or 256k sectors are present. Is that fine?

Thanks for any advice.

Some various disk details for my system:

# lsblk -o NAME,PATH,MODEL,SIZE,PHY-SEC,LOG-SEC,FSTYPE,FSVER,LABEL

NAME        PATH           MODEL                        SIZE PHY-SEC LOG-SEC   FSTYPE            FSVER LABEL
...
sdb         /dev/sdb       Samsung SSD 870 EVO 2TB      1.8T     512     512 
├─sdb1      /dev/sdb1                                     2G     512     512   linux_raid_member 1.2   TrueNAS02:swap1
└─sdb2      /dev/sdb2                                   1.8T     512     512   zfs_member        5000  TEST_SSD_2TB

sdc         /dev/sdc       Samsung SSD 870 EVO 2TB      1.8T     512     512
├─sdc1      /dev/sdc1                                     2G     512     512   linux_raid_member 1.2   TrueNAS02:swap1
└─sdc2      /dev/sdc2                                   1.8T     512     512   zfs_member        5000  TEST_SSD_2TB

sdd         /dev/sdd       ST18000NM000J-2TV103        16.4T    4096    4096
├─sdd1      /dev/sdd1                                     2G    4096    4096   linux_raid_member 1.2   truenas02:swap0
└─sdd2      /dev/sdd2                                  16.4T    4096    4096   zfs_member        5000  TEST_POOL_18TB
sde         /dev/sde       ST18000NM000J-2TV103        16.4T    4096    4096
├─sde1      /dev/sde1                                     2G    4096    4096   linux_raid_member 1.2   truenas02:swap0
└─sde2      /dev/sde2                                  16.4T    4096    4096   zfs_member        5000  TEST_POOL_18TB

nvme1n1     /dev/nvme1n1   Micron_7450_MTFDKCC3T2TFS    2.9T  262144    4096
└─nvme1n1p1 /dev/nvme1n1p1                              2.9T  262144    4096   zfs_member        5000  TANK_NVME
nvme0n1     /dev/nvme0n1   Micron_7450_MTFDKCC3T2TFS    2.9T  262144    4096
└─nvme0n1p1 /dev/nvme0n1p1                              2.9T  262144    4096   zfs_member        5000  TANK_NVME

… what I’m seeing for ashift values:

# zpool get ashift
NAME            PROPERTY  VALUE   SOURCE
TANK_NVME       ashift    12      local
TEST_SSD_2TB    ashift    12      local
TEST_POOL_18TB  ashift    12      local
1 Like

I think the ashift defaults to 12 because the SSDs that report phy-sec 512 are actually doing 512e (512emulated=4K really).

There’s another layer to this decision. Regardless if the SSD is reporting 512 or 4K, the unit of writing for SSDs is pages. I don’t think I have seen anything larger than 16K yet and the Micron 7450 should also have flash with 16K pages.

2 Likes

According to this: https://assets.micron.com/adobe/assets/urn:aaid:aem:d133a40b-b36c-4b17-b768-659acc4d4bca/renditions/original/as/7450-nvme-ssd-tech-prod-spec.pdf you can either set it to 512e or 4k.

For the performance numbers that it states, they use 4k, so 4k performs probably better and is the default. 4k is ashift 12, so I guess everything is fine?

AFAIK you can ignore phy-sec, since there is firmware black magic in between and the controller is tuned to perform best with 4k anyway.

1 Like

Is that true? Back when 4k physical sectors were the new kid on the block, even with presented as 512 there was all sorts of advice to use ashift=12, matching the physical sectors, to avoid a “write-amplification” performance penality.

For example, this older Ars Technica article:

This means that a ZFS admin is strongly advised to be aware of the actual sector size of his or her devices, and manually set ashift accordingly. If ashift is set too low, an astronomical read/write amplification penalty is incurred—writing a 512 byte “sectors” to a 4KiB real sector means having to write the first “sector”, then read the 4KiB sector, modify it with the second 512 byte “sector”, write it back out to a new 4KiB sector, and so forth, for every single write.

In real world terms, this amplification penalty hits a Samsung EVO SSD—which should have ashift=13, but lies about its sector size and therefore defaults to ashift=9 if not overridden by a savvy admin—hard enough to make it appear slower than a conventional rust disk.

By contrast, there is virtually no penalty to setting ashift too high. There is no real performance penalty, and slack space increases are infinitesimal (or zero, with compression enabled). We strongly recommend even disks that really do use 512 byte sectors should be set ashift=12 or even ashift=13 for future-proofing.
ZFS 101—Understanding ZFS storage and performance - Ars Technica.

… which makes me wonder, now that I see a 256k physical sector, if it’s time for ashift=18?

Since 4k is what most people will use in production, controllers are tuned for that. Which isn’t to say that it can’t be possible that a disk performs better with ashift 18!

Please find out for us, by running a benchmark :slight_smile:

This is IMHO pretty dangerous to say.
A too high blocksize can have extreme implications.

Assuming we have the default volblocksize of 16k and a ashift=14 or 16k, that would mean that a RAIDZ1 only has the storage efficiency of a mirror (50%) and RAIDZ2 even worse than mirror (33%) because each stripe would be one block data and one or two block parity.

If you set ashift to 18 or 256k, you would need to set volblocksize to also to at least 256k. Even with a two way mirror, the read and write amplification and fragmentation would be a disaster.
For every single let’s say 16k read or write your system does, there would be the need to read or write 256k.

But yeah, for a dataset mostly consisting of big files and a recordsize of at least 256k, ashift 18 is (for none RAIDZ pool geometries) probably fine.

1 Like

It’s not quite like this. In SSD you write at Page granularity, you mark as deleted at Page granularity and you ERASE at block level (a continuous set of pages). Once a block is ERASED, its pages are ready to be used again.
Having this in mind, your writes are distributed to any available pages.

In the background, the controller, may copy same pages from one block to another and mark the source pages as deleted. The goal of this process is creating a block made only of deleted pages, so it can be erased and its pages reused for writing.
This process of copying pages from one block to another is the write amplification.

Moreover, If the page size is 16K and you send 4K, you MAY end with 12K in the page wasted.
I say may because the controller will attempt to pack more that into that page. For example, if you are writing a big file sequentially, or you are sending multiple concurrent small write operations.
The worst should happen with low queue depth random writes.

1 Like

I still think that 256K is quite a large value for a drive that small. Maybe there is some error.

EDIT: Read this: ashift=18 needed for NVMe with physical block size 256k · Issue #13917 · openzfs/zfs · GitHub

Maybe you need a firmware upgrade.

3 Likes

More specifically, firmware E2MU200 available from Micron here:

https://www.micron.com/products/storage/ssd/micron-ssd-firmware#accordion-e6c186b05b-item-e124228d2a

1 Like

So basically it is a firmware error and the drive is a 4k drive.
Thank god you did not set ashift to 18 :grinning:

1 Like

Wow, thanks much for the help, and your insights @WhiteNoise, @Sara, and @HoneyBadger !

2 Likes

There is another issue at play here, even with 4KB. The problem is the ZFS Uber block ring buffer which keeps track of ZFS transactions. The change from 512 to 4KB reduced this history to 1/8 amount of entries, (if I did my math right). Higher ASHIFT values would reduce this further, potentially to the point of no history.

Ideally the ZFS Uber block ring buffer would have remained static in count of entries, but would have grown in size when using ASHIFT greater than 9.

Unfortunately making that change today would be a pool format breaking change. Potentially requiring the “feature” to be active from pool creation time.

2 Likes

I have been hearing about this problem for a while.
Do you know if a change at ZFS level will be introduced at least for new pool?

I have no knowledge if / when a “fix” is will occur.

It could very well be that most ZFS developers are ignoring the problem because it is not too bad with ASHIFT=12 / 4KB.


A re-imagining of the problem could be warranted. I think the Uber block ring buffer uses the sector size to allow existing entries to remain untouched during an update. If I remember correctly, this Uber block ring buffer is 64KB in size. Using 512 byte sectors, we get 128 entries. ZFS basically over-writes the oldest entry with the newest writes, allowing both history and the ability to roll back up to 127 transactions. (With ever decreasing chances of good roll back…)

During pool import, the Uber block ring buffer entry with the highest ZFS write transaction number is normally used. This means you get the latest writes, but have a record of older writes and what they point to.

But, because the design was based on updating a sector as a method to prevent overwriting previous data, increasing the sector size reduced the amount of entries possible.

Using a “new” design that would be based on a “fixed” number of Uber block entries, like what I think is the original of 128, should be no problem for modern sized disks. For example, using 4KB sectors is only 512KB for 128 entries. Their would be 2 of these Uber block ring buffers per disk, (if I remember correctly), so a total 1MB would be used. That’s nothing for a 1TB disk. Even 16KB sectors would still be reasonable, 2 x 2MB used.

Anyway, that is my limited knowledge, which could easily be incomplete or out right wrong.

What you say makes sense to me.
Thanks for taking the time to explain the issue.

1 Like

The other thing larger ashift values will cause is pretty terrible allocation efficiency for small files and especially metadata. ZFS’s smallest unit of allocation is defined by ashift, so for every piece of metadata you waste 256k instead of 4k… you also mandate larger recordsizes otherwise everything falls through compression and becomes waste too.

on SSDs that have the option to present these bigger sector sizes you are almost always better off with 4kn, even if the internal pages are bigger you won’t lose too much in performance.

Agreed.

However, using a Special Allocation Class device, (aka Metadata vDev), would cover both cases. Small files and standard metadata entries. Of course, this Metadata vDev would need to use a more reasonable ASHIFT value, like 4KB.

I don’t know if the ASHIFT value for Metadata vDev(s) can be different from regular data vDevs. But, if so, this would almost be the best of both worlds. The maximum for small files would have to be set to something pretty large if using 256KB for ASHIFT on data vDevs. Like 128KB to 256KB.

Might be a stupid question, but are there even any modern drives that don’t use 4k?

Larger than ashift=12 looks to me like a “2018 old TrueNAS Forum future prediction” that never happened :grin:

If I remember correctly, it is possible to low level format some Enterprise HDDs to something other than 4KB. For example, I can see 2KB being useful for large databases, that store small items. But a great many of them. Further, these DBs would not need high speed access, yet are so huge that SSDs, (SAS, SATA or NVMe), are not cost effective. Sure, specialized case, but that’s the Enterprise Data Center world.

Further, I vaguely recall seeing some 16KB native HDDs. That has been long enough ago it could have been a SSD.

I think some Enterprise HDDs are still available in 512 native, because that is what some servers still use. Enterprise hardware tends to be both behind state of the art, and at the cutting edge. (But not the bleeding edge…) This dichotomy is due to the long life a server might have, but new needs might end up with something beyond what general consumers can purchase.

1 Like