Partitioning Optane P1600x in SCALE

dxun · September 15, 2024, 5:48am

I’ve got a couple of 118 GB P1600x and a couple of pools that will need a SLOG device (VM hosting + DB traffic). I just couldn’t bear to waste a single drive completely on a SLOG device when I could use what I have more efficiently and have redundancy.

I also didn’t find a good guide for SCALE on how to do this so I pieced together a procedure that seems to work fine out of the many guides on how to partition the boot-pool in SCALE plus adapting the scripts I have from the CORE setup I had done.

Given that the pools are sitting behind a pair of 10 Gb/s NICs, I judged 16 GB should be more than enough (5 sec * 1.2 GB/s = 6 GB) and I thought I’d use the extra space to prep for a future jump to 40 Gb/s (it is highly unlikely that there will be synchronous traffic coming in to my storage at line speed for that long).

The two P1600xs are sitting on nvme0n1 and nvme1n1 devices respectively. I want to carve out the partitions for two SLOG mirrors, one for each pool. Here is the procedure I had used against a couple of test pools:

// delete all prev partitions + initialise disks with GPT
sudo sgdisk -og /dev/nvme0n1
sudo sgdisk -og /dev/nvme1n1

// create ZFS partitions of size 16G for the first SLOG mirror
sudo sgdisk -n 0:0:+16G -t 0:BF01 /dev/nvme0n1
sudo sgdisk -n 0:0:+16G -t 0:BF01 /dev/nvme1n1

// make kernel aware of new partitions
sudo partprobe

// work out the PARTUUID for these new partitions
sudo blkid /dev/nvme0n1p1
sudo blkid /dev/nvme1n1p1

// take the output of the previous two directives and plug the values into the command to add the new SLOG mirror, i.e.
sudo zpool add vm-nand log mirror /dev/disk/by-partuuid/ce3e51ea-cf36-4f3e-9ae0-f3d7a19a48b1 /dev/disk/by-partuuid/d36a5367-91e4-40f1-b1d5-dfe2539c9ce5

// now repeat for the second SLOG mirror
sudo sgdisk -n 0:0:+16G -t 0:BF01 /dev/nvme0n1
sudo sgdisk -n 0:0:+16G -t 0:BF01 /dev/nvme1n1
sudo partprobe
sudo blkid /dev/nvme0n1p2
sudo blkid /dev/nvme1n1p2
sudo zpool add vm-optane log mirror /dev/disk/by-partuuid/5e201121-3b9d-483a-a49d-61920d153bc2 /dev/disk/by-partuuid/ec2dcb95-6a75-4f83-99be-0d45eb99d949

The end result should be something like this:

Aside from the usual caution of doing this against an appliance OS (you’re basically on your own in case anything goes awry), the end result looks promising.

The UI, however, is rightfully confused:

Any suggestions on how to improve this procedure or how to fix the UI?

EDIT: Should you need to remove the log mirror vdev from a pool, it’s best to do this through CLI:

sudo zpool remove vm-nand mirror-2
sudo zpool remove vm-optane mirror-1

Doing this through the UI throws an error dialog if you try to remove the log vdev, but ultimately does seem to remove the log vdev.

neofusion · September 15, 2024, 1:15pm

Anyone considering this, heed this bit especially.
Doing this style of unsupported partitioning adds failure points and confuses the OS.

Applying this to a special metadata device would be especially ill advised.

etorix · September 15, 2024, 2:24pm

Why?
The above write-up does use UUIDs rather than unreliable drive numbers. SLOG can be removed from the pool to revert to a supported configuration.

I’d be wary of throwing data or metadata duties on partitions, but SLOG and/or L2ARC are reasonably safe. Of course, slicing and mirrorring two drives in this way is a complication over just assigning a single drive as non-redundant SLOG to each pool.

Protopia · September 15, 2024, 3:10pm

Is there actually any point of having SLOG vDevs on the same speed drives as the main data vDev? (It seems pointless to me but others may be more expert than me.)

dxun · September 15, 2024, 4:04pm

Good catch. AFAIK, I doesn’t - a meaningful SLOG drive should have considerably lower latencies than the pool it’s attached to.

This has just been a PoC to understand what happens against a couple of empty pools that I can safely blow away in case e.g. middleware throws a fit or I mess something up badly.

I’d still like to be able to fix the UI…presumably, this isn’t much of a problem, except aesthetics.

dxun · September 15, 2024, 4:14pm

Absolutely - do not put any critical member of a pool in such a configuration. Doing this with metadata (or any critical member of a pool) is just asking for trouble.
If an L2ARC dies, no big deal except for some possible immediate performance impact. If a SLOG dies, not a catastrophe, but if your server suddenly dies, your in-flight sync traffic is in peril. If a metadata vDEV is lost, your pool is toast.

This is the big thing here - increased maintenance. Prepare scripts, document clearly and know exactly what needs to happen when one of these drives fail as it’s no longer as simple as removing the drive from the pool and attaching a new one. This is most likely the reason why this kind of messing around is unsupported by iX.

That being said, does anyone think that a future update of SCALE is likely to disallow this or somehow mess this up?

etorix · September 15, 2024, 5:49pm

SLOG and data for the same pool on the same drive? I don’t see it either, but some may be tempted to use the recipe to have boot+data or boot+apps on the same drives to avoid “wasting” a whole drive for boot.
What I can imagine is someone doing “SLOG for HDD pool” and “fast NVMe pool” on the same drives, and that looks dangerous—unless the content of the fast pool is entirely disposable.

I have myself used SLOG + persistent L2ARC (for the same pool) on a partitioned Optane 900p. Optane has the special ability to maintain its performance in such a mixed read/write workload.

Stux · September 15, 2024, 11:24pm

I don’t believe that makes sense. In the absence of a “separate log” device, the log will be written to the main pool devices

But splitting an Optane into slog / l2arc / sepcial or an Optane pool can certainly make sense.

Just not to have a slog and it’s pool on the same device.

dxun · September 16, 2024, 5:16am

One interesting thing I came across as I was adding and removing partitions from pools is that each removal introduced a hole. If you look at the vdev names below, it looks that once used vdev name isn’t reused.

Even more interesting is the summary output of sudo zdb -Pbbbs tank that mentiones a couple of “holes”. Also interesting that it mentions only two ‘holes’ (I’d expected three from the previous output).

Does anyone know if this is significant or problematic? Could this be causing problems in the future if, e.g. the pool is expanded with another (data) vdev?

Stux · September 16, 2024, 7:19am

When you remove a normal VDev it leaves a pseudo VDev in its place with a redirection table for all the blocks.

I believe this is trimmed as the blocks are replaced, and I’m not 100% sure, but I think it may eventually be removed once there are no blocks present in the pool that depend on the redirection.

neofusion · September 16, 2024, 2:15pm

shrug
I am not saying that is necessarily how it should be, just that it is how it is. I believe we are in agreement for the most part. A) It’s unsupported. iX will not test for this scenario. B) You say it complicates the setup which is basically what I mean when I say it adds failure points and C) The OS doesn’t know how to handle it, sparking the question of how to fix the GUI-bugs by @dxun in this thread.

You’re right that losing the SLOG or L2ARC isn’t a critical event (unless you run a database, maybe), it will result in an unmountable pool until you remove the cache vdev(s). A minor interruption if you’re a seasoned ZFS user, a showstopper for a novice who didn’t know what they were doing, until they can get help.

When I replied to this thread my main concern wasn’t that @dxun was at the risk of losing stuff, he appears to have a healthy understanding of the risks and has taken precautions to safeguard his data against the dangers, instead, the target of my comment was novices reading his guide and misusing the knowledge resulting in irreparable data loss. Understand the risks.

dxun · September 16, 2024, 2:34pm

Thanks, this is good to know. Do you know if there are any adverse effects or complications if these holes are present?

HoneyBadger · September 16, 2024, 2:48pm

There’s a tiny amount of memory overhead to hold the tables for indirect blocks on removed vdevs, which will decrease as the data is rewritten.

Should be visible on zpool status -v

EasyRhino · September 19, 2024, 2:36am

Vaguely similarly, i manually partitioned a SSD to use a several l2arcs for several pools.

The truenas ui gets a little confused but it works okay

qmcb23YR · May 24, 2025, 1:35am

This worked perfectly to partition an Optane drive for 2 LOG devices (and potentially more in the future) on 24.10.2.

The only thing I had to change was defining the ashift value, else the command wouldn’t take, i.e.

zpool add $POOL log /dev/disk/by-partuuid/$UUID -o ashift=12

Marieau · May 24, 2025, 11:47am

It is a very interesting approach if you can take the risk. I am really tempted to apply this because of how little you need for LOG devices. It’s essentially 12GB because it writes every 5 seconds and my fastest NIC is 10GbE.

However I simply just don’t know enough about ZFS and Truenas to take the chance. If only, there was a guide for noobs like me

Protopia · May 24, 2025, 4:12pm

10Gb/s is c. 1.25GB/s.

You can have one TXG open and several being written (depending on how fast it can physically write the closed TXGs to the drives). So you could have 10-20s of data (or possibly more) not written.

Then ZFS needs time to work out which ZIL records are no longer needed and can be overwritten, and I suspect that this is an asynchronous task that could be several more seconds.

So personally I would say that you need > 12GB of SLOG.

Stux · May 25, 2025, 7:04pm

I always thought it was two.

One being written, one being flushed. And if another needs to be flushed…. Zfs blocks.

That might not be true, or not true anymore.

HoneyBadger · May 26, 2025, 6:10pm

Three, actually - there can be one in each of the three states:

open (accepting new changes)
quiescing (buffer state to resolve any internal software operations)
syncing (writing those changes to stable storage)

It can either be active or blocked waiting to move to the next state.