Pool Layout Questions

Hello Guys,

I know that redundancy does not guarantee data safety and neither RAID is a backup and there should be multiple copies of important data (3-2-1). When it comes to pool layout, i know that Z1 is quite risky if a disk fails and during the resilver/pool rebuild time after the drive replacement, it can stress out the drive. This is totally understood for the spinning drives but does it still apply to the SSDs as well?

In short, i’m talking about reliability of Z1 between HDDs pool and SSDs pool. Any guidance would be highly appreciated.

Thanks

I would say that depends on the size and the quality of the SSD.
A RAIDZ1 made of enterprise grade SSDs is probably more ok than one made of
no name SSDs from aliexpress.

The question you have to ask yourself is: are you ok with no redundancy after 1 drive fails or not?

2 Likes

It’s all about the pool being vulnerable during the (possibly lengthy) resilver after losing a member, so it applies irrespective of the kind of drive.
But SSDs would resilver (much) faster, and typically boast lower URE rates than HDDs, so the risk window is shorter.

Beside your tolerance for risk, the question is (ta-da!!!):

What’s the use case?

Anything dealing with small blocks (database, apps, VMs, zvols) benefits from mirrors.
So are you doing bulk storage on SSDs, to make best use of raidz?

1 Like

Oh, yeah, absolutely. That makes sense. My case is Intel DC series

My main question is that if a drive fails, i replace it and resilver the pool and during that time, what if the other disk fails.

Gotcha!

URE referring to unrecoverable error count i guess? Yes, i also think that the risk is less as the resilver time would be less.

Just normal backups from Veeam, Acronis, etc.

The other day i asked a question about immich deployment here. If you can have a look and guide me on that, it would be really helpful.

No, not really.

URE = Unrecoverable Read Error

HDDs and SSDs have a stated statistical chance of those occurring, published in the storage device’s specifications. It’s been pretty high, up until we started seeing >2TB storage devices. Then, the odds start to take a bad turn.

Having one or just a few URE(s) when using ZFS is not fatal, but can take out part of file(s). Thus, you would need to recover the affected file(s).

If using Dataset defaults, and if the URE(s) happens to affect Metadata, the chance of data loss is VERY LOW. This is because Metadata is redundant by default, EVEN ON A RAID-Zx vDev / pool.


To be clear, it is entirely possible to have a 2nd storage device, HDD or SSD, fail completely during a re-silver. And if not using “replace in place” for the first failure, then with RAID-Z1 you get total pool loss.

By “replace in place”, I mean replacing a storage device that has not failed completely, with another while both are installed. ZFS will simply Mirror the 2 devices, and when the resilver is complete, detach the failing device. Any block unavailable on the source disk, ZFS will use any redundancy available, like RAID-Z1 parity.

1 Like

IMHO SSDs are more prone to failure, because of firmware errors.

So I would recommend you use two different drives, from different vendors, with different controllers. That way something like a “Samsung we instantly shut down instead of throttling” or a “ADATA we lie about sync writes” firmware bug does not lead to a loss of the pool, just one disk :slightly_smiling_face:

1 Like