Pool Layout Help

Fastline · January 14, 2025, 4:50pm

Hello guys,

So, finally i’ve got all the drives and planning to create a pool from scratch. I see four options for the layout:

Single VDEV. 16x16TB disks in RAID-Z2. However, when reading the ZFS primer, it tells that " We do not recommend using more than 12 disks per vdev. The recommended number of disks per vdev is between 3 and 9. With more disks, use multiple vdevs." So this is definitely not a good option.
4xVDEV. 4x16TB disks in each VDEV in RAID-Z2. However, as RAID-Z2 requires 2 disks for parity and 2 disks for read/write which means a total of 4 disks required atleast for the RAID-Z2. Although, this is okay and multiple VDEV might give me good performance (more IOPS), i still loose 2 disk per VDEV which means a total of 8 disks for parity calculations (2disks*4 VDEVS=8 disks). This gives me just half space but good IOPS i think.
8xMirrored. This may work but i don’t prefer it in my case as i don’t need that much of redundancy for the moment. Trying to have a balance between the performance the redundancy.
2xVDEV. 8x16TB disks in each VDEV in RAID-Z2. This is what i’m actually planning to go with? A total of 2disk each per VDEV for partity, which means 4 disks (2disks*2VDEV=4) and this leaves me with 12 disks for data storage which is fine. I think this layout is perfect as it creates balance between the performance and redundancy. Offers 2 VDEVs, has RAID-Z2 and has a maximum tolerance of 4disks for the entire pool and the pool is still operational. As far as i’m aware, a failure of 4 disks (2 disk from each VDEV or 4 disk from either of the VDEV) at the same time is very less likely to happen, however, it could be in the event of catastrophic failure (PSU short-circuit or expose to fire or bad batch and maybe extreme bad luck). So, if a disk fails from a VDEV1, i can replace the disk easily and during resilver if another disk fails, the pool will still work/survive (am i right here?) unless there is a third disk failure from the VDEV1 during the resilver (which will make the VDEV not working i guess, maybe dead?) I’m not sure of it.

My second question is, if there are 2xVDEVS in RAID-Z2 containing 8 disks each, and let’s suppose 2 disk from a VDEV1 fails in a row. Still, the pool will be operational i guess. I can easily replace the two faulty disk and resilver the pool. However, I’m worried what if other disk fail (third one from VDEV1) from the same VDEV? Then, i think the pool is a toast, yeah? I know that’s unlikely to happen as i wrote above with the exceptions, but i’m just trying to understand the basic things here before the actual deployment.

Also, i’ll do the burning tests as per the expert’s recommendation on this forum. Any idea if i can do it through SeaTools?. Its available for Windows, Linux and DOS. I think of doing SMART Test+Short Test+Long test for each drive. Would this be sufficient for determining pre-failure of the disks or are there any smart script out there for this specific burn-in test?

Thanks

Theo · January 14, 2025, 4:59pm

2 pools of RAIDZ2 is what I would do and yes if 3 drives fail at any time in either vdev, your data is smoked. That seems very unlikely, especially if you burn in the drives before deploying. Additionally, it is hella easy to replicate zfs to another system (or do a cloud backup to Backblaze/Storj/Amazon/etc). If you could go RAIDZ3, but then if 4 drives fail at the same time, you are hosed, and if 5 drives fail… You see how this spins out of control. If you are always near your system, keep a replacement drive on hand, if one drive starts to fail, replace it. RAIDZ2 with a good backup covers you IMHO.

I would not rely on SMART long. badblocks works great and it will give your drives much more work than SMART long. You can use SeaTools to test, but I am unclear of what they actually do when testing. Long Generic might be okay for testing with SeaTools.

jro · January 14, 2025, 5:01pm

What’s your intended use case? If this is just home-use storage for one or two users, I’d do 2x 8wZ2.

Some people use badblocks for disk burn-in but I think it’s overkill for home use. Just start loading up data and if a disk throws some errors, get a new one in there ASAP. It’s not a bad idea to order an extra disk to keep on hand (or, ideally, install as a hot spare).

Fastline · January 14, 2025, 5:12pm

2 pools? I think you meant 2 VDEVS. As the pool will be single.

OMG. I heard of this but never actually tried? Any link where i can read more about it and know how to actually set it up, if i require someday.

Yes, even for mirrors i guess the same theory is applicable. Isn’t it?

Yes, pretty near. A few meters away only
Regarding the backup, do you mean the Snapshots feature or setting up another TrueNAS server to have a copy of the data? I’m clueless how the Snapshot and backup feature works. I even had initial question of Snapshots but it sounded too complex to me and i was afraid to set that up. Regarding the backup, i’m not sure about how much disk size should be as the main NAS can grow over a period of time and how much total disk space i would need for the backup server.

So, you mean Long test are bad and is actually a waste?

What do you mean by badblock here? Can you please explain a little? Do you mean sectors?

Its a tool provided by Seagate to test the drives.

Yes, i was talking about that only.

Then, i know a few other options such as HD Tune Pro and HD Sentinel.

Fastline · January 14, 2025, 5:28pm

Well, i’m not sure what i can call it off. But i’ll explain the scenario. It intent it to store more of a cold data (libraries, apps, programs, personal photos, project files and rendered videos). From time to time, i need to keep adding those data, and pull out the particular files when i need them or sometimes just view it as per the need. I think its more of an archival server kind (if i guess it correctly, not an expert, so not sure). I would be an active user and another person which would also need it to access the same way, so 2 users only. It will be accessed by 2 MacPro7,1 (sometimes single user and sometimes 2 users at same time) and read/write operations would be via the Finder.
Also, I don’t need it to operate 24x7 as i’m planning to build an active NAS soon which will stay on when i’m at work and shut it down when i exit the work.

This is more of a work related and stays in the office (office is at home. xD). Two users, yes. With the help of @etorix it seemed wise to use 2x 8Z2 as it has the pros of both the worlds redundancy and performance upto an extent.

Is it any kind of app or script? I never heard about that before ;(

Two years ago i guess, the other NAS was used in the same way i explained the scenario above. Out of nowhere, a disk started to have bad sectors and before i could replace, it eventually failed. So, i’m bit scared this time as it has affected my work so i’m trying to be bit cautious this time. Also, that time i was not aware of things and i used RAID-Z1 as RAID-Z2 was giving me same kind of speeds via 10G (AQC107, Base-T) so i just went ahead with RAID-Z1.

Yes, i realised that so keeping a few extra ones as spare this time.

Also, i would like to know how much performance i can expect in this setup in real world. I’ll provide the specs:

MB: Supermicro X12SPA-TF
CPU: Intel Xeon Platinum 8362
RAM: 16x32GB SK Hynix DDR4 RDIMM ECC 3200MHz
Boot Disk: 2xSK Hynix PC801 512GB PCIe 4.0 (OEM) (I plan to use it as mirror)
HBA: LSI 9400-16i (flashed in IT mode)
Data Disks: 16xSeagate EXOS X18 16TB
SAS Cable: 4xSupermicro CBL-SAST-0556
NIC: I’ve several options like Intel XXV710-DA2 (25GbE, PCIe 3.0), E810-XXVDA2 (25GbE, PCIe 4.0) and XL710-DA2 (40GbE, PCIe 3.0). Whatever works best, i can install that as per the suggestion.
PSU: Antec Signature Platinum 1300W

Can this setup easily saturate a 25GbE single stream?
By the performance, i mean the read/write speed basically via SMB.

Theo · January 14, 2025, 5:34pm

Whew, too much quoting.

Yes, I meant to VDEVs, sorry about that
I think this video is still good Tom does a great job explaining snapshots and ZFS replication.
Yup, you can go down quite the rabbit hole (I have) I settled on RAIDZ2, cold spare sitting on my desk, replication to another server, and backblaze backups of important data.
Yes, snapshots and replication to another server, bonus points if it is offsite. You can grow the replicated systems as your primary system grows, it is also simple to delete and recreate your backup system as needed (as long as you have the bandwidth to rereplicate. (My replica is sitting on Starlink and takes a long time to replication terrabytes of data)
Long SMART tests are not a waste, they just dont hit the drive as hard as using badblocks or SeaTools. I SMART long test each drive once a week since it can run without killing system performance. badblocks makes the system nearly unusable and SeaTools makes the entire system unusable. SMART is not good for burnin testing, but it is good maintenance.
badblocks is software build into your system that you can use to heavily test your drives. Look for burnin drives with badblocks in this forum and you will find lots of information. badblock and other tools beyond SMART long should not be used on an ongoing basis, since you dont want to pound your drives all the time.
badblocks is free and built in, but whatever system you want to use to hit all of the sectors on your drive works.

Fastline · January 14, 2025, 5:51pm

I apologise

Great

I’ll definitely check that out. My basic question regarding the Snapshot in TrueNAS is: Is it the same way a macOS stores the Snapshots for Time Machine? I can mount/simply browse by right clicking it and view the files. The second question is, how many disk for the Snapshot? How much space does it require for a single Snapshot? In the Snapshot, does it actually stores all the data like Time Machine and is on incremental basis and the size varies as per the changes in the actual pool data?

Wow. Quite a nice strategy i must say
How do you replicate it to another server? I mean what software/script do you use? Is your another server having the same capacity in terms of storage and the pool layout as well?

Wow. How large is your pool capacity?

Oh, i see.

Yes, makes sense.

Of course, what i meant is, first i do the SMART test. If a disk passes the SMART test, i go with the short Generic test and if that passes the short generic test, i go with the long generic test. Same for all the 16 drives. I’m kinda new to badblock thingy and i come to know because of you and @jro and that today only. xD

If the disk does not passes the initial SMART test, i would save time and simply go for RMA. BTW, any idea how long it may take for a full long generic test for a single 16TB HDD? That way, i can ensure i’m available to keep any eye.

Wow? Really? I never knew that.

Cool. Will look into that.

Yes, yes.

Then i think SeaTools would be best as i can use it in DOS mode. The other options are HD Sentinel and HD Tune Pro. Both have surface disk tests (short and long).

Theo · January 14, 2025, 6:13pm

It sort of works the same way, but generally you need to restore the entire pool or vdev at the same time. You can see the vdevs (in read only) on the replicant server and pull individual files from there.

snapshots are built in to ZFS, you just set it up in the UI, the same you set up your replication tasks.

My primary is 8x16TB RAIDz2, Replicant is 6x16TB RAIDZ2 (replication) and 2x16TB for local data.

Yes, SMART long first makes sense. On my system SMART long takes 28 hours or so and full drive sector scans take 23ish hours per drive. The nice thing with SMART is you can run the test on all drives at the same time.

That should work, just remember is it about 1 day per drive, so it is a lot of days unless you have multiple systems you can run the drives tests on at the same time.

Fastline · January 14, 2025, 6:22pm

WOW. I think i need to try it first before i actually set it up on a production server.

Damn ;( 28hrs for a single drive. Shit

Testing all the drives at once is a bad idea, yeah?

Theo · January 14, 2025, 6:56pm

You can’t test them all at once from SeaTools (I don’t think), you can run badblocks on multiple drive at the same time, but you have to watch your heat, bang all drives at once could lead to them overheating (this happened to me), so have a lot of air moving across your drives and coolish room tempuratures.

Fastline · January 14, 2025, 7:00pm

Cool. Thanks for all your input. Really appreciate that you took time to answer my questions.

Have a wonderful day ahead!

Whattteva · January 14, 2025, 7:35pm

I only read your OP and I want to clarify (emphasize) this. Fault tolerance/redundancy exists ONLY at the vdev level and NOT at the pool level. So that is only 2 disks from each vdev. If a vdev fails, the entire pool fails. That means if a third disk fails from the same vdev, your entire pool will fail.

Theo · January 14, 2025, 7:44pm

Are you sure about that? Everything I have read says that if the drives fail within a VDEV, only the VDEV fails and not the entire pool (which makes sense, right?) I am willing to admit my understanding might be wrong, please provide more information on this, as this is 100% different that I am reading (and querying Grok/Chat GPT)

Fastline · January 14, 2025, 7:58pm

Yes, that is understood and @Theo clarified on this!

But i’m not sure if a VDEV fails the entire pool fails. But i think you’re right cause the data is stored on both the VDEVs which is collectively forming a pool. So, yes, up to 2 disk failures in each VDEVs and the pool continues to operate. So, this becomes 4 disk in total.

If 2 disk fails in a VDEV1 and the pool still work but if you replace the disk and resilver and one more disk fails from the same VDEV1, its a toast. This is what you mean to say right? @Whattteva Also, this will result in the entire pool becoming dead/offline.

Fastline · January 14, 2025, 8:00pm

OMG. Now, i’m stuck on this as well.

@jro Could you please confirm what @Whattteva said? I think i understood it correctly and @Whattteva is probably right due to the fact the data is being stored on rest of the disk, beside the parity drives!

Johnny_Fartpants · January 14, 2025, 8:04pm

If you lose a vdev you lose your pool.

Fastline · January 14, 2025, 8:05pm

I expected it.

Theo · January 14, 2025, 8:07pm

This is what I am reading, why is this wrong?

Data Storage in ZFS Pool

In a RAIDZ2 configuration with a 2-vdev pool, data is not stored in both vdevs simultaneously in the same way. Instead, each vdev in a RAIDZ2 configuration is a separate RAIDZ2 vdev, meaning that each vdev can independently store data and parity information.

For example, if you have a pool with two RAIDZ2 vdevs, each vdev will contain its own set of data and parity blocks. This means that data written to the pool is distributed across the disks within each vdev, but not directly mirrored or striped between the vdevs themselves. Each vdev can tolerate the failure of up to two disks without losing data.

However, it’s important to note that if one vdev fails, the entire pool will still be at risk if the failure is not addressed. Therefore, it is recommended to have multiple smaller vdevs to minimize the risk of data loss in the event of a vdev failure.

Fastline · January 14, 2025, 8:10pm

Yes

But if i’m not wrong, the 2xVDEV makes a complete pool. You don’t have two separate pools so if a VDEV has more disk failures than intended, you loose one VDEV and the pool becomes offline. The pool will continue to operate in the following cases:

upto single disk failure from each VDEV
upto two disk failure from each VDEV

In the second case, the pool’s VDEVs actually become RAID-Z or call it stripe. And if a drive has potential issue or is more stressed during resilver or something goes wrong and one more disk fails from either of the VDEV, the entire pool is a toast and the recovery is near to impossible for ZFS.

Disks can fail, but two disks failing at the same time from either of the VDEV is very less likely, unless one has the disks from bad batch or has not tested it fully before the deployment. The usual case scenario would be a disk fails from a VDEV and you replace and rebuild and during the rebuild if another disk fails, it will still continue to build i guess (despite one of the VDEV has become RAID-Z, Stripe=aka no redundancy for this particular VDEV).

Is that correct @Johnny_Fartpants ?

Johnny_Fartpants · January 14, 2025, 8:12pm

Data is dynamically striped across vdevs not too dissimilar to how RAID 0 works and we all know what happens if you lose a drive in that config