HBA with Expander in TN - A whole book of questions

blacklight · July 7, 2024, 10:00pm

I am currently looking for an HBA upgrade for my server that is running TrueNAS-13.0-U6.1.

*** Had to butcher the links, because it only allowed me 2, which makes it impossible to explain, sry ***

There are multiple reasons I want to switch to a different/better model than my LSI 9300-16i. First of all, I want to get all my disks on one PCIe slot to avoid possible future problems I may encounter when troubleshooting a virtualized TrueNAS instance on my Unraid server. I split my HBA into two separate IOMMU groups, which were handled differently by the hypervisor. Currently, I have everything scattered over the LSI the onboard HBA (4 Sata, 1x slimsas-> 4x Sata = total of 8) and I even had one of these (apparently dangerous because JBOD) adapters to get another two sata port from the m2 wifi slot. Next to that I was not totally happy with the upgradability/expansion of the system, because I was already on the edge with sata and even pcie ports while not even fully utilizing the pcie bandwidth. With that being said, to clean up my drive mess I want to put all these drives on one single HBA and into an old office computer (something like a 4th gen i7) and test the existing components for a week to ensure long time stability (and avoid bugs created by virtualization).

So, I did a few days of research and figured out that there are basically no HBAs with more than 24 internal ports (through using the SAS/SFF to SATA cables). What I found are the following types of expanders that are apparently being used to multiplex the signal to multiple ports: /www.ebay.com/itm/165350815615

Or this: HBA Expander 2

And after finding these types of products even more questions came up that I didn’t find a definitive answer too:

Is the setup different on TrueNas (this one is easy ): there was literally no set up for the LSI 9300 I am currently using. So, my guess is the same with eg. this HBA: /www.ebay.com/itm/166481101887 + the Expander card → Just put it in a slot and it works ?
How do I wire these up ? I couldn’t find dedicated wiring diagrams (only YT videos) which explain how to put the expander together with the HBA into a PCIes. According to videos and the official documentation, I did a hypothetical wiring diagram, so how I understand it works like the following:

grafik696×402 56.4 KB

Basically, I want to be able to attach 32 SATA devices while having spare SFF ports for (potential) U.2 drives. Of these 32 two SATA devices I want to utilize 18 ports with HDDs (2x pools (one x15 + one x2) + a hot swap) and 13 ports with SSDs (1x pool + 1x hot spare) while having one port to spare. Would the wiring work as I illustrated? So basically, my idea was to adapt the one big SFF (8654) to two smaller ones (8087) that are both connected to the expander. I read that these expanders basically work like unmanaged switches. So, my idea was to use the ports as either inputs or outputs on this model, am I wrong with this? In another video I saw that two slots were marked as inputs but not on the two models I linked above. Are there always two inputs? Can Passthrough be a problem?
I only found one source for the SFF 8087 to 8x SATA. Are they reliable/ a potential bottleneck ?: /www.datastoragecables.com/sas/internal/minisas68-satax8/

So talking speeds, assuming I have a lot of HDDs and a bunch of SSDs (as illustrated above), I come up to a needed bandwidth of about 62Gbit/s = 7.75Gbyte/s. While the PCIe Gen 4 x8 slot should have plenty of bandwidth left I am curious about the cables, the expanders and most importantly the SFF 8654 port. Will the system bottleneck when connected to only one SFF 8654 port of the HBA ? Hooking up hypothetically only SSDs would bottleneck the system by 2.25 times by the PCIe (36Gbyte/s needed / only 16 Gbyte/s). Would the number be way lower because of only using one SFF port (so not 1/3 b.c. 3 ports but for example 1/4 of the PCIe speed available) ?
Why only one SFF port ?? → I am currently using (oh boy here it comes) SLOGs, L2Arcs and Dedups for my system to tinker around with. I know this is a big discussion point where lots of experts just say: Avoid it. You won’t benefit from it either way ! BUT I want to make up my own mind on this and I think there is a good chance to see a difference on a system with >30 drives. Especially Dedup has good potential to save a few percents of storage while not losing performance because I only need to compute that I already had on my system (replicated from: /www.youtube.com/watch?v=KjjSJJLKS_s&t=666s.
I want to use the SLOG for a bunch of other VMs I want to deploy and see if I benefit from it, but rn it is more experimental for me.
For now, my Dedup as well as my SLOG for the ssd mirror vdevs is placed on two intel optane p900. I didn’t split them on separate devices because I lack the PCI lanes for that. And it would be a big waste of NVME considering the SLOG “only” gets 10Gbit data in. The main problem is that it works well for my mirror vdevs but not my RAIDz2 main storage pool, because I am lacking the 3rd NVME for the redundancy rule. My idea was to switch to these: U.2 SSD
Which have the same speeds as these: /harddiskdirect.com/ssdped1d015tax1-intel-optane-ssd-905p-series-1-5tb-pci-express-nvme-3-0-x4-hhhl-solid-state-drive.html?srsltid=AfmBOoqnCqqRmN8-O0aVa2a2e5e84q5Xx0heQeDLwXljfk1BEHGSPvb63vQ

But are way cheaper. I could attach these to the left-out slots of the HBA easily with SFF8654 to 4x U2 cables. Could the xPoint/U2 compared to HHHL be a problem? Speeds are the same but what about IOps (ARK says it’s the same if I didn’t see it wrong)? Are these types of drives not suitable for SpecialVDEVS, LOGs, Dedeups ? From what I saw they are good for this use case, but I am not sure.
So, considering the intel U2 drives I could easily expand my SpecialVDEVs while keeping the same, if not higher, integrity layout (potential RaidZ2 or mirror with 4 drives). Next to this the space requirement is also getting big for a capacity up to potentially 150TB for my potential main pool (15x 20TB RaidZ2 x3 approx = 150TB net) with 1GB of Dedup for 1TB space.

Bottlenecks: These LSI cards can actually connect up to 240 Sata devices which even would bottleneck with only HDDs attached (assuming 0,16 GB/s * 240 = 38,4 GB/s > 16GB/s from the PCIe). Not even talking about all SSDs or the supported 32x NVME/SAS devices. In my setup I don’t mind bottlenecks as long as they are symmetric aka all drives are basically relatively bottlenecked to their native speeds. If I would for example attach 4x or even 8x U2 drives over the two left-out SFF8654 and use these under full load in VMs (not the network) and it makes the “Main” hdd storage pool unusable, then this would render the whole idea useless for my case. Has someone experience with the bottlenecks of these cards? How does ZFS/ TrueNAS handle this ? Is it bad for SpecialVDEVs to put them on shared PCIe connections? We are still talking about Gigabytes per second that could be potentially reached for each separate pool and VDEV (so still NVME/PCIe speeds that would be theoretically available).
Another simple one: Do all of these SFF Expanders only need power over PCI ? I don’t have any slots left that is why I want to put this card on an external powered PCI slot. Could that lead to problems? Advice ? Big no-go ?
How can I verify the source/quality of the HBA that I want to buy on e.g. ebay? Can I validate that it is not a cheap dupe? Can I test the hardware myself to verify that ?
Suggestions for alternative ways ? How is my general idea ?

A few more comments:

I am currently very restricted with PCIe because I am using Intel consumer CPUs. That is why I want to only use one x8 Gen5/4 slot for all my storage (my guess is there is no Gen5 HBA yet and it will also be unaffordable for me)
I want to try out the SpecialVDEVs features of TrueNAS on bare metal by myself and see what I could potentially use long term. I can always switch between no SLOG, Dedup through the use of backup to a second server and a recovery afterwards, while keeping all my data and the file structure. And I will do a backup either way.
I would be glad about links/documentation etc. I researched a lot already and want to tick off boxes that potential experts can answer quickly. Sorry, if I overlooked something major, especially in this forum.

I know this is a long post, but I really want to know as much about this type of HBA before I invest in hardware that will cost me hundreds short term (if not thousands long-term) of dollars. I would be glad about any flaws pointed out by experts or better suggestions on how to solve this.

I really found the community very helpful in my last post which is why I came here and didn’t just go to an arbitrary subreddit or other forums.

And btw here is the whole reason for the 1x HBA → a lot of drives story:
/forums.truenas.com/t/truenas-core-cant-execute-smart-check-not-capable-of-smart-self-check-resulting-in-bug/681/39

Thanks in advance

blacklight · July 7, 2024, 10:23pm

Actually the HBA linked above is just a Gen3 HBA which would be way too slow for me. I also saw there are already Gen5 HBAs. Can someone suggest Gen4 & Gen5 HBAs with compatible expanders and also where to buy them ? (Gen5 just for theory - they seem way too expensive)

etorix · July 8, 2024, 7:30am

Too many questions in one post, and still not enough information about your hardware and use case.

But Tri-Mode HBAs are not favoured due their way of running NVMe drives under the SCSI bus. If you want U.2 drives you really should have the corresponding PCIe lanes, meaning a Xeon Scalable or EPYC system.
I also do not like the mention of dedup with consumer hardware, its limited RAM capacity, and a large storage pool.

They need power. Full stop.
Some expanders have a 4-pin Molex and can be powered from that instead of the PCIe bus.

blacklight · July 13, 2024, 6:23am

Yea I figured that, but everything was connected in terms of logic and I didnt want to do 4-6 posts just for one HBA that I don’t fully understand. Sorry btw. if that is against community guidlines, I can delte the post if so.

The question is what information is most needed:
CPU: i9 - 13900 for my main server (VM ! good single core perf. needed / Xeon to expensive) - i7 47XX on my test server
RAM: 128GB Main server / 16Gb test server
PCIe layout: 8xGen5; 8xGen5; 4xGen3; 4xGen3 (Asus W680 ACE → again. Server Mobo too expensive for me for NEWER systems)
What else ?

Ok that is good to know, I would love to go with a 32 or 64 core Ryzen but that wasnt in the budget for me, which I obviously regret terms of PCIe lanes but I want to solve that problem with my current hardware and avoid starting off with a whole new CPU, MOBO combination.

May I ask why ? What exact problem did you encounter ? I tried it just a few hours and speeds stayed in the 10s of Gbits over a 10Gbit network. 64Gb of ram was plenty for a testvm and the CPU spikes also weren’t that bad while basically having 3 VMs that only take the spcae of one. Can running them parallel lead to problems ? I am suprised the SLOG wasn’t jumped on first, for the DEDUP I already saw the advantage on my system while trying (not for SLOG yet). Did I miss a major flaw ?

Thanks for verifying.

5.: ANSWER: Found an answer here while talking to a tech guy from ebay that is using the hardware on a daily basis. He mentioned that the controllers use round robin so from what I understand it is fine for SAS & SATA but not for stalled NVMEs. Even worse when the PCIe lanes are bottlenecked, so I will basically use the 3x M2 slots of my MOBO for Oculink adapters and directly attach the U2s. Best “consumer” solution from what I saw for now.

SmallBarky · July 13, 2024, 6:43am

Try reading post by dak180 at link below regarding de duplication and links made on subject.

https://forums.truenas.com/t/deduplication-with-n100-and-32gb/6268

blacklight · July 13, 2024, 7:24am

Thanks I went half way through for now and the example in this article Dedup is exactly my case ! Only problem, my max theoretical RAM for a Raptolake is 196GB (close to 256Gb) and I would assign something like 128Gb in a real worls scenario.

Why the “x5 save” storage criteria ? Not an expert but that doesn’t make sense. When we are in the magnitude of around 150TBs (also my goal, currently 40TBs), I am glad to have 0.2x (20%) savings and have 30 TB less space usage, potentially hundreds of $ of HDD, thousands of SSD. That would mean Dedup only would work somhow/good/perfect in datacenters with petabytes and upwards and nowhere else ? Is ZFS only targeting these as potential customers ? Highly doubt that.

So the experience about Dedup is very mixed as I see, I will try to do (and also report) a few more tests to see my experience and also the reactions … maybe I did something wrong in my setup.

etorix · July 13, 2024, 10:30am

Not enough money for a “server motherboard” but enough for a W680 “desktop workstation” motherboard that is twice as expensive as its Z690 consumer sibling (higher specifications overall but for ECC support), plus a top-of-the-line-and-accordingly-overprised Core i9? I do not quite grasp that.

Desktop “Ryzen” tops at 16 cores, so 32-64 cores would be EPYC (which is simply referred to by its “Zen” architecture). If the lower maximal frequency of EPYC is still satisfying, the solution would have been to go for second-hand EPYC 7002/7003 or 1st/2nd generation Xeon Scalable rather than current EPYC 800’/9004 or 4th/5th gen. Scalable. With second-hand DDR4 RDIMM rather than DDR5 ECC UDIMM, that could even end out being cheaper for much more RAM.

ZFS dedup is a memory hog. Read and re-read slowly.
With default recordsize, the guideline is 5 GB RAM per data TB—and (much) more with smaller recordsize, as would be the case for VMs. This comes on top of other, basic, requirements.

My experience? I, unwisely, set up a 16 TB dataset with depup within a larger pool, which serves a dumping ground for all the external storage I have lying aroung. It currently holds about 10 TB. The primary NAS for this dataset has 64 GB and an Optane 900p persistent metadata SLOG (I had read quite some documentation, and, typically for a newbie, thought I would get away by piling up ZFS fancy features). The backup system just has 128 GB RAM. Both are pure storage, no VM, no apps.
Result: The primary does fine with its barely minimal RAM for such a setup. The backup does fine as long as it stays at 128 GB; but at one stage I removed half the RAM and then a scrub of the pool would take several days rather than several hours.
Lesson: The guideline is accurate. 64 GB RAM is not enough to handle a mere 10 TB of deduped data. At an actual dedup ratio of less than 2x, that is an awful lot of ressources thrown in for minimal space benefits.

That’s a sensible guidance given the hardware resources one has to thrown in to make dedupplication viable. Otherwise you’re better just buying more storage space.

…for which you’ll spend ten of thousands of $ for RAM!
40 TB of regular 128k-recordsize data would need at least 256 GB RAM, possibly more to account for your VMs. 150 TB of small record size deduplicated block storage would require several terabytes of RAM. Or several 1.6 TB Optane P5800X (six in two 3-way mirrors?) as dedup vdev to assist a mere helf-terabyte of RAM, maybe? Never mind, it just isn’t worth for ridiculous 0.2x gains.

Wrong guess! Enterprise really is the target market for ZFS.
It’s great that we can enjoy ZFS in our home labs as well, but some exotic features (dedup, dRAID) really only make sense for special applications with mean big iron servers.

Back to basic questions:
How much of (bulk) data? (The kind that would do fine on raidz2/3 at under 70% occupancy.)
How much of VM data? (The kind that really wants mirrors at under 50% occupancy, and quite possibly SSD at that.)
How much RAM for the VMs alone?
What drives do you own already?
Then we can see if there’s a way to fit that in a W680 system.

Davvo · July 13, 2024, 7:21pm

Just a heads-up.

What is your understanding of ZFS?

blacklight · August 4, 2024, 7:48pm

There is a simple reason for that, the server motherboard doesn’t buy me the CPU While you can cheap out on Z boards, one with good PCIe capabilities will costs you the same ball park as a W680 board and I was intrigued of the extra PCIe slot for the ASUS IPMI (that btw sucks I know that one afterwards …). For the mobo the difference is small, take for example the WRX80, would be “only” a difference of about 300 bugs compared to my setup but the processor costs 4 times what an i9 did cost me. Ignoring the current Oxidation & High Voltage problems of Alder Lake & Raptor Lake (that I didn’t know about last year) the performance of these processor is within the best single core performance and can not be compared to Xeons from older architecture. Xeons are pretty much “saver” i processors with more features, so if I would have build an older Xeon rig, the performance would be much worse. And also get an older Xeon with more than 24 cores … they are far from cheap even used or “shady” engineering samples on eBay. This was basically as if you would build a new PC for your home media & gaming use. And this was also my ultimate goal I wanted one machine to virtualize the basic needs → Performance VM, Office VM, NAS VM. A Threadripper system with 32 core would have cost me multiple thousands with more cores than I need and less single core performance.

Yes these older EPYCs look actually not that bad, I have no idea how they compare in terms of single core performance. I have to be honest I am not an expert with AMD, for my current problem the higher amount of PCIe lanes looks perfect but wen I designed my system I had criteria for hardware that is not older than 1 or 2 generations and also I didn’t want to deploy a huge server with way more cores than I need. Also I saw that AMD has more problems with ECC compatibility (ok I maybe could have ditched this one) and less virtualization features, but again not an expert there. No chance I could afford an Intel Scalabel, everything near an i9 costs as much as a car

Ok I understand, the requirements that you wrote is the highest I ever saw, before that I read a lot of times 1Gb per TB average. Independent from that I did tests on my own:
I tested on an old i7 6700K with only 16Gb of ram, simply because I don’t have any other computers available and TN can not boot bare metal on my Raptor-Lake with 128GB of RAM. This is the baseline (basically what my network and the NVME (Optane 900p) on the older Mobo is capable of:
TN SSD Pool (no Dedup)
This is one Pool of 6 Samsung SSDs on 3 mirror VDEVs: 3 x 2 mirror of 2TB.
The write size is 32GB per run so the RAM Cache is full each time. The writes seem a little slow for 3 VDEVs of SSD, but the reads are close to capping the 10Gbit lan I was using.

With Dedup enabled I get:
TN SSD Pool (with DEDUP)

The difference is 10% for writes for bigger files ! I have a SLOG, DEDUP and Special VDEV on one 900p (usually 2 I want expand to three but the old system has not more PCIe slots). Here you see the DEDUP increase and then decrease (deleting of test files) with a Crystal Disk Mark test:

Maybe I am measuring wrong but for a 10% speed penalty on a decade old system I will accept the speed penalty for 50% more DEDUP efficiency ! Also considering reintegration as a VM on my Raptor-Lake system, which has way more RAM. What am I doing wrong here ? My experience if far from yours.

Is it ? It ramped up the old i7 to 50% CPU average, without DEDUP it was around 35% in my setup. In a VM the CPU spiked more on my Raptor-Lake, but in this case it was deduping on RAM.

I want to use the dedup feature for my fast pool, in this one I want to use the Clone and Snapshot feature of Unraid with a bunch of VMs. I also wanted to use it for my main storage pool that I use in my network because I back up multiple devices from multiple people that sometimes have the same pictures and videos e.g. iTunes backups of my & my parents devices (2 backups one 3rd party one original iTunes → both include the same data for multiple people). It’s not about extending the disk but when I have to upgrade. If dedup makes me fit 14TB storage on 10TB block storage than I have to upgrade later, that was my ultimate goal. Let me know how else I can test the system to see that dedup is not a good choice in my case please, because I don’t see the facts you state in my tests right now.

My goal is to attach 3x DC P4800X in a mirror over the 3 m2 slots I have.

RN: <10TB increasing rapidly with Backups of Videos. I have about 22TB available for my main pool. I want to upgrade to 15x 4TB and later switch them for 10 or even 20TB Iron Wolfs

SSD Pool is 12TB I use under 1TB currently but didn’t clone a single VM yet !
I am also about to virtualize a view computers in my home, my guess is about 2TB for now without clones.

RN: 128 GB on my Raptor-Lake half of it for the TN VM. The rest for all other VMs and the system. I want to upgrade to 196 (the maximum) which would give TN 128GB and the rest for the system. My old Gen 4 rig has only 16GB.

10x 4TB iron wolfs on 2x RAIDz2s
6x 2TB SSDs in 3x mirrors
2x Optane 900p → Aiming for 3x U2 1.5TB octanes in mirror

Would be awesome, at least for my ssd pool that holds the VMs.

Davvo · August 4, 2024, 7:50pm

That’s part of the numbers you should add to the minimum of 8 for standard use (without deduplication). Deduplication is a different child entirely.

blacklight · August 4, 2024, 7:53pm

Yes, thank you already for the information ! I am currently in the process of changing the BIOS settings !

If you mean my experience, I am using TN for roughly 2 years now, and I have 3 deployed systems. I researched a lot on the Raid layout and how TN stores data on it … but that’s about it. Relative to you guys I am just a beginner

Davvo · August 4, 2024, 7:59pm

I would strongly suggest powering off the system and hoping Intel release it’s microcode update ASAP (eta should be mid August).

blacklight · August 4, 2024, 9:05pm

The TN system is not on the Raptor-Lake, the Raptor-Lake system is turned off and I was only using Unraid on it. The P-cores were not put under load and were blocked under Unraid → Cores won’t be used except by selected VM that weren’t started once.

I meant 1-2GB per TB baseline + 1GB per TB for Dedup either RAM or Dedup drive, that was my understanding. I also saw that a lot of people give less RAM than suggested if >32GB because the systems are not that stressed in home use and e.g. 64GB is plenty for a TN system for a private (non-commercial) use case (without Dedup).

My guess is the difference of 10% from my test scales upwards if we are talking about 100s of TBs, hundreds of users & continues read/write load on the system, that is why the rules there are more strict, right ?

Stux · August 4, 2024, 10:20pm

Dedup is currently a pig. The 5GB/TB is for TB stored, not total disk size.

Not sure if that is a good or bad estimate.

But generally it comes down to a resource usage vs disk saved comparison, CPU+RAM vs Disk.

And at higher dedup ratios it can make sense.

The issue is there is a perf cliff if you overflow the in-memory dedup tables.

The cliff is so bad that people think the box had locked up.

“Fast” dedup is coming. It’s not as fast as no dedup. Main trick is small dedup tables I believe.

Davvo · August 4, 2024, 10:52pm

The ZFS Deduplication | TrueNAS Documentation Hub ^[1] states between 1 and 3 GB… 5GB per TB of dedupled data is suggested to avoid the nasty situation of having the DDT (deduplication table) outside your RAM… usually meaning on HDD; it however depends on the average record size!^[2]. None should be that guy.

I really suggest reading it a few times, it’s a GREAT resource! ↩︎
Resource - ZFS de-Duplication - Or why you shouldn't use de-dup | TrueNAS Community same as above. ↩︎

Stux · August 4, 2024, 11:18pm

I wonder how he’s going…. Like a slow motion truck crash.

It’s okay… I can still add more resources…

Or 10 hard drives.

etorix · August 6, 2024, 7:21pm

This looks like a wrong start. For the amount of RAM you want and the number of PCIe lanes you need (NIC, HBA, 3xOptane) a refurbished 1st/2nd Gen. Xeon Scalable would have served you well at a fraction of the cost—probably less for the motherboard, CPU and 256+ GB RAM than the cost of 192 GB of DDR5 ECC UDIMM.

On the W680, some of the PCIe connectivity is from the PCH and shares a x4 for all (plus on-board SATA and USB).

Nothing wrong, but you have 330 GB allocated for the test so your DDT may be ca. 1.5 GB, easily fits in RAM, and all is fine and dandy. The pain comes in when the DDT grows to the point where it evicts ARC and/or plainly no longer fits in RAM. With 128 GB planned, you have some margin, but far from enough for 40 TB of data, and the more you have, the bigger the pain when you reach the tipping point.
The high costs I’m referring to are the fixed hardware costs, RAM and/or special vdev; a persistent L2ARC is a less costly alternative for mostly static data.

Again, my experience with a mere 10 TB of deduped data:

128 GB RAM — all fine (but a ridiculous amount for a backup server)
64 GB RAM — BAD!!!
64 GB RAM with persistent metadata L2ARC — all fine

The last option is not too expensive in absolute terms (it’s a single Optane 900p), but I suspect that the server would do fine on 32 GB and no L2ARC if there were no dedup at all so this is still a relatively high price in extra hardware to save a few TB. When building the system, the price of the 900p could have paid for a general upgrade of the HDDs to the next size (12->14 TB), which would have given slightly more extra space than what dedup was^[1] saving with my not-so-duplicated-as-I-tought data; so, in retrospect, it was bad system design—though a valuable ZFS experience.

I have now replicated the dataset to a non-deduped version, and nuked the original. The next time I open the sever, the 900p goes back to my parts drawer. ↩︎

Davvo · August 6, 2024, 7:25pm

Overall, anyone who wants to use deduplicarion should… WAIT FOR FAST DEDUP TO LAND!

NickF1227 · August 7, 2024, 3:32am

@blacklight In general, dedupe in ZFS trades in the wrong direction. What do I mean? You’re trading relatively inexpensive storage device space savings for relatively expensive RAM costs.

Using RAM which may be applicable for your system I HAVE NOT CHECKED FOR COMPATIBILTY, JUST CHOSE A COMMON BRAND/KIT

We’re talking about a cost $/GB of about $3.50 per gigabyte.

On Newegg’s Ebay, the 960GB 905P is running on same for $249. This is about $0.25 per gigabyte.

So, you see there’s MORE THAN an order of magnitude price difference between the two, and dedupe may relieve the pressure on the inexpensive parts and place them on the expensive ones.

This doesn’t even consider the performance cost you will face by enabling dedup. All of those additional transactions in RAM can starve I/O requests from ARC and reduce the overall performance AND also reduce the size of the ARC itself.

This experience is far from the only one we’ve seen in the community. We’ve seen systems work fine for months and years suddenly, inexplicably, crash. Then, when they reboot they don’t import the pool because their DDT has grown beyond the RAM installed in the system.

That being said, there are a limited few workloads I can honestly say I believe dedupe in it’s current form can be useful. But you REALLY have to know what you’re doing and REALLY understand what ZFS is doing.

The best example I can provide (in my experience) is VDI for Adobe “computer labs”. You cannot (or at least historically couldn’t) use linked/instant clones in VMWare because of the shitty licensing “device” model form Adobe. In those cases full phat VMs needed to be spun up, and I used dedupe on a Dell R6415 with a pair of NVME mirrored vdevs (4 drives). I was able to stably host 15 of them, but even then, I was I/O bottlenecked if I scaled to 30 or even 20. I ended up abandoning the project because I needed to have at least 30. But dedupe did work.

Yes.