Advice on RAM needs

Hi there,

I’m setting up a new machine (an Intel Ultra 5 225) that will act mostly as file server (maybe some dockers but nothing fancy) on a 10Gbe network. The initial setup will be with 4 mechanical HDDs (raid-z1) and the OS on SSD; it will work as 2nd level storage but, maybe in the near future, I will think of an array of NVMe disks as it will work with SMB shares and the vast majority of workloads will be editing photos and videos.

How much RAM do I have to provide to stay as close as possible to saturate the 10Gbe connection at least in file transfers?
I read in many forums that ZFS performance are vastly inferior to EXT4 but I like the idea of having snapshots and scrubbing on my data. At the moment I’ve bought 32GB of DDR5 (not ECC).

No deduplication, no ZIL, no L2ARC. And unsure about compression, but probably no compression.

Thanks in advance! :slight_smile:

This will NOT saturate 10Gb/s, but 32 GB RAM is a good starting point.

Compression is basically free with ZFS. Set at least LZ4, or some low level of zstd.

2 Likes

I don’t have a 10gbps network. but my experience is that the hardware (cpu + ram) hardly makes a difference over 1gpbs network. I ran the same configuration from 4GB over to 256GB and there was no difference for file transfer speed.

1 Like

More RAM means a larger ARC (read cache), which increases the likelihood that data already in the cache will be needed again… and in that case, even a 10GbE connection could actually be saturated….
However, if the workload practically rules out the same data being accessed again (before the ARC algorithm evicts it), then even having a lot of RAM will make little to no difference.
In that case you’ll only get as much performance as your pool can deliver…
(and as @etorix already pointed out: 4XHDD Z1 won’t saturate a 10 GbE connection).

1 Like

I would say LZ4 compression is free.


With Ultra 5 you would be fine for sure with zstd-3 (or even zstd-6). You can change the compression algorithm afterwards (already written data would be compressed with the “old“ algo). So you can make your own tests.

Again, I think that zstd-3 is a good starting point, and mb Ultra 5 would even be able to sustain 10Gbps compression of compressible data. Not sure where you could find so much compressible data, though.


The theoretical throughput of this would be 3x of a single drive throughput. With single (almost empty) drive throughput being 200MB/s, that would bring us to 600MB/s –> 4800Mbps. Random read/write would have performance of a single drive (you would be able to saturate 100Mbps probably).

RAM is used as ARC (aka read cache), so the download speed would depend on the ARC hit ratio. And hit ratio depends on the ARC/RAM size AND the working set size. So if you retrieve the same, let’s say, 25GB file over and over again (did I ever tell you what the definition of insanity is?), you would saturate 10Gbps easily. Welp, you probably would even be able to saturate 400Gbps (with multiple sessions) if you have a dual-channel DDR5 setup. The bigger the working set (compared to the max ARC size), the closer the actual read performance to the underlined VDEVs performance.

Also, ZFS stores async writes in ARC. Up to 4GiB by default. The number can be changed with zfs_dirty_data_max_max (clickable); perhaps other param tweaks would be required as well (in case you decide to change the default size). So, your single writes of up to 4GB files should saturate 10Gbps for a few seconds.

Thanks @etorix , nice to “see” you again (from TonyMac :wink: ) !

ok, so I will keep these 2 banks of 16GB DDR5 each, and some days I will upgrade them to 64GB. And ok for LZ4, but my storage is made of photo raw files and video files up to 90%, so compression is not that useful, I guess.

Another question (more theoretical at the present time): I’ve successfully tested both Thunderbolt 4 and 5 (I know you don’t like them but anyway… :smiley: ) and I’ve reached their limits of 20gbps and 40gbps bi-directional in P2P connection with my Mac Studio and MBP using iperf3, so the question is simple: if, in a near future, I’d like to build a PCI-ex 4gen NVME’s arrays to be formatted in ZFS raid-z1 , how many of them should I need to saturate the 40gbps connection?

This arises an interesting question: the ARC algorithm evicts old data, maybe even before it will be accessed again, so having an L2ARC device much bigger than ARC would be an intelligent way to get more bandwidth in a very simple way just because the data is not evicted so quickly anymore?

I work with photo raw files and video files … so all stuff which is barely compressible, I guess.

Here is not clear: 4x drives and throughput of only 3, why? ok, there is the parity overhead and the metadata overhead… maybe it’s this. Anyway my disks are declared for 268MB/s → 6432 Mbps. Which is not bad at all. Why random write is on par with single drive’s performance?

I do have a dual-channel DDR5 setup (not ECC, unfortunately). But this another concept of yours is not clear to me. Can you elaborate more, please?

Yes, setting up an L2ARC could indeed increase the probability that a requested data block is still in cache (although I have to admit, I’m not too familiar with the exact logic ZFS uses to decide which data the ARC and L2ARC consider cache-worthy, and which they evict…).

But as always in life, there’s a tradeoff: managing the L2ARC consumes RAM.
How much depends on the block size being cached, but as far as I know it’s roughly 1–4 GB of RAM per 100 GB of L2ARC.

Also, even with a fast NVMe drive, the L2ARC is still at least one order of magnitude slower than RAM (and in terms of latency, probably closer to three orders of magnitude slower.)
That’s why it’s generally recommended to max out your RAM before adding an L2ARC.

Of course, it really depends on your workload.
If you’re randomly pulling data from a 100 TB pool, any cache will struggle to make a noticeable difference.
So in the end, you may just have to experiment and see what works best for your setup.

One more note: both ARC and L2ARC are cleared after a reboot.
Technically, the data on the L2ARC NVMe still exists, but after a restart, ZFS loses the in-RAM index mapping, so it effectively starts over with an empty cache.

1 Like

H.264/265 videos are not further compressible, but raw photos/raw video streams are highly compressible. LZ4 is worh it, just to squeeze the occasional padding of zeroes.

Thunderbolt 3/4 is four lanes worth of PCIe 3.0. So even the bare raidz1 minimum of 3 drives should do, and it won’t take Gen 4 speed.

That’s the maximum throughput on the outer tracks; do not count on it as guaranteed rate on any random data.

Yes, just that. A stripe of mirrors (2*2) would read as four drives, and write as two.

Because random is all about IOPS.

3 Likes

I have a dataset for my personal photo/video archive (mainly jpegs and raws with occasional mp4) with LZ4 compression. The stat shows a 1.06x compression ratio. Perhaps I could have slightly more with zstd-3.

I’ve checked some random files: jpegs show no compression, mp4s show very little compression (like 0.2MB for a 180MB file), some raws show no compression (perhaps they were compressed in camera), and some show minor compression (like 0.4-0.8MB for a ~40MB raw). Well, these numbers left me wondering, where did I actually get this 6% compression…

I use 4M recordsize for this dataset, btw. And IMO you should use at least 1M (for media files). It will decrease the amount of metadata and marginally increase random read/write performance. I saw reports that it can increase linear throughputs too.

You can have different recordsize and compression on different datasets. Moreover, you can change them after the creation (“old” data stays with “old” settings). So I encourage you to make your own tests.

I can be totally wrong, but it seems that SMB (and NFS) are designed for the multi-user environments. SMB has some single-threaded limitations, and even 10Gbps over a single SMB-session is a very good speed. You should hold your horses about 40Gbps if it’s not a multiuser NAS.

Regarding the actual numbers, 40Gbps is 5GB/s. Which can be easily seen as a read speed on a modern NVMe. So, 1 drive can already be enough.

Probably. This video can be useful.

2 Likes

Because the actual data will be stored on non-parity ones. At least that’s how I understand it. Perhaps read can be or is faster. Anyway these are only theoretical estimations. You can read more here.

Because the actual (random) data record will be stored on all non-parity drives. Thus utilizing 1 IOPS of every drive for 1 IOPS on the entire VDEV. At least that’s how I understand it.

They probably have this speed. At the very beginning. My HC550’s have shown 270MB/s at the beginning and about 170MB/s at 80% (tested with the free version of HD Tune Pro).

I brought DDR5 just as an example for saturating 400Gbps (DDR5-4800 single module is 38.4GB/s). It doesn’t really matter (IMO) for the home/SOHO NAS whether it is dual channel or not. As always – I can be wrong on this.

Regarding the working set: The working set is a term used to describe “active data” on the pool – data that is being accessed (yoinked it here).

So, your working set (size) strongly depends on your workload. This is why I brought up this stupid single 25GB file example. Well, it can be not that stupid in the case of some CDN, for one.

For example, if you use your NAS just as a media slop, with no particular pattern of access, your working set is probably equal to the entire pool size.

Another example (which can be your case): you just copied all your shooting session raws to the NAS and then started to edit them. Your working set is probably all those raws (+ edited files). I saw claims that a shooting session is usually about 500-1500 photos. If we consider one raw file is 40MB that brings us to ~20-60GB. So 32GB RAM can be very good or fine for this workload.

NB! Truenas will consume some RAM itself. From my (limited) experience – ~6-10GB.


FWIW, Spacerex youtuber consults photographers/videographers much (at least he says so). So mb his insights can be helpful.

2 Likes

Well… it depends: if I have a short shooting session, you could be right but if we have to cover a wedding, things are much bigger. We are around 300GB of raw files from 2/3 photographers, and 1TB from videographers.

But my idea is just work all that stuff in some speedy NVME PCIex 4 or 5gen drives, connected by the way of Thunderbolt 5 ports of my Macs. I don’t want to suffer for speed waiting that ZFS over TCP/IP over SMB gets the things done. Definitely too slow.

This TrueNAS is a first level of “backup” storage of my NVME’s, while an old and mighty Synology is the second one. When I finish editing all the stuff, I will wipe my NVMEs and TrueNAS will hold the final 1st copy of job, and Synology the 2nd one.

As I cannot store everything on my NVMEs because they are too expensive, maybe some folders will be stored on TrueNAS only, but definitely not my shooting sessions.

So, ideally, you need 1+TB RAM :slightly_smiling_face:. Actually, it is not even that expensive if you consider used servers. Or mb 1+TB L2ARC. But that probably only makes sense if you have more than 1 editor. These are only my assumptions – don’t take them seriously.

I have never ever tried to edit photos/videos over smb. However, assuming h264/prores/dnxhr 4K footage maxes out at 200-600Mbps (well, it can be beefier, of course), even 1Gbps should provide enough speed. As I’ve said, I’ve never tried it. And IMO, if you are the only editor, SMB is not needed.

Meh, if it’s only a (media) backup, then even a 8GB can be ok. 16-32GB is the way to go IMO.

FWIW, there is a persistent L2ARC feature.

1 Like

Thanks. Readings are always welcome! :slight_smile:

well 1TB of RAM :rofl: :rofl: :rofl: :rofl: :rofl: :rofl:

Maybe I can consider and L2ARC that can have that size, or similar. But to my understanding I need enterprise SSD or NVMe grade for this. Am I correct?

This is great! :slight_smile: Persistent is a word I like! Especially cuz this NAS will not be on all the times but only when needed.

ok, then to recap things a bit: I will have a single zpool made by one single “vdev”, which consists of my actual4 HDD’s organised in raid-z1. Correct?
Now what if I’d like to create an hybrid pool where metadata is separated from actual data ( I come from Synology hybrid raid)? Can I modify the (z)pool afterwards and separate them in the future or do I have to make this at the very beginning?

Another question: in Synologys btrfs world there is a nice concept of subvolume - which ultimately I’d like to have when I’ll create SMB shares - and to my understanding this corresponds to “dataset” in ZFS world, right?

32GB DDR5 is a solid start, for your use case (RAID-Z1, no dedup/ZIL/L2ARC), that is plenty. You will easliy handle 10 GbE transfers, especially for photo/video workloads. If you expand to NVMe or heavier Docker use later, consider bumping to 64GB. ZFS on that setup will perform great.

1 Like

That is a controversial topic. Some say something like L2ARC is the most gentle workload the SSD could have, so even consumer SSDs would be fine. I haven’t ever had L2ARC, so can’t really advise anything. IMO, you can be ok with any SSD, because even if you lose your L2ARC drive, you won’t lose any data.

Well, you do you. Sounds reasonable to me if that’s what you ask.

I have 0 experience with synology, so can’t say for sure. There is a concept of special VDEV (aka sVDEV or metadata VDEV) in zfs that can store the metadata. However, unlike L2ARC, this vdev cannot be lost (or you will lose the data of your entire pool). So it must be at least mirrored.

If you decide to go special vdev route, you should do it from the very start, because already written metadata won’t migrate to the sVDEV. Also, afaik, you can’t remove it from the pool afterwards. So it should be planned carefully.

IMO, you don’t need it (and I’m in “team sVDEV” myself). With a big dataset’s recordsize (1-4M) you should not have much metadata, and it probably would be already cached in the ARC. Otoh, if your NAS is only occasionally on, your cache probably won’t be filled. Well, to answer your original topic-starting question – you don’t need much ram for a non 24/7 NAS, because your ARC won’t be properly filled anyway. Mb this discussion (of special vdev) would be useful.

Have no clue about btrfs in general, and synology btrfs in particular. In zfs you can should have different datasets (with different tunnables like record size, compression etc.) for different kinds of data (for the record, photos and videos can be considered the same kind of data – both being big, uncompessible files). Then you create different SMB/NFS shares (in truenas) that point to different datasets.

1 Like

Hi all,

in the end I’ve finished to follow all your advices, and this is the final result:

:~$ zpool status
  pool: bigdata
state: ONLINE
config:

	NAME                                      STATE     READ WRITE CKSUM
	bigdata                                   ONLINE       0     0     0
	  raidz1-0                                ONLINE       0     0     0
	    ata-TOSHIBA_MG09ACA18TE_53G0A2LUFJDH  ONLINE       0     0     0
	    ata-TOSHIBA_MG09ACA18TE_53T0A07FFJDH  ONLINE       0     0     0
	    ata-TOSHIBA_MG09ACA18TE_53T0A07KFJDH  ONLINE       0     0     0
	cache
	  nvme-CT500P3SSD8_2310E6B7145D           ONLINE       0     0     0
errors: No known data errors

with 32GB of RAM. And, when I can, I will buy 32GB more to have full 64GB and 4 sticks on the mobo.

All the datasets are with compression=LZ4 and recordsize=1M, the last is not changeable?

I’ve also mounted a 500GB NVME as L2ARC, I know probably it will not be beneficial I’ve done so just because it was literally an empty NVME that I forgot to have. Do I need to activate TRIM on it or is it automatic?

Thanks to everyone! :slight_smile:

1 Like

You can use ``` for multi-line code snippets.


(Open?)ZFS has a built-in util for showing cache statistics. You can call it with sudo arc_summary.

No, you can change it any time. However, it won’t affect already written data.

I personally follow @winnielinnie’s advice with disabled trim and the weekly trim cron job zpool trim <your-pool-name>. However, AIUI, L2ARC is a ring buffer, so your NVMe drive will be eventually fully filled, rewriting the “oldest” cached blocks. I’m not sure whether trim is needed in this scenario. Probably not.

1 Like

I would be interested to know what incremental gain you get out of the last 32GB of RAM.

1 Like