Tuning Scale?

I run some larger hardware, 1tb ram, 12tb l2arc, slog, and nvme for meta data.

We host terminal servers on this storage. and while scale was a HUGE improvement over my highly tuned core configs, Im just curious if there is anything to do for tuning on larger than average boxes?

That tells us nothing about:

  • Your pool layouts
  • Your storage and data access use cases (types of data, types of access, performance requirements, synchronous or asynchronous writes etc.)

But my gut reaction without knowing any details and based only on the general lack of knowledge of some people who throw hardware at a problem…

  • Not sure you need SLOG
  • Not sure you need L2ARC
  • Not sure you need 1TB of memory
  • Special allocation vDev for metadata might be useful - not sure.

The only generally applicable rules of thumb are:

  • Do not do synchronous writes unless you need them (regardless of having SLOG)
  • Use mirrors for high IOPS data
  • Ensure that all your vDevs (except L2ARC perhaps) are redundant.
  • Ideally metadata vDevs for mission critical systems should be 3x mirrors.

SLOG and L2ARC can be added and removed to see whether performance improves or not.

Special metadata vDevs can be added but sometimes not able to be removed - so these need more planning.

Is there a problem you need to solve, or do you just want to fiddle for fiddling’s sake?

Lol, thanks for not reading my post…

As I said, under core it was highly tuned, but under scale I do not run any tunes and it does work substantially better.

The issue I have is that have almost 400gb of ram free, so it’s not really utilizing the resources very well.

My major problem would be that Arc and l2arc aren’t be utilized very well.

With that kind of hardware you should run TrueNAS Enterprise and ask the tuning question directly to iX… In any case, this is totally out of the league of the amateurs in this forum.

Not knowing how much data you’re serving and the size of active datsets, it’s possible you’ve managed to supply ZFS with more RAM than it can actually find a use for.

I have noticed some interesting behavior re ARC on my own system. I can’t quite put my finger of if it’s a problem yet or not. What sort of worload are you running? Is it possible your working set just fits in the memory that has been used thus far?

Perhaps it just doesn’t need these resources. Without a reasonable amount more detail how can we have any idea?

I have 4 servers, running 106 virtual machines, 30tb of data and around 1000 terminal server users. my arc under core stayed much higher, but again I had more tunes in place. I have zero tunes so far in scale

Still insufficient detail.

What are your pool layouts?

How are the pools used?
Are the VMs only using Zvols + iscsi? Or also NFS?
How are the terminal server users using the storage? Big SMB datasets?

What are the ARC and L2ARC caching stats?

What were the tunes you did have in place on Core?

I don’t believe @Protopia was the one that didn’t read.

1 Like

11 mirrored vdevs, nfs3, the terminal servers are stored on the 4 storage servers in a vmware load balanced storage cluster. meta special drive us going 8kb and smaller caching. 12 800gb sas ssds for l2arc, mirrored 800gb intel nvme over provisioned to 100gb. chelsio 40gb networking.

My tunes were:
sysctl kern.ipc.maxsockbuf=157286400
sysctl net.inet.tcp.delacktime=20
sysctl net.inet.tcp.delayed_ack=1
sysctl net.inet.tcp.recvbuf_inc=65536
sysctl net.inet.tcp.recvbuf_max=4194304
sysctl net.inet.tcp.recvspace=65536
sysctl net.inet.tcp.rfc1323=1
sysctl net.inet.tcp.sendbuf_inc=65536
sysctl net.inet.tcp.sendbuf_max=4194304
sysctl net.inet.tcp.sendspace=65536
sysctl net.inet.tcp.tso=0
sysctl vfs.read_max=128
sysctl vfs.zfs.arc_max=231928233984
sysctl vfs.zfs.delay_min_dirty_percent=98
sysctl vfs.zfs.dirty_data_max=68719476736
sysctl vfs.zfs.dirty_data_sync_pct=95
sysctl vfs.zfs.l2arc_headroom=2
sysctl vfs.zfs.l2arc_noprefetch=1
sysctl vfs.zfs.l2arc_norw=0
sysctl vfs.zfs.l2arc_write_boost=536870912
sysctl vfs.zfs.l2arc_write_max=536870912

sysctl vfs.zfs.top_maxinflight=128
sysctl vfs.zfs.trim.txg_delay=2
sysctl vfs.zfs.txg.timeout=120
sysctl vfs.zfs.vdev.write_gap_limit=0

sysctl kern.ipc.soacceptqueue=1028
sysctl kern.ipc.maxsockbuf=33554432
sysctl net.inet.tcp.recvbuf_max=33554432
sysctl net.inet.tcp.recvspace=4194304
sysctl net.inet.tcp.recvbuf_inc=524288
sysctl net.inet.tcp.recvbuf_auto=1
sysctl net.inet.tcp.sendbuf_max=33554432
sysctl net.inet.tcp.sendspace=2097152
sysctl net.inet.tcp.sendbuf_inc=262144
sysctl net.inet.tcp.sendbuf_auto=1
sysctl vfs.zfs.arc_max=231928233984
sysctl vfs.zfs.delay_min_dirty_percent=98
sysctl vfs.zfs.dirty_data_max=68719476736
sysctl vfs.zfs.dirty_data_sync_pct=95
sysctl vfs.zfs.l2arc_headroom=2
sysctl vfs.zfs.l2arc_noprefetch=1
sysctl vfs.zfs.l2arc_norw=0
sysctl vfs.zfs.l2arc_write_boost=536870912
sysctl vfs.zfs.l2arc_write_max=536870912

The tunes in core were much faster than stock, and stock scale blows core out of the water, but when it comes to the arc in scale it will stay full after we migrate load on, and will evacuate. so its not impossible that theres too much available, and thats ok, I want redundancy. I was just looking to see if anyone else has tuned large boxes…there isn’t neccisarily a problem.

I just know there are tunes for 40gb, there are tunes for large memory etc. but not many people around here run large clusters or this much volume.

What is so difficult about providing a clear explanation of your setup. This is still essentially meaningless:

  1. What terminal servers? Windows or Linux? Do you boot from a TrueNAS iSCSI link? Do users access files on the TrueNAS server? If so what type of files, what size random or sequential?

  2. 11 mirrored vDevs of what? SAS, SATA, HDD, SSD, 5400rpm, 7200rpm, 1TB or 20TB, 2-way mirrors or 3-way? 1 pool or 11 pools?

  3. 4 storage servers - independent or clustered? If clustered how are disks shared?

  4. Metadata special vDev of what?

  5. 12x800GB SAS SSDs for L2ARC? Striped? Mirrored?

  6. 2x 800GB NVMe for what?

1 Like

you are really over thinking this…It’s vmware nfs and terminal servers. The vms run on the nfs data store. why would I boot from an iscsi link and run on nfs?
They are terminal servers with over a thousand users, it’s loaded with random reads and writes.

The vdevs being sata or sas don’t even matter. I’m asing about tuning the arc and possible l2arc.

you clearly don’t know/use the special meta drive. so move on from that. irrelevant from my question.

l2arc is striped. who mirrors when it’s meant for pure performance?
Yes sorry, the 800gb which you missed as being overprovisioned, is for SLOG.

anyway. this is over your head. I’ll figure this out another way.

The thing is. I needed 4 truenas boxes on core to support my load, but I can damn near put it all one one box now if I had enough space. it just diesn’t want to keep that arc full but maybe I just have more ram than it needs. once it gets going and full, the iops mainly come from the l2arc and ram, and very little reads from the actual spinning disks…its pretty amazing what scale has improved on over core. zfs as a whole needs to be tuned for this large of an arc and l2arc, but im just not sure how to tune scale! It’s worked so well I’ve just left it alone. I dont need to create work for myself, but just seems like arc is evacuating data sooner than it needs to and could/should stay fuller.

What are you seeing for ARC and L2ARC statistics?

No - I am not over thinking this. It may be over my head, and if it is I will say so, but I don’t think it will be.

My experience tells me that the devil is in the detail.

You say this as if it is something obvious - but it really isn’t.

I really do not understand why you are so reluctant - from the very start - to give details, and I don’t understand why you should assume that your configuration is the only possible one or that other people should be psychic enough to read your mind to determine what it is, nor do I understand why your description is so fuzzy when you do give a description.

Of course, how stupid of me - the way ZFS uses the ARC and decides what to keep and what to discard is nothing whatsoever to do with how the data is stored or the means of access or the patterns of access that your infrastructure and users create.

So, someone gives up their own time - for free - to try to help you, and they patiently ask several times for details, and all you can do is: assume that you know what information they need to help you better than they do, and insult them several times?

1 Like