"Best" usage of my 3 types of disks? Also, what's the deal with swap on SCALE?

I have done a lot of reading here, actually just read a thread with somewhat similar hardware to me, but I’m still not really sure what to do here.

Hardware is an iXSystems TrueNAS Mini X+ system (I have physical space constraints and the small but capable chassis was ideal). I’ve got the following:

  • Atom C3758 (8-cores, 2.2GHz)
  • 128GB ECC DDR4 - 4x32GB (self-upgraded from the stock 2x16GB)
  • Factory 250GB WD Blue NVMe (boot)
  • 5x Seagate Exos X18 18TB [ST18000NM000J] - SATA
  • 2x Crucial MX500 4TB [CT4000MX500SSD1] - SATA
  • 1x Radian Memory 8GB RMS-200 in PCIe slot (gift from a friend)

If you’re not familiar with the RMS-200, these are basically RAM drives with a capacitor-backed 8GB flash storage, designed for massive write abuse. The RAM auto-copies to flash in the event of power loss, so the actual flash on the card doesn’t get written much. It presents to the OS as an NVMe drive. The speed is somewhat limited because it’s an x8 card in an x4 slot here, but it’s still really quick.

Planned usage:

  1. MacOS TimeMachine backups from my laptop - It is on WiFi 6E so I expect it to be able to hit ~1Gbps even though it is wireless most of the time
  2. SMB (or NFS??) file access for the Mac and my wife’s Windows laptop - Apple Lossless music stored here from all of my CDs; personal home folders/document storage; planning on ripping BluRays and DVDs to here as well (hence the relatively large amount of storage for a small system)
  3. Some sort of media/video sharing server (likely Jellyfin)? Due to lack of transcoding hardware on the TrueNAS hardware, I may instead just use it as storage space for the media, and run Jellyfin on a different system I have… meaning that the media would likely just be an NFS or SMB mount to a different Linux system.
  4. Backups from one or two Proxmox systems running other VMs (e.g. Jellyfin, maybe a PiHole VM, my virtualized OPNsense router, etc)

I might consider running some smaller TrueNAS “Apps”/utility containers on the TrueNAS system (likely something like sonarr/radarr/etc to yarrr some media), but the lack of transcoding hardware means that a “media server” probably needs to run elsewhere and the NAS will mostly just be a NAS.

I was thinking:

  • RAIDZ2 of the 5 spinners - main pool (Z1 seems unwise with 18TB drives, no?)
  • Mirror of the two SATA SSDs - flash pool - “app” / VM storage?
  • Radian Memory as SLOG for the main pool? But I don’t know how much benefit I’ll get out of the scenarios above since I doubt there are a lot of sync-writes…

TL;DR: I’m mostly concerned with whether I should do Z1 or Z2 on the main spinning pool of 18TB drives, and then how to best utilize the pair of 4TB SATA SSDs and what to do with the 8GB Radian card.

Additionally, what’s the deal with swap on SCALE? I actually created the main pool and flash pool already (no data so I can re-do them), but all of those disks had 2GB partitions created for swap, but then swap isn’t actually turned on. I feel like having swap on the spinners is kind of pointless, so maybe I should re-do the pool so I can reclaim that 10GB of storage? 16-32GB carved off of each of the SATA SSDs would make more sense as swap… I doubt that I need 128GB of swap to match the 128GB of RAM? Chris Down, who has worked on Linux kernel memory management, suggests that Linux machines should have swap…

I had a fairly similar setup.

You have it right re allocation of drives though you could add another SSD and then consider a sVDEV to host metadata, small files. If you set the small file cutoff on a dataset to equal the record size of a dataset then you can force the entire dataset to be held by the sVDEV.

That way, the datasets for your VMs and so on can live in the same pool as your data, you also get the speed boost re small files and metadata, from the sVDEV. That should all work since you have 8 slots, 5 HDDs, 3SSDs. See my resource page re: SVDEV planning and implementation.

The big downside of a sVDEV is that if it goes, so does your pool. Hence my suggestion re: a 3-way mirror for your sVDEV. I use a 4-way mirror sVDEV because I run a Z3 pool here.

I’d likely ditch the radian and keep that slot empty for the future if you ever want to go SFP+. Time Machine does use sync writes but it’s so light re usage I doubt you’ll ever see a benefit unless your VMs are sync-write heavy.

I have 7 slots total, they’re all full. 5x3.5", and 2x2.5". The Radian is internal. I’d prefer to keep anything that may fail accessible from the hotswap bays… The other limitation is that they are all SATA, not SAS.

The system has 2x10Gb RJ45 already; I’m in the process of upgrading my LAN to allow for that too.

Gotcha. The mini XL had 8 slots, IIRC, and some of the SSDs were attached using Velcro?

If you’re at all interested in the sVDEV Route, I’d still consider securing an SSD inside since they throw off little heat and usually do not fail.

Otherwise, go as planned, two SSDs for apps and VMs, 5 HDDs for the pool.

Do a scrub and see how high the temps go, if too high, consider upgrading the fan in the back to something more performant.

I’d still keep the PCIe slot in reserve. You could use it to house NVME or whatever in the future. If you want to attach the Radian, it won’t hurt and SLOGs can be removed w/o issues later but I doubt you will see any benefit unless you start hosting VMs that do a lot of sync writes.

what’s the deal with swap on SCALE?

You did not specify what version of scale.
In the current Dragonfish version of 24.04.1.1 the swap has been disabled though in certain cases the swap partition will still be there at this time. In a future version swap will undergo additional work.

From the release notes:

Fixes to address issues involving ZFS ARC cache and excessive swap usage leading to performance degradation (NAS-128988, NAS-128788).

With these changes swap is disabled by default, vm.swappiness is set to 1, and Multi-Gen LRU is disabled. Additional related development is expected in the upcoming 24.10 major version of TrueNAS SCALE.

With 128GB ram you wouldn’t normally need swap. I have been running without swap since the update came out (update disables swap) without issues and the cache system seems to be behaving just fine on my systems.

The deal with SWAP is this:

  1. The O/S swaps when it runs out of memory for O/s and apps.
  2. In a ZFS system, you need a lot more memory than you need for O/s and apps to hold the ZFS cache, and when more memory is needed the ZFS cache is released before the O/S thinks about swapping. So swapping hardly ever happens. (I have swap enabled, and it has never got above 512MB.)
  3. You really don’t want parts of the O/S or apps swapped out on a performance critical file server.

So iX has (quite reasonably IMO) decided that SWAP should be turned off in Dragonfish and later versions of SCALE, and have a phased implementation for turning it off and then recovering the 2GB per drive of space.

But with a new build, you can set the reserved swap space to zero before creating pools, and for Cobia and earlier you can also run swapoff -a as a post boot command to turn swapping off completely.

2 Likes

P.S. @ZPrime Your proposed disk usage seems sensible.

You won’t need SLOG for TimeMachine (because that is a background task) - it will likely be beneficial for SMB (or NFS) writes from your Mac (but not from Windows).

The only change I might think about would be to use the PCIe slot for an older GPU for Jellyfin transcoding rather than using it for SLOG and hosting Jellyfin elsewhere.

P.S. I run Plex which does not use the GPU for transcoding - and for my own use occasional batch transcoding using the CPU (which is far less powerful than yours) is perfectly fine. IMO you will really only need a GPU for real-time transcoding that needs more CPU than the Atom you have i.e. transcoding an 8K movie to something smaller to watch on your phone - and if you plan ahead you can pre-transcode it anyway.

It’s complicated.

In SCALE Cobia and earlier (this doesn’t apply to CORE), ARC was limited to half of your RAM, due to a variety of issues, some really there, some incorrectly understood to be there. In larger systems, this could mean that dozens or even hundreds of gigabytes of RAM were being wasted.

Dragonfish attempted to bring RAM usage more in line with CORE and removed this limit on ARC. This unexpectedly resulted in parts of the system being paged out to swap unnecessarily. To resolve that, 24.04.1 pretty well eliminates use of swap. It still, by default, creates the 2 GB swap partition on disks when you create a pool–though that was only really intended as “slop” in case you tried to replace a disk with one slightly smaller–I don’t regard this as an inconsistency. What I do regard as an inconsistency is that it still offers to create a 16 GB swap partition on the boot device.

I could always remove the Radian and get an NVMe → PCIe card… but that could only host a single NVMe I assume. (Unless the slot on the Supermicro board that iXSystems used supports bifurcation to turn it into 2x2 instead of 1x4, which seems unlikely?) I would not feel safe with an sVDEV unless it was a 3 or 4-way mirror, since I know how important metadata is to a pool, and since I assume the amount of writes it sees has the potential to kill SSD flash relatively quickly. I do like the idea of being able to have the SSDs help with performance on the main array though, especially with the decent size of my music collection (~15k+ files across several top-level folders, broken down into further folders by artist and album). Storage-wise it’s not a lot of data, but bunches of relatively small files (1-5MB for the lossy library, and 20-50MB for the lossless) and I know directory traversal performance isn’t always ZFS’s strong suit in this scenario. Is a 4TB sVDEV “sane” for a ~45TB spinning array?

It’d be really cool if ZFS supported tiering, to put things with more frequent access on SSD while putting colder data on spinners.

Sorry! I’m on Dragonfish 24.04.1.1 (latest as of when I started the post).

Nothing personal, but when I see a Linux kernel dev who has specifically worked on the memory management architecture say that systems should have swap… let’s just say that I put a lot of weight on the kernel dev’s opinion. :slight_smile: Even if the swap space is far less than the physical memory, it seems as though modern kernels can benefit from having it available rather than not.

Valid suggestion. Due to the physical size constraints of the system, and the fact that it only has an x4 slot, a GPU may be somewhat unrealistic though. Plus, I got the Radian card for free, I don’t have a suitable GPU just laying around and I’ve already blown way too much money on this system :stuck_out_tongue: I’ve already got an Intel i5-8xxx passive system (Protectli) with an IGPU that I think could work for transcode, so I would just need to have that system mount the videos from the TrueNAS machine… Only problem would be that the i5 doesn’t have 10Gb NICs, although it does have 6x1Gb that could be potentially all be used together, either bonded or individually (don’t know if that would help NFS performance, and I’m not sure if Linux can do SMB multichannel…)

Ok, this makes a fair amount of sense to me, a reservation to allow for disk sizing concerns. If this is truly one of the intended reasons behind it, I feel like it would be more logical to use a different partition type so it causes less confusion!

1 Like

This neatly summarizes my thoughts on swap.

A little bit is useful for recovering the memory used by long running allocations which aren’t needed to be live for now.

I think it’s a shame that swap has been totally disabled in 24.04.1.1, and afaict, there is no easy way to re-enable it (swapon doesn’t seem to work)

As was discovered the issue was not swap per-se, but rather lru_gen incompatibility with the arc changes and iX not dogfooding with swap enabled.

4 Likes

Me too, I prefer swap and while I agree on a live system you don’t want a lot of paging, OTOH, we’ve seen some crashes reported here lately with OOM, which we knew were coming. Not many that we know of, but, I’d rather have a slower but running system that a total crash myself. Either way, there is an issue to be solved. I am pretty sure they did not say they would never enable it again. On a correctly functioning system with reasonable swappiness setting, there should be zero issues and no harm with swap being enabled, and only a benefit. I understand completely turning it off for now and deciding later how best to handle.

1 Like

As a a long term Linux user I’d like to see a way to enable swap. For example I partitioned my boot drive so I could create a mirrored mdadm disk to use for swap so I have a bit of additional headroom. My TrueNas box is basically SoHo grade so only has 32GB of ram at present so 8GB of Swap can be useful.

I just upgraded to Dragonfish and now I can’t get my swap to re-enable.

Hi guys,

I’ve run into this thread as I’ve noticed that my system upgraded to Dragonfish-24.04.2.5 is also not using any swap. I’ve started researching swap configuration because I see OOM errors like the ones below in the messages file, which I think cause a restart of the web GUI (middlewared). I’ve noticed this because I kept getting logged out of the GUI. It looks like the OOM condition occurs in particular when replication tasks are running, causing them to stop.

The system has 32 GB of ECC RAM, approx. 14 TB of storage. I’ve looked into re-enabling the swap and I’ve managed to do so on another test system, but before I touch my main system, I thought I research this further.

Is anybody else seeing similar memory related errors?

Nov 11 13:03:46 nas3 kernel: DBENGINE invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=-900
Nov 11 13:03:47 nas3 kernel: CPU: 1 PID: 6635 Comm: DBENGINE Tainted: P           OE      6.6.32-production+truenas #1
Nov 11 13:03:47 nas3 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./E3C246D4I-2T, BIOS P2.30 06/06/2023
Nov 11 13:03:47 nas3 kernel: Call Trace:
Nov 11 13:03:47 nas3 kernel:  <TASK>
Nov 11 13:03:47 nas3 kernel:  dump_stack_lvl+0x47/0x60
Nov 11 13:03:47 nas3 kernel:  dump_header+0x4a/0x1d0
Nov 11 13:03:47 nas3 kernel:  oom_kill_process+0xf9/0x190
Nov 11 13:03:47 nas3 kernel:  out_of_memory+0x256/0x540
Nov 11 13:03:47 nas3 kernel:  __alloc_pages_slowpath.constprop.0+0xb23/0xe20
Nov 11 13:03:47 nas3 kernel:  __alloc_pages+0x32b/0x350
Nov 11 13:03:47 nas3 kernel:  folio_alloc+0x1b/0x50
Nov 11 13:03:47 nas3 kernel:  __filemap_get_folio+0x128/0x2c0
Nov 11 13:03:47 nas3 kernel:  filemap_fault+0x5d2/0xb50
Nov 11 13:03:47 nas3 kernel:  __do_fault+0x30/0x130
Nov 11 13:03:47 nas3 kernel:  do_fault+0x2b0/0x4f0
Nov 11 13:03:47 nas3 kernel:  __handle_mm_fault+0x790/0xd90
Nov 11 13:03:47 nas3 kernel:  ? __blk_mq_free_request+0x71/0xe0
Nov 11 13:03:47 nas3 kernel:  handle_mm_fault+0x182/0x370
Nov 11 13:03:47 nas3 kernel:  do_user_addr_fault+0x1fb/0x660
Nov 11 13:03:47 nas3 kernel:  exc_page_fault+0x77/0x170
Nov 11 13:03:47 nas3 kernel:  asm_exc_page_fault+0x26/0x30
Nov 11 13:03:47 nas3 kernel: RIP: 0033:0x7fc8a8d30e26
Nov 11 13:03:47 nas3 kernel: Code: Unable to access opcode bytes at 0x7fc8a8d30dfc.
Nov 11 13:03:47 nas3 kernel: RSP: 002b:00007fc8a834a2e0 EFLAGS: 00010293
Nov 11 13:03:47 nas3 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fc8a8d30e26
Nov 11 13:03:47 nas3 kernel: RDX: 0000000000000400 RSI: 00007fc8a834a3e0 RDI: 000000000000003d
Nov 11 13:03:47 nas3 kernel: RBP: 00000000000003e8 R08: 0000000000000000 R09: 0000000000000002
Nov 11 13:03:47 nas3 kernel: R10: 00000000000003e8 R11: 0000000000000293 R12: 00007fc8a834a3e0
Nov 11 13:03:47 nas3 kernel: R13: 0000000000000000 R14: 000055688c9c00a0 R15: 000055688c9c00a0
Nov 11 13:03:47 nas3 kernel:  </TASK>
Nov 11 13:03:47 nas3 kernel: Mem-Info:
Nov 11 13:03:47 nas3 kernel: active_anon:815 inactive_anon:239581 isolated_anon:0
 active_file:193 inactive_file:142 isolated_file:0
 unevictable:768 dirty:0 writeback:0
 slab_reclaimable:4638 slab_unreclaimable:2347600
 mapped:1103 shmem:2448 pagetables:1775
 sec_pagetables:0 bounce:0
 kernel_misc_reclaimable:0
 free:74202 free_pcp:118 free_cma:89
Nov 11 13:03:47 nas3 kernel: Node 0 active_anon:3260kB inactive_anon:958324kB active_file:772kB inactive_file:568kB unevictable:3072kB isolated(anon):0kB isolated(file):0kB mapped:4412kB dirty:0kB writeback:0kB shmem:9792kB shmem_thp:0kB shmem_pmdmapped:0kB anon_thp:268288kB writeback_tmp:0kB kernel_stack:9952kB pagetables:7100kB sec_pagetables:0kB all_unreclaimable? no
Nov 11 13:03:47 nas3 kernel: Node 0 DMA free:11264kB boost:0kB min:28kB low:40kB high:52kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Nov 11 13:03:47 nas3 kernel: lowmem_reserve[]: 0 1943 31787 31787 31787
Nov 11 13:03:47 nas3 kernel: Node 0 DMA32 free:127740kB boost:0kB min:4128kB low:6116kB high:8104kB reserved_highatomic:6144KB active_anon:0kB inactive_anon:31532kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:2129272kB managed:2062320kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Nov 11 13:03:47 nas3 kernel: lowmem_reserve[]: 0 0 29843 29843 29843
Nov 11 13:03:47 nas3 kernel: Node 0 Normal free:157804kB boost:0kB min:63420kB low:93980kB high:124540kB reserved_highatomic:94208KB active_anon:3260kB inactive_anon:926792kB active_file:1400kB inactive_file:608kB unevictable:3072kB writepending:0kB present:31178752kB managed:30567996kB mlocked:0kB bounce:0kB free_pcp:968kB local_pcp:0kB free_cma:356kB
Nov 11 13:03:47 nas3 kernel: lowmem_reserve[]: 0 0 0 0 0
Nov 11 13:03:47 nas3 kernel: Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (UM) 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 11264kB
Nov 11 13:03:47 nas3 kernel: Node 0 DMA32: 51*4kB (UM) 15*8kB (UM) 54*16kB (UM) 23*32kB (U) 649*64kB (U) 582*128kB (U) 38*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 127684kB
Nov 11 13:03:47 nas3 kernel: Node 0 Normal: 5395*4kB (UMHC) 2401*8kB (UMH) 5489*16kB (UMEH) 200*32kB (UMC) 239*64kB (UMC) 59*128kB (UC) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 0*8192kB 0*16384kB 0*32768kB 0*65536kB = 157860kB
Nov 11 13:03:47 nas3 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Nov 11 13:03:47 nas3 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Nov 11 13:03:47 nas3 kernel: 2689 total pagecache pages
Nov 11 13:03:47 nas3 kernel: 0 pages in swap cache
Nov 11 13:03:47 nas3 kernel: Free swap  = 0kB
Nov 11 13:03:47 nas3 kernel: Total swap = 0kB
Nov 11 13:03:47 nas3 kernel: 8331004 pages RAM
Nov 11 13:03:47 nas3 kernel: 0 pages HighMem/MovableOnly
Nov 11 13:03:47 nas3 kernel: 169585 pages reserved
Nov 11 13:03:47 nas3 kernel: 65536 pages cma reserved
Nov 11 13:03:47 nas3 kernel: 0 pages hwpoisoned

You could create a bug ticket.
It’s probably best to create your own thread or tie to a very recent one with the same SCALE version.

I was successful at enabling swap on a zvol on Scale 24.04 and 24.10 using some custom pre-init shell scripts that set up zstd-compressed zswap (rather than regular swap) that get called by the UI middleware on boot or can be run manually.

Basically, the idea is to format a zvol as a swapfile with a blocksize matching your kernel’s page size, relying on the Global SED for encryption. This is a one-time task, but can be done on boot for sanity-checking. Then load the zswap module with modprobe, and configure its various settings in the sys and proc filesystems to enable it. The script then has to manually call swapon with a fully qualified device path to the swapfile location in the pool rather than to the zvol device (which won’t work right with zswap) or relying on information in /etc/fstab as the latter doesn’t appear to be retained by the middleware during reboots or upgrades. Remember that, except for calling mkswap on the backing store, you will need a way to perform these activities at each boot. You will also need to back those scripts up, as upgrades and boot-pool reinstalls will clobber all the usual places you might put them, e.g. /root/bin) or /usr/local/sbin.

I’ve also found that with Electric Eel I need to set sysctl -w vm.swappiness=133 to encourage the kernel to page out. This is documented by zswap, and basically just pushes the kernel to swap out inactive pages; the system still uses hardly any swap at all, but it is at least available so that non-enterprise systems aren’t inviting the OOM killer to dinner.

For compression, I’ve found that zstd works well for my use case. It’s slower to compress, which shouldn’t be an issue for moving low-usage or old pages to swap, but very fast to decompress on a on an in-memory cache miss. If you’re using swap more heavily you might opt for lz4 or rle instead, but zswap is meant to use in-memory and on-disk compression and zstd seems to strike a good balance for me. Calling echo zstd > /sys/module/zswap/parameters/compressor gets that done, but you can use any kernel-supported compressor that you like. IIRC the zswap documentation tells you how to find out which ones are available to you, but the three I mentioned are available on 24.10.

There are lots of other tunable knobs with zswap, so I won’t cover them all, but so long as your backing store doesn’t enter lower power mode this has been reliable for me for close to a month. The only time I ever had problems was when I set hardware to spin down or enter lower power modes; when I did that, sometimes the backing store would appear offline and swap errors without crashing the kernel. YMMV.

If you’re running a lot of apps or VMs that aren’t being actively used simultaneously, which seems like a common scenario in non-enterprise environments, then swap should have more upsides than downsides. However, if you’re actively using all those pages, then zswap may still help by optimizing in-memory compression but could also lead to disk thrashing if there’s too much paging in and out.

For a non-enterprise scenario with lots of VMs and Docker apps, I would highly recommend trying zswap. It might even be a good idea for enterprise environments where all the allocated memory isn’t actively in use and should be swapped out to make room for active processes. However, if you have mission-critical stuff running, the usual advice about upgrading your memory should apply if it’s within your budget. If not, and swapping causes you problems, then you may need to reconsider how many containers or heavyweight VMs you should be provisioning.

On 16GB of RAM I wouldn’t recommend more than one or two GUI-enabled VMs at once, or more than about 10 apps as a ballpark figure even with zswap enabled. That’s based on my personal experience; your mileage and particular usage patterns will certainly vary.

1 Like