Asking for recommendations on a new SETUP

WoisWoi · May 28, 2025, 6:05pm

Hi,

I’m planning to upgrade the capacity of my NAS:

OS : TrueNAS Scale

Case : Fractral Design Define 7 XL
CPU : Intel Core i5-12600K
Motherboard : ASUS Pro WS W680-ACE IPMI
LAN Card : Chelsio TC520 - Ethernet 10 Gigabit
HBA Card : LSI 9302-8I
RAM : Kingston Server 128 Go (4x32 Go) DDR5 ECC @ 4 800 MHz
Storage - SSD M2 : Samsung 970 EVO PLUS 500 Go
Storage - HDD : 16 x Toshiba MG10 20 To in 2x RAIDZ2 (1 pool, 2 vdev) for a total of 210 To
PSU : Corsair 850W - 80PLUS Gold

This server is used for hosting the media of my servers of streaming (dozens of video streams to run at the same time).

I’m going to upgrade from 16 x 20 To to 16 x To + 8 x 22 To + 8 x 24 To.
I’d like to know if you think I should upgrade the RAM ?
If you think so, I’ll have to build a new NAS (and keep the old one or move everything in the new one).

Also, when my pool is scrub, everything is slow down and I’m experiencing at some point, would you have any idea why? Is there any solution?

Thanks a lot !

Fleshmauler · May 28, 2025, 6:28pm

I don’t see any particular reason to upgrade RAM unless you’re getting a lot of ARC misses - for a media server I’m guessing that ARC isn’t the needful since folks aren’t watching the same movie/episode more than once consecutively for ARC to do the needful.

You’ll also have a rough time finding a way to increase capacity - I don’t see any support for 4x48gb ECC-UDIMMs. If you’re talking about upgrading speed, then it is same issue. I’m impressed you managed to get 4x32gb running as I’ve had a lot of issues with that motherboard & had to downgrade to specific bios to boot with all 4 slots filled.

Slows down in what way? Scrub would hit the storage & cpu (storage to read files, cpu to validate checksums) - best advice would be to schedule scrubs when there is least amount of viewers I guess. It’ll at least remove some stress from the CPU.

If these streams are being transcoded, then IMO look into a GPU for hardware transcoding. Considering you’re running dozens of streams, then either nvidia patch or a quadro may be necessary.

What are you running your apps on? I see you mention a single nvme - if this for apps or boot pool? Consider moving apps to nvme in general.

If you’re doing transcoding, might be an idea to move transcoding to tmpfs - I mean you got 128gbs of RAM & the main usecase seems to be media, where limiting arc, but getting fast temp storage of transcoded media (especially since you got 128gbs to play with) may be more optimal for performance.

WoisWoi · May 28, 2025, 7:21pm

The NAS is only here for storage, the PMS (Plex Media Server) are on 2 others servers (using the UHD 770 to transcode).
At the beginning the NAS was running on a 14900K, but since I decided not to run any apps on it, I switched to something less overkill, also not concerned with the hardware/microcode “issue”.

When I was saying “upgrade”, I was thinking of changing the MB, because I couldn’t find any 48 GB stick, to add more 32 GB sticks if required.

I cannot plan the scrub at a “good” timing, since it takes several days (between 2 and 4 days, I think).

Slow in what way? Well, it takes a lot of time to load pages or to be able to start a stream, once the stream is started, it’s likely fine. Also, everything is laggy, because for now, the VM drives are in the NAS, I plan to go on local SSD for better performance.

That’s the full setup :

Serveur 1

OS : Proxmox VE 8.2 (clusterisé)

Boîtier : NZXT H5 Flow
CPU : Intel Core i5-12600K
Ventirad : Thermalright Peerless Assasin 120 SE
GPU : Nvidia Quadro RTX 4000
Mémoire vive : Crucial Pro 64 Go (4x16 Go) DDR4 @ 3 200 MHz
Carte mère : Gigabyte B760M DS3H DDR4
Carte réseau : TP-Link TX401 - Ethernet 10 Gigabit
Stockage - SSD M2 : Western Digital Black SN850X 500 Go
Alimentation : Corsair CX650 - 80PLUS Bronze

Serveur 2

OS : Proxmox VE 8.3 (clusterisé)

Boîtier : Corsair 4000D Airflow
CPU : Intel Core i9-14900K
Mémoire vive : Corsair 64 Go (4x16 Go) DDR4 @ 3 200 MHz
Carte mère : Gigabyte B760M DS3H DDR4
Carte réseau : Chelsio TC520 - Ethernet 10 Gigabit
Stockage - SSD M2 : Crucial P3 2 To
Alimentation : Corsair CX550 - 80PLUS Bronze

Cooling : Be quiet! Silent Loop 2 280mm

Serveur 3

Mini PC : Insoluxia Intel N100 CPU, 16 Go de RAM, 512 Go de SSD sous Proxmox 8.3 (clusterisé)

NAS

OS : TrueNAS Scale

Boîtier : Fractral Design Define 7 XL
CPU : Intel Core i5-12600K
Carte mère : ASUS Pro WS W680-ACE IPMI
Carte réseau : Chelsio TC520 - Ethernet 10 Gigabit
Carte HBA : LSI 9302-8I
Mémoire vive : Kingston Server 128 Go (4x32 Go) DDR5 ECC @ 4 800 MHz
Stockage - SSD M2 : Samsung 970 EVO PLUS 500 Go
Stockage - HDD : 16 x Toshiba MG10 20 To en 2x RAIDZ2 pour un total de 210 To
Alimentation : Corsair 850W - 80PLUS Gold

Cooling : Be quiet! Silent Loop 2 280mm (avec 2x SILENT WINGS PRO 4 140mm)

Switch : TP-Link TL-SX105 - 5 ports Ethernet 10G
Onduleurs : 3x Eaton Ellipse PRO 1600 FR

The ZFS report for the last month :

Fleshmauler · May 28, 2025, 7:34pm

Oh wow - you have a pretty sweet setup. It (and likely your budget) is more than I expected. I thought there was a typo when you said ‘dozens’ of streams. If you’re willing to drop money on new motherboard and cpu for additional ram slots, then you’re more serious than I expected.

I’ll pass on giving further advice; my media feeds less than a half-dozen.

WoisWoi · May 28, 2025, 7:49pm

To be totally honest, right now, my peak (record) is between 40 and 50 streams (I’m not expecting to have more than +/- 70 simultaneous streams in the future), most of them are not being transcoded, and since I’ve been forced to move away from Google and Dropbox unlimited plans, I had to go local for efficiency regarding the price of the storage. But I can’t store unlimitedely heavy 4K medias anymore, and my 210 To are used at 82 %, which isn’t good performance wise… That’s why I’m thinking about upgrading to a really important amount of data (by adding 8 HDD of 22 To and 8 more 24 To drives which cost a fortune ).

I’ve almost my whole family and friends on my servers, and most of them are willing to contribute to support the cost of the infrastructure, so upgrading isn’t much a concern for me

WoisWoi · May 29, 2025, 10:24am

UP

WoisWoi · May 31, 2025, 12:48pm

Anyone could help, please?

Fleshmauler · May 31, 2025, 8:06pm

Alright I’ll hop back in & do my best. I’m curious, you got dozens of feeds going & mention things getting laggy - what do you mean? Is the entire server laggy or just the end-user experience?

What are your upload speeds & how much outgoing data are you seeing for ~50 active streams at the same time? If you’re not transcoding & running FIFTY 4k streams that would require a 3gig upload to your ISP

Maybe it’d be worth looking into transcoding + HDR tonemapping vs building a new storage solution?
Edit: after additional thinking, I also realize that you have 50 clients sending requests to a pool of hard drives - wondering if it would be worth adding another pool instead of just upgrading capacity & spreading media across the two

WoisWoi · May 31, 2025, 8:35pm

Maybe I explained it wrong.
It’s getting “laggy” only when a scrub of the pool is running, and it’s not lagging the whole time, it’s kinda random.

At the start, I had unlimited GDrive and when it wasn’t an option anymore I switched to Dropbox (also unlimited) which also stoped their unlimited plan, I had +/- 300 To of data : half was a good quality-weight, which means less than 10 Mbps bitrate and the other half was REMUX so mostly 50 Mbps and above. Since my storage capacity is limited, I almost entirely stopped to store heavy stuff, which I’d like to do again (not as much as before, obviously).

My bandwidth isn’t an issue though, I’ve 8 Gbps up and down.

The pool is already split in 2 vdev, but yes, running 2 different pool (or 2 NAS) is an option.

My biggest concern and question is about the amount of RAM required or recommended for my use, also if having a bigger CPU would change anything for the scrub (I don’t think so, but still) ? THIS, considering the upgrade in capacity planned, I already have most of the new disk, I just have to decide if I’m building a second NAS or if I’m upgrading this one.

Fleshmauler · May 31, 2025, 9:10pm

Very jealous.

Scrub does add impact to the pool as you’re validating all the data on that pool; so you’re going to have I/O and latency impacts.

With a 12600k, I doubt better cpu would improve anything - at worst you can check CPU usage during a scrub, but I’d doubt you’ll be seeing crazy usage numbers.

I’m sure there are some ZFS tunables that could limit the speed of the scrub so there is less impact on an actively in-use pool (in theory); I don’t know them & don’t know the risks of changing defaults.

Not sure how often you run scrubs, but you could consider limiting them to once a month & just telling everyone ‘sorry, maintenance day; performance may be degraded’.

WoisWoi · May 31, 2025, 9:45pm

OK

I’d be more than happy to tweak and optimise the scrub, if possible, and without “danger”.

You don’t know about the RAM ?

Fleshmauler · May 31, 2025, 11:21pm

I just doubt that with your current capacity that it’d make a difference. If you have like… 16gb of ram, I’m certain there’d be a noticeable positive impact by going to 32gb. But, with 128gb - even if you went & got 256gb, I’m doubting that performance improvement would be noticeable or worth the cost.

If I had to guess, during a scrub & with >40 active 4k streams, you’re just bottle-necked by the throughput of your drives.

I’m wondering, @NickF1227 - any chance your benchmark tool could also compare performance during a scrub? Would it be as simple as running a scrub while the benchmark is active?

This would at least give some hard numbers & evidence of performance impact during a scrub & maybe also allow WoisWoi to compare changes in hardware/zfs tuneables.

There are ways to do it, I just don’t know enough about tuneables; I do know that you shouldn’t touch them unless you have a specific & well defined usecase (which you do; so it could be worth exploring).

Honestly if there wasn’t also a major need for storage capacity I’d just recommend going to ssd pool.

Lol, any chance a pool of these is in the budget? Would rule out throughput issues AND give you 30TB of space per drive

NickF1227 · June 1, 2025, 12:21am

Yes that would work. I’ve not explored this topic actually, happy to join you in the experiment. I’ll report back.

Also zpool iostat -vyl 10 should give you what you need…it’s more like a screwdriver than a hammer. Also note, scrubs impacts are dynamic based on workload. Scrub I/Os have a low priority, so real workloads will take precidence.

swc-phil · June 1, 2025, 12:46am

I’m 95% sure this is the culprit. I have some facts/pieces of evidence to support my POV:

First of all, raidz should not be used for VM disks. I advise you read this.
I have similar (in a way) architecture for jellyfin. The data of the LXC container with JF is stored on a proxmox node itself. The media directory is provided as a mount for the container. The mount itself is an SMB share. Despite not having your workloads, my interface is very snappy. In most cases it’s almost an instant (meaning it is below 100ms). Starting the stream, OTOH, can take some time.
I have once read that JF’s config database can have some performance bottlenecks specifically with ZFS that should be tweaked. Something about write amplification with standard record size. Mine works ok (without tweaks) during JF metadata update and great at all other times. I suspect that it’s because NVMe mirror can withstand any reasonable workload. While this point is specific to JF, it could also be applicable for PMS.

Fleshmauler · June 1, 2025, 2:01am

Thinking so too, but I’m pretty sure that the default workload is already pegging the drives hard in this one.

Completely missed that VMs are also running on spinning rust at the same time

NickF1227 · June 1, 2025, 2:52am

I can say from experience in production…it does try its best to not interfere with incoming writes, and there are tunables to “time shift” the scrub a bit

As an example,

Make the scrub last longer, but be less impactful…e.g because I have a very sensitive workload or a niche usecase.
Make it more impactful now, to get it over with sooner…e.g because we are off prod hours…

Now on a more personal note,
I’ve only ever had to modify them for very specific reasons though, and mostly envrionmental reasons rather than technical ones.

If a scrub is painfully slow, though, there’s always the possibility of a PHY layer (physical issue) with cabling or adapters…but theoretically a rogue bad disk with a really particularly poor faulure mode dragging you down clogging up I/Os with checksum errors…

But I don’t think thats OPs problem as its an edge case and its not a common thing I’ve seen personally.

If @WoisWoi
Can you confirm if you have recently done a SMART short test?

Can you post a screenshot of the disk reports screen in the Reporting section? Just include any of the drives in the pool and only Disk I/O

This is probably a bad example, because this pool was doing nothing really all day until I started messing around with runnin tn-bench and scrubbing concurrently…

This information would be helpful for troubleshooting.

But. @Fleshmauler Side note.

Literally just for fun, I spent the past hour writing the worlds most rediculous regex awk madness with my local LLM because I had to answer your question… because you’ve gotten me curious how I can measure.

I just started the benchmark. I’ll have to probably still rework some things but I should have an answer to you with meaningful-ish numbers from just the numbers reported in zpool status at some point.

I’ll post results and stuff in the tn-bench thread when I have some time.

Sorry for multiple edits

swc-phil · June 1, 2025, 3:47am

Is additional info (instead of plain sda, sdb etc.) shown because your drives are SAS, or is there some setting of the reporting page?

NickF1227 · June 1, 2025, 3:49am

Ah, That screenshot is on an old OOS Enterprise system, I am sorry I didn’t realize.
I’m not sure if SAS reports like that in CE or if its the BUS.

I’ll check one sec. On CE here for NVME. I know there was work done in Fangtooth for reporting, I’m not sure but this looks to be a new feature.

These are all on 25.10.0

CE on SATA

Professional Nick:
Now that I’m looking at it, it shouldnt literally say {serial_lunid}. I’ll have to go see if this was fixed in .1. Thanks for bringing this to my attention I hadn’t noticed till you asked.

swc-phil · June 1, 2025, 4:08am

Thanks for your clarifications. I’m still on 24.10, so no wonder mine are plain sdX.

WoisWoi · June 1, 2025, 11:02am

First of all, I’d like to thanks everyone for your returns!

SMART short test are auto run every day on all the drives, and LONG every month. No error reported, I had a to replace 1 HDD a few months ago, now everything looks fine.

As I said, since I’m storing in local with my NAS I kinda stopped the 4K stuff, so I only have pretty light media (between 0 and 10 Mbps). Also, the ~ 45 stream peak, is my all time peak, I’m more likely between 20 and 30 streams every evening, and right now for example there is about 5 streams.

I don’t understand this :

Max in 1 hour is higher than max in 1 week (or 1 day, etc.) ??

Should I send the screenshots for all the drives ? It kinda look the same on all of them, also, scrub is not running atm.

the traffic (same thing about the “Max” between 1h and 1d) :

image1993×612 95.6 KB

image1981×487 118 KB