The intent is to build a quiet NAS to provide around 80TB or up-to 96TB storage space and capable of running 10-56Gbe network with features like snapshot, compression, and de-duplication.
The machine has to be placed only 3-5 meters away from my desk in a rather open space, and is not far enough from bedroom, so keeping the noise level down is a major challenge.
NVMe disks are silent and fast, but the per TB cost is too much for me to bear.
So this build will be a compromise between quietness and cost. I’ll try to eliminate or at least reduce its intensity of random I/Os especially on small files and hot-data, using sVDEV, L2ARC, and SLOG and sizing them adequately and cost-effectively.
With raidz2 to build around 80TB space, we can use either
- 7 x 16TB HDD
- 12 x 8TB HDD
Chassis:
The choice of the chassis is Fractal Design Define 7(or XL), it’s built with thick metal plates, has a front door, side and top panels are covered with sound absorbing materials. And has enough front bays for all the HDDs, and supporting 2.5 SSD/U2 drives to be cooled by the front fans. It definitely can reduce some noise, but I’d rather not be optimistic.
In addition, it has slots for vertical GPU mount, which can be used to install a carrier for 3x80 or 2x120mm fans blowing air down to the PCIe region, easier for cooling of the HBA, NIC, direct-mount NVMe drives, and quieter than attaching multiple small 40mm fans to them. Unsure about its effectiveness, worth a try though.
The only concern I’m about this chassis is the rather shaky-looking 3 points mount of the drive bays, i.e. the 8 o’clock corner is dangling. But this chassis has been widely used for storage, I guess it is stable enough and won’t cause damage to the spinning.
Further I’ve heard talking that the XL has better quality than the non-XL.
So if you knows this case, I’d love to hear your valuable advice.
HDDs:
The much quiet designated NAS HDDs like iron wolf pro are over 1.5-2 times the price of normal drives and not very afforable.
The 8TB disks like Seagate 7e10 or WD HA340 are both less noisy than say Seagate exo x18 16T, but definitely are hear-able(random IO test resulting in about 50+dB). So it’s not known that whether 7x16T are louder, or 12 less noisy 8T could potentially resonate and causing a larger noise. Further, 16T are helium sealed so it has a higher decibel but deep and low-pitched sounds that may be better absorbed by the chassis and less annoying midnight.
Further, knowing that
- the major noise comes from intensive random IO,
- idle and sequential IO are less noticeable,
- these disks inevitably clicking periodically, and little can be done
therefore, the goal is to reduce intensive random IO as much as possible, which happens to be the goal of achieving high throughput.
The usage:
- about 50TB out of the 80TB are greater than 1GB large and rather cold media files.
these files alone are sequential. - about 10TB are medium files within the range of 1+MB and far less than a GB,
including packages, documents, photos, etc.,
and after awhile of usage, a large portion can sit in L2ARC. - 1 or 2 TB of hot VM images, virtual disks, etc.,
the one in use is definitely in the L2ARC; if not R/W are likely to be sequential. - less than 1TB of tens of thousands of files less than 1M that can be stored
in the NVMe sVDEVs.
by “small files” we specifically mean these files that are stored by the sVDEVs;
then “medium files” are those files that are larger than the sVDEV threshold, but likely be able to stay in the L2ARC, also the number of medium files can be large, and generate some random I/Os;
“large files” are too big to always stay in the L2ARC and likely be evicted, read or write a large file are often sequential operators.
The NAS itself will has one or two VMs with docker that running very few light-weight services, not expect to use the nas as a hypervisor; Then there will be hypervisor machine with 5/10 VMs that connect to this nas, only a few have short burst of heavy I/O and they are unlikely to be highly concurrent; Then a workstation, three Macs and are often not used the same time.
Although, I’ll install a 56Gbe NIC, I’m not expecting this machine to saturate the bandwidth, as long as it can run in 10Gbe, the higher bandwidth is left for reads hitting ARC/L2ARC and small files that can be directly write to the NVMe sVDEV. May be later adding a small pool of 3/4 NVMes.
Cases that can have random I/O:
- metadata lookup like rsync causes heavy random IO,
easily mitigated using sVDEVs which store all the metadata - majority of random reads of small files can be handled by pre-heated ARC/L2ARC,
byfind . -size 10M -exec wc
for instance. - majority of random write to small files can be handled by a large sVDEV and a large small file threshold adjustable per dataset, e.g. store the entire source code and git repo datasets in the sVDEV.
- medium files like VM images, applications in use are often in the L2ARC
- ! concurrent read of many large files can cause the hard-disk arm moves rapidly between these file locations, similar to random I/O.
this cannot be mitigated - ! a number of concurrent writes to large files that have to be committed to the HDDs;
likely cannot be mitigated. - ? a number of concurrent writes to medium files that needs be committed to the HDDs;
I’m not sure how zfs handles the writes, if zfs serialize these medium files and then write to the HDDs in a sequential fashion, then the noise is low. Otherwise, it’s similar to random writes. - ! zfs routine housekeeping like scrubs, esp. during midnight, can cause high random w/r,
we could re-schedule it to run in the daylight, but it cannot be eliminated. - TBD
A summary of a primitive build:
- M.board: Supermicro X11SPL-i
- CPU: Xeon 2nd Gen 6230 20 cores, 2.1-3.9 with 27.5M L3 and only 125W TDP
- Memory: total 512GB, from 8 x 64GB 2666MHz or 2400Mhz DDR4 RDIMM ECC
- HBA: Broadcom LSI 9305-16i PCIe3.0x8 (which has only a single chip compared to 9300-16i)
- NIC: Mellanox dual port 40/56Gbe MCX354A-FCBT PCIe3.0x8
- NIC: optional Mellanox dual port 10/25Gbe MCX4121A-ACAT PCIe3.0x8
- NIC: optional dual port 10Gbe X550-T2 PCIe3.0x4
- CPU Cooler: Noctua 3647 like U12S DX-3647 should be more than enough
- Chassis: Fractal Design Define 7(or XL)
- PSU: 850W Platium (model TBD, probably some Seasonic model)
- UPS: SANTAK UPS TG-BOX850 850VA/510W, 200W~10min, 500W~5min
- HDD: 7 x 16TB or 12 x 8TB
- boot drive: mirroed
- mboard SATADOM: innodisk 128G SATADOM-ML 3IE2-P 8pin compatible with supermicro superdom
- mboard PCH SATA 0: Micron BX500 240G
- L2ARC: [TBD] 1 x U.2 of 4/8TB likely buy used
- sVDEVs: 2 x 2T U.2 of different brand mirrored, connected to different PCIe adapters
- [TBD] 2T U.2
- [TBD] 2T U.2
- slog: 1 or 2 x 120GB optane [model TBD] m.2 or U.2 or AIC PCIe mount
- one m.2 can be installed using the board PCH m.2 slot
- one U.2 can install using an adaptor on the PCH PCIe3.0x4 slot
- AIC PCIe mount can install into the PCH PCIe3.0x4 slot
- ? is it critical that the PCH m.2/PCIe slot increased latency?
notice the cpu only provide 6 memory channels in 2933MHz, but as per the x11spl-i board manual, installing 8 memory sticks will still run in 2666Mhz.
de-duplication of 80T storage at least require 5 x 80 = 400GB memory, 512G memory is indeed too tight, but single stick of 128G is 3-4 times the price of the 64G stick. Therefore, if the dedup. performance drops too much due to the low memory, I’ll have to turn this feature off. However, as zfs is online dedup. (is it eager?) then it may reduce random I/O in some scenarios, for instance L2ARC can cache more files in higher density, and some concurrent or random write are absorbed, for example, when multiple VM from the hypervisor operating on the same datasets, and writing backups sharing high similarities. If this turn out to be effective, then upgrading to 128G single stick is a cost worth paying, 1T is the max memory this cpu can support.
metadata consumes about at least 0.3% of total pool capacity, e.g. a 96T pool, the metadata is about 280G. To be safe, we could go 500G. The default svdev metadata and small file ratio is 25/75, for a 2TB svdev, 25% is 500G, and leave about 1.5T for small files.
4T L2ARC is 1:8 ratio to 512G main memory, with about 88GB footprint in ARC if all blocks in the L2ARC are 4KiB blocks.
stick[GB] N mem[GB] :8[TB] meta[GB] :16[TB] meta[GB]
64 8 512 4.096 88.0 8.192 176.0
the motherboard layout:
the slot7 is the closest to the cpu. slot1-4 should underneath the fans that mount to the chassis vertical GPU slots
two mirrored sVDEVs U2 should be installed via different PCIe slots, in case the failure of one of the adapters.
x11spl-f
PCH 2 x 1Gbe RJ45
2 x satadom satadom#0: boot
PCH 8 sata sata#0: boot 2.5 ssd, the rest are not used
PCH m2 pcie3.0x4 TBD
slot7 CPU PCIe 3.0 x8, TBD
slot6 CPU PCIe 3.0 x8 (in x16), 2 x PCIe to U2 cable
slot5 CPU PCIe 3.0 x8, 2 x PCIe to U2 cable
slot4 CPU PCIe 3.0 x8 (in x16), dual port 56G NIC
slot3 CPU PCIe 3.0 x8, 2 x directly attached NVMe, m.2 or U2
slot2 CPU PCIe 3.0 x8, HBA
slot1 PCH PCIe 3.0 x4 (in x8) dual 10G NIC
Appendix:
just for a rough estimation
raidz memory and l2arc sizing table
every data block in the L2ARC, the primary ARC needs an 88-byte entry.
we assume the l2arc is filled with 4k blocks, which is the extreme case
recommended ratio is 1:4/5/8/10 and very large memory(TB) can go to 1:16/20
we only list 1:4/8/16 for quick reference.
stick[GB] N mem[GB] :4[TB] meta[GB] :8[TB] meta[GB] :16[TB] meta[GB]
16 2 32 0.128 2.75 0.256 5.5 0.512 11.0
32 2 64 0.256 5.5 0.512 11.0 1.024 22.0
64 2 128 0.512 11.0 1.024 22.0 2.048 44.0
128 2 256 1.024 22.0 2.048 44.0 4.096 88.0
16 4 64 0.256 5.5 0.512 11.0 1.024 22.0
32 4 128 0.512 11.0 1.024 22.0 2.048 44.0
64 4 256 1.024 22.0 2.048 44.0 4.096 88.0
128 4 512 2.048 44.0 4.096 88.0 8.192 176.0
16 6 96 0.384 8.25 0.768 16.5 1.536 33.0
32 6 192 0.768 16.5 1.536 33.0 3.072 66.0
64 6 384 1.536 33.0 3.072 66.0 6.144 132.0
128 6 768 3.072 66.0 6.144 132.0 12.288 264.0
16 8 128 0.512 11.0 1.024 22.0 2.048 44.0
32 8 256 1.024 22.0 2.048 44.0 4.096 88.0
64 8 512 2.048 44.0 4.096 88.0 8.192 176.0
128 8 1024 4.096 88.0 8.192 176.0 16.384 352.0
16 12 192 0.768 16.5 1.536 33.0 3.072 66.0
32 12 384 1.536 33.0 3.072 66.0 6.144 132.0
64 12 768 3.072 66.0 6.144 132.0 12.288 264.0
128 12 1536 6.144 132.0 12.288 264.0 24.576 528.0
16 16 256 1.024 22.0 2.048 44.0 4.096 88.0
32 16 512 2.048 44.0 4.096 88.0 8.192 176.0
64 16 1024 4.096 88.0 8.192 176.0 16.384 352.0
128 16 2048 8.192 176.0 16.384 352.0 32.768 704.0