Metadata VDEV impact is "not noticeable" / it "does not perform as expected"

Indeed, it is not meant as a revolutionary change. The main quality of this board is that a German reseller is getting rid of the stock for cheap[1] right now, so it is a good opportunity to get a genuine server board, and its IPMI. It does have a TPM header. The single M.2 slot is strictly intended for boot, which is why one needs a riser in the x16 slot for more, but such risers are cheap, and all “data” M.2 drives would then be on CPU lanes rather than chipset lanes.

Irrespective of the board, I’d also suggest consolidating to a single 3- or 4-wide raidz1 pool for all bulk data, with various datasets for different uses. (And thus replacing the case… Two hard drives is not enough for a NAS.)
No special vdev. But you may consider a 512 GB persistent L2ARC for metadata if the workload is mostly reads.
If you have to boot from USB, put a cheap SSD on USB adapter.


  1. they have advertised lower prices than the current 79.99E in the recent past… and accepted even lower offers (50 or below) ↩︎

1 Like

FWIW, I’ve had very good experiences with both a sVDEV as well as a metadata-only, persistent L2ARC. The former requires planning, etc. the latter results in somewhat slower metadata performance than a sVDEV but is redundant. A metadata only, persistent L2ARC is only advisable when RAM exceeds 64GB, however and there are some limits re: the size of SSD to dedicate to L2ARC vs. available RAM (L2ARC pointers cut into ARC RAM).

Storing metadata on SSDs has a significant performance benefit when it comes to tasks that include a lot of directory traversals like rsync backups. I doubt you’d see a significant benefit unless the pool is filled at least 25%, as a very small amount of metadata will simply get read into the ARC followed by the ARC doing all the work, not the SSD.

Even fuller pools may feature very little metadata if you consolidate large collections of files with tools like Apple sparsebundles. My pools metadata clocks in somewhere around 0.03% of pool capacity, 1/10th the rule of thumb of 0.3%. But that’s because most of the files are relatively ‘large’ images and videos. Your use case and file types may be quite different.

Lastly, the other reason to go sVDEV (which sets aside 75% of drive capacity for small files, 25% for metadata by default), is that the HDDs are rid of all files that are below the small file cutoff that you can set on a per-dataset basis. That in turn allows you to fine-tune what files / datasets get the full SSD benefit (i.e. databases, VMs, etc.) whereas for others only some small files will (archives). See the sVDEV resource for more info.

4 Likes

This post is more “For Science” than it is a recommendation of any kind.
image

Benchmarking for special vdev isn’t exactly straight forward. It’s not like adding additional vdevs where you can run standard benchmarking suites and understand the performance implications. We’re specifically modifying how metadata is stored in the pool, and you should really be testing LOCALLY on the NAS first before moving to SMB or another sharing protocol if you’re trying to really see the impact without introducing variables.

A really quick and dirty approach is to make a bunch of empty files and then time how long it takes to make them, followed by going back through and listing the contents of each of those directories.

For giggles and laughs while reading this thread I made a script that does that. You’d want to go into the shell on your NAS and make a folder inside of a dataset on your pool. From there use nano to make a file called test.sh, paste the contents of the script below (control+o and then control+x to save and exit), and then type chmod +x test.sh. You can then run the script by doing ./test.sh

#!/bin/bash

# Number of directories and files
NUM_DIRS=5
NUM_FILES=50000 
MILESTONE=10000  # Milestone to report progress

# Base directory to create all folders in
BASE_DIR="./test_folders"

# Create a base directory
mkdir -p "$BASE_DIR"

# Function to get current time in milliseconds
current_time_ms() {
    date +%s%3N
}

# Measure time to create folders and files
START_CREATE_TIME_MS=$(current_time_ms)

# Initialize milestone times
PREVIOUS_MILESTONE_TIME_MS=$START_CREATE_TIME_MS

for ((i=1; i<=NUM_DIRS; i++)); do
    # Format the directory name with leading zeros
    DIR_NAME=$(printf "%05d" "$i")
    mkdir -p "$BASE_DIR/$DIR_NAME"
    
    for ((j=1; j<=NUM_FILES; j++)); do
        # Format the file name with leading zeros
        FILE_NAME=$(printf "%05d.txt" "$j")
        touch "$BASE_DIR/$DIR_NAME/$FILE_NAME"
        
        # Print elapsed time every 10000 files
        if (( j % MILESTONE == 0 )); then
            CURRENT_TIME_MS=$(current_time_ms)
            CURRENT_MILESTONE_TIME_MS=$((CURRENT_TIME_MS - PREVIOUS_MILESTONE_TIME_MS))
            
            echo "Time for creating $j files in directory $DIR_NAME: $((CURRENT_MILESTONE_TIME_MS / 1000)).$((CURRENT_MILESTONE_TIME_MS % 1000)) seconds."
            
            # Update the previous milestone time
            PREVIOUS_MILESTONE_TIME_MS=$CURRENT_TIME_MS
        fi
    done
done

# Silently track the total elapsed time
END_CREATE_TIME_MS=$(current_time_ms)
CREATE_ELAPSED_TIME_MS=$((END_CREATE_TIME_MS - START_CREATE_TIME_MS))

# Measure time to recursively perform 'ls' in the base directory
echo "Starting recursive 'ls' command timings..."
START_LS_TIME_MS=$(current_time_ms)

# Recursively list all files and subdirectories
(cd "$BASE_DIR" && ls -R > /dev/null)

END_LS_TIME_MS=$(current_time_ms)
LS_ELAPSED_TIME_MS=$((END_LS_TIME_MS - START_LS_TIME_MS))
echo "Recursive 'ls' completed in $((LS_ELAPSED_TIME_MS / 1000)).$((LS_ELAPSED_TIME_MS % 1000)) seconds."

# Measure time to remove and clean up all test files
echo "Starting cleanup..."
START_CLEANUP_TIME_MS=$(current_time_ms)

# Remove all test files and directories
rm -rf "$BASE_DIR"

END_CLEANUP_TIME_MS=$(current_time_ms)
CLEANUP_ELAPSED_TIME_MS=$((END_CLEANUP_TIME_MS - START_CLEANUP_TIME_MS))
echo "Cleanup completed in $((CLEANUP_ELAPSED_TIME_MS / 1000)).$((CLEANUP_ELAPSED_TIME_MS % 1000)) seconds."

# Calculate and print total elapsed time
TOTAL_ELAPSED_TIME_MS=$((CREATE_ELAPSED_TIME_MS + LS_ELAPSED_TIME_MS + CLEANUP_ELAPSED_TIME_MS))
TOTAL_ELAPSED_MINUTES=$((TOTAL_ELAPSED_TIME_MS / 60000))
TOTAL_ELAPSED_SECONDS=$(((TOTAL_ELAPSED_TIME_MS % 60000) / 1000))
TOTAL_ELAPSED_MILLISECONDS=$((TOTAL_ELAPSED_TIME_MS % 1000))

echo "Total elapsed time: $TOTAL_ELAPSED_MINUTES minutes, $TOTAL_ELAPSED_SECONDS seconds, and $TOTAL_ELAPSED_MILLISECONDS milliseconds."

This isn’t testing a real-world scenario, but should be a potential way to test for improvement directly associated with a metadata vdev I think.

System on the left is AMD EPYC 7F52 with 5x RAIDZ2 7 wide VDEVs of 8TiB drives and System on the right is AMD Ryzen 3700X with 2x 2-way mirrors of Optane 905P 960GB. Neither has special vdevs. Both are SCALE 24.04.2

Posting just for reference point. CPU (single threaded performance) and other system performance such as RAM speed may be a factor here, so direct comparison between different systems might be of limited value. Really the idea is to compare the pool with and without special vdev for metadata, ON THE SAME SYSTEM, to see if this test can tease out some intelligence.

I’d be interested in seeing the results for a clean fresh pool both without a special vdev, as well as clean fresh pool with a metadata special VDEV @louis

If there’s interest in this topic I can probably come up with better stuff to do, like @Constantin mentioned rsync is a good one to test.

3 Likes

Benchmarking - which are scientific and repeatable measurements of performance - is actually much much more difficult even than this.

For benchmarks to be repeatable they have to have the exact same starting conditions, and with a dynamic ARC that can only be achieved by:

  • flushing the ARC completely immediately before the test starts; or
  • rebooting and running the script immediately after boot completes and load has stabilised; or
  • prerunning the test to populate the ARC with consistent stuff before running again to take the measurements. (Note: If you want to measure ARC effectiveness this is good, but if as in this case you want to measure device performance, this is bad.)
2 Likes

I am thinking about how I could improve my system with reasonable effort and cost. And I do not yet know. The main problem always was and still is, is the lack of PCIE-lanes.

The only way I could free up PCIE-lanes is to give up the graphic card. However my idea was to use the machine as virtual linux / freebsd machine with GUI as well (running in a VM) . That that is not trivial is clear. E.g. NVDIA is doing every thing they can to restrict that kind of usage to expensive professional cards.

And of course to spend 16 PCIE-lanes to the graphic card is all most ever wasting costy PCIE-lanes. But it is reallity.

If I sometime in the future will build the next-generation NAS, I will probably choose a MB having significant more PCIE-lanes and 3 X16-slots (one for an extra GPU, one for the 10G NIC and one for extra NVME. And again a CPU with build in GPU. That however will be costly and imply a big case.

For now that is however two bridges to far …

Something like this https://www.asrockrack.com/general/productdetail.asp?Model=ROMED6U-2L2T#Specifications.

Too Expensive … :upside_down_face:

No … If I had to start in the near future probably I would look into the cheapest top end AMD5 board (more PCIE) with the cheapest processor. But even that will be too expensive for a home NAS :frowning:

The PCIe lane number you want is found only in Xeon, EPYC, or Threadripper boards. Plus, using two or three GPUs is not going to be cheap.

In general I am happy with may actual NAS, which is IMHO all ready a top end Home-NAS.

If I can improve it by small changes, I will perhaps do that. However within that limitation, I do understand the comments given.

And if I would have needed a enterprise level NAS ,of course I would have used another MB another case and would have used raid arrays.
And perhaps I would have installed a multi thousand euro video bord :slight_smile:

But for now I will live with my actual NAS, perhaps with some upgrades or with a changed setup. And my intention is surely not to install n-video boards. I would be happy if I could share one cheap one across VM’s that is all.

Note that actual AM5 motherboard and certainly those combined with the top end chipsets, do have more PCIE-lanes than the B550 board I am using now. In fact for a new build an AM4 570 motherboard with an 5700G CPU would IMHO not be a bad choose.

If you want enterprise features, for consumer prices, you need to go a few cpu generations back. But consumer boards and cpus dont have the features you want and probably wont in the near future either.

You cannot share a GPU across VMs: What’s passed through to one VM is no longer available to others.

If what you want is one graphical VM, an MC12-LE0 with x8x4x4 riser can do it: x8 to half-height GPU (does not mind if the card is x16, it will work well enough with 8 lanes), 2 M.2 for apps/VMs, x4 electrical slot for 10 GbE NIC, onboard M.2 to boot, 6 SATA for bulk storage.
Or drop the GPU alltogether and pass the iGPU.
Same principle with an AM5 build.

Sure, but you won’t find any with the three 16x slots you want.

Well… it IS possible, but will require slicing it up into a vGPUs & that would require enabling dev mode & then either some work following a guide or a lot of work following a guide & actually getting it stable. Either way it is in the ‘here there be dragons’ territory.

Not something I’d be excited to implement on a NAS, but clutch on a hypervisor that isn’t my main storage vault.

1 Like

Also IMHO that is not realistic :cold_face: I would only be realistic if companies like NVIDIA, AMD and Intel make that option default available. What they don’t for harware and … commercial reasons.

1 Like

Surely an interesting riser card !! From the functionality point of view exactly where I am looking for.

However
It might work, but from mechanical and electrical point of view … It probably would not make things more stable. Not in any way. Placing a small graphic card on top is probably possible, but if it is stable …
Not sure as well if the AM4 B550 bios is supporting this

From the functional point of view, it is were I am looking for! Never the less I am not convinced

The 8x4x4 bifurcation cards are supposed to support using a low profile PCIe card in a full height slot, with the regular screw down

My 1030 is half height so in principle it should be possible. But it feels very very hazy / tricky.

It is of course also a no name product, I could destroy the whole system using such a card. At least it feels that way

I read through this entire post.

To all of you who trying to help Louis, well done. I think style points should be awarded for patience and kindness. Overall a very difficult task taken on at no cost or revenue.

Bravo

Louis, listen closely to your teachers/gurus here that are assisting you with your issues.

My opinion is you may have started oubt in the wrong direction and are trying to recover by continuing in the same direction. I say this because it my normal way of doing things. One day a friend told me to turn around and go back. Even though it is theoreticaly possible to walk around the world, it is easier to turn around, walk the 50 meters you need to get where you need to be.

I am hugely impressed with everyone involved with this discussion. On most forums there would have been rudeness, meanness and much negativity.

2 Likes

A lack of PCIe lanes is only a problem if you are using your PCIe lanes heavily and having to share them across multiple devices.

With 92GB of RAM mostly for ARC you should be able to cache almost all metadata plus pre-fetch read-ahead, and so easily hit 99.9% cache hit rate. With the remaining I/O to 2 NVMe + 4x SATA you probably won’t get anywhere close to maxing out the PCIe lanes you have.

2 Likes