Sanity check, please! -- Best practices for cold-ish media server backup storage?

Strohminator · April 4, 2025, 2:38pm

Gents, please lend me your ears – I need a sanity check / some advice on best practices for backing up a media library.

I have a media library of about 130 TB as it stands… I’ve been thoroughly cavalier for going on 10 years now, no backups, relying only on a 3x 8-drive RAID-Z2 VDEVs Pool to keep my data intact. So far, so good, but a recent bad ECC DIMM scare has me a considering my options…

I’ve briefly considered LTO-7/8 as a solution, but the drives (and cartridges) just render that prohibitively expensive for incompressible media storage.

Having recently acquired 2x cheap Dell PowerVault MD1200 shelves, each populated with 12x 4TB SAS drives, I’ve started considering a few more of these as a backup solution…

At $350 a pop, for a 30TB RAID-Z3 pool, I’m failing to think of anything that could compete at <$12 per TB, while offering an element of redundancy as well. (Context - local pricing on 20TB HDD’s - ~$550)

So I’m thinking along the lines of doing the following:

Purchase 5x MD1200 units, each with 12x 4TB drives - $1750 or so.
(Obviously Badblocks and SMART test the drives thoroughly, before committing any data in the first place)
Create 12-wide self-contained RAID-Z3 pool on each shelf (or perhaps 11-wide with 1x hot spare if need-be?)
Perform a backup, copying folders/files manually, scrub, export/disconnect, switch off the shelf and mothball it for ~6(?) months.
Switch on each individual shelf once every (6?) months, perform scrub, copy across latest additional media (if need-be), scrub again. Export, unplug, mothball… Rinse and repeat, ad infinitum…

If anyone has a more cost-effective solution, or I’m somehow barking up the wrong tree in my methodology - please help me out?

Oh, and bonus question… guaranteed to ruffle feathers:

I’m retiring 8x Seagate 8TB Archive SMR drives… given that they’re Host-Aware, am I completely off the mark to consider using them for a Z2 cold storage solution as well?

Thanks in advance!

etorix · April 4, 2025, 3:36pm

If you ever have an incident with the cold storage pool, you’ll be in SMR Hell…
Frankly, I don’t understand the point of 8 TB HDDs nowadays. The $/TB sweet spot should be around 18-20 TB; anything smaller than 16 TB is just too small to consider.

At this price I assume that drives aren’t new, so consider they are burnt-in and just check SMART.
So many small, and probably 10k rpm drives, will use a lot of power compared with a lower number of high capacity 3.5" drives, but if this is cold storage it may not matter much.

Hot spares are useful if the system is live and can react to failures. In cold storage, a “hot spare” would be… cold—and useless.

Arwen · April 4, 2025, 4:11pm

As an owner of a Seagate 8TB SMR Archive drive, I think they are not Host Aware, but just Drive Managed. Could be wrong. However, even Host Aware are only useful IF the host driver software can work with the SMR drive to select areas to write or clear. Not the case with OpenZFS.

In regards to using 8 of them in a RAID-Z2 pool, I think this fails the sanity check

Part of the issue is that internally they fragment the data. With ZFS doing COW, Copy On Write, this leads to both internal and external fragmentation causing them to be quite slow after long use.

Some people in the past foolishly said “My pool of Seagate SMR Archive disks is just fine, been using it for months without serious problems…” Well, that is NOT the problem we are warning against. It’s after they have been in heavy use, or long use, that the fragmentation rears its ugly head and tanks performance. Even to the point of a SMR disk becoming OFFLINE due to timeouts. Causing DEGRADED pools.

Strohminator · April 4, 2025, 8:02pm

Thanks for the feedback thus far folks.

etorix:
If you ever have an incident with the cold storage pool, you’ll be in SMR Hell…
Frankly, I don’t understand the point of 8 TB HDDs nowadays. The $/TB sweet spot should be around 18-20 TB; anything smaller than 16 TB is just too small to consider.

I’d probably resilver with a CMR drive at that point if it comes to it.

As for the decision on 8TB’s, they can be had cheap these days… I got lucky with a helluva deal from a buddy who shelved his own NAS plans, on 8x 18TB’s, that I’ve just finished upgrading one of my 8x 8TB VDEVs with…

You’re right - sweet spot SHOULD be 18-20TB these days, but like I said, large capacity drives are eye-wateringly expensive in my neck of the woods, with hardly any deep discount sales. I can’t justify laying out nearly $5k on another 8x new 18-20TB drives… So I’ll roll with cheap and cheerful 8TB’s for now, until a decent deal comes along on another 8x 18-20TB’s.

etorix:
At this price I assume that drives aren’t new, so consider they are burnt-in and just check SMART.
So many small, and probably 10k rpm drives, will use a lot of power compared with a lower number of high capacity 3.5" drives, but if this is cold storage it may not matter much.

Nope - Ex Data centre MD1200 shelves - BUT, they were conventional RAID… They had a few SMART tests at <20 hours, and spent the rest of their lives relying on the RAID controller’s oversight. I did SMART tests when I fired up the shelves – all seemed well, and then one of the drives started acting up as soon as a chucked a few hundred GB of data on a test pool, to gauge some performance metrics… So now I’m running at least 1-2 passes of Badblocks and another Long SMART on them before deployment - just to be sure.

7200 RPM Seagate ST4000NM0063 drives - and if I’m effectively only going to run them for 3-4 days a year as a cold storage backup solution, their power consumption really makes little to no difference in the grand scheme of things - The comparative cost of larger drives will outweigh the cheap 4TB drives’ power consumption many times over.

etorix:
Hot spares are useful if the system is live and can react to failures. In cold storage, a “hot spare” would be… cold—and useless.

LOL - you say that… I was tempted to refer to it as a luke-warm spare in my post.

My reasoning for a spare would be to enable an immediate replacement in situ, without having to pull the offending disk (or connect the replacement disk to the server first) - but now that I think about it, you’re right. I’m being daft on that front.

Arwen:
As an owner of a Seagate 8TB SMR Archive drive, I think they are not Host Aware, but just Drive Managed. Could be wrong. However, even Host Aware are only useful IF the host driver software can work with the SMR drive to select areas to write or clear. Not the case with OpenZFS.

I’d understood 8TB Archives to apparently be some of the earliest HA SMR examples. But yeah, ultimately moot if OpenZFS doesn’t know what to do with it.

Arwen:
In regards to using 8 of them in a RAID-Z2 pool, I think this fails the sanity check
Part of the issue is that internally they fragment the data. With ZFS doing COW, Copy On Write, this leads to both internal and external fragmentation causing them to be quite slow after long use.
Some people in the past foolishly said “My pool of Seagate SMR Archive disks is just fine, been using it for months without serious problems…” Well, that is NOT the problem we are warning against. It’s after they have been in heavy use, or long use, that the fragmentation rears its ugly head and tanks performance. Even to the point of a SMR disk becoming OFFLINE due to timeouts. Causing DEGRADED pools.

Duly noted …
Ay caramba, the considerations on SMR’s seem to be all over the place. After much searching, I’d basically understood them to behave just fine in a media hoarding environment, where all the data is written sequentially in large chunks, and hardly ever deleted / edited…

And I, anecdotally, kind of AM one of those people you refer to - the second of 3x VDEVs on my server, consisted of 8x 8TB Archives in Z2… These are the same ones I’m now replacing, after going nigh-on 9 years of 24/7 availability. Numerous 3G/s capacity upgrade drive replacements, and 1.5G/s+ Scrubs. I’ve had one or two of them go on the fritz, and replaced them, with no hassles.

I’d spent a good few years inactive on the Free/TrueNAS forums, and only now did I see that my Archives are in fact SMR, and that SMR is considered the boogeyman-devil-babayaga-vengeful-John-Wick of mangling ZFS data - so I’m replacing them with CMR’s now…

It’d feel a bit wasteful to just throw them out / use them as peperweights though, hence thinking of employing them for cold storage backups…
The way I figured it - cold backup storage in Z1 / Z2 (perhaps even as a 2nd tier backup?), and if a drive conks out, replace it with a CMR - if that goes painfully slow, nuke the resilver, and take my chances to pull all the data onto a spare disk shelf…

That said, I’m not arguing for their use - just spitballing and providing my own anecdotal considerations… if conventional wisdom dictates to keep them away from ZFS, then we keep them away from any and all forms of ZFS… so be it. If they’re not to be trusted in any way, shape or form, then so be that as well…

etorix · April 4, 2025, 8:40pm

Which is correct, when using basically any file system and/or volume manager other than ZFS. SMR and CoW do not go well together. (Someone will probably quote this sentence, strike “~~well~~” and add “FTFY”.)
ServeTheHome tested and found that SMR are bad when rebuilding classical RAID5/6 arrays, but not nearly as bad as when resilvering a raidz#.

Arwen · April 4, 2025, 9:37pm

SMR in a WORM, Write Once Read Many, use can be okay with ZFS if their are certain things done:

Disable atime & relatime at top level dataset and let that be inherited to all child datasets
I think quotas and reservations require some Metadata updates, so leaving them as none may reduce extra Metadata writes.
Might want to avoid using:
filesystem_limit
snapshot_limit
filesystem_count
snapshot_count
Don’t go wild with snapshots
Limit changes to NFS & SMB share parameters, they are stored in the pool

Basically anything that writes to the SMR disk pool will increase the amount of fragmentation. People think that data is the only thing writing to disks. But, changes to directory entries, free space lists, snapshots, pool history and many internal ZFS structures require writes, (at least 2 copies, even with RAID-Zx or Mirroring, see redundant_metadata).

Someone once asked how to take an internally fragmented SMR disk and clear it out. Not sure. But if you can figure it out, that would be a good starting point when re-using SMR disks.

Strohminator · May 18, 2025, 12:20pm

Hi again folks. Not quite a necro-bump… so let’s call it a deathbed thump…

At least I didn’t start a new thread

Update:

Having slogged through 4x MD1200 units, and swopping out a few sketch drives with the supplier, I’m now armed with 48x 4TB’s that have passed badblocks with no hassles.

I also happened across a Newisys NDS-4600 60-drive JBOD at the princely sum of ~$300, and snatched it up.

Oh, and picked up a cheap batch of 11x 14TB HGST SAS drives. (unfortunately all I could get - would have loved to buy more)

So, recap and new info…

I data hoard, mostly media… I rename and sort my stuff once… and there it stays. Hardly delete anything, aside from the occasional 4K release update to older 1080p content.

Plex server as it stands:
Main Pool of 3x 8-wide Z2 VDEVs, 128k record size
8x 8TB’s (now all CMR WD Red Pro / Purple)
8x 10TB’s (WD Gold / Red Pro mix)
8x 18/20TB (Skyhawk AI / Exos - 20TB’s supplied under warranty for faulty 18’s)
1x cold 20TB spare, just for in case.

My game plan from here:

Create secondary pool on Plex server - 11-wide Z3 with the 14TB’s - for hoarding other incidental data, desktop backups, and other media, currently held on 2x Microservers (which will then be retired)
Fit 2x 128GB SATA SSD’s for mirrored boot volume (part of TrueNAS Scale migration)
Fit 2x 500GB SATA SSD’s for mirrored Plex Metadata volume (part of TrueNAS Scale migration)
60-drive JBOD configured for backup… cold storage, will realistically only really be fired up once every few months, for pulling across the latest data, and a scrub:

Primary backup pool: 3x 15-wide Z3 4TB (or 4x 11-wide Z3 if need-be) 1MB record size… incidentally, this provides a handy capacity for performing a move-and-move-back re-balance + shift to a 1MB record size on my main server pool.
also allows for 3-4 spare drives to be kept on-hand. ---- 4x 12-wide Z3 would also be tempting for that matter, but leaves no spares on hand.
Secondary backup pool: 1x 11-wide Z3 8TB (Mix of 7x SMR Archive 8TB, 2x CMR WD Red 8TB, 2x CMR WD Red Pro 10TB — and 2x CMR WD Gold 10TB’s for resilver spares) 1MB record size, strictly WORM - or as close to it as possible - has a bunch of small files backed up, but will mostly have large chunks of data added in a single go going forward, probably once a year.

Comments, critiques & alternative suggestions welcome.

Thanks in advance!

Wouldn’t a full badblocks run (4x passes) achieve that, in theory?

It would eventually just bypass the short-term CMR zone, and force sequential writes directly to the SMR… That way, by the 2nd pass, the original batch of CMR zone data would be rendered entirely redundant, and simply gets nuked instead of writing to the SMR?

Arwen · May 19, 2025, 5:36am

That would be nice, but how can you tell / be certain?

My own thoughts is that badblocks, even run 4 times, would not clear out the fragmentation. Just make it worse.

Now some have suggested that using the SATA secure erase might do a clear of the shingles, allowing a reset of the drive. But again, how can you tell / be certain?

Unless the drive is host aware, meaning the server can request information about the shingles, I doubt any real progress can be made. Even with host aware drives, someone would have to write the software to either confirm a SATA secure erase did the clear. Or that you can clear it another way.

Last, SMR drives almost need to be scheduled for yearly, (or some time), clearing to prevent problems.

We have seen people create extra wide RAID-Zx vDevs and it starts off okay. But then it gets slower and slower to the point where it would take weeks to perform a scrub. Or worse, have to copy all the data off to re-build it with a sane configuration. This is similar to the SMR problem, too much fragmentation.

Strohminator · May 19, 2025, 12:58pm

Wouldn’t the TrueNAS dashboard Storage I/O graphs tell the story?

Here are 3x of the 8TB Archives during the first Badblocks pass:

But…but…but… Why? I may genuinely be missing something fundamental in Badblocks and/or SMR operation to not understand this logic…

How would writing blocks sequentially, from the platter perimeter inwards, subsequent tracks overlapping previously written ones, worsen fragmentation? Append-only zones shouldn’t apply, unless at some point during a subsequent Badblocks pass, the drive I/O suddenly tanks for periods, while applying its own internal logic to zone management?

Worth a try, I guess - but only if I/O faceplants on Badblocks, no?

As a part of a backup pool, WORM for all intents and purposes (or is that, WORHN? Read Hopefully Never ), written in large sequential chunks, would clearing still be necessary in principle?

Granted - after a bit more reading up, I reckon I’ll shelve my 15-wide Z3 VDEVs idea, in favour of 11- or 12-wide VDEVs.
For that matter, I might be overthinking things with Z3 on a cold backup unit - where Z2 should actually suffice in principle?

Arwen · May 19, 2025, 7:48pm

Your points are interesting, and I have no serious counter arguments except this:

If I understand SMR HDDs, blocks are not stored sequentially. Like SSDs, they are stored where ever the drive thinks is appropriately. Meaning the SMR firmware selects a track to write and stores the data their, after any reads of the following track… to allow re-writing it.

My point is that while you write sequentially, internal pointers in the SMR HDD could put the data anywhere. The drive does not know the badblocks program is attempting to write completely sequential.

Now can I prove my guess?
Nope.

As for your graphs, simple, non-RAID, non-ZFS writes and reads are exactly what SMR HDDs excel at. Throw in ZFS attempting to select proper sized areas for writes, and ZFS’ copy on write behavior, well, shingle fragmentation is the result. In my opinion.

Strohminator · May 20, 2025, 12:40pm

By the very nature of shingle layers, it is implied that tracks must be written sequentially. There has to be an algorithm in the abstraction layer that at least attempts grouping blocks closely together - anything short of that would be insanity?

Had a bit of a ponder on this, and what you’re saying does make sense - but surely the abstraction layer is smart enough to cater for directly writing said data stream (mostly) sequentially? This assumption would be substantiated by the shape and consistency of the write I/O graph?

There, I can’t fault your logic… I suppose it’s all down to how clever the abstraction layer is at sequencing & managing the data blocks, versus what ZFS is dictating.

Assuming you’re on the money with COW and ZFS area targets causing fragmentation, what do you suppose would be the best writing approach to have the drive’s abstraction layer attempt to mitigate fragmentation?

Large one-shot copy of the entire backup dataset, and let it default to writing sequentially directly to SMR? or…
Break the copy job up into ~400-500GB blocks (and give the drives a breather inbetween to shift data written into CMR, to SMR zones as it sees fit?

PS - I’m still not arguing - just spitballing here - the subject’s piqued my curiosity now.

etorix · May 20, 2025, 1:20pm

A thread on the old forum came to the conclusion that SMR might fit with some insane value of ashift, like 28 or 29.

Strohminator · May 20, 2025, 4:47pm

I did actually stumble across that thread…

Why stop even there? The CMR area consists of 64x 256mb zones, for a total of 16GB… Ashift=34 and a 16GB record size, and you pack the CMR area one chunk at a time.

Arwen · May 20, 2025, 5:36pm

I have no serious suggestion to reset SMR HDDs’ shingle table, other than what I already wrote:

SATA secure erase
Host aware SMR HDD, which allows using software on the host to reset shingles

To be clear, I have no proof, (nor further argument), that writing sequentially does not clear the shingle indirect tables.