The future of SMR Drives in TrueNAS / ZFS

Johnny_Fartpants · September 17, 2024, 7:35am

In light of the recent announcement from Western Digital regarding their 32TB SMR drive it got me thinking is SMR use in TrueNAS / ZFS going to be a no-no forever or is there some hope / development going on behind the scenes to help them play nice together one day?

ericloewe · September 17, 2024, 7:43am

Realistically, I see SMR forever being the domain of hyperscalers who can throw resources at the problem. This applies to any FS, not just ZFS.

joeschmuck · September 17, 2024, 9:04am

Wow! 32TB on a single drive. Amazing how far we have come. The earliest mass storage device I ever laid my hands on was an 8MB Drum Memory. It had a spinning drum, not platter. I have no idea how it worked, never got to see the internal components, guess I should have opened it up when I had the opportunity. And then my first 5GB Seagate drive (used of course).

Can you imagine making a 3 way mirror and poof, 30TB of instant storage in a nice small box.

It would be nice if there was a way to force ZFS to accept SMR drives, or in other words, slow the writs down, cache them if it must first. But I can only see that happening in an environment where writing changing data is a very small part of the NAS operation.

Constantin · September 17, 2024, 10:36am

ZFS will no doubt have to become more compatible with SMR. At the very least, I expect some development in the direction of Host-Aware (HA-SMR) and Host-Managed (HM-SMR) to make ZFS fundamentally OK with a internally-cached HDD.

After all, even a HM-SMR system is not a true copy-on-write (COW) system until HDD cache contents have been flushed from the internal HDD cache, written to a shingled sector, and CRC’d as good.

Thus, there is potentially a lot more memory / flash needed for storing transaction records until they can be verified as having been written into a shingled sector. That goes for flash or HDD.

I don’t see how Device-Managed SMR (DM-SMR) will ever really work well with ZFS in terms of COW because even if ZFS is aware that’s dealing with a DM-SMR drive, it has no control / awareness / etc. of when the CMR cache is flushed to the SMR sectors. Thus, how to verify the writes as good?

Further, HA-SMR & HM-SMR drives are currently only sold into the B2B market, while DM-SMR drives are the only ones being sold B2C. DM-SMR drives are particularly problematic because they don’t signal to the host that they are SMR or what they are doing (like flushing the cache), which can make the host think they’re non-responsive drives.

The issue of COW dealing with internally-2Tier storage devices also applies to SSDs. Some flash sticks / drives feature a fast cache up front, coupled with slower flash in the rear. Just like SMR, the fast stuff up front has to be flushed to the slower flash in the rear. In a SSD, chances are pretty good these flushes happen fairly regularly / quickly, in a HDD, these flushes can take a lot longer.

But either way, ZFS is potentially dealing with disaster if a cache that is getting written to the slower storage in the rear is interrupted, etc. I have no doubt the drive OEMs have invested a lot of time and money to prevent this issue. But how to verify a transaction as ‘good’ if it can only be verified as good inside a temporary cache vs. a actual, final resting place?

I wonder if the solution will be a new type of SSD cache whose sole job is to buffer every recent transaction followed by verifying actual COW data after a “reasonable” amount of time / data has passed that would have triggered a cache flush. Easier to do with HM-SMR and HA-SMR vs. DM-SMR or 2-tier flash drives.

No matter how many band-aids are thrown at the problem, SMR drives are very problematic re: performance if a lot of writing is going on (see Red DM-SMR ZFS rebuild performance at serve the home). They might become OK for home users who want a WORM repository with limited writes. But every drive failure will become that much more risky since drive replacements will take a lot longer than with CMR.

But the first step would be transitioning the B2C market from DM-SMR to HA- or HM-SMR to make drives available / create demand for HA/HM compatibility. At the moment, HA/HM drives go straight into data centers where even the motherboards are custom.

Alexey · September 17, 2024, 10:45am

You don’t.
There is no difference between this scenario and the regular hard drive which develops a bad sector.

In both cases, you write something to a location and verify it reads back good (if you even do). After a while you read that same location and the data comes back damaged. There is no protection against that happening, even on a traditional drive.

In general, one does not rely on transactions being good, one relies on backups.

Constantin · September 17, 2024, 10:56am

I don’t agree due to COW model. There is a difference between writing something to the final resting place vs. writing to a cache that then has to be flushed on occasion. If bit-rot is a design goal then the data cannot be verified as good and the transaction closed out until it has been written to its final resting place? Maybe I’m thinking ZFS is more paranoid than I thought it was?

dan · September 17, 2024, 11:03am

Check out Usagi Electric on YouTube–he’s restoring a Bendix G15 with drum memory:

He describes the drum memory here:

Alexey · September 17, 2024, 11:21am

I don’t think ZFS verifies data after write, at least not in the default configuration. Anyway, let’s assume it does. So the transaction is verified and the data becomes bad sometime later. There is no difference if the data becomes bad because the sector becomes bad or because the cache flush fails. In both cases, the checksum verification on read detects the data is bad and then there is an optional recovery attempt.

Arwen · September 17, 2024, 12:52pm

There are actually multiple problems with DM-SMR drives and ZFS.

I have personally seen that my old, Seagate Archive 8TB drive I use for backups, (with ZFS), slow down over time, even though the amount of data has remained steady. This is because the internal data is becoming more and more fragmented. This mostly affects reads, but with the free space also being fragmented, it affects writes. THEN it also affects reading the metadata for the writes.

Note that this was one of the first 8TB drives of a reasonable price. I was aware that it would be not speedy. And could potentially not last long. But, it was cheap enough and able to fully backup my 4 x 4TB RAID-Z2 pool.

Another problem, as been pointed out above, is during the shingle move / re-writes, does the firmware of the DM-SMR device actually have power fail safe migration? Tested both by code reviews AND actually power losses?

Further, if their is some SSD storage, (or even plain non-shingled disk cylinders), used as cache, does it do the proper thing? If it tells the host that the write is complete, even if it is just to cache, will it work correctly for the underlying shingled writes? Meaning not read the cache into memory, invalidate the disk cache entry then write the shingles. That would be WRONG. The cache entry needs to stay in place, active until the shingles are 100% written. Basically ZFS write integrity inside the drive.

etorix · September 17, 2024, 1:38pm

That would require driver support in consumer-grade OS. Which is unlikely to come before consumers can really get their hands on HA-/HR-SMR. Chicken and egg.
With SMR drives offering only a few TB more than CMR drives—and SSDs now being much larger than HDDs if one can pay data centre prices—I do not see that happening.

Constantin · September 17, 2024, 1:50pm

Good point. I’m not aware to what extent HM/HA-SMR drives have a dumb mode to act as DM-SMR. one would think it’s HA—SMR, ie the drive can tell the host what it’s doing but if the host doesn’t ask, the HA-SMR drive will act just like a DM-SMR.

HM-SMR is a different beastie but there are also SAS vs. SATA vs. U.2 whatever drives out there so I don’t consider it insurmountable. More than anything, I expect OEMs NOT to do the WD thing and surrepticiously and deliberately polluting extant CMR NAS drive channels with SMR drives.

IE clear labeling would fix this issue in a jiffy vs. hiding material engineering specifications in hard to find cut sheets. As opposed to the mess of drive platforms the OEMs currently cultivate to obfuscate what’s in the drive. Why not make it simple: the box has to say: what interface, what cache, what drive type, capacity, helium or not, etc in bold letters. Sort of like nutritional labels on our cereal boxes, window stickers on cars, and so on, to cut through the marketing spin.

HoneyBadger · September 17, 2024, 3:18pm

HM (Host Managed) SMR has no “dumb mode” - HA (Host Aware) SMR has a fallback.

Linking to a previous post of mine on this:

At present HM-SMR drives work on a granularity of 256M. Think of the problems of trying to use ashift=9 on devices that should be ashift=12 - so 512b into 4K - and then amplify it massively as SMR zones are effectively ashift=28

etorix · September 17, 2024, 4:00pm

So SMR support in ZFS would begin with ashift=28 and corresponding recordsize of, say 1 GB or 4 GB. (Why stop at 16 MB anyway? )

bonox · September 18, 2024, 1:35am

In the talk about SMR drives writing to a cache before a final resting place, I suspect this is a little different to the ram cache on board (and equivalents elsewhere) with respect to being able to check data made it safely across the end to end path.

As I understand it, an SMR drive writes to a region of CMR space (what I think is being called cache in this discussion) before making it to the shingled bulk region after the CMT space fills or the IOPS fall away enough to prompt the drive to do it on its own. But, here comes the important bit, I suspect that that transfer from CMR to SMR space is just another more abstracted version of a COW that ZFS does. So long as the data makes it to the disk, regardless of whether that’s CMR or SMR space inside each drive, you can still check and confirm the data exists. After “insert suitable dwell time for CMR to SMR transfer here” the only thing that changes is the pointers get updated - you don’t run a risk of losing data in cache because of the DM-SMR abstraction I think.

Does anyone have any data/info on what happens to an SMR drive if powered off having successfully completed a write to CMR space (cache) but before transfer to SMR space?

Additionally, does any drive on the market give write confirmation if the data has only made it to the mere MB’s of ram cache on board a HDD?, or will it always wait until it has been written to the storage medium? If the latter, this COW ‘problem’ isn’t actually a problem, so long as the data is on the magnetic surface somewhere, whether in CMR or SMR space.

I think the issue is less about confirming a write via COW, and more so permitted time to do so without calling a drive failed when it stops responding to new requests while sorting its laundry. That said, if it turns out that the drive firmware does not do a COW and check equivalent internally, then we have both a problem as users, as well as a few questions for the drive manufacturers.

Constantin · September 18, 2024, 1:57am

For me, it comes down to how much I trust a drive OEM to get the CMR-SMR cache flush right every time. Is it possible that they CRC the blocks in question, do a quasi-SLOG sync write equivalent? Even manage to ensure writes are good by only clearing the CMR cache after the drive verifies that the SMR shingle sector has been dutifully written 100% under all operational conditions?

Yes, entirely possible. But seems unlikely. I’m not a fan of tiered storage unless the OS / file system knows what’s going on. So the only SMR drives I may ever get comfortable with would be the HM-SMR type for ZFS.

Stux · September 18, 2024, 2:06am

Just don’t use SMR drives. They’re not fit for purpose. The purpose being ZFS vdev members.

bonox · September 18, 2024, 3:20am

Do you trust a drive to remap bad sectors/flash cells?

Because if you do, this is pretty much the same philosophy - the file system doesn’t map the individual drive storage sectors, only a higher level abstraction. And ultimately, if the checksum is correct, why does it matter to the filesystem how/where that data is stored on the disk (ie shingled or in cmr cache)? If it isn’t, a scrub will fix it or raise flags

That’s independent of the other reasons why SMR causes zfs problems - don’t think i’m advocating for smr drives here, but if the write delays etc can be solved by suitable software and engineering, then it’s not out of the question that SMR drives become safe for current implementations of zfs in future. Thought - how would things change for zfs if you implemented write throttling for drives themselves? ie the iops keep reducing as shingled re-writes take place, instead of just falling to zero as it enters a wait state that the filesystem is specifically designed to kick drives out of as a response.

Constantin · September 18, 2024, 2:07pm

The major issue is likely two things:

DMSMR drives do not let the host system know what they are doing. Hence, ZFS may freak out as the SMR engages in lengthy cache flushes, garbage collection, and so on.
DMSMR drives are the only ones being sold into the consumer market at the moment.

I agree that in the future, HA/HM-SMR drives may become compatible, though HM-SMR is likely the only option that will allow ZFS to maintain even the modicum of performance by synchronizing cache flushes and not having one drive in a pool keep the pool from being ready for the next write.

Between the write amplification associated with large SMR sectors and individual drives blocking the pool, the current insane resilver wait times associated with DMSMR can be explained quite easily.

HoneyBadger · September 18, 2024, 2:42pm

SMR drives have their own indirection table, similar to how an SSD handles NAND mapping, so I’d assume that it would have to do atomic updates to the table during copies from CMR → SMR (or vice-versa, for reshingling operations) - however, see above re: drive vendors not really wanting to be an open book on that.

Drives are only supposed to give a write confirmation when data has been written to non-volatile (in case of power failure) storage - RAM is volatile, but the CMR section on an SMR drive is stable, so you’ll get the confirmation after the data is staged in the CMR area. Moving between CMR and SMR is up to drive firmware.

SSDs with power loss protection for in-flight data (eg: via supercapacitors) are able to confirm after a write into RAM because it can still flush to NAND in case of power loss.

Arwen · September 18, 2024, 4:02pm

I decided to check out Seagate’s new 32TB SMR drive;

But, when I checked other sites to see about retail pricing, I saw that the 30TB disk was NOT SMR, but CMR;

Capacity:
32TB (SMR)
30TB

That is weird. Only 2TB more than the 30TB HAMR CMR. I would have hoped it would be at least 50% better, and ideally closer to 80%. (Assuming most of the space was shingled.)

With those numbers, it is a no brainer to prefer the 30TB HAMR CMR drive.

Note: I don’t know if the 32TB SMR drive also uses HAMR technology. Seagate’s web site is not clear on that regard. (HAMR is Heat Assisted Magnetic Recording.)