Zfs vs ext4 for external backups

One last note on non SAS or SATA storage interfaces:

Does it / they support “Write Barriers”?

This is one key feature for ZFS on-disk consistency. Basically ZFS wants the data & metadata written completely to disk BEFORE making it live with the Uber block update. This maintains the pool in an always consistent state, so no boot time file system check is needed.

If the storage driver, whether it is external software or hardware RAID, or non SAS or SATA, performs elevator seeks & writes, that works until a power off. Then, it is possible to get pool corruption. Un-repairable pool corruption.

Hey, I don’t even know if NVMe drives support write barriers!
My guess is that they do, but do they?

Post backup external RAID drive array activity has always been a bugaboo for me… so I wait until the array shuts down all drive activity to shut it down.

Since my inexpensive Oyen digital Mobius 5 drive arrays do not allow meaningful connections between the RAID and the host CPU (the macOS software is from JM micron, was written for 32bit only, etc) there is no great way to time this, and sometimes the RAID controller merrily chugs along for a while with disk activity without anything being wrong post-“ejection” of the RAID from my desktop.

I presume it’s doing some parity work (Hardware RAID 5) or maybe even some primitive testing. No way to know as there is no interface to consult. Instead, I wait until the drives are eventually spun down by the RAID controller and presume it’s safe to turn off the RAID enclosure PSU when they do.

As best as I can tell, this is not an issue with ZFS. When an array is exported, everything in it is prepared such that once the array export is done, the thing can be electrically cut pretty quickly. Yet another argument also not to let a protocol / layer / whatever come between your drives and the ZFS system (ie USB, RAID controller, Proxmox, etc)

Isn’t that what NCQ was supposed achieve, too? Notify the host when data is commited to the physical platters.

You should probably quote the relevant part too:

Basically, I have questions on these storage protocols:

  • USB BOT
  • Thunderbolt
  • NVMe

In theory, USB Attached SCSI should support write barriers if the attached disk does too.

Yes, I misread that. Sorry!

The UASP specification document is available here if that is of interest.

Thank you.

Poking around the document, I was unable to find information about the write barriers feature. However, since that is more SCSI protocol, it is perfectly understandable that would not be included.

It was interesting that UASP seems to supports full duplex traffic on USB 3 and higher. SATA is half-duplex, even though it has dedicated lines for each direction. The original SCSI was also half-duplex, but Fibre Channel and later SAS were full duplex, (as far as I know).

Having traffic flow in both directions at the same time reduces lag and allows drivers or storage to start the next round of traffic.

I went down a rabbit hole, and it appears that NCQ does the opposite of write barriers:

NCQ allows the drive itself to determine the optimal order in which to retrieve outstanding requests.

However, on the drive side it appears the concept of “write cache flush” is the term more appropriate. The OS side is write barrier and “fsync”.

Of course, I am no expert on all this. I just understand the concept that certain writes must complete before others are started, in order for ZFS’ Copy On Write methodology to always maintain on-disk consistency. So it is possible that SATA NCQ does work in this context as a write barrier.

This is what I dug up on “write barriers”:
http://web.archive.org/web/20220608144201/https://docs.fedoraproject.org/en-US/Fedora/14/html/Storage_Administration_Guide/writebarr.html

1 Like

@Arwen I am also not an expert in this area. I have only my own experiences and various documentation to go on.

My understanding is that write barriers are on the kernel side, but I wouldn’t have put my name on that without the sources you’ve brought.

The Linux kernel UAS code is in the kernel tree at drivers/usb/storage/ (TrueNAS fork for kernel 6.12 here). [Edit: the driver spans a few other files, they all have uas in their name.]

It looks like it’s just detecting UAS capable chipsets and drives, testing for specific quirks as needed, and then registering the drive as SCSI block storage. I’m not a kernel driver developer, and the C is not strong with this one, but it looks like otherwise the driver just sends the SCSI commands down to the drive, along with implementing various sanity checks and driver state management. (uas.c)

This tracks with my experience. When I connect a drive via UAS (including in TrueNAS) it’s detected as a SCSI block device by lsblk and ZFS (and TrueNAS) see it as such. To do anything with the drive I have to treat it as I would any attached hard drive. I haven’t encountered anything “weird”. (I can discuss brands and drives I’ve used, if others are interested.)

The code I linked to runs a series of checks which could serve as a source of information on what capabilities your hardware stack needs to have.

Speaking to write barriers, I would expect to find specifics in the kernel’s SCSI and/or filesystem code paths. A cursory reading of the UAS code (with all my prior caveats applied) didn’t suggest to me that the UAS code is even aware of the concept.

Yes, it is my understanding too, that the USB UAS code is just a wrapper for using SCSI over a different physical layer.

Reading and thinking about this more, I think ZFS does the “fsync” at the appropriate time. This implements the write barrier in that no further writes at the lower level are started until the “fsync” is complete.

Their is a separate, and probably equally important part in the drive for “write cache flush”. ZFS probably also does this. Or perhaps the “fsync” does it too.

I do know that Sun Microsystems, (the original writer of ZFS), said they had trouble with drive write caches causing corruption in ZFS pools. Thus, early on they recommended that drive write caches be disabled. Later, this was rescinded. Why could be any of these:

  • They compensated with software, perhaps using “fsync” and drive “write cache flush”
  • Sun made a mistake, and it was never a problem with the drive write caches
  • Drive vendors fixed their implementation of drive write caches

Perhaps this last item is one culprit for the recent Metadata corruption problems. Maybe a few storage drives do have “broken” implementations of drive write caches. Obviously not all storage drives, but enough non-Enterprise models that bite TrueNAS users…

Oh, well.