ZFS pools & power loss - you won't lose data

There is sometimes misconceptions about ZFS losing data due to power loss. The subject of power loss, (or OS crash), and then not being able to import a ZFS pool comes up here in the forums a few times a month. Here is the skinny on what should happen.

There are several things to un-pack about power loss affecting ZFS and data loss.

  1. Any previously stored data in a ZFS pool can not be / is not lost on power loss. Exception is in hardware failures that affect pool redundancy. Like the loss of both disks in a 2 way Mirror vDev.
  2. Any data that has not yet been written / still in flight, is lost. Just like every other file system out there.
  3. Using SLOG, (Separate intent LOG), is a specialized case good only for synchronous writes. Enterprise’s might Mirror SLOGs for “just in case” of a SLOG device failure during power loss or OS crash.
  4. ZFS attempts to be always consistent on disk, thus, after crash / power loss, no file system check is needed. I say attempt because consumer hardware fails more often that Enterprise and that can lead to data loss.

This specific power loss issue was a design criteria of ZFS, no data loss on power loss. Given no hardware failures, (RAM, storage controller, storage device, etc…), and no bugs in ZFS, (rare, but has happened), there is zero chance of ZFS losing data on crash or power loss. Except of course data in flight. Data is either completely written or not.

On the other hand, we have had enough indications on why pool corruption has occurred:

  • Use of hardware RAID disk controllers, (yes, even JBOD mode can be bad).
  • Using a proper LSI HBA, but really old firmware has led to problems, some appearing to cause data loss.
  • Use of USB attached drives, (some with hardware RAID controllers set to JBOD mode).
  • Virtualization of TrueNAS but not passing through the disk controller, (just a virtual disk or the plain disk).

Any of these can lead to out of order disk writes that during a crash can cause ZFS pool corruption. Since the corruption only occurs due to a crash, (power loss or OS crash), people say “But it worked for months / years!”. That is the point. ZFS was designed to handle crashes IF you give it the proper hardware. If not, you may get lucky or not.

On the subject of UPSes, while a good thing, even if it just handles a few minutes, is not strictly necessary.

  • As a minimum, it will extend the life of your hardware through fewer surges, power dips, transient outages, etc.
  • Some hardware will fail or lose data because of graceless power loss.
  • When the cleaning lady comes through and plugs the vacuum into the wrong circuit, it doesn’t interrupt your workflow …

Comments?
Corrections?
Additions?

4 Likes

Looks right, other than some typos:

  • s,loose,lose,g;
  • s,their,there;

I would also add that a UPS, while not strictly necessary, is a Really Good Thing for a file server even if it only has a few minutes’ of run time. As a minimum, it’ll extend the life of your hardware through fewer surges, power dips, transient outages, etc. Plus when the cleaning lady comes through and plugs the vacuum into the wrong circuit, it doesn’t interrupt your workflow …

But that’s not specific to ZFS, no.

1 Like

Done.

I also still recommend a UPS. Who wants any data loss ideally? Power Losses can definitely affect hardware, especially brownouts. People are always worried about power surges. Brownouts can kill too! And once you affect hardware, you can affect Truenas.

I agree, the way zfs works, it should never cause a corruption or lost pool just because of a power outage. Has it? Yes, there’s been a bug here and there. Pretty much, it’s solid.

I would add another to the list. There are many people using good HBAs, but ancient firmware too!

I’ve added the HBA firmware issue, and the UPS. Give those a read and let me know if something could be improved.

1 Like

Looks good for the most part, I do still think a UPS is required if you want your machine to keep working. Low power, and I’ve had low power for 10 minutes before, can fry many electronics in your house (and it has). If it’s shorter, it can weaken components and make it seem like the “power failure” caused it. Now very strange things can happen with stuff perhaps not running to spec any more. I used to be in that business and low power situations cause so many issues.

My specific fear is the weakening of components as now it’s very hard to troubleshoot (if it doesn’t outright fry them). Maybe it doesn’t cause immediate loss but system gradually starts having various unknown issues.

Anyway, since the topic is losing a pool, perhaps your writeup is good. Maybe I am expanding beyond lost pools. I just think a UPS is absolutely necessary. It is for anything I run! But I’ve lost half a house of stuff from brownouts before, lol.

Perhaps creating a separate UPS resource, explaining the various issues, and possibly referencing this Resource for the ZFS side.

1 Like

Thanks for writing this up for everyone to use. The community needs more of course. I am incapable at this time of writing things up, not my skill (though I have done a couple) and have other problems. People like you, Stux, etc. help move stuff along for folks trying to understand more. I try and pitch in comments where I can, but time is limited.

Now if someone had the time to write up a noob Plex install guide, the forum would have a lot less posts! :joy:

1 Like

Looks good.

A separate UPS would be good too.

I have always used server hardware with redundant supplies and use dual UPS systems on the server rack which includes the switches, router and other equipment. Each server is thus supplied by both UPS systems and the rest of the equipment is split between each UPS. I never have to worry about any data loss during a brownout or other power issue or waiting for a rack reboot after a power failure and the generator kicks in.

1 Like

One other problem that I thought of, storage device write cache. ZFS was originally written when HDDs had write caches that could re-order writes, (aka elevator seek). On power loss, (or OS crash), that could cause data loss.

At one point, (around 2010), Sun Microsystems recommended disabling HDD write caches when using ZFS because of this problem.

Eventually this problem was overcome by implementing and using write barriers. This meant that all the preliminary writes for a ZFS transaction, (aka data write), could be known to be flushed to media, (HDD sectors, SSD / NVMe flash memory), before the final write that activated the prior writes. This meant that ZFS was still always consistent on stable media and would not lose data on power loss or OS crash.


However, USB controllers, (with or without hardware RAID controller), could still implement elevator seeking with their write cache. And it is possible that low end flash storage devices, (whether that’s USB, SATA or NVMe), may not implement write cache correctly. Expecting low cost / low end storage devices to be perfect would be asking too much.

So, caveat emptor, (buyer beware).

2 Likes