Cold-Storage on Single ZFS Disk

Hello,

for a cold-storage backup I am thinking about using a ZFS replication onto a single 16 TB harddisk. Is my understanding correct that with a single drive ZFS detects any faults (bit-rot, etc.) , but cannot correct them?

To ensure that - even with a fault - the data integrity is ensured I need at least another copy of the data, meaning that I should use at least two harddisk for a cold-storage mirrot or set the ZFS option “copies=2” on the single harddisk, right?

Just for my understanding: Is it possible with ZFS to read data even if it has a wrong checksum? I am thinking about photos/videos/textfiles where some faulty bits might only result in some faulty pixels/characters, but the files could still be recovered in worst case?

Thanks a lot in advance,

Thomas

ZFS will throw an error upon reading data where it cannot repair it from checksums.

Discussion on “copies” here: The ZFS "copies" features is useless: Change my mind

I definitely wouldn’t rely on it on a single drive. Perhaps consider purchasing another for use in a mirror in an external 2-disk enclosure?

Then you’re also into the realm of long-term archival in general. If it was me I’d be thinking cl**d services such as AWS Glacier.

1 Like

Correct, unless you…

ZFS will not return a file that is known to be corrupt; its enterprise mindet assumes that there’s always another backup to restore from. And, while a not-further-compressible video file might still be mostly viewable with a corrupted frame somewhere, a LZ4/ZSTD-compressed RAW photo or text file may not be usable at all once it is corrupted. If you still want to salvage something despite bit-rot, you should have the last backup out of ZFS and without any compression.

1 Like

For cold storage:

Two individual 8-TiB harddrives, each with a copy of your replicated pool (full, then incremental), will safeguard your data better than a single 16-TiB harddrive with copies=2.[1]

The former protects against bit rot and complete drive failure, up to the point of the latest backup run. The latter only protects against bit rot.


Another way to fix corrupted data is to use the new feature called “corrective receive”, introduced in OpenZFS 2.2.x.

This allows a target dataset to repair corrupt data blocks, using a (good) source dataset. I haven’t tried to use it myself, so I’m not sure about the caveats beyond “It only works on corrupted data, not metadata.”


  1. The same is also true if you configure the two drives into a mirror vdev. ↩︎

1 Like

Dual drive enclosures are cheap, can run over USB, and present the drives inside as JBOD. ZFS will happily ingest said drives, make them into a mirrored VDEV and presto, instant off site backup solution.

Don’t some confuse ZFS / TrueNAS due to using the same serial # for all drives within?

2 Likes

Yes, I think this would require more investigation about the controller in the dual enclosure and how it presents drives. eSATA would be better.
But one drive is more convenient than two.

While mostly true, their are rare cases where errors can be corrected.

ZFS by default makes 2 copies of metadata, (the directory entries, etc…), and 3 copies of critical metadata. This applies EVEN on single disks or redundant pools. ZFS considers the loss of a metadata worse than the loss of a file block or the entire file. This is because a loss of a directory entry could take out hundreds of files.

I’ve seen this in action. A scrub on my non-redundant media pool found an error but corrected it. Puzzled me for weeks until I figured it out. (Other faults on that non-redundant media pool required me to restore the file(s) from backups.)


One thing about cold storage, is that you probably want to bring the storage devices back in for regular ZFS scrubs. (And SMART tests...) Whether that is every 3 months or yearly, is your choice. But, if you wait too long and too much bit-rot has accumulated, you might loose data, even if it is a 2 way Mirror.

One thing a scrub can find, is a block that is failing, but not yet failed. In theory, ZFS won’t do anything. The storage device, (disk or SSD), would apply it’s error detecting and correction code against a failing block. And if good, supply the block to ZFS and spare out the block. Sparing out means write the corrected block to the new location, update the translation lookup table so all new references for that block use the new locations.

None of that happens on a cold, (un-powered), storage device. Nor does the storage device actually verify all blocks just because it has power. That is where the strength of ZFS scrub comes into play. It forces the storage device to read all the blocks that ZFS is using, allowing the storage device to find and potentially spare out failing blocks.

I say potentially spare out. A completely failed block where the error detecting and correcting code can’t recover it, nor does ZFS have redundancy, means the block is gone. Thus, backups are useful.

One side note. A totally bad block will stay on a SATA disk forever, even if their are spare blocks available. The mechanism to force that sparing is a write to that block. Then it will be spared out, with the newly written data safely stored. That is what ZFS does IF it has redundancy available to re-create the missing data block(s).

Sorry for the epic response. But others in the future may want some of these details.

3 Likes

Another option, if you are willing to accept the performance hit, is to partition the drive and create a RaidZ pool with the partitions on the single disk. I came across this video from the Art of Server describing just this option when I was researching ZFS and BTRFS on Linux before I made the move to TrueNAS.

2 Likes

Yes.

Got a link?

It’s at the very end of this page.

There was also a discussion on their GitHub in 2019.

They don’t do a good job showing you real-world examples, and it’s not even clear (though it is implied) that using the -c flag cannot be combined with other zfs recv flags.

They also do not make clear the relationship between the source stream (snapshot? delta?) and the destination (dataset? snapshot?)

They also do not make it clear if the absence of corruption will simply skip the receive, since there is “nothing to be done if there are no corrupted blocks.”

There’s a point where I space out, since developers and engineers tend to speak like… developers and engineers.

This is one of those things you need a spare machine to try some tests with.


EDIT:

Here is a theoretical example by alek-p to generate a minimal stream, which can be used to quickly repair corrupted blocks on the target.

# dumps spa err list that are part of this snapshot and the snapshot guid
zfs send -C data/fs@snap > /tmp/errlist 

# on replica system generates healing sendfile based on the errors list
zfs send -cc /tmp/errlist backup_data > /tmp/healing_sendfile

# heal our data with the minimal healing sendfile
zfs recv -c data/fs@snap < /tmp/healing_sendfile
1 Like

Okay, that’s pretty cool.

It’s basically a way to reverse the flow of a backup.

So you backup to your backup pool, and instead of pulling files from backup to replace corruption, or replacing a dataset with an older version from the backup, you can just heal the blocks from the blocks in the backup.

The neat thing is this corrective receive doesn’t affect the history of the pool. Ie it doesn’t generate changes. In other words the metadata.

You can imagine that TrueNAS could have an option to “repair from replication” which would run a replication in reverse when a scrub finds an issue…

Which is why in the GitHub commit this is mentioned

The next logical extension for part two of this work is to provide a way for a corrupted pool to tell a backup system to generate a minimal send stream in such a way as to enable the corrupted pool to be healed with this generated send stream.

Then it becomes super interesting for the “repair from backup” feature, as only the corrupted blocks would be pulled from
backup.

2 Likes

It still requires a back-and-forth dance, but as you said, it’s not outside the realm of feasibility for this to be streamlined in an appliance like TrueNAS.

As an example, creating a jail in TrueNAS Core (previously FreeNAS) is not a “one step process”. It “seems” like it because the user only needs to click a button.

In reality, a lot of stuff happens in the background, including fetching a release’s root image, creating ZFS snapshots and clones, and issuing FreeBSD jail commands.

2 Likes

Trying to play around with corrective receives, I can’t seem to figure out how to make the target snapshot generate an “errors” list. Otherwise, it will have to receive the full stream from the “good” dataset’s snapshot, which is not as efficient as only sending the repair blocks for the specific corrupt blocks.

The above example of zfs send -C seems to be from a debugger or devtool of some sort. It’s not a standard zfs send flag.

1 Like

It’s full streams currently.

But in theory, you could start from the most recent snapshot that is before the block was born. And that is how snapshots are identified.

Hence the quote above.

1 Like

Trick would be to be able to specify block ranges instead of snapshot ranges to generate the send stream.

Then just request the list of blocks and correctively receive what you get.

I cannot imagine it to be that far away from being implemented.

:crossed_fingers:

ZFS is already aware of the corrupt blocks (as seen by checking a pool’s verbose status with zpool status -v)

Yes, you can do that. And for your own server, your choice.

My old laptop from early 2014 had only 1 x 2.5" SATA drive bay and came with a slowish 320GB HDD. I later replaced it with a 1TB SSD, which I did partition, with 2 smaller parts for the OS Mirror and the rest un-mirrored. This allows me to have some data redundancy for important things, like the bootability and my home directory. But, still have lots a storage for misc. things like movies when traveling. (My new laptop was selected because it had 2 internal devices, 2.5" SATA & M.2 NVMe.)

Keep in mind that TrueNAS has zero support for such a thing. One of the big gotchas with TrueNAS, is that it does not know that removing 1 disk will impact another, (or all). Even scheduling SMART tests would have to be checked to make sure you only run 1 on such a single disk, multi-partitioned configuration.

Basically I am saying if you do such, some tasks will be manual. And while some of the Apps and sharing with TrueNAS are simpler to manage in a GUI, it may be better to roll your own server. Maybe based on FreeBSD or Linux, (the later of which does not have good ZFS support in major distros).

1 Like

I do this already with a zfs 2 disk strip for 36TB. The drives are set to wake on use. But are normally powered down to low power state. I use SSD as my main zfs storage and the spindles as warm/cold replication for backup. The replication runs daily at 1am and usually takes < 30 mins depending on how many writes the main SSD storage has made.

Same chassis…works a treat