ZFS Errors - My Imagination or Not?

joeschmuck · November 18, 2024, 6:24pm

Is it my imagination or are we seeing a LOT more problems with ZFS reporting Read, Write, Chksum errors?

I feel like I am seeing them every day, and some days multiple postings. Of course they all scream “Drive Failure” before actually checking into it.

Is this due to TrueNAS gaining in popularity or people building a sub-standard system, both? I could see it also being due to people just ignoring a problem until it becomes a BIG problem.

I just wanted to know if I’m going crazy. Have reports of ZFS Errors been going up, or now that I’m retired and I can spend more time on the forum, I just now noticed it.

Cheers!

winnielinnie · November 18, 2024, 6:25pm

I think it’s this.

When the userbase grows, everything else grows with it, including bug reports, errors, and complaints.

If not for Free/TrueNAS, ZFS for home users would not be as popular and widely used as it is today. Home users are more likely to incorporate shortcuts and subpar hardware.

Arwen · November 18, 2024, 6:29pm

I too feel that the ZFS error reports here in the forums are going up.

But, I think it is a combination of 3 things:

User error
Sub-standard hardware
And last, increased TrueNAS adoption

Some of the new reports lack so much detail, we only find later that they used SMR disks. Or hardware RAID controllers in JBOD mode. Or using non-server hardware as a server, without adequate cooling.

neofusion · November 18, 2024, 7:38pm

Perish the thought… have they dug too deep…?

Arwen · November 18, 2024, 7:58pm

Oh, we are also starting to see people use TrueNAS as a VM. But, without the previously clear configuration we had for VMWare, some solutions are not a good choice.

For example, Proxmox seems to be the one that generates both many questions and problems. Part of it seems to be that VMWare did not support ZFS, however Proxmox DOES natively support ZFS. Thus, the need to black list the TrueNAS devices so that Proxmox does not touch them.

I’d put this one under “User error”. People think virtualization is the way to go. But, clear understanding of both the hypervisor AND client is REQUIRED for reliable operation when using a client that needs direct access to devices. (Storage devices in TrueNAS case.)

NickF1227 · November 18, 2024, 8:30pm

iX has said in a bunch of blog posts that adoption rates and installations have never been higher. I think @Captain_Morgan has posted metrics somewhere.

So given that I would concur with @Arwen but I think its mostly the last bullet point because user error and sub-standard hardware have always been a thing we’ve seen in the forums.

Anecdotally (IIRC), the last few ZFS Errors I’ve encountered/engaged on in the forums were some reallly old TN 11.x systems that had drives with 60k+ spindle hours (and multiple drives with errors to boot)

I haven’t seen any trends that have been alarming @joeschmuck if thats why you are asking.

joeschmuck · November 18, 2024, 9:43pm

Okay, just making sure it isn’t only me.

I was saddened by VMWare and the ZFS thing, and the free ESXi no longer viable to new users. I have yet to try Proxmox but why mess with something that is working. But I may put together a third smaller system, just to give Proxmox a trial run so I can maybe understand what someone is talking about.

Of course when I started this thread, I was not insinuating there was any fault of the TrueNAS software and I’m glad no one went down that path.

Hey, on the other side of the coin, I got my first ZFS Cksum error on my spinning rust drive with 52172 hours on it. Of course it wasn’t the drive, it was self-induced and I’m glad it popped up when it did as it helped me with a script I’m working on. Scrub passed, will clear the error in a few days.

So, dang, I blame it on the Super Moon on 15 November, the timing seems about right

etorix · November 19, 2024, 8:06am

Not checksums, but I feel we’re seeing quite some issues with pools disconnecting and not importing back.

joeschmuck · November 19, 2024, 2:04pm

Agreed. But we unfortunately see people also doing things without understanding what they are doing.

I recall a day when this project was called FreeNAS and iXsystems said that the end users would need to be someone who has some basic knowledge on how to use FreeBSD, basic command line tasks. Somehow that philosophy changed to “Anyone can do it” likely after the version that will not me named, when clearly there are people who would rather not read a User Guide and “just wing it”, or worse, follow the advice of an AI.

winnielinnie · November 19, 2024, 2:18pm

That’s such hyperbole. When has an AI assistant or ChatGPT ever suggested to do anything dangerous, such as rm -rf ?

Never. Not once. I can’t find even one example.

All this fear mongering about AI…

etorix · November 19, 2024, 4:09pm

This particular case is happening to a user who knows his way around the command line AND knows how to ask for support…

Whattteva · November 19, 2024, 4:24pm

I must be one of the few select lucky unicorns. I haven’t really had any ZFS errors issues besides having to replace 1 faulty drive in the last 11 years of using Free/TrueNAS.

Of course, I also don’t run big arrays of drives. Just 6 drives at most.

joeschmuck · November 19, 2024, 5:19pm

I’ve been following that thread. I just asked a few questions. This is a strange problem. So yes, it is not my imagination, there are problems afoot.

Redcoat · November 19, 2024, 6:14pm

Hmmmm…
In cases where we have knowledgeable users with these reports, do we have a sense of the ratio of Core to Scale occurrences? Of course, any such analysis would be skewed by the “global”
Core to Scale install ratio…
I do notice so many first reports that don’t give the TrueNAS version. I always suspect that’s from a Scale user, Core users likely being aware of the potential for confusion.

etorix · November 19, 2024, 6:25pm

100% SCALE as far as I know.

SmallBarky · November 19, 2024, 6:35pm

Maybe we should encourage Bug Reports?

etorix · November 19, 2024, 6:39pm

And now we have a case with the boot-pool! Electric Eel bare metal.

This looks more and more like a system bug.

Stux · November 19, 2024, 8:13pm

And then there is this:

We have also seen a lot of OpenZFS development and bugs have happened (ie block cloning race condition)

Stux · November 19, 2024, 8:18pm

Scale erased my L2ARC disks upon an upgrade.

So, this is what happened I had over-provisioned L2ARC disks.

Because they were made on core, they appear as sdt2 and sdw2, not gptid in zpool status

Upon a reboot sdt and sdw were assigned to HDs.

Ergo my pool degraded :-/

But where was the label?

Searching the L2ARC disks for labels I found nothing.

Anyway… I removed the two L2ARC disks and re-added them.

This matches up to what others have been seeing, I just happened to luck out with it being members which were not required.

winnielinnie · November 19, 2024, 8:20pm

That makes me nervous to upgrade from Core to SCALE.

I always assumed it translated from BSD paths (GPTID) to Linux paths (PARTUUID) when importing an existing pool, without any user intervention.