Pool unhealthy, all disks and cables working fine

rdcustom · June 13, 2024, 8:52am

Hello all.

I’m running Truenas Core with multiple pools.
One of them is 8x4tb WD RED, all connected on a PCI to Sata board.

Pool status is ONLINE (Unhealthy), I already changed all SATA cables and checked all connections.
Six disks are EFRX CMR, two of them are EFAX SMR

these two SMR didn’t passed S.M.A.R.T. test (Extended offline failed)

I know that SMR drives are not the best for truenas, but can this be the issue?

is there a way to solve this without changing the drives?
Thanks

Farout · June 13, 2024, 9:33am

Could be this. What kind of card is this ?

rdcustom · June 13, 2024, 10:23am

okay, sounds stupid but… I found the issue.

I run zpool status -v and found an error on one file in the pool.
Tried to copy it on my Mac and in fact it was unreadable.

deleted, restarted the server and now the pool is ok.

I think Truenas should be more “verbose” on these things, instead of searching through cables, boards, shell, smart, scrubs etc…

winnielinnie · June 13, 2024, 4:33pm

A pool comprised of a vdev with 8 drives… and it still could not repair a single corrupted file?

rdcustom · June 13, 2024, 4:48pm

IDK, I ran many scrubs but the error never gone away.
I tried ZPOOL STATUS today (shame on me for not trying it before) and found the issue.
Sounds strange, but now it has been solved and this post can be useful for the future.

winnielinnie · June 13, 2024, 5:06pm

The concern is that one of the main reasons for using ZFS is to protect against this exact type of thing.

What if that was an irreplaceable file with sentimental value? (Leaving out the topic of backups for the time being.)

Stux · June 13, 2024, 10:33pm

Could be a stripe. OP doesn’t say.

rdcustom · June 14, 2024, 7:17am

I know, this time the file was just a mkv movie, but next time could be something important.
My concern is that I cannot investigate on what caused the corruption and there wasn’t any notification or message on what caused the pool to show as “unhealthy”.

after deleting the file, performed a long smart test on all disks and a scrub on the pool; all is 100% fine now

etorix · June 14, 2024, 7:42am

Could be a SATA port multiplier on this card. OP doesn’t say either.
OP doesn’t share error messages.

My wet little finger says that OP does not really want help with his issues. And that the new forum need a big red “Forum Rules” link on top of each page, with instructions for reporting issues and seeking help.

Fleshmauler · June 14, 2024, 7:49am

…Ouch. The reason Truenas isn’t more “verbose” is because quite a few things have to go wrong on ZFS for this to happen. Now it is a guessing game of is it the SMR drives, the pool configuration, or unspecified “PCI to Sata board” (which I’m going to guess is a port multiplier not an HBA since you said “SATA cables” instead of SAS to SATA).

I’ve seen a few posts of this nature & they either turn into someone realizing that their fundamentals were flawed & need to follow best practices instead, or it devolves into “but it worked for over 6 consecutive weeks without issues, it MUST be Truenas at fault! Why shouldn’t x,y,z just work like it did before it didn’t?!”

rdcustom · June 14, 2024, 11:00am

Let’s clear something here:

this is an home server, so I don’t need fancy hardware. PCI board is an ASM1064.
OP didn’t posted the error cause didn’t took a screenshot, I don’t think I should be placed on a cross like Jesus, right?

Maybe the controller caused the corruption, IDK

As told before, I thought it was an SMR drive issue but the pool status reported “0 errors”

I am not a network or linux engineer, so I must do what I can how I can; the fact that a corrupted file affected the pool status sounds strange to me, maybe not to you, but honestly I read the old and the new forum before writing and never seen a similar problem before.

Anyway, if this forum is just for those who have great skills, the forum itself is quite useless, isn’t it?

oxyde · June 14, 2024, 11:39am

Just for share my personal experience.
Same as you, im talking about an home server, nothing mission critical… But still important to me.
I used a cheap port SATA multiplier due to the lack of SATA ports of my old mainboard. What i needed was just 1 more SATA so i tried, despite wasn’t raccomended.
And yes, it seems runs good with stripe boot pool/stripe app pool… In the moment i attach a “storage” spinning drive he corrupt a file… no warning from pool status or else, but i admit i didn’t investigate more → Just delete the lucky not important file (an image backup of an unused PC) and changed ASAP mainboard with one provide the SATA i need (6 instead of 4, this upgrade was cheaper than buy an hba and provide more advantage anyway).
So… (Sorry for long post) If u concern about your data off course you dont need fancy hardware, as same you must care about bad hardware for your purpose

rdcustom · June 14, 2024, 12:04pm

thanks, in fact I’m aware of this and I will soon find another solution.
Hardware-side problem is that I have a B550M MoBo with just 4 Sata in a Node 804.

Luckily the 8-disks pool run all through the PCI board and hosts my Plex library.
Important files are on another pool connected directly to the motherboard.

essinghigh · June 14, 2024, 12:17pm

While I absolutely agree with you (check my dropdown), there’s a difference between fancy server hardware and bad hardware. I use consumer-grade hardware I’ve personally validated to be reliable, SATA port multipliers and the like fall firmly into the unreliable category.

Port multipliers cause problems, SMR drives cause problems. Both have already been discussed above so I’m not going to go into it.

More specifically into how ZFS did not manage to recover this, what is your pool configuration? I don’t see it mentioned anywhere? RAIDZ? Mirror? Stripe?

rdcustom · June 14, 2024, 12:36pm

The affected pool is RAIDZ; technically the file should have been recovered…
Now I’m looking for a good PCI to Sata card, but I don’t find any in 1x

the only 16x slot is occupied by an ASUS Hyper M.2 with 3x NVME; this is the major issue with MATX motherboards

etorix · June 14, 2024, 12:52pm

The right solution is a HBA. LSI 2008, 2308, 3008. Nothing fancy, nothing overly expensive ($50-100 refurbished). But it takes a x8 physical slot, or a least an open PCIe slot.

At worse, you might use an adapter to get 4 lanes from a M.2 slot.
If even that is not possible, this particular B550M board with 4 SATA was not a good choice for a Node 804, which really demands at least 8 SATA ports.

winnielinnie · June 14, 2024, 1:49pm

That’s one of the main reasons to use ZFS. It’s concerning that ZFS (RAIDZ) couldn’t recover a single corrupted file?

Something doesn’t add up.

rdcustom · June 14, 2024, 2:20pm

open PCIe are not so easy to find on MATX; I looked all B550 MATX boards, nothing available with more than 6 Sata.
Unfortunately with network and hyper m.2 card I had just one 1x slot left, that’s why I chose that crappy Sata board (even if seems working fine).
I’m not sure that the board is the culprit and I think I will never be until a new error appear

rdcustom · June 14, 2024, 2:21pm

yes, exactly. That’s the strange point to me

etorix · June 14, 2024, 3:35pm

Chipset limitation. You need to look at X570 to get 8 SATA, or Intel C2x6 chipsets.
With many NVMe lanes, the micro-ATX solution could even be a Supermicro X11SPM board (12 SATA, and onboard 10 GbE for the -TF or -TPF models).

Ah, there is a network card as well…
Not knowing more, my best suggestion is to use a x8x4x4 riser to get a HBA in there alongside two M.2 drives. The best use for the x1 is for boot, if you boot from NVMe to save SATA ports.