Issue with Truenas Scale 24.10 zpool extension

RetroG · October 30, 2024, 10:56pm

whatever issue is going on it’s affecting multiple drives. try a different HBA or connect your disks to your motherboard.

tessierp · October 30, 2024, 10:58pm

Well, I’ve watched Laurence Systems stream regarding this feature, I just didn’t know it was going to take so long to do all this. And the re-writing part is the other issue.

I do have a good backup, RSYNCED everything to my second Truenas backup system.

Let me ask you this… The hard part with starting over is all the time I spent configuring users and permissions on each dataset and shares. If I saved my config, Systems → General Settings → Manage Configuration, then destroy my pool, and recreate it will all the new drives, is there a way to use that config to recreate all the shares, permissions, etc on a poll with the same name?

tessierp · October 30, 2024, 11:00pm

It’s really weird, I literally had no issues before. I’ll try and swap the cables for new ones, see if that helps. Otherwise I may try to recreate my pool from scratch with the new disks and using multiple VDEVs instead of a big fat one.

Quiet1824 · October 30, 2024, 11:06pm

I’m not experienced enough to answer, nor have I used rsync. I recently added another HBA with 7x18TB, and used zfs replication. Whichever options I chose, the users/permissions are present on the datasets…that wouldn’t help with the shares that would have to be re-created though. Maybe a config backup you mentioned could be used?

tessierp · October 30, 2024, 11:59pm

Those CRITICAL TARFET ERROR I’m getting, I assume I would have had those before right? I mean while I use the NAS and write files to it… Is there something special it is doing when resilvering that could cause these write errors?

tessierp · October 31, 2024, 12:17am

I’ve decided to create a replication… And I have a question for you if you don’t mind. When you save your system’s configuration, I presume all these configuration details come back if you restore the settings correct? What if I reinstall Truenas completely and restore the config, will the replication set reappear for me to restore? Do you know?

Quiet1824 · October 31, 2024, 12:41am

I’m sorry I don’t know. I haven’t tried a restore from config. I would hope all the shares would be there, and if the target dataset isn’t reachable at the time of restoral the shares would just fail or be disabled. Then do a full zfs replication back and enable the shares? Again, I haven’t tried. Sometime in the next week I’ll have a new server to test these things out on as I too am trying to start fresh without losing my users and shares.

tessierp · October 31, 2024, 12:55am

So far I’ve changed my HBA expansion card for another, I know that is not causing the errors… Next the cables, and then the HBA. If that still doesn’t fix the issue then, clearly this expansion feature is not working so well. In any case, like you said, this is taking a long time and then I have to use a script to copy of the parity files over… I’m not sure this is such a great feature after all.

Quiet1824 · October 31, 2024, 1:21am

I’m thinking starting that 2nd expansion before the 1st finished(because it is NOT clear that’s happening in the background), may have triggered those errors. Since you have a backup, I’d say either let the scrub finish or cancel it. Clear out those zfs errors with zpool clear, and restart the expansion. Heck tryout that zpool replace while you’re it…that way when a drive does fail someday, you’ll have experience with it. Might as well use this as a learning experience / experiment because who cares if it gets messed up if you’re just going to end up restoring from the backup file and rsyncing the datasets back. I do recommend testing your rsync data first. Test anything encrypted that you can lock/unlock, access files, etc.

awalkerix · October 31, 2024, 1:22am

It looks like we’re slightly out of sync in py-libzfs and libzfs regarding errnos. Will get fixed in 24.10.1.

EZFS_RAIDZ_EXPAND_IN_PROGRESS,  /* a raidz is currently expanding */

This is the error code.

Stux · October 31, 2024, 1:39am

the exansion is still in progress
you’ve experienced write errors in the meantime
this has possibly triggered a scrub
this has paused the expansion…
you have hardware errors being reported

This is what I find the most concerning

[ 2495.690213] mce: [Hardware Error]: Machine check events logged
[ 2806.975827] mce: [Hardware Error]: Machine check events logged

Anyway, I think the raidz expansion is a very heavy process, and triggered the errors because of that. Perhaps your HBA is overheating. Perhaps you have memory errors… perhaps… etc.

You need to figure out the cause of your errors… I don’t think its actually the RaidZ expansion.

Start by checking the smartctl -a /dev/sdX status of each drive… I’m wondering if you are seeing UDMA errors there. That would indicate a power, hba or cabling issue.

tessierp · October 31, 2024, 1:41am

Thanks for the help… Any idea on where I could check for that MCE / hardware issue? What does MCE stand for?

Stux · October 31, 2024, 2:22am

I believe it stands for Machine Check Exception

And no, i’m not really sure where you can check for it.

DjP-iX · October 31, 2024, 1:19pm

Kind of off topic from the rest of the thread, but to clarify, you don’t need to run the script. The old data retains the previous parity ratio until it is accessed and rewritten. The script is one way to rewrite the existing data to the new parity ratio if you want to do it all at once. Replicating the existing data and then moving it back into the pool is another option. So is just continuing to use the data and letting the parity ratio work itself out over time. It all depends on your preference and how often the data is accessed/modified.

tessierp · October 31, 2024, 3:32pm

Thanks for clarifying that and certainly good to know for those who will be doing this.

fa2k · November 3, 2024, 4:56pm

Sorry I’m late, but please check your power supply (PSU), and whether it can provide enough power to the drives and the rest of the system. Machine Check Exception is not normally caused by faulty SATA/SAS cables, it’s something more central to the system (PCI bus / CPU problems). A faulty HBA could also be the issue.

(I also want to mention that I’m quite happy with raidz expansion, despite its downsides. I expanded from 5 to 6 drives, and afterwards even re-wrote a lot of my data to optimise the stripe size. It’s still nice to be able to do it online, and not have to start over.)