Eel pool extend issue

techdan91 · September 5, 2024, 9:19pm

hey, back with another issue lol…so ive been playing with EE since beta came out and finally was able to upgrade and extend my main pool…i had issues at first cause the extend just stopped at 25% and said i needed to scrub/resilver or something but i wasnt too sure what i was supposed to do…so i let it go for a few days seeing if anything would happen on its own lol.

after nothing happened i decided to scrub the whole pool to see if that was the command it was waiting for, and it seemed like it was…had to wait like another day for the scrub, and then another day or two for the extend to complete…

so this was all in all a very long and boring process for me haha…but now my problem is i dont see the added storage from the extra drive i added to the pool, its still 7tb out of a 4xwide pool with 4tb drives in raidz1

so i just let it be after that, rebooted a couple times over a few days to see if it was just being buggy, but its still only showing the 7tb max size. and now i obviously dont have the “assign disk” option anymore cause it apparently added the drive already…so im stuck unless theres something you guys know to fix this or if im out of luck until a fix comes out, idk lol…

some screenshots for reference i suppose…

Protopia · September 5, 2024, 9:53pm

To quote Sir Humphrey, it seems extremely “courageous” to try the extremely new functionality in EE, especially since you appear not to have implemented standard data protection recommendations like regular scrubs etc.

If a reboot doesn’t make the extra storage appear, I would suggest that your only solution would be to move the data elsewhere, destroy the pool and recreate it, and then move the data back again.

(ZFS has all sorts of self correction mechanisms to ensure that it stays uncorrupted, but that does mean that if it ever does become corrupted then there is no fsck utility to attempt to fix it.

TBH, you should probably be thanking your lucky stars that the pool is still online and not so corrupted that it won’t mount and your data is gone.

techdan91 · September 5, 2024, 10:06pm

i do have regular scrubs though…

…they used to be set weekely but just made them monthly…i also have snapshots

but damn, yea the stuff i had in this pool isnt really important at all, and i did take precautions and backed up the most important data to a drive on my client pc and in a different pool that i havent tried to uprgade or anything…

so i did a good job being somewhat careful before going ahead with the upgrade lol…

but alright im probably just going to kill the pool then and create a new one(glad cause i want a shorter pool name lol) and backup the data…

thanks for the advice!

Protopia · September 5, 2024, 11:04pm

Well that will teach me not to be a clever-dick and make assumptions. So apologies are in order - sorry!

It sounds like you took all reasonable precautions knowing it was a bit risky, and if it had worked it would have been a good decision, but unfortunately it didn’t pan out.

Arwen · September 6, 2024, 3:02am

I’d like to see if the output of zpool status TrueNas_Main_Storage shows the 5th drive. This could simply be a bug in the GUI that is yet to be squashed. Even possibly unknown bug!

So @techdan91 if you have not destroyed the pool, please, (can I beg?), give us the output of zpool status in code tags.

techdan91 · September 6, 2024, 9:47pm

alright so my systems probably buggin out on what to do haha, ive done too much too it and hope i can get it stable…but the line you asked me to put showed me some interesting info…

so i did in fact try export/disconnecting the pool the other day, but it doesnt seem to have deleted it or anything( but i do get those popup notifications saying there was an error in the path for etc/TrueNas blah blah blah…

so i tried your command and it gave me info showing all 4 disks are there(before the expansion there were only 3 obviously, issue is not the disks showing but the capacity its showing is a disk too low in tb), but it also shows that the expansion is still in progress and will take another 7 days!!! lmfao…idk but thats what it reads like to me…so maybe just let it sit for another week and see if it “finishes” and then see if the capacity pops up to ~10tb?..thanks for the advice on the command to check the pool!

Stux · September 6, 2024, 10:09pm

Sounds like UI needs to be improved…

It needs to recognize the expansion process much like it recognizes a scrub/resilver process and display the eta.

Arwen · September 7, 2024, 4:27am

Sorry, I got confused. Thought it was a 4 disk pool expanding to 5… Oh well, glad of the correction and complete information.

The original indicated this type of disk;
(3) 4tb WD blue hdd (1) WD Red (Raidz1) (Misc Storage/SMB)
The 4TB WD Blue HDD are likely SMR, so EXPECTED behavior to be SLOW. Taking 7 more days sounds about right.

One comment about OpenZFS RAID-Zx vDev expansion. If I understand it correctly, it does not calculate the parity or checksum it read off disk. This is to improve the speed. Thus, a ZFS Scrub will be performed after to catch any pre-existing data faults. So count on another week or 2 of work to be done.

Suppenkasper · September 7, 2024, 7:56am

AFAICR the Jobs menue item showed the expansion process.

For me this process needed 12h with aprox. 150 MB/s by adding a 8TB drive. No problems at all.

Protopia · September 7, 2024, 1:30pm

Actually, depending on the exact model WD Reds can also be SMR (some models are CMR, some SMR). WD explicitly states in the WD Red specification on their web site that WD Red are unsuitable for ZFS and that you need WD Red Plus or Pro for ZFS.

Arwen · September 8, 2024, 2:50am

Hmm, I wanted to see it for myself so I went to Western Digital’s web site and tried to pull up WD Reds. Got 404 Page Not Found. The WD Red Plus and Pro were fine, just the plain Reds.

Seems they have a web page problem, which I won’t bother reporting. Or perhaps its a prelude to removing the Reds because they still get, (costly), returns because of SMR.

techdan91 · September 15, 2024, 8:09pm

yaaaay!!! after a month of starting to expand the pool, it finally finished today!..my 4th 4tb hdd finally attached to my pool and have almost 3tb more of free space

…idk why it took so long, but over the past week or two i had to keep clearing the errors in the shell for the pool cause it would frequently pause and require a resilver or clear to continue, was very annoying…

but thought id post the outcome and some shots.

…before expansion

…after expansion

Protopia · September 15, 2024, 9:52pm

As previously stated, the lengthy time is likely to be due to using SMR drives.

Arwen · September 15, 2024, 10:37pm

Even the errors can be tracked back to SMR drives, which under extreme write loads can have a long delay in any response. Thus, ZFS may think the drive is failing.

This effect is also caused by desktop drives, like the WD Blue line, where the drive has either of these 2 features enabled and not tuned for NAS work;

Aggressive head parking, like 5 second intervals. Then taking too much time to un-park causing ZFS to think it needs to retry the prior command. Thus, error.
TLER, Time Limited Error Recovery, (Seagate has another name for it, but same concept). This feature is for desktop drives such that when a block is found bad, longer TLER, (like 1 minute by default on most desktop drives), to retry reading the block and applying ECC to it. Since NASes usually have redundancy at a higher level, a NAS, (like with ZFS), may declare an error on the drive. Using 7 seconds allows ZFS to apply the pool level redundancy and move on quicker. Still an error but more “normal” for ZFS.

PS - WD Red page is still failing

doge · November 18, 2024, 2:02pm

Had this issue adding new disk to the pool,
have 3 16to disks and added one 18to disk.

The 3 disk pool was 1/3 full, 10tb of data there,
was stuck at 25% adding pool,
finally this command seems to explain what is happening :
sudo zpool status

expand: expansion of raidz1-0 in progress since Mon Nov 18 12:56:20 2024
318G / 15.8T copied at 51.1M/s, 1.97% done, 3 days 16:16:16 to go

In fact it is syncing 16tb of data between four disks while the pool is actively used.
At 51.1M/s the process will take about 4 days.

This complete information should be displayed on the main page in an understandable way,
because people will not understand why the process is stuck “pool attach 25%” during 4 days.
So they will reboot the nas and try all sort of things interrupting the long process.

Best regards

Arwen · November 18, 2024, 6:17pm

Yes, having more complete information would help prevent some people asking questions. RAID-Zx expansion likely needs some GUI work, its still a new feature. Perhaps report it via “Report a Bug” at the top of the forum pages.

My earlier responses included a comment about WD Red web page issues. Checking again today, I find no reference to WD Red under NAS drives. We have WD Red Plus and Pro, no plain WD Reds. Nor does a Google or WD web page search find the WD Red HDDs, (only the NVMe & SATA SSD). Weird.

Protopia · November 18, 2024, 9:25pm

I am personally of the opinion that 7 seconds is way too long for a redundant array - if one drive fails after 1sec the redundancy will kick in. Consequently I have my TLER set to 1sec by adding “-l scterc,10,10” to the SMART extra options for each drive in Storage / Disks. I have yet for this to cause any errors to occur - in 16 months of production use, my HDD pool has yet to have a single error

I have already raised NAS-132559 - Improved Storage and Dataset available-space stats especially after RAIDZ expansion as a suggestion for better reporting of pool / dataset available space stats.

But I agree that better reporting of pool expansion progress in the UI would be good. I am not sure how good the reporting of resilvering in the UI already is, but expansion progress needs to be as good if not better (because it runs for longer).

Stux · November 18, 2024, 9:29pm

Yes, it should be treated more like a “resilver” or
disk replace.

Ie the expansion progresss toast goes away once it’s initiated, but an “expansion progress” progress bar is displayed where a resilver/scrub bar is normally displayed (there are a few points)

The current system only makes sense for instant expansions which occur with dummy test systems.

Protopia · November 18, 2024, 9:49pm

Ah - the old “performance was great when we tested it with only a smattering of data” result.

I was once working for an outsourcer and refused not only to take on a new system, but also to refuse to let them connect the new system to an existing system we were responsible for unless they gave us a performance waiver. Which was just as well, because they had tested it on a tiny database, and the call centre agents dealing with customers had their CRM system response times go from 0.5secs to 30+secs - with the customer waiting on the line - because the new system was doing full database scans and blocking the old system from accessing the data. And why? Because they tested it on a tiny database and full scans took a fraction of a second, but with a full sized database they needed some extra indexes.