That would explain why the resilver has stopped.
Yes. It’s perhaps a little unfair to suggest TrueNAS SCALE is unstable simply because you are seeing multiple drive failures. These drives would have failed on any other ZFS system no doubt.
Question is how old are the drives? When did the first one fail and why was it not replaced? Resilvers and I dare say Expansion will put considerable stress on drives so if they are already creaking it can be enough to send them over the edge.
There is a difference with ZFS compared to EXT2/3/4. If their is bit-rot in a file on EXT2/3/4, (but not in directory entry), you probably won’t know it unless you see the corruption when using the file. So some people rant about ZFS reporting errors when they “never” had errors with EXT2/3/4. It may just be the case they never knew about the errors. ZFS will report any corruption on read and scrub.
In general, TrueNAS SCALE with ZFS is stable. But, in some cases, like with SMR disks, USB attached data disks, (not boot disk), or consumer / home hardware without ECC memory, problems can occur.
We have such a wide variety of use cases, and setups that from an outside perspective it can be hard to nail down causes of problems.
I never thought of this, but when you add a drive to say a raidz1 via expansion, and the process is not yet finished, does that count as the one disk the vdev is allowed to loose?
I’m not very familiar with expansion but my guess would be until the expansion is complete the original pool config applies.
Do you have your apps on a 17-wide raidz2 of HDDs?
No. With raidz2 you’re at risk after losing two drives. Any more issue during the-LONG—resilver can be fatal.
It generally is, but you have been pushing your luck too far.
It goes without saying but IF you can still access the data on this pool then I suggest you backup any important data asap.
ok let me clear that :
-
That’s not me that choose to put the ix-applications on the RaidZ2 pool, it’s the system at the installation of the apps, while it was enough space on the system disk holding Scale to do so, as only one third of the boot disk is used btw… don’t know why Scale don’t put it on there and choose the Raid pool… BTW, shouldn’t RaidZ2 strong enough to not break ?
if the system trust that, why not me ?
-
I’ve lost a disk as it really died on me, it’s the first ‘removed’ disk, being replaced.
-
when the disk died an expansion of the pool was just beginning, and the system stopped that for me to replace the dead disk. that’s the second ‘removed’ disk, so technically it is only one disk that’s lost for now,
-
then the ‘faulted’ disk is one new that was just placed last month and for whom the Raid expansion resilvering took a long time.
-
when this resilvering ran I’ve seen numerous time disks being 'stopped and ‘removed’ with lotsa wanabe errors said to be on it from the pool just to be recalled later and finishing resilver fine without any errors.
so I don’t see why it would be different this time…
for the ext4 disks I still haven’t lost any files with them in ten years (and as I often copy and use these files from disk to disks and across other machines and the NAS I can say that these files are fine with no rot
ZFZ was said to be the strongest and safest filesystem of the universe, that’s why I choose TrueNAS Scale for my needs (and for the added possibility of growing the pool when I have the money for without having to reconstruct it after backuckping the whole beast first each time. I don’t have many money so it was the right solution for me, at least at start
I need a monstrous sized Raid array for my work as some files used can be very big and better put in one chunk that in a multitude of small pieces across lotsa stuff to be used without glitches…
I would like to know if dRAID is possible with Scale at all ?? as asked before…
So, as now I just can’t backup 120 TO of datas (don’t have the space for that) I just think I will shut down that NAS totally for now and look for another RAID solution to keep my files safe.
Well, looking at the bright side of things, your long line of less than ideal configuration choices, intentional or not, now mean that you likely no longer have 120 TB of data to worry about anymore. So that’s that weight of your chest, whew!
Scale can use dRAID. That doesn’t mean it’s a good fit for you, I would say it’s more of a specialist option.
thanks for dRAID, but I don’t recall having given the choice at install time.
now at 500 Euro an enterprise or nas disk I can only do what I can with my poor life ;p
BTW I was wondering why so many gossiped on UnRaid and nearly never on TrueNAS, while unraid is paid for use software… perhaps my biggest error was to choose FreeNAS at first ? ;p
In no way is that expected or OK. Could be heat on that LSI, could be RAM as the system doesn’t have ECC - hard to say but “wannabe errors” during resilver point to a massive problem.
For ZFS, planning ahead is very much helpful. I don’t think draid is right for you, but multiple vdevs would have been - as would have been ECC memory (just for peace of mind) and investigating any errors that crop up when they shouldn’t.
Typically with this many disks you’d see 8x10 TB or maybe max 10x10 TB per vdev. Of course you can have multiple vdevs in the same pool, this doesn’t mean you split your data. It just means you design in a way that is friendly to resilver. Strain on the disks and risk of failure goes up the wider the vdev is.
Unraid is going to be much easier to administrate. I don’t think it can save you from whatever underlying issue causes the “wannabe errors” in this system, but it certainly needs a lot less planning.
Your apps on TrueNAS, btw, are best placed on an appropriately-sized mirror SSD pool. Not the boot drive. A separate pool.
I would have been happy to be able to use ecc memory (as I have it in stock) but never got the box to boot with it installed whatever the OS, although that same ram worked elsewhere in others don’t know why, as at least the mobo is said to made good use of ecc mem if available as well as the OS and TrueNAS… the i7 in this machine should handle it too…
unraid is not for me as I support Free Software the most.
the only software I bought (one time for all) for linux is my pro scanner one : I can feed it with any scanner I want, even very old scsi or parallel port ones or very uncommon ones and it just works I had less chances with sane… as I have a lot of ‘safepu’ hardware in my collection it’s perfect to use or test it !
Not sure where you’re getting that from? What I can find about the Maximus IV gene-z is that it doesn’t support ECC. The i7-2600 as well does not support ECC, see https://www.intel.com/content/www/us/en/products/sku/52213/intel-core-i72600-processor-8m-cache-up-to-3-80-ghz/specifications.html
ECC in Intel land is Xeon, some i3 and some Pentium, depending on generation. Xeon and workstation chipsets only. With recent generations more CPUs support ECC (i5, i7, i9), but still require the workstation chipsets.
In AMD land the Ryzen desktop CPUs support ECC, but there’s only one vendor who consistently implements support: AsRock.
For ECC to work, the CPU has to support it, the chipset has to support it, the motherboard vendor has to put the traces down, and the motherboard vendor has to support it in BIOS/UEFI.
ECC is not a given and takes building for it, deliberately.
I use it in my main TrueNAS and our PCs; I don’t have it on the backup TrueNAS.
Ruling out RAM failure without ECC is alas a pain. Boot from memtest86+ (the FOSS one) and run for 5 days in continuous loop. That doesn’t rule out memory failure but makes it very very unlikely. A few runs, or even a day without errors, is not conclusive.
For my backup TrueNAS I did a 14 day memory test to be reasonably sure the memory was good at least at the point of testing.
All that ECC does is remove the guesswork and lengthy troubleshooting. You’ll know when it goes bad.
For drives, I use the HDD burn in / badblocks script that’s floating around these forums, from an Ubuntu Live USB, before placing a drive into a pool. I haven’t found any bad drives this way, but others have found bad drives after placing them into a pool, when they didn’t test ahead of time. Ouch. It adds a little bit of time but it’s not bad: The last test I did was for 8TB HDD and took about 4 days.
Apps are about small transactions, for which raidz# is highly inefficient and mirrors are recommended. iXsystems further recommends to put apps on SSDs rather than HDD.
TrueNAS does automatically moves the system dataset from the boot pool to the first created pool, which might not be optimal if one begins with the “bulk storage” pool but it is always possible to move the system dataset through the GUI. The app/instances/VM pool, as far as I know, one has to declare it, TrueNAS does not choose—and in any case it can be moved.
Sh!t happens. That’s what resiliency aims to address.
As soon as expansion was initiated, the vdev had an extra member taking part in the redundancy scheme. Losing this disk was a second drive incident and a second lost drive.
Long expansion might be a consequence of excessive width.
Had you burnt the drive in before adding it?
This was not normal and should have been investigated…
Possibly, when used according to recommended practices.
ZFS was designed for enterprise use, and as most professional tools it can be a very efficient way to shoot oneself in the foot when used without proper care.
Yes, through the command line. Note that dRAID cannot be extended and relies on having suffcient spares. And “beyond an effective width of 40, dRAID3 gets pretty wacky” (read all posts by @jro in the thread).
If you haven’t shut down the NAS yet and can still access the pool, I’d strongly suggest looking for extra storage and backing up anything critical, as there’s no guarantee this pool will ever mount back. (Yes, it’s not my money here, but it’s also not my data…)
You should be able to configure dRAID through the web UI now as well.