My system is TrueNAS-13.0-U6.8, Core, I believe. I upgraded from FreeNAS with a newer MB about 3 years ago. The processor (IIRC) is an I5 and the system reports 16G (although I thought I had 4 4G and 4 1G sticks making 20G). It is very lightly used, with just a few SMB volumes and Plex running in a jail. Just this week I put Home Assistant in a Virtual linux VM. It was a learning curve to set up freeNAS and again to set up trueNAS. My documentation and memory is fragmented and I’ve spent very little time managing the system (though the uptime was almost 100%)
For myself, I’m a retired programmer. I have basic network skills and a long history of building my own systems.
ISSUES
I had built a Pool on 6 4T Seagate HDD in RAIDZ2. The drives were originally used and I’ve swapped 2 out in the last 3 years.
This week I belatedly noticed that pool was degraded (although it can transfer 100MB/s over 1G network). One drive (ad4) was totally isolated from pool, though I don’t recall it being degraded or faulted. I bought 2 new but unboxed 4T drives.
I set bad drive (ad4) offline, physically replaced it with a new drive, and designated it to replace the bad drive
it took a long time, during day 2 I noticed that ad3 was now FAULTED, the other 4 were ONLINE
after the replacement completed the pool was DEGRADED still, I set ad3 OFFLINE, physically replaced the drive, and set it to replace that 2nd ‘bad’ drive
after a couple hours I checked and now ad1 is FAULTED, ad3 is still being replaced, and the other four are now all DEGRADED. Completion time is now 20 hours and ticking upwards! I’m going to wait it out, maybe reboot, and see what trueNAS says about the pool.
Of the 4 drives replaced now, I’ve formatted (not quick format) them in my workstation USB dock and they all pass and ‘seem’ to work. They are not in service but I have used a couple as a backup.
As stated, I’ve backup up all the SMB volumes (except for time machine) and saved the TrueNAS configuration. I’m also going to try to fix system:email since I don’t think it was able to email me status.
HELP
I’m open to suggestions. I hope there is enough information provided on my system. I looked for some function that would provide a generic dump but did not find that or any evident means of consolidating the configuration for human perusal.
I wonder, have you ever performed a full memtest run? Basically if I was in a situation like this, that would be one of the first things on my list. If you already did it, sorry, it’s just better to make sure. So visit memtest.org, create a bootable usb stick, boot it up and let it run for a day. It’s better to do 2-3 full passes.
Then I guess I would also look at kernel messages, you might see something interesting there. Maybe a faulty sata controller… you never know! Just connect a screen. Kernel errors should appear on the console if I remember correctly.
It’s also a possibility that the other drives failed during the rebuild. It puts high stress on the disks. Did you enable a scheduled scrub? If not, that is a good idea to enable, lets say every 2-3 months a scrub. That could help to avoid such suprises.
I forgot to mention. I think my pools are scrubbed once a week. Since email is hosed I’m not sure if I missed error reports. The Alerts on the main page only mentioned scrubs were completed.
That we can rule out then. I would look into the memory tests, kernel messages, everything. If you have a voltmeter also measure the 12V and 5V rails of the power supply.
My brain fog has cleared somewhat and I remembered the AsMedia 6 port SATA card I used (and somewhat warned about when I rebuilt my system.)
Since re-silvering apparently puts heavy stress on the controller/drives. I’m going to go with the assumption that I should replace my controller.
From the various LSI and Dell HBA (?) controllers it’s not clear exactly how they work. I was under the assumption they were SAS only but some (?) seem to do SATA? Is that in conjunction with some special cables (which ones?) The cables are 1 (HBA) to 4 (SATA)?
I have to check my bargain motherboard, I think the AsMedia is either 1x or 2x. Have to check for a 4x or 16x PCI port?
Question
If I install an better HBA, and connect the 6 existing drives with SATA adapter cables, what are the chances my trueNAS will come up and identify the existing pools? Would the drives all remain degraded/failed or generally hosed?
I’m not quite sure what the steps of recreating the pool would be (using system config and/or an exported pool?)
To my knowledge, SAS and SATA use the same command set. However, SAS uses higher signal levels than SATA (so that longer data cables between the controller and SAS disk are possible in the server area). However, every current SAS controller is able to adjust the signal level accordingly when connecting SATA disks.
You can therefore operate SATA disks on a SAS controller. However, you cannot connect a SAS disk to a SATA controller. That will not work.
Many HBAs can control 8 disks, and to my knowledge there are also some for 12 or 16 disks.
In your case, a model such as the IBM1015 might be sufficient. I used one myself for a while without any problems.
You need to check whether you have a suitable PCIe connection on your board. And when purchasing, order the appropriate SAS cables at the same time if possible. Many sellers offer suitable bundles.
Incidentally, I also used a PCIe card with eSATA and Asmedia chipset for a while. It worked perfectly, but only for importing or exporting data. However, these cards are designed for PCs, not servers.
Unfortunately, I can’t comment on importing your pools. When I installed the HBA, I had a completely different starting point.
SAS & SATA use totally different command protocols, (doing basically the same thing, storage). As you pointed out, they also use different signalling voltages & methods on the wires.
When a SATA disk is connected to a SAS HBA port, the chip and software cause the port to use SATA protocol. Then, the signalling voltages limit the cable length to SATA standards, (generally less than 1/2 of SAS).
Back to @Robert_Townsend - Can you supply the output of zpool status -v from the command line, in CODE tags?
I have a PCI 16x slot free and identified a HBA with a LSI SAS2308 SAS chipset and SFF-8087 cables..
I’m thinking of destroying the pool, replacing the two (allegedly failed) drives, and scraping them to bare metal. That is using any commands to remove partitions etc.
reasons for restoring the older 2 drives and starting over
all 4 drives removed (2 from a year or so ago) were restored on my workstation and seem to be ok
the first (new) drive now says degraded with a large number of read errors, I’m considering returning my 2 drive purchase. They are surveillance drives but my NAS is very lightly used and the 1G network only requires 120 MB/s performance
the last re-silvering was halted and although I somehow reset/cleared the errors so the drives, except one, show ONLINE I don’t trust it
I ran a command with -v and it listed a dozen files that were permanently corrupted
Question
I know I’ll have to create a new pool from the drives. I’ll also have to copy (slowly) the off system data back to the pool. I’d like to not have to recreate all the datasets/filesystems and set their ACLs manually. There doesn’t seem to be a way. I’ve stumbled through snapshots, config save, and export/import but none seem to be a match. These and (replication?) seem to be way overkill for my simple needs.
BONUS ROUND
I got trueNAS working with email – one of the first emails stated that one drive in my SSD mirrored pool was REMOVED! BTW it shows up on my workstation. The SSD drives are attached to the motherboard SATA ports. I physically installed another SSD drive and I’m waiting for the system to come back up so I can attempt to replace that drive now! It took 15 minutes to come up!
BONUS BONUS ROUND
The replace on the SSD mirror drive proceeded fast! I got distracted at 50% and about 3 minutes later I checked back. Upon completion only rotating blue circles showed up for Disks, pools, and many other selections for at least 10 minutes. The SSD pool at least is back.
I replaced the SATA controller with an SAS HBA 8 port version. After wiping and installing two alternate drives I used them to replace the two (removed) drives and used replace (resliver). After about 20 hours everything is up and somewhat running.
I moved on to removing/replacing an files marked bad - except for a timeserver child dataset which I’ve tried to remove.
I’ve opened another post describing the issues - the filesystem that won’t die.