Missing Disk from Members

Spec · April 19, 2024, 11:41am

Im running TrueNAS Core 13u6 (recently upgraded from 11.2>11.3>12>13).

I have a RAIDZ2 legacy GELI encrypted pool of 6 WD Red Plus, 2 of these are failing/failed.

My pool is currently degraded with 1 failure (2nd one will surely follow soon), da0 is currently reporting FAULTED.
I have added a new additional disk to the system which is detected as da2, it can be seen in storage > disks and is reporting back in ‘desmsg’ and ‘camcontrol devlist’, ‘gpart show da2’ reports ‘no such geom’.
I’ve ran a short and long smart test on the new disk with success.

When I go to storage > pools > data1 > status > da0 > replace > the member disk section is empty - expecting to see da2 here.

As a quick test I tried to make a new pool with the new single disk:
storage > pools > add > create new pool > create pool
Im given an error:
KeyError
‘devname’

Any ideas how I can get this disk to be added to the existing pool to replace the faulted disk?

Thanks

chuck32 · April 19, 2024, 12:14pm

How is the new drive connected to the system?

Please post your full hardware.

Spec · April 19, 2024, 1:00pm

Supermicro X10SL7-F
Intel(R) Xeon(R) CPU E3-1271 v3
32GB ECC
LSI firmware in IT mode
The new disk is connected via SATA to the same port that the faulty disk was originally connected to.
The faulty disk has been moved to another spare SATA port.

Spec · April 21, 2024, 10:27am

Some slight differences I missed
My original disks are WD60EFRX-68L 0A82 and the new one is WD60EFPX-68C 0A81.
I think my original disks were normal CMR red’s before they split them into red and red plus.
The other difference I’ve seen is that the rotational rpm on the originals is 5700 whereas the new one reports 5400 in storage > disks. The new one likely has more cache too (256MB).

root@freenas:~ # camcontrol devlist
<ATA WDC WD60EFRX-68L 0A82>        at scbus0 target 0 lun 0 (pass0,da0)
<ATA WDC WD60EFRX-68L 0A82>        at scbus0 target 1 lun 0 (pass1,da1)
<ATA WDC WD60EFPX-68C 0A81>        at scbus0 target 2 lun 0 (pass2,da2)
<ATA WDC WD60EFRX-68L 0A82>        at scbus0 target 3 lun 0 (pass3,da3)
<ATA WDC WD60EFRX-68L 0A82>        at scbus0 target 4 lun 0 (pass4,da4)
<ATA WDC WD60EFRX-68L 0A82>        at scbus0 target 5 lun 0 (pass5,da5)
<ATA WDC WD60EFRX-68L 0A82>        at scbus0 target 6 lun 0 (pass6,da6)
<ATA CT1000MX500SSD1 033>          at scbus0 target 7 lun 0 (pass7,da7)
<INTEL SSDSA2M080G2GC 2CV102M3>    at scbus1 target 0 lun 0 (ada0,pass8)
<INTEL SSDSA2M080G2GC 2CV102M3>    at scbus2 target 0 lun 0 (ada1,pass9)
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus3 target 0 lun 0 (ada2,pass10)
<WDC WD40EFRX-68N32N0 82.00A82>    at scbus4 target 0 lun 0 (ada3,pass11)
<AHCI SGPIO Enclosure 2.00 0001>   at scbus7 target 0 lun 0 (ses0,pass12)

Could either of these differences stop the disk showing up in member list? Although it shouldnt stop me creating a new single disk vdev with the new one

Is it safe to try the command line to replace the disk in this scenario? (GELI encrypted)

Fleshmauler · April 22, 2024, 7:27am

Are we sure it isn’t a faulty port and/or wire? Does it fail to show if changed to another port?

Does this drive show up if you connect it to another system?

Stux · April 22, 2024, 7:38am

For what it’s worth, I just contacted WD support. […] WD Support gave me the answer:

WD60EFAX models do not use SMR. The older WD60EFRX models do use SMR.

Drives dropping out of an array is what happens when you use SMR drives

Spec · April 22, 2024, 11:16am

I dug out my last order to replace a faulty disk back in 2020:

WD 6TB Red Plus (CMR) 64MB 3.5IN SATA 6GB/S NAS Hard Drive
Exclusive NASware 3.0 Built for optimum NAS compatibility 3D Active Balance Plus 24/7 reliability Cooler operations - High reliability Premium support and a 3-year limited warranty WD60EFRX

According to the WD warranty checker (dont know if this helps with CMR/SMR determination)

WD40EFRX-68N32N0 APOLLO 5400 64M SATA3 6GB/S 4.0 TB 6HD NAS
WD40EFRX-68N32N0 APOLLO 5400 64M SATA3 6GB/S 4.0 TB 6HD NAS
WD60EFRX-68L0BN1 REMBRNDT 5400 64M SATA3 6GB/S 6.0 TB 10HD NAS
WD60EFRX-68L0BN1 REMBRNDT 5400 64M SATA3 6GB/S 6.0 TB 10HD NAS
WD60EFRX-68L0BN1 REMBRNDT 5400 64M SATA3 6GB/S 6.0 TB 10HD NAS
WD60EFRX-68L0BN1 REMBRNDT 5400 64M SATA3 6GB/S 6.0 TB 10HD NAS
WD60EFRX-68L0BN1 REMBRNDT 5400 64M SATA3 6GB/S 6.0 TB 10HD NAS
WD60EFRX-68L0BN1 REMBRNDT 5400 64M SATA3 6GB/S 6.0 TB 10HD NAS

And one of the new WD Red Plus 6TB direct from WD this month:

WD60EFPX-68C5ZN0 VENRP2LP 5400 256M SATA3 6GB/S 6.0 TB 6HD NAS

I dont think the port/cable is faulty as the disk did show up in storage > disks and survive short and long smart tests, I think my issue turned out to be a broken install/upgrade.
Along the way of 11.2>11.3>12>13 something must have gone wrong.
I did a clean install of 13u6, uploaded my config, same errors as above trying to make use of the new disk.

I then tried a reset config, imported my pools and I could select the new disk in the replace > member disk drop down. The errors from storage > import disk are also gone.

1 disk has now finished resilvering via the web GUI, I have changed my passphrase and added a new recovery key, on to the 2nd disk.

How many CMR drives I actually have now is baffling thanks to the hidden changes to SMR/CMR drives. Although according to the original author of the story on techsport:
https://www.techspot.com/news/84973-wd-publishes-complete-list-smr-drives-following-user.html

No. EFRX were always CMR
EFAX in 1-6TB are SMR
EFAX in 8TB and up are CMR

These drives were originally purchased 2015, with a warranty replacement in 2019, purchase replacement in 2020 and now 2 purchased in 2024 to replace.

Spec · April 22, 2024, 1:03pm

Having shutdown, removed the first faulty disk (replaced already) and attached the 2nd new disk. Im now having the same issue as before.
The new disk is seen in Storage > disks.
Storage > pools > data1 > pool status > faulty disk > replace > member disks does not display the the new disk.
Storage > pools > add > create new pool > error: keyerror devname
Storage > import disk says ‘error getting disk data’

If I export data1 pool and try to reimport it, the add process doesnt show the disks anymore, yet they are still there in storage > disks.

Ive a feeling this is something to do with the GELI encryption/rekeying/importing. As thats pretty much the only thing Ive done on a clean config/database.

Spec · April 23, 2024, 11:17pm

Did the following to confirm when the problem is starting:

System > General > Reset config, restarted
Login, set root password
Storage > Pools > Add > Import existing pool > Yes, decrypt the disks > select disks for data1 pool + add key + passphrase > data1 > import

Storage > Import disk > still ok
Storage > Pools > Add pool - still ok

Storage > Pools > data1 > Cog > Status > Replace da6 with da0 > formatting disk > replacing disk > Successfully replaced disk da6.

Storage > Import disk > still ok
Storage > Pools > Add pool > still ok

Storage > Pools > data1 > cog > Encryption Key/Passphrase > changed passphrase and downloaded key
Storage > Pools > data1 > cog > Manage recover key > add recovery key
System > General > Set timezone > Save & Save Config

Storage > Import disk > still ok
Storage > Pools > Add pool > still ok

Restart host - Python errors seen while importing the pool during start up sequence

Storage > Import disk > Error getting disk data
Storage > Pools > Add > Create new pool > Error KeyError ‘devname’ - more info provides the following:

Error: Traceback (most recent call last):
File “/usr/local/lib/python3.9/site-packages/middlewared/main.py”, line 139, in call_method
result = await self.middleware._call(message[‘method’], serviceobj, methodobj, params, app=self)
File “/usr/local/lib/python3.9/site-packages/middlewared/main.py”, line 1240, in call
return await methodobj(*prepared_call.args)
File “/usr/local/lib/python3.9/site-packages/middlewared/schema.py”, line 981, in nf
return await f(*args, **kwargs)
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/disk/availability.py", line 21, in get_unused
reserved = await self.middleware.call(‘disk.get_reserved’)
File “/usr/local/lib/python3.9/site-packages/middlewared/main.py”, line 1283, in call
return await self.call(
File “/usr/local/lib/python3.9/site-packages/middlewared/main.py”, line 1240, in call
return await methodobj(*prepared_call.args)
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/disk/availability.py", line 44, in get_reserved
reserved += [i async for i in await self.middleware.call(‘pool.get_disks’)]
File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/disk/availability.py", line 44, in
reserved += [i async for i in await self.middleware.call(‘pool.get_disks’)]
File “/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool.py”, line 1059, in get_disks
disk_path = os.path.join(‘/dev’, d[‘devname’])
KeyError: ‘devname’

Storage > Pools > Add > Import an existing pool > Yes, decrypt the disks > Disks drop down is empty - should be populated with multiple disks
All disks can be seen in Storage > Disks but will not appear in drop downs above after the restart.

Spec · April 23, 2024, 11:34pm

Did a further test, since my pool had now recovered with the 2 replaced disks.

System > General > Reset config, restarted
Login, set root password
Storage > Pools > Add > Import existing pool > Yes, decrypt the disks > select disks for data1 pool + add key + passphrase > data1 > import > successful

Restart host - Python errors seen while importing the pool during start up sequence
Login - Same errors as previous post.

Storage > Pools > data1 > unlock > successful and can be access via the shell /mnt/data1

Something about importing a GELI pool or this particular pool is causing the TrueNAS GUI to screw up, stopping me creating/importing additional disks/pools.

Fleshmauler · April 24, 2024, 12:47am

Glad you got your pool recovered - considering opening a ticket for the issues importing an encrypted pool?

Spec · April 24, 2024, 8:54am

Issue logged here:
https://ixsystems.atlassian.net/browse/NAS-128482
Edit - issue fixed in Core v13.3