Issues replacing a SAS drive in my TrueNAS Core

WB3FFV · January 2, 2025, 8:20pm

I am running TrueNAS 13, and it works great. I have several pools that are a back of RaidZ2 vdev’s. I had a drive start reporting high error counts, so I thought simple enough I will just replace the drive as I have several spares on the shelf. Same exact same model numbers, everything should be identical. So when I loaded the new drive I did the wipe, and then let it create the multipath disk. All seemed fine til I went to replace the drive in the pool.

It would error out and not replace, so I looked at the error being presented and saw this:

Error: concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 283, in replace
    target.replace(newvdev)
  File "libzfs.pyx", line 392, in libzfs.ZFS.__exit__
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 283, in replace
    target.replace(newvdev)
  File "libzfs.pyx", line 2123, in libzfs.ZFSVdev.replace
libzfs.ZFSException: device is too small

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 246, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 111, in main_worker
    res = MIDDLEWARE._run(*call_args)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 45, in _run
    return self._call(name, serviceobj, methodobj, args, job=job)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.9/site-packages/middlewared/worker.py", line 39, in _call
    return methodobj(*params)
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 979, in nf
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/zfs.py", line 285, in replace
    raise CallError(str(e), e.code)
middlewared.service_exception.CallError: [EZFS_BADDEV] device is too small
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 355, in run
    await self.future
  File "/usr/local/lib/python3.9/site-packages/middlewared/job.py", line 391, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/local/lib/python3.9/site-packages/middlewared/schema.py", line 975, in nf
    return await f(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/plugins/pool_/replace_disk.py", line 92, in replace
    await self.middleware.call(
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1278, in call
    return await self._call(
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1243, in _call
    return await self._call_worker(name, *prepared_call.args)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1249, in _call_worker
    return await self.run_in_proc(main_worker, name, args, job)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1168, in run_in_proc
    return await self.run_in_executor(self.__procpool, method, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/middlewared/main.py", line 1151, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
middlewared.service_exception.CallError: [EZFS_BADDEV] device is too small

I verified the drives are identical, same model, same size. Here is disk24 which is the one I am trying to install in the pool, and disk23 which is the next working drive that is fine. Same sector size, and same block size, so how can it be to small. I also did a full wipe overnight on the drive, and no difference.

Here is the working disk:

Geom name: disk23
Type: AUTOMATIC
Mode: Active/Passive
UUID: d4ee1761-4736-11ed-9b58-1402ec6acf14
State: OPTIMAL
Providers:
1. Name: multipath/disk23
   Mediasize: 3000592981504 (2.7T)
   Sectorsize: 512
   Mode: r1w1e3
   State: OPTIMAL
Consumers:
1. Name: da48
   Mediasize: 3000592982016 (2.7T)
   Sectorsize: 512
   Mode: r2w2e4
   State: ACTIVE
2. Name: da24
   Mediasize: 3000592982016 (2.7T)
   Sectorsize: 512
   Mode: r2w2e4
   State: PASSIVE


Here is the one I am trying to put in to replace a failed drive, but it's refusing.   

Geom name: disk24
Type: AUTOMATIC
Mode: Active/Passive
UUID: 32850760-c945-11ef-88f8-1402ec6acf14
State: OPTIMAL
Providers:
1. Name: multipath/disk24
   Mediasize: 3000592981504 (2.7T)
   Sectorsize: 512
   Mode: r0w0e0
   State: OPTIMAL
Consumers:
1. Name: da25
   Mediasize: 3000592982016 (2.7T)
   Sectorsize: 512
   Mode: r1w1e1
   State: ACTIVE
2. Name: da49
   Mediasize: 3000592982016 (2.7T)
   Sectorsize: 512
   Mode: r1w1e1
   State: PASSIVE

Am I missing something? Any idea appreciated as I am scratching my head…

Davvo · January 2, 2025, 8:35pm

I do recall a sizing issue due to how TN handles partitions and similar… but unless I am remembering wrong it was related to SCALE.

Which CORE version are you running?

neofusion · January 2, 2025, 8:36pm

Have you verified there’s not a lingering partition there somewhere?
Just in case your wipe only cleared the contents of the main partition.

WB3FFV · January 2, 2025, 9:00pm

Yes it’s core, haven’t tried jumping to Scale yet. I think I will try that with a less critical NAS.

It shows as TrueNAS-13.0-U3.1, which I know isn’t current as I haven’t rebooted it in a while, because it just works!

WB3FFV · January 2, 2025, 9:03pm

Well I am learning something new, I thought a drive wipe really zeroed it out. I even did the all zero’s wipe that took like a day to run.

Is there a preferred way, or should I say assured way to totally clear the drive in TrueNAS, command line is fine. I did try a dd to /dev/multipath/disk24 to clear out the drive, I figured that would do it, but maybe I am wrong. I am sure up to trying about anything on that drive, as it’s not in use…

etorix · January 2, 2025, 9:11pm

Multipath support was removed long ago.

neofusion · January 2, 2025, 9:13pm

I just realised that your previous output was from running a gpart list, which should have shown any existing partition.

As @etorix etorix points out, your use multipaths is worrying, since it’s been deprecated.

Johnny_Fartpants · January 2, 2025, 9:51pm

As others have said multipath support was depreciated a while ago.

Try running this command after you’ve added your disk but before you try to replace.

/usr/local/bin/midclt call disk.multipath_sync

WB3FFV · January 2, 2025, 9:59pm

On multipath, I have had multipath for years on it, interesting it’s removed as it used it by default when I built the NAS. What do they do now to support drive shelves with multiple controllers? Guessing I could destroy the multipath, and try and just mount the active device path, so much for redundancy.

I knew the multipath sync, as I used that to add it back to the multipath like the rest of the drives are before trying to replace it.

Does anyone know how scale handles multiple SAS controllers to a shelf?

Looks like I need to build a new NAS, what a pain, as this serves a production cluster of nodes running a bunch of VM’s…

Johnny_Fartpants · January 2, 2025, 10:01pm

In short it was too easy for people to hang themselves with multipath so the decision was made to pull it as the potential cons outweighed the pros.

Davvo · January 2, 2025, 10:42pm

If I’m not wrong, this is a feature of SCALE’s ENTERPRISE edition; with enough skill, you can implement it (code is public on GitHub)… personally, I have no clue about how to do so.

Johnny_Fartpants · January 3, 2025, 10:38am

Might be worth doing a gpart show on da48 and da25 and compare the two.

WB3FFV · January 6, 2025, 5:04am

Well as a follow up, I moved everything off the pool, deleted it completely, and then re-made it new. I still left the multipart disks as they had been that way since day 1 of the NAS. When just setting it all up fresh, it just worked, hands down, so go figure.

I figure with Core pretty much EOL now, I will just work on building a new NAS using Scale, and can just move the data over and be done with Core…