TL;DR: Look at the “Workaround” section at the bottom.
I think I have figured out the cause of the problem and a reliable workaround thanks to the previous research by
- Oleksandr Liutyi, author of
- @seekee (with finding the post of Liutyi), and
- @dameb10129 (with bringing my focus to the
/sys/blocksystem).
Issue
I am also running an Odroid H4+ with an eMMC (64GB) and encountered the exact same error behaviour as described by OP while trying to install TrueNAS SCALE 25.04.0
Retrying ~30 times in total and the other workarounds suggested by dameb and Liutyi did not help for me.
Investigation
During my investigation I primarily looked at the python snippet identified by Liutyi (format_disk function at truenas-installer/truenas_installer/install.py at TS-25.04.0 · truenas/truenas-installer · GitHub) and the function get_partitions called from this code site (truenas-installer/truenas_installer/utils.py at TS-25.04.0 · truenas/truenas-installer · GitHub).
From what I can tell it’s a combination of the long time it takes the new partitions to show up in /dev (which I have not observed directly but rather inferred from my knowledge of the code and the system behaviour during my debugging attempts) and the MMC boot partitions (special to eMMC devices) that evidently have not been anticipated during the implementation of the installer.
So the format_disk function looks fine - it relies on the get_partitions function to do it’s job and wait for up to 30s to find the correct partition devices - neither of which it accomplishes in this case.
The issues seem to lie in the main loop of the get_partitions function:
for _try in range(tries):
if all((disk_partitions[i] is not None for i in disk_partitions)):
# all partitions were found on disk
return disk_partitions
try:
with os.scandir(f"/sys/block/{device}") as dir_contents:
for partdir in filter(lambda x: x.is_dir() and x.name.startswith(device), dir_contents):
with open(os.path.join(partdir.path, 'partition')) as f:
try:
_part = int(f.read().strip())
if _part in partitions:
# looks like {1: '/dev/sda1', 2: '/dev/nvme0n1p2'}
disk_partitions[_part] = f'/dev/{partdir.name}'
except (OSError, ValueError):
# OSError: [Errno 19] No such device was seen on
# our internal CI/CD infrastructure for reasons
# not understood...
continue
except FileNotFoundError:
continue
await asyncio.sleep(1)
A short context:
triesis the number of retries i.e. 30disk_partitionsis a dictionary that is initialized as{1: None, 2: None, 3: None}in our case
it’s supposed to end up as{1: "/dev/mmcblk0p1", 2: "/dev/mmcblk0p2", 3: "/dev/mmcblk0p3"}in our casedeviceis a string - the base device name, in our casemmcblk0partitionsis simply a numeric list of the partitions we are looking for - basically the keys ofdisk_partitions, i.e.[1,2,3]
Problems in the loop
The problem here is that for eMMCs with boot partitions there are /sys/block/mmcblkX/mmcblkXbootZ paths in addition and independent from the normal, expected /sys/block/mmcblkX/mmcblkXpY paths.
These do not have a partition file in them as the inner loop that iterates all /sys/block/mmcblkX/mmcblkX* expects!
Unfortunately the open for trying to read the /sys/.../partition file is not protected by the inner try-catch which would only skip the current /sys/block/mmcblkX/mmcblkXbootZ and eventually reach the .../mmcblkXpY paths it expects.
Instead, it is only protected by the outer try-catch which is probably intended as a catch-all in case /sys/block/mmcblkX is completely missing for whatever reason, since it rapidly completes the outer loop by skipping the await asyncio.sleep(1) statement.
So it’s semi-random, dependent on the order that os.scandir returns the directory entries (favorable case: all mmcblkXpY first, all mmcblkXbootZ later), whether the loop completes it’s mission successfully.
The fallback can’t help
There is actually backup code path should the loop fail. This one scans the /dev directory instead of looking in /sys/....
Though from analyzing the code it seems that this one would also get confused by the /dev/mmcblkXbootZ entries and even misidentify them as the /dev/mmcblkXpY it’s actually looking for, in case Z and Y are the same, which is the case on my system for boot1 and p1.
Also due to the missing sleep statements in the prior loop it seems that /dev is not populated by the time this code path is reached.
Conclusion
To resolve this issue permanently, changes to the installer seem to be unavoidable.
Fixing the installer
Since I am not familiar with the code base and my knowledge of Linux internals is very limited, my suggestions will stay vague.
A fix might include:
- Making sure that the sleep interval between retries is respected even when exceptions occur
- Making sure that all the paths
/sys/block/{device_name}/{device_name}*are traversed even if some do not behave as expected - Potentially: Explicit handling for eMMC by skipping
bootpartitions (and maybe other device partitions if they exist)
Workaround
For now I have found a workaround that allowed me to install TrueNAS successfully:
-
Open the Shell in the installer
-
Patch
/lib/python3/dist-packages/truenas_installer/utils.pyby:moving the line
await asyncio.sleep(1)right beneathfor _try in range(tries): -
Hide the
/dev/mmcblk0boot*devices that confuse the “last resort” path of the installer by:running
rm -f /dev/mmcblk0boot0andrm -f /dev/mmcblk0boot1 -
exitthe shell and run the installer as normal (without rebooting in between)