Problem with NVME driver/ boot-pool

amigo3271 · October 16, 2024, 10:13am

Hi guys, I actually have an issue with the boot-pool, because in the GUI I see in the storage section as exported pool

Also, when I start the multi_report script it crashes with the following message:

Any idea what the problem can be? I started to use truenas recently and I am currently on electric eel RC.2.

Thanks to anyone for help

Davvo · October 16, 2024, 10:55am

Your ssd might be in read-only protection mode. Output of zpool status and zpool import?

amigo3271 · October 16, 2024, 9:35pm

@Davvo This is the result of zpool status:

amigo3271 · October 16, 2024, 9:35pm

@Davvo If I try to import the pool it doesn’t even show up

Davvo · October 17, 2024, 5:43am

Output of zpool import as well please.

amigo3271 · October 17, 2024, 6:43am

Same thing.

amigo3271 · October 20, 2024, 5:41pm

@Davvo Any idea that came to your mind?

joeschmuck · October 20, 2024, 7:14pm

EE RC2 is still a release candidate, however I have not heard of a person having any problems running Multi-Report on EE, and never a report against the boot-pool.

The error message does not list what caused the error, however if you say running Multi-Report caused it, I will not argue that as I can’t prove or disprove anything from my end. I would ask you one question: Did you follow the guide on installing the script correctly? Okay, two questions… Where did you install the script? All I can suspect it that you are trying to write to the boot-pool which is not desired and probably not possible without unlocking the destination, which I will not have the script perform.

However, I am thinking you are wanting to troubleshoot the exported disk issue so I will stay focused on that.

TrueNAS has not fully implemented NVMe support yet, not even in EE RC2 based off of my recent testing and examination of the smartctl.py file to build TrueNAS. However that should not cause the pool issue you are seeing.

Some more questions:

Did you create these pools before you installed EE?
Did you upgrade the feature flags? (I am not suggesting that you do, I certainly do not myself)
Have you rebooted (powered off, wait a few seconds - like 10 seconds, power back on)?
Are you running TrueNAS on bare metal or in a VM?
Add any other details you can think of, such as hardware you are using as it may not be important however it could be the clue needed.

Davvo · October 20, 2024, 8:59pm

Run a long smart test on your boot pool and then post the result, my guess is that there is something wonky with the drive… but you could also try a fresh install.

amigo3271 · October 21, 2024, 9:35am

Thanks for your detailed answer @joeschmuck Really appreciated.

I don’t remember if I first installed truenas scale dragonfish and then installed EE from there.
So I propose to start with the obvious:
I will try re-install truenas scale with a clean electric eel RC.2 or, since it is just 10 days away, I will wait for the official release of EE. After that I will try to see if boot-pool issue is still present (if it still shows as exported) and if yes, then I’ll give a try again with multi-tool to see if it crashes the system while the script runs.

amigo3271 · October 21, 2024, 9:37am

I can’t really run a smart test because I get

Read Self-test Log failed: Invalid Field in Command (0x2002)

Therefore for the time being, I’ll probably order a smaller sata SSD drive. Then I’ll do a clean installation of truenas scale when EE is released and let’s hope that everyone solves by itself. Thanks for the help

joeschmuck · October 21, 2024, 12:26pm

So two ways to test the nvme drive:

NOTE: This is ONLY for SCALE (Linux). It will fail on CORE (FreeBSD).

The device commands:

From the command line enter (you can cut/paste if you like) nvme device-self-test /dev/nvme0 -s 2 to run a long test on drive nvme0 (your boot-pool). This will likely take less than 10 minutes to complete. If you are using EE RC2 you can then use the smartctl command to read the result smartctl -a /dev/nvme0 and view the test results.

Some extra info: A Short test command is nvme device-self-test /dev/nvme0 -s 1 to run a Short test on drive nvme0 and that will take 2 minutes or less to complete. Of course you can run this command on the other nvme drives you have all at the same time, just replace nvme0 with nvme1 or nvme2 and send the command.

NOTE: This works on CORE and SCALE.
2. The smartctl commands:

This should work fine from EE RC2, use smartctl -t long /dev/nvme0 and wait 10 minutes, then use smartctl -a /dev/nvme0 to read the results.

To run a short test is almost the exact same command, smartctl -t short /dev/nvme0 and of course you can run all the drives at the same time, just issue the command for each drive individually and wait the appropriate amount of time to check the status of each drive.

If any of these fails to work, please provide a screen capture of the command you sent and the error message. You should not have any problems with these commands. If a drive fails, that is a different issue and post those results as well.

amigo3271 · October 21, 2024, 1:28pm

This is my topology:
I have 3 nvme drives.
2 of them (crucial P3 2TB) (nvme0n1 an nvme2n1) make the AppsAndVmsPool

result of nvme2n1:

result of nvme0n1:

nvme1n1 (Sabrent 1TB Rocket Q) (bool-pool)

first I got this result:

after some minutes it crashed truenas and to run everything again, I had to shutdown my machine completely, otherwise the boot-pool drive is not even recognized and GRUB doesn’t start (Logs from IPMI before restarting):

joeschmuck · October 21, 2024, 1:33pm

Stop using the namespace! Not a single place in my posting did I use “nvme0n1” and nvme0n1 is not the same as nvme0. nvme0 is the controller for the drive and that is what you want to communicate with.

This is the problem TrueNAS code has, they are trying to use the namespace as well and it is wrong.

amigo3271 · October 21, 2024, 1:34pm

Ah ok! I did not know about the namespace thing, I’ll try again to run smartctl without namespaces once the server runs again

joeschmuck · October 21, 2024, 1:50pm

While I understand you thought you were doing it properly, I feel that I must warn you about something… If someone provides you very specific commands to enter, do not deviate without asking first. It is okay to ask and verify before doing something, however just changing the command, you could cause more harm than you realize and none of us here want that. Thankfully this was not an issue in that respect however there are some commands that will wipe your data slick as fast as you can hit the Enter key.

If you are running EE RC2, the smartctl commands I provided should work fine and you should not need to use the nvme commands.

amigo3271 · October 21, 2024, 1:55pm

No problem, I mistakenly assumed that nvme2n1 and nvme2 is just the same thing, I did not know that there is a concept as namespace.
Anyway, I can’t test it anymore because it seems like the boot-pool drive is gone for good now
So for the moment I will need to shut-down the server and wait for another SSD as boot-pool to test again.

oxyde · October 21, 2024, 2:29pm

grab the first usb stick you have, and boot from there for now. So you don’t have to keep your nas off meanwhile new disk arrive

Davvo · October 21, 2024, 3:30pm

I would not suggest doing so in SCALE due to the constant writes to the boot pool that have been introduced.

joeschmuck · October 21, 2024, 3:31pm

That is odd for the boot-pool to just die.

Any reason you are using a 2TB M.2 drive as the boot-pool? I recommend using something a bit less expensive. Maybe a nice slow 256GB NVMe drive? Gen 3? This will produce a lot less heat and the cost is minimal. A boot drive does not need to be fast, even if you use it as your SWAP space, it is faster than the spinning rust.