Thoughts on data pool corruption due to unknown reasons

Rift8390 · May 6, 2025, 11:08am

Hello everyone!

A few days ago, I found that my home NAS suddenly crashed.
No matter how I restarted it, it would eventually get stuck at the boot interface and a line of prompts would appear: “sd? data cmplt err -32 uas-tag 2 inflight cmd”.

The following is the device information where the problem was found:

Key	Value
System	HPE ProLiant MicroServer Gen10 Plus v2
CPU	Intel Xeon E-2314 Processor
Memory	SK Hynix 64G(32Gx2) DDR4 3200MHz ECC UDIMM
Network Card	Intel I350 Quad 1GbE (built-in)
Boot Pool	WD SN740 1TB 2230 NVMe SSD (with ITGZ JMS583 USB enclosure, on USB 3.2 Gen 2 port)
HDD Pool	16TB 2-way Mirror x2
HDD Pool Storage Device	WD Ultrastar DC HC550 16TB SATA x4
HDD Pool SLOG Device	N/A
UPS	APC SPM1K Online-UPS (apcsmart via usb)
OS	TrueNAS Scale 24.10.2.1
Applications	Intranet Services: SMB, Time Machine, Vaultwarden, Wiki, and etc
VM	One for development
Uptime	More than 600 days (7x24)

After the problem occurred, I shut down the machine and removed the 4 hard drives and installed them on another Dell PowerEdge R730xd Rack Server that was running the same version of TrueNAS normally.

When importing the problematic data pool through the GUI, the server will crash and restart immediately.
By searching the forums I tried using shell import.

The following are the outputs of the various commands (sensitive information has been hidden).

root@x[~]# zpool import
  pool: tank_x
    id: xxxxxxxxxxxxxxxxxxx
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
	the '-f' flag.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
config:

	tank                                      ONLINE
	  indirect-0                              ONLINE
	  mirror-1                                ONLINE
	    xxxxxxxx-d937-41c0-b751-xxxxxxxxxxxx  ONLINE
	    xxxxxxxx-662b-4ef7-bb89-xxxxxxxxxxxx  ONLINE
	  mirror-2                                ONLINE
	    xxxxxxxx-2bf0-47f5-ad4a-xxxxxxxxxxxx  ONLINE
	    xxxxxxxx-d1f6-4357-853b-xxxxxxxxxxxx  ONLINE

If the problem data pool is forcibly imported in non-read-only mode through the command, the system also crashes and restarts.

I used the read-only mode provided in the forum to import the problem data pool.

root@x[~]# zpool import -o readonly=on tank_x -R /mnt
cannot import 'tank_x': pool was previously in use from another system.
Last accessed by truenas (hostid=xxxxxxxx) at Sun May  4 22:22:53 2025
The pool can be imported, use 'zpool import -f' to import the pool.

root@x[~]# zpool import -o readonly=on tank_x -R /mnt -f

root@x[~]# zpool status -v
...
  pool: tank_x
 state: ONLINE
  scan: scrub repaired 0B in 06:10:31 with 0 errors on Sun Apr 27 06:10:33 2025
remove: Removal of vdev 0 copied 754G in 1h6m, completed on Thu Oct 19 11:31:49 2023
	6.08M memory used for removed device mappings
config:

	NAME                                      STATE     READ WRITE CKSUM
	tank_x                                    ONLINE       0     0     0
	  mirror-1                                ONLINE       0     0     0
	    xxxxxxxx-d937-41c0-b751-xxxxxxxxxxxx  ONLINE       0     0     0
	    xxxxxxxx-662b-4ef7-bb89-xxxxxxxxxxxx  ONLINE       0     0     0
	  mirror-2                                ONLINE       0     0     0
	    xxxxxxxx-2bf0-47f5-ad4a-xxxxxxxxxxxx  ONLINE       0     0     0
	    xxxxxxxx-d1f6-4357-853b-xxxxxxxxxxxx  ONLINE       0     0     0

errors: No known data errors

After the problem data pool was imported, I immediately used the “zfs send | zfs receive” commands to transfer the data sets one by one to the normal server.

It should be noted that if you use the default App storage of TrueNAS Scale, please remember to back up the data sets under ix-apps!!!
Use “zfs list” to view all data sets.

Fortunately, all data can be read out (a lot of family memories).
After completing the data backup, I restarted to WinPE and ran Victoria to scan the four hard drives.
The quick scan result found no errors, and the slow full scan has not yet ended, but judging from the progress, the probability of errors in the hard disk itself is not high.

The recovery method I have found from searching the forum so far is that after backing up the data, I must delete the original data pool, create a new data pool, and then write the data back to the new data pool.
I will perform data pool recovery according to the above method after the hard disk scan is completed.

The above records are for reference by other TrueNASers who have the same problem.

For this failure, I have several doubts, and I hope you will give me some advice.

After importing the problem pool in read-only mode, what do the “remove” and “vdev 0” information displayed by the command “zpool status -v” mean?

remove: Removal of vdev 0 copied 754G in 1h6m, completed on Thu Oct 19 11:31:49 2023
	6.08M memory used for removed device mappings

Is there a situation where even read-only mode cannot be imported?
I remember that ZIL was striped and stored on the hard disk of the data pool when there was no SLOG device.
If the boot-pool crashes due to instability caused by overheating of the USB SSD, and there happens to be a write operation at this time, will the ZIL be damaged and cause the entire pool to crash?
Can this problem be avoided if two separate USB SSDs are used to form a mirror as the boot-pool?
If I install Optane as the SLOG device of the data pool on the idle PCIe4.0x16 slot of the machine, can this problem be avoided?
If a SLOG device is added to the data pool and this problem occurs, do I have to install the SLOG device on a working server to import the data pool in read-only mode?
Are there other possibilities that may cause this problem? Are there any suggestions to avoid this problem?

winnielinnie · May 6, 2025, 4:42pm

You were probably hit by the OpenZFS bug in regards to vdev removal.

Here is the upstream bug report. A fix was implemented (thanks to @mav) to prevent future kernel panics. However, the issue of recovering from a kernel panic (due to this bug already being triggered) has yet to be made, since it involves a deeper dive.

Rift8390 · May 6, 2025, 5:27pm

Thanks for your reply!

The original data pool was on Dell R730xd Rack Server, with a SLOG device configured.
More than 600 days ago, I deleted the SLOG device on Dell R730xd and exported the data pool, then placed the hard disks on HPE MicroServer Gen10 Plus v2 and imported the data pool.
After further searching, I found that the “remove” information might be generated by the above operation.

Since the system has been running normally for more than 600 days and has been upgraded to the latest version without any problems.
A few days ago, it suddenly crashed, and there was no vdev operation during the period, so I am not sure whether this crash was caused by the BUG you mentioned.

I originally thought that the hard disk had an error, but no error was found after the hard disk scan.
What I am worried about now is how to avoid this problem after rebuilding the data pool.

winnielinnie · May 6, 2025, 5:35pm

There does not need to be a vdev operation to trip this bug. It’s a combination of factors: block-cloning + vdev removal in the past

Then upon importing the pool at a later date, it will kernel panic.

The bugfix^[1] (to prevent future triggers) has already been included with upstream OpenZFS and TrueNAS SCALE since version 24.10.2.1.^[2] Any new pools should be safe going forward.

See my previous post. ↩︎
The reason you still were victim of this bug, even on the “fixed” SCALE 24.10.2.1, is because the bugfix only prevents future triggers. You had already tripped this bug, likely because of the mentioned factors, and so it was too late by then. ↩︎

Rift8390 · May 6, 2025, 5:54pm

Thank you!

Do you have any suggestions for my current hardware configuration?
Are there any potential risks in the boot pool?
If the boot pool based on the external USB SSD fails during operation, will it cause damage to the data pool?

Topic		Replies	Views
ZFS Pool 'Data' Causes Kernel Panic TrueNAS General SCALE , ZFS	35	1018	February 13, 2025
Pool import fails with I/O error and import -F forces system reboot after 15 seconds TrueNAS General ZFS , Import-problem	8	101	June 12, 2026
TreuNAS Scale - Pool randomly corrupted after 24.10.1 update TrueNAS General SCALE , ZFS , pool	28	498	January 28, 2025
TrueNAS Pools OFFLINE After Every Reboot – Disks Not Recognized TrueNAS General SCALE , TrueNAS_as_VM	9	1138	January 16, 2025
24.10.2 setup breaks after app update TrueNAS General SCALE , ZFS	25	338	March 19, 2025

Thoughts on data pool corruption due to unknown reasons

Related topics