My True NAS failed and i don’t know why

Hi!

Three days ago, I was working on Gramps web. Everything was going well, and I shut down the system as usual. (i don’t remember that i closed it fast way than in safe way - you know, hold the button)

The next day, I turn on the computer, the basic HP system starts up, but when I go to the ‘Network’ tab in Windows, I don’t see the server. I decide to check what’s going on in the Web UI. I log into the TrueNAS dashboard and it turns out that it doesn’t show me basic data such as CPU usage, pools, etc.

Pic #1

When I restarted my computer to find out what was going on inside, I saw this.

Pic #2

“Pool ‘name of pool’ has encountered an uncorrectable I/O failure and has been suspended.” That was first warning i saw. “ix.zfs.service” attempted to load and did so repeatedly until it reached the maximum loading time and TrueNAS started up, but again without disks, etc.

In the TrueNAS dashboard, it kept telling me that it couldn’t download the update, because it didn’t have enough disk space (only 1.56GB). Could the fact that I have two versions of TrueNAS on my SSD be the problem? I have both 24.10.2.2 and 24.10.2.4.

I had these two processes in my notifications: pool.dataset.sync_db_keys and pool.import_on_boot.

Pic #3

[EDIT: 2025-11-02 18:00]

I took out all HDDs and reinstalled TrueNas on SSD. I import my .tar file to restore all the settings from previous version. Unfortunatelly, he wrote “Pool ‘name of pool’ has encountered an uncorrectable I/O failure and has been suspended.” and use all of the time on the ix-zfs.service process again.

Pic #4

[EDIT: 2025-11-02 20:40]

I’m was trying to reconnect HDDs pool from WebUI. 45 minutes - still nothing. I asked for turning off the NAS and wait all 4-5 minutes. He was closing with HDDs very long.

Pic #4

[EDIT: 2025-11-02 23:00]

I took out all HDDs and i wanted try how fast system works without them. Closed in seconds.


Link to photos: https ://imgur.com/a/3W8VSaz

Any clues as to where I could start investigating what went wrong? I’m a complete novice when it comes to TN, so I need more detailed explanations of what could have gone wrong.


MY SETUP:
TrueNAS: I was 24.10.2.4, but after the reinstalation SSD it’s 24.10.2.2
HP ProLiant MicroServer Gen8 G1610T 2,3Ghz
CPU: IntelCore i3-3240 3.40GHz
RAM: 16 GiB
STORAGE: 3x Seagate IronWolf Pro 4TB and KIOXIA 480GB 2,5" SATA SSD EXCERIA

Welcome to TrueNAS and it’s forums!

In general, pictures can be problematic to view here in the forums.

If you can supply the output of these commands, from user root via Linux Shell, this would help. Naturally we would need the pool disks re-inserted into the computer.

zpool import
lsblk -b

The first will show us your pool layout, and hopefully what errors are preventing it’s Import. The second will show all the disks in the computer.

2 Likes

From zpool import:

pool: HDDs
    id: 4978944288524160861
 state: ONLINE
status: The pool was last accessed by another system.
action: The pool can be imported using its name or numeric identifier and
        the '-f' flag.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-EY
config:

        HDDs                                      ONLINE
          raidz1-0                                ONLINE
            0bcf2220-2d83-4202-b749-1696e5d64c4f  ONLINE
            1652031b-6ef6-48e6-a74c-60204cb3cb61  ONLINE
            6b3c1fb3-8f41-4066-a123-78e3407a3870  ONLINE

From lsblk -b:

NAME   MAJ:MIN RM          SIZE RO TYPE MOUNTPOINTS
sda      8:0    0 4000787030016  0 disk 
└─sda1   8:1    0 4000785104896  0 part 
sdb      8:16   0 4000787030016  0 disk 
└─sdb1   8:17   0 4000785104896  0 part 
sdc      8:32   0  480103981056  0 disk 
├─sdc1   8:33   0       1048576  0 part 
├─sdc2   8:34   0     536870912  0 part 
└─sdc3   8:35   0  479563947520  0 part 
sdd      8:48   0 4000787030016  0 disk 
└─sdd1   8:49   0 4000785104896  0 part

Perfect, thank you.

Now we try what it says, using the -f flag. But, based on what error you got, you are likely going to get it again. Include the text of the command’s output:

zpool import -f -R /mnt HDDs
3 Likes

So yea, i’ve got nothing from the shell.
For how long i ought to leave the serwer to work rn?
(I put some images form Reporting page in Imgur - pictures #2.2.1 and #2.2.2)

Are you running Truenas on Proxmox ?

Nope. It’s HexOS/TrueNAS.

What do you mean, nothing from the shell?

If the command line returned to a new prompt, that is perfect. The pool was imported without errors. You can check with zpool status -v HDDs. If it looks good, then export from the command line, zpool export HDDs and import from the GUI.

If the import command hung, or gave some error(s), then that is useful information too.

Note that most of us here won’t know the HexOS GUI…

4 Likes

Since you are running TrueNAS CE, please edit the initial post of this thread and remove the “CORE” tag, if possible.

TrueNAS CORE is an entirely different product and not what you are running, so that is misleading.

I’ve done it for the user, added a few others, like HexOS, TrueNAS-CE and made a new one, “Import-problem”. We seem to have a plethora of these, so having a dedicated tag would seem useful.

2 Likes

I mean that wrote the line and nothing got back <— it’s hung.

The same here. Didn’t give back anything. Null. Next line and blank space.

Yes, i know. I use more TrueNAS GUI than HexOS. The same in this situation. In HexOS GUI you can do nothing in this kind of situations.

Well, that is not good. A simple import should not take more than 1 minute, (really much less). So ZFS is likely attempting to see if the pool can be imported.

If it comes back before you give up, let us know any error message or results.

If it does not import, then please supply the output of these 3 commands:

zdb -l /dev/sda1 | grep txg: | head -1
zdb -l /dev/sdb1 | grep txg: | head -1
zdb -l /dev/sdd1 | grep txg: | head -1

Use sudo before if you are not user root.

Basically, the TXG, (write Transaction Group), should be the same for all 3 disks. If they get way out of sync, it is bad. In some cases we never have figured out why the TXGs get out of sync.

Known causes of extreme out of sync TXG are:

  • Hardware RAID
  • Poor implementation visualizing TrueNAS
  • SMR disks
  • Disk failure during disk replacement with reboot

Pretty extreme cases, but we will see.

2 Likes
  txg: 85837
  Gave me nothing.
  txg: 85837 (the same as first one)

Doesn’t HexOS have its own support channels? Which are not here?

2 Likes

I suggest that anyone who needs to troubleshoot TrueNAS do so in the Debian forums, since SCALE/CE is powered by Debian. It’s only fair.

2 Likes

HexOS still works in TrueNAS environment so i want better anwers for background system than to ask people how still, probably, send me to TrueNAS GUI. You can’t do much in HexOS than in the TrueNAS GUI. (that’s only my lay opinion)

I could be way off - but to me, if a single disk failed to return the txg while other two agree with ‘85837’, then something is wrong with that specific disk & as long as the pool has enough redundancy then the pool can be fixed.

What does smartctl -a /dev/sdb1 return? Have you setup smart tests & scrubs to run periodically before this failure?

The one thing I’m not sure about is if the pool can be imported simply for the purpose of removing/replacing that disk, or if it should be attempted to import the pool in a degraded state (without problem disk attached) to then replace missing drive & resilver the pool.

I think there was some worry about HexOS on the forms with it was announced & the possible flood of [inset choice of adjective] users asking for help on the TrueNAS forums - some folks may be more opinionated on where support should be directed… As this seems to have nothing to do with HexOS at a glance & seems to be a reasonable ask on how TrueNAS OS can be used to restore a failed pool, I think it is a reasonable ask for help.

Personally I keep calling Oracle - sadly they keep rejecting my calls.

2 Likes

This is bad and likely the reason why the pool won’t import. Or is taking forever. However, it is confusing that the pool import scan showed all 3 disks as ONLINE.

First, make sure you don’t have a typo, (or I had a typo). Try the command again, without the grep and head commands:

zdb -l /dev/sdb1

If it still shows no useful output, then see below.

One test could be to attempt to import the pool without the sdb disk. Power down the server, and physically remove the power & SATA cable from sdb. I’d also do the pool import read only just to see if it works:

zpool import -f -R /mnt -o readonly=on HDDs

If R/O works and you see everything you expect to see, then we can look at exporting the pool and re-importing without the R/O option.

Note, when importing a pool from the Linux Shell command line, the GUI and sharing won’t know about the pool. This is why trouble shooting from the Linux Shell command line is useful, but not the complete answer to getting a TrueNAS server functional.


As for HexOS support occurring here, supposedly people using HexOS pay for a license, so I don’t see why they should not have their own forums and support system. TrueNAS CE, is a Community Edition, thus no fee. No one here is paid and is free to ignore any problem, regardless of the reason.

To be clear, if we get a flood of HexOS users wanting help, I may not get involved in some of those requests compared to TrueNAS. But, so far that has not happened.

1 Like

Speaking of such, we could just as easily point the HexOS users with pool problems to FreeBSD, because OpenZFS is the default file system.

But in reality, we almost need a serious ZFS troubleshooting forum, that is not FreeBSD, not TrueNAS / HexOS and not Linux with ZFS. We have started to see some unexplainable pool corruption that prevents pool import. It appears to have only cropped up since TrueNAS SCALE / CE became the primary free version of TrueNAS.

Thus, I really want to blame the Linux kernel for doing something it should not. Like caching writes it thinks are being updated too often, so it figures it can optimize out 90% of them. Except during a power loss or OS crash, that leave ZFS labels or TXGs out of sync!!!

But, we should probably stick to the user’s issue.

1 Like
=== START OF INFORMATION SECTION ===
Device Model:     ST4000NT001-3M2101
Serial Number:    WX11J1XR
LU WWN Device Id: 5 000c50 0faa1f9f9
Firmware Version: EN01
User Capacity:    4,000,787,030,016 bytes \[4.00 TB\]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Tue Nov  4 13:56:07 2025 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:         (0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  567) seconds.
                                        Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:           (1) minutes.
Extended self-test routine
recommended polling time:         (383) minutes.
Conveyance self-test routine
recommended polling time:           (2) minutes.
SCT capabilities:              (0x50bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate       0x000f   083   065   044    Pre-fail  Always       -       188492102
3 Spin_Up_Time              0x0003   092   092   000    Pre-fail  Always       -       0
4 Start_Stop_Count          0x0032   100   100   020    Old_age   Always       -       98
5 Reallocated_Sector_Ct     0x0033   100   100   010    Pre-fail  Always       -       0
7 Seek_Error_Rate           0x000f   069   060   045    Pre-fail  Always       -       7929758
9 Power_On_Hours            0x0032   100   100   000    Old_age   Always       -       224
10 Spin_Retry_Count         0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count        0x0032   100   100   020    Old_age   Always       -       99
18 Unknown_Attribute        0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   064   049   000    Old_age   Always       -       36 (Min/Max 23/36)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       56
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       482
194 Temperature_Celsius     0x0022   036   051   000    Old_age   Always       -       36 (0 20 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   100   000    Old_age   Offline      -       143 (33 217 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       15998332584
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       7824604792

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  \[To run self-tests, use: smartctl -t\]

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
1        0        0     Not_testing
2        0        0     Not_testing
3        0        0     Not_testing
4        0        0     Not_testing
5        0        0     Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I don’t know why, but currently the boot-pool is on the sda drive, so it changed the order naming of the drives. Is it normal?

Probably not…