ZFS Errors - what now?

FalseNAS · August 14, 2025, 7:45pm

Hi, I have 3 ZFS errors, but I don’t know what I should do now. I thought this is no problem, because on the other driver everything is good. So I assumed they will sync again and everything is set on 0 again.

This happened after I got this alert few days ago:

Pool state is DEGRADED: One or more devices has been removed by the administrator. Sufficient replicas exist for the pool to continue functioning in a degraded state.
The following devices are not healthy:

* Disk Samsung SSD 990 PRO 4TB S7DPNF0XXXXXX is REMOVED

I have TrueNAS-SCALE-24.10.2.4

I hope it’s not because the SSDs are always above 60 degrees. They never actually drop below 60 degrees…

Thanks…

winnielinnie · August 14, 2025, 7:50pm

Are you using m.2 heatsinks? Those temps are not ideal.

joeschmuck · August 14, 2025, 8:44pm

A few things to note here:

Never disregard a message that informs you of a problem. A few days ago is when you should have been looking into this problem. If you do not stay on top of messages like these, you are going to learn the very hard way about data loss.
As @winnielinnie said, 60C is way too hot (Upper limit is 70C for this drive). You should provide us with a list of components that you have making up this machine. Remember one thing when informing us, do not allow us to assume something. Things work much better if we don’t have to assume.
How/where are those M.2 drives mounted and do they have a heatsink? Samsung sells two models, one with and one without a heatsink.
What is on this Mirror to cause this kind of heat?

Not a good assumption. ZFS is not a file system you can just forget. It needs a little bit of attention periodically. Until you get those NVMe temps down, I would be afraid to run a scrub and superheat those drives. And those are not cheap drives.

In the meantime, post the output of zpool status so we can see those errors.

FalseNAS · August 15, 2025, 7:27am

Thank you for your responses!

Unfortunately, the temperature issue cannot be changed, as both SSDs are in a MiniPC (Gigabyte Brix Extreme) where there is no space for a heatsink (That might just barely fit, but I need to check more closely. However, the space is extremely limited). Additionally, the MiniPC doesn’t have the best airflow, and it’s currently around 24 to 30 degrees in my apartment. I had hoped that the connection through just one PCIe lane would significantly reduce the speed and thus the temperature. But it seems that is not the case. Are there perhaps other ways to throttle the SSD? It’s frustrating that the temperature is over 60 degrees even at idle. I’m not constantly writing data, yet it stays at 60 degrees.

Regarding the ZFS errors: Why can’t they be fixed? ZFS data shouldn’t affect the hardware, and the system should simply be able to check the other SSD for the data and correct everything. But apparently, that has failed. I just don’t understand it. Should I now reformat the SSD with the errors and synchronize it from scratch with the other SSD?

The SSDs were really quite expensive. It would be unfortunate if one of them is already broken. Maybe I should have chosen a different model instead of the 990 Pro? Would other models operate noticeably cooler? Perhaps PCIe 3.0 SSDs?

Thank you so much!

root@truenas:/home/truenas_admin# zpool status
  pool: Pool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 3.07G in 00:06:25 with 0 errors on Tue Aug 12 20:12:16 2025
config:

        NAME                                      STATE     READ WRITE CKSUM
        Pool                                      ONLINE       0     0     0
          mirror-0                                ONLINE       0     0     0
            eef38799-4abd-442b-815e-b7d3b2eccd1f  ONLINE       0     0     0
            38dd59db-0cce-4da2-ad93-2d2b97909cbb  ONLINE       0     0     3

errors: No known data errors

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:08 with 0 errors on Sat Aug  9 03:45:09 2025
config:

        NAME         STATE     READ WRITE CKSUM
        boot-pool    ONLINE       0     0     0
          nvme0n1p3  ONLINE       0     0     0

errors: No known data errors
root@truenas:/home/truenas_admin#

FalseNAS · August 15, 2025, 7:32am

pmh · August 15, 2025, 8:15am

You can try to limit the power state of the devices.

Check the available states with:

root@truenas[~]# smartctl -a /dev/nvme0
[...]
Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.80W       -        -    0  0  0  0        0       0
 1 +     6.00W       -        -    1  1  1  1        0       0
 2 +     3.40W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3      210    1200
 4 -   0.0100W       -        -    4  4  4  4     2000    8000
[...]

and if your devices have a lower power state 2 like mine, then:

nvme set-feature /dev/nvme0 --feature-id=2 --value=2

If it works you can create an init script in System > Advanced Settings to have it set at every boot.

HTH,
Patrick

pmh · August 15, 2025, 8:19am

ZFS cannot fix a failing SSD - the point of ZFS is to notice any hardware failure and inform the user so a failing device can be replaced.

joeschmuck · August 15, 2025, 12:58pm

Let’s be hypothetical for a moment: You have two NVMe drives, running very hot. They are in a mirror and all is working nicely. Now the hypothetical part, one drive has a bit error, or many bit errors. This is a hardware issue. ZFS warns you of the error but it cannot fix the hardware issue. That is very oversimplified but failures happen.

The zpool output you provided does state that the error was fixed, it resilvered and fixed it for you. You still have 3 errors on the second drive.

Run zpool clear Pool and those 3 errors should clear. This only removes the count of them happening, it does not fix anything. But if you have more errors, you will be able to see that.

As for the overheated drives, @pmh had the same though as I did, limit the power level the drives can enter.

If you can’t get those temps down, odds are you will continue to have problems. That is my opinion, not a fact, but hopefully your issue.

I want to say yes but honestly it depends on the drive. There are cooler drives on the market. If you can limit the power level that the drives enter, that should let your current drives cool down quite a bit.

What is the output of smartctl -a /dev/nvme0 (Post all of it please, you can remove the drive serial number if desired)

When looking at the power states, you have a few columns of data, St=State, Op=Operational and a (+) means it runs, a (-) is a sleep state, Max is of course the amount of power the drive may consume at this level. I would try to set the drive to the lowest operational power level possible and see how the system works. Is it snappy enough, it should be. What are the drive temps. Try to get below 50C, I prefer under 40C myself, and under 30C when the system is idle. Of course if your room is 30C, that will not happen.

Post an update, I’d like to see it all work for you.

FalseNAS · August 16, 2025, 1:48pm

Thank you very much for your help! I executed the command, and I think there isn’t a lower power level for me, right? That would be the ones with 9.39W, and they are all the same size.

Could you perhaps say something about the error message in the first post? The message states that the SSD was removed by an administrator. But I didn’t do that. Is this the usual message you get when the SSD has too many ZFS errors? The message confused me a bit. Or can the ZFS errors also be caused by a faulty hardware connection that is not stable?

root@truenas:/home/truenas_admin# smartctl -a /dev/nvme2
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.44-production+truenas] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 990 PRO 4TB
Serial Number:                      S7DPNXXXXXXXXXX
Firmware Version:                   4B2QJXD7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 4,000,787,030,016 [4.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          4,000,787,030,016 [4.00 TB]
Namespace 1 Utilization:            3,974,017,331,200 [3.97 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 4b41b1f6e4
Local Time is:                      Sat Aug 16 15:40:33 2025 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0055):     Comp DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x2f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Log0_FISE_MI
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.39W       -        -    0  0  0  0        0       0
 1 +     9.39W       -        -    1  1  1  1        0       0
 2 +     9.39W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3     4200    2700
 4 -   0.0050W       -        -    4  4  4  4      500   21800

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        70 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    2%
Data Units Read:                    13,960,356 [7.14 TB]
Data Units Written:                 108,104,021 [55.3 TB]
Host Read Commands:                 82,769,116
Host Write Commands:                976,065,924
Controller Busy Time:               12,999
Power Cycles:                       18
Power On Hours:                     4,406
Unsafe Shutdowns:                   6
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    22233
Critical Comp. Temperature Time:    10051
Temperature Sensor 1:               70 Celsius
Temperature Sensor 2:               79 Celsius
Thermal Temp. 1 Transition Count:   296
Thermal Temp. 2 Transition Count:   114
Thermal Temp. 1 Total Time:         30555
Thermal Temp. 2 Total Time:         881605

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID Seg SCT Code
 0   Extended          Completed without error                4365            -     -   -   -    -
 1   Extended          Completed without error                3869            -     -   -   -    -
 2   Extended          Completed without error                   0            -     -   -   -    -

root@truenas:/home/truenas_admin#

FalseNAS · August 16, 2025, 7:26pm

I believe my SSDs are never in idle because I have something running in a Windows VM that continuously writes logs. This seems to raise my temperatures to over 70 degrees… hmm.

I also just took apart the mini PC. Only a heatsink that is a maximum of 3mm thick would fit there. There aren’t really any good options for that… However, I could cut a hole in the cover at the bottom of the mini PC. Then a thicker heatsink would fit. I will probably have to do that if I want to keep the Windows VM running. Or I could run the Windows VM on another PC… we’ll see.

But it would also be good if I could somehow limit the SSD to 6 watts. I don’t need such extreme speeds. I was just looking for the fast access times of such SSDs. But I haven’t been able to set that limit so far.

awasb · August 16, 2025, 10:12pm

Adding to Patrick’s hints above:

https://wiki.archlinux.org/title/Solid_state_drive/NVMe#Power_saving

Before drilling holes, etc. , I’d max out those settings. On my NVMe-driven system, which is reasonably cooled, this makes all the difference (20°C!).

joeschmuck · August 16, 2025, 11:05pm

@FalseNAS Those results are disturbing, the lower levels are all 9.39 Watts. But that is Maximum. Have you tried to lower the power level?

First check your SSD temp smartctl -a /dev/nvme0 | grep -i "temp" and note the value.
WARNING: Do not screw around with the nvme command. Consider it as dangerous as FORMAT or “Hit me with a hammer”. ONLY perform the commands listed.

Let’s set the maximum power level allowed.
At the CLI, logged in as root, enter the command nvme set-feature /dev/nvme0 --feature-id=2 --value=2

Let’s read the current power level and see if it is in level 2.
Next enter nvme get-feature /dev/nvme0 --feature-id=2 and it should return Current Value:00000002, we hope.

Watch your temp drop on nvme0.
If that is true, now check your SSD temp again smartctl -a /dev/nvme0 | grep -i "temp"

If this works, let us know and you can then set up the set-feature line for each nvme drive in the init scripts. That is two entries since you have two drives. K.I.S.S.

FalseNAS · August 17, 2025, 9:21am

I believe that didn’t work. The value was 3, and at some point, it jumps back to 4 and then back to 3 again

root@truenas:~# nvme set-feature /dev/nvme1 --feature-id=2 --value=2
set-feature:0x02 (Power Management), value:0x00000002, cdw12:00000000, save:0
root@truenas:~# nvme get-feature /dev/nvme1 --feature-id=2
get-feature:0x02 (Power Management), Current value:0x00000003
root@truenas:~# nvme get-feature /dev/nvme1 --feature-id=2
get-feature:0x02 (Power Management), Current value:0x00000003
root@truenas:~# nvme get-feature /dev/nvme1 --feature-id=2
get-feature:0x02 (Power Management), Current value:0x00000004
root@truenas:~#

Krill · August 17, 2025, 9:29am

You may want to look at the link below:

FalseNAS · August 17, 2025, 9:53am

But I don’t understand that. He wants to run the SSD permanently in mode 1 (or later he tried 0). However, we had already established that modes 0 and 2 do not differ at all. Am I misunderstanding something?

Krill · August 17, 2025, 10:03am

Just sharing the link as it may be useful information given you are using the Samsung 990 Pro drives. Depending on when you go them it may be that you could consider returning them as faulty as there are known challenges of using them.

joeschmuck · August 17, 2025, 2:29pm

Looks like it worked to me. Remember, you are looking to restrict the power level from going into start 1 or 2. Power Level 3 is really 2 (zero = 1). Power level 4 is a very low power state.

Linux will change the power states, going up and down. You only limited it to the third power state.

What about the NVMe temperatures? did they drop at all? If all your power states are identical as indicated, then you may not see any temperature drop. This was a test and with a hope that the levels were possibly lower power usage. If it didn’t work then we have at least tried something. I’d like to see if the temps dropped.

What was reported by the NVMe drive may not have been accurate. That is what the best outcome would have been. This is why examining the temps is important. You may be screwed in purchasing high end professional server type drives, meant to run full speed all the time.

You could water cool the drives. Grab a bucket of cold water, submerge the computer into it. Shazam! Water Cooled! Ha Ha, don’t really do that. Sometimes a little levity is needed.

FalseNAS · August 17, 2025, 3:58pm

Unfortunately, nothing has changed. All 3 SSDs are now around 57 degrees without the Windows VM. This is significantly cooler, but you think that’s not enough, right? However, according to the NVMe commands, nothing has changed; they were already consistently at 57 degrees before. By the way, there is already a heatsink installed on the system SSD, and there are also air holes in the case at that location. It’s funny that there is no temperature difference compared to the SSDs that don’t have a heatsink and have no air holes, where the heat is extremely trapped. But the system SSD has minimal data to write once per minute. The other two SSDs are really continuously idle

joeschmuck · August 17, 2025, 4:22pm

I thought the NVMe drives were running at 60C to 70C. 57C is an improvement, just not a great improvement.

As for air holes, they only work if you have forced air flow, and then the airflow must flow across the NVMe chips. A small case with virtually no airflow and chips that create a lot of heat and were meant for more of a server or high-end use application, this is not good. I understand what you were trying to do but it didn’t work out.

I have no idea if you can do this but can you slow down your PCIe lanes? This will reduce the data transfer rate and should produce less heat.

Short of telling you to purchase new hardware, I’m not sure what else you can do. How about a photo of the case and the holes you are talking about. Maybe you can obtain a small 5VDC fan (USB) and secure it to the holes to force air inside, it will come out somewhere. You may need to think outside the box.

FalseNAS · August 19, 2025, 6:51pm

So, I now have 57 degrees, and I see that I’m writing or reading almost no data. So this is the actual idle temperature… and it’s already significantly throttled with just one PCIe lane connected.

And yes… I have also thought about new hardware. A dream would be a compact computer with a server motherboard and ECC RAM. But the Gigabyte Brix Extreme would always be significantly smaller and nicer. Additionally, I would have to spend over €1800 to meet my requirements for a proper ECC RAM server. Maybe something for the future.

I will now simply remove the Windows VM and never run anything with continuous load again. I was able to delete the errors thanks to you. Now everything looks good again for the time being. If I have time, I might open up the mini PC and implement the heatsinks. But I won’t do more than that now. I mean, I also use RAID 1. Both drives are unlikely to fail simultaneously due to heat.

Thanks for your help!