Returning to healthy zfs, can't delete errored files

Setup:
Proxmox VE 7.4-16 with HBA Passthrough with DELL PERC H310 PCIe-x8 (IT Mode)
Total RAM: 48GB
TrueNAS-13.0-U6.3
TrueNAS RAM: 16GB
PC: DELL T3620 Tower
CPU: i7 6700
Drives: 2x6TB WD mirrored, 1x3TB WD, 1x4TB SSD

This setup has been used without problem for 18 months, never seeing any errors, but I have 2 pools now showing as un-healthy, a mirrored 6TB and un-mirrored 3TB.

I am trying to return these to healthy via “zpool clear”, scrubbing and then erasing the permanent errored files.

When I try to delete the errored files Truenas crases on me. I have tried deleting them via SMB and the via shell and also playing with permissions but TrueNAS always crashes when I try to delete them.

I dont mind losing any of these files to return back to healthy. Why would TrueNAS want to crash on me when a delete attempt is made?

zpool status -v is

pool: Andys3TB_WD_Pool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub repaired 0B in 04:27:55 with 4 errors on Sun Dec 1 00:33:02 2024
config:

NAME                                          STATE     READ WRITE CKSUM
Andys3TB_WD_Pool                              ONLINE       0     0     0
  gptid/6c1e95e8-373d-11ec-a175-bb7c5929b21a  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

    /mnt/Andys3TB_WD_Pool/WD Red 3TB/tv/adamstv/Arrow/Season 1/Arrow.S01E03.1080p.BluRay.AV1-PTNX.mkv
    /mnt/Andys3TB_WD_Pool/tv/andystv/Industry/Season 3/Industry.S03E07.1080p.HEVC.x265-MeGusta[EZTVx.to].mkv[eztvx.to].mkv

pool: Andys6TBPool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: Message ID: ZFS-8000-8A — OpenZFS documentation
scan: scrub repaired 0B in 07:03:26 with 13 errors on Sun Dec 1 07:03:28 2024
config:

NAME                                            STATE     READ WRITE CKSUM
Andys6TBPool                                    ONLINE       0     0     0
  mirror-0                                      ONLINE       0     0     0
    gptid/2055b4c7-2e57-11ec-817f-09ccd2eaeaf2  ONLINE       0     0     0
    gptid/20cd049a-2e57-11ec-817f-09ccd2eaeaf2  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

    /mnt/Andys6TBPool/Music/music to be tagged/Various/mos/Various - Ministry Of Sound Classics - Various/CD2/02 - After The Love (10 Glorious Years Mix).mp3
    /mnt/Andys6TBPool/Music/music to be tagged/Various/mos/Various - Ministry Of Sound Classics - Various/CD2/04 - Inside Your Mind.mp3
    /mnt/Andys6TBPool/Music/music to be tagged/Various/mos/Various - Ministry Of Sound Classics - Various/CD2/06 - Throw.mp3
    /mnt/Andys6TBPool/Music/music to be tagged/Various/mos/Various - Ministry Of Sound Classics - Various/CD2/09 - Witch Doktor (Original Mix).mp3
    Andys6TBPool/Music:<0x13f28>
    /mnt/Andys6TBPool/Music/music to be tagged/Various/mos/Various - Ministry Of Sound Classics - Various/CD2/12 - 4 You (Jules and Skins Mix).mp3
    /mnt/Andys6TBPool/Music/music to be tagged/Various/mos/Various - Ministry Of Sound Classics - Various/CD2/16 - Voices In My Mind (Original Mix).mp3
    /mnt/Andys6TBPool/Music/music to be tagged/Various/mos/Various - MOS - Clubbers Guide Summer 2006
    Andys6TBPool/Music:<0x13f30>
    Andys6TBPool/Music:<0x13f34>
    Andys6TBPool/Music:<0x13f36>
    Andys6TBPool/Music:<0x13f3b>
    Andys6TBPool/Music:<0x13f3f>

I see a Panic in messages at the tim or TrueNAS crash, is this relevant?

Dec 1 00:00:00 truenas syslog-ng[1142]: Configuration reload request received, reloading configuration;
Dec 1 00:00:00 truenas syslog-ng[1142]: Configuration reload finished;
Dec 1 10:11:28 truenas syslog-ng[1156]: syslog-ng starting up; version=‘3.35.1’
Dec 1 10:11:28 truenas panic: VERIFY(BP_GET_FILL(db->db_blkptr) == 0 || db->db_dirtycnt > 0) failed
Dec 1 10:11:28 truenas cpuid = 0
Dec 1 10:11:28 truenas time = 1733047837
Dec 1 10:11:28 truenas KDB: stack backtrace:
Dec 1 10:11:28 truenas db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe011df96ac0
Dec 1 10:11:28 truenas vpanic() at vpanic+0x17f/frame 0xfffffe011df96b10
Dec 1 10:11:28 truenas spl_panic() at spl_panic+0x3a/frame 0xfffffe011df96b70
Dec 1 10:11:28 truenas free_children() at free_children+0x402/frame 0xfffffe011df96bf0
Dec 1 10:11:28 truenas free_children() at free_children+0x2bd/frame 0xfffffe011df96c70
Dec 1 10:11:28 truenas dnode_sync_free_range() at dnode_sync_free_range+0x20c/frame 0xfffffe011df96cf0
Dec 1 10:11:28 truenas range_tree_walk() at range_tree_walk+0x80/frame 0xfffffe011df96d50
Dec 1 10:11:28 truenas dnode_sync() at dnode_sync+0x315/frame 0xfffffe011df96de0
Dec 1 10:11:28 truenas sync_dnodes_task() at sync_dnodes_task+0x89/frame 0xfffffe011df96e20
Dec 1 10:11:28 truenas taskq_run() at taskq_run+0x1f/frame 0xfffffe011df96e40
Dec 1 10:11:28 truenas taskqueue_run_locked() at taskqueue_run_locked+0x181/frame 0xfffffe011df96ec0
Dec 1 10:11:28 truenas taskqueue_thread_loop() at taskqueue_thread_loop+0xc2/frame 0xfffffe011df96ef0
Dec 1 10:11:28 truenas fork_exit() at fork_exit+0x7e/frame 0xfffffe011df96f30
Dec 1 10:11:28 truenas fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe011df96f30
Dec 1 10:11:28 truenas — trap 0, rip = 0xffffffff80aa32ef, rsp = 0, rbp = 0xffffffff832f8fa0 —
Dec 1 10:11:28 truenas mi_startup() at mi_startup+0xdf/frame 0xffffffff832f8fa0
Dec 1 10:11:28 truenas swapper() at swapper+0x69/frame 0xffffffff832f8ff0
Dec 1 10:11:28 truenas btext() at btext+0x22
Dec 1 10:11:28 truenas KDB: enter: panic
Dec 1 10:11:28 truenas —<>—
Dec 1 10:11:28 truenas Copyright (c) 1992-2021 The FreeBSD Project.
Dec 1 10:11:28 truenas Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Dec 1 10:11:28 truenas The Regents of the University of California. All rights reserved.
Dec 1 10:11:28 truenas FreeBSD is a registered trademark of The FreeBSD Foundation.
Dec 1 10:11:28 truenas FreeBSD 13.1-RELEASE-p9 n245431-b8ec9bde091 TRUENAS amd64
Dec 1 10:11:28 truenas FreeBSD clang version 13.0.0 (git@github.com:llvm/llvm-project.git llvmorg-13.0.0-0-gd7b669b3a303)
Dec 1 10:11:28 truenas VT(vga): text 80x25
Dec 1 10:11:28 truenas CPU: Common KVM processor (3408.10-MHz K8-class CPU)
Dec 1 10:11:28 truenas Origin=“GenuineIntel” Id=0xf61 Family=0xf Model=0x6 Stepping=1
Dec 1 10:11:28 truenas Features=0x1783fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,HTT>
Dec 1 10:11:28 truenas Features2=0x80202001<SSE3,CX16,x2APIC,HV>
Dec 1 10:11:28 truenas AMD Features=0x20100800<SYSCALL,NX,LM>
Dec 1 10:11:28 truenas AMD Features2=0x1
Dec 1 10:11:28 truenas Hypervisor: Origin = “KVMKVMKVM”
Dec 1 10:11:28 truenas real memory = 17179869184 (16384 MB)
Dec 1 10:11:28 truenas avail memory = 16605192192 (15835 MB)
Dec 1 10:11:28 truenas Event timer “LAPIC” quality 100
Dec 1 10:11:28 truenas ACPI APIC Table:
Dec 1 10:11:28 truenas FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
Dec 1 10:11:28 truenas FreeBSD/SMP: 1 package(s) x 2 core(s)
Dec 1 10:11:28 truenas random: unblocking device.
Dec 1 10:11:28 truenas ioapic0 <Version 1.1> irqs 0-23
Dec 1 10:11:28 truenas Launching APs: 1
Dec 1 10:11:28 truenas random: entropy device external interface
Dec 1 10:11:28 truenas kbd1 at kbdmux0
Dec 1 10:11:28 truenas vtvga0:

Should I look to clean install TrueNAS and re-load this saved config?

Posting link to two Proxmox posts. You need to protect Boot also

I would start running long SMART tests on the drives and checking all connections. Running a memory test like Memtest.

Virtualize TrueNAS

Thanks for the links and I can run some long SMART and memory tests.

But could of any this be the cause of TrueNAS crashing when I try and delete the errored files?

Disk problem or memory problems could be possible. I’m hoping some others comment on your issue but the SMART and RAM tests are a good start. Make sure your Proxmox / TrueNAS setup is doing the passthrough and blacklisting so Proxmox doesn’t interfere with the TrueNAS drives.

I am concerned because you have the errors on two different pools. We can verify the HBA is in IT mode and check its firmware version. Run the command below in a shell window and post the results back using Preformatted text (ctrl+e). Looks like </> on toolbar where you type your replies

sas2flash -list && sas3flash -list

A long SMART on the 3TB 3.5" had SUCCESS on everything. I’ve started a long test on one of the 6TB 3.5" mirrored.

My original errors were on the 4TB SSD and the 6TB 3.5", I RMA’d back to Samsung the SSD and it came back error free. The 3TB errors were probably me playing with rsync to empty the 4TB SSD so I could send it off at the same time playing with the perm errors on the 6TB, which may have caused a reboot during the Rysnc, possibly.

I do pass the HBA PCI through to TrueNAS VM. Still need to look at blacklisting, but isn’t that a Scale issue only?

I’ll try and run the mem test tomorrow.

root@truenas[~]# sas2flash -list && sas3flash -list
LSI Corporation SAS2 Flash Utility
Version 16.00.00.00 (2013.03.01) 
Copyright (c) 2008-2013 LSI Corporation. All rights reserved 

	Adapter Selected is a LSI SAS: SAS2008(B2)   

	Controller Number              : 0
	Controller                     : SAS2008(B2)   
	PCI Address                    : 00:00:10:00
	SAS Address                    : 5c81f66-0-e411-f100
	NVDATA Version (Default)       : 14.01.00.08
	NVDATA Version (Persistent)    : 14.01.00.08
	Firmware Product ID            : 0x2213 (IT)
	Firmware Version               : 20.00.07.00
	NVDATA Vendor                  : LSI
	NVDATA Product ID              : SAS9211-8i
	BIOS Version                   : N/A
	UEFI BSD Version               : N/A
	FCODE Version                  : N/A
	Board Name                     : 6Gbps SAS HBA
	Board Assembly                 : N/A
	Board Tracer Number            : N/A

	Finished Processing Commands Successfully.
	Exiting SAS2Flash.
Avago Technologies SAS3 Flash Utility
Version 16.00.00.00 (2017.05.02) 
Copyright 2008-2017 Avago Technologies. All rights reserved.

	No Avago SAS adapters found! Limited Command Set Available!
	ERROR: Command Not allowed without an adapter!
	ERROR: Couldn't Create Command -list
	Exiting Program.

Thanks for your help on this.

Memtest86 showed over 10,000 memory fails related to the pair of 8GB Samsung DIMMs from year 2015 and 2 errors from the pair of Crucial 16GB DIMMs from 2021. I have ordered two new Crucial 16GB DIMMS to replace the Samsung 8GB’s and will retest the memory when they go in. The two Crucials only showed one error on the 2nd pass, I didn’t execute the 3rd and 4th pass. I’m hoping I will be able to delete the errored files when I get the new memory.

3 Likes