CORE: File corruption on SMB file writes, any particularly advanced tips to assist?

diskdiddler · July 17, 2024, 9:02am

Hi All,

My system has been great for years but I’m currently suffering a very serious problem and I’m running out of ideas of what it could be.

I have a variety of systems reading from and writing back to TrueNAS via SMB causing file corruption on the destination files, confirmed 100% with a binary file compare.
(This is obviously, very very bad)

Current testing involves the following:

cp testfile.bin /mnt/tnas-smb/compare/compare1 && cp testfile.bin /mnt/tnas-smb/compare/compare2 && cp testfile.bin /mnt/tnas-smb/compare/compare3 && cp testfile.bin /mnt/tnas-smb/compare/compare4

Full genuine binary file compare confirms all files are NOT identical, on most tries.

If I run the same command and copy the files from TrueNAS to TrueNAS directly, it seems to consistently work.
I can confirm that the problem occurs to both of my pools, one being 6x16TB drives, the other 6x2TB SSDs Z2
The system has been dead stable for me.

Here’s where the problem gets significantly more frustrating and complicated.
(Seriously stop reading here, this is ridiculous stuff)

Machine1: Proxmox host, HP Mini 12th Gen machine running:
Proxmox (obviously)
DietPiVM
UbuntuLXC provided by proxmox template.
UbuntuVM 22.04
UbuntuVM 24.04
Windows10VM

Machine2:
Spare laptop, Ubuntu 22.04

The DietPi VM infrequently corrupts files
The Ubuntu 22.04 VM corrupts files very often and is very slow to copy
The UbuntuLXC infrequently corrupts files
Windows10 doesn’t seem to corrupt files.
Ubuntu 24.04 doesn’t seem to corrupt files
Even the proxmox host will corrupt files (!!!) if I SSH into it, mount TrueNAS and copy from and back to TrueNAS…
The spare laptop with Ubuntu 22.04 is not corrupting files.

I would be fine with all the VMs and LXCs corrupting files on proxmox, but I can’t seem to fault the 24.04 UbuntuVM. (built specifically to replace the 22.04 VM with the most issues)

None of this makes any sense to me at all.
What other testing should I be doing?

I’m going to loop in @pmh because I know he’s particularly skilled but at this point I’m quite lost and frustrated. I’m going to destroy the SSDs with writes at this rate just trying to isolate what systems do and do not cause this problem to occur.

pmh · July 17, 2024, 9:32am

I’m a skilled systems engineer, yes - but if the software itself (ZFS and/or Samba) does not behave as it should, I have no idea, sorry.

You could try to isolate SMB as the potential culprit by transferring files via SSH or sending them with netcat over the wire, then checking if they also get corrupted.

neofusion · July 17, 2024, 10:07am

Missing a lot of details here.
You barely have information on the hardware involved and, from what I can see, make no mention whatsoever of what TN Core is running on.

My go-to cause in this scenario would be memory corruption on the source, since you say local copy on your TN system works.

Test your boxes thoroughly, not a single pass memtest, properly stress test them.

diskdiddler · July 17, 2024, 10:25am

Let me just clarify here:
The source IS the TrueNAS machine (read off, back to it)

I am reading a file from the NAS via SMB.
\nas\share\SOURCEFILE
I am then copying the file back to the NAS, one path deeper for testing.
\nas\share\foldercopy1\DESTINATIONFILE

Even if the “source” was corrupt, the corruption should be copied.

Copying the same file by using SSH directly to TrueNAS and using
cp /mnt/pool/real-actual/nas/filesystem/path/source
to
/mnt/pool/real-actual/nas/filesystem/path/destination

Will not result in corruption.

Hardware:

64GB ECC
Using SATA drives across the board
I have a 120GB Optane M.2 drive as cache for the SSD pool.
I have a 120GB Optane M.2 drive as cache for the HDD pool.

I have 0 crashes, hangs or any other issues.
This is a “server” system which will report ECC issues to me too.

diskdiddler · July 17, 2024, 10:26am

Do you mean via SCP? I just tried this and I’m getting
Permission denied (publickey,password)

For both the root account and mine, so I’ll have to fiddle.

Netcat no idea, I’ll google it.

pmh · July 17, 2024, 10:39am

On some external system:

nc -l 4444 >somefile

On TrueNAS:

nc -n ip.of.external.system 4444 <somefile

Then to copy it back on the external system:

nc -l 4444 <somefile

On TrueNAS:

nc ip.of.external.system 4444 >somefile.copy

If this leaves the file intact, we can eliminate the network or memory of the TrueNAS host in my opinion.

neofusion · July 17, 2024, 10:53am

Okay, the data still passes a second system in that case; even though the source and destination is the TrueNAS system, a second system is handling the data.

If something is unstable on that second system, it can corrupt the data in transit.

There are new-ish ZFS features that let you do near instant copies between different datasets on the same pool, but I am not sure if those have been enabled in CORE.

diskdiddler · July 17, 2024, 11:00am

The secondary system has at least 1 VM which I simply can’t fault. I’m trying now to fault it but I’m writing 16GB a time to my SSD pool to test this

I do agree, in theory it sounds like the second system has faults, except it runs 3 VMs and an LXC with no crashes, no file issues (except files it writes to TrueNAS) etc.

It runs multiple reliable docker containes in my network consistently.

I’m testing netcat now thanks to @pmh - SCP is throwing nasty errors at me, so I’ll start with NC

winnielinnie · July 17, 2024, 2:14pm

Sounds like a network card issue.

Read the linked thread if you want to assess it yourself and do some quick troubleshooting.

Consider replacing the network card (or adding a new NIC if the one you’re using is integrated.)

diskdiddler · July 17, 2024, 2:46pm

It’s def possible but I’d like to test fully.

Unfortunately netcat is giving me problems as well, I’m trying to see what the cause of that is too.
(Netcat / NC seems to transfer the file but NOT drop back to the command prompt, as if it’s stuck waiting for a final packet or cache to fill up and finally commit it all to disk)

I don’t feel confident, performing testing on binary file reliability, when I’ve had to “CTRL-C” at the end of the file copy… to end the copy task?

This is both ubuntu and truenas behaving this way

pmh · July 17, 2024, 4:22pm

BSD nc has the -n flag to close the network connection on EOF on stdin. Linux nc might have a similar one. And then iX might have installed nc from ports instead of using the one in the base system.

Check the respective docs, i.e. man pages.

winnielinnie · July 17, 2024, 9:37pm

I think this is the default behavior for Linux.

diskdiddler · August 1, 2024, 11:44pm

Thanks all for helping here, just so others know. This was isolated down to being the TSO and GSO functionality on the e1000e driver on my Proxmox host.

This is the fix : ethtool -K eno1 tso off gso off

This fault was wildly intermittent and painful, to replicate the fault, consistently and isolate the cause I’d estimate I copied somewhere in the range of 1 to 2 TB of data to my NAS and a desktop PC of mine with an SMB share.

Obviously trying another network card, would be the the logical move earlier, but the fact a few VMs simply didn’t seem to trip the fault, wasn’t fun.

What I am happy about is I know what data, my Proxmox machine interacts with on my NAS (non important data) and more importantly, my faith in TrueNAS reliability remains in tact.

It’s a doozy!