VLC stopping playback when jumping to timestamps for media on SMB share

eds89 · November 16, 2024, 1:38pm

Hi,

I have just copied all my media from an SMB share on my old TrueNAS Core box, onto a similar SMB share on my new TrueNAS scale box. I did this using ROBOCOPY within Windows to MIRror the paths.

I have setup the share in, as far as I can see, the exact same way and am doing some testing before I decommission the old box. The dataset settings also look the same.

I am finding some weird behaviour, whereby when I playback a piece of media in VLC from a client, if I jump ahead to later timestamps in the video several times in a row, playback will then stop. I have to start the playback again manually. I also sometimes see some heavy artifacting when jumping ahead, but not all the time.
I can reproduce this pretty reliably on the new box.

If I run the same test against the old box/share, I do not see this behaviour.
If I copy the media to my local machine from both shares, and try playback locally, the original server’s copy is absolutely fine, but the new server’s copy gives me massive artifacting, sometimes takes a very long time to complete the jump ahead, and can’t even gernerate a file preview thumbnail.

I could reattempt to copy all the media over again in the event it was corrupted during the copy, but I’m worried I’ll just see the same behaviour again.
Might anyone have any thoughts on what the root cause of this is, and is it more likely to be a media problem, or a TrueNAS problem?

Note: I 've tried with several different pieces of content, but given how much I have, I can’t test it all!

Thanks
Eds

Protopia · November 16, 2024, 2:20pm

The best way of transferring data from one ZFS pool to another ZFS pool (on the same or different boxes) is by using ZFS replication.

Also, I assume from other things you have said that you are running VLC over SMB (and not e.g. DLNA etc.). I don’t have any knowledge which would help explain this behaviour, but the means that VLC is accessing the stream is going to be important to know.

winnielinnie · November 16, 2024, 3:49pm

This could be a network card (or loose cable) issue.

We’ve seen this before in the old forums, in regards to image corruption, even though the files themselves (stored in the ZFS dataset) are fine.

What network cards are involved?

Have you run a checksum on the files (independent of what ZFS does)? Checksum on the server(s) locally (not over the network), and then again on the client locally?

eds89 · November 16, 2024, 4:40pm

Thanks, didn’t even think of that as a way to replicate the data. Might have a go at redoing the copy using this.

Correct, I’m using VLC to play the file over SMB.

I’m using Mellanox ConnectX 3 dual port 10Gb SFP cards

I will do so and report back.

Thanks
Eds

winnielinnie · November 16, 2024, 4:41pm

Just remember not to run a checksum over the network. But if you do, then you must also do so locally (via an SSH session on the server), so that you can at least compare everything.

Generated checksum of:

Local file on SCALE server
Local file on Core server
Local file on client
Over-the-network on SCALE server
Over-the-network on Core server

SHA1 is good enough for this. The method doesn’t matter, as long as you use the same hash function for all tests.

eds89 · November 16, 2024, 6:10pm

Well this is odd:

SHA1 hashes of a particular test file locally on both boxes match.
Copying to my local machine from both boxes, and for the old box the hash matches local, but for the one from the new box the hash is different.
Hashing over the network, old box matches local, but I keep getting unexpected network exception when hashing the new box.

Old Local  		 - b99505ed63a84659e25209b1b7420f03e2541a7d
New Local 		 - b99505ed63a84659e25209b1b7420f03e2541a7d
Old client copy  - B99505ED63A84659E25209B1B7420F03E2541A7D
New client copy  - 55A4B2D22B3424845EE7DFCFDB8F7066E4F29168
Old network		 - B99505ED63A84659E25209B1B7420F03E2541A7D
New network		 - Fails

Given I used the same adapter to complete the replication originally, and via SMB, I’m confused why when reading it back from SMB the hash would be different? Especially when locally the hash looks to match?

winnielinnie · November 16, 2024, 6:20pm

Looking like a network card issue on the “New” server.^[1]

It’s the common variable.

Wrong checksum when copied from the New server to the Client.
Failed hash function on the New server when you check it over-the-network

Same adapter? As in the same make/model, or are you referring to the actual physical device?

It should also be noted that ZFS replication is different than file-based copying. The ZFS replication did not accept corrupt blocks, and likely retried them multiple times until they transferred and verified on the other end.

I would be weary about using this network adapter and would seriously consider replacing it. (Or at least check to see if it’s seated properly, and maybe even change out the cable, since the cable connected to the interface could be the issue.)

The good news is that the files that already reside on the New server are likely okay, as confirmed when you run a checksum on them locally. However, be careful about transferring files from this server to any other location, as you could indeed be copying over corrupt files. You need to remove the offender, whether it is the network card or the cabling.

These types of issues are usually only caught when something else triggers further investigation: video corruption, image corruption, streaming issues, and so on. So you should be glad that VLC was glitching out.

EDIT: Do you have another network interface on the New server that you could use in the meantime, until you resolve the current issue? (Even if it’s a downgrade in speeds.)

EDIT 2: I’m not well versed on “fake” cards, but from what I recall, the person who suffered corruption over-the-network (in the older forums) might have been scammed into buying a “legit” Intel NIC, when in fact it was likely an imitation.

It could also be the cable / connection connected to the network card. ↩︎

eds89 · November 16, 2024, 7:03pm

Seems like a reasonable conclusion.

When I reference the “Same adapter”, what I mean to say is that the data was copied from the old server to the new server, over the same Mellanox ConnectX card/port that I am currently using to access the files/share.
I should also pick my words more carefully, as when I say “replicated” I should have said “copied”. I did not use ZFS replication to copy this data from one server to another, I just copied directly between the SMB shares.

I do have some other 10Gb cards spare, and also some onboard 1Gb ports, so can quite easily test the same process over different interfaces/cards to be sure. I also have some more DAC cables I can use for testing too, although the ones I currently have plugged in are brand new, unlike the adapters.

Is there a simple way to do a checksum on every file in a dataset, and export the results? I’d like to checksum both shares to see if I need to re-copy the data, as its 20TB worth so having to copy it again will take another age haha.

winnielinnie · November 16, 2024, 7:13pm

That’s why it’s recommended to use ZFS replication between servers, since not only is it more efficient, but you are 100% assured that you have a block-for-block copy of the filesystem, and that each block was verified before the replication is considered “finished”.

To run a hash on 20TB worth of data is going to take a long time.^[1] Then you’ll need to generate a list from both servers that allow you to remove duplicates (which includes the paths and filenames). Then whatever remains on the “Old server list”, you can assume got corrupted on the New server.

How come you didn’t use ZFS replication? Is this “New” server going to replace the Old server? Or is it a backup destination?

You might as well do a ZFS replication from scratch, which won’t take as long as hashing 20TB of data on two servers. ↩︎

eds89 · November 16, 2024, 7:31pm

Fair one.
In this instance, the server is being replaced.

I didn’t even think about ZFS replication, as it wasn’t something I’d looked into or used before. As it’s been suggested, I will use this going forwards when completing this kind of process again.

Appreciate the hashes will take a long time to calculate, but I assumed it would be quicker than recopying the data.
Perhaps not though. Maybe it is easier to just do a new ZFS replication and start over.

I’ll get the NIC sorted first then weigh up my options.

Thank you!

Stux · November 16, 2024, 8:04pm

How does that work through a one way mechanism like a Unix pipe

TCP is a reliable transport. It supports retries and uses a checksum.

winnielinnie · November 16, 2024, 8:10pm

How “reliable” exactly?

I found the thread from the old forums. Same issue, resolved when the network card was replaced.

Stux · November 16, 2024, 8:13pm

It depends on the NIC if TCP offloading is used.

My point is that a zfs replication uses TCP too. There is no “retransmit” in a replication stream.

winnielinnie · November 16, 2024, 8:18pm

What happens if the destination block (before being committed to disk on the destination) fails the ZFS checksum? ZFS doesn’t allow a “mostly good” replication. If it’s not 100% block-for-block, then you won’t have the snapshot on the other end.

That’s why if you did a completed ZFS replication, you’re 100% assured all the blocks are identical. If you copy over SMB, it might “finish”, but you have no assurances. (Especially if the network card is failing, as seen in the old forums, and likely what is being seen here.)

Just like reading a file from a ZFS dataset. If the block is corrupt (say on a member drive in a mirror vdev), it seamlessly reattempts from the other drive on the mirror (and tries to replace the corrupt block with a known good copy.)

Is that to imply that ZFS will just “fail” upon the first corrupt block, without retrying the same block from the source to the destination stream?

Stux · November 16, 2024, 8:24pm

I’m not sure what happens

But a zfs send| … | zfs receive is one way

eds89 · November 19, 2024, 9:30pm

Wondering if I can get some additional advice as part of this:

The replication of my dataset has completed, so I opened the snapshot, and cloned it to a new dataset.
This now shows under my pool name TV-Pool with a dataset name of TV
I promoted the dataset to unlink from the snapshot, which I understand is correct procedure.

Oddly, if I select the dataset and click Add SMB share it sets the path as /mnt/TV-Pool/TV
If I go to shares and click add, then manually browse the pool, if I select the TV-Pool it does not show me any child datasets, and instead shows me all the folders that are/were in the TV dataset on the old box (almost as if everything has bumped up one level in the filesystem).

Have I done something wrong here, as TrueNAS rightly warns me that adding a share on the root of the pool has potential negative consequences.
I was expecting it to show me a child dataset, and I would create the share on there?

eds89 · November 21, 2024, 10:11pm

Ok, I don’t think the NIC is the issue;

I recreated the pool on the new machine I replicated the dataset to, as I couldn’t delete the dataset or snapshot from the GUI or CLI.
I replicated the dataset to the new machine again, over a known working 10Gb adapter
I have managed to get the dataset in place on the new machine this time, with an SMB share, and can access the files remotely
I ran a SHA1 file hash against a file over SMB on the old machine and got this

30F9CA066F5F62F22BA100085640F50F64B48945

I ran the same SHA1 hash against the same file over SMB on the new machine, over the known working 10Gb adatper, and got this:

3F6B5F2772B228E7D4DAEC51F4C3783C33C5F59B

I changed the new machine to use the onbaord 1Gb adapter instead, ran the same SHA1 hash over SMB against the same file, and got this:

5B4226FD67B819458CEFACBA60E1F1E5C1CEEF6B

How can I be getting different file hashes for the same file over different interfaces, let alone for them to be different to the source server?

Any advice greatly appreciated.

Many thanks
Eds

winnielinnie · November 21, 2024, 11:09pm

Yikes. Is there a network device somewhere in the middle that could be at fault? Perhaps a particular port on a switch?

Otherwise, you could be looking at bad RAM. (Which should also manifest the error when a hash is run locally. That’s why I think there’s something on the network side that is at fault.)

I would let a memtest run overnight on the new machine.

EDIT: Double check to confirm that you get the same SHA1 hash on both servers when the command is run locally.

eds89 · November 22, 2024, 12:44am

I will do another local hash of both in the morning.

I thought I may have had bad RAM at one point, so have already done a memtest, and it all came back as a pass.

My client is wirelessly connected to a UniFi AP, that connects to a Unifi Access switch. This connects to a UniFi aggregation switch, which has the new machine on it.
The old machine is connected to another UniFi access switch which in turn connects to the aggregation (however the issue also existed when both old and new machines where connected directly to the aggregation switch).

So there is a max of 2 switches between client and new machine, and 3 for client and old machine. If either the first access or aggregation switches were at fault, I’d expect to see issues to both old and new machines.
I think both machines have been moved to different ports since starting all this so again that feels unlikely.

I don’t remember if I changed the DAC cables in the end, so may try that again.

eds89 · November 22, 2024, 12:49am

I just did a local hash on both machines before going to bed, and they are identical when run locally.

Maybe as well as connecting the DACs, I’ll connect my client directly to the new machine and rerun the SMB share hash. If that comes back ok, then you could be right and it could be a network switch/device failing in between