I have just copied all my media from an SMB share on my old TrueNAS Core box, onto a similar SMB share on my new TrueNAS scale box. I did this using ROBOCOPY within Windows to MIRror the paths.
I have setup the share in, as far as I can see, the exact same way and am doing some testing before I decommission the old box. The dataset settings also look the same.
I am finding some weird behaviour, whereby when I playback a piece of media in VLC from a client, if I jump ahead to later timestamps in the video several times in a row, playback will then stop. I have to start the playback again manually. I also sometimes see some heavy artifacting when jumping ahead, but not all the time.
I can reproduce this pretty reliably on the new box.
If I run the same test against the old box/share, I do not see this behaviour.
If I copy the media to my local machine from both shares, and try playback locally, the original server’s copy is absolutely fine, but the new server’s copy gives me massive artifacting, sometimes takes a very long time to complete the jump ahead, and can’t even gernerate a file preview thumbnail.
I could reattempt to copy all the media over again in the event it was corrupted during the copy, but I’m worried I’ll just see the same behaviour again.
Might anyone have any thoughts on what the root cause of this is, and is it more likely to be a media problem, or a TrueNAS problem?
Note: I 've tried with several different pieces of content, but given how much I have, I can’t test it all!
The best way of transferring data from one ZFS pool to another ZFS pool (on the same or different boxes) is by using ZFS replication.
Also, I assume from other things you have said that you are running VLC over SMB (and not e.g. DLNA etc.). I don’t have any knowledge which would help explain this behaviour, but the means that VLC is accessing the stream is going to be important to know.
This could be a network card (or loose cable) issue.
We’ve seen this before in the old forums, in regards to image corruption, even though the files themselves (stored in the ZFS dataset) are fine.
What network cards are involved?
Have you run a checksum on the files (independent of what ZFS does)? Checksum on the server(s) locally (not over the network), and then again on the client locally?
Just remember not to run a checksum over the network. But if you do, then you must also do so locally (via an SSH session on the server), so that you can at least compare everything.
Generated checksum of:
Local file on SCALE server
Local file on Core server
Local file on client
Over-the-network on SCALE server
Over-the-network on Core server
SHA1 is good enough for this. The method doesn’t matter, as long as you use the same hash function for all tests.
SHA1 hashes of a particular test file locally on both boxes match.
Copying to my local machine from both boxes, and for the old box the hash matches local, but for the one from the new box the hash is different.
Hashing over the network, old box matches local, but I keep getting unexpected network exception when hashing the new box.
Old Local - b99505ed63a84659e25209b1b7420f03e2541a7d
New Local - b99505ed63a84659e25209b1b7420f03e2541a7d
Old client copy - B99505ED63A84659E25209B1B7420F03E2541A7D
New client copy - 55A4B2D22B3424845EE7DFCFDB8F7066E4F29168
Old network - B99505ED63A84659E25209B1B7420F03E2541A7D
New network - Fails
Given I used the same adapter to complete the replication originally, and via SMB, I’m confused why when reading it back from SMB the hash would be different? Especially when locally the hash looks to match?
Looking like a network card issue on the “New” server.[1]
It’s the common variable.
Wrong checksum when copied from the New server to the Client.
Failed hash function on the New server when you check it over-the-network
Same adapter? As in the same make/model, or are you referring to the actual physical device?
It should also be noted that ZFS replication is different than file-based copying. The ZFS replication did not accept corrupt blocks, and likely retried them multiple times until they transferred and verified on the other end.
I would be weary about using this network adapter and would seriously consider replacing it. (Or at least check to see if it’s seated properly, and maybe even change out the cable, since the cable connected to the interface could be the issue.)
The good news is that the files that already reside on the New server are likely okay, as confirmed when you run a checksum on them locally. However, be careful about transferring files from this server to any other location, as you could indeed be copying over corrupt files. You need to remove the offender, whether it is the network card or the cabling.
These types of issues are usually only caught when something else triggers further investigation: video corruption, image corruption, streaming issues, and so on. So you should be glad that VLC was glitching out.
EDIT: Do you have another network interface on the New server that you could use in the meantime, until you resolve the current issue? (Even if it’s a downgrade in speeds.)
EDIT 2: I’m not well versed on “fake” cards, but from what I recall, the person who suffered corruption over-the-network (in the older forums) might have been scammed into buying a “legit” Intel NIC, when in fact it was likely an imitation.
When I reference the “Same adapter”, what I mean to say is that the data was copied from the old server to the new server, over the same Mellanox ConnectX card/port that I am currently using to access the files/share.
I should also pick my words more carefully, as when I say “replicated” I should have said “copied”. I did not use ZFS replication to copy this data from one server to another, I just copied directly between the SMB shares.
I do have some other 10Gb cards spare, and also some onboard 1Gb ports, so can quite easily test the same process over different interfaces/cards to be sure. I also have some more DAC cables I can use for testing too, although the ones I currently have plugged in are brand new, unlike the adapters.
Is there a simple way to do a checksum on every file in a dataset, and export the results? I’d like to checksum both shares to see if I need to re-copy the data, as its 20TB worth so having to copy it again will take another age haha.
That’s why it’s recommended to use ZFS replication between servers, since not only is it more efficient, but you are 100% assured that you have a block-for-block copy of the filesystem, and that each block was verified before the replication is considered “finished”.
To run a hash on 20TB worth of data is going to take a long time.[1] Then you’ll need to generate a list from both servers that allow you to remove duplicates (which includes the paths and filenames). Then whatever remains on the “Old server list”, you can assume got corrupted on the New server.
How come you didn’t use ZFS replication? Is this “New” server going to replace the Old server? Or is it a backup destination?
Fair one.
In this instance, the server is being replaced.
I didn’t even think about ZFS replication, as it wasn’t something I’d looked into or used before. As it’s been suggested, I will use this going forwards when completing this kind of process again.
Appreciate the hashes will take a long time to calculate, but I assumed it would be quicker than recopying the data.
Perhaps not though. Maybe it is easier to just do a new ZFS replication and start over.
I’ll get the NIC sorted first then weigh up my options.
What happens if the destination block (before being committed to disk on the destination) fails the ZFS checksum? ZFS doesn’t allow a “mostly good” replication. If it’s not 100% block-for-block, then you won’t have the snapshot on the other end.
That’s why if you did a completed ZFS replication, you’re 100% assured all the blocks are identical. If you copy over SMB, it might “finish”, but you have no assurances. (Especially if the network card is failing, as seen in the old forums, and likely what is being seen here.)
Just like reading a file from a ZFS dataset. If the block is corrupt (say on a member drive in a mirror vdev), it seamlessly reattempts from the other drive on the mirror (and tries to replace the corrupt block with a known good copy.)
Is that to imply that ZFS will just “fail” upon the first corrupt block, without retrying the same block from the source to the destination stream?
Wondering if I can get some additional advice as part of this:
The replication of my dataset has completed, so I opened the snapshot, and cloned it to a new dataset.
This now shows under my pool name TV-Pool with a dataset name of TV
I promoted the dataset to unlink from the snapshot, which I understand is correct procedure.
Oddly, if I select the dataset and click Add SMB share it sets the path as /mnt/TV-Pool/TV
If I go to shares and click add, then manually browse the pool, if I select the TV-Pool it does not show me any child datasets, and instead shows me all the folders that are/were in the TV dataset on the old box (almost as if everything has bumped up one level in the filesystem).
Have I done something wrong here, as TrueNAS rightly warns me that adding a share on the root of the pool has potential negative consequences.
I was expecting it to show me a child dataset, and I would create the share on there?
Yikes. Is there a network device somewhere in the middle that could be at fault? Perhaps a particular port on a switch?
Otherwise, you could be looking at bad RAM. (Which should also manifest the error when a hash is run locally. That’s why I think there’s something on the network side that is at fault.)
I would let a memtest run overnight on the new machine.
EDIT: Double check to confirm that you get the same SHA1 hash on both servers when the command is run locally.
I will do another local hash of both in the morning.
I thought I may have had bad RAM at one point, so have already done a memtest, and it all came back as a pass.
My client is wirelessly connected to a UniFi AP, that connects to a Unifi Access switch. This connects to a UniFi aggregation switch, which has the new machine on it.
The old machine is connected to another UniFi access switch which in turn connects to the aggregation (however the issue also existed when both old and new machines where connected directly to the aggregation switch).
So there is a max of 2 switches between client and new machine, and 3 for client and old machine. If either the first access or aggregation switches were at fault, I’d expect to see issues to both old and new machines.
I think both machines have been moved to different ports since starting all this so again that feels unlikely.
I don’t remember if I changed the DAC cables in the end, so may try that again.
I just did a local hash on both machines before going to bed, and they are identical when run locally.
Maybe as well as connecting the DACs, I’ll connect my client directly to the new machine and rerun the SMB share hash. If that comes back ok, then you could be right and it could be a network switch/device failing in between