Resilvering Stuck after changing disk array connector

Stux · October 5, 2024, 3:22pm

You have many many errors, read, write and checksum across all disks.

Something is not right with the esata setup. Perhaps the cable.

I/O has been suspended.

A restart would probably resume it… but perhaps you should restore the USB and rescrub until you can figure out the issue.

PS: You’re using RaidZ1. You don’t have 96TB, but rather 96TiB.

Protopia · October 5, 2024, 3:54pm

I disagree completely with this statement.

Losing your data is the most expensive part of running a storage server. Disks are the next most expensive part of a storage server. And the cost of administrator time spent reorganising the storage when you need to rebalance multiple small drives on a non-ZFS server is probably the third biggest cost.

Whilst ZFS does want more memory for it’s ARC, that gives you a lot of performance gain anyway, and you would need the same amount of memory for any other kind of caching - or just live with the performance without much ARC. Besides which memory is relatively cheap these days.

So IMO using ZFS is actually the most cost effective storage technology.

Yes - I knew that my extrapolation was probably nuts, but I wanted to make it clear that it is likely to take an unreasonably long time due to the size of the pool combined with the multiplexed bandwidth.

If it is 1TB/hr when you have 8x SATA connections to 8x drives, I would expect it to be several times slower when these 8x SATA connections are multiplexed onto a single connection.

According to the specification sheet the Toshiba drives can sustain 268MiB/s. A SATA 3 channel is 6Gb/s = 750MB/s = 715MiB/s, and so a single SATA channel can sustain 2.5 drives, which is 3x slower than if every drive had its own SATA 3 channel.

96TB used at 1 TB per 3 hours = 288 = 12 days to resilver (as opposed to 4 days if it was not multiplexed).

If the eSata is SATA 2 instead, then you can double this time to 24 days.

Yes - I was going to mention that. It is unclear whether these errors pre-date the resilver attempt or were caused by it, but it is IMO highly indicative of the unreliability and dangers of using any sort of multiplexed connection.

In addition to this, ZFS relies heavily for its integrity on writes being made in the correct sequence, and RAID controllers and some multiplexing controllers resequence writes in order to try to improve performance. This may be the cause of the errors or may be on top of these errors.

Protopia · October 5, 2024, 4:05pm

I think we all understand that the 8GB drives are not part of THIS problem, nevertheless hiding them behind a RAID controller is NOT a good long-term idea. Just because it has been and is now working fine, does not mean that it is a sound configuration when something goes wrong.

With this configuration, if you ever have a problem with that pool that takes it offline, you have a good chance of never recovering from it without losing all the data on those drives because ZFS has no idea of the real drive characteristics.

melonion · October 5, 2024, 5:13pm

Alright, I rebooted with USB connection and the data is accessible again, for now.
We learned our lesson and are conceptualizing an alternative setup.

Protopia · October 5, 2024, 5:35pm

Good oh.

To avoid a 2nd bad design, please be sure either to use a genuine TrueNAS / ZFS expert to do the design and build or run the proposed solution past the community here.

melonion · October 5, 2024, 5:53pm

Yes! Please feedback on Updated ASRock Build Plan Feedback

ZFS is still running a resilvering, data is accessible though. Is there a way to safely abort that cause it is kind of pointless on an array about to be decommissioned and slows down my migration of data off of it? Heard of export+import?
Or should I let it finish for now?

% sudo zpool status -x 
  pool: b
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Oct  4 18:15:06 2024
	5.15T scanned at 0B/s, 137G issued at 38.5M/s, 95.8T total
	7.77G resilvered, 0.14% done, 30 days 03:14:12 to go
config:

	NAME                                      STATE     READ WRITE CKSUM
	b                                         ONLINE       0     0     0
	 raidz1-0                                ONLINE       0     0     0
	   e654fda2-1bab-4dd7-8941-27b7c5399456  ONLINE       0     0     0
	   c3e92d12-8b7d-475c-ac04-2afd4887b551  ONLINE       0     0     0
	   89e8070b-c005-4d31-9159-c2368ffd4be3  ONLINE       0     0     0
	   5b7157fc-7996-4c6b-9600-2b3d93b90bd9  ONLINE       0     0     0
	   3e08e841-46a5-41b2-92d0-13a47c81d6d5  ONLINE       0     0     0
	   e961fda4-7c66-4d3c-9fee-d3346908da32  ONLINE       0     0     0
	   18f61b50-172e-4a17-b0ca-51e219491a8d  ONLINE       0     0     0  (resilvering)
	   8ab72240-b593-4094-be04-ab8482f30414  ONLINE       0     0     0

melonion · October 15, 2024, 10:57am

After going back to USB, resilvering finished within a day, we are preparing for migration now, thanks everyone!

Continuing on Updated ASRock Build Plan Feedback