Cold spares always make sense, especially if the server is easily reachable for intervention.
Hot spares make sense with multiple vdevs, as these require less drives than bumping the raidz level of all vdevs.
Well, yes they are running hot spares, eating up potential disk time. (I work in the Enterprise DC space, though not in Storage, just OS support.)
Compared to the amount of active, in-use disks, the few hot spares is trivial. And so what if they don’t last as long as they could if they were cold spares. The disk arrays are under a support contract. (Plus, some sites have cold spares TOO, like mine.)
There is sort of a case for a cold spare.
Imagine your pool / vDev is made up of 4 x 10TB disks, either 2 way Mirrored or RAID-Z2. That is about 20TB of storage. So, you buy a 20TB disk that can act as an additional backup, off-line but on-site. This would then also be available as a cold spare, (after you check your other backups for integrity).
It is an even simpler choice if you have a 2 way Mirrored pool / vDev. Just get another disk for backups and cold spare.
This is my take on things. I think that “hot spares” that are spun down would be called “warm spares”.
Hot spares I would consider to be fully initialized, on, spinning, assigned to a pool and ready to go, so they can be swapped in to the pool immediately when the system detects a drive offline or fail. Ideally the hot spare would already be synchronized with the pool data it is assigned to which would decrease the chances of an additional failure or server lag during resilver. This process is fully automatic and requires no immediate physical interaction
Warm spares I would consider near line where they are powered and available, at idle or not active. This would be drives that are installed, spun down sitting at idle and not synced with any pool. These warm spares may be unassigned drives and be available to any pool the system detects has a failure that they meet or exceed the requirements to replace. These drives when a failure is detected are assigned to replace the failed drive and would need to be spun up, and a resilver run to assimilate (resistance if futile) the drive into the pool. This could cause further failures and server lag while the resilver process completes. This process is also automatic and requires no immediate physical interaction.
Cold spares I consider to be drives that sit on a shelf or in a storage locker and have to be physically added to the server, either by replacing the failed drive or by adding it into a spare slot. This might entail powering the server off if the chassis is not hot swap resulting in downtime or other issues such as additional drive failure during resilver, server lag during resilver, and someone has to be physically near the server to make the swap.
I personally have full server chassis with no extra slots. I can replace the drive in a reasonable time period and as a plus the chassis are hot swap so I don’t need to shut anything down. I can pop the bad one out by easily locating the drive by the serial number to physical slot location map I have for the server and complete the drive change procedure in Truenas. The map also reduces the chances of pulling the wrong drive and possibly creating a larger issue. I am wiling to risk a possible additional drive failure during resilver and the server is lightly loaded so possible lag during resilver is not an issue. The main server is fully backed up to a second server, which could be turned into the main server (which saved my butt when I had a server physically fail) without much trouble if it came to that.
Back to the other question, consensus seems to be that if a pool is made up of 2 vdevs and both vdevs use raidz2 then you can only lose 3 drives before pool loss occurs. I.E. vdev1 loses 2 drives and vdev2 can lose only one?
I would think vdev1 could lose 2 drives and vdev2 could lose 2 drives. Any additional drive loss in either vdev would result in total pool loss. If the pool consisted of two vdevs that are each 2 drive mirrors and one drive was lost from mirror0 and one drive lost from mirror1 then both vdevs are degraded but still fully operational and the pool while degraded still has full data availability.
You’ve misunderstood. A stripe of two raidz2 can loose up to four drives, if the losses are equally spread, but it is only guaranteed that it can “safely” lose only up to two drives in any case. Loosing three drives in the same vdev loses the pool.
How is this possible?
Are you suggesting that ZFS starts to resilver the hot spare before a failing drive completely fails?
This is underappreciated for home users.
We have daily access to our NAS server. We don’t need a hot spare taking up a slot (or power) 24/7. On a moment’s notice, we can do the replacement ourselves.
I do recommend that anyone who has cold spare drives to run an occasional long SMART selftest on them.
I’m not sure. Today was clean the laptop day and it was in some old notes I ran across and was deleting from some training I had. I was not IT but had to get a second level of security clearance at one point to be able to enter the secured server room in the IT room for some upgrade work. Maybe it would be auto synced with the pool when created so essentially it is already there online and would not need to resilver Not really practicable in most cases. I guess your thought that a spare be put into service and begin resilvering if the system detects certain parameters of a failing drive giving it a head start without interaction from an admin. might work?
This would be the same as running RAIDz3 or 3-way mirrors. Back when my pool consisted of 2 top level vdevs I would run 3-way mirrors. I often break a new to me drive in by adding it as a 3rd mirror for a week or two.
Hot Spare is a fairly well established term for having a device powered on and ready to assume the role of a failed component. If it is covering more than one possible device, then it almost can’t be pre-loaded with the data necessary.
Warm Spare is a term that is often used as a synonym for Hot Spare.
If you use terms based on their accepted definitions then there is less misunderstanding.
<rant>I still call a lectern a lectern and a podium a podium even if accepted use has changed</rant> ![]()
And for us folks with home system, but travel a lot, even with a z3 it might make sense to have a hot spare depending on number of drives. I was recently gone for a month. So while I normally have access, not from Bali!
![]()
My definition of Warm Spare is a disk that is both powered up, and ready for assignment. However, it would require manual intervention.
On the other hand, a Hot Spare is both powered up and does not need manual intervention to take over a failing / failed disk.
One use of a Warm Spare is an on-line backup. If needed as a replacement disk, check other backups, and if good, then convert the Warm Spare into the replacement disk for the failing / failed disk. You can’t do this with a Hot Spare.
In the bad old days of SunOS DiskSuite and Linux MD-RAID, Warm Spare(s) made some sense if you have multiple RAID sets that could use the replacement disk. A monitoring program / script would check those multiple RAID sets, and if a failing / failed disk was detected, the "Warm Spare" could be assigned as a replacement.
However, ZFS supports using Hot Spare disk(s) across multiple pools, with the caveat that only 1 pool can actively use it at a time. Plus, their can be problems exporting pools with shared Hot Spare disk(s). It is simple to overcome, by removing the active Hot Spare from the pools NOT using it. Then pools can be exported as desired.
V. interesting! I was not aware of this usage change until now. X-referenced the two terms in Webster’s and American Heritage dictionaries. Both show “lectern” as a definition under “podium”, but neither has “podium” under “lectern”. I’m with you @PK1048!
Sun’s Online DiskSuite was nicknamed “Odius” (ODS) by many who had to manage it. I liked it. But ZFS is so much better.
DiskSuite was so much more flexible than Linux MD devices.
The original way of thinking about it was:
The Lectern is the box you stand behind.
The Podium is the box you stand on. Think “conductor’s podium”.
Podium has come to mean either.