In part 1 of this series on data protection, I discussed some general mechanisms Swarm has for protecting data. In part 2 of this series, I describe how Swarm automates the protection of individual objects during their lifetimes.
Swarm can protect your data in transit over the network using integrity seals (hashes) that prevent tampering or transmission errors. These seals are stored with each object and are re-verified when the object is retrieved from Swarm. A write or update request can even request that the full complement of replicas is made before the requests finishes.
Ultimately, Swarm data is stored on mechanical disks which provide high data density, inexpensive persistence, and fast transfer speeds. Although it’s a rare thing, disks can drop data due to bad sectors. Those bad sectors may hold one or more of your objects, all of which are unreadable. The Swarm health processor will periodically read every object on disk to verify its data integrity. If this check fails, the replica is declared invalid and a new replica is made in its place using duplicate data residing elsewhere in the cluster. Here, we rely on the fact that we check the data much faster than the failure rate, so these sorts of disk read/write errors don’t result in lost data.
More commonly though, disks just go bad. They can do so slowly over time or quickly and catastrophically. For slow failures, Swarm watches for an accumulation of disk errors that is usually an indication of an impending failure. When a disk is determined to be near failure, we “retire” it. Retiring involves the cluster performing an active recovery of the questionable disk without increasing the load on it. As sufficient replicas are made elsewhere in the cluster, the retiring disk can erase its replicas, ultimately resulting in an empty disk in the cluster than can be replaced at the administrator’s leisure. All of this is done automatically and without any potential for data loss.
Even when a disk failure is catastrophic, Swarm has a few tricks up its sleeve. The entire cluster participates in an active recovery that quickly restores the cluster to full replication in a relatively short amount of time. By making the recovery fast, we shorten the window of time during which another disk failure might impact the cluster. If a customer chooses 3 replicas for an object, the two remaining replicas effectively guarantee that a third replica is rapidly made. Yes, these active recoveries have a small impact on cluster performance, but they are often just minutes long, which is a small price to pay for closing the window of vulnerability. Usually, the only thing that slows active recovery is a customer who does not leave enough empty space in a cluster for active recovery to replicate the content of failed disks.
Larger clusters raise some interesting questions. It is true that larger clusters can expect to see more frequent disk failures, that is, a failure of some disk in the cluster. This is simply because larger clusters will have more disks and even highly reliable disks, taken as a large collection, will have a relatively frequent failure of some disk in the collection.
But Swarm’s active recovery uses the entire cluster in the recovery which means recovery time in a larger cluster is shorter due to the parallelism of cluster resources devoted to it. While the math is a bit complex, larger Swarm clusters offer comparable data protection to smaller ones, despite the higher likelihood of an individual disk failure in a larger cluster.
Next week, I’ll talk about how Swarm protects customer data through various catastrophic failures. In the meantime, if you missed our livestream during Tech Field Day (#TFD10), check out the videos for more insight into how Swarm works.