Data recovery in Swarm works hand-in-hand with its data resiliency features. In our last blog post, we discussed how Swarm distributes and protects data from the possibility of failure. When a failure actually happens, Swarm performs active recovery with the goal of again achieving full protection for its data. We regularly perform a kind of “whack-a-mole” test with up to 14 sequential failures. A properly configured cluster can have multiple simultaneous hardware failures and achieve full replication again in minutes.
What Happens When a Storage Volume Fails?
Of all the types of failures, volume failures are perhaps the most common. Even with annualized volume failure rates of a low 2%, a large Swarm cluster (e.g., PB+) can potentially have one or more failures a month. Because of Swarm’s parallel active recovery mechanisms, a larger cluster will also recover from such a failure faster than will a smaller cluster. Few storage solutions in the market scale its recovery mechanisms in this way. Swarm’s active recovery means data loss only happens with true catastrophic failures or gross neglect.
How Does Swarm Recover?
Volumes can be physically moved from one chassis to another, either newly introduced or already running in the cluster. The data on the volume will then be available for use after mount. While Swarm can recover from the permanent loss of a chassis (along with its volumes), chassis loss is usually a temporary problem remedied by a new power supply or network card.
Performing full recovery in a chassis or subcluster loss may be counterproductive if the data is known to be on viable disks. In such a situation, Swarm can alert to this condition and accept administrator guidance on whether full recovery of all volumes should be performed. In the absence of this feedback, Swarm will proceed with the recovery. As described in part 1 of this blog, Swarm will be able to read and write all cluster data in either of these temporary loss scenarios.
Best Practices for Cluster Configuration
Some Caringo customers keep multiple clusters for high availability and failover. Clusters can be easily configured for remote replication and multiple clusters can be configured to mirror each other. With small network configuration changes, either cluster can be the primary access for an application or user base or load balancing to both clusters can be used to serve out exactly the same objects. Should one of those clusters suffer a catastrophic failure, the other cluster can carry the load while the first one is repopulated.
Take the Worry Out of Protecting Data
With its combined data resilience and recovery mechanisms, Swarm takes the worry out of protecting your data and making it always available. These mechanisms “just work” with little configuration or manual intervention. This is especially important with remote data centers that may only be periodically serviced.
Learn More About Data Resilience & Recovery
Register now for our August 20 Tech Tuesday webinar, Data Resilience & Recovery in Swarm Object Storage. It will feature T.W. Cook, VP Engineering, and John Bell, Sr. Consultant, and include live Q&A throughout the webcast.
Learn how object storage can provide continuous, built-in data resilience that can be configured to withstand massive hardware failure More Details »
With Caringo Swarm & Marquis, you can optimize media & metadata workflows, offload Avid shared storage and simplify Avid backup and DR. More Details »