Your company’s data is in danger. Every day, those hidden little IT gremlins are on the hunt, waiting to poach your ability to access your files when you need them most. Routine headaches like drive and chassis failures seem to happen at the most inopportune times, and any downtime can prove severely disruptive to business processes. The good news is that you and your team aren’t alone in this fight to protect your data.
With the right storage, you can have an elite team of park rangers, ever vigilant for signs of failure and diligently working to keep your data fully available—before you even know any hardware failed. If you’ve followed this blog, you’ve read about how object storage can protect your data, even from catastrophic loss. Today I’d like to take you a little deeper into our philosophy on how to recover from hardware failures, especially as your data grows over time.
Hardware failures happen; it’s a fact of life. Lose too much hardware, and you begin to irretrievably lose data, and that is one of the worst things that can happen for a business. Any time hardware goes down the clock starts ticking. The gremlins are hunting your data, your storage solution has a finite window of time to restore the affected data before it becomes extinct.
Some solutions rely on a passive recovery scheme. In these scenarios, data recovery is only attempted when data is read. Unfortunately, this is a bit of a head-in-the-sand approach that does little to protect your data from the realities of life.
From our perspective at Caringo, once a hardware failure has been detected, you need an active recovery process that immediately begins. If your storage is designed so that all volumes in the cluster participate, the desired protection level for anything that was affected should be restored almost instantaneously. During the active recovery, requests from the outside should be serviced as usual, even though within the cluster the nodes are busily repairing the data. We coined the term Fast Volume Recovery for this process and it is a key feature we provide in Swarm.
Fast Volume Recovery
Every node in a Swarm cluster checks to see if its volumes contain either replicas or elastic content (EC) segments of objects affected by the failure. If the node does not have any affected data, it goes back to its normal business. If affected data does exist on the node, however, then it immediately tells the cluster to create a new replica or EC segment to restore the protection level for the object. This continues until all the data has been restored, and the cluster is once again at full protection.
Simple, right? For any given volume, the data needed to restore that volume resides elsewhere in the cluster so there is no single point of failure. It may seem that the more data there is the longer this process would take, but there’s a really neat trick hidden in all of this: As your cluster grows, data recovery actually gets FASTER!
Let’s walk through two examples that show how this works.
Strength in Numbers
How is it that the more your storage needs grow the safer your data becomes? Consider the following simplified scenario. Let’s say you have four 1 TB drives spread out across a few nodes, and for simplicity’s sake, let’s assume your cluster uses the default protection scheme of two replicas.
In this small cluster, each node has its data spread out across the rest of the cluster. The actual distribution will vary over time, but on average we can expect each volume to be protecting 33% of the data of each of the other volumes.
If a given volume goes down, each of the other volumes will work in parallel to restore their portion of the affected data. Here’s what that participation looks like over time:
So that we can draw a clear contrast, let’s say that the cluster eventually doubles in size to eight volumes. Now each volume is only responsible for 14.2% of the data on each of the other volumes. Here’s what that participation looks like over time:
As you can see, the recovery is faster this time. But why is that?
In the first scenario, even though all the volumes were working in parallel, each had to do a max of roughly 333 GB worth of work. In the second scenario, the data was more spread out. Each volume only had to do a max of roughly 143 GB worth of work. Since they were all working at the same time, the overall recovery completed much faster. As the cluster continues to grow, the benefits continue to improve as well.
However long it takes for the volumes to restore their data, that is the cluster’s window for potential data loss. The smaller that window, the better your data is protected. Using stronger data protection schemes (more replicas or a different EC encoding) can further enhance your data protection.
Faster recoveries mean less opportunity for another hardware failure to occur while your data is at risk. And as your data needs grow, you need to rest assured knowing that your data protection is automatically growing along with it.
If you want to learn more about how you can use object storage to keep data safe and accessible for your organization, check out this whitepaper on protecting data. Also, join Caringo VP of Marketing Adrian Herrera tomorrow on the Storage Megacast.io where you can evaluate and compare six different enterprise storage solutions back to back.
If you have questions about whether object storage is the right solution for you, contact us. We would be happy to help!