You may have noticed the market buzz around erasure coding in object storage lately, essentially presenting sliced objects as the best thing since sliced bread.
Erasure coding (EC), sometimes also referred to as Reed-Solomon encoding, is a RAID like technique using Forward Error Correction codes where you slice up an object into say 5 chunks and you generate 2 additional parity chunks from those. Then you store all 7 chunks on different devices to protect the content: even if devices fail, as long as you have any 5 out of the 7 original chunks, you can regenerate the full set. In the above case you would be able to sustain 2 device failures without data loss, while only incurring 40% disk footprint overhead.
The interest is clear: save disk space, data center footprint, power, cooling. Additionally, by increasing the number of parity slices, like in a 10 + 6 scheme for example, the level of protection against device failures can be raised at will, way beyond what replication based schemes can offer.
So, what do you think: a free lunch? Well, as usual, not really; or at least, not always. If you cut up an object into a number of slices, you increase the object count in your repository accordingly, and that comes with overhead, two kinds of it:
1. it increases the index size: a RAM based index (like in Caringo’s CAStor Object Storage SW) will need more RAM, while an SSD or hard disk based index will be even slower than it already was before.
2. minimum object size overhead: CAStor doesn’t use a file system to park its objects and can store on tiny 1KB boundaries, but some file system based offerings out there come with 64 or even 128KB minimum object sizes! With the 10+6 erasure code scheme that means up to a full megabyte of footprint overhead regardless of object size. It’s quite obvious that EC simply does not apply to small objects, especially when riding on/ridden by file systems.
In addition, to achieve its protection potential, EC needs to dispose of a very large number of devices in the cluster to reduce data loss risks associated with simultaneous failures. So it cannot effectively support smaller cluster sizes without compromising protection either. Which brings us to a point that one-trick-pony EC object storage companies keep very mum about: the mechanism may bring benefits in large file, large cluster use cases, but it isn’t nearly as versatile as replication.
For CAStor, we simply would not even think of sacrificing our wide application range, so we did only introduce erasure coding when we saw a way of integrating it with our flexible replication and fast recovery capabilities. Which does make CAStor 6 with Elastic Content Protection the only object store in the market that effectively combines the strength of both approaches while compensating for their mutual drawbacks. CAStor was already able to specify individual replication schemes (2, 3, 4, …) as lifepoint metadata on a per cluster, per domain, per bucket or even per object basis; now we have added erasure codes to that scheme as if they were just another form of replication (5+2, 7+3, 10+6,…). A settable threshold (e.g. 1 Mbyte) specifies the automatic boundary below which replication will be used, and above which erasure coding. All of this inside the same infrastructure.
Lat but not least, our EC implementation brings another major exclusive: fast, active recovery of segments lost in device failure. Where other EC offerings tend to rely on passive background or read based discovery and repair of EC sets, CAStor leverages its “turbo” Fast Volume Recovery scheme as used in its replication; EC sets are back to spec minutes after device failure, rather than hours, days or longer. This does make a huge difference in protection level, of course, and it means net footprint savings as less redundancy may still guarantee a similar SLA.
Yes, I agree, object storage is becoming intricate stuff to keep track of. But it’s worthwhile to study it, as it is here to stay. And it pays to get into the details. Literally.
.