This week, I stepped out of the host seat and into the subject-matter expert role for our monthly Tech Tuesday educational webinar. Now almost 2 years old, we’ve covered a broad swath of topics—from migrating your data from NAS, SAN or Tape to Object Storage to how to verify content integrity in Swarm Object Storage (this week’s topic).
I thought that verifying content integrity would be the perfect follow-up to our August webinar on Data Resilience & Recovery in Swarm Object Storage. After all, maintaining accurate data is key to a solid storage strategy, along with making sure your data is protected, accessible and can be recovered in the case of a disaster.
What is content integrity?
Content integrity is the accuracy and consistency (validity) of content (data) over its lifecycle. Content integrity must be addressed at multiple levels:
- Human error and/or tampering
- Transfer errors
- System/Hardware compromise, etc.
How can I ensure the integrity of my stored content?
To ensure the integrity of your content over its lifespan, you can use a combination of the following methods:
- Input validation
- Error detection/data validation (data in transit)
- Access Policy/Constraints
What methods are available for validating and maintaining data integrity in Swarm?
There are several methods that can be used for validating data in Swarm Object storage, including Content-MD5, Integrity Seals and Lifepoint Metadata Headers (WORM).
Which validation method is preferred? That question is a bit more complex than it might appear. There are situations where certain content check approaches work better than others. For example, we routinely recommend using Content-MD5 verification as we know that can work with both our Content Gateway as well as direct-to-Swarm integrations.
In the case of Integrity Seals, that’s an implementation that’s specific to Swarm and the use of storing unnamed objects with direct integrations. It’s up to the storage admin to decide which path is best for them.
Data validation methods using Content-MD5 are typically used to detect errors related to data in transit/on write. They can be provided by the client directly or generated by Swarm. Each client-provided “Content-MD5” header includes a base64-encoded MD5 sum (as shown in the above example).
Content-MD5 headers must contain base64 encoding values to maintain compliance with HTTP 1.1 and associated RFCs. Therefore, we always store a value in the format specified by RFC 1846 (base64 representation of the 128 bit MD5 sum).
Auto-generation is supported with gencontentmd5 Client query argument and with global cluster configuration parameter (e.g., scsp.autoContentMD5Computation).
An Integrity Seal is a URL containing the object name or UUID, its hash value, and the type of hash algorithm that was used for the computations, It can be requested on client write using the “hashtag” query argument. Success results in an HTTP 201 response that includes a location header w/ URL that can be used to retrieve the data. Failure results in an error message.
Supported hash types include:
Hashes can be upgraded in place (e.g., MD5 to SHA512) using the hash and newhashtype query argument.
Lifepoint Metadata Headers
Lifepoint Metadata Headers specify the constraint associated with an object in Swarm (e.g., replica count, erasure coding scheme, etc.). It is evaluated by the Health Processor (HP) to determine what is “healthy” for the object at that point in its lifecycle.
From a content integrity perspective, there is one constraint of note which tells the HP whether or not the object should be deleted at a point in time. This is done with the “deletable” and “delete” directives in the Lifepoint header(s). The directive “deletable=no” is typically applied to implement “Write Once Read Many” (WORM) behavior for objects in Swarm storage.
Here are some examples:
Lifepoint: [Wed, 12 Dec 2015 15:59:02 GMT] reps=3, deletable=no
Lifepoint: [Sun, 08 Jun 2016 15:59:02 GMT] reps=2, deletable=yes
Lifepoint:  delete
Swarm provides several mechanisms to confirm and maintain the integrity of the content stored in objects. Always keep in mind the integrity of the client itself; memory errors can happen, and Error Correction Code (ECC) is your friend. And last, but certainly not least, garbage in, garbage out (GIGO). Make sure that data you store initially is valid.
If you have questions about how Swarm maintains the integrity of your content, or any other questions related to object storage, feel free to contact us. We are also happy to schedule a custom demo for you.