Caringo: Fixed Content Storage
Home | Products | Solutions | Partners | Customers | Resources | FAQ | News | Company | Contact
     




CAStor Content Storage Software
The Age of CAS:
Thoughts from the Inventor of Content Addressed Storage
by Paul Carpentier, CTO, Caringo, Inc.
 
CAS - Content Addressed Storage - is a surprisingly simple and straightforward concept, with many mainstream applications in all realms of information processing. Then why did it take until 2002 until the first true CAS architecture came to market?

In the mid-1990s, when developing the initial peer-to-peer networking concepts that would eventually become CAS, it seemed natural to me to have a mechanism that allowed for the unique identification (and thus finding) of a piece of content, regardless of where it was located. Looking closer, it dawned on me that all popular disk and network storage mechanisms worked in exactly the opposite way, as they were essentially using the location of a file - the "fully qualified path name" - as its identifier. Can you imagine using your street address as your Social Security number? Nor can I. Yet that was the state of storage technology at the end of the previous century.

Enter CAS, with its "content addressing" instead of "location addressing." When "hooking up" a digital X-ray image file to a patient record in a hospital information system, we want to make 100 percent sure we're referring to the right information for obvious reasons. Ideally, the image file would be identified using a unique "serial number" - a content address - that is entered into the patient record in the hospital database as a truly unbreakable link between the patient and the image.

It's as easy to store and retrieve the image: just think of CAS as a big valet parking lot for data objects. You hand in any file or data object and you get a ticket stub back: a globally unique identifier for exactly that object, with that specific content. Just like valet parking, you don't need to know exactly where your data object is being stored, only that you can get it back by showing the ticket stub with the unique identifier.

A brief note about the unique identifier: believe it or not, a 128-bit number allows for more than enough different unique content addresses to avoid accidental double-ups or brute-force enumeration exploits. So many in fact, that if 10 billion people in the world each had 1 million files stored in CAS, and a patient hacker would have his/her computer guess 1,000 times per second to try and match the content address of any random single file from that collection, it would take a billion millennia to do so!

There are different ways to generate a unique identifier. One way is to use a so called cryptographic hash algorithm: the result can be used both as a unique content address as well as a digital fingerprint to validate the integrity of the stored content before handing it out again at retrieval time. While this seemed a nice idea at the time, the reality is that a hash algorithm has a limited, unpredictable life expectancy before being cracked, thus is unsuitable for proof of integrity purposes well within the life cycle of the stored information.

Caringo's patent-pending answer to this dilemma is to use a simple random number as a lifetime, unique content address, transparently secured by a hash algorithm that can be upgraded over time when the need arises. Cryptographic operations linking identifiers and hashes provide auditable proof of integrity, even in the very long term.

While simple and straightforward in concept, CAS offers very unusual advantages often misunderstood because of their very difference with "classical" storage architectures. The only way to retrieve information is by handing in the valet parking stub (the unique identifier). This is not a bug, but a very valuable feature: it means that each and every data object inside CAS has its own individual password protection. Having the identifier means getting to the data, otherwise, no way!

While embedding a 128-bit number in a data record is easy, the resulting functionality may be spectacular: hooking up a picture to an HRM record that was never programmed to hold one, attaching scanned invoices to text-only accounting records, attaching all kinds of extra documents or multimedia objects to existing workflows, ERP systems or intranet portals. Content in context, so to speak.

The above examples show clearly why true CAS should never embed any kind of search engine: providing alternate pathways into the stored data would spoil the magically simple and unbreakable security provided by the identifiers acting as keys. CAS and search together do make for a perfect couple at the application level, but definitely not below - for security reasons.

A CAS-powered architecture lends itself perfectly to a massively parallel implementation on any commodity X86 hardware. The ability to mix vendors and hardware specs without migration is an absolute requirement for any large infrastructure that needs to grow and evolve over time.

This promise of inexpensive, near-infinite storage really becomes significant if we revisit the true nature of content information in today's applications. Let's just look at which data types would not be suitable for content storage architectures like CAS: only those files that are subject to high-frequency, block-oriented read-write operations; in other words database files. So just think of content storage as "Everything But Databases." Living databases, that is, because database backup/archiving is a perfect CAS application.

The economic consequences are staggering: only the small databases fraction of corporate data can actually justify being stored on NAS and SAN infrastructures so expensive in purchase, maintenance, administration, expansion and migration, yet so fragile for longer term storage. Almost all other data - the bulk of corporate storage - are really just content and could be stored in CAS infrastructure with a TCO that is often a full order of magnitude lower: office documents or email archives, bulky multimedia files, voice messages, video streams, graphics files, pictures, scanned documents, medical images, web pages, intranet content, internet files like photos, music and video clips, anything but databases … more than 90 percent of all data. In addition, all that content is a lot simpler to share on an intranet or beyond by passing along a simple, unbreakable URL.

Unbreakable? On regular file or web servers broken links are a fact of life, but the unique identifiers handed out by CAS remain valid for life, leaving no doubt as to the specific content version. That valuable characteristic allows for building compound document sets with robust, inter-document references that remain valid for the entire desired lifetime of the information, a scenario enabled by the low per-capacity cost of CAS. You can afford to actually leave the information stored there until it is truly not required anymore, then automatically disposed of, even shredded, by an automated policy.

By now it must be clear why you have not heard this story before. You could hardly imagine your father's SAN & NAS vendor sharing this news with you and endangering some of its most lucrative business. It will take new players without vested interests in incumbent complex storage mechanisms to help you reap the long-overdue storage benefits enabled by massively parallel commodity hardware.

The age of CAS is here. Enjoy!
    2 of 3