Swarm is unified storage software, enabled by pure object storage, designed to store unstructured data also referred to as fixed content or reference information. This includes documents, e-mail, images, audio, video, voice mails, ring tones, and medical images and records.
Swarm software runs on standard, commodity server hardware (x86), which enables organizations to implement affordable clustered storage that delivers high performance, scalability and reliability. Swarm may be used to store all of an organization’s content because it delivers the speed of primary storage that is secure and cost-effective enough to archive content for the duration of its useful life.
Caringo CAStor was the first software only storage product launched in 2005. In May of 2014 due to the continued innovation by Caringo in the object storage space and the introduction of interfaces that enabled unified storage on top of Caringo’s pure object storage approach we changed the name of the product to Swarm but kept the version number consistent to show the maturity of the product after almost a decade of development and market hardening.
Swarm is enabled by a massively parallel architecture, also known as a redundant array of independent nodes that boot from bare metal, run completely from RAM and where each node can perform every process. All legacy technology that constrains traditional storage like file systems and OS installs are eliminated. Swarm is what makes Caringo the most efficient and easy to manage object storage solution on the planet.
The easiest way to understand object storage is to use the analogy of valet parking. When you hand over your car to have it parked, you are issued a ticket with a specific number on it. In order to retrieve the car, you simply provide the ticket/number to the valet. You don’t care where the car is parked, if it was moved or how it is stored as long as it is returned in the same condition.
Object storage functions similarly, but with digital content. A file is submitted/stored and a key/unique identifier (UUID) is returned to the application for future access. When that file is later requested for retrieval, the application passes the key back to the object storage system and the file is retrieved. There are no file hierarchies, folder names or disk locations associated with the stored file. If the file is moved in storage, the key never changes. The file is free to move system wide enabling highly automated system management and optimization which is the reason why object storage can scale easily in file count and capacity.
As with any technology there are different ways to implement technical concepts. Some object storage solutions are built on top of file systems or employ single points of failure like controller nodes, management nodes or metadata databases. Pure object storage does not have a file system, metadata database or any single points of failure. Pure object storage is the only way to provide uncomplicated, massive scale since all bottlenecks are removed and complexity is automatically managed by the software as the system grows.
Swarm protects data through Elastic Content Protection (ECP) that provides the storage industry’s most comprehensive data protection functionality and enables you to match the right storage durability and SLA to individual users or apps, optimize data center footprint, and enhance data accessibility regardless of content size, capacity or number of data centers. ECP is the only data protection scheme to provide the choice and movement between replication and erasure coding, simultaneously available on the same node with automated migration from one protection scheme to the other. The result is a storage solution that dynamically adapts storage capacity utilization and object count based on your business, retention or accessibility requirements.
Erasure coding provides Enterprise protection at lower storage footprint levels and is ideal for moderately sized files (1 MB+) and large content stores.
As file sizes and data sets grow – operational resources such as data center space, power, cooling and hardware become more difficult to manage and copy based protection schemes become less efficient. The solution is erasure coding, which breaks a file into multiple data segments and computes parity segments, resulting in total segments that use less capacity and operational resources than an additional copy but still provide enterprise grade protections.
Files split into multiple data segments (k) and additional parity segments (p) based on the content of the data segments. This results in m total segments (k + p = m) being distributed to m different CAStor nodes or sub-clusters.
Replication provides Enterprise protection with faster access for small or large files and is ideal for content delivery and small files (KBs-MBs).
Responsiveness is enabled by rapid access to a file. This means that to ensure snappy response times replicating whole copies is a better solution than splitting a file into segments. In addition there is overhead associated with splitting a file into segments which is why erasure coding is less efficient for smaller files. For these use cases replication is ideal.
Replication uses copy-based data protection where complete copies of a piece of content are made and distributed across nodes or sub-clusters. Because replication stores data contiguously on disk once the first bit is identified, content is delivered rapidly and efficiently without the need for rehydration.
Swarm employs a hash algorithm that computes a digest, sometimes referred to as a digital fingerprint, based on the bit sequence for each content object (file). The digest is used by Swarm’s Health Processor (HP) and runs in the background and continuously checks the content’s integrity. If an object is determined to be corrupt a new replica is generated from another correct replica stored in the system. This ensures that there is always the correct number of clean replicas available and accessible in Swarm.
The hash digest is also used as the Content Integrity Seal, which is a method to prove the authenticity of a content object as an original for compliance and evidentiary purposes in an open, customer auditable data structure. Swarm separates the content address (UUID) from the digital fingerprint (digest) allowing the hash algorithm to be seamlessly upgraded if the original is compromised. This has happened with both the MD5 and SHA-1 algorithms and Swarm’s patented, transparently upgradeable hash assures the long-term integrity of content.
Yes, Swarm can provide WORM storage if specified when content is written. Once specified, WORM content can never be deleted. Swarm can also manage content lifecycle information automatically and one can specify that a file cannot be changed throughout its defined life and cannot be deleted until the retention period has expired. This addresses regulatory mandates such as SEC17a4, which is the most stringent regulatory requirement defined for data storage.
Legal Hold creates a point-in-time snapshot of a specified set of objects at a specified time that are then immutably stored regardless of what happens to the original object or cluster and satisfies SEC 17-4(F).
Capacity can be increased in a Swarm cluster dynamically while the system is running. Simply add a new node to the cluster and the available capacity is automatically added to the available pool without the need to provision or configure the new storage. Upgrading the server hardware for Swarm nodes is similarly easy. Boot the new, updated server(s) into the cluster then gracefully retire the older node(s) to be removed. All the content on the server node being retired is replicated to other nodes in the cluster and when completed its disks are wiped clean and it can be removed. All this is done while the Swarm cluster is operational and without impact to applications or data availability. Caringo calls this hot scaling.
The Swarm cluster is easy to administer. It eliminates the need to provision or configure storage when new capacity is added. Its self-healing characteristics allow the Swarm cluster to seamlessly recover from a failed node or disk without impacting data availability. If a node goes down, the cluster immediately recognizes its loss and the rest of the cluster works together to replicate all the content on the impaired node. This occurs without administrator intervention or impact to applications and data availability. The Swarm cluster is also self-balancing such that it will automatically balance stored content evenly across nodes in the cluster for optimal performance and to eliminate hot spots. All of these actions require minimal administrative overhead and a Swarm cluster can be managed from a central browser interface that is the same whether there are a couple of nodes or thousands of nodes in the cluster.
Applications integrate with Swarm using standard HTTP 1.1. Swarm uses a simplified subset of the HTTP 1.1 standard called Simple Content Storage Protocol (SCSP) as the native interface to Swarm. It is an on-the-wire protocol that will never be outdated and will never require porting. Essentially, there is no proprietary API and any application or web service can be interfaced to Swarm in a matter of hours.
CloudScaler extends the standard Swarm API with authentication and authorization functionality in addition to support for the Amazon S3 API that enables the existing ecosystem of applications that have integrated S3 support to interface directly with Swarm.
Caringo also supports traditional file and storage protocols, SMB, NFS, FTP and WebDav through FileScaler (previously named Content File Server (CFS)). FileScaler is not a classic file system. Rather it is a thin mapping layer that looks like a file system to applications, and speaks HTTP to Swarm. On the front-end it presents a standard file system interface and on the back-end it delivers a vast flat address space, massive scalability, high performance, and reliability.
Unlike file systems that ride on top of block storage devices, Swarm provides a single, flat address space to store content and information about the file is stored with the object in metadata. In file system based storage information about the file is stored in the file system while the file itself is fused to a specific hardware location through file hierarchies, folder names and physical disk location (inodes and blocks). Swarm stores files as whole objects or object segments in contiguous disk space and only needs to manage a single UUID for each piece of content. This approach virtualizes information from the hardware layer enabling the movement of objects throughout the storage system and enables the continuous evolution of hardware while maintaining the integrity and durability of information.
Yes, Swarm allows custom metadata to be defined by applications to uniquely describe content objects. Swarm stores all metadata with actual content and is persisted through its life cycle. Other metadata elements include number of replicas to be maintained, erasure coding scheme, retention period, content type, file name, originating application and others. Swarm also supports a special metadata element called the Lifepoint™.
Lifepoints are system enforced and managed content lifecycle policies. Swarm stores all operational and descriptive information needed to execute these policies with the object itself – eliminating the need for a vulnerable metadata database and associated database administration.
The Health Processor continuously runs in the background enforcing Lifepoints and ensuring optimal use of available resources system wide.
Yes, the Indexer is scale-out application that can run on dedicated hardware or a VM and provides ad-hoc querying of a Swarm cluster. Results can be viewed through the web-based Indexer portal or be delivered in raw XML or JSON for import into 3rd party analytics packages. Swarm can also be integrated with 3rd party indexing ir search engine applications.