This article was originally published on Linux CloudExpo Journal.
Bridging the divide between legacy storage and new data management platforms could constrain IT organizations and budgets and could prevent the utilization of cost-effective scalable storage infrastructures. But, businesses can avoid some of these constraints by evaluating their storage options objectively and asking themselves three important questions.
A decade ago, we were putting 250-gigabyte drives into servers. When people mentioned the cloud, they were talking about the weather, and a business was considered to be on the cutting edge if it needed to store a few million files.
Now, we have access to 10-terabyte drives, and grandparents are using the cloud to store pictures of their grandkids. It’s now common for businesses to need access to billions of files, so companies need to move to newer systems to keep track of everything. With so many options available today, what’s really the best solution for storage?
Common Data Storage Systems
To truly understand legacy storage systems, you need to know how storage has evolved beyond the hard drive. Here is a quick rundown of the most common solutions that have emerged over the years:
- Storage Area Network (SAN): A SAN is a dedicated network that connects storage devices with servers typically using a Fibre Channel, InfiniBand, or Ethernet. SANs are commonly used for database servers and other applications that require a low-latency block-level storage interface. Advanced setups allow for clustering and failover capabilities among the servers.The downsides of SANs are that they often require exotic network hardware, proprietary software tools, and specialized staff to deploy and manage them. For these reasons, membership in the storage area network is normally limited to a small number of servers.
- Network Attached Storage (NAS): The storage devices in a NAS can be a purpose-built NAS appliance or a general-purpose server running Windows or Linux that delivers files to clients. While there have historically been many protocols that connect storage devices and clients, the market has settled upon a couple: Network File System (NFS) and Server Message Block (SMB). A NAS appliance is an all-in-one bundle of integrated hardware and software that is built for the sole purpose of delivering files to clients. Almost any general-purpose server can also deliver files and act like a NAS with the appropriate level of administrative configuration.Unfortunately, there are inherent disadvantages of NAS systems. With limited potential to scale, they can quickly become costly, complex, and labor-intensive to manage.
- Software-Defined Storage (SDS): SDS is still an evolving concept that can include file-based, object-based, block, cloud and storage management solutions. Software-defined storage essentially separates the data and services layers from the underlying hardware. Software-defined storage solutions typically involve storage virtualization, and they may provide features like search, organization, replication, distribution, thin provisioning, snapshots and backup to name a few.
- Cloud-Based Storage (Public and Private): As a multi-tenant environment, a public cloud storage system requires you to purchase a portion of a cloud-based computing environment that is shared with many other tenants. Public cloud storage is offered in an on-demand arrangement with monthly payments that can be advantageous; however, capacity and access costs are compounded monthly and won’t go down until data is deleted. Because you pay for the bandwidth to use your data, you may resist running analytics and other operations that would incur additional monthly charges. Private cloud storage solutions let you deploy storage as a service within your data center. You need to make an upfront capital investment in hardware and have the data center space and electrical power to run the service. If security is a priority, you are storing large amounts of data for long periods of time, or you are performing a lot of reads (such as analytics) on your data, a private cloud is almost always the best option. There are also hybrid solutions that provide a combination of private and public services.
- Object-Based Storage: One popular type of SDS is object-based storage, which is at the heart of many public and private cloud-based storage services. In this model, there is no hierarchical folder structure; however, object-based storage does provide a method for data organization using metadata (often defined as “data about data”). In object-based storage systems, the data is organized into self-contained entities (objects). This flat approach provides for greater scalability and can be less expensive than block or file-based storage systems. For businesses with a need to store and search through high volumes of data, this is often the ideal solution.
Building a cost-effective and scalable storage infrastructure is not a task to be taken lightly. Initiatives like this have the potential to impact IT resources and inflate budgets. So how do you bridge the divide between legacy storage systems and the new data management platforms?
Planning for the Future
Bridging the gap between legacy storage and newer technologies sometimes requires ensuring compatibility through protocols such as S3, RESTful HTTP, NFS, and SMB. As a result, business and IT leaders should consider a few important matters before determining the best data management platform to use for taking their businesses into the future.
- What type of data is being stored, and how quickly is it growing? SAN and NAS are still your best options for structured data; however, the total amount of structured information an organization has is often less than 10 percent. Unstructured data is often 90 percent or more of the total capacity need. If you focus on your unstructured data growth rate year over year, you’ll most likely notice an acceleration.Some of this acceleration can be accounted for in factors such as the improvements in resolution for videos and images as well as new sources of unstructured data, such as log files, metrics, and data created by devices. Create a formula based on these considerations, and use it in conjunction with your historic storage capacity compounded annual growth rate (CAGR) to estimate your needs three to five years out. Using your forecasted capacity need, select a storage solution that can expand to accommodate your expected growth.
- What are your access patterns? When you think about access, consider what (device, application, etc.) and who needs access and exactly how they will access it (e.g., geographical location and interface or search mechanism). When you have billions of files, how will you find what you need? Almost as important, how will you determine what you don’t need so you can confidently delete this data? When choosing your future storage platform, make sure your chosen solution supports your organization’s access requirements.
- How long must the data be retained? Data retention rates vary by industry from a few seconds to indefinitely. When you think about retention, consider the cost of different protection methods versus the value of the data and ease of migration (e.g., how easy it is to continue to evolve the underlying hardware infrastructure). If you factor ease of migration into your decisions today, you will make your life simpler when you one day find yourself needing to migrate petabytes or possibly exabytes of data.Beyond how long you are required to retain data, consider how long that data may be valuable to you from both an information and a monetary perspective.
- The relationships between data and keeping content accessible and instantly searchable increase profit and agility, something every forward-thinking business leader understands. If you can keep your data online, organize it, and search it, you can continue to extract value from it.
In the information age, those who can leverage long-tail data will not only succeed, but they will also reap benefits in orders of magnitude greater than those constrained by the limits of traditional technologies.
The performance figures achieved are results of Caringo Swarm’s underlying parallel architecture. Let's describe the infrastructure, methodologies and results achieved. More Details »