Benjamin Wootton

19 June 2018

Persistent Storage Strategies for Containers

Persistent storage for containers is challenging. But it’s necessary if you want to build stateful apps using containers.

Below is an overview of the challenges involved in persistent storage for containers, along with tips for addressing them.

Storage as an Afterthought

Persistent storage was not originally part of the container picture. Docker containers were initially conceived as here-one-moment, gone-the-next resources that didn't need persistent storage. A container would simply pop into existence, take whatever input it received, do its job, and pop back out of existence, handing off any output to whatever resource was supposed to make use of it next.

This is actually typical of the design and initial development phases of almost any kind of conceptually innovative technology. The designers quite reasonably assume that existing, well-established technology (such as persistent data storage) will be handled by existing, well-established resources, and does not need to be included in the design.

Usually, that’s how it works.

But as always, the devil is in the details. In the case of containers and persistent storage, the details include:

The practical need for some kind of storage for use by the container. Many common software tasks use temporary storage. While it may be possible to design programs that perform those tasks without the need for persistent storage, doing so introduces unnecessary complications.
Prolonged use of individual containers. In practice, some containers remain in use for hours, or even longer. The longer a container remains in use, the greater the likelihood that it will need storage (as a scratchpad, to save state information, or for more complex purposes).
The need for containers to share data. This is a big one. Containers frequently need to work together, and working together generally means sharing data. The easiest and best way to share data is often by means of shared storage. Lack of such storage makes data sharing difficult.

Adding storage

Docker, third-party developers, and storage providers have taken a variety of approaches to the challenge of giving containers persistent storage capabilities.

Amazon Elastic Block Store

Amazon Elastic Block Store (Amazon EBS) provides persistent block storage volumes for use with Amazon EC2 instances in the AWS Cloud. Each Amazon EBS volume is automatically replicated within its Availability Zone to protect you from component failure, offering high availability and durability. Amazon EBS volumes offer the consistent and low-latency performance needed to run your workloads. With Amazon EBS, you can scale your usage up or down within minutes – all while paying a low price for only what you provision.

Amazon EBS is designed for application workloads that benefit from fine tuning for performance, cost and capacity. Typical use cases include Big Data analytics engines (like the Hadoop/HDFS ecosystem and Amazon EMR clusters), relational and NoSQL databases (like Microsoft SQL Server and MySQL or Cassandra and MongoDB), stream and log processing applications (like Kafka and Splunk), and data warehousing applications (like Vertica and Teradata).

Kubernetes

Kubernetes is the hottest container management tool currently. It offers Persistent Volumes that provide an API for users and administrators that abstracts details of how storage is provided from how it is consumed.

Cheryl Hung of StorageOS has a great talk on persistant storage with Kubernetes in production here.

There are instructions on how to configure a Pod to use a PersistentVolumeClaim for storage here.

Docker Data Volumes

For storage by individual containers, Docker offers data volumes. These allow a container to use a kind of virtualized persistent storage abstracted from the host system’s storage. This virtualized storage is integrated into the standard container file structure, which makes access easy. Data volumes are, however, limited to an individual instance of a single container. The data can’t be shared with other containers, and it can’t be accessed by later instances of the same container.

Using the Host File System

There are other methods of persistent storage which make much more direct use of the host file system, setting aside some of the host system’s storage for use by the container without the layers of abstraction imposed by data volumes. This can, depending on the method used, allow data to be shared with other containers, or with later instances of the same container. If file and storage management are not fully coordinated between the container and the host system, however, data may be overwritten or corrupted.

Everybody Can Join the Storage Party

A much more sophisticated and versatile approach is to create a system of virtualised storage that can be shared by multiple containers, and which persists over time, without being destroyed when individual containers are destroyed. Storage of this kind is typically managed by a plugin, which acts as a kind of high-level storage-system driver, placing a layer (or several layers) of abstraction between the host file system(s) and the file system as seen by the containers.

There are currently a variety of persistent-storage plugins and plugin-driven packaged storage solutions available from third-party vendors and services. These plugins make it possible for containers to make use of the general range of cloud-based storage systems, including block storage and object storage.

Containers on the Move

A plugin typically bases its abstracted storage system on the host system’s storage. What happens when a container is automatically moved to another server? This can be a serious problem, because depending on the system, containers may be moved around frequently, as part of load-balancing, or for other purposes.

If a plugin is running on a specific server, and if it points to storage on that server, then a container which uses that plugin will not be able to access persistent storage when it is moved to another server. Plugins such as Rook or StorageOS (or NFS for on-premises) allow abstracted persistent storage volumes to move with containers when they are relocated, so the container-storage link isn't broken.

The Near-Future Picture

What does the near-future hold for persistent storage and containers?

Greater Control

Persistent storage is likely to be increasingly abstracted and virtualized, with layers of sophisticated management of host-system storage resources down to the hardware level. A storage system might, for example, shift stored data between high-speed SSD storage and lower-speed archival storage, depending on the type of data and the containerized application’s need for quick access.

Container-Based and Commodified

One current approach to doing this is using a dedicated container to define and manage persistent storage. Such a container can be bundled with storage hardware to provide a standardised, generic storage solution for use with containers. Systems of this sort essentially offer persistent storage as a commodity. We can expect to see greater commodification of this sort in the near future.

Standardised Server Storage

We can also expect to see increasing virtualization of server-based storage. Virtualization of this sort would be less likely to make targeted use of underlying hardware resources. It would, however, have the effect of making such server-based storage even more generic and portable.

This perhaps is the basic near-future trend of persistent storage for containers. It will become more generic, more of a commodity, and more universal. This is likely to be a welcome development for enterprises making heavy use of container-based operations, since it will simplify the deployment and use of persistent storage.

For Now, Roadmaps

For now, however, the situation remains complex and fluid, with a variety of persistent storage solutions in play. This is one of those areas where any enterprise that is looking for the best and most durable solution is likely to see considerable benefit from expert consultation. A first-rate roadmap is indispensable when it comes to finding your way through the changing persistent storage landscape.