Storage

In Kubernetes, using storage in the right way is well-facilitated by the API.
However, unless you are running in a major public cloud, that API may not be hooked up to anything. This frequently sends users down a rabbit hole of researching all the various options for storage backends for their platform, for Kubernetes, and for their workloads. There are a lot of options out there, and it can be fairly bewildering.

For Talos, we try to limit the options somewhat to make the decision-making easier.

Public Cloud

If you are running on a major public cloud, use their block storage. It is easy and automatic.

Storage Clusters

Redundancy in storage is usually very important. Scaling capabilities, reliability, speed, maintenance load, and ease of use are all factors you must consider when managing your own storage.

Running a storage cluster can be a very good choice when managing your own storage, and there are two project we recommend, depending on your situation.

If you need vast amounts of storage composed of more than a dozen or so disks, just use Rook to manage Ceph. Also, if you need both mount-once and mount-many capabilities, Ceph is your answer. Ceph also bundles in an S3-compatible object store. The down side of Ceph is that there are a lot of moving parts.

Please note that most people should never use mount-many semantics. NFS is pervasive because it is old and easy, not because it is a good idea. While it may seem like a convenience at first, there are all manner of locking, performance, change control, and reliability concerns inherent in any mount-many situation, so we strongly recommend you avoid this method.

If your storage needs are small enough to not need Ceph, use Mayastor.

Rook/Ceph

Ceph is the grandfather of open source storage clusters. It is big, has a lot of pieces, and will do just about anything. It scales better than almost any other system out there, open source or proprietary, being able to easily add and remove storage over time with no downtime, safely and easily. It comes bundled with RadosGW, an S3-compatible object store. It comes with CephFS, a NFS-like clustered filesystem. And of course, it comes with RBD, a block storage system.

With the help of Rook, the vast majority of the complexity of Ceph is hidden away by a very robust operator, allowing you to control almost everything about your Ceph cluster from fairly simple Kubernetes CRDs.

So if Ceph is so great, why not use it for everything?

Ceph can be rather slow for small clusters. It relies heavily on CPUs and massive parallelisation to provide good cluster performance, so if you don’t have much of those dedicated to Ceph, it is not going to be well-optimised for you. Also, if your cluster is small, just running Ceph may eat up a significant amount of the resources you have available.

Troubleshooting Ceph can be difficult if you do not understand its architecture. There are lots of acronyms and the documentation assumes a fair level of knowledge. There are very good tools for inspection and debugging, but this is still frequently seen as a concern.

Mayastor

Mayastor is an OpenEBS project built in Rust utilising the modern NVMEoF system. (Despite the name, Mayastor does not require you to have NVME drives.) It is fast and lean but still cluster-oriented and cloud native. Unlike most of the other OpenEBS project, it is not built on the ancient iSCSI system.

Unlike Ceph, Mayastor is just a block store. It focuses on block storage and does it well. It is much less complicated to set up than Ceph, but you probably wouldn’t want to use it for more than a few dozen disks.

Mayastor is new, maybe too new. If you’re looking for something well-tested and battle-hardened, this is not it. If you’re looking for something lean, future-oriented, and simpler than Ceph, it might be a great choice.

Video Walkthrough

To see a live demo of this section, see the video below:

Prep Nodes

Either during initial cluster creation or on running worker nodes, several machine config values should be edited. This can be done with talosctl edit machineconfig or via config patches during talosctl gen config.

Under /machine/sysctls, add vm.nr_hugepages: "512"
Under /machine/kubelet/extraMounts, add /var/local like so:

yaml

...
extraMounts:
  - destination: /var/local
    type: bind
    source: /var/local
    options:
    - rbind
    - rshared
    - rw
...

Either using kubectl taint node in a pre-existing cluster or by updating /machine/kubelet/extraArgs in machine config, add openebs.io/engine=mayastor as a node label. If being done via machine config, extraArgs may look like:

yaml

...
extraArgs:
  node-labels: openebs.io/engine=mayastor
...

Deploy Mayastor

Using the Mayastor docs as a reference, apply all YAML files necessary. At the time of writing this looked like:

bash

kubectl create namespace mayastor

kubectl apply -f https://raw.githubusercontent.com/openebs/Mayastor/master/deploy/moac-rbac.yaml

kubectl apply -f https://raw.githubusercontent.com/openebs/Mayastor/master/deploy/nats-deployment.yaml

kubectl apply -f https://raw.githubusercontent.com/openebs/Mayastor/master/csi/moac/crds/mayastorpool.yaml

kubectl apply -f https://raw.githubusercontent.com/openebs/Mayastor/master/deploy/csi-daemonset.yaml

kubectl apply -f https://raw.githubusercontent.com/openebs/Mayastor/master/deploy/moac-deployment.yaml

kubectl apply -f https://raw.githubusercontent.com/openebs/Mayastor/master/deploy/mayastor-daemonset.yaml

Create Pools

Each “storage” node should have a “MayastorPool” that defines the local disks to use for storage. These are later considered during scheduling and replication of data. Create the pool by issuing the following, updating as necessary:

bash

cat <<EOF | kubectl create -f -
apiVersion: "openebs.io/v1alpha1"
kind: MayastorPool
metadata:
  name: pool-on-talos-xxx
  namespace: mayastor
spec:
  node: talos-xxx
  disks: ["/dev/sdx"]
EOF

Create StorageClass

With the pools created for each node, create a storage class that uses the nvmf protocol, updating the number of replicas as necessary:

bash

cat <<EOF | kubectl create -f -
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: mayastor-nvmf
parameters:
  repl: '1'
  protocol: 'nvmf'
provisioner: io.openebs.csi-mayastor
EOF

Consume Storage

The storage can now be consumed by creating a PersistentVolumeClaim (PVC) that references the StorageClass. The PVC can then be used by a Pod or Deployment. An example of creating a PersistentVolumeClaim may look like:

bash

cat <<EOF | kubectl create -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: mayastor-volume-claim
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: mayastor-nvmf
EOF

NFS

NFS is an old pack animal long past its prime. However, it is supported by a wide variety of systems. You don’t want to use it unless you have to, but unfortunately, that “have to” is too frequent.

NFS is slow, has all kinds of bottlenecks involving contention, distributed locking, single points of service, and more.

The NFS client is part of the kubelet image maintained by the Talos team. This means that the version installed in your running kubelet is the version of NFS supported by Talos. You can reduce some of the contention problems by parceling Persistent Volumes from separate underlying directories.

Object storage

Ceph comes with an S3-compatible object store, but there are other options, as well. These can often be built on top of other storage backends. For instance, you may have your block storage running with Mayastor but assign a Pod a large Persistent Volume to serve your object store.

One of the most popular open source add-on object stores is MinIO.

Others (iSCSI)

The most common remaining systems involve iSCSI in one form or another. This includes things like the original OpenEBS, Racher’s Longhorn, and many proprietary systems. Unfortunately, Talos does not support iSCSI-based systems. iSCSI in Linux is facilitated by open-iscsi. This system was designed long before containers caught on, and it is not well suited to the task, especially when coupled with a read-only host operating system.

One day, we hope to work out a solution for facilitating iSCSI-based systems, but this is not yet available.

Last modified March 18, 2022: docs: overhaul all the docs (e3fda049f)