This is part three of my small high level Kubernetes introduction. In the first post I gave an overview about the control plane and in the previous post I explained how we can get traffic into our cluster.
In this post we will answer the basic question: When pods can spawn everywhere and die anytime, how do you manage persistent storage?
This question is a bit less subtle as the previous one. Sharing storage across multiple computers and pods immediately jumps out as a problem, whereas for networking people often think all the nice loadbalancer magic that proprietary cloud providers gift to you, is something that comes automatically with kubernetes.
We will start with the naive solutions and then move onto more sophisticated and usable ones.
When one uses hostPath type volumes, one simply mounts a fixed directory, file, socket or even device on the host. If your pod dies and spawns on a different host, it will use the same path on the different host. Maybe you already have some shared network file system available on all hosts. In some instances this might even be a good idea, but usually this also means that you do not get nice things like quotas automatically.
Alternatively you could add node selectors to the Pod, such that it only respawns on the same host.
Actually there is another type of volume for exactly this purpose:
localPath volume is pretty much the same thing as a
hostPath volume, but
it includes the information on which nodes this volume should be available and
then the placement of the pod happens accordingly, without you having to
manage the pods.
For single node "clusters" hostPath volumes are usually also a good choice.
Clearly we need an automated way to provision volumes. Before you can automate anything, you need to be able to declare what should happen. Towards this goal there are three resources:
PersistentVolumeClaimwhich are used to request a new volume of a specific type
StorageClasswhich defines a specific type of volume and holds information about the capabilities of the underlying volumes, e.g. if it can be used by only one pod pod or by multiple pods
PersistentVolumewhich ties an actual volume to an existing
The obvious abbreviations of these terms also work with
kubectl, so you can
always just type
kubectl get pv,
kubectl get sc and
kubectl get pvc.
Among these resources only PVC are namespaced.
So far not much is happening automatically.
One can create a PVC for a pod with a specific StorageClass and then nothing
You still have to provision the volume for yourself and then create a
PersistentVolume that is attached to the PVC.
It is a good exercise to do this once with a
So we now have the resources to declare which volumes we need and where we need them. Now we can talk automation.
For automatic provisioning of volumes, we need something running that can both manage the specific type of volume you want to have and that can ingest the PVC resources so it knows which volumes to provision. Consequently, such an aptly named provisioner, is usually just a pod or collection of pods running in the cluster and doing exactly this.
In general you have a provisioner running for each StorageClass, as they all need to do very different things. This usually means at daemonset is running, so on each node the attaching and detaching of storage can be handled, and then you also have additional pods running that contain the actual control loop ingesting PVC.
And this is pretty much all there is to it in abstracto.
Not all storages are created equal. What your provisioner can do and can not do is pretty much determined by what your storage can do and can not do.
Ceph is made for multi tenancy and has many nice features like auth tokens for each device and possesses a nice API to create actual devices and filesystems. Consequently provisioning ceph rbd volumes and cephfs volumes works like a charm.
But all the kubernetes magic is making deployments easy, but it does not make your underlying devices better. A NFS storage will still be an NFS storage.
Provisioners have to make some choices and sometimes these choices are not even that good. For example you can set up a localPath provisioner and configure it to select LVM devices, which you then can provision somewhat automatically via different means such as ansible. If you then request a 5GB volume but the only device which is currently available for provisioning is 1 TB, you end up with this being used as the underlying PV for the 5GB PVC. This is clearly somewhat unfortunate.
Also it is important to remember that containers are just linux processes. There is no virtualization happening, so the containers usually see the devices or folders as they represent themselves to the underlying host system. In particular your containers running have to match the respective permissions. Depending on the storage and flexibility of your application, this might mean you need chown via initContainers or some other automated way, whereas in other cases annotations can automatically handle such things for you.
Of course you do not want to end up as one of these people that has still a lot of toil and no nice security concepts just because you wanted to reuse your organizations old network filesystem. You want volumes and devices created with separate user tokens and only on demand so everything integrates nicely with kubernetes.
It is definitely possible in finite time to manually set up a volume provisioner for ceph, as described in this nice blogpost about ceph. But running any storage different than what you have run before is difficult, and the more you delve into it, the more you might enjoy some more capabilities which are too much pain to roll out manually.
Luckily there is rook ceph and it is even deployable with nice helm charts.
If you already have a ceph cluster running somewhere or you rather want the data in the PVC to be independent of Kubernetes, you can use rook ceph just as a very robust and comfortable way to interact with said cluster. Or rook ceph can just manage running a ceph cluster for you in the cluster, provisioned from devices in your nodes. Of course this means you need to take some more care of backups if there is anything worthwhile of backing up. Another possibility would be running a kubernetes just for rook ceph on your storage nodes and then use rook on your main cluster to interact with said cluster.
Such decisions mainly depend on the cluster lifecycle you envision, the importance of the data, the hardware available and the workloads you want to run.
But more important, if you do not know yet how to get some storage attached to your on prem cluster, rook ceph is not a bad choice, as ceph has all the nice features you would expect from a modern storage, while still being simple enough to be managed by a single mortal.
In the next post I will then discuss monitoring, where ceph integrates nicely.