Persistent storage with Rook-Ceph

Bare-metal Kubernetes Homelab · by

Series overview
  1. Bare-metal Kubernetes for your Homelab — An overview of Kubernetes, some motivation and common hardware choices for such projects.
  2. Installing Fedora CoreOS — Learn how to install Fedora CoreOS, a minimal, container-optimized Linux distribution, as the base operating system for your Kubernetes cluster with Butane and Ignition.
  3. Deploying bare-metal Kubernetes with Kubespray — How to use Kubespray, an automation tool based on Ansible, to deploy Kubernetes on your bare-metal setup, providing detailed steps for configuring and launching a multi-node cluster.
  4. Using Flux for GitOps — How to integrate Flux, a GitOps tool, into your Kubernetes cluster, enabling automated deployment from Git repositories and streamlining your CI/CD processes.
  5. Persistent storage with Rook-Ceph — You are here.
  6. Observability — Learn how to monitor and log your cluster's performance and health for better insights and troubleshooting.
    1. Metrics: Prometheus and Grafana — Using the prometheus-stack, we set up observability for the cluster, by scraping metrics and displaying them in Grafana Dashboards.
    2. Logging: Centralized Logging with ElasticSearch — The other part of observability, centralized logging. We'll discover how to set-up FluentBit to scrape logs from all parts of your cluster and use ElasticSearch and Kibana to analyze them.
  7. Cert-Manager for automatic TLS certificates — We use Cert-Manager to automatically create and rotate TLS certificates in our cluster that we acquire using ACME and Let's Encrypt.
  8. Ingress — Guide to configuring ingress for your cluster, using various tools to manage traffic routing and external access.
    1. Ingress with Traefik — Learn how to deploy Traefik as an ingress controller for dynamic routing and load balancing across your services.
    2. Ingress with cloudflared (Cloudflare Tunnels) — Discover how to use Cloudflared/ Cloudflare Tunnels as a secure ingress solution that integrates easily with Cloudflare’s edge network for added protection and performance.
    3. Ingress with TailScale — Finally, we can also integrate our Kubernetes cluster into our tailscale network.
  9. Cloud-Native Postgres — This project allows you to manage PostgreSQL Databases in a cloud-native way.
  10. Service Meshes — Explore the benefits of service meshes for secure and efficient communication between your services.
    1. Service Meshes: linkerd — An introduction to Linkerd, a lightweight and secure service mesh for your Kubernetes cluster.
    2. Service Meshes: Istio — A detailed guide to Istio, a robust service mesh offering advanced traffic management and security features.
  11. Security — Here, we discuss how to harden the cluster.
    1. Automatic K8s cluster scanning with Trivy — Trivy can scan clusters for vulnerabilities and misconfigurations.
    2. Use Renovate Bot for GitOps with Flux — We can use renovate bot to automatically update our Flux Deployments.

In this article, we’re going to set up persistent storage for our bare-metal Kubernetes cluster using Rook-Ceph, which is one of those things that sounds terrifying until you realize the operator does most of the heavy lifting for you and you’re mostly just writing YAML (as is tradition).

How Kubernetes manages storage

Before we get into the specifics of our storage solution, let’s talk about how Kubernetes thinks about storage in general, because the abstraction layers here are important to understand and they’ll inform why we need what we need.

At the base of all, CSI is a standardized interface that Kubernetes uses to communicate with storage providers. Before CSI existed, every storage vendor had to write their code directly into the Kubernetes source tree (called “in-tree” drivers), which was a maintenance nightmare for everyone involved and meant that adding support for a new storage system required changes to Kubernetes itself. CSI decouples this by defining a gRPC-based API that any storage provider can implement as a separate plugin (a set of pods running in your cluster), and Kubernetes talks to that plugin whenever it needs to manage storage in the cluster.

With CSI in place, Kubernetes separates the provisioning of storage from the consumption of storage through three key concepts that build on each other:

A StorageClass defines how storage is dynamically provisioned. It specifies which CSI driver to use (e.g., the Ceph CSI driver, the AWS EBS driver, or the local path provisioner), along with parameters like replication factor, filesystem type, and reclaim policy. You can think of it as a template that says “when someone asks for storage of this type, here’s how to create it.”

A PersistentVolume (PV) is a piece of storage that has been provisioned according to a StorageClass (or manually by an administrator). It represents the actual “disk” that was created, whether that’s a physical drive, a network-attached block device, or a chunk of a distributed storage system. PVs have a lifecycle independent of any pod that uses them, which is what makes them persistent rather than ephemeral, and the StorageClass’s CSI driver is what creates them by talking to the underlying storage system.

A PersistentVolumeClaim (PVC) is a request for storage by a pod. It specifies how much storage is needed, what access mode is required, and which StorageClass to use. When a PVC is created, Kubernetes asks the StorageClass’s CSI driver to provision a new PV matching the request, and then binds the PVC to that PV. This separation means that the person writing the application deployment doesn’t need to know the details of how storage is provisioned, they just say “I need 10Gi of block storage” and the cluster figures out the rest 1 .

PVCs also specify an access mode , which determines how many pods can mount the volume and whether they can write to it:

  • ReadWriteOnce (RWO): The volume can be mounted as read-write by a single pod (on a single node). This is what you use for databases, application state directories, and anything where having two writers would cause corruption. This maps to block storage in the underlying system.
  • ReadWriteMany (RWX): The volume can be mounted as read-write by multiple pods across multiple nodes simultaneously. This requires a distributed filesystem that handles concurrent access and locking. This maps to filesystem storage (like CephFS or NFS).
  • ReadOnlyMany (ROX): Multiple pods can mount the volume read-only. Less commonly used, but useful for things like shared configuration or static assets.

There’s also a fourth pattern that doesn’t go through PVCs at all: object storage (S3-compatible API), where applications talk directly to a storage endpoint over HTTP rather than mounting a filesystem. This is what you’d use for backups, media files, or any application that natively speaks the S3 protocol.

The storage solution we choose needs to support at least RWO (for databases) and ideally RWX (for shared filesystems) as well as S3 (for object storage), which narrows our options considerably.

The storage problem on bare-metal

So far in the series we haven’t talked about where data actually lives yet. On a managed Kubernetes service like EKS or GKE, you get PersistentVolumes backed by the cloud provider’s block storage (EBS, Persistent Disks, and so on) essentially for free, because someone at AWS or Google already solved this problem and is charging you handsomely for the privilege. Their CSI drivers talk to the cloud API, and when your PVC asks for 10Gi of storage, a network-attached volume is created and handed to your pod without you ever thinking about the underlying hardware. On bare-metal, there’s no cloud API to call, which means we need to provide our own storage backend with its own CSI driver that can dynamically provision PVs when PVCs are created.

Without persistent storage, any data written inside a container evaporates the moment the pod restarts, which is fine for stateless workloads but becomes a rather significant problem the moment you want to run a database, a document management system, or anything that needs to remember things between restarts (which, if you think about it, is most useful software).

While if you’re running a single-node cluster, you can get away with local storage because there’s only one node for pods to land on, in a multi-node cluster set-up, this won’t work: The whole point of running multiple nodes is that workloads can be scheduled on any of them, and can migrate between them when a node goes down for maintenance or crashes. If your data is stuck on one specific node, your pod can only ever run on that node, which defeats the entire purpose of having a distributed system in the first place. We need storage that’s accessible from any node in the cluster, which means we need something distributed.

The options roughly break down like this:

  • Rancher Local Path Provisioner : The simplest approach, where you just use a directory on the host node. This is what k3s ships with by default, and it works great for single-node clusters or workloads that are pinned to a specific node. But it ties your data to that one machine, so if the node dies or the pod gets rescheduled elsewhere, your data is either gone or inaccessible, which is not what we want.
  • NFS: Works and is battle-tested, but is a single point of failure unless you set up HA NFS, which is its own multi-day adventure that I’d rather not embark on.
  • Distributed storage: Replicates data across multiple nodes so you can lose an entire machine without losing data. More complex to set up, but gives you the redundancy that makes sleeping at night possible.

We’re going with option three, for the reasons outlined above. If you’re already running a NAS, then I’d consider using NFS and skipping the rest of the article.

Let’s consider options for storage in a homelab scenario:

  • MinIO was once the go-to S3-compatible object storage system in the Kubernetes world, and it performed very well for object storage workloads. I say “was” because in February 2026, MinIO archived their GitHub repository with a README that simply said “THIS REPOSITORY IS NO LONGER MAINTAINED” and pointed users toward their commercial product. This was the final step in a long slide that started with the license change from Apache 2.0 to AGPL v3 in 2021, continued with removing features from the community edition (including the web console and precompiled binaries), and ended with the project being fully abandoned as open source. A community fork exists now under the Pigsty project, but building infrastructure on top of a project whose corporate steward actively tried to kill the open-source version is not something I’m interested in doing. Beyond the licensing disaster, MinIO only provides object storage (S3-compatible API), which means you can’t use it for block storage (ReadWriteOnce PVCs) or shared filesystem storage (ReadWriteMany), and those are exactly what most Kubernetes workloads need for things like databases and application data.
  • Garage is a lightweight, self-hosted distributed storage system that provides an S3-compatible API and is designed specifically for small-scale deployments and homelabs. It’s written in Rust, has minimal resource requirements, and the homelab community on Reddit seems to genuinely love it. I haven’t used it myself, so I can’t speak to the operational experience, but the documentation is refreshingly straightforward and the project has a clear philosophy of simplicity over features. That said, like MinIO, Garage only provides object storage (S3 API), so it can’t serve as a general-purpose storage backend for Kubernetes workloads that need block devices or POSIX filesystems. If your workloads are all S3-native (like backups, media storage, or data lakes), Garage might be worth a look.
  • Longhorn is a CNCF project by Rancher/SUSE that provides distributed block storage for Kubernetes. I haven’t used it personally, but from what I’ve read it’s simpler to set up than Ceph and has a nice web UI. The tradeoff is that it only provides block storage (no filesystem or object storage), and reports from the community suggest it performs worse than Ceph on NVMe drives due to its architecture (it replicates at the block level using iSCSI, which adds overhead compared to Ceph’s native RADOS protocol). If you want something potentially simpler than Ceph and only need block storage, it might be worth investigating.
  • Rook-Ceph is a Kubernetes operator that deploys and manages Ceph , a distributed storage system that has been around since 2006 and powers some of the largest storage clusters in the world (CERN runs a Ceph cluster measured in exabytes). Ceph provides all three storage interfaces from a single system: block storage (RWO via RBD ), filesystem storage (RWX via CephFS ), and object storage (S3-compatible via RGW ). The tradeoff is operational complexity: Ceph has more moving parts than any of the other options (monitors, managers, OSDs, optionally metadata servers), and when something goes wrong the debugging experience involves concepts like placement groups and CRUSH maps that have a learning curve. That said, Rook abstracts away most of the day-to-day operations, and once it’s running it tends to just work. This is what we’re going with.

Unlike ZFS , which protects against disk failures within a single machine (via mirrors or RAIDZ) but is fundamentally a single-node solution, Ceph protects against entire machine failures by replicating data across multiple independent nodes. With a replication factor of 3 across 3 nodes, you can lose an entire server (motherboard, PSU, all drives, everything) and your data is still fully available on the remaining two nodes with zero downtime, while the cluster automatically re-replicates to restore redundancy. The tradeoff is operational complexity and resource overhead (you need at least 3 nodes, each running monitor and OSD daemons), while ZFS on a single server is simpler and gives you more usable storage per dollar. For a homelab with a single server, ZFS is probably the better choice, but the moment you have multiple nodes and want storage that survives node failures without manual intervention, Ceph’s architecture makes more sense.

A common question is what hardware you actually need. Ceph is surprisingly flexible here: you can mix different drive sizes in the same cluster because CRUSH assigns weight to each OSD proportional to its capacity, so a 12TB drive will simply store more data than a 2TB drive on the same node. That said, having wildly different sizes means your replication won’t be perfectly balanced across nodes, and if your smallest drive fills up while your largest is half-empty you’ll get health warnings, so keeping drives reasonably similar in size within a node makes life easier. For compute, Red Hat recommends a baseline of 16GB RAM per OSD host plus 5GB per OSD daemon (so a node with 4 NVMe drives wants at least 36GB), and each OSD will happily consume a CPU core or two during recovery operations. Network bandwidth matters more than you’d expect: 1Gbps works but becomes the bottleneck quickly during rebalancing or recovery, and 10Gbps is strongly recommended if you’re running NVMe drives (otherwise the network becomes the slowest link and your expensive SSDs sit idle waiting for packets). You also want your OSD drives to be dedicated to Ceph and separate from your OS drive, because Ceph will consume all available IOPS and you don’t want that competing with your kubelet writing logs.

Homelab Jank

Clearly, our homelab doesn’t fullfill Ceph’s hardware specs at all. The mini PCs I recommended in the beginning are technically not at all suitable for Ceph. However, this is the kind of jank I am willing to tolerate in my personal lab environments. Sure, performance might not be optimal, but also: Who cares? I’m not scaling to millions of users or storing exabytes of data.

A Ceph crash course

Now that we’ve established why Ceph and what the alternatives are, let’s understand the internals, because blindly applying YAML without understanding the system underneath is how you end up debugging storage issues at 3 AM with no idea what a “placement group” is.

Ceph stores everything as objects in a flat namespace within RADOS (Reliable Autonomic Distributed Object Store), and uses the CRUSH algorithm (Controlled Replication Under Scalable Hashing) to determine where each piece of data should live. The beauty of CRUSH is that clients can calculate the location of any object without asking a central authority, which means there’s no single bottleneck for data access and no single point of failure in the data path.

The key components you need to know about are:

  • Monitors (MON): These maintain the cluster map , which is essentially the source of truth about the cluster topology. You want an odd number (typically 3) so they can form a quorum and agree on the state of the world even if one of them goes down.
  • Managers (MGR): These provide monitoring, orchestration, and plugin interfaces like the dashboard . They’re the “nice to have” management layer on top of the core storage system.
  • OSDs (Object Storage Daemons): These are the workhorses that actually store your data. Each OSD manages a physical disk or partition, and they handle replication, recovery, and rebalancing amongst themselves.
  • MDS (Metadata Servers): Only needed if you use CephFS (the filesystem interface). They manage the filesystem metadata (directory structure, permissions, etc.) while the actual file data still lives on OSDs.

Data flows through the system like this: a client wants to write a block, it uses the CRUSH algorithm to determine which placement group the block belongs to, the placement group maps to a set of OSDs (determined by the replication factor), and the primary OSD coordinates the write to all replicas before acknowledging success to the client.

So what’s Rook?

Rook is a Kubernetes operator that deploys and manages Ceph on Kubernetes. Instead of manually configuring Ceph daemons, SSH-ing into nodes, and running ceph-deploy commands (which is how people used to do it and which is about as fun as it sounds), you describe your desired cluster state in a CephCluster custom resource, and Rook takes care of deploying monitors, managers, OSDs, handling upgrades, and generally keeping the cluster healthy.

Installing the Rook operator

As with everything in our GitOps setup, we’re going to deploy Rook using Flux. If you haven’t read the Flux article yet, now would be a good time, because the pattern we’re about to use (namespace, HelmRepository, ConfigMap for values, HelmRelease) is the same one we’ll use for every piece of infrastructure going forward.

Create a file infrastructure/controllers/rook-ceph.yml:

apiVersion: v1
kind: Namespace
metadata:
  name: rook-ceph
---
apiVersion: source.toolkit.fluxcd.io/v1beta2
kind: HelmRepository
metadata:
  name: rook-release
  namespace: flux-system
spec:
  interval: 12h
  url: https://charts.rook.io/release
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: rook-ceph-helm-chart-value-overrides
  namespace: rook-ceph
data:
  values.yaml: |-
    # <upstream values go here>
---
apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  chart:
    spec:
      chart: rook-ceph
      version: 1.18.x
      sourceRef:
        kind: HelmRepository
        name: rook-release
        namespace: flux-system
  interval: 15m
  timeout: 5m
  releaseName: rook-ceph
  install:
    remediation:
      retries: 3
  upgrade:
    remediation:
      retries: -1 # keep trying to remediate
    crds: CreateReplace # Upgrade CRDs on package update
  valuesFrom:
    - kind: ConfigMap
      name: rook-ceph-helm-chart-value-overrides
      valuesKey: values.yaml

We pin to 1.18.x so that patch versions are automatically applied but we don’t accidentally jump major versions during a reconciliation (which would be the kind of surprise that ruins your weekend). The crds: CreateReplace setting ensures that CRDs are updated when the chart is upgraded, which is important because Rook relies heavily on CRDs for its custom resources like CephCluster, CephBlockPool, and CephFilesystem. The remediation settings make the installation resilient: it retries three times on initial install, and on upgrades it keeps trying indefinitely (-1) because a failed Ceph upgrade is not something you want to leave in a half-applied state.

Don’t forget to add rook-ceph.yml to your infrastructure/controllers/kustomization.yml.

Configuring the CephCluster

The operator alone doesn’t do much on its own, it just sits there watching for CephCluster resources and waiting patiently for you to tell it what to do. Now we need to describe what our cluster should look like. Create infrastructure/configs/ceph-cluster.yml:

apiVersion: ceph.rook.io/v1
kind: CephCluster
metadata:
  name: rook-ceph
  namespace: rook-ceph
spec:
  cephVersion:
    image: quay.io/ceph/ceph:v18.2.4
    allowUnsupported: false
  dataDirHostPath: /var/lib/rook
  mon:
    count: 3
    allowMultiplePerNode: false
  mgr:
    count: 2
    allowMultiplePerNode: false
    modules:
      - name: rook
        enabled: true
  dashboard:
    enabled: true
    ssl: false
  storage:
    useAllNodes: true
    useAllDevices: true
    deviceFilter: "nvme0n1"

I’ve trimmed this to the essentials (the full version in my repository is about 200 lines of mostly comments from the upstream example), but let’s go through what each section does:

cephVersion.image: quay.io/ceph/ceph:v18.2.4 pins us to Ceph Reef (v18), which is the latest stable release. The Rook documentation lists which Ceph versions are supported by which Rook versions, and you should check this before upgrading either component.

dataDirHostPath: /var/lib/rook is where Ceph stores its configuration and metadata on the host filesystem. This is important to remember because if you ever need to completely tear down and recreate the cluster, you’ll need to delete this directory on each node, otherwise the monitors will refuse to start because they’ll find stale data from the old cluster and get confused.

mon.count: 3 gives us three monitors for quorum . With three nodes in our cluster, each gets one monitor, and we can tolerate one node going down without losing quorum (which would make the entire cluster read-only until quorum is restored).

mgr.count: 2 runs two managers for high availability, one active and one in standby, so that if the active manager crashes, the standby takes over immediately without any gap in monitoring or dashboard availability.

storage.deviceFilter: "nvme0n1" is the critical setting that tells Rook which disks to use as OSDs. In my case, each node has an NVMe drive at nvme0n1 that’s dedicated to Ceph storage. Rook will format these drives and take full ownership of them, so please triple-check that this filter matches the correct drives on your nodes and not, say, your OS drive (ask me how I know this is important).

Data loss warning

The deviceFilter setting will cause Rook to format and take ownership of the matching devices. If you point it at the wrong drive, you will have a very bad day. Run lsblk on each node to verify which device is which before deploying this.

For the full list of options available in the CephCluster CRD, check the Rook documentation , which goes into detail about network configuration, placement constraints, resource limits, disruption management, and all the other knobs you can turn.

Creating StorageClasses

With the cluster running, we need StorageClasses so that pods can actually request storage through PersistentVolumeClaims. We’ll create two: one for block storage (RBD) and one for filesystem storage (CephFS), because they serve different use cases and you’ll want both available.

Block storage (RBD)

RBD (RADOS Block Device) provides a raw block device that gets mounted into your pod, which you can think of as attaching a virtual hard drive. This is what you’ll use for most workloads: databases, application data, anything that needs a regular filesystem mounted at a path.

Create infrastructure/configs/ceph-block-storage.yml:

apiVersion: ceph.rook.io/v1
kind: CephBlockPool
metadata:
  name: replicapool
  namespace: rook-ceph
spec:
  failureDomain: host
  replicated:
    size: 3
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-ceph-block
provisioner: rook-ceph.rbd.csi.ceph.com
parameters:
  clusterID: rook-ceph
  pool: replicapool
  imageFormat: "2"
  imageFeatures: layering,fast-diff,object-map,deep-flatten,exclusive-lock
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-rbd-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-rbd-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
  csi.storage.k8s.io/fstype: ext4
reclaimPolicy: Delete
allowVolumeExpansion: true

The CephBlockPool defines how data is replicated. Setting failureDomain: host means that the three replicas (from replicated.size: 3) are guaranteed to land on different hosts rather than just different OSDs on the same host, which means you can lose an entire node without losing a single byte of data.

The imageFeatures field enables several RBD image features that improve performance: fast-diff and object-map speed up operations like snapshots and exports by tracking which objects have changed, while exclusive-lock ensures only one client can write to the image at a time (which is what you want for block devices that are mounted ReadWriteOnce).

Setting allowVolumeExpansion: true means you can resize PVCs later without recreating them, which is one of those things you’ll be very grateful for the first time a database outgrows its initial allocation.

The reclaimPolicy controls what happens to the underlying storage when a PVC is deleted. Delete means the PV and its data are destroyed when the PVC is removed, which keeps things tidy but means accidental kubectl delete pvc is genuinely destructive. The alternative is Retain, which keeps the PV and its data around as an “orphan” that an administrator must manually clean up or re-attach, which is safer but means you’ll accumulate abandoned volumes if you’re not careful. For a homelab where I’d rather not have storage slowly filling up with forgotten volumes, Delete is the right default, but if you’re running something irreplaceable you might want a Retain-based StorageClass as well.

The csi.storage.k8s.io/* annotations in the parameters section tell the CSI driver which Kubernetes Secrets contain the credentials it needs to talk to Ceph. The provisioner-secret is used when creating or deleting volumes (the CSI controller needs to authenticate with the Ceph cluster to allocate storage), the node-stage-secret is used when mounting volumes on a node (the kubelet needs credentials to map the RBD image on the host), and the controller-expand-secret is used when resizing volumes. These secrets are created automatically by Rook when the cluster is deployed, so you don’t need to manage them yourself, but you do need to reference them correctly in the StorageClass or volume operations will fail with authentication errors.

Filesystem storage (CephFS)

CephFS provides a POSIX-compliant filesystem that can be mounted by multiple pods simultaneously (ReadWriteMany), which is useful for shared storage scenarios where multiple pods need to read and write the same files. Create infrastructure/configs/ceph-filesystem-storage.yml:

apiVersion: ceph.rook.io/v1
kind: CephFilesystem
metadata:
  name: ceph-filesystem
  namespace: rook-ceph
spec:
  metadataPool:
    replicated:
      size: 3
  dataPools:
    - name: replicated
      replicated:
        size: 3
  preserveFilesystemOnDelete: true
  metadataServer:
    activeCount: 1
    activeStandby: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: rook-cephfs
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
  clusterID: rook-ceph
  fsName: ceph-filesystem
  pool: ceph-filesystem-replicated
  csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
  csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
  csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
  csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
  csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
reclaimPolicy: Delete

The CephFilesystem resource creates a filesystem with separate metadata and data pools (both replicated 3x), and deploys a Metadata Server with one active instance and one standby for failover. The preserveFilesystemOnDelete: true setting is a safety net that prevents the underlying data from being destroyed if you accidentally delete the CephFilesystem resource, which is the kind of guardrail that exists because someone, somewhere, learned this lesson the hard way.

Using the storage

Now that we have our StorageClasses registered, any pod in the cluster can request persistent storage by creating a PersistentVolumeClaim that references one of them:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-data
spec:
  storageClassName: rook-ceph-block
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

Kubernetes will dynamically provision a 10Gi block volume backed by Ceph, replicated three times across your nodes, and you can mount it in your pod like any other volume. We’ll see this in action when we deploy stateful workloads, like PostgreSQL , that need persistent data.

The Ceph Dashboard

Remember that we enabled the dashboard earlier? It gives you a web UI to monitor the health of your Ceph cluster, including OSD status, pool usage, IOPS, and recovery progress. We’ll set up proper ingress for it in a later article when we cover Traefik and cert-manager , but for now you can port-forward to it:

$ kubectl -n rook-ceph port-forward svc/rook-ceph-mgr-dashboard 7000:7000

To get the admin password (which Rook generates automatically and stores in a secret):

$ kubectl -n rook-ceph get secret rook-ceph-dashboard-password \
    -o jsonpath="{['data']['password']}" | base64 --decode

Verifying the cluster

After deploying everything, give it a few minutes to converge (Ceph needs to elect monitors, start OSDs, and create placement groups, which involves a fair amount of distributed consensus that takes time). Then check the status:

$ kubectl -n rook-ceph get cephcluster
NAME        DATADIRHOSTPATH   MONCOUNT   AGE   PHASE   MESSAGE                        HEALTH
rook-ceph   /var/lib/rook     3          10m   Ready   Cluster created successfully   HEALTH_OK

$ kubectl -n rook-ceph get pods
# You should see mon, mgr, osd, and csi pods running

If HEALTH shows HEALTH_OK, you have a working distributed storage cluster. If you see HEALTH_WARN, common causes are not having enough OSDs for the configured replication factor, or the cluster still rebalancing data after initial deployment. You can get more detail by running ceph status inside the Rook toolbox pod , which gives you direct access to the Ceph CLI.

Summary

We now have a fully functional distributed storage system running on our bare-metal cluster, with block storage (rook-ceph-block) for single-pod volumes like databases, filesystem storage (rook-cephfs) for shared volumes that multiple pods can mount simultaneously, three-way replication across nodes for fault tolerance, and a dashboard for monitoring cluster health. In the next articles, we’ll put this storage to use when we deploy databases and applications that need persistent data.

Footnotes

  1. A word on resizing: if you realize later that 10Gi wasn’t enough, you may be able to expand the volume by editing the PVC’s spec.resources.requests.storage field to a larger value, provided the StorageClass has allowVolumeExpansion: true set and the CSI driver supports it (Ceph’s does). Shrinking a volume, however, is never supported by Kubernetes, because truncating a filesystem that might have data written near the end is a recipe for corruption that no one wants to automate. If you over-provisioned and want the space back, you’ll need to create a new smaller volume, migrate the data, and delete the old one.

Previous article

Next article

Metrics: Prometheus and Grafana