Kubernetes News

The Kubernetes project blog
Kubernetes.io
  1. Author: Chris Seto (Cockroach Labs)

    As long as you're willing to follow the rules, deploying on Kubernetes and air travel can be quite pleasant. More often than not, things will "just work". However, if one is interested in travelling with an alligator that must remain alive or scaling a database that must remain available, the situation is likely to become a bit more complicated. It may even be easier to build one's own plane or database for that matter. Travelling with reptiles aside, scaling a highly available stateful system is no trivial task.

    Scaling any system has two main components:

    1. Adding or removing infrastructure that the system will run on, and
    2. Ensuring that the system knows how to handle additional instances of itself being added and removed.

    Most stateless systems, web servers for example, are created without the need to be aware of peers. Stateful systems, which includes databases like CockroachDB, have to coordinate with their peer instances and shuffle around data. As luck would have it, CockroachDB handles data redistribution and replication. The tricky part is being able to tolerate failures during these operations by ensuring that data and instances are distributed across many failure domains (availability zones).

    One of Kubernetes' responsibilities is to place "resources" (e.g, a disk or container) into the cluster and satisfy the constraints they request. For example: "I must be in availability zone A" (see Running in multiple zones), or "I can't be placed onto the same node as this other Pod" (see Affinity and anti-affinity).

    As an addition to those constraints, Kubernetes offers Statefulsets that provide identity to Pods as well as persistent storage that "follows" these identified pods. Identity in a StatefulSet is handled by an increasing integer at the end of a pod's name. It's important to note that this integer must always be contiguous: in a StatefulSet, if pods 1 and 3 exist then pod 2 must also exist.

    Under the hood, CockroachCloud deploys each region of CockroachDB as a StatefulSet in its own Kubernetes cluster - see Orchestrate CockroachDB in a Single Kubernetes Cluster. In this article, I'll be looking at an individual region, one StatefulSet and one Kubernetes cluster which is distributed across at least three availability zones.

    A three-node CockroachCloud cluster would look something like this:

    3-node, multi-zone cockroachdb cluster

    When adding additional resources to the cluster we also distribute them across zones. For the speediest user experience, we add all Kubernetes nodes at the same time and then scale up the StatefulSet.

    illustration of phases: adding Kubernetes nodes to the multi-zone cockroachdb cluster

    Note that anti-affinities are satisfied no matter the order in which pods are assigned to Kubernetes nodes. In the example, pods 0, 1 and 2 were assigned to zones A, B, and C respectively, but pods 3 and 4 were assigned in a different order, to zones B and A respectively. The anti-affinity is still satisfied because the pods are still placed in different zones.

    To remove resources from a cluster, we perform these operations in reverse order.

    We first scale down the StatefulSet and then remove from the cluster any nodes lacking a CockroachDB pod.

    illustration of phases: scaling down pods in a multi-zone cockroachdb cluster in Kubernetes

    Now, remember that pods in a StatefulSet of size n must have ids in the range [0,n). When scaling down a StatefulSet by m, Kubernetes removes m pods, starting from the highest ordinals and moving towards the lowest, the reverse in which they were added. Consider the cluster topology below:

    illustration: cockroachdb cluster: 6 nodes distributed across 3 availability zones

    As ordinals 5 through 3 are removed from this cluster, the statefulset continues to have a presence across all 3 availability zones.

    illustration: removing 3 nodes from a 6-node, 3-zone cockroachdb cluster

    However, Kubernetes' scheduler doesn't guarantee the placement above as we expected at first.

    Our combined knowledge of the following is what lead to this misconception.

    Consider the following topology:

    illustration: 6-node cockroachdb cluster distributed across 3 availability zones

    These pods were created in order and they are spread across all availability zones in the cluster. When ordinals 5 through 3 are terminated, this cluster will lose its presence in zone C!

    illustration: terminating 3 nodes in 6-node cluster spread across 3 availability zones, where 2/2 nodes in the same availability zone are terminated, knocking out that AZ

    Worse yet, our automation, at the time, would remove Nodes A-2, B-2, and C-2. Leaving CRDB-1 in an unscheduled state as persistent volumes are only available in the zone they are initially created in.

    To correct the latter issue, we now employ a "hunt and peck" approach to removing machines from a cluster. Rather than blindly removing Kubernetes nodes from the cluster, only nodes without a CockroachDB pod would be removed. The much more daunting task was to wrangle the Kubernetes scheduler.

    A session of brainstorming left us with 3 options:

    1. Upgrade to kubernetes 1.18 and make use of Pod Topology Spread Constraints

    While this seems like it could have been the perfect solution, at the time of writing Kubernetes 1.18 was unavailable on the two most common managed Kubernetes services in public cloud, EKS and GKE. Furthermore, pod topology spread constraints were still a beta feature in 1.18 which meant that it wasn't guaranteed to be available in managed clusters even when v1.18 became available. The entire endeavour was concerningly reminiscent of checking caniuse.com when Internet Explorer 8 was still around.

    2. Deploy a statefulset per zone.

    Rather than having one StatefulSet distributed across all availability zones, a single StatefulSet with node affinities per zone would allow manual control over our zonal topology. Our team had considered this as an option in the past which made it particularly appealing. Ultimately, we decided to forego this option as it would have required a massive overhaul to our codebase and performing the migration on existing customer clusters would have been an equally large undertaking.

    3. Write a custom Kubernetes scheduler.

    Thanks to an example from Kelsey Hightower and a blog post from Banzai Cloud, we decided to dive in head first and write our own custom Kubernetes scheduler. Once our proof-of-concept was deployed and running, we quickly discovered that the Kubernetes' scheduler is also responsible for mapping persistent volumes to the Pods that it schedules. The output of kubectl get events had led us to believe there was another system at play. In our journey to find the component responsible for storage claim mapping, we discovered the kube-scheduler plugin system. Our next POC was a Filter plugin that determined the appropriate availability zone by pod ordinal, and it worked flawlessly!

    Our custom scheduler plugin is open source and runs in all of our CockroachCloud clusters. Having control over how our StatefulSet pods are being scheduled has let us scale out with confidence. We may look into retiring our plugin once pod topology spread constraints are available in GKE and EKS, but the maintenance overhead has been surprisingly low. Better still: the plugin's implementation is orthogonal to our business logic. Deploying it, or retiring it for that matter, is as simple as changing the schedulerName field in our StatefulSet definitions.


    Chris Seto is a software engineer at Cockroach Labs and works on their Kubernetes automation for CockroachCloud, CockroachDB.

  2. Author: Shihang Zhang (Google)

    Typically when a CSI driver mounts credentials such as secrets and certificates, it has to authenticate against storage providers to access the credentials. However, the access to those credentials are controlled on the basis of the pods' identities rather than the CSI driver's identity. CSI drivers, therefore, need some way to retrieve pod's service account token.

    Currently there are two suboptimal approaches to achieve this, either by granting CSI drivers the permission to use TokenRequest API or by reading tokens directly from the host filesystem.

    Both of them exhibit the following drawbacks:

    • Violating the principle of least privilege
    • Every CSI driver needs to re-implement the logic of getting the pod’s service account token

    The second approach is more problematic due to:

    • The audience of the token defaults to the kube-apiserver
    • The token is not guaranteed to be available (e.g. AutomountServiceAccountToken=false)
    • The approach does not work for CSI drivers that run as a different (non-root) user from the pods. See file permission section for service account token
    • The token might be legacy Kubernetes service account token which doesn’t expire if BoundServiceAccountTokenVolume=false

    Kubernetes 1.20 introduces an alpha feature, CSIServiceAccountToken, to improve the security posture. The new feature allows CSI drivers to receive pods' bound service account tokens.

    This feature also provides a knob to re-publish volumes so that short-lived volumes can be refreshed.

    Pod Impersonation

    Using GCP APIs

    Using Workload Identity, a Kubernetes service account can authenticate as a Google service account when accessing Google Cloud APIs. If a CSI driver needs to access GCP APIs on behalf of the pods that it is mounting volumes for, it can use the pod's service account token to exchange for GCP tokens. The pod's service account token is plumbed through the volume context in NodePublishVolume RPC calls when the feature CSIServiceAccountToken is enabled. For example: accessing Google Secret Manager via a secret store CSI driver.

    Using Vault

    If users configure Kubernetes as an auth method, Vault uses the TokenReview API to validate the Kubernetes service account token. For CSI drivers using Vault as resources provider, they need to present the pod's service account to Vault. For example, secrets store CSI driver and cert manager CSI driver.

    Short-lived Volumes

    To keep short-lived volumes such as certificates effective, CSI drivers can specify RequiresRepublish=true in theirCSIDriver object to have the kubelet periodically call NodePublishVolume on mounted volumes. These republishes allow CSI drivers to ensure that the volume content is up-to-date.

    Next steps

    This feature is alpha and projected to move to beta in 1.21. See more in the following KEP and CSI documentation:

    Your feedback is always welcome!

  3. Authors: Renaud Gaubert (NVIDIA), David Ashpole (Google), and Pramod Ramarao (NVIDIA)

    With Kubernetes 1.20, infrastructure teams who manage large scale Kubernetes clusters, are seeing the graduation of two exciting and long awaited features:

    • The Pod Resources API (introduced in 1.13) is finally graduating to GA. This allows Kubernetes plugins to obtain information about the node’s resource usage and assignment; for example: which pod/container consumes which device.
    • The DisableAcceleratorMetrics feature (introduced in 1.19) is graduating to beta and will be enabled by default. This removes device metrics reported by the kubelet in favor of the new plugin architecture.

    Many of the features related to fundamental device support (device discovery, plugin, and monitoring) are reaching a strong level of stability. Kubernetes users should see these features as stepping stones to enable more complex use cases (networking, scheduling, storage, etc.)!

    One such example is Non Uniform Memory Access (NUMA) placement where, when selecting a device, an application typically wants to ensure that data transfer between CPU Memory and Device Memory is as fast as possible. In some cases, incorrect NUMA placement can nullify the benefit of offloading compute to an external device.

    If these are topics of interest to you, consider joining the Kubernetes Node Special Insterest Group (SIG) for all topics related to the Kubernetes node, the COD (container orchestrated device) workgroup for topics related to runtimes, or the resource management forum for topics related to resource management!

    The Pod Resources API - Why does it need to exist?

    Kubernetes is a vendor neutral platform. If we want it to support device monitoring, adding vendor-specific code in the Kubernetes code base is not an ideal solution. Ultimately, devices are a domain where deep expertise is needed and the best people to add and maintain code in that area are the device vendors themselves.

    The Pod Resources API was built as a solution to this issue. Each vendor can build and maintain their own out-of-tree monitoring plugin. This monitoring plugin, often deployed as a separate pod within a cluster, can then associate the metrics a device emits with the associated pod that's using it.

    For example, use the NVIDIA GPU dcgm-exporter to scrape metrics in Prometheus format:

    $ curl -sL http://127.0.01:8080/metrics
    # HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
    # TYPE DCGM_FI_DEV_SM_CLOCK gauge
    # HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz).
    # TYPE DCGM_FI_DEV_MEM_CLOCK gauge
    # HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C).
    # TYPE DCGM_FI_DEV_MEMORY_TEMP gauge
    ...
    DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="foo",namespace="bar",pod="baz"} 139
    DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="foo",namespace="bar",pod="baz"} 405
    DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="foo",namespace="bar",pod="baz"} 9223372036854775794
    

    Each agent is expected to adhere to the node monitoring guidelines. In other words, plugins are expected to generate metrics in Prometheus format, and new metrics should not have any dependency on the Kubernetes base directly.

    This allows consumers of the metrics to use a compatible monitoring pipeline to collect and analyze metrics from a variety of agents, even if they are maintained by different vendors.

    Device metrics flowchart

    Disabling the NVIDIA GPU metrics - Warning

    With the graduation of the plugin monitoring system, Kubernetes is deprecating the NVIDIA GPU metrics that are being reported by the kubelet.

    With the DisableAcceleratorMetrics feature being enabled by default in Kubernetes 1.20, NVIDIA GPUs are no longer special citizens in Kubernetes. This is a good thing in the spirit of being vendor-neutral, and enables the most suited people to maintain their plugin on their own release schedule!

    Users will now need to either install the NVIDIA GDGM exporter or use bindings to gather more accurate and complete metrics about NVIDIA GPUs. This deprecation means that you can no longer rely on metrics that were reported by kubelet, such as container_accelerator_duty_cycle or container_accelerator_memory_used_bytes which were used to gather NVIDIA GPU memory utilization.

    This means that users who used to rely on the NVIDIA GPU metrics reported by the kubelet, will need to update their reference and deploy the NVIDIA plugin. Namely the different metrics reported by Kubernetes map to the following metrics:

    Kubernetes Metrics NVIDIA dcgm-exporter metric
    container_accelerator_duty_cycle DCGM_FI_DEV_GPU_UTIL
    container_accelerator_memory_used_bytes DCGM_FI_DEV_FB_USED
    container_accelerator_memory_total_bytes DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED

    You might also be interested in other metrics such as DCGM_FI_DEV_GPU_TEMP (the GPU temperature) or DCGM_FI_DEV_POWER_USAGE (the power usage). The default set is available in Nvidia's Data Center GPU Manager documentation.

    Note that for this release you can still set the DisableAcceleratorMetrics feature gate to false, effectively re-enabling the ability for the kubelet to report NVIDIA GPU metrics.

    Paired with the graduation of the Pod Resources API, these tools can be used to generate GPU telemetry that can be used in visualization dashboards, below is an example:

    Grafana visualization of device metrics

    The Pod Resources API - What can I go on to do with this?

    As soon as this interface was introduced, many vendors started using it for widely different use cases! To list a few examples:

    The kuryr-kubernetes CNI plugin in tandem with intel-sriov-device-plugin. This allowed the CNI plugin to know which allocation of SR-IOV Virtual Functions (VFs) the kubelet made and use that information to correctly setup the container network namespace and use a device with the appropriate NUMA node. We also expect this interface to be used to track the allocated and available resources with information about the NUMA topology of the worker node.

    Another use-case is GPU telemetry, where GPU metrics can be associated with the containers and pods that the GPU is assigned to. One such example is the NVIDIA dcgm-exporter, but others can be easily built in the same paradigm.

    The Pod Resources API is a simple gRPC service which informs clients of the pods the kubelet knows. The information concerns the devices assignment the kubelet made and the assignment of CPUs. This information is obtained from the internal state of the kubelet's Device Manager and CPU Manager respectively.

    You can see below a sample example of the API and how a go client could use that information in a few lines:

    service PodResourcesLister {
    rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {}
    rpc GetAllocatableResources(AllocatableResourcesRequest) returns (AllocatableResourcesResponse) {}
    // Kubernetes 1.21
    rpc Watch(WatchPodResourcesRequest) returns (stream WatchPodResourcesResponse) {}
    }
    
    func main() {
    ctx, cancel := context.WithTimeout(context.Background(), connectionTimeout)
    defer cancel()
    socket := "/var/lib/kubelet/pod-resources/kubelet.sock"
    conn, err := grpc.DialContext(ctx, socket, grpc.WithInsecure(), grpc.WithBlock(),
    grpc.WithDialer(func(addr string, timeout time.Duration) (net.Conn, error) {
    return net.DialTimeout("unix", addr, timeout)
    }),
    )
    if err != nil {
    panic(err)
    }
    client := podresourcesapi.NewPodResourcesListerClient(conn)
    resp, err := client.List(ctx, &podresourcesapi.ListPodResourcesRequest{})
    if err != nil {
    panic(err)
    }
    net.Printf("%+v\n", resp)
    }
    

    Finally, note that you can watch the number of requests made to the Pod Resources endpoint by watching the new kubelet metric called pod_resources_endpoint_requests_total on the kubelet's /metrics endpoint.

    Is device monitoring suitable for production? Can I extend it? Can I contribute?

    Yes! This feature released in 1.13, almost 2 years ago, has seen broad adoption, is already used by different cloud managed services, and with its graduation to G.A in Kubernetes 1.20 is production ready!

    If you are a device vendor, you can start using it today! If you just want to monitor the devices in your cluster, go get the latest version of your monitoring plugin!

    If you feel passionate about that area, join the kubernetes community, help improve the API or contribute the device monitoring plugins!

    Acknowledgements

    We thank the members of the community who have contributed to this feature or given feedback including members of WG-Resource-Management, SIG-Node and the Resource management forum!

  4. Authors: Hemant Kumar, Red Hat & Christian Huffman, Red Hat

    Kubernetes 1.20 brings two important beta features, allowing Kubernetes admins and users alike to have more adequate control over how volume permissions are applied when a volume is mounted inside a Pod.

    Allow users to skip recursive permission changes on mount

    Traditionally if your pod is running as a non-root user (which you should), you must specify a fsGroup inside the pod’s security context so that the volume can be readable and writable by the Pod. This requirement is covered in more detail in here.

    But one side-effect of setting fsGroup is that, each time a volume is mounted, Kubernetes must recursively chown() and chmod() all the files and directories inside the volume - with a few exceptions noted below. This happens even if group ownership of the volume already matches the requested fsGroup, and can be pretty expensive for larger volumes with lots of small files, which causes pod startup to take a long time. This scenario has been a known problem for a while, and in Kubernetes 1.20 we are providing knobs to opt-out of recursive permission changes if the volume already has the correct permissions.

    When configuring a pod’s security context, set fsGroupChangePolicy to "OnRootMismatch" so if the root of the volume already has the correct permissions, the recursive permission change can be skipped. Kubernetes ensures that permissions of the top-level directory are changed last the first time it applies permissions.

    securityContext:
    runAsUser:1000
    runAsGroup:3000
    fsGroup:2000
    fsGroupChangePolicy:"OnRootMismatch"
    

    You can learn more about this in Configure volume permission and ownership change policy for Pods.

    Allow CSI Drivers to declare support for fsGroup based permissions

    Although the previous section implied that Kubernetes always recursively changes permissions of a volume if a Pod has a fsGroup, this is not strictly true. For certain multi-writer volume types, such as NFS or Gluster, the cluster doesn’t perform recursive permission changes even if the pod has a fsGroup. Other volume types may not even support chown()/chmod(), which rely on Unix-style permission control primitives.

    So how do we know when to apply recursive permission changes and when we shouldn't? For in-tree storage drivers, this was relatively simple. For CSI drivers that could span a multitude of platforms and storage types, this problem can be a bigger challenge.

    Previously, whenever a CSI volume was mounted to a Pod, Kubernetes would attempt to automatically determine if the permissions and ownership should be modified. These methods were imprecise and could cause issues as we already mentioned, depending on the storage type.

    The CSIDriver custom resource now has a .spec.fsGroupPolicy field, allowing storage drivers to explicitly opt in or out of these recursive modifications. By having the CSI driver specify a policy for the backing volumes, Kubernetes can avoid needless modification attempts. This optimization helps to reduce volume mount time and also cuts own reporting errors about modifications that would never succeed.

    CSIDriver FSGroupPolicy API

    Three FSGroupPolicy values are available as of Kubernetes 1.20, with more planned for future releases.

    • ReadWriteOnceWithFSType - This is the default policy, applied if no fsGroupPolicy is defined; this preserves the behavior from previous Kubernetes releases. Each volume is examined at mount time to determine if permissions should be recursively applied.
    • File - Always attempt to apply permission modifications, regardless of the filesystem type or PersistentVolumeClaim’s access mode.
    • None - Never apply permission modifications.

    How do I use it?

    The only configuration needed is defining fsGroupPolicy inside of the .spec for a CSIDriver. Once that element is defined, any subsequently mounted volumes will automatically use the defined policy. There’s no additional deployment required!

    What’s next?

    Depending on feedback and adoption, the Kubernetes team plans to push these implementations to GA in either 1.21 or 1.22.

    How can I learn more?

    This feature is explained in more detail in Kubernetes project documentation: CSI Driver fsGroup Support and Configure volume permission and ownership change policy for Pods.

    How do I get involved?

    The Kubernetes Slack channel #csi and any of the standard SIG Storage communication channels are great mediums to reach out to the SIG Storage and the CSI team.

    Those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.

  5. Authors: Xing Yang, VMware & Xiangqian Yu, Google

    The Kubernetes Volume Snapshot feature is now GA in Kubernetes v1.20. It was introduced as alpha in Kubernetes v1.12, followed by a second alpha with breaking changes in Kubernetes v1.13, and promotion to beta in Kubernetes 1.17. This blog post summarizes the changes releasing the feature from beta to GA.

    What is a volume snapshot?

    Many storage systems (like Google Cloud Persistent Disks, Amazon Elastic Block Storage, and many on-premise storage systems) provide the ability to create a “snapshot” of a persistent volume. A snapshot represents a point-in-time copy of a volume. A snapshot can be used either to rehydrate a new volume (pre-populated with the snapshot data) or to restore an existing volume to a previous state (represented by the snapshot).

    Why add volume snapshots to Kubernetes?

    Kubernetes aims to create an abstraction layer between distributed applications and underlying clusters so that applications can be agnostic to the specifics of the cluster they run on and application deployment requires no “cluster-specific” knowledge.

    The Kubernetes Storage SIG identified snapshot operations as critical functionality for many stateful workloads. For example, a database administrator may want to snapshot a database’s volumes before starting a database operation.

    By providing a standard way to trigger volume snapshot operations in Kubernetes, this feature allows Kubernetes users to incorporate snapshot operations in a portable manner on any Kubernetes environment regardless of the underlying storage.

    Additionally, these Kubernetes snapshot primitives act as basic building blocks that unlock the ability to develop advanced enterprise-grade storage administration features for Kubernetes, including application or cluster level backup solutions.

    What’s new since beta?

    With the promotion of Volume Snapshot to GA, the feature is enabled by default on standard Kubernetes deployments and cannot be turned off.

    Many enhancements have been made to improve the quality of this feature and to make it production-grade.

    • The Volume Snapshot APIs and client library were moved to a separate Go module.

    • A snapshot validation webhook has been added to perform necessary validation on volume snapshot objects. More details can be found in the Volume Snapshot Validation Webhook Kubernetes Enhancement Proposal.

    • Along with the validation webhook, the volume snapshot controller will start labeling invalid snapshot objects that already existed. This allows users to identify, remove any invalid objects, and correct their workflows. Once the API is switched to the v1 type, those invalid objects will not be deletable from the system.

    • To provide better insights into how the snapshot feature is performing, an initial set of operation metrics has been added to the volume snapshot controller.

    • There are more end-to-end tests, running on GCP, that validate the feature in a real Kubernetes cluster. Stress tests (based on Google Persistent Disk and hostPath CSI Drivers) have been introduced to test the robustness of the system.

    Other than introducing tightening validation, there is no difference between the v1beta1 and v1 Kubernetes volume snapshot API. In this release (with Kubernetes 1.20), both v1 and v1beta1 are served while the stored API version is still v1beta1. Future releases will switch the stored version to v1 and gradually remove v1beta1 support.

    Which CSI drivers support volume snapshots?

    Snapshots are only supported for CSI drivers, not for in-tree or FlexVolume drivers. Ensure the deployed CSI driver on your cluster has implemented the snapshot interfaces. For more information, see Container Storage Interface (CSI) for Kubernetes GA.

    Currently more than 50 CSI drivers support the Volume Snapshot feature. The GCE Persistent Disk CSI Driver has gone through the tests for upgrading from volume snapshots beta to GA. GA level support for other CSI drivers should be available soon.

    Who builds products using volume snapshots?

    As of the publishing of this blog, the following participants from the Kubernetes Data Protection Working Group are building products or have already built products using Kubernetes volume snapshots.

    How to deploy volume snapshots?

    Volume Snapshot feature contains the following components:

    It is strongly recommended that Kubernetes distributors bundle and deploy the volume snapshot controller, CRDs, and validation webhook as part of their Kubernetes cluster management process (independent of any CSI Driver).

    Warning: The snapshot validation webhook serves as a critical component to transition smoothly from using v1beta1 to v1 API. Not installing the snapshot validation webhook makes prevention of invalid volume snapshot objects from creation/updating impossible, which in turn will block deletion of invalid volume snapshot objects in coming upgrades.

    If your cluster does not come pre-installed with the correct components, you may manually install them. See the CSI Snapshotter README for details.

    How to use volume snapshots?

    Assuming all the required components (including CSI driver) have been already deployed and running on your cluster, you can create volume snapshots using the VolumeSnapshot API object, or use an existing VolumeSnapshot to restore a PVC by specifying the VolumeSnapshot data source on it. For more details, see the volume snapshot documentation.

    Note: The Kubernetes Snapshot API does not provide any application consistency guarantees. You have to prepare your application (pause application, freeze filesystem etc.) before taking the snapshot for data consistency either manually or using higher level APIs/controllers.

    Dynamically provision a volume snapshot

    To dynamically provision a volume snapshot, create a VolumeSnapshotClass API object first.

    apiVersion:snapshot.storage.k8s.io/v1
    kind:VolumeSnapshotClass
    metadata:
    name:test-snapclass
    driver:testdriver.csi.k8s.io
    deletionPolicy:Delete
    parameters:
    csi.storage.k8s.io/snapshotter-secret-name:mysecret
    csi.storage.k8s.io/snapshotter-secret-namespace:mysecretnamespace
    

    Then create a VolumeSnapshot API object from a PVC by specifying the volume snapshot class.

    apiVersion:snapshot.storage.k8s.io/v1
    kind:VolumeSnapshot
    metadata:
    name:test-snapshot
    namespace:ns1
    spec:
    volumeSnapshotClassName:test-snapclass
    source:
    persistentVolumeClaimName:test-pvc
    

    Importing an existing volume snapshot with Kubernetes

    To import a pre-existing volume snapshot into Kubernetes, manually create a VolumeSnapshotContent object first.

    apiVersion:snapshot.storage.k8s.io/v1
    kind:VolumeSnapshotContent
    metadata:
    name:test-content
    spec:
    deletionPolicy:Delete
    driver:testdriver.csi.k8s.io
    source:
    snapshotHandle:7bdd0de3-xxx
    volumeSnapshotRef:
    name:test-snapshot
    namespace:default
    

    Then create a VolumeSnapshot object pointing to the VolumeSnapshotContent object.

    apiVersion:snapshot.storage.k8s.io/v1
    kind:VolumeSnapshot
    metadata:
    name:test-snapshot
    spec:
    source:
    volumeSnapshotContentName:test-content
    

    Rehydrate volume from snapshot

    A bound and ready VolumeSnapshot object can be used to rehydrate a new volume with data pre-populated from snapshotted data as shown here:

    apiVersion:v1
    kind:PersistentVolumeClaim
    metadata:
    name:pvc-restore
    namespace:demo-namespace
    spec:
    storageClassName:test-storageclass
    dataSource:
    name:test-snapshot
    kind:VolumeSnapshot
    apiGroup:snapshot.storage.k8s.io
    accessModes:
    - ReadWriteOnce
    resources:
    requests:
    storage:1Gi
    

    How to add support for snapshots in a CSI driver?

    See the CSI spec and the Kubernetes-CSI Driver Developer Guide for more details on how to implement the snapshot feature in a CSI driver.

    What are the limitations?

    The GA implementation of volume snapshots for Kubernetes has the following limitations:

    • Does not support reverting an existing PVC to an earlier state represented by a snapshot (only supports provisioning a new volume from a snapshot).

    How to learn more?

    The code repository for snapshot APIs and controller is here: https://github.com/kubernetes-csi/external-snapshotter

    Check out additional documentation on the snapshot feature here: http://k8s.io/docs/concepts/storage/volume-snapshots and https://kubernetes-csi.github.io/docs/

    How to get involved?

    This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together.

    We offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach GA. We want to thank Saad Ali, Michelle Au, Tim Hockin, and Jordan Liggitt for their insightful reviews and thorough consideration with the design, thank Andi Li for his work on adding the support of the snapshot validation webhook, thank Grant Griffiths on implementing metrics support in the snapshot controller and handling password rotation in the validation webhook, thank Chris Henzie, Raunak Shah, and Manohar Reddy for writing critical e2e tests to meet the scalability and stability requirements for graduation, thank Kartik Sharma for moving snapshot APIs and client lib to a separate go module, and thank Raunak Shah and Prafull Ladha for their help with upgrade testing from beta to GA.

    There are many more people who have helped to move the snapshot feature from beta to GA. We want to thank everyone who has contributed to this effort:

    For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.

    We also hold regular Data Protection Working Group meetings. New attendees are welcome to join in discussions.