Kubernetes News

-
Blog: A Custom Kubernetes Scheduler to Orchestrate Highly Available Applications
Author: Chris Seto (Cockroach Labs)
As long as you're willing to follow the rules, deploying on Kubernetes and air travel can be quite pleasant. More often than not, things will "just work". However, if one is interested in travelling with an alligator that must remain alive or scaling a database that must remain available, the situation is likely to become a bit more complicated. It may even be easier to build one's own plane or database for that matter. Travelling with reptiles aside, scaling a highly available stateful system is no trivial task.
Scaling any system has two main components:
- Adding or removing infrastructure that the system will run on, and
- Ensuring that the system knows how to handle additional instances of itself being added and removed.
Most stateless systems, web servers for example, are created without the need to be aware of peers. Stateful systems, which includes databases like CockroachDB, have to coordinate with their peer instances and shuffle around data. As luck would have it, CockroachDB handles data redistribution and replication. The tricky part is being able to tolerate failures during these operations by ensuring that data and instances are distributed across many failure domains (availability zones).
One of Kubernetes' responsibilities is to place "resources" (e.g, a disk or container) into the cluster and satisfy the constraints they request. For example: "I must be in availability zone A" (see Running in multiple zones), or "I can't be placed onto the same node as this other Pod" (see Affinity and anti-affinity).
As an addition to those constraints, Kubernetes offers Statefulsets that provide identity to Pods as well as persistent storage that "follows" these identified pods. Identity in a StatefulSet is handled by an increasing integer at the end of a pod's name. It's important to note that this integer must always be contiguous: in a StatefulSet, if pods 1 and 3 exist then pod 2 must also exist.
Under the hood, CockroachCloud deploys each region of CockroachDB as a StatefulSet in its own Kubernetes cluster - see Orchestrate CockroachDB in a Single Kubernetes Cluster. In this article, I'll be looking at an individual region, one StatefulSet and one Kubernetes cluster which is distributed across at least three availability zones.
A three-node CockroachCloud cluster would look something like this:
When adding additional resources to the cluster we also distribute them across zones. For the speediest user experience, we add all Kubernetes nodes at the same time and then scale up the StatefulSet.
Note that anti-affinities are satisfied no matter the order in which pods are assigned to Kubernetes nodes. In the example, pods 0, 1 and 2 were assigned to zones A, B, and C respectively, but pods 3 and 4 were assigned in a different order, to zones B and A respectively. The anti-affinity is still satisfied because the pods are still placed in different zones.
To remove resources from a cluster, we perform these operations in reverse order.
We first scale down the StatefulSet and then remove from the cluster any nodes lacking a CockroachDB pod.
Now, remember that pods in a StatefulSet of size n must have ids in the range
[0,n)
. When scaling down a StatefulSet by m, Kubernetes removes m pods, starting from the highest ordinals and moving towards the lowest, the reverse in which they were added. Consider the cluster topology below:As ordinals 5 through 3 are removed from this cluster, the statefulset continues to have a presence across all 3 availability zones.
However, Kubernetes' scheduler doesn't guarantee the placement above as we expected at first.
Our combined knowledge of the following is what lead to this misconception.
- Kubernetes' ability to automatically spread Pods across zone
- The behavior that a StatefulSet with n replicas, when Pods are being deployed, they are created sequentially, in order from
{0..n-1}
. See StatefulSet for more details.
Consider the following topology:
These pods were created in order and they are spread across all availability zones in the cluster. When ordinals 5 through 3 are terminated, this cluster will lose its presence in zone C!
Worse yet, our automation, at the time, would remove Nodes A-2, B-2, and C-2. Leaving CRDB-1 in an unscheduled state as persistent volumes are only available in the zone they are initially created in.
To correct the latter issue, we now employ a "hunt and peck" approach to removing machines from a cluster. Rather than blindly removing Kubernetes nodes from the cluster, only nodes without a CockroachDB pod would be removed. The much more daunting task was to wrangle the Kubernetes scheduler.
A session of brainstorming left us with 3 options:
1. Upgrade to kubernetes 1.18 and make use of Pod Topology Spread Constraints
While this seems like it could have been the perfect solution, at the time of writing Kubernetes 1.18 was unavailable on the two most common managed Kubernetes services in public cloud, EKS and GKE. Furthermore, pod topology spread constraints were still a beta feature in 1.18 which meant that it wasn't guaranteed to be available in managed clusters even when v1.18 became available. The entire endeavour was concerningly reminiscent of checking caniuse.com when Internet Explorer 8 was still around.
2. Deploy a statefulset per zone.
Rather than having one StatefulSet distributed across all availability zones, a single StatefulSet with node affinities per zone would allow manual control over our zonal topology. Our team had considered this as an option in the past which made it particularly appealing. Ultimately, we decided to forego this option as it would have required a massive overhaul to our codebase and performing the migration on existing customer clusters would have been an equally large undertaking.
3. Write a custom Kubernetes scheduler.
Thanks to an example from Kelsey Hightower and a blog post from Banzai Cloud, we decided to dive in head first and write our own custom Kubernetes scheduler. Once our proof-of-concept was deployed and running, we quickly discovered that the Kubernetes' scheduler is also responsible for mapping persistent volumes to the Pods that it schedules. The output of
kubectl get events
had led us to believe there was another system at play. In our journey to find the component responsible for storage claim mapping, we discovered the kube-scheduler plugin system. Our next POC was aFilter
plugin that determined the appropriate availability zone by pod ordinal, and it worked flawlessly!Our custom scheduler plugin is open source and runs in all of our CockroachCloud clusters. Having control over how our StatefulSet pods are being scheduled has let us scale out with confidence. We may look into retiring our plugin once pod topology spread constraints are available in GKE and EKS, but the maintenance overhead has been surprisingly low. Better still: the plugin's implementation is orthogonal to our business logic. Deploying it, or retiring it for that matter, is as simple as changing the
schedulerName
field in our StatefulSet definitions.
Chris Seto is a software engineer at Cockroach Labs and works on their Kubernetes automation for CockroachCloud, CockroachDB.
-
Blog: Kubernetes 1.20: Pod Impersonation and Short-lived Volumes in CSI Drivers
Author: Shihang Zhang (Google)
Typically when a CSI driver mounts credentials such as secrets and certificates, it has to authenticate against storage providers to access the credentials. However, the access to those credentials are controlled on the basis of the pods' identities rather than the CSI driver's identity. CSI drivers, therefore, need some way to retrieve pod's service account token.
Currently there are two suboptimal approaches to achieve this, either by granting CSI drivers the permission to use TokenRequest API or by reading tokens directly from the host filesystem.
Both of them exhibit the following drawbacks:
- Violating the principle of least privilege
- Every CSI driver needs to re-implement the logic of getting the pod’s service account token
The second approach is more problematic due to:
- The audience of the token defaults to the kube-apiserver
- The token is not guaranteed to be available (e.g.
AutomountServiceAccountToken=false
) - The approach does not work for CSI drivers that run as a different (non-root) user from the pods. See file permission section for service account token
- The token might be legacy Kubernetes service account token which doesn’t expire if
BoundServiceAccountTokenVolume=false
Kubernetes 1.20 introduces an alpha feature,
CSIServiceAccountToken
, to improve the security posture. The new feature allows CSI drivers to receive pods' bound service account tokens.This feature also provides a knob to re-publish volumes so that short-lived volumes can be refreshed.
Pod Impersonation
Using GCP APIs
Using Workload Identity, a Kubernetes service account can authenticate as a Google service account when accessing Google Cloud APIs. If a CSI driver needs to access GCP APIs on behalf of the pods that it is mounting volumes for, it can use the pod's service account token to exchange for GCP tokens. The pod's service account token is plumbed through the volume context in
NodePublishVolume
RPC calls when the featureCSIServiceAccountToken
is enabled. For example: accessing Google Secret Manager via a secret store CSI driver.Using Vault
If users configure Kubernetes as an auth method, Vault uses the
TokenReview
API to validate the Kubernetes service account token. For CSI drivers using Vault as resources provider, they need to present the pod's service account to Vault. For example, secrets store CSI driver and cert manager CSI driver.Short-lived Volumes
To keep short-lived volumes such as certificates effective, CSI drivers can specify
RequiresRepublish=true
in theirCSIDriver
object to have the kubelet periodically callNodePublishVolume
on mounted volumes. These republishes allow CSI drivers to ensure that the volume content is up-to-date.Next steps
This feature is alpha and projected to move to beta in 1.21. See more in the following KEP and CSI documentation:
Your feedback is always welcome!
- SIG-Auth meets regularly and can be reached via Slack and the mailing list
- SIG-Storage meets regularly and can be reached via Slack and the mailing list.
-
Blog: Third Party Device Metrics Reaches GA
Authors: Renaud Gaubert (NVIDIA), David Ashpole (Google), and Pramod Ramarao (NVIDIA)
With Kubernetes 1.20, infrastructure teams who manage large scale Kubernetes clusters, are seeing the graduation of two exciting and long awaited features:
- The Pod Resources API (introduced in 1.13) is finally graduating to GA. This allows Kubernetes plugins to obtain information about the node’s resource usage and assignment; for example: which pod/container consumes which device.
- The
DisableAcceleratorMetrics
feature (introduced in 1.19) is graduating to beta and will be enabled by default. This removes device metrics reported by the kubelet in favor of the new plugin architecture.
Many of the features related to fundamental device support (device discovery, plugin, and monitoring) are reaching a strong level of stability. Kubernetes users should see these features as stepping stones to enable more complex use cases (networking, scheduling, storage, etc.)!
One such example is Non Uniform Memory Access (NUMA) placement where, when selecting a device, an application typically wants to ensure that data transfer between CPU Memory and Device Memory is as fast as possible. In some cases, incorrect NUMA placement can nullify the benefit of offloading compute to an external device.
If these are topics of interest to you, consider joining the Kubernetes Node Special Insterest Group (SIG) for all topics related to the Kubernetes node, the COD (container orchestrated device) workgroup for topics related to runtimes, or the resource management forum for topics related to resource management!
The Pod Resources API - Why does it need to exist?
Kubernetes is a vendor neutral platform. If we want it to support device monitoring, adding vendor-specific code in the Kubernetes code base is not an ideal solution. Ultimately, devices are a domain where deep expertise is needed and the best people to add and maintain code in that area are the device vendors themselves.
The Pod Resources API was built as a solution to this issue. Each vendor can build and maintain their own out-of-tree monitoring plugin. This monitoring plugin, often deployed as a separate pod within a cluster, can then associate the metrics a device emits with the associated pod that's using it.
For example, use the NVIDIA GPU dcgm-exporter to scrape metrics in Prometheus format:
$ curl -sL http://127.0.01:8080/metrics # HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz). # TYPE DCGM_FI_DEV_SM_CLOCK gauge # HELP DCGM_FI_DEV_MEM_CLOCK Memory clock frequency (in MHz). # TYPE DCGM_FI_DEV_MEM_CLOCK gauge # HELP DCGM_FI_DEV_MEMORY_TEMP Memory temperature (in C). # TYPE DCGM_FI_DEV_MEMORY_TEMP gauge ... DCGM_FI_DEV_SM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="foo",namespace="bar",pod="baz"} 139 DCGM_FI_DEV_MEM_CLOCK{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="foo",namespace="bar",pod="baz"} 405 DCGM_FI_DEV_MEMORY_TEMP{gpu="0", UUID="GPU-604ac76c-d9cf-fef3-62e9-d92044ab6e52",container="foo",namespace="bar",pod="baz"} 9223372036854775794
Each agent is expected to adhere to the node monitoring guidelines. In other words, plugins are expected to generate metrics in Prometheus format, and new metrics should not have any dependency on the Kubernetes base directly.
This allows consumers of the metrics to use a compatible monitoring pipeline to collect and analyze metrics from a variety of agents, even if they are maintained by different vendors.
Disabling the NVIDIA GPU metrics - Warning
With the graduation of the plugin monitoring system, Kubernetes is deprecating the NVIDIA GPU metrics that are being reported by the kubelet.
With the DisableAcceleratorMetrics feature being enabled by default in Kubernetes 1.20, NVIDIA GPUs are no longer special citizens in Kubernetes. This is a good thing in the spirit of being vendor-neutral, and enables the most suited people to maintain their plugin on their own release schedule!
Users will now need to either install the NVIDIA GDGM exporter or use bindings to gather more accurate and complete metrics about NVIDIA GPUs. This deprecation means that you can no longer rely on metrics that were reported by kubelet, such as
container_accelerator_duty_cycle
orcontainer_accelerator_memory_used_bytes
which were used to gather NVIDIA GPU memory utilization.This means that users who used to rely on the NVIDIA GPU metrics reported by the kubelet, will need to update their reference and deploy the NVIDIA plugin. Namely the different metrics reported by Kubernetes map to the following metrics:
Kubernetes Metrics NVIDIA dcgm-exporter metric container_accelerator_duty_cycle
DCGM_FI_DEV_GPU_UTIL
container_accelerator_memory_used_bytes
DCGM_FI_DEV_FB_USED
container_accelerator_memory_total_bytes
DCGM_FI_DEV_FB_FREE + DCGM_FI_DEV_FB_USED
You might also be interested in other metrics such as
DCGM_FI_DEV_GPU_TEMP
(the GPU temperature) or DCGM_FI_DEV_POWER_USAGE (the power usage). The default set is available in Nvidia's Data Center GPU Manager documentation.Note that for this release you can still set the
DisableAcceleratorMetrics
feature gate to false, effectively re-enabling the ability for the kubelet to report NVIDIA GPU metrics.Paired with the graduation of the Pod Resources API, these tools can be used to generate GPU telemetry that can be used in visualization dashboards, below is an example:
The Pod Resources API - What can I go on to do with this?
As soon as this interface was introduced, many vendors started using it for widely different use cases! To list a few examples:
The kuryr-kubernetes CNI plugin in tandem with intel-sriov-device-plugin. This allowed the CNI plugin to know which allocation of SR-IOV Virtual Functions (VFs) the kubelet made and use that information to correctly setup the container network namespace and use a device with the appropriate NUMA node. We also expect this interface to be used to track the allocated and available resources with information about the NUMA topology of the worker node.
Another use-case is GPU telemetry, where GPU metrics can be associated with the containers and pods that the GPU is assigned to. One such example is the NVIDIA
dcgm-exporter
, but others can be easily built in the same paradigm.The Pod Resources API is a simple gRPC service which informs clients of the pods the kubelet knows. The information concerns the devices assignment the kubelet made and the assignment of CPUs. This information is obtained from the internal state of the kubelet's Device Manager and CPU Manager respectively.
You can see below a sample example of the API and how a go client could use that information in a few lines:
service PodResourcesLister { rpc List(ListPodResourcesRequest) returns (ListPodResourcesResponse) {} rpc GetAllocatableResources(AllocatableResourcesRequest) returns (AllocatableResourcesResponse) {} // Kubernetes 1.21 rpc Watch(WatchPodResourcesRequest) returns (stream WatchPodResourcesResponse) {} }
func main() { ctx, cancel := context.WithTimeout(context.Background(), connectionTimeout) defer cancel() socket := "/var/lib/kubelet/pod-resources/kubelet.sock" conn, err := grpc.DialContext(ctx, socket, grpc.WithInsecure(), grpc.WithBlock(), grpc.WithDialer(func(addr string, timeout time.Duration) (net.Conn, error) { return net.DialTimeout("unix", addr, timeout) }), ) if err != nil { panic(err) } client := podresourcesapi.NewPodResourcesListerClient(conn) resp, err := client.List(ctx, &podresourcesapi.ListPodResourcesRequest{}) if err != nil { panic(err) } net.Printf("%+v\n", resp) }
Finally, note that you can watch the number of requests made to the Pod Resources endpoint by watching the new kubelet metric called
pod_resources_endpoint_requests_total
on the kubelet's/metrics
endpoint.Is device monitoring suitable for production? Can I extend it? Can I contribute?
Yes! This feature released in 1.13, almost 2 years ago, has seen broad adoption, is already used by different cloud managed services, and with its graduation to G.A in Kubernetes 1.20 is production ready!
If you are a device vendor, you can start using it today! If you just want to monitor the devices in your cluster, go get the latest version of your monitoring plugin!
If you feel passionate about that area, join the kubernetes community, help improve the API or contribute the device monitoring plugins!
Acknowledgements
We thank the members of the community who have contributed to this feature or given feedback including members of WG-Resource-Management, SIG-Node and the Resource management forum!
-
Blog: Kubernetes 1.20: Granular Control of Volume Permission Changes
Authors: Hemant Kumar, Red Hat & Christian Huffman, Red Hat
Kubernetes 1.20 brings two important beta features, allowing Kubernetes admins and users alike to have more adequate control over how volume permissions are applied when a volume is mounted inside a Pod.
Allow users to skip recursive permission changes on mount
Traditionally if your pod is running as a non-root user (which you should), you must specify a
fsGroup
inside the pod’s security context so that the volume can be readable and writable by the Pod. This requirement is covered in more detail in here.But one side-effect of setting
fsGroup
is that, each time a volume is mounted, Kubernetes must recursivelychown()
andchmod()
all the files and directories inside the volume - with a few exceptions noted below. This happens even if group ownership of the volume already matches the requestedfsGroup
, and can be pretty expensive for larger volumes with lots of small files, which causes pod startup to take a long time. This scenario has been a known problem for a while, and in Kubernetes 1.20 we are providing knobs to opt-out of recursive permission changes if the volume already has the correct permissions.When configuring a pod’s security context, set
fsGroupChangePolicy
to "OnRootMismatch" so if the root of the volume already has the correct permissions, the recursive permission change can be skipped. Kubernetes ensures that permissions of the top-level directory are changed last the first time it applies permissions.securityContext: runAsUser:1000 runAsGroup:3000 fsGroup:2000 fsGroupChangePolicy:"OnRootMismatch"
You can learn more about this in Configure volume permission and ownership change policy for Pods.
Allow CSI Drivers to declare support for fsGroup based permissions
Although the previous section implied that Kubernetes always recursively changes permissions of a volume if a Pod has a
fsGroup
, this is not strictly true. For certain multi-writer volume types, such as NFS or Gluster, the cluster doesn’t perform recursive permission changes even if the pod has afsGroup
. Other volume types may not even supportchown()
/chmod()
, which rely on Unix-style permission control primitives.So how do we know when to apply recursive permission changes and when we shouldn't? For in-tree storage drivers, this was relatively simple. For CSI drivers that could span a multitude of platforms and storage types, this problem can be a bigger challenge.
Previously, whenever a CSI volume was mounted to a Pod, Kubernetes would attempt to automatically determine if the permissions and ownership should be modified. These methods were imprecise and could cause issues as we already mentioned, depending on the storage type.
The CSIDriver custom resource now has a
.spec.fsGroupPolicy
field, allowing storage drivers to explicitly opt in or out of these recursive modifications. By having the CSI driver specify a policy for the backing volumes, Kubernetes can avoid needless modification attempts. This optimization helps to reduce volume mount time and also cuts own reporting errors about modifications that would never succeed.CSIDriver FSGroupPolicy API
Three FSGroupPolicy values are available as of Kubernetes 1.20, with more planned for future releases.
- ReadWriteOnceWithFSType - This is the default policy, applied if no
fsGroupPolicy
is defined; this preserves the behavior from previous Kubernetes releases. Each volume is examined at mount time to determine if permissions should be recursively applied. - File - Always attempt to apply permission modifications, regardless of the filesystem type or PersistentVolumeClaim’s access mode.
- None - Never apply permission modifications.
How do I use it?
The only configuration needed is defining
fsGroupPolicy
inside of the.spec
for a CSIDriver. Once that element is defined, any subsequently mounted volumes will automatically use the defined policy. There’s no additional deployment required!What’s next?
Depending on feedback and adoption, the Kubernetes team plans to push these implementations to GA in either 1.21 or 1.22.
How can I learn more?
This feature is explained in more detail in Kubernetes project documentation: CSI Driver fsGroup Support and Configure volume permission and ownership change policy for Pods.
How do I get involved?
The Kubernetes Slack channel #csi and any of the standard SIG Storage communication channels are great mediums to reach out to the SIG Storage and the CSI team.
Those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.
- ReadWriteOnceWithFSType - This is the default policy, applied if no
-
Blog: Kubernetes 1.20: Kubernetes Volume Snapshot Moves to GA
Authors: Xing Yang, VMware & Xiangqian Yu, Google
The Kubernetes Volume Snapshot feature is now GA in Kubernetes v1.20. It was introduced as alpha in Kubernetes v1.12, followed by a second alpha with breaking changes in Kubernetes v1.13, and promotion to beta in Kubernetes 1.17. This blog post summarizes the changes releasing the feature from beta to GA.
What is a volume snapshot?
Many storage systems (like Google Cloud Persistent Disks, Amazon Elastic Block Storage, and many on-premise storage systems) provide the ability to create a “snapshot” of a persistent volume. A snapshot represents a point-in-time copy of a volume. A snapshot can be used either to rehydrate a new volume (pre-populated with the snapshot data) or to restore an existing volume to a previous state (represented by the snapshot).
Why add volume snapshots to Kubernetes?
Kubernetes aims to create an abstraction layer between distributed applications and underlying clusters so that applications can be agnostic to the specifics of the cluster they run on and application deployment requires no “cluster-specific” knowledge.
The Kubernetes Storage SIG identified snapshot operations as critical functionality for many stateful workloads. For example, a database administrator may want to snapshot a database’s volumes before starting a database operation.
By providing a standard way to trigger volume snapshot operations in Kubernetes, this feature allows Kubernetes users to incorporate snapshot operations in a portable manner on any Kubernetes environment regardless of the underlying storage.
Additionally, these Kubernetes snapshot primitives act as basic building blocks that unlock the ability to develop advanced enterprise-grade storage administration features for Kubernetes, including application or cluster level backup solutions.
What’s new since beta?
With the promotion of Volume Snapshot to GA, the feature is enabled by default on standard Kubernetes deployments and cannot be turned off.
Many enhancements have been made to improve the quality of this feature and to make it production-grade.
-
The Volume Snapshot APIs and client library were moved to a separate Go module.
-
A snapshot validation webhook has been added to perform necessary validation on volume snapshot objects. More details can be found in the Volume Snapshot Validation Webhook Kubernetes Enhancement Proposal.
-
Along with the validation webhook, the volume snapshot controller will start labeling invalid snapshot objects that already existed. This allows users to identify, remove any invalid objects, and correct their workflows. Once the API is switched to the v1 type, those invalid objects will not be deletable from the system.
-
To provide better insights into how the snapshot feature is performing, an initial set of operation metrics has been added to the volume snapshot controller.
-
There are more end-to-end tests, running on GCP, that validate the feature in a real Kubernetes cluster. Stress tests (based on Google Persistent Disk and
hostPath
CSI Drivers) have been introduced to test the robustness of the system.
Other than introducing tightening validation, there is no difference between the v1beta1 and v1 Kubernetes volume snapshot API. In this release (with Kubernetes 1.20), both v1 and v1beta1 are served while the stored API version is still v1beta1. Future releases will switch the stored version to v1 and gradually remove v1beta1 support.
Which CSI drivers support volume snapshots?
Snapshots are only supported for CSI drivers, not for in-tree or FlexVolume drivers. Ensure the deployed CSI driver on your cluster has implemented the snapshot interfaces. For more information, see Container Storage Interface (CSI) for Kubernetes GA.
Currently more than 50 CSI drivers support the Volume Snapshot feature. The GCE Persistent Disk CSI Driver has gone through the tests for upgrading from volume snapshots beta to GA. GA level support for other CSI drivers should be available soon.
Who builds products using volume snapshots?
As of the publishing of this blog, the following participants from the Kubernetes Data Protection Working Group are building products or have already built products using Kubernetes volume snapshots.
- Dell-EMC: PowerProtect
- Druva
- Kasten K10
- NetApp: Project Astra
- Portworx (PX-Backup)
- Pure Storage (Pure Service Orchestrator)
- Red Hat OpenShift Container Storage
- Robin Cloud Native Storage
- TrilioVault for Kubernetes
- Velero plugin for CSI
How to deploy volume snapshots?
Volume Snapshot feature contains the following components:
- Kubernetes Volume Snapshot CRDs
- Volume snapshot controller
- Snapshot validation webhook
- CSI Driver along with CSI Snapshotter sidecar
It is strongly recommended that Kubernetes distributors bundle and deploy the volume snapshot controller, CRDs, and validation webhook as part of their Kubernetes cluster management process (independent of any CSI Driver).
Warning: The snapshot validation webhook serves as a critical component to transition smoothly from using v1beta1 to v1 API. Not installing the snapshot validation webhook makes prevention of invalid volume snapshot objects from creation/updating impossible, which in turn will block deletion of invalid volume snapshot objects in coming upgrades.If your cluster does not come pre-installed with the correct components, you may manually install them. See the CSI Snapshotter README for details.
How to use volume snapshots?
Assuming all the required components (including CSI driver) have been already deployed and running on your cluster, you can create volume snapshots using the
VolumeSnapshot
API object, or use an existingVolumeSnapshot
to restore a PVC by specifying the VolumeSnapshot data source on it. For more details, see the volume snapshot documentation.Note: The Kubernetes Snapshot API does not provide any application consistency guarantees. You have to prepare your application (pause application, freeze filesystem etc.) before taking the snapshot for data consistency either manually or using higher level APIs/controllers.Dynamically provision a volume snapshot
To dynamically provision a volume snapshot, create a
VolumeSnapshotClass
API object first.apiVersion:snapshot.storage.k8s.io/v1 kind:VolumeSnapshotClass metadata: name:test-snapclass driver:testdriver.csi.k8s.io deletionPolicy:Delete parameters: csi.storage.k8s.io/snapshotter-secret-name:mysecret csi.storage.k8s.io/snapshotter-secret-namespace:mysecretnamespace
Then create a
VolumeSnapshot
API object from a PVC by specifying the volume snapshot class.apiVersion:snapshot.storage.k8s.io/v1 kind:VolumeSnapshot metadata: name:test-snapshot namespace:ns1 spec: volumeSnapshotClassName:test-snapclass source: persistentVolumeClaimName:test-pvc
Importing an existing volume snapshot with Kubernetes
To import a pre-existing volume snapshot into Kubernetes, manually create a
VolumeSnapshotContent
object first.apiVersion:snapshot.storage.k8s.io/v1 kind:VolumeSnapshotContent metadata: name:test-content spec: deletionPolicy:Delete driver:testdriver.csi.k8s.io source: snapshotHandle:7bdd0de3-xxx volumeSnapshotRef: name:test-snapshot namespace:default
Then create a
VolumeSnapshot
object pointing to theVolumeSnapshotContent
object.apiVersion:snapshot.storage.k8s.io/v1 kind:VolumeSnapshot metadata: name:test-snapshot spec: source: volumeSnapshotContentName:test-content
Rehydrate volume from snapshot
A bound and ready
VolumeSnapshot
object can be used to rehydrate a new volume with data pre-populated from snapshotted data as shown here:apiVersion:v1 kind:PersistentVolumeClaim metadata: name:pvc-restore namespace:demo-namespace spec: storageClassName:test-storageclass dataSource: name:test-snapshot kind:VolumeSnapshot apiGroup:snapshot.storage.k8s.io accessModes: - ReadWriteOnce resources: requests: storage:1Gi
How to add support for snapshots in a CSI driver?
See the CSI spec and the Kubernetes-CSI Driver Developer Guide for more details on how to implement the snapshot feature in a CSI driver.
What are the limitations?
The GA implementation of volume snapshots for Kubernetes has the following limitations:
- Does not support reverting an existing PVC to an earlier state represented by a snapshot (only supports provisioning a new volume from a snapshot).
How to learn more?
The code repository for snapshot APIs and controller is here: https://github.com/kubernetes-csi/external-snapshotter
Check out additional documentation on the snapshot feature here: http://k8s.io/docs/concepts/storage/volume-snapshots and https://kubernetes-csi.github.io/docs/
How to get involved?
This project, like all of Kubernetes, is the result of hard work by many contributors from diverse backgrounds working together.
We offer a huge thank you to the contributors who stepped up these last few quarters to help the project reach GA. We want to thank Saad Ali, Michelle Au, Tim Hockin, and Jordan Liggitt for their insightful reviews and thorough consideration with the design, thank Andi Li for his work on adding the support of the snapshot validation webhook, thank Grant Griffiths on implementing metrics support in the snapshot controller and handling password rotation in the validation webhook, thank Chris Henzie, Raunak Shah, and Manohar Reddy for writing critical e2e tests to meet the scalability and stability requirements for graduation, thank Kartik Sharma for moving snapshot APIs and client lib to a separate go module, and thank Raunak Shah and Prafull Ladha for their help with upgrade testing from beta to GA.
There are many more people who have helped to move the snapshot feature from beta to GA. We want to thank everyone who has contributed to this effort:
- Andi Li
- Ben Swartzlander
- Chris Henzie
- Christian Huffman
- Grant Griffiths
- Humble Devassy Chirammal
- Jan Šafránek
- Jiawei Wang
- Jing Xu
- Jordan Liggitt
- Kartik Sharma
- Madhu Rajanna
- Manohar Reddy
- Michelle Au
- Patrick Ohly
- Prafull Ladha
- Prateek Pandey
- Raunak Shah
- Saad Ali
- Saikat Roychowdhury
- Tim Hockin
- Xiangqian Yu
- Xing Yang
- Zhu Can
For those interested in getting involved with the design and development of CSI or any part of the Kubernetes Storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.
We also hold regular Data Protection Working Group meetings. New attendees are welcome to join in discussions.
-