Kubernetes News

-
Blog: Consider All Microservices Vulnerable — And Monitor Their Behavior
Author: David Hadas (IBM Research Labs)
This post warns Devops from a false sense of security. Following security best practices when developing and configuring microservices do not result in non-vulnerable microservices. The post shows that although all deployed microservices are vulnerable, there is much that can be done to ensure microservices are not exploited. It explains how analyzing the behavior of clients and services from a security standpoint, named here "Security-Behavior Analysis", can protect the deployed vulnerable microservices. It points to Guard, an open source project offering security-behavior monitoring and control of Kubernetes microservices presumed vulnerable.
As cyber attacks continue to intensify in sophistication, organizations deploying cloud services continue to grow their cyber investments aiming to produce safe and non-vulnerable services. However, the year-by-year growth in cyber investments does not result in a parallel reduction in cyber incidents. Instead, the number of cyber incidents continues to grow annually. Evidently, organizations are doomed to fail in this struggle - no matter how much effort is made to detect and remove cyber weaknesses from deployed services, it seems offenders always have the upper hand.
Considering the current spread of offensive tools, sophistication of offensive players, and ever-growing cyber financial gains to offenders, any cyber strategy that relies on constructing a non-vulnerable, weakness-free service in 2023 is clearly too naïve. It seems the only viable strategy is to:
➥ Admit that your services are vulnerable!
In other words, consciously accept that you will never create completely invulnerable services. If your opponents find even a single weakness as an entry-point, you lose! Admitting that in spite of your best efforts, all your services are still vulnerable is an important first step. Next, this post discusses what you can do about it...
How to protect microservices from being exploited
Being vulnerable does not necessarily mean that your service will be exploited. Though your services are vulnerable in some ways unknown to you, offenders still need to identify these vulnerabilities and then exploit them. If offenders fail to exploit your service vulnerabilities, you win! In other words, having a vulnerability that can’t be exploited, represents a risk that can’t be realized.
Figure 1. An Offender gaining foothold in a vulnerable service
The above diagram shows an example in which the offender does not yet have a foothold in the service; that is, it is assumed that your service does not run code controlled by the offender on day 1. In our example the service has vulnerabilities in the API exposed to clients. To gain an initial foothold the offender uses a malicious client to try and exploit one of the service API vulnerabilities. The malicious client sends an exploit that triggers some unplanned behavior of the service.
More specifically, let’s assume the service is vulnerable to an SQL injection. The developer failed to sanitize the user input properly, thereby allowing clients to send values that would change the intended behavior. In our example, if a client sends a query string with key “username” and value of “tom or 1=1”, the client will receive the data of all users. Exploiting this vulnerability requires the client to send an irregular string as the value. Note that benign users will not be sending a string with spaces or with the equal sign character as a username, instead they will normally send legal usernames which for example may be defined as a short sequence of characters a-z. No legal username can trigger service unplanned behavior.
In this simple example, one can already identify several opportunities to detect and block an attempt to exploit the vulnerability (un)intentionally left behind by the developer, making the vulnerability unexploitable. First, the malicious client behavior differs from the behavior of benign clients, as it sends irregular requests. If such a change in behavior is detected and blocked, the exploit will never reach the service. Second, the service behavior in response to the exploit differs from the service behavior in response to a regular request. Such behavior may include making subsequent irregular calls to other services such as a data store, taking irregular time to respond, and/or responding to the malicious client with an irregular response (for example, containing much more data than normally sent in case of benign clients making regular requests). Service behavioral changes, if detected, will also allow blocking the exploit in different stages of the exploitation attempt.
More generally:
-
Monitoring the behavior of clients can help detect and block exploits against service API vulnerabilities. In fact, deploying efficient client behavior monitoring makes many vulnerabilities unexploitable and others very hard to achieve. To succeed, the offender needs to create an exploit undetectable from regular requests.
-
Monitoring the behavior of services can help detect services as they are being exploited regardless of the attack vector used. Efficient service behavior monitoring limits what an attacker may be able to achieve as the offender needs to ensure the service behavior is undetectable from regular service behavior.
Combining both approaches may add a protection layer to the deployed vulnerable services, drastically decreasing the probability for anyone to successfully exploit any of the deployed vulnerable services. Next, let us identify four use cases where you need to use security-behavior monitoring.
Use cases
One can identify the following four different stages in the life of any service from a security standpoint. In each stage, security-behavior monitoring is required to meet different challenges:
Service State Use case What do you need in order to cope with this use case? Normal No known vulnerabilities: The service owner is normally not aware of any known vulnerabilities in the service image or configuration. Yet, it is reasonable to assume that the service has weaknesses. Provide generic protection against any unknown, zero-day, service vulnerabilities - Detect/block irregular patterns sent as part of incoming client requests that may be used as exploits. Vulnerable An applicable CVE is published: The service owner is required to release a new non-vulnerable revision of the service. Research shows that in practice this process of removing a known vulnerability may take many weeks to accomplish (2 months on average). Add protection based on the CVE analysis - Detect/block incoming requests that include specific patterns that may be used to exploit the discovered vulnerability. Continue to offer services, although the service has a known vulnerability. Exploitable A known exploit is published: The service owner needs a way to filter incoming requests that contain the known exploit. Add protection based on a known exploit signature - Detect/block incoming client requests that carry signatures identifying the exploit. Continue to offer services, although the presence of an exploit. Misused An offender misuses pods backing the service: The offender can follow an attack pattern enabling him/her to misuse pods. The service owner needs to restart any compromised pods while using non compromised pods to continue offering the service. Note that once a pod is restarted, the offender needs to repeat the attack pattern before he/she may again misuse it. Identify and restart instances of the component that is being misused - At any given time, some backing pods may be compromised and misused, while others behave as designed. Detect/remove the misused pods while allowing other pods to continue servicing client requests. Fortunately, microservice architecture is well suited to security-behavior monitoring as discussed next.
Security-Behavior of microservices versus monoliths
Kubernetes is often used to support workloads designed with microservice architecture. By design, microservices aim to follow the UNIX philosophy of "Do One Thing And Do It Well". Each microservice has a bounded context and a clear interface. In other words, you can expect the microservice clients to send relatively regular requests and the microservice to present a relatively regular behavior as a response to these requests. Consequently, a microservice architecture is an excellent candidate for security-behavior monitoring.
Figure 2. Microservices are well suited for security-behavior monitoring
The diagram above clarifies how dividing a monolithic service to a set of microservices improves our ability to perform security-behavior monitoring and control. In a monolithic service approach, different client requests are intertwined, resulting in a diminished ability to identify irregular client behaviors. Without prior knowledge, an observer of the intertwined client requests will find it hard to distinguish between types of requests and their related characteristics. Further, internal client requests are not exposed to the observer. Lastly, the aggregated behavior of the monolithic service is a compound of the many different internal behaviors of its components, making it hard to identify irregular service behavior.
In a microservice environment, each microservice is expected by design to offer a more well-defined service and serve better defined type of requests. This makes it easier for an observer to identify irregular client behavior and irregular service behavior. Further, a microservice design exposes the internal requests and internal services which offer more security-behavior data to identify irregularities by an observer. Overall, this makes the microservice design pattern better suited for security-behavior monitoring and control.
Security-Behavior monitoring on Kubernetes
Kubernetes deployments seeking to add Security-Behavior may use Guard, developed under the CNCF project Knative. Guard is integrated into the full Knative automation suite that runs on top of Kubernetes. Alternatively, you can deploy Guard as a standalone tool to protect any HTTP-based workload on Kubernetes.
See:
- Guard on Github, for using Guard as a standalone tool.
- The Knative automation suite - Read about Knative, in the blog post Opinionated Kubernetes which describes how Knative simplifies and unifies the way web services are deployed on Kubernetes.
- You may contact Guard maintainers on the SIG Security Slack channel or on the Knative community security Slack channel. The Knative community channel will move soon to the CNCF Slack under the name
#knative-security
.
The goal of this post is to invite the Kubernetes community to action and introduce Security-Behavior monitoring and control to help secure Kubernetes based deployments. Hopefully, the community as a follow up will:
- Analyze the cyber challenges presented for different Kubernetes use cases
- Add appropriate security documentation for users on how to introduce Security-Behavior monitoring and control.
- Consider how to integrate with tools that can help users monitor and control their vulnerable services.
Getting involved
You are welcome to get involved and join the effort to develop security behavior monitoring and control for Kubernetes; to share feedback and contribute to code or documentation; and to make or suggest improvements of any kind.
-
-
Blog: Protect Your Mission-Critical Pods From Eviction With PriorityClass
Author: Sunny Bhambhani (InfraCloud Technologies)
Kubernetes has been widely adopted, and many organizations use it as their de-facto orchestration engine for running workloads that need to be created and deleted frequently.
Therefore, proper scheduling of the pods is key to ensuring that application pods are up and running within the Kubernetes cluster without any issues. This article delves into the use cases around resource management by leveraging the PriorityClass object to protect mission-critical or high-priority pods from getting evicted and making sure that the application pods are up, running, and serving traffic.
Resource management in Kubernetes
The control plane consists of multiple components, out of which the scheduler (usually the built-in kube-scheduler) is one of the components which is responsible for assigning a node to a pod.
Whenever a pod is created, it enters a "pending" state, after which the scheduler determines which node is best suited for the placement of the new pod.
In the background, the scheduler runs as an infinite loop looking for pods without a
nodeName
set that are ready for scheduling. For each Pod that needs scheduling, the scheduler tries to decide which node should run that Pod.If the scheduler cannot find any node, the pod remains in the pending state, which is not ideal.
Note: To name a few,nodeSelector
,taints and tolerations
,nodeAffinity
, the rank of nodes based on available resources (for example, CPU and memory), and several other criteria are used to determine the pod's placement.The below diagram, from point number 1 through 4, explains the request flow:
Scheduling in Kubernetes
Typical use cases
Below are some real-life scenarios where control over the scheduling and eviction of pods may be required.
-
Let's say the pod you plan to deploy is critical, and you have some resource constraints. An example would be the DaemonSet of an infrastructure component like Grafana Loki. The Loki pods must run before other pods can on every node. In such cases, you could ensure resource availability by manually identifying and deleting the pods that are not required or by adding a new node to the cluster. Both these approaches are unsuitable since the former would be tedious to execute, and the latter could involve an expenditure of time and money.
-
Another use case could be a single cluster that holds the pods for the below environments with associated priorities:
- Production (
prod
): top priority - Preproduction (
preprod
): intermediate priority - Development (
dev
): least priority
- Production (
In the event of high resource consumption in the cluster, there is competition for CPU and memory resources on the nodes. While cluster-level autoscaling may add more nodes, it takes time. In the interim, if there are no further nodes to scale the cluster, some Pods could remain in a Pending state, or the service could be degraded as they compete for resources. If the kubelet does evict a Pod from the node, that eviction would be random because the kubelet doesn’t have any special information about which Pods to evict and which to keep.
- A third example could be a microservice backed by a queuing application or a database running into a resource crunch and the queue or database getting evicted. In such a case, all the other services would be rendered useless until the database can serve traffic again.
There can also be other scenarios where you want to control the order of scheduling or order of eviction of pods.
PriorityClasses in Kubernetes
PriorityClass is a cluster-wide API object in Kubernetes and part of the
scheduling.k8s.io/v1
API group. It contains a mapping of the PriorityClass name (defined in.metadata.name
) and an integer value (defined in.value
). This represents the value that the scheduler uses to determine Pod's relative priority.Additionally, when you create a cluster using kubeadm or a managed Kubernetes service (for example, Azure Kubernetes Service), Kubernetes uses PriorityClasses to safeguard the pods that are hosted on the control plane nodes. This ensures that critical cluster components such as CoreDNS and kube-proxy can run even if resources are constrained.
This availability of pods is achieved through the use of a special PriorityClass that ensures the pods are up and running and that the overall cluster is not affected.
$ kubectl get priorityclass NAME VALUE GLOBAL-DEFAULT AGE system-cluster-critical 2000000000 false 82m system-node-critical 2000001000 false 82m
The diagram below shows exactly how it works with the help of an example, which will be detailed in the upcoming section.
Pod scheduling and preemption
Pod priority and preemption
Pod preemption is a Kubernetes feature that allows the cluster to preempt pods (removing an existing Pod in favor of a new Pod) on the basis of priority. Pod priority indicates the importance of a pod relative to other pods while scheduling. If there aren't enough resources to run all the current pods, the scheduler tries to evict lower-priority pods over high-priority ones.
Also, when a healthy cluster experiences a node failure, typically, lower-priority pods get preempted to create room for higher-priority pods on the available node. This happens even if the cluster can bring up a new node automatically since pod creation is usually much faster than bringing up a new node.
PriorityClass requirements
Before you set up PriorityClasses, there are a few things to consider.
- Decide which PriorityClasses are needed. For instance, based on environment, type of pods, type of applications, etc.
- The default PriorityClass resource for your cluster. The pods without a
priorityClassName
will be treated as priority 0. - Use a consistent naming convention for all PriorityClasses.
- Make sure that the pods for your workloads are running with the right PriorityClass.
PriorityClass hands-on example
Let’s say there are 3 application pods: one for prod, one for preprod, and one for development. Below are three sample YAML manifest files for each of those.
--- # development apiVersion:v1 kind:Pod metadata: name:dev-nginx labels: env:dev spec: containers: - name:dev-nginx image:nginx resources: requests: memory:"256Mi" cpu:"0.2" limits: memory:".5Gi" cpu:"0.5"
--- # preproduction apiVersion:v1 kind:Pod metadata: name:preprod-nginx labels: env:preprod spec: containers: - name:preprod-nginx image:nginx resources: requests: memory:"1.5Gi" cpu:"1.5" limits: memory:"2Gi" cpu:"2"
--- # production apiVersion:v1 kind:Pod metadata: name:prod-nginx labels: env:prod spec: containers: - name:prod-nginx image:nginx resources: requests: memory:"2Gi" cpu:"2" limits: memory:"2Gi" cpu:"2"
You can create these pods with the
kubectl create -f <FILE.yaml>
command, and then check their status using thekubectl get pods
command. You can see if they are up and look ready to serve traffic:$ kubectl get pods --show-labels NAME READY STATUS RESTARTS AGE LABELS dev-nginx 1/1 Running 0 55s env=dev preprod-nginx 1/1 Running 0 55s env=preprod prod-nginx 0/1 Pending 0 55s env=prod
Bad news. The pod for the Production environment is still Pending and isn't serving any traffic.
Let's see why this is happening:
$ kubectl get events ... ... 5s Warning FailedScheduling pod/prod-nginx 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.
In this example, there is only one worker node, and that node has a resource crunch.
Now, let's look at how PriorityClass can help in this situation since prod should be given higher priority than the other environments.
PriorityClass API
Before creating PriorityClasses based on these requirements, let's see what a basic manifest for a PriorityClass looks like and outline some prerequisites:
apiVersion:scheduling.k8s.io/v1 kind:PriorityClass metadata: name:PRIORITYCLASS_NAME value:0# any integer value between -1000000000 to 1000000000 description:>- (Optional) description goes here! globalDefault:false# or true. Only one PriorityClass can be the global default.
Below are some prerequisites for PriorityClasses:
- The name of a PriorityClass must be a valid DNS subdomain name.
- When you make your own PriorityClass, the name should not start with
system-
, as those names are reserved by Kubernetes itself (for example, they are used for two built-in PriorityClasses). - Its absolute value should be between -1000000000 to 1000000000 (1 billion).
- Larger numbers are reserved by PriorityClasses such as
system-cluster-critical
(this Pod is critically important to the cluster) andsystem-node-critical
(the node critically relies on this Pod).system-node-critical
is a higher priority thansystem-cluster-critical
, because a cluster-critical Pod can only work well if the node where it is running has all its node-level critical requirements met. - There are two optional fields:
globalDefault
: When true, this PriorityClass is used for pods where apriorityClassName
is not specified. Only one PriorityClass withglobalDefault
set to true can exist in a cluster.
If there is no PriorityClass defined with globalDefault set to true, all the pods with no priorityClassName defined will be treated with 0 priority (i.e. the least priority).description
: A string with a meaningful value so that people know when to use this PriorityClass.
Note: Adding a PriorityClass withglobalDefault
set totrue
does not mean it will apply the same to the existing pods that are already running. This will be applicable only to the pods that came into existence after the PriorityClass was created.PriorityClass in action
Here's an example. Next, create some environment-specific PriorityClasses:
apiVersion:scheduling.k8s.io/v1 kind:PriorityClass metadata: name:dev-pc value:1000000 globalDefault:false description:>- (Optional) This priority class should only be used for all development pods.
apiVersion:scheduling.k8s.io/v1 kind:PriorityClass metadata: name:preprod-pc value:2000000 globalDefault:false description:>- (Optional) This priority class should only be used for all preprod pods.
apiVersion:scheduling.k8s.io/v1 kind:PriorityClass metadata: name:prod-pc value:4000000 globalDefault:false description:>- (Optional) This priority class should only be used for all prod pods.
Use
kubectl create -f <FILE.YAML>
command to create a pc andkubectl get pc
to check its status.$ kubectl get pc NAME VALUE GLOBAL-DEFAULT AGE dev-pc 1000000 false 3m13s preprod-pc 2000000 false 2m3s prod-pc 4000000 false 7s system-cluster-critical 2000000000 false 82m system-node-critical 2000001000 false 82m
The new PriorityClasses are in place now. A small change is needed in the pod manifest or pod template (in a ReplicaSet or Deployment). In other words, you need to specify the priority class name at
.spec.priorityClassName
(which is a string value).First update the previous production pod manifest file to have a PriorityClass assigned, then delete the Production pod and recreate it. You can't edit the priority class for a Pod that already exists.
In my cluster, when I tried this, here's what happened. First, that change seems successful; the status of pods has been updated:
$ kubectl get pods --show-labels NAME READY STATUS RESTARTS AGE LABELS dev-nginx 1/1 Terminating 0 55s env=dev preprod-nginx 1/1 Running 0 55s env=preprod prod-nginx 0/1 Pending 0 55s env=prod
The dev-nginx pod is getting terminated. Once that is successfully terminated and there are enough resources for the prod pod, the control plane can schedule the prod pod:
Warning FailedScheduling pod/prod-nginx 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory. Normal Preempted pod/dev-nginx by default/prod-nginx on node node01 Normal Killing pod/dev-nginx Stopping container dev-nginx Normal Scheduled pod/prod-nginx Successfully assigned default/prod-nginx to node01 Normal Pulling pod/prod-nginx Pulling image "nginx" Normal Pulled pod/prod-nginx Successfully pulled image "nginx" Normal Created pod/prod-nginx Created container prod-nginx Normal Started pod/prod-nginx Started container prod-nginx
Enforcement
When you set up PriorityClasses, they exist just how you defined them. However, people (and tools) that make changes to your cluster are free to set any PriorityClass, or to not set any PriorityClass at all. However, you can use other Kubernetes features to make sure that the priorities you wanted are actually applied.
As an alpha feature, you can define a ValidatingAdmissionPolicy and a ValidatingAdmissionPolicyBinding so that, for example, Pods that go into the
prod
namespace must use theprod-pc
PriorityClass. With another ValidatingAdmissionPolicyBinding you ensure that thepreprod
namespace uses thepreprod-pc
PriorityClass, and so on. In any cluster, you can enforce similar controls using external projects such as Kyverno or Gatekeeper, through validating admission webhooks.However you do it, Kubernetes gives you options to make sure that the PriorityClasses are used how you wanted them to be, or perhaps just to warn users when they pick an unsuitable option.
Summary
The above example and its events show you what this feature of Kubernetes brings to the table, along with several scenarios where you can use this feature. To reiterate, this helps ensure that mission-critical pods are up and available to serve the traffic and, in the case of a resource crunch, determines cluster behavior.
It gives you some power to decide the order of scheduling and order of preemption for Pods. Therefore, you need to define the PriorityClasses sensibly. For example, if you have a cluster autoscaler to add nodes on demand, make sure to run it with the
system-cluster-critical
PriorityClass. You don't want to get in a situation where the autoscaler has been preempted and there are no new nodes coming online.If you have any queries or feedback, feel free to reach out to me on LinkedIn.
-
-
Blog: Kubernetes 1.26: Eviction policy for unhealthy pods guarded by PodDisruptionBudgets
Authors: Filip Křepinský (Red Hat), Morten Torkildsen (Google), Ravi Gudimetla (Apple)
Ensuring the disruptions to your applications do not affect its availability isn't a simple task. Last month's release of Kubernetes v1.26 lets you specify an unhealthy pod eviction policy for PodDisruptionBudgets (PDBs) to help you maintain that availability during node management operations. In this article, we will dive deeper into what modifications were introduced for PDBs to give application owners greater flexibility in managing disruptions.
What problems does this solve?
API-initiated eviction of pods respects PodDisruptionBudgets (PDBs). This means that a requested voluntary disruption via an eviction to a Pod, should not disrupt a guarded application and
.status.currentHealthy
of a PDB should not fall below.status.desiredHealthy
. Running pods that are Unhealthy do not count towards the PDB status, but eviction of these is only possible in case the application is not disrupted. This helps disrupted or not yet started application to achieve availability as soon as possible without additional downtime that would be caused by evictions.Unfortunately, this poses a problem for cluster administrators that would like to drain nodes without any manual interventions. Misbehaving applications with pods in
CrashLoopBackOff
state (due to a bug or misconfiguration) or pods that are simply failing to become ready make this task much harder. Any eviction request will fail due to violation of a PDB, when all pods of an application are unhealthy. Draining of a node cannot make any progress in that case.On the other hand there are users that depend on the existing behavior, in order to:
- prevent data-loss that would be caused by deleting pods that are guarding an underlying resource or storage
- achieve the best availability possible for their application
Kubernetes 1.26 introduced a new experimental field to the PodDisruptionBudget API:
.spec.unhealthyPodEvictionPolicy
. When enabled, this field lets you support both of those requirements.How does it work?
API-initiated eviction is the process that triggers graceful pod termination. The process can be initiated either by calling the API directly, by using a
kubectl drain
command, or other actors in the cluster. During this process every pod removal is consulted with appropriate PDBs, to ensure that a sufficient number of pods is always running in the cluster.The following policies allow PDB authors to have a greater control how the process deals with unhealthy pods.
There are two policies
IfHealthyBudget
andAlwaysAllow
to choose from.The former,
IfHealthyBudget
, follows the existing behavior to achieve the best availability that you get by default. Unhealthy pods can be disrupted only if their application has a minimum available.status.desiredHealthy
number of pods.By setting the
spec.unhealthyPodEvictionPolicy
field of your PDB toAlwaysAllow
, you are choosing the best effort availability for your application. With this policy it is always possible to evict unhealthy pods. This will make it easier to maintain and upgrade your clusters.We think that
AlwaysAllow
will often be a better choice, but for some critical workloads you may still prefer to protect even unhealthy Pods from node drains or other forms of API-initiated eviction.How do I use it?
This is an alpha feature, which means you have to enable the
PDBUnhealthyPodEvictionPolicy
feature gate, with the command line argument--feature-gates=PDBUnhealthyPodEvictionPolicy=true
to the kube-apiserver.Here's an example. Assume that you've enabled the feature gate in your cluster, and that you already defined a Deployment that runs a plain webserver. You labelled the Pods for that Deployment with
app: nginx
. You want to limit avoidable disruption, and you know that best effort availability is sufficient for this app. You decide to allow evictions even if those webserver pods are unhealthy. You create a PDB to guard this application, with theAlwaysAllow
policy for evicting unhealthy pods:apiVersion:policy/v1 kind:PodDisruptionBudget metadata: name:nginx-pdb spec: selector: matchLabels: app:nginx maxUnavailable:1 unhealthyPodEvictionPolicy:AlwaysAllow
How can I learn more?
- Read the KEP: Unhealthy Pod Eviction Policy for PDBs
- Read the documentation: Unhealthy Pod Eviction Policy for PodDisruptionBudgets
- Review the Kubernetes documentation for PodDisruptionBudgets, draining of Nodes and evictions
How do I get involved?
If you have any feedback, please reach out to us in the #sig-apps channel on Slack (visit https://slack.k8s.io/ for an invitation if you need one), or on the SIG Apps mailing list: kubernetes-sig-apps@googlegroups.com
-
Blog: Kubernetes 1.26: Retroactive Default StorageClass
Author: Roman Bednář (Red Hat)
The v1.25 release of Kubernetes introduced an alpha feature to change how a default StorageClass was assigned to a PersistentVolumeClaim (PVC). With the feature enabled, you no longer need to create a default StorageClass first and PVC second to assign the class. Additionally, any PVCs without a StorageClass assigned can be updated later. This feature was graduated to beta in Kubernetes 1.26.
You can read retroactive default StorageClass assignment in the Kubernetes documentation for more details about how to use that, or you can read on to learn about why the Kubernetes project is making this change.
Why did StorageClass assignment need improvements
Users might already be familiar with a similar feature that assigns default StorageClasses to new PVCs at the time of creation. This is currently handled by the admission controller.
But what if there wasn't a default StorageClass defined at the time of PVC creation? Users would end up with a PVC that would never be assigned a class. As a result, no storage would be provisioned, and the PVC would be somewhat "stuck" at this point. Generally, two main scenarios could result in "stuck" PVCs and cause problems later down the road. Let's take a closer look at each of them.
Changing default StorageClass
With the alpha feature enabled, there were two options admins had when they wanted to change the default StorageClass:
-
Creating a new StorageClass as default before removing the old one associated with the PVC. This would result in having two defaults for a short period. At this point, if a user were to create a PersistentVolumeClaim with storageClassName set to
null
(implying default StorageClass), the newest default StorageClass would be chosen and assigned to this PVC. -
Removing the old default first and creating a new default StorageClass. This would result in having no default for a short time. Subsequently, if a user were to create a PersistentVolumeClaim with storageClassName set to
null
(implying default StorageClass), the PVC would be inPending
state forever. The user would have to fix this by deleting the PVC and recreating it once the default StorageClass was available.
Resource ordering during cluster installation
If a cluster installation tool needed to create resources that required storage, for example, an image registry, it was difficult to get the ordering right. This is because any Pods that required storage would rely on the presence of a default StorageClass and would fail to be created if it wasn't defined.
What changed
We've changed the PersistentVolume (PV) controller to assign a default StorageClass to any unbound PersistentVolumeClaim that has the storageClassName set to
null
. We've also modified the PersistentVolumeClaim admission within the API server to allow the change of values from an unset value to an actual StorageClass name.Null
storageClassName
versusstorageClassName: ""
- does it matter?Before this feature was introduced, those values were equal in terms of behavior. Any PersistentVolumeClaim with the storageClassName set to
null
or""
would bind to an existing PersistentVolume resource with storageClassName also set tonull
or""
.With this new feature enabled we wanted to maintain this behavior but also be able to update the StorageClass name. With these constraints in mind, the feature changes the semantics of
null
. If a default StorageClass is present,null
would translate to "Give me a default" and""
would mean "Give me PersistentVolume that also has""
StorageClass name." In the absence of a StorageClass, the behavior would remain unchanged.Summarizing the above, we've changed the semantics of
null
so that its behavior depends on the presence or absence of a definition of default StorageClass.The tables below show all these cases to better describe when PVC binds and when its StorageClass gets updated.
PVC binding behavior with Retroactive default StorageClass PVC storageClassName = ""
PVC storageClassName = null
Without default class PV storageClassName = ""
binds binds PV without storageClassName binds binds With default class PV storageClassName = ""
binds class updates PV without storageClassName binds class updates How to use it
If you want to test the feature whilst it's alpha, you need to enable the relevant feature gate in the kube-controller-manager and the kube-apiserver. Use the
--feature-gates
command line argument:--feature-gates="...,RetroactiveDefaultStorageClass=true"
Test drive
If you would like to see the feature in action and verify it works fine in your cluster here's what you can try:
-
Define a basic PersistentVolumeClaim:
apiVersion:v1 kind:PersistentVolumeClaim metadata: name:pvc-1 spec: accessModes: - ReadWriteOnce resources: requests: storage:1Gi
-
Create the PersistentVolumeClaim when there is no default StorageClass. The PVC won't provision or bind (unless there is an existing, suitable PV already present) and will remain in
Pending
state.$ kc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc-1 Pending
-
Configure one StorageClass as default.
$ kc patch sc -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}' storageclass.storage.k8s.io/my-storageclass patched
-
Verify that PersistentVolumeClaims is now provisioned correctly and was updated retroactively with new default StorageClass.
$ kc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc-1 Bound pvc-06a964ca-f997-4780-8627-b5c3bf5a87d8 1Gi RWO my-storageclass 87m
New metrics
To help you see that the feature is working as expected we also introduced a new
retroactive_storageclass_total
metric to show how many times that the PV controller attempted to update PersistentVolumeClaim, andretroactive_storageclass_errors_total
to show how many of those attempts failed.Getting involved
We always welcome new contributors so if you would like to get involved you can join our Kubernetes Storage Special-Interest-Group (SIG).
If you would like to share feedback, you can do so on our public Slack channel.
Special thanks to all the contributors that provided great reviews, shared valuable insight and helped implement this feature (alphabetical order):
- Deep Debroy (ddebroy)
- Divya Mohan (divya-mohan0209)
- Jan Šafránek (jsafrane)
- Joe Betz (jpbetz)
- Jordan Liggitt (liggitt)
- Michelle Au (msau42)
- Seokho Son (seokho-son)
- Shannon Kularathna (shannonxtreme)
- Tim Bannister (sftim)
- Tim Hockin (thockin)
- Wojciech Tyczynski (wojtek-t)
- Xing Yang (xing-yang)
-
-
Blog: Kubernetes v1.26: Alpha support for cross-namespace storage data sources
Author: Takafumi Takahashi (Hitachi Vantara)
Kubernetes v1.26, released last month, introduced an alpha feature that lets you specify a data source for a PersistentVolumeClaim, even where the source data belong to a different namespace. With the new feature enabled, you specify a namespace in the
dataSourceRef
field of a new PersistentVolumeClaim. Once Kubernetes checks that access is OK, the new PersistentVolume can populate its data from the storage source specified in that other namespace. Before Kubernetes v1.26, provided your cluster had theAnyVolumeDataSource
feature enabled, you could already provision new volumes from a data source in the same namespace. However, that only worked for the data source in the same namespace, therefore users couldn't provision a PersistentVolume with a claim in one namespace from a data source in other namespace. To solve this problem, Kubernetes v1.26 added a new alphanamespace
field todataSourceRef
field in PersistentVolumeClaim the API.How it works
Once the csi-provisioner finds that a data source is specified with a
dataSourceRef
that has a non-empty namespace name, it checks all reference grants within the namespace that's specified by the.spec.dataSourceRef.namespace
field of the PersistentVolumeClaim, in order to see if access to the data source is allowed. If any ReferenceGrant allows access, the csi-provisioner provisions a volume from the data source.Trying it out
The following things are required to use cross namespace volume provisioning:
- Enable the
AnyVolumeDataSource
andCrossNamespaceVolumeDataSource
feature gates for the kube-apiserver and kube-controller-manager - Install a CRD for the specific
VolumeSnapShot
controller - Install the CSI Provisioner controller and enable the
CrossNamespaceVolumeDataSource
feature gate - Install the CSI driver
- Install a CRD for ReferenceGrants
Putting it all together
To see how this works, you can install the sample and try it out. This sample do to create PVC in dev namespace from VolumeSnapshot in prod namespace. That is a simple example. For real world use, you might want to use a more complex approach.
Assumptions for this example
- Your Kubernetes cluster was deployed with
AnyVolumeDataSource
andCrossNamespaceVolumeDataSource
feature gates enabled - There are two namespaces, dev and prod
- CSI driver is being deployed
- There is an existing VolumeSnapshot named
new-snapshot-demo
in the prod namespace - The ReferenceGrant CRD (from the Gateway API project) is already deployed
Grant ReferenceGrants read permission to the CSI Provisioner
Access to ReferenceGrants is only needed when the CSI driver has the
CrossNamespaceVolumeDataSource
controller capability. For this example, the external-provisioner needs get, list, and watch permissions forreferencegrants
(API groupgateway.networking.k8s.io
).- apiGroups:["gateway.networking.k8s.io"] resources:["referencegrants"] verbs:["get","list","watch"]
Enable the CrossNamespaceVolumeDataSource feature gate for the CSI Provisioner
Add
--feature-gates=CrossNamespaceVolumeDataSource=true
to the csi-provisioner command line. For example, use this manifest snippet to redefine the container:- args: - -v=5 - --csi-address=/csi/csi.sock - --feature-gates=Topology=true - --feature-gates=CrossNamespaceVolumeDataSource=true image:csi-provisioner:latest imagePullPolicy:IfNotPresent name:csi-provisioner
Create a ReferenceGrant
Here's a manifest for an example ReferenceGrant.
apiVersion:gateway.networking.k8s.io/v1beta1 kind:ReferenceGrant metadata: name:allow-prod-pvc namespace:prod spec: from: - group:"" kind:PersistentVolumeClaim namespace:dev to: - group:snapshot.storage.k8s.io kind:VolumeSnapshot name:new-snapshot-demo
Create a PersistentVolumeClaim by using cross namespace data source
Kubernetes creates a PersistentVolumeClaim on dev and the CSI driver populates the PersistentVolume used on dev from snapshots on prod.
apiVersion:v1 kind:PersistentVolumeClaim metadata: name:example-pvc namespace:dev spec: storageClassName:example accessModes: - ReadWriteOnce resources: requests: storage:1Gi dataSourceRef: apiGroup:snapshot.storage.k8s.io kind:VolumeSnapshot name:new-snapshot-demo namespace:prod volumeMode:Filesystem
How can I learn more?
The enhancement proposal, Provision volumes from cross-namespace snapshots, includes lots of detail about the history and technical implementation of this feature.
Please get involved by joining the Kubernetes Storage Special Interest Group (SIG) to help us enhance this feature. There are a lot of good ideas already and we'd be thrilled to have more!
Acknowledgments
It takes a wonderful group to make wonderful software. Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution to the CrossNamespaceVolumeDataSouce feature:
- Michelle Au (msau42)
- Xing Yang (xing-yang)
- Masaki Kimura (mkimuram)
- Tim Hockin (thockin)
- Ben Swartzlander (bswartz)
- Rob Scott (robscott)
- John Griffith (j-griffith)
- Michael Henriksen (mhenriks)
- Mustafa Elbehery (Elbehery)
It’s been a joy to work with y'all on this.
- Enable the