Kubernetes News

The Kubernetes project blog
Kubernetes.io
  1. Author: David Hadas (IBM Research Labs)

    This post warns Devops from a false sense of security. Following security best practices when developing and configuring microservices do not result in non-vulnerable microservices. The post shows that although all deployed microservices are vulnerable, there is much that can be done to ensure microservices are not exploited. It explains how analyzing the behavior of clients and services from a security standpoint, named here "Security-Behavior Analysis", can protect the deployed vulnerable microservices. It points to Guard, an open source project offering security-behavior monitoring and control of Kubernetes microservices presumed vulnerable.

    As cyber attacks continue to intensify in sophistication, organizations deploying cloud services continue to grow their cyber investments aiming to produce safe and non-vulnerable services. However, the year-by-year growth in cyber investments does not result in a parallel reduction in cyber incidents. Instead, the number of cyber incidents continues to grow annually. Evidently, organizations are doomed to fail in this struggle - no matter how much effort is made to detect and remove cyber weaknesses from deployed services, it seems offenders always have the upper hand.

    Considering the current spread of offensive tools, sophistication of offensive players, and ever-growing cyber financial gains to offenders, any cyber strategy that relies on constructing a non-vulnerable, weakness-free service in 2023 is clearly too naïve. It seems the only viable strategy is to:

    Admit that your services are vulnerable!

    In other words, consciously accept that you will never create completely invulnerable services. If your opponents find even a single weakness as an entry-point, you lose! Admitting that in spite of your best efforts, all your services are still vulnerable is an important first step. Next, this post discusses what you can do about it...

    How to protect microservices from being exploited

    Being vulnerable does not necessarily mean that your service will be exploited. Though your services are vulnerable in some ways unknown to you, offenders still need to identify these vulnerabilities and then exploit them. If offenders fail to exploit your service vulnerabilities, you win! In other words, having a vulnerability that can’t be exploited, represents a risk that can’t be realized.

    Image of an example of offender gaining foothold in a service

    Figure 1. An Offender gaining foothold in a vulnerable service

    The above diagram shows an example in which the offender does not yet have a foothold in the service; that is, it is assumed that your service does not run code controlled by the offender on day 1. In our example the service has vulnerabilities in the API exposed to clients. To gain an initial foothold the offender uses a malicious client to try and exploit one of the service API vulnerabilities. The malicious client sends an exploit that triggers some unplanned behavior of the service.

    More specifically, let’s assume the service is vulnerable to an SQL injection. The developer failed to sanitize the user input properly, thereby allowing clients to send values that would change the intended behavior. In our example, if a client sends a query string with key “username” and value of “tom or 1=1”, the client will receive the data of all users. Exploiting this vulnerability requires the client to send an irregular string as the value. Note that benign users will not be sending a string with spaces or with the equal sign character as a username, instead they will normally send legal usernames which for example may be defined as a short sequence of characters a-z. No legal username can trigger service unplanned behavior.

    In this simple example, one can already identify several opportunities to detect and block an attempt to exploit the vulnerability (un)intentionally left behind by the developer, making the vulnerability unexploitable. First, the malicious client behavior differs from the behavior of benign clients, as it sends irregular requests. If such a change in behavior is detected and blocked, the exploit will never reach the service. Second, the service behavior in response to the exploit differs from the service behavior in response to a regular request. Such behavior may include making subsequent irregular calls to other services such as a data store, taking irregular time to respond, and/or responding to the malicious client with an irregular response (for example, containing much more data than normally sent in case of benign clients making regular requests). Service behavioral changes, if detected, will also allow blocking the exploit in different stages of the exploitation attempt.

    More generally:

    • Monitoring the behavior of clients can help detect and block exploits against service API vulnerabilities. In fact, deploying efficient client behavior monitoring makes many vulnerabilities unexploitable and others very hard to achieve. To succeed, the offender needs to create an exploit undetectable from regular requests.

    • Monitoring the behavior of services can help detect services as they are being exploited regardless of the attack vector used. Efficient service behavior monitoring limits what an attacker may be able to achieve as the offender needs to ensure the service behavior is undetectable from regular service behavior.

    Combining both approaches may add a protection layer to the deployed vulnerable services, drastically decreasing the probability for anyone to successfully exploit any of the deployed vulnerable services. Next, let us identify four use cases where you need to use security-behavior monitoring.

    Use cases

    One can identify the following four different stages in the life of any service from a security standpoint. In each stage, security-behavior monitoring is required to meet different challenges:

    Service State Use case What do you need in order to cope with this use case?
    Normal No known vulnerabilities: The service owner is normally not aware of any known vulnerabilities in the service image or configuration. Yet, it is reasonable to assume that the service has weaknesses. Provide generic protection against any unknown, zero-day, service vulnerabilities - Detect/block irregular patterns sent as part of incoming client requests that may be used as exploits.
    Vulnerable An applicable CVE is published: The service owner is required to release a new non-vulnerable revision of the service. Research shows that in practice this process of removing a known vulnerability may take many weeks to accomplish (2 months on average). Add protection based on the CVE analysis - Detect/block incoming requests that include specific patterns that may be used to exploit the discovered vulnerability. Continue to offer services, although the service has a known vulnerability.
    Exploitable A known exploit is published: The service owner needs a way to filter incoming requests that contain the known exploit. Add protection based on a known exploit signature - Detect/block incoming client requests that carry signatures identifying the exploit. Continue to offer services, although the presence of an exploit.
    Misused An offender misuses pods backing the service: The offender can follow an attack pattern enabling him/her to misuse pods. The service owner needs to restart any compromised pods while using non compromised pods to continue offering the service. Note that once a pod is restarted, the offender needs to repeat the attack pattern before he/she may again misuse it. Identify and restart instances of the component that is being misused - At any given time, some backing pods may be compromised and misused, while others behave as designed. Detect/remove the misused pods while allowing other pods to continue servicing client requests.

    Fortunately, microservice architecture is well suited to security-behavior monitoring as discussed next.

    Security-Behavior of microservices versus monoliths

    Kubernetes is often used to support workloads designed with microservice architecture. By design, microservices aim to follow the UNIX philosophy of "Do One Thing And Do It Well". Each microservice has a bounded context and a clear interface. In other words, you can expect the microservice clients to send relatively regular requests and the microservice to present a relatively regular behavior as a response to these requests. Consequently, a microservice architecture is an excellent candidate for security-behavior monitoring.

    Image showing why microservices are well suited for security-behavior monitoring

    Figure 2. Microservices are well suited for security-behavior monitoring

    The diagram above clarifies how dividing a monolithic service to a set of microservices improves our ability to perform security-behavior monitoring and control. In a monolithic service approach, different client requests are intertwined, resulting in a diminished ability to identify irregular client behaviors. Without prior knowledge, an observer of the intertwined client requests will find it hard to distinguish between types of requests and their related characteristics. Further, internal client requests are not exposed to the observer. Lastly, the aggregated behavior of the monolithic service is a compound of the many different internal behaviors of its components, making it hard to identify irregular service behavior.

    In a microservice environment, each microservice is expected by design to offer a more well-defined service and serve better defined type of requests. This makes it easier for an observer to identify irregular client behavior and irregular service behavior. Further, a microservice design exposes the internal requests and internal services which offer more security-behavior data to identify irregularities by an observer. Overall, this makes the microservice design pattern better suited for security-behavior monitoring and control.

    Security-Behavior monitoring on Kubernetes

    Kubernetes deployments seeking to add Security-Behavior may use Guard, developed under the CNCF project Knative. Guard is integrated into the full Knative automation suite that runs on top of Kubernetes. Alternatively, you can deploy Guard as a standalone tool to protect any HTTP-based workload on Kubernetes.

    See:

    • Guard on Github, for using Guard as a standalone tool.
    • The Knative automation suite - Read about Knative, in the blog post Opinionated Kubernetes which describes how Knative simplifies and unifies the way web services are deployed on Kubernetes.
    • You may contact Guard maintainers on the SIG Security Slack channel or on the Knative community security Slack channel. The Knative community channel will move soon to the CNCF Slack under the name #knative-security.

    The goal of this post is to invite the Kubernetes community to action and introduce Security-Behavior monitoring and control to help secure Kubernetes based deployments. Hopefully, the community as a follow up will:

    1. Analyze the cyber challenges presented for different Kubernetes use cases
    2. Add appropriate security documentation for users on how to introduce Security-Behavior monitoring and control.
    3. Consider how to integrate with tools that can help users monitor and control their vulnerable services.

    Getting involved

    You are welcome to get involved and join the effort to develop security behavior monitoring and control for Kubernetes; to share feedback and contribute to code or documentation; and to make or suggest improvements of any kind.

  2. Author: Sunny Bhambhani (InfraCloud Technologies)

    Kubernetes has been widely adopted, and many organizations use it as their de-facto orchestration engine for running workloads that need to be created and deleted frequently.

    Therefore, proper scheduling of the pods is key to ensuring that application pods are up and running within the Kubernetes cluster without any issues. This article delves into the use cases around resource management by leveraging the PriorityClass object to protect mission-critical or high-priority pods from getting evicted and making sure that the application pods are up, running, and serving traffic.

    Resource management in Kubernetes

    The control plane consists of multiple components, out of which the scheduler (usually the built-in kube-scheduler) is one of the components which is responsible for assigning a node to a pod.

    Whenever a pod is created, it enters a "pending" state, after which the scheduler determines which node is best suited for the placement of the new pod.

    In the background, the scheduler runs as an infinite loop looking for pods without a nodeName set that are ready for scheduling. For each Pod that needs scheduling, the scheduler tries to decide which node should run that Pod.

    If the scheduler cannot find any node, the pod remains in the pending state, which is not ideal.

    The below diagram, from point number 1 through 4, explains the request flow:

    A diagram showing the scheduling of three Pods that a client has directly created.

    Scheduling in Kubernetes

    Typical use cases

    Below are some real-life scenarios where control over the scheduling and eviction of pods may be required.

    1. Let's say the pod you plan to deploy is critical, and you have some resource constraints. An example would be the DaemonSet of an infrastructure component like Grafana Loki. The Loki pods must run before other pods can on every node. In such cases, you could ensure resource availability by manually identifying and deleting the pods that are not required or by adding a new node to the cluster. Both these approaches are unsuitable since the former would be tedious to execute, and the latter could involve an expenditure of time and money.

    2. Another use case could be a single cluster that holds the pods for the below environments with associated priorities:

      • Production (prod): top priority
      • Preproduction (preprod): intermediate priority
      • Development (dev): least priority

    In the event of high resource consumption in the cluster, there is competition for CPU and memory resources on the nodes. While cluster-level autoscaling may add more nodes, it takes time. In the interim, if there are no further nodes to scale the cluster, some Pods could remain in a Pending state, or the service could be degraded as they compete for resources. If the kubelet does evict a Pod from the node, that eviction would be random because the kubelet doesn’t have any special information about which Pods to evict and which to keep.

    1. A third example could be a microservice backed by a queuing application or a database running into a resource crunch and the queue or database getting evicted. In such a case, all the other services would be rendered useless until the database can serve traffic again.

    There can also be other scenarios where you want to control the order of scheduling or order of eviction of pods.

    PriorityClasses in Kubernetes

    PriorityClass is a cluster-wide API object in Kubernetes and part of the scheduling.k8s.io/v1 API group. It contains a mapping of the PriorityClass name (defined in .metadata.name) and an integer value (defined in .value). This represents the value that the scheduler uses to determine Pod's relative priority.

    Additionally, when you create a cluster using kubeadm or a managed Kubernetes service (for example, Azure Kubernetes Service), Kubernetes uses PriorityClasses to safeguard the pods that are hosted on the control plane nodes. This ensures that critical cluster components such as CoreDNS and kube-proxy can run even if resources are constrained.

    This availability of pods is achieved through the use of a special PriorityClass that ensures the pods are up and running and that the overall cluster is not affected.

    $ kubectl get priorityclass
    NAME VALUE GLOBAL-DEFAULT AGE
    system-cluster-critical 2000000000 false 82m
    system-node-critical 2000001000 false 82m
    

    The diagram below shows exactly how it works with the help of an example, which will be detailed in the upcoming section.

    A flow chart that illustrates how the kube-scheduler prioritizes new Pods and potentially preempts existing Pods

    Pod scheduling and preemption

    Pod priority and preemption

    Pod preemption is a Kubernetes feature that allows the cluster to preempt pods (removing an existing Pod in favor of a new Pod) on the basis of priority. Pod priority indicates the importance of a pod relative to other pods while scheduling. If there aren't enough resources to run all the current pods, the scheduler tries to evict lower-priority pods over high-priority ones.

    Also, when a healthy cluster experiences a node failure, typically, lower-priority pods get preempted to create room for higher-priority pods on the available node. This happens even if the cluster can bring up a new node automatically since pod creation is usually much faster than bringing up a new node.

    PriorityClass requirements

    Before you set up PriorityClasses, there are a few things to consider.

    1. Decide which PriorityClasses are needed. For instance, based on environment, type of pods, type of applications, etc.
    2. The default PriorityClass resource for your cluster. The pods without a priorityClassName will be treated as priority 0.
    3. Use a consistent naming convention for all PriorityClasses.
    4. Make sure that the pods for your workloads are running with the right PriorityClass.

    PriorityClass hands-on example

    Let’s say there are 3 application pods: one for prod, one for preprod, and one for development. Below are three sample YAML manifest files for each of those.

    ---
    # development
    apiVersion:v1
    kind:Pod
    metadata:
    name:dev-nginx
    labels:
    env:dev
    spec:
    containers:
    - name:dev-nginx
    image:nginx
    resources:
    requests:
    memory:"256Mi"
    cpu:"0.2"
    limits:
    memory:".5Gi"
    cpu:"0.5"
    
    ---
    # preproduction
    apiVersion:v1
    kind:Pod
    metadata:
    name:preprod-nginx
    labels:
    env:preprod
    spec:
    containers:
    - name:preprod-nginx
    image:nginx
    resources:
    requests:
    memory:"1.5Gi"
    cpu:"1.5"
    limits:
    memory:"2Gi"
    cpu:"2"
    
    ---
    # production
    apiVersion:v1
    kind:Pod
    metadata:
    name:prod-nginx
    labels:
    env:prod
    spec:
    containers:
    - name:prod-nginx
    image:nginx
    resources:
    requests:
    memory:"2Gi"
    cpu:"2"
    limits:
    memory:"2Gi"
    cpu:"2"
    

    You can create these pods with the kubectl create -f <FILE.yaml> command, and then check their status using the kubectl get pods command. You can see if they are up and look ready to serve traffic:

    $ kubectl get pods --show-labels
    NAME READY STATUS RESTARTS AGE LABELS
    dev-nginx 1/1 Running 0 55s env=dev
    preprod-nginx 1/1 Running 0 55s env=preprod
    prod-nginx 0/1 Pending 0 55s env=prod
    

    Bad news. The pod for the Production environment is still Pending and isn't serving any traffic.

    Let's see why this is happening:

    $ kubectl get events
    ...
    ...
    5s Warning FailedScheduling pod/prod-nginx 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.
    

    In this example, there is only one worker node, and that node has a resource crunch.

    Now, let's look at how PriorityClass can help in this situation since prod should be given higher priority than the other environments.

    PriorityClass API

    Before creating PriorityClasses based on these requirements, let's see what a basic manifest for a PriorityClass looks like and outline some prerequisites:

    apiVersion:scheduling.k8s.io/v1
    kind:PriorityClass
    metadata:
    name:PRIORITYCLASS_NAME
    value:0# any integer value between -1000000000 to 1000000000
    description:>-
    (Optional) description goes here!
    globalDefault:false# or true. Only one PriorityClass can be the global default.
    

    Below are some prerequisites for PriorityClasses:

    • The name of a PriorityClass must be a valid DNS subdomain name.
    • When you make your own PriorityClass, the name should not start with system-, as those names are reserved by Kubernetes itself (for example, they are used for two built-in PriorityClasses).
    • Its absolute value should be between -1000000000 to 1000000000 (1 billion).
    • Larger numbers are reserved by PriorityClasses such as system-cluster-critical (this Pod is critically important to the cluster) and system-node-critical (the node critically relies on this Pod). system-node-critical is a higher priority than system-cluster-critical, because a cluster-critical Pod can only work well if the node where it is running has all its node-level critical requirements met.
    • There are two optional fields:
      • globalDefault: When true, this PriorityClass is used for pods where a priorityClassName is not specified. Only one PriorityClass with globalDefault set to true can exist in a cluster.
        If there is no PriorityClass defined with globalDefault set to true, all the pods with no priorityClassName defined will be treated with 0 priority (i.e. the least priority).
      • description: A string with a meaningful value so that people know when to use this PriorityClass.

    PriorityClass in action

    Here's an example. Next, create some environment-specific PriorityClasses:

    apiVersion:scheduling.k8s.io/v1
    kind:PriorityClass
    metadata:
    name:dev-pc
    value:1000000
    globalDefault:false
    description:>-
    (Optional) This priority class should only be used for all development pods.
    
    apiVersion:scheduling.k8s.io/v1
    kind:PriorityClass
    metadata:
    name:preprod-pc
    value:2000000
    globalDefault:false
    description:>-
    (Optional) This priority class should only be used for all preprod pods.
    
    apiVersion:scheduling.k8s.io/v1
    kind:PriorityClass
    metadata:
    name:prod-pc
    value:4000000
    globalDefault:false
    description:>-
    (Optional) This priority class should only be used for all prod pods.
    

    Use kubectl create -f <FILE.YAML> command to create a pc and kubectl get pc to check its status.

    $ kubectl get pc
    NAME VALUE GLOBAL-DEFAULT AGE
    dev-pc 1000000 false 3m13s
    preprod-pc 2000000 false 2m3s
    prod-pc 4000000 false 7s
    system-cluster-critical 2000000000 false 82m
    system-node-critical 2000001000 false 82m
    

    The new PriorityClasses are in place now. A small change is needed in the pod manifest or pod template (in a ReplicaSet or Deployment). In other words, you need to specify the priority class name at .spec.priorityClassName (which is a string value).

    First update the previous production pod manifest file to have a PriorityClass assigned, then delete the Production pod and recreate it. You can't edit the priority class for a Pod that already exists.

    In my cluster, when I tried this, here's what happened. First, that change seems successful; the status of pods has been updated:

    $ kubectl get pods --show-labels
    NAME READY STATUS RESTARTS AGE LABELS
    dev-nginx 1/1 Terminating 0 55s env=dev
    preprod-nginx 1/1 Running 0 55s env=preprod
    prod-nginx 0/1 Pending 0 55s env=prod
    

    The dev-nginx pod is getting terminated. Once that is successfully terminated and there are enough resources for the prod pod, the control plane can schedule the prod pod:

    Warning FailedScheduling pod/prod-nginx 0/2 nodes are available: 1 Insufficient cpu, 2 Insufficient memory.
    Normal Preempted pod/dev-nginx by default/prod-nginx on node node01
    Normal Killing pod/dev-nginx Stopping container dev-nginx
    Normal Scheduled pod/prod-nginx Successfully assigned default/prod-nginx to node01
    Normal Pulling pod/prod-nginx Pulling image "nginx"
    Normal Pulled pod/prod-nginx Successfully pulled image "nginx"
    Normal Created pod/prod-nginx Created container prod-nginx
    Normal Started pod/prod-nginx Started container prod-nginx
    

    Enforcement

    When you set up PriorityClasses, they exist just how you defined them. However, people (and tools) that make changes to your cluster are free to set any PriorityClass, or to not set any PriorityClass at all. However, you can use other Kubernetes features to make sure that the priorities you wanted are actually applied.

    As an alpha feature, you can define a ValidatingAdmissionPolicy and a ValidatingAdmissionPolicyBinding so that, for example, Pods that go into the prod namespace must use the prod-pc PriorityClass. With another ValidatingAdmissionPolicyBinding you ensure that the preprod namespace uses the preprod-pc PriorityClass, and so on. In any cluster, you can enforce similar controls using external projects such as Kyverno or Gatekeeper, through validating admission webhooks.

    However you do it, Kubernetes gives you options to make sure that the PriorityClasses are used how you wanted them to be, or perhaps just to warn users when they pick an unsuitable option.

    Summary

    The above example and its events show you what this feature of Kubernetes brings to the table, along with several scenarios where you can use this feature. To reiterate, this helps ensure that mission-critical pods are up and available to serve the traffic and, in the case of a resource crunch, determines cluster behavior.

    It gives you some power to decide the order of scheduling and order of preemption for Pods. Therefore, you need to define the PriorityClasses sensibly. For example, if you have a cluster autoscaler to add nodes on demand, make sure to run it with the system-cluster-critical PriorityClass. You don't want to get in a situation where the autoscaler has been preempted and there are no new nodes coming online.

    If you have any queries or feedback, feel free to reach out to me on LinkedIn.

  3. Authors: Filip Křepinský (Red Hat), Morten Torkildsen (Google), Ravi Gudimetla (Apple)

    Ensuring the disruptions to your applications do not affect its availability isn't a simple task. Last month's release of Kubernetes v1.26 lets you specify an unhealthy pod eviction policy for PodDisruptionBudgets (PDBs) to help you maintain that availability during node management operations. In this article, we will dive deeper into what modifications were introduced for PDBs to give application owners greater flexibility in managing disruptions.

    What problems does this solve?

    API-initiated eviction of pods respects PodDisruptionBudgets (PDBs). This means that a requested voluntary disruption via an eviction to a Pod, should not disrupt a guarded application and .status.currentHealthy of a PDB should not fall below .status.desiredHealthy. Running pods that are Unhealthy do not count towards the PDB status, but eviction of these is only possible in case the application is not disrupted. This helps disrupted or not yet started application to achieve availability as soon as possible without additional downtime that would be caused by evictions.

    Unfortunately, this poses a problem for cluster administrators that would like to drain nodes without any manual interventions. Misbehaving applications with pods in CrashLoopBackOff state (due to a bug or misconfiguration) or pods that are simply failing to become ready make this task much harder. Any eviction request will fail due to violation of a PDB, when all pods of an application are unhealthy. Draining of a node cannot make any progress in that case.

    On the other hand there are users that depend on the existing behavior, in order to:

    • prevent data-loss that would be caused by deleting pods that are guarding an underlying resource or storage
    • achieve the best availability possible for their application

    Kubernetes 1.26 introduced a new experimental field to the PodDisruptionBudget API: .spec.unhealthyPodEvictionPolicy. When enabled, this field lets you support both of those requirements.

    How does it work?

    API-initiated eviction is the process that triggers graceful pod termination. The process can be initiated either by calling the API directly, by using a kubectl drain command, or other actors in the cluster. During this process every pod removal is consulted with appropriate PDBs, to ensure that a sufficient number of pods is always running in the cluster.

    The following policies allow PDB authors to have a greater control how the process deals with unhealthy pods.

    There are two policies IfHealthyBudget and AlwaysAllow to choose from.

    The former, IfHealthyBudget, follows the existing behavior to achieve the best availability that you get by default. Unhealthy pods can be disrupted only if their application has a minimum available .status.desiredHealthy number of pods.

    By setting the spec.unhealthyPodEvictionPolicy field of your PDB to AlwaysAllow, you are choosing the best effort availability for your application. With this policy it is always possible to evict unhealthy pods. This will make it easier to maintain and upgrade your clusters.

    We think that AlwaysAllow will often be a better choice, but for some critical workloads you may still prefer to protect even unhealthy Pods from node drains or other forms of API-initiated eviction.

    How do I use it?

    This is an alpha feature, which means you have to enable the PDBUnhealthyPodEvictionPolicy feature gate, with the command line argument --feature-gates=PDBUnhealthyPodEvictionPolicy=true to the kube-apiserver.

    Here's an example. Assume that you've enabled the feature gate in your cluster, and that you already defined a Deployment that runs a plain webserver. You labelled the Pods for that Deployment with app: nginx. You want to limit avoidable disruption, and you know that best effort availability is sufficient for this app. You decide to allow evictions even if those webserver pods are unhealthy. You create a PDB to guard this application, with the AlwaysAllow policy for evicting unhealthy pods:

    apiVersion:policy/v1
    kind:PodDisruptionBudget
    metadata:
    name:nginx-pdb
    spec:
    selector:
    matchLabels:
    app:nginx
    maxUnavailable:1
    unhealthyPodEvictionPolicy:AlwaysAllow
    

    How can I learn more?

    How do I get involved?

    If you have any feedback, please reach out to us in the #sig-apps channel on Slack (visit https://slack.k8s.io/ for an invitation if you need one), or on the SIG Apps mailing list: kubernetes-sig-apps@googlegroups.com

  4. Author: Roman Bednář (Red Hat)

    The v1.25 release of Kubernetes introduced an alpha feature to change how a default StorageClass was assigned to a PersistentVolumeClaim (PVC). With the feature enabled, you no longer need to create a default StorageClass first and PVC second to assign the class. Additionally, any PVCs without a StorageClass assigned can be updated later. This feature was graduated to beta in Kubernetes 1.26.

    You can read retroactive default StorageClass assignment in the Kubernetes documentation for more details about how to use that, or you can read on to learn about why the Kubernetes project is making this change.

    Why did StorageClass assignment need improvements

    Users might already be familiar with a similar feature that assigns default StorageClasses to new PVCs at the time of creation. This is currently handled by the admission controller.

    But what if there wasn't a default StorageClass defined at the time of PVC creation? Users would end up with a PVC that would never be assigned a class. As a result, no storage would be provisioned, and the PVC would be somewhat "stuck" at this point. Generally, two main scenarios could result in "stuck" PVCs and cause problems later down the road. Let's take a closer look at each of them.

    Changing default StorageClass

    With the alpha feature enabled, there were two options admins had when they wanted to change the default StorageClass:

    1. Creating a new StorageClass as default before removing the old one associated with the PVC. This would result in having two defaults for a short period. At this point, if a user were to create a PersistentVolumeClaim with storageClassName set to null (implying default StorageClass), the newest default StorageClass would be chosen and assigned to this PVC.

    2. Removing the old default first and creating a new default StorageClass. This would result in having no default for a short time. Subsequently, if a user were to create a PersistentVolumeClaim with storageClassName set to null (implying default StorageClass), the PVC would be in Pending state forever. The user would have to fix this by deleting the PVC and recreating it once the default StorageClass was available.

    Resource ordering during cluster installation

    If a cluster installation tool needed to create resources that required storage, for example, an image registry, it was difficult to get the ordering right. This is because any Pods that required storage would rely on the presence of a default StorageClass and would fail to be created if it wasn't defined.

    What changed

    We've changed the PersistentVolume (PV) controller to assign a default StorageClass to any unbound PersistentVolumeClaim that has the storageClassName set to null. We've also modified the PersistentVolumeClaim admission within the API server to allow the change of values from an unset value to an actual StorageClass name.

    Null storageClassName versus storageClassName: "" - does it matter?

    Before this feature was introduced, those values were equal in terms of behavior. Any PersistentVolumeClaim with the storageClassName set to null or "" would bind to an existing PersistentVolume resource with storageClassName also set to null or "".

    With this new feature enabled we wanted to maintain this behavior but also be able to update the StorageClass name. With these constraints in mind, the feature changes the semantics of null. If a default StorageClass is present, null would translate to "Give me a default" and "" would mean "Give me PersistentVolume that also has "" StorageClass name." In the absence of a StorageClass, the behavior would remain unchanged.

    Summarizing the above, we've changed the semantics of null so that its behavior depends on the presence or absence of a definition of default StorageClass.

    The tables below show all these cases to better describe when PVC binds and when its StorageClass gets updated.

    PVC binding behavior with Retroactive default StorageClass
    PVC storageClassName = "" PVC storageClassName = null
    Without default class PV storageClassName = "" binds binds
    PV without storageClassName binds binds
    With default class PV storageClassName = "" binds class updates
    PV without storageClassName binds class updates

    How to use it

    If you want to test the feature whilst it's alpha, you need to enable the relevant feature gate in the kube-controller-manager and the kube-apiserver. Use the --feature-gates command line argument:

    --feature-gates="...,RetroactiveDefaultStorageClass=true"
    

    Test drive

    If you would like to see the feature in action and verify it works fine in your cluster here's what you can try:

    1. Define a basic PersistentVolumeClaim:

      apiVersion:v1
      kind:PersistentVolumeClaim
      metadata:
      name:pvc-1
      spec:
      accessModes:
      - ReadWriteOnce
      resources:
      requests:
      storage:1Gi
      
    2. Create the PersistentVolumeClaim when there is no default StorageClass. The PVC won't provision or bind (unless there is an existing, suitable PV already present) and will remain in Pending state.

      $ kc get pvc
      NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
      pvc-1 Pending
      
    3. Configure one StorageClass as default.

      $ kc patch sc -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
      storageclass.storage.k8s.io/my-storageclass patched
      
    4. Verify that PersistentVolumeClaims is now provisioned correctly and was updated retroactively with new default StorageClass.

      $ kc get pvc
      NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
      pvc-1 Bound pvc-06a964ca-f997-4780-8627-b5c3bf5a87d8 1Gi RWO my-storageclass 87m
      

    New metrics

    To help you see that the feature is working as expected we also introduced a new retroactive_storageclass_total metric to show how many times that the PV controller attempted to update PersistentVolumeClaim, and retroactive_storageclass_errors_total to show how many of those attempts failed.

    Getting involved

    We always welcome new contributors so if you would like to get involved you can join our Kubernetes Storage Special-Interest-Group (SIG).

    If you would like to share feedback, you can do so on our public Slack channel.

    Special thanks to all the contributors that provided great reviews, shared valuable insight and helped implement this feature (alphabetical order):

  5. Author: Takafumi Takahashi (Hitachi Vantara)

    Kubernetes v1.26, released last month, introduced an alpha feature that lets you specify a data source for a PersistentVolumeClaim, even where the source data belong to a different namespace. With the new feature enabled, you specify a namespace in the dataSourceRef field of a new PersistentVolumeClaim. Once Kubernetes checks that access is OK, the new PersistentVolume can populate its data from the storage source specified in that other namespace. Before Kubernetes v1.26, provided your cluster had the AnyVolumeDataSource feature enabled, you could already provision new volumes from a data source in the same namespace. However, that only worked for the data source in the same namespace, therefore users couldn't provision a PersistentVolume with a claim in one namespace from a data source in other namespace. To solve this problem, Kubernetes v1.26 added a new alpha namespace field to dataSourceRef field in PersistentVolumeClaim the API.

    How it works

    Once the csi-provisioner finds that a data source is specified with a dataSourceRef that has a non-empty namespace name, it checks all reference grants within the namespace that's specified by the.spec.dataSourceRef.namespace field of the PersistentVolumeClaim, in order to see if access to the data source is allowed. If any ReferenceGrant allows access, the csi-provisioner provisions a volume from the data source.

    Trying it out

    The following things are required to use cross namespace volume provisioning:

    • Enable the AnyVolumeDataSource and CrossNamespaceVolumeDataSource feature gates for the kube-apiserver and kube-controller-manager
    • Install a CRD for the specific VolumeSnapShot controller
    • Install the CSI Provisioner controller and enable the CrossNamespaceVolumeDataSource feature gate
    • Install the CSI driver
    • Install a CRD for ReferenceGrants

    Putting it all together

    To see how this works, you can install the sample and try it out. This sample do to create PVC in dev namespace from VolumeSnapshot in prod namespace. That is a simple example. For real world use, you might want to use a more complex approach.

    Assumptions for this example

    • Your Kubernetes cluster was deployed with AnyVolumeDataSource and CrossNamespaceVolumeDataSource feature gates enabled
    • There are two namespaces, dev and prod
    • CSI driver is being deployed
    • There is an existing VolumeSnapshot named new-snapshot-demo in the prod namespace
    • The ReferenceGrant CRD (from the Gateway API project) is already deployed

    Grant ReferenceGrants read permission to the CSI Provisioner

    Access to ReferenceGrants is only needed when the CSI driver has the CrossNamespaceVolumeDataSource controller capability. For this example, the external-provisioner needs get, list, and watch permissions for referencegrants (API group gateway.networking.k8s.io).

    - apiGroups:["gateway.networking.k8s.io"]
    resources:["referencegrants"]
    verbs:["get","list","watch"]
    

    Enable the CrossNamespaceVolumeDataSource feature gate for the CSI Provisioner

    Add --feature-gates=CrossNamespaceVolumeDataSource=true to the csi-provisioner command line. For example, use this manifest snippet to redefine the container:

    - args:
    - -v=5
    - --csi-address=/csi/csi.sock
    - --feature-gates=Topology=true
    - --feature-gates=CrossNamespaceVolumeDataSource=true
    image:csi-provisioner:latest
    imagePullPolicy:IfNotPresent
    name:csi-provisioner
    

    Create a ReferenceGrant

    Here's a manifest for an example ReferenceGrant.

    apiVersion:gateway.networking.k8s.io/v1beta1
    kind:ReferenceGrant
    metadata:
    name:allow-prod-pvc
    namespace:prod
    spec:
    from:
    - group:""
    kind:PersistentVolumeClaim
    namespace:dev
    to:
    - group:snapshot.storage.k8s.io
    kind:VolumeSnapshot
    name:new-snapshot-demo
    

    Create a PersistentVolumeClaim by using cross namespace data source

    Kubernetes creates a PersistentVolumeClaim on dev and the CSI driver populates the PersistentVolume used on dev from snapshots on prod.

    apiVersion:v1
    kind:PersistentVolumeClaim
    metadata:
    name:example-pvc
    namespace:dev
    spec:
    storageClassName:example
    accessModes:
    - ReadWriteOnce
    resources:
    requests:
    storage:1Gi
    dataSourceRef:
    apiGroup:snapshot.storage.k8s.io
    kind:VolumeSnapshot
    name:new-snapshot-demo
    namespace:prod
    volumeMode:Filesystem
    

    How can I learn more?

    The enhancement proposal, Provision volumes from cross-namespace snapshots, includes lots of detail about the history and technical implementation of this feature.

    Please get involved by joining the Kubernetes Storage Special Interest Group (SIG) to help us enhance this feature. There are a lot of good ideas already and we'd be thrilled to have more!

    Acknowledgments

    It takes a wonderful group to make wonderful software. Special thanks to the following people for the insightful reviews, thorough consideration and valuable contribution to the CrossNamespaceVolumeDataSouce feature:

    • Michelle Au (msau42)
    • Xing Yang (xing-yang)
    • Masaki Kimura (mkimuram)
    • Tim Hockin (thockin)
    • Ben Swartzlander (bswartz)
    • Rob Scott (robscott)
    • John Griffith (j-griffith)
    • Michael Henriksen (mhenriks)
    • Mustafa Elbehery (Elbehery)

    It’s been a joy to work with y'all on this.