Kubernetes News

The Kubernetes project blog
Kubernetes.io
  1. Author: Fabrizio Pandini (VMware)

    The Cluster API community is happy to announce the implementation of ClusterClass and Managed Topologies, a new feature that will greatly simplify how you can provision, upgrade, and operate multiple Kubernetes clusters in a declarative way.

    A little bit of context…

    Before getting into the details, let's take a step back and look at the history of Cluster API.

    The Cluster API project started three years ago, and the first releases focused on extensibility and implementing a declarative API that allows a seamless experience across infrastructure providers. This was a success with many cloud providers: AWS, Azure, Digital Ocean, GCP, Metal3, vSphere and still counting.

    With extensibility addressed, the focus shifted to features, like automatic control plane and etcd management, health-based machine remediation, machine rollout strategies and more.

    Fast forwarding to 2021, with lots of companies using Cluster API to manage fleets of Kubernetes clusters running workloads in production, the community focused its effort on stabilization of both code, APIs, documentation, and on extensive test signals which inform Kubernetes releases.

    With solid foundations in place, and a vibrant and welcoming community that still continues to grow, it was time to plan another iteration on our UX for both new and advanced users.

    Enter ClusterClass and Managed Topologies, tada!

    ClusterClass

    As the name suggests, ClusterClass and managed topologies are built in two parts.

    The idea behind ClusterClass is simple: define the shape of your cluster once, and reuse it many times, abstracting the complexities and the internals of a Kubernetes cluster away.

    Defining a ClusterClass

    ClusterClass, at its heart, is a collection of Cluster and Machine templates. You can use it as a “stamp” that can be leveraged to create many clusters of a similar shape.

    ---
    apiVersion:cluster.x-k8s.io/v1beta1
    kind:ClusterClass
    metadata:
     name:my-amazing-cluster-class
    spec:
     controlPlane:
       ref:
         apiVersion:controlplane.cluster.x-k8s.io/v1beta1
         kind:KubeadmControlPlaneTemplate
         name:high-availability-control-plane
       machineInfrastructure:
         ref:
           apiVersion:infrastructure.cluster.x-k8s.io/v1beta1
           kind:DockerMachineTemplate
           name:control-plane-machine
     workers:
       machineDeployments:
         - class:type1-workers
           template:
             bootstrap:
               ref:
                 apiVersion:bootstrap.cluster.x-k8s.io/v1beta1
                 kind:KubeadmConfigTemplate
                 name:type1-bootstrap
             infrastructure:
               ref:
                 apiVersion:infrastructure.cluster.x-k8s.io/v1beta1
                 kind:DockerMachineTemplate
                 name:type1-machine
         - class:type2-workers
           template:
             bootstrap:
               ref:
                 apiVersion:bootstrap.cluster.x-k8s.io/v1beta1
                 kind:KubeadmConfigTemplate
                 name:type2-bootstrap
             infrastructure:
               ref:
                 kind:DockerMachineTemplate
                 apiVersion:infrastructure.cluster.x-k8s.io/v1beta1
                 name:type2-machine
     infrastructure:
       ref:
         apiVersion:infrastructure.cluster.x-k8s.io/v1beta1
         kind:DockerClusterTemplate
         name:cluster-infrastructure
    
    

    The possibilities are endless; you can get a default ClusterClass from the community, “off-the-shelf” classes from your vendor of choice, “certified” classes from the platform admin in your company, or even create custom ones for advanced scenarios.

    Managed Topologies

    Managed Topologies let you put the power of ClusterClass into action.

    Given a ClusterClass, you can create many Clusters of a similar shape by providing a single resource, the Cluster.

    Create a Cluster with ClusterClass

    Here is an example:

    ---
    apiVersion:cluster.x-k8s.io/v1beta1
    kind:Cluster
    metadata:
      name:my-amazing-cluster
      namespace:bar
    spec:
      topology:# define a managed topology
        class:my-amazing-cluster-class# use the ClusterClass mentioned earlier
        version:v1.21.2
        controlPlane:
          replicas:3
        workers:
          machineDeployments:
          - class:type1-workers
            name:big-pool-of-machines
            replicas:5
          - class:type2-workers
            name:small-pool-of-machines
            replicas:1
    

    But there is more than simplified cluster creation. Now the Cluster acts as a single control point for your entire topology.

    All the power of Cluster API, extensibility, lifecycle automation, stability, all the features required for managing an enterprise grade Kubernetes cluster on the infrastructure provider of your choice are now at your fingertips: you can create your Cluster, add new machines, upgrade to the next Kubernetes version, and all from a single place.

    It is just as simple as it looks!

    What’s next

    While the amazing Cluster API community is working hard to deliver the first version of ClusterClass and managed topologies later this year, we are already looking forward to what comes next for the project and its ecosystem.

    There are a lot of great ideas and opportunities ahead!

    We want to make managed topologies even more powerful and flexible, allowing users to dynamically change bits of a ClusterClass according to the specific needs of a Cluster; this will ensure the same simple and intuitive UX for solving complex problems like e.g. selecting machine image for a specific Kubernetes version and for a specific region of your infrastructure provider, or injecting proxy configurations in the entire Cluster, and so on.

    Stay tuned for what comes next, and if you have any questions, comments or suggestions:

  2. Authors: Jim Angel (Google), Pushkar Joglekar (VMware), and Savitha Raghunathan (Red Hat)

    Background

    USA's National Security Agency (NSA) and the Cybersecurity and Infrastructure Security Agency (CISA) released, "Kubernetes Hardening Guidance" on August 3rd, 2021. The guidance details threats to Kubernetes environments and provides secure configuration guidance to minimize risk.

    The following sections of this blog correlate to the sections in the NSA/CISA guidance. Any missing sections are skipped because of limited opportunities to add anything new to the existing content.

    Note: This blog post is not a substitute for reading the guide. Reading the published guidance is recommended before proceeding as the following content is complementary.

    Introduction and Threat Model

    Note that the threats identified as important by the NSA/CISA, or the intended audience of this guidance, may be different from the threats that other enterprise users of Kubernetes consider important. This section is still useful for organizations that care about data, resource theft and service unavailability.

    The guidance highlights the following three sources of compromises:

    • Supply chain risks
    • Malicious threat actors
    • Insider threats (administrators, users, or cloud service providers)

    The threat model tries to take a step back and review threats that not only exist within the boundary of a Kubernetes cluster but also include the underlying infrastructure and surrounding workloads that Kubernetes does not manage.

    For example, when a workload outside the cluster shares the same physical network, it has access to the kubelet and to control plane components: etcd, controller manager, scheduler and API server. Therefore, the guidance recommends having network level isolation separating Kubernetes clusters from other workloads that do not need connectivity to Kubernetes control plane nodes. Specifically, scheduler, controller-manager, etcd only need to be accessible to the API server. Any interactions with Kubernetes from outside the cluster can happen by providing access to API server port.

    List of ports and protocols for each of these components are defined in Ports and Protocols within the Kubernetes documentation.

    Special note: kube-scheduler and kube-controller-manager uses different ports than the ones mentioned in the guidance

    The Threat modelling section from the CNCF Cloud Native Security Whitepaper + Map provides another perspective on approaching threat modelling Kubernetes, from a cloud native lens.

    Kubernetes Pod security

    Kubernetes by default does not guarantee strict workload isolation between pods running in the same node in a cluster. However, the guidance provides several techniques to enhance existing isolation and reduce the attack surface in case of a compromise.

    "Non-root" containers and "rootless" container engines

    Several best practices related to basic security principle of least privilege i.e. provide only the permissions are needed; no more, no less, are worth a second look.

    The guide recommends setting non-root user at build time instead of relying on setting runAsUser at runtime in your Pod spec. This is a good practice and provides some level of defense in depth. For example, if the container image is built with user 10001 and the Pod spec misses adding the runAsuser field in its Deployment object. In this case there are certain edge cases that are worth exploring for awareness:

    1. Pods can fail to start, if the user defined at build time is different from the one defined in pod spec and some files are as a result inaccessible.
    2. Pods can end up sharing User IDs unintentionally. This can be problematic even if the User IDs are non-zero in a situation where a container escape to host file system is possible. Once the attacker has access to the host file system, they get access to all the file resources that are owned by other unrelated pods that share the same UID.
    3. Pods can end up sharing User IDs, with other node level processes not managed by Kubernetes e.g. node level daemons for auditing, vulnerability scanning, telemetry. The threat is similar to the one above where host file system access can give attacker full access to these node level daemons without needing to be root on the node.

    However, none of these cases will have as severe an impact as a container running as root being able to escape as a root user on the host, which can provide an attacker with complete control of the worker node, further allowing lateral movement to other worker or control plane nodes.

    Kubernetes 1.22 introduced an alpha feature that specifically reduces the impact of such a control plane component running as root user to a non-root user through user namespaces.

    That (alpha stage) support for user namespaces / rootless mode is available with the following container runtimes:

    Some distributions support running in rootless mode, like the following:

    Immutable container filesystems

    The NSA/CISA Kubernetes Hardening Guidance highlights an often overlooked feature readOnlyRootFileSystem, with a working example in Appendix B. This example limits execution and tampering of containers at runtime. Any read/write activity can then be limited to few directories by using tmpfs volume mounts.

    However, some applications that modify the container filesystem at runtime, like exploding a WAR or JAR file at container startup, could face issues when enabling this feature. To avoid this issue, consider making minimal changes to the filesystem at runtime when possible.

    Building secure container images

    Kubernetes Hardening Guidance also recommends running a scanner at deploy time as an admission controller, to prevent vulnerable or misconfigured pods from running in the cluster. Theoretically, this sounds like a good approach but there are several caveats to consider before this can be implemented in practice:

    • Depending on network bandwidth, available resources and scanner of choice, scanning for vulnerabilities for an image can take an indeterminate amount of time. This could lead to slower or unpredictable pod start up times, which could result in spikes of unavailability when apps are serving peak load.
    • If the policy that allows or denies pod startup is made using incorrect or incomplete data it could result in several false positive or false negative outcomes like the following:
      • inside a container image, the openssl package is detected as vulnerable. However, the application is written in Golang and uses the Go crypto package for TLS. Therefore, this vulnerability is not in the code execution path and as such has minimal impact if it remains unfixed.
      • A vulnerability is detected in the openssl package for a Debian base image. However, the upstream Debian community considers this as a Minor impact vulnerability and as a result does not release a patch fix for this vulnerability. The owner of this image is now stuck with a vulnerability that cannot be fixed and a cluster that does not allow the image to run because of predefined policy that does not take into account whether the fix for a vulnerability is available or not
      • A Golang app is built on top of a distroless image, but it is compiled with a Golang version that uses a vulnerable standard library. The scanner has no visibility into golang version but only on OS level packages. So it allows the pod to run in the cluster in spite of the image containing an app binary built on vulnerable golang.

    To be clear, relying on vulnerability scanners is absolutely a good idea but policy definitions should be flexible enough to allow:

    • Creation of exception lists for images or vulnerabilities through labelling
    • Overriding the severity with a risk score based on impact of a vulnerability
    • Applying the same policies at build time to catch vulnerable images with fixable vulnerabilities before they can be deployed into Kubernetes clusters

    Special considerations like offline vulnerability database fetch, may also be needed, if the clusters run in an air-gapped environment and the scanners require internet access to update the vulnerability database.

    Pod Security Policies

    Since Kubernetes v1.21, the PodSecurityPolicy API and related features are deprecated, but some of the guidance in this section will still apply for the next few years, until cluster operators upgrade their clusters to newer Kubernetes versions.

    The Kubernetes project is working on a replacement for PodSecurityPolicy. Kubernetes v1.22 includes an alpha feature called Pod Security Admission that is intended to allow enforcing a minimum level of isolation between pods.

    The built-in isolation levels for Pod Security Admission are derived from Pod Security Standards, which is a superset of all the components mentioned in Table I page 10 of the guidance.

    Information about migrating from PodSecurityPolicy to the Pod Security Admission feature is available in Migrate from PodSecurityPolicy to the Built-In PodSecurity Admission Controller.

    One important behavior mentioned in the guidance that remains the same between Pod Security Policy and its replacement is that enforcing either of them does not affect pods that are already running. With both PodSecurityPolicy and Pod Security Admission, the enforcement happens during the pod creation stage.

    Hardening container engines

    Some container workloads are less trusted than others but may need to run in the same cluster. In those cases, running them on dedicated nodes that include hardened container runtimes that provide stricter pod isolation boundaries can act as a useful security control.

    Kubernetes supports an API called RuntimeClass that is stable / GA (and, therefore, enabled by default) stage as of Kubernetes v1.20. RuntimeClass allows you to ensure that Pods requiring strong isolation are scheduled onto nodes that can offer it.

    Some third-party projects that you can use in conjunction with RuntimeClass are:

    As discussed here and in the guidance, many features and tooling exist in and around Kubernetes that can enhance the isolation boundaries between pods. Based on relevant threats and risk posture, you should pick and choose between them, instead of trying to apply all the recommendations. Having said that, cluster level isolation i.e. running workloads in dedicated clusters, remains the strictest workload isolation mechanism, in spite of improvements mentioned earlier here and in the guide.

    Network Separation and Hardening

    Kubernetes Networking can be tricky and this section focuses on how to secure and harden the relevant configurations. The guide identifies the following as key takeaways:

    • Using NetworkPolicies to create isolation between resources,
    • Securing the control plane
    • Encrypting traffic and sensitive data

    Network Policies

    Network policies can be created with the help of network plugins. In order to make the creation and visualization easier for users, Cilium supports a web GUI tool. That web GUI lets you create Kubernetes NetworkPolicies (a generic API that nevertheless requires a compatible CNI plugin), and / or Cilium network policies (CiliumClusterwideNetworkPolicy and CiliumNetworkPolicy, which only work in clusters that use the Cilium CNI plugin). You can use these APIs to restrict network traffic between pods, and therefore minimize the attack vector.

    Another scenario that is worth exploring is the usage of external IPs. Some services, when misconfigured, can create random external IPs. An attacker can take advantage of this misconfiguration and easily intercept traffic. This vulnerability has been reported in CVE-2020-8554. Using externalip-webhook can mitigate this vulnerability by preventing the services from using random external IPs. externalip-webhook only allows creation of services that don't require external IPs or whose external IPs are within the range specified by the administrator.

    CVE-2020-8554 - Kubernetes API server in all versions allow an attacker who is able to create a ClusterIP service and set the spec.externalIPs field, to intercept traffic to that IP address. Additionally, an attacker who is able to patch the status (which is considered a privileged operation and should not typically be granted to users) of a LoadBalancer service can set the status.loadBalancer.ingress.ip to similar effect.

    Resource Policies

    In addition to configuring ResourceQuotas and limits, consider restricting how many process IDs (PIDs) a given Pod can use, and also to reserve some PIDs for node-level use to avoid resource exhaustion. More details to apply these limits can be found in Process ID Limits And Reservations.

    Control Plane Hardening

    In the next section, the guide covers control plane hardening. It is worth noting that from Kubernetes 1.20, insecure port from API server, has been removed.

    Etcd

    As a general rule, the etcd server should be configured to only trust certificates assigned to the API server. It limits the attack surface and prevents a malicious attacker from gaining access to the cluster. It might be beneficial to use a separate CA for etcd, as it by default trusts all the certificates issued by the root CA.

    Kubeconfig Files

    In addition to specifying the token and certificates directly, .kubeconfig supports dynamic retrieval of temporary tokens using auth provider plugins. Beware of the possibility of malicious shell code execution in a kubeconfig file. Once attackers gain access to the cluster, they can steal ssh keys/secrets or more.

    Secrets

    Kubernetes Secrets is the native way of managing secrets as a Kubernetes API object. However, in some scenarios such as a desire to have a single source of truth for all app secrets, irrespective of whether they run on Kubernetes or not, secrets can be managed loosely coupled with Kubernetes and consumed by pods through side-cars or init-containers with minimal usage of Kubernetes Secrets API.

    External secrets providers and csi-secrets-store are some of these alternatives to Kubernetes Secrets

    Log Auditing

    The NSA/CISA guidance stresses monitoring and alerting based on logs. The key points include logging at the host level, application level, and on the cloud. When running Kubernetes in production, it's important to understand who's responsible, and who's accountable, for each layer of logging.

    Kubernetes API auditing

    One area that deserves more focus is what exactly should alert or be logged. The document outlines a sample policy in Appendix L: Audit Policy that logs all RequestResponse's including metadata and request / response bodies. While helpful for a demo, it may not be practical for production.

    Each organization needs to evaluate their own threat model and build an audit policy that complements or helps troubleshooting incident response. Think about how someone would attack your organization and what audit trail could identify it. Review more advanced options for tuning audit logs in the official audit logging documentation. It's crucial to tune your audit logs to only include events that meet your threat model. A minimal audit policy that logs everything at metadata level can also be a good starting point.

    Audit logging configurations can also be tested with kind following these instructions.

    Streaming logs and auditing

    Logging is important for threat and anomaly detection. As the document outlines, it's a best practice to scan and alert on logs as close to real time as possible and to protect logs from tampering if a compromise occurs. It's important to reflect on the various levels of logging and identify the critical areas such as API endpoints.

    Kubernetes API audit logging can stream to a webhook and there's an example in Appendix N: Webhook configuration. Using a webhook could be a method that stores logs off cluster and/or centralizes all audit logs. Once logs are centrally managed, look to enable alerting based on critical events. Also ensure you understand what the baseline is for normal activities.

    Alert identification

    While the guide stressed the importance of notifications, there is not a blanket event list to alert from. The alerting requirements vary based on your own requirements and threat model. Examples include the following events:

    • Changes to the securityContext of a Pod
    • Updates to admission controller configs
    • Accessing certain files / URLs

    Additional logging resources

    Upgrading and Application Security practices

    Kubernetes releases three times per year, so upgrade-related toil is a common problem for people running production clusters. In addition to this, operators must regularly upgrade the underlying node's operating system and running applications. This is a best practice to ensure continued support and to reduce the likelihood of bugs or vulnerabilities.

    Kubernetes supports the three most recent stable releases. While each Kubernetes release goes through a large number of tests before being published, some teams aren't comfortable running the latest stable release until some time has passed. No matter what version you're running, ensure that patch upgrades happen frequently or automatically. More information can be found in the version skew policy pages.

    When thinking about how you'll manage node OS upgrades, consider ephemeral nodes. Having the ability to destroy and add nodes allows your team to respond quicker to node issues. In addition, having deployments that tolerate node instability (and a culture that encourages frequent deployments) allows for easier cluster upgrades.

    Additionally, it's worth reiterating from the guidance that periodic vulnerability scans and penetration tests can be performed on the various system components to proactively look for insecure configurations and vulnerabilities.

    Finding release & security information

    To find the most recent Kubernetes supported versions, refer to https://k8s.io/releases, which includes minor versions. It's good to stay up to date with your minor version patches.

    If you're running a managed Kubernetes offering, look for their release documentation and find their various security channels.

    Subscribe to the Kubernetes Announce mailing list. The Kubernetes Announce mailing list is searchable for terms such as "Security Advisories". You can set up alerts and email notifications as long as you know what key words to alert on.

    Conclusion

    In summary, it is fantastic to see security practitioners sharing this level of detailed guidance in public. This guidance further highlights Kubernetes going mainstream and how securing Kubernetes clusters and the application containers running on Kubernetes continues to need attention and focus of practitioners. Only a few weeks after the guidance was published, an open source tool kubescape to validate cluster against this guidance became available.

    This tool can be a great starting point to check the current state of your clusters, after which you can use the information in this blog post and in the guidance to assess where improvements can be made.

    Finally, it is worth reiterating that not all controls in this guidance will make sense for all practitioners. The best way to know which controls matter is to rely on the threat model of your own Kubernetes environment.

    A special shout out and thanks to Rory McCune (@raesene) for his inputs to this blog post

  3. Authors: Augustinas Stirbis (CAST AI)

    Why Duplicate Data?

    It’s convenient to create a copy of your application with a copy of its state for each team. For example, you might want a separate database copy to test some significant schema changes or develop other disruptive operations like bulk insert/delete/update...

    Duplicating data takes a lot of time. That’s because you need first to download all the data from a source block storage provider to compute and then send it back to a storage provider again. There’s a lot of network traffic and CPU/RAM used in this process. Hardware acceleration by offloading certain expensive operations to dedicated hardware is always a huge performance boost. It reduces the time required to complete an operation by orders of magnitude.

    Volume Snapshots to the rescue

    Kubernetes introduced VolumeSnapshots as alpha in 1.12, beta in 1.17, and the Generally Available version in 1.20. VolumeSnapshots use specialized APIs from storage providers to duplicate volume of data.

    Since data is already in the same storage device (array of devices), duplicating data is usually a metadata operation for storage providers with local snapshots (majority of on-premise storage providers). All you need to do is point a new disk to an immutable snapshot and only save deltas (or let it do a full-disk copy). As an operation that is inside the storage back-end, it’s much quicker and usually doesn’t involve sending traffic over the network. Public Clouds storage providers under the hood work a bit differently. They save snapshots to Object Storage and then copy back from Object storage to Block storage when "duplicating" disk. Technically there is a lot of Compute and network resources spent on Cloud providers side, but from Kubernetes user perspective VolumeSnapshots work the same way whether is it local or remote snapshot storage provider and no Compute and Network resources are involved in this operation.

    Sounds like we have our solution, right?

    Actually, VolumeSnapshots are namespaced, and Kubernetes protects namespaced data from being shared between tenants (Namespaces). This Kubernetes limitation is a conscious design decision so that a Pod running in a different namespace can’t mount another application’s PersistentVolumeClaim (PVC).

    One way around it would be to create multiple volumes with duplicate data in one namespace. However, you could easily reference the wrong copy.

    So the idea is to separate teams/initiatives by namespaces to avoid that and generally limit access to the production namespace.

    Solution? Creating a Golden Snapshot externally

    Another way around this design limitation is to create Snapshot externally (not through Kubernetes). This is also called pre-provisioning a snapshot manually. Next, I will import it as a multi-tenant golden snapshot that can be used for many namespaces. Below illustration will be for AWS EBS (Elastic Block Storage) and GCE PD (Persistent Disk) services.

    High-level plan for preparing the Golden Snapshot

    1. Identify Disk (EBS/Persistent Disk) that you want to clone with data in the cloud provider
    2. Make a Disk Snapshot (in cloud provider console)
    3. Get Disk Snapshot ID

    High-level plan for cloning data for each team

    1. Create Namespace “sandbox01”
    2. Import Disk Snapshot (ID) as VolumeSnapshotContent to Kubernetes
    3. Create VolumeSnapshot in the Namespace "sandbox01" mapped to VolumeSnapshotContent
    4. Create the PersistentVolumeClaim from VolumeSnapshot
    5. Install Deployment or StatefulSet with PVC

    Step 1: Identify Disk

    First, you need to identify your golden source. In my case, it’s a PostgreSQL database on PersistentVolumeClaim “postgres-pv-claim” in the “production” namespace.

    kubectl -n <namespace> get pvc <pvc-name> -o jsonpath='{.spec.volumeName}'
    

    The output will look similar to:

    pvc-3096b3ba-38b6-4fd1-a42f-ec99176ed0d90
    

    Step 2: Prepare your golden source

    You need to do this once or every time you want to refresh your golden data.

    Make a Disk Snapshot

    Go to AWS EC2 or GCP Compute Engine console and search for an EBS volume (on AWS) or Persistent Disk (on GCP), that has a label matching the last output. In this case I saw: pvc-3096b3ba-38b6-4fd1-a42f-ec99176ed0d9.

    Click on Create snapshot and give it a name. You can do it in Console manually, in AWS CloudShell / Google Cloud Shell, or in the terminal. To create a snapshot in the terminal you must have the AWS CLI tool (aws) or Google's CLI (gcloud) installed and configured.

    Here’s the command to create snapshot on GCP:

    gcloud compute disks snapshot <cloud-disk-id> --project=<gcp-project-id> --snapshot-names=<set-new-snapshot-name> --zone=<availability-zone> --storage-location=<region>
    
    Screenshot of a terminal showing volume snapshot creation on GCP

    GCP snapshot creation

    GCP identifies the disk by its PVC name, so it’s direct mapping. In AWS, you need to find volume by the CSIVolumeName AWS tag with PVC name value first that will be used for snapshot creation.

    Screenshot of AWS web console, showing EBS volume identification

    Identify disk ID on AWS

    Mark done Volume (volume-id) vol-00c7ecd873c6fb3ec and ether create EBS snapshot in AWS Console, or use aws cli.

    aws ec2 create-snapshot --volume-id '<volume-id>' --description '<set-new-snapshot-name>' --tag-specifications 'ResourceType=snapshot'
    

    Step 3: Get your Disk Snapshot ID

    In AWS, the command above will output something similar to:

    "SnapshotId": "snap-09ed24a70bc19bbe4"
    

    If you’re using the GCP cloud, you can get the snapshot ID from the gcloud command by querying for the snapshot’s given name:

    gcloud compute snapshots --project=<gcp-project-id> describe <new-snapshot-name> | grep id:
    

    You should get similar output to:

    id: 6645363163809389170
    

    Step 4: Create a development environment for each team

    Now I have my Golden Snapshot, which is immutable data. Each team will get a copy of this data, and team members can modify it as they see fit, given that a new EBS/persistent disk will be created for each team.

    Below I will define a manifest for each namespace. To save time, you can replace the namespace name (such as changing “sandbox01” → “sandbox42”) using tools such as sed or yq, with Kubernetes-aware templating tools like Kustomize, or using variable substitution in a CI/CD pipeline.

    Here's an example manifest:

    ---
    apiVersion:snapshot.storage.k8s.io/v1
    kind:VolumeSnapshotContent
    metadata:
    name:postgresql-orders-db-sandbox01
    namespace:sandbox01
    spec:
    deletionPolicy:Retain
    driver:pd.csi.storage.gke.io
    source:
      snapshotHandle:'gcp/projects/staging-eu-castai-vt5hy2/global/snapshots/6645363163809389170'
    volumeSnapshotRef:
      kind:VolumeSnapshot
      name:postgresql-orders-db-snap
      namespace:sandbox01
    ---
    apiVersion:snapshot.storage.k8s.io/v1
    kind:VolumeSnapshot
    metadata:
    name:postgresql-orders-db-snap
    namespace:sandbox01
    spec:
    source:
      volumeSnapshotContentName:postgresql-orders-db-sandbox01
    

    In Kubernetes, VolumeSnapshotContent (VSC) objects are not namespaced. However, I need a separate VSC for each different namespace to use, so the metadata.name of each VSC must also be different. To make that straightfoward, I used the target namespace as part of the name.

    Now it’s time to replace the driver field with the CSI (Container Storage Interface) driver installed in your K8s cluster. Major cloud providers have CSI driver for block storage that support VolumeSnapshots but quite often CSI drivers are not installed by default, consult with your Kubernetes provider.

    That manifest above defines a VSC that works on GCP. On AWS, driver and SnashotHandle values might look like:

     driver:ebs.csi.aws.com
     source:
       snapshotHandle:"snap-07ff83d328c981c98"
    

    At this point, I need to use the Retain policy, so that the CSI driver doesn’t try to delete my manually created EBS disk snapshot.

    For GCP, you will have to build this string by hand - add a full project ID and snapshot ID. For AWS, it’s just a plain snapshot ID.

    VSC also requires specifying which VolumeSnapshot (VS) will use it, so VSC and VS are referencing each other.

    Now I can create PersistentVolumeClaim from VS above. It’s important to set this first:

    ---
    apiVersion:v1
    kind:PersistentVolumeClaim
    metadata:
    name:postgres-pv-claim
    namespace:sandbox01
    spec:
    dataSource:
      kind:VolumeSnapshot
      name:postgresql-orders-db-snap
      apiGroup:snapshot.storage.k8s.io
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage:21Gi
    

    If default StorageClass has WaitForFirstConsumer policy, then the actual Cloud Disk will be created from the Golden Snapshot only when some Pod bounds that PVC.

    Now I assign that PVC to my Pod (in my case, it’s Postgresql) as I would with any other PVC.

    kubectl -n <namespace> get volumesnapshotContent,volumesnapshot,pvc,pod
    

    Both VS and VSC should be READYTOUSE true, PVC bound, and the Pod (from Deployment or StatefulSet) running.

    To keep on using data from my Golden Snapshot, I just need to repeat this for the next namespace and voilà! No need to waste time and compute resources on the duplication process.

  4. Author: Dewan Ahmed, Red Hat

    Introduction

    In Kubernetes, a Node is a representation of a single machine in your cluster. SIG Node owns that very important Node component and supports various subprojects such as Kubelet, Container Runtime Interface (CRI) and more to support how the pods and host resources interact. In this blog, we have summarized our conversation with Elana Hashman (EH) & Sergey Kanzhelev (SK), who walk us through the various aspects of being a part of the SIG and share some insights about how others can get involved.

    A summary of our conversation

    Could you tell us a little about what SIG Node does?

    SK: SIG Node is a vertical SIG responsible for the components that support the controlled interactions between the pods and host resources. We manage the lifecycle of pods that are scheduled to a node. This SIG's focus is to enable a broad set of workload types, including workloads with hardware specific or performance sensitive requirements. All while maintaining isolation boundaries between pods on a node, as well as the pod and the host. This SIG maintains quite a few components and has many external dependencies (like container runtimes or operating system features), which makes the complexity we deal with huge. We tame the complexity and aim to continuously improve node reliability.

    "SIG Node is a vertical SIG" could you explain a bit more?

    EH: There are two kinds of SIGs: horizontal and vertical. Horizontal SIGs are concerned with a particular function of every component in Kubernetes: for example, SIG Security considers security aspects of every component in Kubernetes, or SIG Instrumentation looks at the logs, metrics, traces and events of every component in Kubernetes. Such SIGs don't tend to own a lot of code.

    Vertical SIGs, on the other hand, own a single component, and are responsible for approving and merging patches to that code base. SIG Node owns the "Node" vertical, pertaining to the kubelet and its lifecycle. This includes the code for the kubelet itself, as well as the node controller, the container runtime interface, and related subprojects like the node problem detector.

    How did the CI subproject start? Is this specific to SIG Node and how does it help the SIG?

    SK: The subproject started as a follow up after one of the releases was blocked by numerous test failures of critical tests. These tests haven’t started falling all at once, rather continuous lack of attention led to slow degradation of tests quality. SIG Node was always prioritizing quality and reliability, and forming of the subproject was a way to highlight this priority.

    As the 3rd largest SIG in terms of number of issues and PRs, how does your SIG juggle so much work?

    EH: It helps to be organized. When I increased my contributions to the SIG in January of 2021, I found myself overwhelmed by the volume of pull requests and issues and wasn't sure where to start. We were already tracking test-related issues and pull requests on the CI subproject board, but that was missing a lot of our bugfixes and feature work. So I began putting together a triage board for the rest of our pull requests, which allowed me to sort each one by status and what actions to take, and documented its use for other contributors. We closed or merged over 500 issues and pull requests tracked by our two boards in each of the past two releases. The Kubernetes devstats showed that we have significantly increased our velocity as a result.

    In June, we ran our first bug scrub event to work through the backlog of issues filed against SIG Node, ensuring they were properly categorized. We closed over 130 issues over the course of this 48 hour global event, but as of writing we still have 333 open issues.

    Why should new and existing contributors consider joining SIG Node?

    SK: Being a SIG Node contributor gives you skills and recognition that are rewarding and useful. Understanding under the hood of a kubelet helps architecting better apps, tune and optimize those apps, and gives leg up in issues troubleshooting. If you are a new contributor, SIG Node gives you the foundational knowledge that is key to understanding why other Kubernetes components are designed the way they are. Existing contributors may benefit as many features will require SIG Node changes one way or another. So being a SIG Node contributor helps building features in other SIGs faster.

    SIG Node maintains numerous components, many of which have dependency on external projects or OS features. This makes the onboarding process quite lengthy and demanding. But if you are up for a challenge, there is always a place for you, and a group of people to support.

    What do you do to help new contributors get started?

    EH: Getting started in SIG Node can be intimidating, since there is so much work to be done, our SIG meetings are very large, and it can be hard to find a place to start.

    I always encourage new contributors to work on things that they have some investment in already. In SIG Node, that might mean volunteering to help fix a bug that you have personally been affected by, or helping to triage bugs you care about by priority.

    To come up to speed on any open source code base, there are two strategies you can take: start by exploring a particular issue deeply, and follow that to expand the edges of your knowledge as needed, or briefly review as many issues and change requests as you possibly can to get a higher level picture of how the component works. Ultimately, you will need to do both if you want to become a Node reviewer or approver.

    Davanum Srinivas and I each ran a cohort of group mentoring to help teach new contributors the skills to become Node reviewers, and if there's interest we can work to find a mentor to run another session. I also encourage new contributors to attend our Node CI Subproject meeting: it's a smaller audience and we don't record the triage sessions, so it can be a less intimidating way to get started with the SIG.

    Are there any particular skills you’d like to recruit for? What skills are contributors to SIG Usability likely to learn?

    SK: SIG Node works on many workstreams in very different areas. All of these areas are on system level. For the typical code contributions you need to have a passion for building and utilizing low level APIs and writing performant and reliable components. Being a contributor you will learn how to debug and troubleshoot, profile, and monitor these components, as well as user workload that is run by these components. Often, with the limited to no access to Nodes, as they are running production workloads.

    The other way of contribution is to help document SIG node features. This type of contribution requires a deep understanding of features, and ability to explain them in simple terms.

    Finally, we are always looking for feedback on how best to run your workload. Come and explain specifics of it, and what features in SIG Node components may help to run it better.

    What are you getting positive feedback on, and what’s coming up next for SIG Node?

    EH: Over the past year SIG Node has adopted some new processes to help manage our feature development and Kubernetes enhancement proposals, and other SIGs have looked to us for inspiration in managing large workloads. I hope that this is an area we can continue to provide leadership in and further iterate on.

    We have a great balance of new features and deprecations in flight right now. Deprecations of unused or difficult to maintain features help us keep technical debt and maintenance load under control, and examples include the dockershim and DynamicKubeletConfiguration deprecations. New features will unlock additional functionality in end users' clusters, and include exciting features like support for cgroups v2, swap memory, graceful node shutdowns, and device management policies.

    Any closing thoughts/resources you’d like to share?

    SK/EH: It takes time and effort to get to any open source community. SIG Node may overwhelm you at first with the number of participants, volume of work, and project scope. But it is totally worth it. Join our welcoming community! SIG Node GitHub Repo contains many useful resources including Slack, mailing list and other contact info.

    Wrap Up

    SIG Node hosted a KubeCon + CloudNativeCon Europe 2021 talk with an intro and deep dive to their awesome SIG. Join the SIG's meetings to find out about the most recent research results, what the plans are for the forthcoming year, and how to get involved in the upstream Node team as a contributor!

  5. Author: Chris Henzie (Google)

    Last month's release of Kubernetes v1.22 introduced a new ReadWriteOncePod access mode for PersistentVolumes and PersistentVolumeClaims. With this alpha feature, Kubernetes allows you to restrict volume access to a single pod in the cluster.

    What are access modes and why are they important?

    When using storage, there are different ways to model how that storage is consumed.

    For example, a storage system like a network file share can have many users all reading and writing data simultaneously. In other cases maybe everyone is allowed to read data but not write it. For highly sensitive data, maybe only one user is allowed to read and write data but nobody else.

    In the world of Kubernetes, access modes are the way you can define how durable storage is consumed. These access modes are a part of the spec for PersistentVolumes (PVs) and PersistentVolumeClaims (PVCs).

    kind:PersistentVolumeClaim
    apiVersion:v1
    metadata:
     name:shared-cache
    spec:
     accessModes:
     - ReadWriteMany# Allow many pods to access shared-cache simultaneously.
     resources:
       requests:
         storage:1Gi
    

    Before v1.22, Kubernetes offered three access modes for PVs and PVCs:

    • ReadWriteOnce – the volume can be mounted as read-write by a single node
    • ReadOnlyMany – the volume can be mounted read-only by many nodes
    • ReadWriteMany – the volume can be mounted as read-write by many nodes

    These access modes are enforced by Kubernetes components like the kube-controller-manager and kubelet to ensure only certain pods are allowed to access a given PersistentVolume.

    What is this new access mode and how does it work?

    Kubernetes v1.22 introduced a fourth access mode for PVs and PVCs, that you can use for CSI volumes:

    • ReadWriteOncePod – the volume can be mounted as read-write by a single pod

    If you create a pod with a PVC that uses the ReadWriteOncePod access mode, Kubernetes ensures that pod is the only pod across your whole cluster that can read that PVC or write to it.

    If you create another pod that references the same PVC with this access mode, the pod will fail to start because the PVC is already in use by another pod. For example:

    Events:
      Type     Reason            Age   From               Message
      ----     ------            ----  ----               -------
      Warning  FailedScheduling  1s    default-scheduler  0/1 nodes are available: 1 node has pod using PersistentVolumeClaim with the same name and ReadWriteOncePod access mode.
    

    How is this different than the ReadWriteOnce access mode?

    The ReadWriteOnce access mode restricts volume access to a single node, which means it is possible for multiple pods on the same node to read from and write to the same volume. This could potentially be a major problem for some applications, especially if they require at most one writer for data safety guarantees.

    With ReadWriteOncePod these issues go away. Set the access mode on your PVC, and Kubernetes guarantees that only a single pod has access.

    How do I use it?

    The ReadWriteOncePod access mode is in alpha for Kubernetes v1.22 and is only supported for CSI volumes. As a first step you need to enable the ReadWriteOncePod feature gate for kube-apiserver, kube-scheduler, and kubelet. You can enable the feature by setting command line arguments:

    --feature-gates="...,ReadWriteOncePod=true"
    

    You also need to update the following CSI sidecars to these versions or greater:

    Creating a PersistentVolumeClaim

    In order to use the ReadWriteOncePod access mode for your PVs and PVCs, you will need to create a new PVC with the access mode:

    kind:PersistentVolumeClaim
    apiVersion:v1
    metadata:
     name:single-writer-only
    spec:
     accessModes:
     - ReadWriteOncePod# Allow only a single pod to access single-writer-only.
     resources:
       requests:
         storage:1Gi
    

    If your storage plugin supports dynamic provisioning, new PersistentVolumes will be created with the ReadWriteOncePod access mode applied.

    Migrating existing PersistentVolumes

    If you have existing PersistentVolumes, they can be migrated to use ReadWriteOncePod.

    In this example, we already have a "cat-pictures-pvc" PersistentVolumeClaim that is bound to a "cat-pictures-pv" PersistentVolume, and a "cat-pictures-writer" Deployment that uses this PersistentVolumeClaim.

    As a first step, you need to edit your PersistentVolume's spec.persistentVolumeReclaimPolicy and set it to Retain. This ensures your PersistentVolume will not be deleted when we delete the corresponding PersistentVolumeClaim:

    kubectl patch pv cat-pictures-pv -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'
    

    Next you need to stop any workloads that are using the PersistentVolumeClaim bound to the PersistentVolume you want to migrate, and then delete the PersistentVolumeClaim.

    Once that is done, you need to clear your PersistentVolume's spec.claimRef.uid to ensure PersistentVolumeClaims can bind to it upon recreation:

    kubectl scale --replicas=0 deployment cat-pictures-writer
    kubectl delete pvc cat-pictures-pvc
    kubectl patch pv cat-pictures-pv -p '{"spec":{"claimRef":{"uid":""}}}'
    

    After that you need to replace the PersistentVolume's access modes with ReadWriteOncePod:

    kubectl patch pv cat-pictures-pv -p '{"spec":{"accessModes":["ReadWriteOncePod"]}}'
    
    Note: The ReadWriteOncePod access mode cannot be combined with other access modes. Make sure ReadWriteOncePod is the only access mode on the PersistentVolume when updating, otherwise the request will fail.

    Next you need to modify your PersistentVolumeClaim to set ReadWriteOncePod as the only access mode. You should also set your PersistentVolumeClaim's spec.volumeName to the name of your PersistentVolume.

    Once this is done, you can recreate your PersistentVolumeClaim and start up your workloads:

    # IMPORTANT: Make sure to edit your PVC in cat-pictures-pvc.yaml before applying. You need to:
    # - Set ReadWriteOncePod as the only access mode
    # - Set spec.volumeName to "cat-pictures-pv"
    
    kubectl apply -f cat-pictures-pvc.yaml
    kubectl apply -f cat-pictures-writer-deployment.yaml
    

    Lastly you may edit your PersistentVolume's spec.persistentVolumeReclaimPolicy and set to it back to Delete if you previously changed it.

    kubectl patch pv cat-pictures-pv -p '{"spec":{"persistentVolumeReclaimPolicy":"Delete"}}'
    

    You can read Configure a Pod to Use a PersistentVolume for Storage for more details on working with PersistentVolumes and PersistentVolumeClaims.

    What volume plugins support this?

    The only volume plugins that support this are CSI drivers. SIG Storage does not plan to support this for in-tree plugins because they are being deprecated as part of CSI migration. Support may be considered for beta for users that prefer to use the legacy in-tree volume APIs with CSI migration enabled.

    As a storage vendor, how do I add support for this access mode to my CSI driver?

    The ReadWriteOncePod access mode will work out of the box without any required updates to CSI drivers, but does require updates to CSI sidecars. With that being said, if you would like to stay up to date with the latest changes to the CSI specification (v1.5.0+), read on.

    Two new access modes were introduced to the CSI specification in order to disambiguate the legacy SINGLE_NODE_WRITER access mode. They are SINGLE_NODE_SINGLE_WRITER and SINGLE_NODE_MULTI_WRITER. In order to communicate to sidecars (like the external-provisioner) that your driver understands and accepts these two new CSI access modes, your driver will also need to advertise the SINGLE_NODE_MULTI_WRITER capability for the controller service and node service.

    If you'd like to read up on the motivation for these access modes and capability bits, you can also read the CSI Specification Changes, Volume Capabilities section of KEP-2485 (ReadWriteOncePod PersistentVolume Access Mode).

    Update your CSI driver to use the new interface

    As a first step you will need to update your driver's container-storage-interface dependency to v1.5.0+, which contains support for these new access modes and capabilities.

    Accept new CSI access modes

    If your CSI driver contains logic for validating CSI access modes for requests , it may need updating. If it currently accepts SINGLE_NODE_WRITER, it should be updated to also accept SINGLE_NODE_SINGLE_WRITER and SINGLE_NODE_MULTI_WRITER.

    Using the GCP PD CSI driver validation logic as an example, here is how it can be extended:

    diff --git a/pkg/gce-pd-csi-driver/utils.go b/pkg/gce-pd-csi-driver/utils.go
    index 281242c..b6c5229 100644
    --- a/pkg/gce-pd-csi-driver/utils.go
    +++ b/pkg/gce-pd-csi-driver/utils.go
    @@ -123,6 +123,8 @@ func validateAccessMode(am *csi.VolumeCapability_AccessMode) error {
            case csi.VolumeCapability_AccessMode_SINGLE_NODE_READER_ONLY:
            case csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY:
            case csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER:
    +       case csi.VolumeCapability_AccessMode_SINGLE_NODE_SINGLE_WRITER:
    +       case csi.VolumeCapability_AccessMode_SINGLE_NODE_MULTI_WRITER:
            default:
                    return fmt.Errorf("%v access mode is not supported for for PD", am.GetMode())
            }
    

    Your CSI driver will also need to return the new SINGLE_NODE_MULTI_WRITER capability as part of the ControllerGetCapabilities and NodeGetCapabilities RPCs.

    Using the GCP PD CSI driver capability advertisement logic as an example, here is how it can be extended:

    diff --git a/pkg/gce-pd-csi-driver/gce-pd-driver.go b/pkg/gce-pd-csi-driver/gce-pd-driver.go
    index 45903f3..0d7ea26 100644
    --- a/pkg/gce-pd-csi-driver/gce-pd-driver.go
    +++ b/pkg/gce-pd-csi-driver/gce-pd-driver.go
    @@ -56,6 +56,8 @@ func (gceDriver *GCEDriver) SetupGCEDriver(name, vendorVersion string, extraVolu
                    csi.VolumeCapability_AccessMode_SINGLE_NODE_WRITER,
                    csi.VolumeCapability_AccessMode_MULTI_NODE_READER_ONLY,
                    csi.VolumeCapability_AccessMode_MULTI_NODE_MULTI_WRITER,
    +               csi.VolumeCapability_AccessMode_SINGLE_NODE_SINGLE_WRITER,
    +               csi.VolumeCapability_AccessMode_SINGLE_NODE_MULTI_WRITER,
            }
            gceDriver.AddVolumeCapabilityAccessModes(vcam)
            csc := []csi.ControllerServiceCapability_RPC_Type{
    @@ -67,12 +69,14 @@ func (gceDriver *GCEDriver) SetupGCEDriver(name, vendorVersion string, extraVolu
                    csi.ControllerServiceCapability_RPC_EXPAND_VOLUME,
                    csi.ControllerServiceCapability_RPC_LIST_VOLUMES,
                    csi.ControllerServiceCapability_RPC_LIST_VOLUMES_PUBLISHED_NODES,
    +               csi.ControllerServiceCapability_RPC_SINGLE_NODE_MULTI_WRITER,
            }
            gceDriver.AddControllerServiceCapabilities(csc)
            ns := []csi.NodeServiceCapability_RPC_Type{
                    csi.NodeServiceCapability_RPC_STAGE_UNSTAGE_VOLUME,
                    csi.NodeServiceCapability_RPC_EXPAND_VOLUME,
                    csi.NodeServiceCapability_RPC_GET_VOLUME_STATS,
    +               csi.NodeServiceCapability_RPC_SINGLE_NODE_MULTI_WRITER,
            }
            gceDriver.AddNodeServiceCapabilities(ns)
    

    Implement NodePublishVolume behavior

    The CSI spec outlines expected behavior for the NodePublishVolume RPC when called more than once for the same volume but with different arguments (like the target path). Please refer to the second table in the NodePublishVolume section of the CSI spec for more details on expected behavior when implementing in your driver.

    Update your CSI sidecars

    When deploying your CSI drivers, you must update the following CSI sidecars to versions that depend on CSI spec v1.5.0+ and the Kubernetes v1.22 API. The minimum required versions are:

    What’s next?

    As part of the beta graduation for this feature, SIG Storage plans to update the Kubenetes scheduler to support pod preemption in relation to ReadWriteOncePod storage. This means if two pods request a PersistentVolumeClaim with ReadWriteOncePod, the pod with highest priority will gain access to the PersistentVolumeClaim and any pod with lower priority will be preempted from the node and be unable to access the PersistentVolumeClaim.

    How can I learn more?

    Please see KEP-2485 for more details on the ReadWriteOncePod access mode and motivations for CSI spec changes.

    How do I get involved?

    The Kubernetes #csi Slack channel and any of the standard SIG Storage communication channels are great mediums to reach out to the SIG Storage and the CSI teams.

    Special thanks to the following people for their insightful reviews and design considerations:

    • Abdullah Gharaibeh (ahg-g)
    • Aldo Culquicondor (alculquicondor)
    • Ben Swartzlander (bswartz)
    • Deep Debroy (ddebroy)
    • Hemant Kumar (gnufied)
    • Humble Devassy Chirammal (humblec)
    • James DeFelice (jdef)
    • Jan Šafránek (jsafrane)
    • Jing Xu (jingxu97)
    • Jordan Liggitt (liggitt)
    • Michelle Au (msau42)
    • Saad Ali (saad-ali)
    • Tim Hockin (thockin)
    • Xing Yang (xing-yang)

    If you’re interested in getting involved with the design and development of CSI or any part of the Kubernetes storage system, join the Kubernetes Storage Special Interest Group (SIG). We’re rapidly growing and always welcome new contributors.