Kubernetes News

The Kubernetes project blog
Kubernetes.io
  1. Author: Sascha Grunert

    The Security Profiles Operator (SPO) makes managing seccomp, SELinux and AppArmor profiles within Kubernetes easier than ever. It allows cluster administrators to define the profiles in a predefined custom resource YAML, which then gets distributed by the SPO into the whole cluster. Modification and removal of the security profiles are managed by the operator in the same way, but that’s a small subset of its capabilities.

    Another core feature of the SPO is being able to stack seccomp profiles. This means that users can define a baseProfileName in the YAML specification, which then gets automatically resolved by the operator and combines the syscall rules. If a base profile has another baseProfileName, then the operator will recursively resolve the profiles up to a certain depth. A common use case is to define base profiles for low level container runtimes (like runc or crun) which then contain syscalls which are required in any case to run the container. Alternatively, application developers can define seccomp base profiles for their standard distribution containers and stack dedicated profiles for the application logic on top. This way developers can focus on maintaining seccomp profiles which are way simpler and scoped to the application logic, without having a need to take the whole infrastructure setup into account.

    But how to maintain those base profiles? For example, the amount of required syscalls for a runtime can change over its release cycle in the same way it can change for the main application. Base profiles have to be available in the same cluster, otherwise the main seccomp profile will fail to deploy. This means that they’re tightly coupled to the main application profiles, which acts against the main idea of base profiles. Distributing and managing them as plain files feels like an additional burden to solve.

    OCI artifacts to the rescue

    The v0.8.0 release of the Security Profiles Operator supports managing base profiles as OCI artifacts! Imagine OCI artifacts as lightweight container images, storing files in layers in the same way images do, but without a process to be executed. Those artifacts can be used to store security profiles like regular container images in compatible registries. This means they can be versioned, namespaced and annotated similar to regular container images.

    To see how that works in action, specify a baseProfileName prefixed with oci:// within a seccomp profile CRD, for example:

    apiVersion:security-profiles-operator.x-k8s.io/v1beta1
    kind:SeccompProfile
    metadata:
    name:test
    spec:
    defaultAction:SCMP_ACT_ERRNO
    baseProfileName:oci://ghcr.io/security-profiles/runc:v1.1.5
    syscalls:
    - action:SCMP_ACT_ALLOW
    names:
    - uname
    

    The operator will take care of pulling the content by using oras, as well as verifying the sigstore (cosign) signatures of the artifact. If the artifacts are not signed, then the SPO will reject them. The resulting profile test will then contain all base syscalls from the remote runc profile plus the additional allowed uname one. It is also possible to reference the base profile by its digest (SHA256) making the artifact to be pulled more specific, for example by referencing oci://ghcr.io/security-profiles/runc@sha256:380….

    The operator internally caches pulled artifacts up to 24 hours for 1000 profiles, meaning that they will be refreshed after that time period, if the cache is full or the operator daemon gets restarted.

    Because the overall resulting syscalls are hidden from the user (I only have the baseProfileName listed in the SeccompProfile, and not the syscalls themselves), I'll additionally annotate that SeccompProfile with the final syscalls.

    Here's how the SeccompProfile looks after I annotate it:

    > kubectl describe seccompprofile test
    Name: test
    Namespace: security-profiles-operator
    Labels: spo.x-k8s.io/profile-id=SeccompProfile-test
    Annotations: syscalls:
     [{"names":["arch_prctl","brk","capget","capset","chdir","clone","close",...
    API Version: security-profiles-operator.x-k8s.io/v1beta1
    

    The SPO maintainers provide all public base profiles as part of the “Security Profiles” GitHub organization.

    Managing OCI security profiles

    Alright, now the official SPO provides a bunch of base profiles, but how can I define my own? Well, first of all we have to choose a working registry. There are a bunch of registries that already supports OCI artifacts:

    The Security Profiles Operator ships a new command line interface called spoc, which is a little helper tool for managing OCI profiles among doing various other things which are out of scope of this blog post. But, the command spoc push can be used to push a security profile to a registry:

    > export USERNAME=my-user
    > export PASSWORD=my-pass
    > spoc push -f ./examples/baseprofile-crun.yaml ghcr.io/security-profiles/crun:v1.8.3
    16:35:43.899886 Pushing profile ./examples/baseprofile-crun.yaml to: ghcr.io/security-profiles/crun:v1.8.3
    16:35:43.899939 Creating file store in: /tmp/push-3618165827
    16:35:43.899947 Adding profile to store: ./examples/baseprofile-crun.yaml
    16:35:43.900061 Packing files
    16:35:43.900282 Verifying reference: ghcr.io/security-profiles/crun:v1.8.3
    16:35:43.900310 Using tag: v1.8.3
    16:35:43.900313 Creating repository for ghcr.io/security-profiles/crun
    16:35:43.900319 Using username and password
    16:35:43.900321 Copying profile to repository
    16:35:46.976108 Signing container image
    Generating ephemeral keys...
    Retrieving signed certificate...
    Note that there may be personally identifiable information associated with this signed artifact.
    This may include the email address associated with the account with which you authenticate.
    This information will be used for signing this artifact and will be stored in public transparency logs and cannot be removed later.
    By typing 'y', you attest that you grant (or have permission to grant) and agree to have this information stored permanently in transparency logs.
    Your browser will now be opened to:
    https://oauth2.sigstore.dev/auth/auth?access_type=…
    Successfully verified SCT...
    tlog entry created with index: 16520520
    Pushing signature to: ghcr.io/security-profiles/crun
    

    You can see that the tool automatically signs the artifact and pushes the ./examples/baseprofile-crun.yaml to the registry, which is then directly ready for usage within the SPO. If username and password authentication is required, either use the --username, -u flag or export the USERNAME environment variable. To set the password, export the PASSWORD environment variable.

    It is possible to add custom annotations to the security profile by using the --annotations / -a flag multiple times in KEY:VALUE format. Those have no effect for now, but at some later point additional features of the operator may rely them.

    The spoc client is also able to pull security profiles from OCI artifact compatible registries. To do that, just run spoc pull:

    > spoc pull ghcr.io/security-profiles/runc:v1.1.5
    16:32:29.795597 Pulling profile from: ghcr.io/security-profiles/runc:v1.1.5
    16:32:29.795610 Verifying signature
    
    Verification for ghcr.io/security-profiles/runc:v1.1.5 --
    The following checks were performed on each of these signatures:
     - Existence of the claims in the transparency log was verified offline
     - The code-signing certificate was verified using trusted certificate authority certificates
    
    [{"critical":{"identity":{"docker-reference":"ghcr.io/security-profiles/runc"},…}}]
    16:32:33.208695 Creating file store in: /tmp/pull-3199397214
    16:32:33.208713 Verifying reference: ghcr.io/security-profiles/runc:v1.1.5
    16:32:33.208718 Creating repository for ghcr.io/security-profiles/runc
    16:32:33.208742 Using tag: v1.1.5
    16:32:33.208743 Copying profile from repository
    16:32:34.119652 Reading profile
    16:32:34.119677 Trying to unmarshal seccomp profile
    16:32:34.120114 Got SeccompProfile: runc-v1.1.5
    16:32:34.120119 Saving profile in: /tmp/profile.yaml
    

    The profile can be now found in /tmp/profile.yaml or the specified output file --output-file / -o. We can specify an username and password in the same way as for spoc push.

    spoc makes it easy to manage security profiles as OCI artifacts, which can be then consumed directly by the operator itself.

    That was our compact journey through the latest possibilities of the Security Profiles Operator! If you're interested in more, providing feedback or asking for help, then feel free to get in touch with us directly via Slack (#security-profiles-operator) or the mailing list.

  2. Author: Sascha Grunert

    The Security Profiles Operator (SPO) is a feature-rich operator for Kubernetes to make managing seccomp, SELinux and AppArmor profiles easier than ever. Recording those profiles from scratch is one of the key features of this operator, which usually involves the integration into large CI/CD systems. Being able to test the recording capabilities of the operator in edge cases is one of the recent development efforts of the SPO and makes it excitingly easy to play around with seccomp profiles.

    Recording seccomp profiles with spoc record

    The v0.8.0 release of the Security Profiles Operator shipped a new command line interface called spoc, a little helper tool for recording and replaying seccomp profiles among various other things that are out of scope of this blog post.

    Recording a seccomp profile requires a binary to be executed, which can be a simple golang application which just calls uname(2):

    package main
    
    import (
     "syscall"
    )
    
    func main() {
     utsname := syscall.Utsname{}
     if err := syscall.Uname(&utsname); err != nil {
     panic(err)
     }
    }
    

    Building a binary from that code can be done by:

    > go build -o main main.go
    > ldd ./main
     not a dynamic executable
    

    Now it's possible to download the latest binary of spoc from GitHub and run the application on Linux with it:

    > sudo ./spoc record ./main
    10:08:25.591945 Loading bpf module
    10:08:25.591958 Using system btf file
    libbpf: loading object 'recorder.bpf.o' from buffer
    libbpf: prog 'sys_enter': relo #3: patched insn #22 (ALU/ALU64) imm 16 -> 16
    10:08:25.610767 Getting bpf program sys_enter
    10:08:25.610778 Attaching bpf tracepoint
    10:08:25.611574 Getting syscalls map
    10:08:25.611582 Getting pid_mntns map
    10:08:25.613097 Module successfully loaded
    10:08:25.613311 Processing events
    10:08:25.613693 Running command with PID: 336007
    10:08:25.613835 Received event: pid: 336007, mntns: 4026531841
    10:08:25.613951 No container ID found for PID (pid=336007, mntns=4026531841, err=unable to find container ID in cgroup path)
    10:08:25.614856 Processing recorded data
    10:08:25.614975 Found process mntns 4026531841 in bpf map
    10:08:25.615110 Got syscalls: read, close, mmap, rt_sigaction, rt_sigprocmask, madvise, nanosleep, clone, uname, sigaltstack, arch_prctl, gettid, futex, sched_getaffinity, exit_group, openat
    10:08:25.615195 Adding base syscalls: access, brk, capget, capset, chdir, chmod, chown, close_range, dup2, dup3, epoll_create1, epoll_ctl, epoll_pwait, execve, faccessat2, fchdir, fchmodat, fchown, fchownat, fcntl, fstat, fstatfs, getdents64, getegid, geteuid, getgid, getpid, getppid, getuid, ioctl, keyctl, lseek, mkdirat, mknodat, mount, mprotect, munmap, newfstatat, openat2, pipe2, pivot_root, prctl, pread64, pselect6, readlink, readlinkat, rt_sigreturn, sched_yield, seccomp, set_robust_list, set_tid_address, setgid, setgroups, sethostname, setns, setresgid, setresuid, setsid, setuid, statfs, statx, symlinkat, tgkill, umask, umount2, unlinkat, unshare, write
    10:08:25.616293 Wrote seccomp profile to: /tmp/profile.yaml
    10:08:25.616298 Unloading bpf module
    

    I have to execute spoc as root because it will internally run an ebpf program by reusing the same code parts from the Security Profiles Operator itself. I can see that the bpf module got loaded successfully and spoc attached the required tracepoint to it. Then it will track the main application by using its mount namespace and process the recorded syscall data. The nature of ebpf programs is that they see the whole context of the Kernel, which means that spoc tracks all syscalls of the system, but does not interfere with their execution.

    The logs indicate that spoc found the syscalls read, close, mmap and so on, including uname. All other syscalls than uname are coming from the golang runtime and its garbage collection, which already adds overhead to a basic application like in our demo. I can also see from the log line Adding base syscalls: … that spoc adds a bunch of base syscalls to the resulting profile. Those are used by the OCI runtime (like runc or crun) in order to be able to run a container. This means that spoc can be used to record seccomp profiles which then can be containerized directly. This behavior can be disabled in spoc by using the --no-base-syscalls/-n or customized via the --base-syscalls/-b command line flags. This can be helpful in cases where different OCI runtimes other than crun and runc are used, or if I just want to record the seccomp profile for the application and stack it with another base profile.

    The resulting profile is now available in /tmp/profile.yaml, but the default location can be changed using the --output-file value/-o flag:

    > cat /tmp/profile.yaml
    
    apiVersion:security-profiles-operator.x-k8s.io/v1beta1
    kind:SeccompProfile
    metadata:
    creationTimestamp:null
    name:main
    spec:
    architectures:
    - SCMP_ARCH_X86_64
    defaultAction:SCMP_ACT_ERRNO
    syscalls:
    - action:SCMP_ACT_ALLOW
    names:
    - access
    - arch_prctl
    - brk
    - …
    - uname
    - …
    status:{}
    

    The seccomp profile Custom Resource Definition (CRD) can be directly used together with the Security Profiles Operator for managing it within Kubernetes. spoc is also capable of producing raw seccomp profiles (as JSON), by using the --type/-t raw-seccomp flag:

    > sudo ./spoc record --type raw-seccomp ./main
    52.628827 Wrote seccomp profile to: /tmp/profile.json
    
    > jq . /tmp/profile.json
    
    {
     "defaultAction": "SCMP_ACT_ERRNO",
     "architectures": ["SCMP_ARCH_X86_64"],
     "syscalls": [
     {
     "names": ["access", "…", "write"],
     "action": "SCMP_ACT_ALLOW"
     }
     ]
    }
    

    The utility spoc record allows us to record complex seccomp profiles directly from binary invocations in any Linux system which is capable of running the ebpf code within the Kernel. But it can do more: How about modifying the seccomp profile and then testing it by using spoc run.

    Running seccomp profiles with spoc run

    spoc is also able to run binaries with applied seccomp profiles, making it easy to test any modification to it. To do that, just run:

    > sudo ./spoc run ./main
    10:29:58.153263 Reading file /tmp/profile.yaml
    10:29:58.153311 Assuming YAML profile
    10:29:58.154138 Setting up seccomp
    10:29:58.154178 Load seccomp profile
    10:29:58.154189 Starting audit log enricher
    10:29:58.154224 Enricher reading from file /var/log/audit/audit.log
    10:29:58.155356 Running command with PID: 437880
    >
    

    It looks like that the application exited successfully, which is anticipated because I did not modify the previously recorded profile yet. I can also specify a custom location for the profile by using the --profile/-p flag, but this was not necessary because I did not modify the default output location from the record. spoc will automatically determine if it's a raw (JSON) or CRD (YAML) based seccomp profile and then apply it to the process.

    The Security Profiles Operator supports a log enricher feature, which provides additional seccomp related information by parsing the audit logs. spoc run uses the enricher in the same way to provide more data to the end users when it comes to debugging seccomp profiles.

    Now I have to modify the profile to see anything valuable in the output. For example, I could remove the allowed uname syscall:

    > jq 'del(.syscalls[0].names[] | select(. == "uname"))' /tmp/profile.json > /tmp/no-uname-profile.json
    

    And then try to run it again with the new profile /tmp/no-uname-profile.json:

    > sudo ./spoc run -p /tmp/no-uname-profile.json ./main
    10:39:12.707798 Reading file /tmp/no-uname-profile.json
    10:39:12.707892 Setting up seccomp
    10:39:12.707920 Load seccomp profile
    10:39:12.707982 Starting audit log enricher
    10:39:12.707998 Enricher reading from file /var/log/audit/audit.log
    10:39:12.709164 Running command with PID: 480512
    panic: operation not permitted
    
    goroutine 1 [running]:
    main.main()
     /path/to/main.go:10 +0x85
    10:39:12.713035 Unable to run: launch runner: wait for command: exit status 2
    

    Alright, that was expected! The applied seccomp profile blocks the uname syscall, which results in an "operation not permitted" error. This error is pretty generic and does not provide any hint on what got blocked by seccomp. It is generally extremely difficult to predict how applications behave if single syscalls are forbidden by seccomp. It could be possible that the application terminates like in our simple demo, but it could also lead to a strange misbehavior and the application does not stop at all.

    If I now change the default seccomp action of the profile from SCMP_ACT_ERRNO to SCMP_ACT_LOG like this:

    > jq '.defaultAction = "SCMP_ACT_LOG"' /tmp/no-uname-profile.json > /tmp/no-uname-profile-log.json
    

    Then the log enricher will give us a hint that the uname syscall got blocked when using spoc run:

    > sudo ./spoc run -p /tmp/no-uname-profile-log.json ./main
    10:48:07.470126 Reading file /tmp/no-uname-profile-log.json
    10:48:07.470234 Setting up seccomp
    10:48:07.470245 Load seccomp profile
    10:48:07.470302 Starting audit log enricher
    10:48:07.470339 Enricher reading from file /var/log/audit/audit.log
    10:48:07.470889 Running command with PID: 522268
    10:48:07.472007 Seccomp: uname (63)
    

    The application will not terminate any more, but seccomp will log the behavior to /var/log/audit/audit.log and spoc will parse the data to correlate it directly to our program. Generating the log messages to the audit subsystem comes with a large performance overhead and should be handled with care in production systems. It also comes with a security risk when running untrusted apps in audit mode in production environments.

    This demo should give you an impression how to debug seccomp profile issues with applications, probably by using our shiny new helper tool powered by the features of the Security Profiles Operator. spoc is a flexible and portable binary suitable for edge cases where resources are limited and even Kubernetes itself may not be available with its full capabilities.

    Thank you for reading this blog post! If you're interested in more, providing feedback or asking for help, then feel free to get in touch with us directly via Slack (#security-profiles-operator) or the mailing list.

  3. Authors: Anish Ramasekar, Mo Khan, and Rita Zhang (Microsoft)

    With Kubernetes 1.27, we (SIG Auth) are moving Key Management Service (KMS) v2 API to beta.

    What is KMS?

    One of the first things to consider when securing a Kubernetes cluster is encrypting etcd data at rest. KMS provides an interface for a provider to utilize a key stored in an external key service to perform this encryption.

    KMS v1 has been a feature of Kubernetes since version 1.10, and is currently in beta as of version v1.12. KMS v2 was introduced as alpha in v1.25.

    What’s new in v2beta1?

    The KMS encryption provider uses an envelope encryption scheme to encrypt data in etcd. The data is encrypted using a data encryption key (DEK). The DEKs are encrypted with a key encryption key (KEK) that is stored and managed in a remote KMS. With KMS v1, a new DEK is generated for each encryption. With KMS v2, a new DEK is only generated on server startup and when the KMS plugin informs the API server that a KEK rotation has occurred.

    Sequence Diagram

    Encrypt Request

    >kube_api_server: create/update resource that's to be encrypted kube_api_server->>kube_api_server: encrypt resource with DEK kube_api_server->>etcd: store encrypted object ``` -->
    Sequence diagram for KMSv2 beta Encrypt

    Decrypt Request

    >kube_api_server: get/list resource that's encrypted kube_api_server->>etcd: get encrypted resource etcd->>kube_api_server: encrypted resource alt Encrypted DEK not in cache kube_api_server->>kms_plugin: decrypt request kms_plugin->>external_kms: decrypt DEK with remote KEK external_kms->>kms_plugin: decrypted DEK kms_plugin->>kube_api_server: return decrypted DEK kube_api_server->>kube_api_server: cache decrypted DEK end kube_api_server->>kube_api_server: decrypt resource with DEK kube_api_server->>user: return decrypted resource ``` -->
    Sequence diagram for KMSv2 beta Decrypt

    Status Request

    >kms_plugin: status request kms_plugin->>external_kms: validate remote KEK external_kms->>kms_plugin: KEK status kms_plugin->>kube_api_server: return status response
    {"healthz": "ok", key_id: "", "version": "v2beta1"} alt KEK rotation detected (key_id changed), rotate DEK Note over kube_api_server,external_kms: Refer to Generate Data Encryption Key (DEK) diagram for details end end ``` -->
    Sequence diagram for KMSv2 beta Status

    Generate Data Encryption Key (DEK)

    >kube_api_server: generate DEK kube_api_server->>kms_plugin: encrypt request kms_plugin->>external_kms: encrypt DEK with remote KEK external_kms->>kms_plugin: encrypted DEK kms_plugin->>kube_api_server: return encrypt response
    {"ciphertext": "", key_id: "", "annotations": {}} ``` -->
    Sequence diagram for KMSv2 beta Generate DEK

    Performance Improvements

    With KMS v2, we have made significant improvements to the performance of the KMS encryption provider. In case of KMS v1, a new DEK is generated for every encryption. This means that for every write request, the API server makes a call to the KMS plugin to encrypt the DEK using the remote KEK. The API server also has to cache the DEKs to avoid making a call to the KMS plugin for every read request. When the API server restarts, it has to populate the cache by making a call to the KMS plugin for every DEK in the etcd store based on the cache size. This is a significant overhead for the API server. With KMS v2, the API server generates a DEK at startup and caches it. The API server also makes a call to the KMS plugin to encrypt the DEK using the remote KEK. This is a one-time call at startup and on KEK rotation. The API server then uses the cached DEK to encrypt the resources. This reduces the number of calls to the KMS plugin and improves the overall latency of the API server requests.

    We conducted a test that created 12k secrets and measured the time taken for the API server to encrypt the resources. The metric used was apiserver_storage_transformation_duration_seconds. For KMS v1, the test was run on a managed Kubernetes v1.25 cluster with 2 nodes. There was no additional load on the cluster during the test. For KMS v2, the test was run in the Kubernetes CI environment with the following cluster configuration.

    KMS Provider Time taken by 95 percentile
    KMS v1 160ms
    KMS v2 80μs

    The results show that the KMS v2 encryption provider is three orders of magnitude faster than the KMS v1 encryption provider.

    What's next?

    For Kubernetes v1.28, we expect the feature to stay in beta. In the coming releases we want to investigate:

    • Cryptographic changes to remove the limitation on VM state store.
    • Kubernetes REST API changes to enable a more robust story around key rotation.
    • Handling undecryptable resources. Refer to the KEP for details.

    You can learn more about KMS v2 by reading Using a KMS provider for data encryption. You can also follow along on the KEP to track progress across the coming Kubernetes releases.

    Call to action

    In this blog post, we have covered the improvements made to the KMS encryption provider in Kubernetes v1.27. We have also discussed the new KMS v2 API and how it works. We would love to hear your feedback on this feature. In particular, we would like feedback from Kubernetes KMS plugin implementors as they go through the process of building their integrations with this new API. Please reach out to us on the #sig-auth-kms-dev channel on Kubernetes Slack.

    How to get involved

    If you are interested in getting involved in the development of this feature, share feedback, or participate in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.

    You are also welcome to join the bi-weekly SIG Auth meetings, held every-other Wednesday.

    Acknowledgements

    This feature has been an effort driven by contributors from several different companies. We would like to extend a huge thank you to everyone that contributed their time and effort to help make this possible.

  4. Authors: Paco Xu (DaoCloud), Sergey Kanzhelev (Google), Ruiwen Zhao (Google)

    How can Pod start-up be accelerated on nodes in large clusters? This is a common issue that cluster administrators may face.

    This blog post focuses on methods to speed up pod start-up from the kubelet side. It does not involve the creation time of pods by controller-manager through kube-apiserver, nor does it include scheduling time for pods or webhooks executed on it.

    We have mentioned some important factors here to consider from the kubelet's perspective, but this is not an exhaustive list. As Kubernetes v1.27 is released, this blog highlights significant changes in v1.27 that aid in speeding up pod start-up.

    Parallel container image pulls

    Pulling images always takes some time and what's worse is that image pulls are done serially by default. In other words, kubelet will send only one image pull request to the image service at a time. Other image pull requests have to wait until the one being processed is complete.

    To enable parallel image pulls, set the serializeImagePulls field to false in the kubelet configuration. When serializeImagePulls is disabled, requests for image pulls are immediately sent to the image service and multiple images can be pulled concurrently.

    Maximum parallel image pulls will help secure your node from overloading on image pulling

    We introduced a new feature in kubelet that sets a limit on the number of parallel image pulls at the node level. This limit restricts the maximum number of images that can be pulled simultaneously. If there is an image pull request beyond this limit, it will be blocked until one of the ongoing image pulls finishes. Before enabling this feature, please ensure that your container runtime's image service can handle parallel image pulls effectively.

    To limit the number of simultaneous image pulls, you can configure the maxParallelImagePulls field in kubelet. By setting maxParallelImagePulls to a value of n, only n images will be pulled concurrently. Any additional image pulls beyond this limit will wait until at least one ongoing pull is complete.

    You can find more details in the associated KEP: Kubelet limit of Parallel Image Pulls (KEP-3673).

    Raised default API query-per-second limits for kubelet

    To improve pod startup in scenarios with multiple pods on a node, particularly sudden scaling situations, it is necessary for Kubelet to synchronize the pod status and prepare configmaps, secrets, or volumes. This requires a large bandwidth to access kube-apiserver.

    In versions prior to v1.27, the default kubeAPIQPS was 5 and kubeAPIBurst was 10. However, the kubelet in v1.27 has increased these defaults to 50 and 100 respectively for better performance during pod startup. It's worth noting that this isn't the only reason why we've bumped up the API QPS limits for Kubelet.

    1. It has a potential to be hugely throttled now (default QPS = 5)
    2. In large clusters they can generate significant load anyway as there are a lot of them
    3. They have a dedicated PriorityLevel and FlowSchema that we can easily control

    Previously, we often encountered volume mount timeout on kubelet in node with more than 50 pods during pod start up. We suggest that cluster operators bump kubeAPIQPS to 20 and kubeAPIBurst to 40, especially if using bare metal nodes.

    More detials can be found in the KEP https://kep.k8s.io/1040 and the pull request #116121.

    Event triggered updates to container status

    Evented PLEG (PLEG is short for "Pod Lifecycle Event Generator") is set to be in beta for v1.27, Kubernetes offers two ways for the kubelet to detect Pod lifecycle events, such as the last process in a container shutting down. In Kubernetes v1.27, the event based mechanism has graduated to beta but remains disabled by default. If you do explicitly switch to event-based lifecycle change detection, the kubelet is able to start Pods more quickly than with the default approach that relies on polling. The default mechanism, polling for lifecycle changes, adds a noticeable overhead; this affects the kubelet's ability to handle different tasks in parallel, and leads to poor performance and reliability issues. For these reasons, we recommend that you switch your nodes to use event-based pod lifecycle change detection.

    Further details can be found in the KEP https://kep.k8s.io/3386 and Switching From Polling to CRI Event-based Updates to Container Status.

    Raise your pod resource limit if needed

    During start-up, some pods may consume a considerable amount of CPU or memory. If the CPU limit is low, this can significantly slow down the pod start-up process. To improve the memory management, Kubernetes v1.22 introduced a feature gate called MemoryQoS to kubelet. This feature enables kubelet to set memory QoS at container, pod, and QoS levels for better protection and guaranteed quality of memory when running with cgroups v2. Although it has benefits, it is possible that enabling this feature gate may affect the start-up speed of the pod if the pod startup consumes a large amount of memory.

    Kubelet configuration now includes memoryThrottlingFactor. This factor is multiplied by the memory limit or node allocatable memory to set the cgroupv2 memory.high value for enforcing MemoryQoS. Decreasing this factor sets a lower high limit for container cgroups, increasing reclaim pressure. Increasing this factor will put less reclaim pressure. The default value is 0.8 initially and will change to 0.9 in Kubernetes v1.27. This parameter adjustment can reduce the potential impact of this feature on pod startup speed.

    Further details can be found in the KEP https://kep.k8s.io/2570.

    What's more?

    In Kubernetes v1.26, a new histogram metric pod_start_sli_duration_seconds was added for Pod startup latency SLI/SLO details. Additionally, the kubelet log will now display more information about pod start-related timestamps, as shown below:

    Dec 30 15:33:13.375379 e2e-022435249c-674b9-minion-group-gdj4 kubelet[8362]: I1230 15:33:13.375359 8362 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="kube-system/konnectivity-agent-gnc9k" podStartSLOduration=-9.223372029479458e+09 pod.CreationTimestamp="2022-12-30 15:33:06 +0000 UTC" firstStartedPulling="2022-12-30 15:33:09.258791695 +0000 UTC m=+13.029631711" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2022-12-30 15:33:13.375009262 +0000 UTC m=+17.145849275" watchObservedRunningTime="2022-12-30 15:33:13.375317944 +0000 UTC m=+17.146157970"

    The SELinux Relabeling with Mount Options feature moved to Beta in v1.27. This feature speeds up container startup by mounting volumes with the correct SELinux label instead of changing each file on the volumes recursively. Further details can be found in the KEP https://kep.k8s.io/1710.

    To identify the cause of slow pod startup, analyzing metrics and logs can be helpful. Other factors that may impact pod startup include container runtime, disk speed, CPU and memory resources on the node.

    SIG Node is responsible for ensuring fast Pod startup times, while addressing issues in large clusters falls under the purview of SIG Scalability as well.

  5. Author: Vinay Kulkarni (Kubescaler Labs)

    If you have deployed Kubernetes pods with CPU and/or memory resources specified, you may have noticed that changing the resource values involves restarting the pod. This has been a disruptive operation for running workloads... until now.

    In Kubernetes v1.27, we have added a new alpha feature that allows users to resize CPU/memory resources allocated to pods without restarting the containers. To facilitate this, the resources field in a pod's containers now allow mutation for cpu and memory resources. They can be changed simply by patching the running pod spec.

    This also means that resources field in the pod spec can no longer be relied upon as an indicator of the pod's actual resources. Monitoring tools and other such applications must now look at new fields in the pod's status. Kubernetes queries the actual CPU and memory requests and limits enforced on the running containers via a CRI (Container Runtime Interface) API call to the runtime, such as containerd, which is responsible for running the containers. The response from container runtime is reflected in the pod's status.

    In addition, a new restartPolicy for resize has been added. It gives users control over how their containers are handled when resources are resized.

    What's new in v1.27?

    Besides the addition of resize policy in the pod's spec, a new field named allocatedResources has been added to containerStatuses in the pod's status. This field reflects the node resources allocated to the pod's containers.

    In addition, a new field called resources has been added to the container's status. This field reflects the actual resource requests and limits configured on the running containers as reported by the container runtime.

    Lastly, a new field named resize has been added to the pod's status to show the status of the last requested resize. A value of Proposed is an acknowledgement of the requested resize and indicates that request was validated and recorded. A value of InProgress indicates that the node has accepted the resize request and is in the process of applying the resize request to the pod's containers. A value of Deferred means that the requested resize cannot be granted at this time, and the node will keep retrying. The resize may be granted when other pods leave and free up node resources. A value of Infeasible is a signal that the node cannot accommodate the requested resize. This can happen if the requested resize exceeds the maximum resources the node can ever allocate for a pod.

    When to use this feature

    Here are a few examples where this feature may be useful:

    • Pod is running on node but with either too much or too little resources.
    • Pods are not being scheduled do to lack of sufficient CPU or memory in a cluster that is underutilized by running pods that were overprovisioned.
    • Evicting certain stateful pods that need more resources to schedule them on bigger nodes is an expensive or disruptive operation when other lower priority pods in the node can be resized down or moved.

    How to use this feature

    In order to use this feature in v1.27, the InPlacePodVerticalScaling feature gate must be enabled. A local cluster with this feature enabled can be started as shown below:

    root@vbuild:~/go/src/k8s.io/kubernetes# FEATURE_GATES=InPlacePodVerticalScaling=true ./hack/local-up-cluster.sh
    go version go1.20.2 linux/arm64
    +++ [0320 13:52:02] Building go targets for linux/arm64
    k8s.io/kubernetes/cmd/kubectl (static)
    k8s.io/kubernetes/cmd/kube-apiserver (static)
    k8s.io/kubernetes/cmd/kube-controller-manager (static)
    k8s.io/kubernetes/cmd/cloud-controller-manager (non-static)
    k8s.io/kubernetes/cmd/kubelet (non-static)
    ...
    ...
    Logs:
    /tmp/etcd.log
    /tmp/kube-apiserver.log
    /tmp/kube-controller-manager.log
    /tmp/kube-proxy.log
    /tmp/kube-scheduler.log
    /tmp/kubelet.log
    To start using your cluster, you can open up another terminal/tab and run:
    export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig
    cluster/kubectl.sh
    Alternatively, you can write to the default kubeconfig:
    export KUBERNETES_PROVIDER=local
    cluster/kubectl.sh config set-cluster local --server=https://localhost:6443 --certificate-authority=/var/run/kubernetes/server-ca.crt
    cluster/kubectl.sh config set-credentials myself --client-key=/var/run/kubernetes/client-admin.key --client-certificate=/var/run/kubernetes/client-admin.crt
    cluster/kubectl.sh config set-context local --cluster=local --user=myself
    cluster/kubectl.sh config use-context local
    cluster/kubectl.sh
    

    Once the local cluster is up and running, Kubernetes users can schedule pods with resources, and resize the pods via kubectl. An example of how to use this feature is illustrated in the following demo video.

    Example Use Cases

    Cloud-based Development Environment

    In this scenario, developers or development teams write their code locally but build and test their code in Kubernetes pods with consistent configs that reflect production use. Such pods need minimal resources when the developers are writing code, but need significantly more CPU and memory when they build their code or run a battery of tests. This use case can leverage in-place pod resize feature (with a little help from eBPF) to quickly resize the pod's resources and avoid kernel OOM (out of memory) killer from terminating their processes.

    This KubeCon North America 2022 conference talk illustrates the use case.

    Java processes initialization CPU requirements

    Some Java applications may need significantly more CPU during initialization than what is needed during normal process operation time. If such applications specify CPU requests and limits suited for normal operation, they may suffer from very long startup times. Such pods can request higher CPU values at the time of pod creation, and can be resized down to normal running needs once the application has finished initializing.

    Known Issues

    This feature enters v1.27 at alpha stage. Below are a few known issues users may encounter:

    • containerd versions below v1.6.9 do not have the CRI support needed for full end-to-end operation of this feature. Attempts to resize pods will appear to be stuck in the InProgress state, and resources field in the pod's status are never updated even though the new resources may have been enacted on the running containers.
    • Pod resize may encounter a race condition with other pod updates, causing delayed enactment of pod resize.
    • Reflecting the resized container resources in pod's status may take a while.
    • Static CPU management policy is not supported with this feature.

    Credits

    This feature is a result of the efforts of a very collaborative Kubernetes community. Here's a little shoutout to just a few of the many many people that contributed countless hours of their time and helped make this happen.

    • @thockin for detail-oriented API design and air-tight code reviews.
    • @derekwaynecarr for simplifying the design and thorough API and node reviews.
    • @dchen1107 for bringing vast knowledge from Borg and helping us avoid pitfalls.
    • @ruiwen-zhao for adding containerd support that enabled full E2E implementation.
    • @wangchen615 for implementing comprehensive E2E tests and driving scheduler fixes.
    • @bobbypage for invaluable help getting CI ready and quickly investigating issues, covering for me on my vacation.
    • @Random-Liu for thorough kubelet reviews and identifying problematic race conditions.
    • @Huang-Wei, @ahg-g, @alculquicondor for helping get scheduler changes done.
    • @mikebrow @marosset for reviews on short notice that helped CRI changes make it into v1.25.
    • @endocrimes, @ehashman for helping ensure that the oft-overlooked tests are in good shape.
    • @mrunalp for reviewing cgroupv2 changes and ensuring clean handling of v1 vs v2.
    • @liggitt, @gjkim42 for tracking down, root-causing important missed issues post-merge.
    • @SergeyKanzhelev for supporting and shepherding various issues during the home stretch.
    • @pdgetrf for making the first prototype a reality.
    • @dashpole for bringing me up to speed on 'the Kubernetes way' of doing things.
    • @bsalamat, @kgolab for very thoughtful insights and suggestions in the early stages.
    • @sftim, @tengqm for ensuring docs are easy to follow.
    • @dims for being omnipresent and helping make merges happen at critical hours.
    • Release teams for ensuring that the project stayed healthy.

    And a big thanks to my very supportive management Dr. Xiaoning Ding and Dr. Ying Xiong for their patience and encouragement.

    References

    For app developers

    For cluster administrators