Kubernetes News

-
Blog: Using OCI artifacts to distribute security profiles for seccomp, SELinux and AppArmor
Author: Sascha Grunert
The Security Profiles Operator (SPO) makes managing seccomp, SELinux and AppArmor profiles within Kubernetes easier than ever. It allows cluster administrators to define the profiles in a predefined custom resource YAML, which then gets distributed by the SPO into the whole cluster. Modification and removal of the security profiles are managed by the operator in the same way, but that’s a small subset of its capabilities.
Another core feature of the SPO is being able to stack seccomp profiles. This means that users can define a
baseProfileName
in the YAML specification, which then gets automatically resolved by the operator and combines the syscall rules. If a base profile has anotherbaseProfileName
, then the operator will recursively resolve the profiles up to a certain depth. A common use case is to define base profiles for low level container runtimes (like runc or crun) which then contain syscalls which are required in any case to run the container. Alternatively, application developers can define seccomp base profiles for their standard distribution containers and stack dedicated profiles for the application logic on top. This way developers can focus on maintaining seccomp profiles which are way simpler and scoped to the application logic, without having a need to take the whole infrastructure setup into account.But how to maintain those base profiles? For example, the amount of required syscalls for a runtime can change over its release cycle in the same way it can change for the main application. Base profiles have to be available in the same cluster, otherwise the main seccomp profile will fail to deploy. This means that they’re tightly coupled to the main application profiles, which acts against the main idea of base profiles. Distributing and managing them as plain files feels like an additional burden to solve.
OCI artifacts to the rescue
The v0.8.0 release of the Security Profiles Operator supports managing base profiles as OCI artifacts! Imagine OCI artifacts as lightweight container images, storing files in layers in the same way images do, but without a process to be executed. Those artifacts can be used to store security profiles like regular container images in compatible registries. This means they can be versioned, namespaced and annotated similar to regular container images.
To see how that works in action, specify a
baseProfileName
prefixed withoci://
within a seccomp profile CRD, for example:apiVersion:security-profiles-operator.x-k8s.io/v1beta1 kind:SeccompProfile metadata: name:test spec: defaultAction:SCMP_ACT_ERRNO baseProfileName:oci://ghcr.io/security-profiles/runc:v1.1.5 syscalls: - action:SCMP_ACT_ALLOW names: - uname
The operator will take care of pulling the content by using oras, as well as verifying the sigstore (cosign) signatures of the artifact. If the artifacts are not signed, then the SPO will reject them. The resulting profile
test
will then contain all base syscalls from the remoterunc
profile plus the additional alloweduname
one. It is also possible to reference the base profile by its digest (SHA256) making the artifact to be pulled more specific, for example by referencingoci://ghcr.io/security-profiles/runc@sha256:380…
.The operator internally caches pulled artifacts up to 24 hours for 1000 profiles, meaning that they will be refreshed after that time period, if the cache is full or the operator daemon gets restarted.
Because the overall resulting syscalls are hidden from the user (I only have the
baseProfileName
listed in the SeccompProfile, and not the syscalls themselves), I'll additionally annotate that SeccompProfile with the finalsyscalls
.Here's how the SeccompProfile looks after I annotate it:
> kubectl describe seccompprofile test Name: test Namespace: security-profiles-operator Labels: spo.x-k8s.io/profile-id=SeccompProfile-test Annotations: syscalls: [{"names":["arch_prctl","brk","capget","capset","chdir","clone","close",... API Version: security-profiles-operator.x-k8s.io/v1beta1
The SPO maintainers provide all public base profiles as part of the “Security Profiles” GitHub organization.
Managing OCI security profiles
Alright, now the official SPO provides a bunch of base profiles, but how can I define my own? Well, first of all we have to choose a working registry. There are a bunch of registries that already supports OCI artifacts:
- CNCF Distribution
- Azure Container Registry
- Amazon Elastic Container Registry
- Google Artifact Registry
- GitHub Packages container registry
- Bundle Bar
- Docker Hub
- Zot Registry
The Security Profiles Operator ships a new command line interface called
spoc
, which is a little helper tool for managing OCI profiles among doing various other things which are out of scope of this blog post. But, the commandspoc push
can be used to push a security profile to a registry:> export USERNAME=my-user > export PASSWORD=my-pass > spoc push -f ./examples/baseprofile-crun.yaml ghcr.io/security-profiles/crun:v1.8.3 16:35:43.899886 Pushing profile ./examples/baseprofile-crun.yaml to: ghcr.io/security-profiles/crun:v1.8.3 16:35:43.899939 Creating file store in: /tmp/push-3618165827 16:35:43.899947 Adding profile to store: ./examples/baseprofile-crun.yaml 16:35:43.900061 Packing files 16:35:43.900282 Verifying reference: ghcr.io/security-profiles/crun:v1.8.3 16:35:43.900310 Using tag: v1.8.3 16:35:43.900313 Creating repository for ghcr.io/security-profiles/crun 16:35:43.900319 Using username and password 16:35:43.900321 Copying profile to repository 16:35:46.976108 Signing container image Generating ephemeral keys... Retrieving signed certificate... Note that there may be personally identifiable information associated with this signed artifact. This may include the email address associated with the account with which you authenticate. This information will be used for signing this artifact and will be stored in public transparency logs and cannot be removed later. By typing 'y', you attest that you grant (or have permission to grant) and agree to have this information stored permanently in transparency logs. Your browser will now be opened to: https://oauth2.sigstore.dev/auth/auth?access_type=… Successfully verified SCT... tlog entry created with index: 16520520 Pushing signature to: ghcr.io/security-profiles/crun
You can see that the tool automatically signs the artifact and pushes the
./examples/baseprofile-crun.yaml
to the registry, which is then directly ready for usage within the SPO. If username and password authentication is required, either use the--username
,-u
flag or export theUSERNAME
environment variable. To set the password, export thePASSWORD
environment variable.It is possible to add custom annotations to the security profile by using the
--annotations
/-a
flag multiple times inKEY:VALUE
format. Those have no effect for now, but at some later point additional features of the operator may rely them.The
spoc
client is also able to pull security profiles from OCI artifact compatible registries. To do that, just runspoc pull
:> spoc pull ghcr.io/security-profiles/runc:v1.1.5 16:32:29.795597 Pulling profile from: ghcr.io/security-profiles/runc:v1.1.5 16:32:29.795610 Verifying signature Verification for ghcr.io/security-profiles/runc:v1.1.5 -- The following checks were performed on each of these signatures: - Existence of the claims in the transparency log was verified offline - The code-signing certificate was verified using trusted certificate authority certificates [{"critical":{"identity":{"docker-reference":"ghcr.io/security-profiles/runc"},…}}] 16:32:33.208695 Creating file store in: /tmp/pull-3199397214 16:32:33.208713 Verifying reference: ghcr.io/security-profiles/runc:v1.1.5 16:32:33.208718 Creating repository for ghcr.io/security-profiles/runc 16:32:33.208742 Using tag: v1.1.5 16:32:33.208743 Copying profile from repository 16:32:34.119652 Reading profile 16:32:34.119677 Trying to unmarshal seccomp profile 16:32:34.120114 Got SeccompProfile: runc-v1.1.5 16:32:34.120119 Saving profile in: /tmp/profile.yaml
The profile can be now found in
/tmp/profile.yaml
or the specified output file--output-file
/-o
. We can specify an username and password in the same way as forspoc push
.spoc
makes it easy to manage security profiles as OCI artifacts, which can be then consumed directly by the operator itself.That was our compact journey through the latest possibilities of the Security Profiles Operator! If you're interested in more, providing feedback or asking for help, then feel free to get in touch with us directly via Slack (#security-profiles-operator) or the mailing list.
-
Blog: Having fun with seccomp profiles on the edge
Author: Sascha Grunert
The Security Profiles Operator (SPO) is a feature-rich operator for Kubernetes to make managing seccomp, SELinux and AppArmor profiles easier than ever. Recording those profiles from scratch is one of the key features of this operator, which usually involves the integration into large CI/CD systems. Being able to test the recording capabilities of the operator in edge cases is one of the recent development efforts of the SPO and makes it excitingly easy to play around with seccomp profiles.
Recording seccomp profiles with
spoc record
The v0.8.0 release of the Security Profiles Operator shipped a new command line interface called
spoc
, a little helper tool for recording and replaying seccomp profiles among various other things that are out of scope of this blog post.Recording a seccomp profile requires a binary to be executed, which can be a simple golang application which just calls
uname(2)
:package main import ( "syscall" ) func main() { utsname := syscall.Utsname{} if err := syscall.Uname(&utsname); err != nil { panic(err) } }
Building a binary from that code can be done by:
> go build -o main main.go > ldd ./main not a dynamic executable
Now it's possible to download the latest binary of
spoc
from GitHub and run the application on Linux with it:> sudo ./spoc record ./main 10:08:25.591945 Loading bpf module 10:08:25.591958 Using system btf file libbpf: loading object 'recorder.bpf.o' from buffer … libbpf: prog 'sys_enter': relo #3: patched insn #22 (ALU/ALU64) imm 16 -> 16 10:08:25.610767 Getting bpf program sys_enter 10:08:25.610778 Attaching bpf tracepoint 10:08:25.611574 Getting syscalls map 10:08:25.611582 Getting pid_mntns map 10:08:25.613097 Module successfully loaded 10:08:25.613311 Processing events 10:08:25.613693 Running command with PID: 336007 10:08:25.613835 Received event: pid: 336007, mntns: 4026531841 10:08:25.613951 No container ID found for PID (pid=336007, mntns=4026531841, err=unable to find container ID in cgroup path) 10:08:25.614856 Processing recorded data 10:08:25.614975 Found process mntns 4026531841 in bpf map 10:08:25.615110 Got syscalls: read, close, mmap, rt_sigaction, rt_sigprocmask, madvise, nanosleep, clone, uname, sigaltstack, arch_prctl, gettid, futex, sched_getaffinity, exit_group, openat 10:08:25.615195 Adding base syscalls: access, brk, capget, capset, chdir, chmod, chown, close_range, dup2, dup3, epoll_create1, epoll_ctl, epoll_pwait, execve, faccessat2, fchdir, fchmodat, fchown, fchownat, fcntl, fstat, fstatfs, getdents64, getegid, geteuid, getgid, getpid, getppid, getuid, ioctl, keyctl, lseek, mkdirat, mknodat, mount, mprotect, munmap, newfstatat, openat2, pipe2, pivot_root, prctl, pread64, pselect6, readlink, readlinkat, rt_sigreturn, sched_yield, seccomp, set_robust_list, set_tid_address, setgid, setgroups, sethostname, setns, setresgid, setresuid, setsid, setuid, statfs, statx, symlinkat, tgkill, umask, umount2, unlinkat, unshare, write 10:08:25.616293 Wrote seccomp profile to: /tmp/profile.yaml 10:08:25.616298 Unloading bpf module
I have to execute
spoc
as root because it will internally run an ebpf program by reusing the same code parts from the Security Profiles Operator itself. I can see that the bpf module got loaded successfully andspoc
attached the required tracepoint to it. Then it will track the main application by using its mount namespace and process the recorded syscall data. The nature of ebpf programs is that they see the whole context of the Kernel, which means thatspoc
tracks all syscalls of the system, but does not interfere with their execution.The logs indicate that
spoc
found the syscallsread
,close
,mmap
and so on, includinguname
. All other syscalls thanuname
are coming from the golang runtime and its garbage collection, which already adds overhead to a basic application like in our demo. I can also see from the log lineAdding base syscalls: …
thatspoc
adds a bunch of base syscalls to the resulting profile. Those are used by the OCI runtime (like runc or crun) in order to be able to run a container. This means thatspoc
can be used to record seccomp profiles which then can be containerized directly. This behavior can be disabled inspoc
by using the--no-base-syscalls
/-n
or customized via the--base-syscalls
/-b
command line flags. This can be helpful in cases where different OCI runtimes other than crun and runc are used, or if I just want to record the seccomp profile for the application and stack it with another base profile.The resulting profile is now available in
/tmp/profile.yaml
, but the default location can be changed using the--output-file value
/-o
flag:> cat /tmp/profile.yaml
apiVersion:security-profiles-operator.x-k8s.io/v1beta1 kind:SeccompProfile metadata: creationTimestamp:null name:main spec: architectures: - SCMP_ARCH_X86_64 defaultAction:SCMP_ACT_ERRNO syscalls: - action:SCMP_ACT_ALLOW names: - access - arch_prctl - brk - … - uname - … status:{}
The seccomp profile Custom Resource Definition (CRD) can be directly used together with the Security Profiles Operator for managing it within Kubernetes.
spoc
is also capable of producing raw seccomp profiles (as JSON), by using the--type
/-t
raw-seccomp
flag:> sudo ./spoc record --type raw-seccomp ./main … 52.628827 Wrote seccomp profile to: /tmp/profile.json
> jq . /tmp/profile.json
{ "defaultAction": "SCMP_ACT_ERRNO", "architectures": ["SCMP_ARCH_X86_64"], "syscalls": [ { "names": ["access", "…", "write"], "action": "SCMP_ACT_ALLOW" } ] }
The utility
spoc record
allows us to record complex seccomp profiles directly from binary invocations in any Linux system which is capable of running the ebpf code within the Kernel. But it can do more: How about modifying the seccomp profile and then testing it by usingspoc run
.Running seccomp profiles with
spoc run
spoc
is also able to run binaries with applied seccomp profiles, making it easy to test any modification to it. To do that, just run:> sudo ./spoc run ./main 10:29:58.153263 Reading file /tmp/profile.yaml 10:29:58.153311 Assuming YAML profile 10:29:58.154138 Setting up seccomp 10:29:58.154178 Load seccomp profile 10:29:58.154189 Starting audit log enricher 10:29:58.154224 Enricher reading from file /var/log/audit/audit.log 10:29:58.155356 Running command with PID: 437880 >
It looks like that the application exited successfully, which is anticipated because I did not modify the previously recorded profile yet. I can also specify a custom location for the profile by using the
--profile
/-p
flag, but this was not necessary because I did not modify the default output location from the record.spoc
will automatically determine if it's a raw (JSON) or CRD (YAML) based seccomp profile and then apply it to the process.The Security Profiles Operator supports a log enricher feature, which provides additional seccomp related information by parsing the audit logs.
spoc run
uses the enricher in the same way to provide more data to the end users when it comes to debugging seccomp profiles.Now I have to modify the profile to see anything valuable in the output. For example, I could remove the allowed
uname
syscall:> jq 'del(.syscalls[0].names[] | select(. == "uname"))' /tmp/profile.json > /tmp/no-uname-profile.json
And then try to run it again with the new profile
/tmp/no-uname-profile.json
:> sudo ./spoc run -p /tmp/no-uname-profile.json ./main 10:39:12.707798 Reading file /tmp/no-uname-profile.json 10:39:12.707892 Setting up seccomp 10:39:12.707920 Load seccomp profile 10:39:12.707982 Starting audit log enricher 10:39:12.707998 Enricher reading from file /var/log/audit/audit.log 10:39:12.709164 Running command with PID: 480512 panic: operation not permitted goroutine 1 [running]: main.main() /path/to/main.go:10 +0x85 10:39:12.713035 Unable to run: launch runner: wait for command: exit status 2
Alright, that was expected! The applied seccomp profile blocks the
uname
syscall, which results in an "operation not permitted" error. This error is pretty generic and does not provide any hint on what got blocked by seccomp. It is generally extremely difficult to predict how applications behave if single syscalls are forbidden by seccomp. It could be possible that the application terminates like in our simple demo, but it could also lead to a strange misbehavior and the application does not stop at all.If I now change the default seccomp action of the profile from
SCMP_ACT_ERRNO
toSCMP_ACT_LOG
like this:> jq '.defaultAction = "SCMP_ACT_LOG"' /tmp/no-uname-profile.json > /tmp/no-uname-profile-log.json
Then the log enricher will give us a hint that the
uname
syscall got blocked when usingspoc run
:> sudo ./spoc run -p /tmp/no-uname-profile-log.json ./main 10:48:07.470126 Reading file /tmp/no-uname-profile-log.json 10:48:07.470234 Setting up seccomp 10:48:07.470245 Load seccomp profile 10:48:07.470302 Starting audit log enricher 10:48:07.470339 Enricher reading from file /var/log/audit/audit.log 10:48:07.470889 Running command with PID: 522268 10:48:07.472007 Seccomp: uname (63)
The application will not terminate any more, but seccomp will log the behavior to
/var/log/audit/audit.log
andspoc
will parse the data to correlate it directly to our program. Generating the log messages to the audit subsystem comes with a large performance overhead and should be handled with care in production systems. It also comes with a security risk when running untrusted apps in audit mode in production environments.This demo should give you an impression how to debug seccomp profile issues with applications, probably by using our shiny new helper tool powered by the features of the Security Profiles Operator.
spoc
is a flexible and portable binary suitable for edge cases where resources are limited and even Kubernetes itself may not be available with its full capabilities.Thank you for reading this blog post! If you're interested in more, providing feedback or asking for help, then feel free to get in touch with us directly via Slack (#security-profiles-operator) or the mailing list.
-
Blog: Kubernetes 1.27: KMS V2 Moves to Beta
Authors: Anish Ramasekar, Mo Khan, and Rita Zhang (Microsoft)
With Kubernetes 1.27, we (SIG Auth) are moving Key Management Service (KMS) v2 API to beta.
What is KMS?
One of the first things to consider when securing a Kubernetes cluster is encrypting etcd data at rest. KMS provides an interface for a provider to utilize a key stored in an external key service to perform this encryption.
KMS v1 has been a feature of Kubernetes since version 1.10, and is currently in beta as of version v1.12. KMS v2 was introduced as alpha in v1.25.
Note
The KMS v2 API and implementation changed in incompatible ways in-between the alpha release in v1.25 and the beta release in v1.27. The design of KMS v2 has changed since the previous blog post was written and it is not compatible with the design in this blog post. Attempting to upgrade from old versions with the alpha feature enabled will result in data loss.What’s new in
v2beta1
?The KMS encryption provider uses an envelope encryption scheme to encrypt data in etcd. The data is encrypted using a data encryption key (DEK). The DEKs are encrypted with a key encryption key (KEK) that is stored and managed in a remote KMS. With KMS v1, a new DEK is generated for each encryption. With KMS v2, a new DEK is only generated on server startup and when the KMS plugin informs the API server that a KEK rotation has occurred.
Caution
If you are running virtual machine (VM) based nodes that leverage VM state store with this feature, you must not use KMS v2.
With KMS v2, the API server uses AES-GCM with a 12 byte nonce (8 byte atomic counter and 4 bytes random data) for encryption. The following issues could occur if the VM is saved and restored:
- The counter value may be lost or corrupted if the VM is saved in an inconsistent state or restored improperly. This can lead to a situation where the same counter value is used twice, resulting in the same nonce being used for two different messages.
- If the VM is restored to a previous state, the counter value may be set back to its previous value, resulting in the same nonce being used again.
Although both of these cases are partially mitigated by the 4 byte random nonce, this can compromise the security of the encryption.
Sequence Diagram
Encrypt Request
>kube_api_server: create/update resource that's to be encrypted kube_api_server->>kube_api_server: encrypt resource with DEK kube_api_server->>etcd: store encrypted object ``` -->Decrypt Request
>kube_api_server: get/list resource that's encrypted kube_api_server->>etcd: get encrypted resource etcd->>kube_api_server: encrypted resource alt Encrypted DEK not in cache kube_api_server->>kms_plugin: decrypt request kms_plugin->>external_kms: decrypt DEK with remote KEK external_kms->>kms_plugin: decrypted DEK kms_plugin->>kube_api_server: return decrypted DEK kube_api_server->>kube_api_server: cache decrypted DEK end kube_api_server->>kube_api_server: decrypt resource with DEK kube_api_server->>user: return decrypted resource ``` -->Status Request
>kms_plugin: status request kms_plugin->>external_kms: validate remote KEK external_kms->>kms_plugin: KEK status kms_plugin->>kube_api_server: return status response
{"healthz": "ok", key_id: "", "version": "v2beta1"} alt KEK rotation detected (key_id changed), rotate DEK Note over kube_api_server,external_kms: Refer to Generate Data Encryption Key (DEK) diagram for details end end ``` --> Generate Data Encryption Key (DEK)
>kube_api_server: generate DEK kube_api_server->>kms_plugin: encrypt request kms_plugin->>external_kms: encrypt DEK with remote KEK external_kms->>kms_plugin: encrypted DEK kms_plugin->>kube_api_server: return encrypt response
{"ciphertext": "", key_id: " ", "annotations": {}} ``` --> Performance Improvements
With KMS v2, we have made significant improvements to the performance of the KMS encryption provider. In case of KMS v1, a new DEK is generated for every encryption. This means that for every write request, the API server makes a call to the KMS plugin to encrypt the DEK using the remote KEK. The API server also has to cache the DEKs to avoid making a call to the KMS plugin for every read request. When the API server restarts, it has to populate the cache by making a call to the KMS plugin for every DEK in the etcd store based on the cache size. This is a significant overhead for the API server. With KMS v2, the API server generates a DEK at startup and caches it. The API server also makes a call to the KMS plugin to encrypt the DEK using the remote KEK. This is a one-time call at startup and on KEK rotation. The API server then uses the cached DEK to encrypt the resources. This reduces the number of calls to the KMS plugin and improves the overall latency of the API server requests.
We conducted a test that created 12k secrets and measured the time taken for the API server to encrypt the resources. The metric used was
apiserver_storage_transformation_duration_seconds
. For KMS v1, the test was run on a managed Kubernetes v1.25 cluster with 2 nodes. There was no additional load on the cluster during the test. For KMS v2, the test was run in the Kubernetes CI environment with the following cluster configuration.KMS Provider Time taken by 95 percentile KMS v1 160ms KMS v2 80μs The results show that the KMS v2 encryption provider is three orders of magnitude faster than the KMS v1 encryption provider.
What's next?
For Kubernetes v1.28, we expect the feature to stay in beta. In the coming releases we want to investigate:
- Cryptographic changes to remove the limitation on VM state store.
- Kubernetes REST API changes to enable a more robust story around key rotation.
- Handling undecryptable resources. Refer to the KEP for details.
You can learn more about KMS v2 by reading Using a KMS provider for data encryption. You can also follow along on the KEP to track progress across the coming Kubernetes releases.
Call to action
In this blog post, we have covered the improvements made to the KMS encryption provider in Kubernetes v1.27. We have also discussed the new KMS v2 API and how it works. We would love to hear your feedback on this feature. In particular, we would like feedback from Kubernetes KMS plugin implementors as they go through the process of building their integrations with this new API. Please reach out to us on the #sig-auth-kms-dev channel on Kubernetes Slack.
How to get involved
If you are interested in getting involved in the development of this feature, share feedback, or participate in any other ongoing SIG Auth projects, please reach out on the #sig-auth channel on Kubernetes Slack.
You are also welcome to join the bi-weekly SIG Auth meetings, held every-other Wednesday.
Acknowledgements
This feature has been an effort driven by contributors from several different companies. We would like to extend a huge thank you to everyone that contributed their time and effort to help make this possible.
-
Blog: Kubernetes 1.27: updates on speeding up Pod startup
Authors: Paco Xu (DaoCloud), Sergey Kanzhelev (Google), Ruiwen Zhao (Google)
How can Pod start-up be accelerated on nodes in large clusters? This is a common issue that cluster administrators may face.
This blog post focuses on methods to speed up pod start-up from the kubelet side. It does not involve the creation time of pods by controller-manager through kube-apiserver, nor does it include scheduling time for pods or webhooks executed on it.
We have mentioned some important factors here to consider from the kubelet's perspective, but this is not an exhaustive list. As Kubernetes v1.27 is released, this blog highlights significant changes in v1.27 that aid in speeding up pod start-up.
Parallel container image pulls
Pulling images always takes some time and what's worse is that image pulls are done serially by default. In other words, kubelet will send only one image pull request to the image service at a time. Other image pull requests have to wait until the one being processed is complete.
To enable parallel image pulls, set the
serializeImagePulls
field to false in the kubelet configuration. WhenserializeImagePulls
is disabled, requests for image pulls are immediately sent to the image service and multiple images can be pulled concurrently.Maximum parallel image pulls will help secure your node from overloading on image pulling
We introduced a new feature in kubelet that sets a limit on the number of parallel image pulls at the node level. This limit restricts the maximum number of images that can be pulled simultaneously. If there is an image pull request beyond this limit, it will be blocked until one of the ongoing image pulls finishes. Before enabling this feature, please ensure that your container runtime's image service can handle parallel image pulls effectively.
To limit the number of simultaneous image pulls, you can configure the
maxParallelImagePulls
field in kubelet. By settingmaxParallelImagePulls
to a value of n, only n images will be pulled concurrently. Any additional image pulls beyond this limit will wait until at least one ongoing pull is complete.You can find more details in the associated KEP: Kubelet limit of Parallel Image Pulls (KEP-3673).
Raised default API query-per-second limits for kubelet
To improve pod startup in scenarios with multiple pods on a node, particularly sudden scaling situations, it is necessary for Kubelet to synchronize the pod status and prepare configmaps, secrets, or volumes. This requires a large bandwidth to access kube-apiserver.
In versions prior to v1.27, the default
kubeAPIQPS
was 5 andkubeAPIBurst
was 10. However, the kubelet in v1.27 has increased these defaults to 50 and 100 respectively for better performance during pod startup. It's worth noting that this isn't the only reason why we've bumped up the API QPS limits for Kubelet.- It has a potential to be hugely throttled now (default QPS = 5)
- In large clusters they can generate significant load anyway as there are a lot of them
- They have a dedicated PriorityLevel and FlowSchema that we can easily control
Previously, we often encountered
volume mount timeout
on kubelet in node with more than 50 pods during pod start up. We suggest that cluster operators bumpkubeAPIQPS
to 20 andkubeAPIBurst
to 40, especially if using bare metal nodes.More detials can be found in the KEP https://kep.k8s.io/1040 and the pull request #116121.
Event triggered updates to container status
Evented PLEG
(PLEG is short for "Pod Lifecycle Event Generator") is set to be in beta for v1.27, Kubernetes offers two ways for the kubelet to detect Pod lifecycle events, such as the last process in a container shutting down. In Kubernetes v1.27, the event based mechanism has graduated to beta but remains disabled by default. If you do explicitly switch to event-based lifecycle change detection, the kubelet is able to start Pods more quickly than with the default approach that relies on polling. The default mechanism, polling for lifecycle changes, adds a noticeable overhead; this affects the kubelet's ability to handle different tasks in parallel, and leads to poor performance and reliability issues. For these reasons, we recommend that you switch your nodes to use event-based pod lifecycle change detection.Further details can be found in the KEP https://kep.k8s.io/3386 and Switching From Polling to CRI Event-based Updates to Container Status.
Raise your pod resource limit if needed
During start-up, some pods may consume a considerable amount of CPU or memory. If the CPU limit is low, this can significantly slow down the pod start-up process. To improve the memory management, Kubernetes v1.22 introduced a feature gate called MemoryQoS to kubelet. This feature enables kubelet to set memory QoS at container, pod, and QoS levels for better protection and guaranteed quality of memory when running with cgroups v2. Although it has benefits, it is possible that enabling this feature gate may affect the start-up speed of the pod if the pod startup consumes a large amount of memory.
Kubelet configuration now includes
memoryThrottlingFactor
. This factor is multiplied by the memory limit or node allocatable memory to set the cgroupv2memory.high
value for enforcing MemoryQoS. Decreasing this factor sets a lower high limit for container cgroups, increasing reclaim pressure. Increasing this factor will put less reclaim pressure. The default value is 0.8 initially and will change to 0.9 in Kubernetes v1.27. This parameter adjustment can reduce the potential impact of this feature on pod startup speed.Further details can be found in the KEP https://kep.k8s.io/2570.
What's more?
In Kubernetes v1.26, a new histogram metric
pod_start_sli_duration_seconds
was added for Pod startup latency SLI/SLO details. Additionally, the kubelet log will now display more information about pod start-related timestamps, as shown below:Dec 30 15:33:13.375379 e2e-022435249c-674b9-minion-group-gdj4 kubelet[8362]: I1230 15:33:13.375359 8362 pod_startup_latency_tracker.go:102] "Observed pod startup duration" pod="kube-system/konnectivity-agent-gnc9k" podStartSLOduration=-9.223372029479458e+09 pod.CreationTimestamp="2022-12-30 15:33:06 +0000 UTC" firstStartedPulling="2022-12-30 15:33:09.258791695 +0000 UTC m=+13.029631711" lastFinishedPulling="0001-01-01 00:00:00 +0000 UTC" observedRunningTime="2022-12-30 15:33:13.375009262 +0000 UTC m=+17.145849275" watchObservedRunningTime="2022-12-30 15:33:13.375317944 +0000 UTC m=+17.146157970"
The SELinux Relabeling with Mount Options feature moved to Beta in v1.27. This feature speeds up container startup by mounting volumes with the correct SELinux label instead of changing each file on the volumes recursively. Further details can be found in the KEP https://kep.k8s.io/1710.
To identify the cause of slow pod startup, analyzing metrics and logs can be helpful. Other factors that may impact pod startup include container runtime, disk speed, CPU and memory resources on the node.
SIG Node is responsible for ensuring fast Pod startup times, while addressing issues in large clusters falls under the purview of SIG Scalability as well.
-
Blog: Kubernetes 1.27: In-place Resource Resize for Kubernetes Pods (alpha)
Author: Vinay Kulkarni (Kubescaler Labs)
If you have deployed Kubernetes pods with CPU and/or memory resources specified, you may have noticed that changing the resource values involves restarting the pod. This has been a disruptive operation for running workloads... until now.
In Kubernetes v1.27, we have added a new alpha feature that allows users to resize CPU/memory resources allocated to pods without restarting the containers. To facilitate this, the
resources
field in a pod's containers now allow mutation forcpu
andmemory
resources. They can be changed simply by patching the running pod spec.This also means that
resources
field in the pod spec can no longer be relied upon as an indicator of the pod's actual resources. Monitoring tools and other such applications must now look at new fields in the pod's status. Kubernetes queries the actual CPU and memory requests and limits enforced on the running containers via a CRI (Container Runtime Interface) API call to the runtime, such as containerd, which is responsible for running the containers. The response from container runtime is reflected in the pod's status.In addition, a new
restartPolicy
for resize has been added. It gives users control over how their containers are handled when resources are resized.What's new in v1.27?
Besides the addition of resize policy in the pod's spec, a new field named
allocatedResources
has been added tocontainerStatuses
in the pod's status. This field reflects the node resources allocated to the pod's containers.In addition, a new field called
resources
has been added to the container's status. This field reflects the actual resource requests and limits configured on the running containers as reported by the container runtime.Lastly, a new field named
resize
has been added to the pod's status to show the status of the last requested resize. A value ofProposed
is an acknowledgement of the requested resize and indicates that request was validated and recorded. A value ofInProgress
indicates that the node has accepted the resize request and is in the process of applying the resize request to the pod's containers. A value ofDeferred
means that the requested resize cannot be granted at this time, and the node will keep retrying. The resize may be granted when other pods leave and free up node resources. A value ofInfeasible
is a signal that the node cannot accommodate the requested resize. This can happen if the requested resize exceeds the maximum resources the node can ever allocate for a pod.When to use this feature
Here are a few examples where this feature may be useful:
- Pod is running on node but with either too much or too little resources.
- Pods are not being scheduled do to lack of sufficient CPU or memory in a cluster that is underutilized by running pods that were overprovisioned.
- Evicting certain stateful pods that need more resources to schedule them on bigger nodes is an expensive or disruptive operation when other lower priority pods in the node can be resized down or moved.
How to use this feature
In order to use this feature in v1.27, the
InPlacePodVerticalScaling
feature gate must be enabled. A local cluster with this feature enabled can be started as shown below:root@vbuild:~/go/src/k8s.io/kubernetes# FEATURE_GATES=InPlacePodVerticalScaling=true ./hack/local-up-cluster.sh go version go1.20.2 linux/arm64 +++ [0320 13:52:02] Building go targets for linux/arm64 k8s.io/kubernetes/cmd/kubectl (static) k8s.io/kubernetes/cmd/kube-apiserver (static) k8s.io/kubernetes/cmd/kube-controller-manager (static) k8s.io/kubernetes/cmd/cloud-controller-manager (non-static) k8s.io/kubernetes/cmd/kubelet (non-static) ... ... Logs: /tmp/etcd.log /tmp/kube-apiserver.log /tmp/kube-controller-manager.log /tmp/kube-proxy.log /tmp/kube-scheduler.log /tmp/kubelet.log To start using your cluster, you can open up another terminal/tab and run: export KUBECONFIG=/var/run/kubernetes/admin.kubeconfig cluster/kubectl.sh Alternatively, you can write to the default kubeconfig: export KUBERNETES_PROVIDER=local cluster/kubectl.sh config set-cluster local --server=https://localhost:6443 --certificate-authority=/var/run/kubernetes/server-ca.crt cluster/kubectl.sh config set-credentials myself --client-key=/var/run/kubernetes/client-admin.key --client-certificate=/var/run/kubernetes/client-admin.crt cluster/kubectl.sh config set-context local --cluster=local --user=myself cluster/kubectl.sh config use-context local cluster/kubectl.sh
Once the local cluster is up and running, Kubernetes users can schedule pods with resources, and resize the pods via kubectl. An example of how to use this feature is illustrated in the following demo video.
Example Use Cases
Cloud-based Development Environment
In this scenario, developers or development teams write their code locally but build and test their code in Kubernetes pods with consistent configs that reflect production use. Such pods need minimal resources when the developers are writing code, but need significantly more CPU and memory when they build their code or run a battery of tests. This use case can leverage in-place pod resize feature (with a little help from eBPF) to quickly resize the pod's resources and avoid kernel OOM (out of memory) killer from terminating their processes.
This KubeCon North America 2022 conference talk illustrates the use case.
Java processes initialization CPU requirements
Some Java applications may need significantly more CPU during initialization than what is needed during normal process operation time. If such applications specify CPU requests and limits suited for normal operation, they may suffer from very long startup times. Such pods can request higher CPU values at the time of pod creation, and can be resized down to normal running needs once the application has finished initializing.
Known Issues
This feature enters v1.27 at alpha stage. Below are a few known issues users may encounter:
- containerd versions below v1.6.9 do not have the CRI support needed for full
end-to-end operation of this feature. Attempts to resize pods will appear
to be stuck in the
InProgress
state, andresources
field in the pod's status are never updated even though the new resources may have been enacted on the running containers. - Pod resize may encounter a race condition with other pod updates, causing delayed enactment of pod resize.
- Reflecting the resized container resources in pod's status may take a while.
- Static CPU management policy is not supported with this feature.
Credits
This feature is a result of the efforts of a very collaborative Kubernetes community. Here's a little shoutout to just a few of the many many people that contributed countless hours of their time and helped make this happen.
- @thockin for detail-oriented API design and air-tight code reviews.
- @derekwaynecarr for simplifying the design and thorough API and node reviews.
- @dchen1107 for bringing vast knowledge from Borg and helping us avoid pitfalls.
- @ruiwen-zhao for adding containerd support that enabled full E2E implementation.
- @wangchen615 for implementing comprehensive E2E tests and driving scheduler fixes.
- @bobbypage for invaluable help getting CI ready and quickly investigating issues, covering for me on my vacation.
- @Random-Liu for thorough kubelet reviews and identifying problematic race conditions.
- @Huang-Wei, @ahg-g, @alculquicondor for helping get scheduler changes done.
- @mikebrow @marosset for reviews on short notice that helped CRI changes make it into v1.25.
- @endocrimes, @ehashman for helping ensure that the oft-overlooked tests are in good shape.
- @mrunalp for reviewing cgroupv2 changes and ensuring clean handling of v1 vs v2.
- @liggitt, @gjkim42 for tracking down, root-causing important missed issues post-merge.
- @SergeyKanzhelev for supporting and shepherding various issues during the home stretch.
- @pdgetrf for making the first prototype a reality.
- @dashpole for bringing me up to speed on 'the Kubernetes way' of doing things.
- @bsalamat, @kgolab for very thoughtful insights and suggestions in the early stages.
- @sftim, @tengqm for ensuring docs are easy to follow.
- @dims for being omnipresent and helping make merges happen at critical hours.
- Release teams for ensuring that the project stayed healthy.
And a big thanks to my very supportive management Dr. Xiaoning Ding and Dr. Ying Xiong for their patience and encouragement.
References
For app developers
For cluster administrators