Kubernetes News

The Kubernetes project blog
Kubernetes.io
  1. Author: Rory McCune (Aqua Security)

    Admission control is a key part of Kubernetes security, alongside authentication and authorization. Webhook admission controllers are extensively used to help improve the security of Kubernetes clusters in a variety of ways including restricting the privileges of workloads and ensuring that images deployed to the cluster meet organization’s security requirements.

    However, as with any additional component added to a cluster, security risks can present themselves. A security risk example is if the deployment and management of the admission controller are not handled correctly. To help admission controller users and designers manage these risks appropriately, the security documentation subgroup of SIG Security has spent some time developing a threat model for admission controllers. This threat model looks at likely risks which may arise from the incorrect use of admission controllers, which could allow security policies to be bypassed, or even allow an attacker to get unauthorised access to the cluster.

    From the threat model, we developed a set of security best practices that should be adopted to ensure that cluster operators can get the security benefits of admission controllers whilst avoiding any risks from using them.

    Admission controllers and good practices for security

    From the threat model, a couple of themes emerged around how to ensure the security of admission controllers.

    Secure webhook configuration

    It’s important to ensure that any security component in a cluster is well configured and admission controllers are no different here. There are a couple of security best practices to consider when using admission controllers

    • Correctly configured TLS for all webhook traffic. Communications between the API server and the admission controller webhook should be authenticated and encrypted to ensure that attackers who may be in a network position to view or modify this traffic cannot do so. To achieve this access the API server and webhook must be using certificates from a trusted certificate authority so that they can validate their mutual identities
    • Only authenticated access allowed. If an attacker can send an admission controller large numbers of requests, they may be able to overwhelm the service causing it to fail. Ensuring all access requires strong authentication should mitigate that risk.
    • Admission controller fails closed. This is a security practice that has a tradeoff, so whether a cluster operator wants to configure it will depend on the cluster’s threat model. If an admission controller fails closed, when the API server can’t get a response from it, all deployments will fail. This stops attackers bypassing the admission controller by disabling it, but, can disrupt the cluster’s operation. As clusters can have multiple webhooks, one approach to hit a middle ground might be to have critical controls on a fail closed setups and less critical controls allowed to fail open.
    • Regular reviews of webhook configuration. Configuration mistakes can lead to security issues, so it’s important that the admission controller webhook configuration is checked to make sure the settings are correct. This kind of review could be done automatically by an Infrastructure As Code scanner or manually by an administrator.

    Secure cluster configuration for admission control

    In most cases, the admission controller webhook used by a cluster will be installed as a workload in the cluster. As a result, it’s important to ensure that Kubernetes' security features that could impact its operation are well configured.

    • Restrict RBAC rights. Any user who has rights which would allow them to modify the configuration of the webhook objects or the workload that the admission controller uses could disrupt its operation. So it’s important to make sure that only cluster administrators have those rights.
    • Prevent privileged workloads. One of the realities of container systems is that if a workload is given certain privileges, it will be possible to break out to the underlying cluster node and impact other containers on that node. Where admission controller services run in the cluster they’re protecting, it’s important to ensure that any requirement for privileged workloads is carefully reviewed and restricted as much as possible.
    • Strictly control external system access. As a security service in a cluster admission controller systems will have access to sensitive information like credentials. To reduce the risk of this information being sent outside the cluster, network policies should be used to restrict the admission controller services access to external networks.
    • Each cluster has a dedicated webhook. Whilst it may be possible to have admission controller webhooks that serve multiple clusters, there is a risk when using that model that an attack on the webhook service would have a larger impact where it’s shared. Also where multiple clusters use an admission controller there will be increased complexity and access requirements, making it harder to secure.

    Admission controller rules

    A key element of any admission controller used for Kubernetes security is the rulebase it uses. The rules need to be able to accurately meet their goals avoiding false positive and false negative results.

    • Regularly test and review rules. Admission controller rules need to be tested to ensure their accuracy. They also need to be regularly reviewed as the Kubernetes API will change with each new version, and rules need to be assessed with each Kubernetes release to understand any changes that may be required to keep them up to date.
  2. Authors & Interviewers: Anubhav Vardhan, Atharva Shinde, Avinesh Tripathi, Debabrata Panigrahi, Kunal Verma, Pranshu Srivastava, Pritish Samal, Purneswar Prasad, Vedant Kakde

    Editor: Priyanka Saggu


    Good day, everyone 👋

    Welcome to the first episode of the APAC edition of the "Meet Our Contributors" blog post series.

    In this post, we'll introduce you to five amazing folks from the India region who have been actively contributing to the upstream Kubernetes projects in a variety of ways, as well as being the leaders or maintainers of numerous community initiatives.

    💫 Let's get started, so without further ado…

    Arsh Sharma

    Arsh is currently employed with Okteto as a Developer Experience engineer. As a new contributor, he realised that 1:1 mentorship opportunities were quite beneficial in getting him started with the upstream project.

    He is presently a CI Signal shadow on the Kubernetes 1.23 release team. He is also contributing to the SIG Testing and SIG Docs projects, as well as to the cert-manager tools development work that is being done under the aegis of SIG Architecture.

    To the newcomers, Arsh helps plan their early contributions sustainably.

    I would encourage folks to contribute in a way that's sustainable. What I mean by that is that it's easy to be very enthusiastic early on and take up more stuff than one can actually handle. This can often lead to burnout in later stages. It's much more sustainable to work on things iteratively.

    Kunal Kushwaha

    Kunal Kushwaha is a core member of the Kubernetes marketing council. He is also a CNCF ambassador and one of the founders of the CNCF Students Program.. He also served as a Communications role shadow during the 1.22 release cycle.

    At the end of his first year, Kunal began contributing to the fabric8io kubernetes-client project. He was then selected to work on the same project as part of Google Summer of Code. Kunal mentored people on the same project, first through Google Summer of Code then through Google Code-in.

    As an open-source enthusiast, he believes that diverse participation in the community is beneficial since it introduces new perspectives and opinions and respect for one's peers. He has worked on various open-source projects, and his participation in communities has considerably assisted his development as a developer.

    I believe if you find yourself in a place where you do not know much about the project, that's a good thing because now you can learn while contributing and the community is there to help you. It has helped me a lot in gaining skills, meeting people from around the world and also helping them. You can learn on the go, you don't have to be an expert. Make sure to also check out no code contributions because being a beginner is a skill and you can bring new perspectives to the organisation.

    Madhav Jivarajani

    Madhav Jivarajani works on the VMware Upstream Kubernetes stability team. He began contributing to the Kubernetes project in January 2021 and has since made significant contributions to several areas of work under SIG Architecture, SIG API Machinery, and SIG ContribEx (contributor experience).

    Among several significant contributions are his recent efforts toward the Archival of design proposals, refactoring the "groups" codebase under k8s-infra repository to make it mockable and testable, and improving the functionality of the GitHub k8s bot.

    In addition to his technical efforts, Madhav oversees many projects aimed at assisting new contributors. He organises bi-weekly "KEP reading club" sessions to help newcomers understand the process of adding new features, deprecating old ones, and making other key changes to the upstream project. He has also worked on developing Katacoda scenarios to assist new contributors to become acquainted with the process of contributing to k/k. In addition to his current efforts to meet with community members every week, he has organised several new contributors workshops (NCW).

    I initially did not know much about Kubernetes. I joined because the community was super friendly. But what made me stay was not just the people, but the project itself. My solution to not feeling overwhelmed in the community was to gain as much context and knowledge into the topics that I was interested in and were being discussed. And as a result I continued to dig deeper into Kubernetes and the design of it. I am a systems nut & thus Kubernetes was an absolute goldmine for me.

    Rajas Kakodkar

    Rajas Kakodkar currently works at VMware as a Member of Technical Staff. He has been engaged in many aspects of the upstream Kubernetes project since 2019.

    He is now a key contributor to the Testing special interest group. He is also active in the SIG Network community. Lately, Rajas has contributed significantly to the NetworkPolicy++ and kpng sub-projects.

    One of the first challenges he ran across was that he was in a different time zone than the upstream project's regular meeting hours. However, async interactions on community forums progressively corrected that problem.

    I enjoy contributing to Kubernetes not just because I get to work on cutting edge tech but more importantly because I get to work with awesome people and help in solving real world problems.

    Rajula Vineet Reddy

    Rajula Vineet Reddy, a Junior Engineer at CERN, is a member of the Marketing Council team under SIG ContribEx . He also served as a release shadow for SIG Release during the 1.22 and 1.23 Kubernetes release cycles.

    He started looking at the Kubernetes project as part of a university project with the help of one of his professors. Over time, he spent a significant amount of time reading the project's documentation, Slack discussions, GitHub issues, and blogs, which helped him better grasp the Kubernetes project and piqued his interest in contributing upstream. One of his key contributions was his assistance with automation in the SIG ContribEx Upstream Marketing subproject.

    According to Rajula, attending project meetings and shadowing various project roles are vital for learning about the community.

    I find the community very helpful and it's always “you get back as much as you contribute”. The more involved you are, the more you will understand, get to learn and contribute new things.

    The first step to “come forward and start” is hard. But it's all gonna be smooth after that. Just take that jump.


    If you have any recommendations/suggestions for who we should interview next, please let us know in #sig-contribex. We're thrilled to have other folks assisting us in reaching out to even more wonderful individuals of the community. Your suggestions would be much appreciated.

    We'll see you all in the next one. Everyone, till then, have a happy contributing! 👋

  3. Authors: Sergey Kanzhelev (Google), Jim Angel (Google), Davanum Srinivas (VMware), Shannon Kularathna (Google), Chris Short (AWS), Dawn Chen (Google)

    Kubernetes is removing dockershim in the upcoming v1.24 release. We're excited to reaffirm our community values by supporting open source container runtimes, enabling a smaller kubelet, and increasing engineering velocity for teams using Kubernetes. If you use Docker Engine as a container runtime for your Kubernetes cluster, get ready to migrate in 1.24! To check if you're affected, refer to Check whether dockershim deprecation affects you.

    Why we’re moving away from dockershim

    Docker was the first container runtime used by Kubernetes. This is one of the reasons why Docker is so familiar to many Kubernetes users and enthusiasts. Docker support was hardcoded into Kubernetes – a component the project refers to as dockershim. As containerization became an industry standard, the Kubernetes project added support for additional runtimes. This culminated in the implementation of the container runtime interface (CRI), letting system components (like the kubelet) talk to container runtimes in a standardized way. As a result, dockershim became an anomaly in the Kubernetes project. Dependencies on Docker and dockershim have crept into various tools and projects in the CNCF ecosystem ecosystem, resulting in fragile code.

    By removing the dockershim CRI, we're embracing the first value of CNCF: "Fast is better than slow". Stay tuned for future communications on the topic!

    Deprecation timeline

    We formally announced the dockershim deprecation in December 2020. Full removal is targeted in Kubernetes 1.24, in April 2022. This timeline aligns with our deprecation policy, which states that deprecated behaviors must function for at least 1 year after their announced deprecation.

    We'll support Kubernetes version 1.23, which includes dockershim, for another year in the Kubernetes project. For managed Kubernetes providers, vendor support is likely to last even longer, but this is dependent on the companies themselves. Regardless, we're confident all cluster operations will have time to migrate. If you have more questions about the dockershim removal, refer to the Dockershim Deprecation FAQ.

    We asked you whether you feel prepared for the migration from dockershim in this survey: Are you ready for Dockershim removal. We had over 600 responses. To everybody who took time filling out the survey, thank you.

    The results show that we still have a lot of ground to cover to help you to migrate smoothly. Other container runtimes exist, and have been promoted extensively. However, many users told us they still rely on dockershim, and sometimes have dependencies that need to be re-worked. Some of these dependencies are outside of your control. Based on your feedback, here are some of the steps we are taking to help.

    Our next steps

    Based on the feedback you provided:

    • CNCF and the 1.24 release team are committed to delivering documentation in time for the 1.24 release. This includes more informative blog posts like this one, updating existing code samples, tutorials, and tasks, and producing a migration guide for cluster operators.
    • We are reaching out to the rest of the CNCF community to help prepare them for this change.

    If you're part of a project with dependencies on dockershim, or if you're interested in helping with the migration effort, please join us! There's always room for more contributors, whether to our transition tools or to our documentation. To get started, say hello in the #sig-node channel on Kubernetes Slack!

    Final thoughts

    As a project, we've already seen cluster operators increasingly adopt other container runtimes through 2021. We believe there are no major blockers to migration. The steps we're taking to improve the migration experience will light the path more clearly for you.

    We understand that migration from dockershim is yet another action you may need to do to keep your Kubernetes infrastructure up to date. For most of you, this step will be straightforward and transparent. In some cases, you will encounter hiccups or issues. The community has discussed at length whether postponing the dockershim removal would be helpful. For example, we recently talked about it in the SIG Node discussion on November 11th and in the Kubernetes Steering committee meeting held on December 6th. We already postponed it once in 2021 because the adoption rate of other runtimes was lower than we wanted, which also gave us more time to identify potential blocking issues.

    At this point, we believe that the value that you (and Kubernetes) gain from dockershim removal makes up for the migration effort you'll have. Start planning now to avoid surprises. We'll have more updates and guides before Kubernetes 1.24 is released.

  4. Author: Andrei Kvapil (WEDOS)

    When you own two data centers, thousands of physical servers, virtual machines and hosting for hundreds of thousands sites, Kubernetes can actually simplify the management of all these things. As practice has shown, by using Kubernetes, you can declaratively describe and manage not only applications, but also the infrastructure itself. I work for the largest Czech hosting provider WEDOS Internet a.s and today I'll show you two of my projects — Kubernetes-in-Kubernetes and Kubefarm.

    With their help you can deploy a fully working Kubernetes cluster inside another Kubernetes using Helm in just a couple of commands. How and why?

    Let me introduce you to how our infrastructure works. All our physical servers can be divided into two groups: control-plane and compute nodes. Control plane nodes are usually set up manually, have a stable OS installed, and designed to run all cluster services including Kubernetes control-plane. The main task of these nodes is to ensure the smooth operation of the cluster itself. Compute nodes do not have any operating system installed by default, instead they are booting the OS image over the network directly from the control plane nodes. Their work is to carry out the workload.

    Kubernetes cluster layout

    Once nodes have downloaded their image, they can continue to work without keeping connection to the PXE server. That is, a PXE server is just keeping rootfs image and does not hold any other complex logic. After our nodes have booted, we can safely restart the PXE server, nothing critical will happen to them.

    Kubernetes cluster after bootstrapping

    After booting, the first thing our nodes do is join to the existing Kubernetes cluster, namely, execute the kubeadm join command so that kube-scheduler could schedule some pods on them and launch various workloads afterwards. From the beginning we used the scheme when nodes were joined into the same cluster used for the control-plane nodes.

    Kubernetes scheduling containers to the compute nodes

    This scheme worked stably for over two years. However later we decided to add containerized Kubernetes to it. And now we can spawn new Kubernetes-clusters very easily right on our control-plane nodes which are now member special admin-clusters. Now, compute nodes can be joined directly to their own clusters - depending on the configuration.

    Multiple clusters are running in single Kubernetes, compute nodes joined to them

    Kubefarm

    This project came with the goal of enabling anyone to deploy such an infrastructure in just a couple of commands using Helm and get about the same in the end.

    At this time, we moved away from the idea of a monocluster. Because it turned out to be not very convenient for managing work of several development teams in the same cluster. The fact is that Kubernetes was never designed as a multi-tenant solution and at the moment it does not provide sufficient means of isolation between projects. Therefore, running separate clusters for each team turned out to be a good idea. However, there should not be too many clusters, to let them be convenient to manage. Nor is it too small to have sufficient independence between development teams.

    The scalability of our clusters became noticeably better after that change. The more clusters you have per number of nodes, the smaller the failure domain and the more stable they work. And as a bonus, we got a fully declaratively described infrastructure. Thus, now you can deploy a new Kubernetes cluster in the same way as deploying any other application in Kubernetes.

    It uses Kubernetes-in-Kubernetes as a basis, LTSP as PXE-server from which the nodes are booted, and automates the DHCP server configuration using dnsmasq-controller:

    Kubefarm

    How it works

    Now let's see how it works. In general, if you look at Kubernetes as from an application perspective, you can note that it follows all the principles of The Twelve-Factor App, and is actually written very well. Thus, it means running Kubernetes as an app in a different Kubernetes shouldn't be a big deal.

    Running Kubernetes in Kubernetes

    Now let's take a look at the Kubernetes-in-Kubernetes project, which provides a ready-made Helm chart for running Kubernetes in Kubernetes.

    Here is the parameters that you can pass to Helm in the values file:

    Kubernetes is just five binaries

    Beside persistence (storage parameters for the cluster), the Kubernetes control-plane components are described here: namely: etcd cluster, apiserver, controller-manager and scheduler. These are pretty much standard Kubernetes components. There is a light-hearted saying that “Kubernetes is just five binaries”. So here is where the configuration for these binaries is located.

    If you ever tried to bootstrap a cluster using kubeadm, then this config will remind you it's configuration. But in addition to Kubernetes entities, you also have an admin container. In fact, it is a container which holds two binaries inside: kubectl and kubeadm. They are used to generate kubeconfig for the above components and to perform the initial configuration for the cluster. Also, in an emergency, you can always exec into it to check and manage your cluster.

    After the release has been deployed, you can see a list of pods: admin-container, apiserver in two replicas, controller-manager, etcd-cluster, scheduller and the initial job that initializes the cluster. In the end you have a command, which allows you to get shell into the admin container, you can use it to see what is happening inside:

    Also, let's take look at the certificates. If you've ever installed Kubernetes, then you know that it has a scary directory /etc/kubernetes/pki with a bunch of some certificates. In case of Kubernetes-in-Kubernetes, you have fully automated management of them with cert-manager. Thus, it is enough to pass all certificates parameters to Helm during installation, and all the certificates will automatically be generated for your cluster.

    Looking at one of the certificates, eg. apiserver, you can see that it has a list of DNS names and IP addresses. If you want to make this cluster accessible outside, then just describe the additional DNS names in the values file and update the release. This will update the certificate resource, and cert-manager will regenerate the certificate. You'll no longer need to think about this. If kubeadm certificates need to be renewed at least once a year, here the cert-manager will take care and automatically renew them.

    Now let's log into the admin container and look at the cluster and nodes. Of course, there are no nodes, yet, because at the moment you have deployed just the blank control-plane for Kubernetes. But in kube-system namespace you can see some coredns pods waiting for scheduling and configmaps already appeared. That is, you can conclude that the cluster is working:

    Here is the diagram of the deployed cluster. You can see services for all Kubernetes components: apiserver, controller-manager, etcd-cluster and scheduler. And the pods on right side to which they forward traffic.

    By the way, the diagram, is drawn in ArgoCD — the GitOps tool we use to manage our clusters, and cool diagrams are one of its features.

    Orchestrating physical servers

    OK, now you can see the way how is our Kubernetes control-plane deployed, but what about worker nodes, how are we adding them? As I already said, all our servers are bare metal. We do not use virtualization to run Kubernetes, but we orchestrate all physical servers by ourselves.

    Also, we do use Linux network boot feature very actively. Moreover, this is exactly the booting, not some kind of automation of the installation. When the nodes are booting, they just run a ready-made system image for them. That is, to update any node, we just need to reboot it - and it will download a new image. It is very easy, simple and convenient.

    For this, the Kubefarm project was created, which allows you to automate this. The most commonly used examples can be found in the examples directory. The most standard of them named generic. Let's take a look at values.yaml:

    Here you can specify the parameters which are passed into the upstream Kubernetes-in-Kubernetes chart. In order for you control-plane to be accessible from the outside, it is enough to specify the IP address here, but if you wish, you can specify some DNS name here.

    In the PXE server configuration you can specify a timezone. You can also add an SSH key for logging in without a password (but you can also specify a password), as well as kernel modules and parameters that should be applied during booting the system.

    Next comes the nodePools configuration, i.e. the nodes themselves. If you've ever used a terraform module for gke, then this logic will remind you of it. Here you statically describe all nodes with a set of parameters:

    • Name (hostname);

    • MAC-addresses — we have nodes with two network cards, and each one can boot from any of the MAC addresses specified here.

    • IP-address, which the DHCP server should issue to this node.

    In this example, you have two pools: the first has five nodes, the second has only one, the second pool has also two tags assigned. Tags are the way to describe configuration for specific nodes. For example, you can add specific DHCP options for some pools, options for the PXE server for booting (e.g. here is debug option enabled) and set of kubernetesLabels and kubernetesTaints options. What does that mean?

    For example, in this configuration you have a second nodePool with one node. The pool has debug and foo tags assigned. Now see the options for foo tag in kubernetesLabels. This means that the m1c43 node will boot with these two labels and taint assigned. Everything seems to be simple. Now let's try this in practice.

    Demo

    Go to examples and update previously deployed chart to Kubefarm. Just use the generic parameters and look at the pods. You can see that a PXE server and one more job were added. This job essentially goes to the deployed Kubernetes cluster and creates a new token. Now it will run repeatedly every 12 hours to generate a new token, so that the nodes can connect to your cluster.

    In a graphical representation, it looks about the same, but now apiserver started to be exposed outside.

    In the diagram, the IP is highlighted in green, the PXE server can be reached through it. At the moment, Kubernetes does not allow creating a single LoadBalancer service for TCP and UDP protocols by default, so you have to create two different services with the same IP address. One is for TFTP, and the second for HTTP, through which the system image is downloaded.

    But this simple example is not always enough, sometimes you might need to modify the logic at boot. For example, here is a directory advanced_network, inside which there is a values file with a simple shell script. Let's call it network.sh:

    All this script does is take environment variables at boot time, and generates a network configuration based on them. It creates a directory and puts the netplan config inside. For example, a bonding interface is created here. Basically, this script can contain everything you need. It can hold the network configuration or generate the system services, add some hooks or describe any other logic. Anything that can be described in bash or shell languages will work here, and it will be executed at boot time.

    Let's see how it can be deployed. Let's pass the generic values file as the first parameter, and an additional values file as the second parameter. This is a standard Helm feature. This way you can also pass the secrets, but in this case, the configuration is just expanded by the second file:

    Let's look at the configmap foo-kubernetes-ltsp for the netboot server and make sure that network.sh script is really there. These commands used to configure the network at boot time:

    Here you can see how it works in principle. The chassis interface (we use HPE Moonshots 1500) have the nodes, you can enter show node list command to get a list of all the nodes. Now you can see the booting process.

    You can also get their MAC addresses by show node macaddr all command. We have a clever operator that collects MAC-addresses from chassis automatically and passes them to the DHCP server. Actually, it's just creating custom configuration resources for dnsmasq-controller which is running in same admin Kubernetes cluster. Also, trough this interface you can control the nodes themselves, e.g. turn them on and off.

    If you have no such opportunity to enter the chassis through iLO and collect a list of MAC addresses for your nodes, you can consider using catchall cluster pattern. Purely speaking, it is just a cluster with a dynamic DHCP pool. Thus, all nodes that are not described in the configuration to other clusters will automatically join to this cluster.

    For example, you can see a special cluster with some nodes. They are joined to the cluster with an auto-generated name based on their MAC address. Starting from this point you can connect to them and see what happens there. Here you can somehow prepare them, for example, set up the file system and then rejoin them to another cluster.

    Now let's try connecting to the node terminal and see how it is booting. After the BIOS, the network card is configured, here it sends a request to the DHCP server from a specific MAC address, which redirects it to a specific PXE server. Later the kernel and initrd image are downloaded from the server using the standard HTTP protocol:

    After loading the kernel, the node downloads the rootfs image and transfers control to systemd. Then the booting proceeds as usual, and after that the node joins Kubernetes:

    If you take a look at fstab, you can see only two entries there: /var/lib/docker and /var/lib/kubelet, they are mounted as tmpfs (in fact, from RAM). At the same time, the root partition is mounted as overlayfs, so all changes that you make here on the system will be lost on the next reboot.

    Looking into the block devices on the node, you can see some nvme disk, but it has not yet been mounted anywhere. There is also a loop device - this is the exact rootfs image downloaded from the server. At the moment it is located in RAM, occupies 653 MB and mounted with the loop option.

    If you look in /etc/ltsp, you find the network.sh file that was executed at boot. From containers, you can see running kube-proxy and pause container for it.

    Details

    Network Boot Image

    But where does the main image come from? There is a little trick here. The image for the nodes is built through the Dockerfile along with the server. The Docker multi-stage build feature allows you to easily add any packages and kernel modules exactly at the stage of the image build. It looks like this:

    What's going on here? First, we take a regular Ubuntu 20.04 and install all the packages we need. First of all we install the kernel, lvm, systemd, ssh. In general, everything that you want to see on the final node should be described here. Here we also install docker with kubelet and kubeadm, which are used to join the node to the cluster.

    And then we perform an additional configuration. In the last stage, we simply install tftp and nginx (which serves our image to clients), grub (bootloader). Then root of the previous stages copied into the final image and generate squashed image from it. That is, in fact, we get a docker image, which has both the server and the boot image for our nodes. At the same time, it can be easily updated by changing the Dockerfile.

    Webhooks and API aggregation layer

    I want to pay special attention to the problem of webhooks and aggregation layer. In general, webhooks is a Kubernetes feature that allows you to respond to the creation or modification of any resources. Thus, you can add a handler so that when resources are applied, Kubernetes must send request to some pod and check if configuration of this resource is correct, or make additional changes to it.

    But the point is, in order for the webhooks to work, the apiserver must have direct access to the cluster for which it is running. And if it is started in a separate cluster, like our case, or even separately from any cluster, then Konnectivity service can help us here. Konnectivity is one of the optional but officially supported Kubernetes components.

    Let's take cluster of four nodes for example, each of them is running a kubelet and we have other Kubernetes components running outside: kube-apiserver, kube-scheduler and kube-controller-manager. By default, all these components interact with the apiserver directly - this is the most known part of the Kubernetes logic. But in fact, there is also a reverse connection. For example, when you want to view the logs or run a kubectl exec command, the API server establishes a connection to the specific kubelet independently:

    Kubernetes apiserver reaching kubelet

    But the problem is that if we have a webhook, then it usually runs as a standard pod with a service in our cluster. And when apiserver tries to reach it, it will fail because it will try to access an in-cluster service named webhook.namespace.svc being outside of the cluster where it is actually running:

    Kubernetes apiserver can39;t reach webhook

    And here Konnectivity comes to our rescue. Konnectivity is a tricky proxy server developed especially for Kubernetes. It can be deployed as a server next to the apiserver. And Konnectivity-agent is deployed in several replicas directly in the cluster you want to access. The agent establishes a connection to the server and sets up a stable channel to make apiserver able to access all webhooks and all kubelets in the cluster. Thus, now all communication with the cluster will take place through the Konnectivity-server:

    Kubernetes apiserver reaching webhook via konnectivity

    Our plans

    Of course, we are not going to stop at this stage. People interested in the project often write to me. And if there will be a sufficient number of interested people, I hope to move Kubernetes-in-Kubernetes project under Kubernetes SIGs, by representing it in form of the official Kubernetes Helm chart. Perhaps, by making this project independent we'll gather an even larger community.

    I am also thinking of integrating it with the Machine Controller Manager, which would allow creating worker nodes, not only of physical servers, but also, for example, for creating virtual machines using kubevirt and running them in the same Kubernetes cluster. By the way, it also allows to spawn virtual machines in the clouds, and have a control-plane deployed locally.

    I am also considering the option of integrating with the Cluster-API so that you can create physical Kubefarm clusters directly through the Kubernetes environment. But at the moment I'm not completely sure about this idea. If you have any thoughts on this matter, I'll be happy to listen to them.

  5. Author: Saifuding Diliyaer (Box)

    Introductory illustration

    Illustration by Munire Aireti

    At Box, we use Kubernetes (K8s) to manage hundreds of micro-services that enable Box to stream data at a petabyte scale. When it comes to the deployment process, we run kube-applier as part of the GitOps workflows with declarative configuration and automated deployment. Developers declare their K8s apps manifest into a Git repository that requires code reviews and automatic checks to pass, before any changes can get merged and applied inside our K8s clusters. With kubectl exec and other similar commands, however, developers are able to directly interact with running containers and alter them from their deployed state. This interaction could then subvert the change control and code review processes that are enforced in our CI/CD pipelines. Further, it allows such impacted containers to continue receiving traffic long-term in production.

    To solve this problem, we developed our own K8s component called kube-exec-controller along with its corresponding kubectl plugin. They function together in detecting and terminating potentially mutated containers (caused by interactive kubectl commands), as well as revealing the interaction events directly to the target Pods for better visibility.

    Admission control for interactive kubectl commands

    Once a request is sent to K8s, it needs to be authenticated and authorized by the API server to proceed. Additionally, K8s has a separate layer of protection called admission controllers, which can intercept the request before an object is persisted in etcd. There are various predefined admission controls compiled into the API server binary (e.g. ResourceQuota to enforce hard resource usage limits per namespace). Besides, there are two dynamic admission controls named MutatingAdmissionWebhook and ValidatingAdmissionWebhook, used for mutating or validating K8s requests respectively. The latter is what we adopted to detect container drift at runtime caused by interactive kubectl commands. This whole process can be divided into three steps as explained in detail below.

    1. Admit interactive kubectl command requests

    First of all, we needed to enable a validating webhook that sends qualified requests to kube-exec-controller. To add the new validation mechanism applying to interactive kubectl commands specifically, we configured the webhook’s rules with resources as [pods/exec, pods/attach], and operations as CONNECT. These rules tell the cluster's API server that all exec and attach requests should be subject to our admission control webhook. In the ValidatingAdmissionWebhook that we configured, we specified a service reference (could also be replaced with url that gives the location of the webhook) and caBundle to allow validating its X.509 certificate, both under the clientConfig stanza.

    Here is a short example of what our ValidatingWebhookConfiguration object looks like:

    apiVersion:admissionregistration.k8s.io/v1
    kind:ValidatingWebhookConfiguration
    metadata:
    name:example-validating-webhook-config
    webhooks:
    - name:validate-pod-interaction.example.com
    sideEffects:None
    rules:
    - apiGroups:["*"]
    apiVersions:["*"]
    operations:["CONNECT"]
    resources:["pods/exec","pods/attach"]
    failurePolicy:Fail
    clientConfig:
    service:
    # reference to kube-exec-controller service deployed inside the K8s cluster
    name:example-service
    namespace:kube-exec-controller
    path:"/admit-pod-interaction"
    caBundle:"{{VALUE}}"# PEM encoded CA bundle to validate kube-exec-controller's certificate
    admissionReviewVersions:["v1","v1beta1"]
    

    2. Label the target Pod with potentially mutated containers

    Once a request of kubectl exec comes in, kube-exec-controller makes an internal note to label the associated Pod. The added labels mean that we can not only query all the affected Pods, but also enable the security mechanism to retrieve previously identified Pods, in case the controller service itself gets restarted.

    The admission control process cannot directly modify the targeted in its admission response. This is because the pods/exec request is against a subresource of the Pod API, and the API kind for that subresource is PodExecOptions. As a result, there is a separate process in kube-exec-controller that patches the labels asynchronously. The admission control always permits the exec request, then acts as a client of the K8s API to label the target Pod and to log related events. Developers can check whether their Pods are affected or not using kubectl or similar tools. For example:

    $ kubectl get pod --show-labels
    NAME READY STATUS RESTARTS AGE LABELS
    test-pod 1/1 Running 0 2s box.com/podInitialInteractionTimestamp=1632524400,box.com/podInteractorUsername=username-1,box.com/podTTLDuration=1h0m0s
    $ kubectl describe pod test-pod
    ...
    Events:
    Type Reason Age    From                            Message
    ----       ------       ----   ----                            -------
    Warning PodInteraction 5s admission-controller-service Pod was interacted with 'kubectl exec' command by user 'username-1' initially at time 2021-09-24 16:00:00 -0800 PST
    Warning PodInteraction 5s admission-controller-service Pod will be evicted at time 2021-09-24 17:00:00 -0800 PST (in about 1h0m0s).
    

    3. Evict the target Pod after a predefined period

    As you can see in the above event messages, the affected Pod is not evicted immediately. At times, developers might have to get into their running containers necessarily for debugging some live issues. Therefore, we define a time to live (TTL) of affected Pods based on the environment of clusters they are running. In particular, we allow a longer time in our dev clusters as it is more common to run kubectl exec or other interactive commands for active development.

    For our production clusters, we specify a lower time limit so as to avoid the impacted Pods serving traffic abidingly. The kube-exec-controller internally sets and tracks a timer for each Pod that matches the associated TTL. Once the timer is up, the controller evicts that Pod using K8s API. The eviction (rather than deletion) is to ensure service availability, since the cluster respects any configured PodDisruptionBudget (PDB). Let's say if a user has defined x number of Pods as critical in their PDB, the eviction (as requested by kube-exec-controller) does not continue when the target workload has fewer than x Pods running.

    Here comes a sequence diagram of the entire workflow mentioned above: 

    Workflow Diagram

    A new kubectl plugin for better user experience

    Our admission controller component works great for solving the container drift issue we had on the platform. It is also able to submit all related Events to the target Pod that has been affected. However, K8s clusters don't retain Events very long (the default retention period is one hour). We need to provide other ways for developers to get their Pod interaction activity. A kubectl plugin is a perfect choice for us to expose this information. We named our plugin kubectl pi (short for pod-interaction) and provide two subcommands: get and extend.

    When the get subcommand is called, the plugin checks the metadata attached by our admission controller and transfers it to human-readable information. Here is an example output from running kubectl pi get:

    $ kubectl pi get test-pod
    POD-NAME INTERACTOR POD-TTL EXTENSION EXTENSION-REQUESTER EVICTION-TIME
    test-pod  username-1  1h0m0s   /          /                    2021-09-24 17:00:00 -0800 PST
    

    The plugin can also be used to extend the TTL for a Pod that is marked for future eviction. This is useful in case developers need extra time to debug ongoing issues. To achieve this, a developer uses the kubectl pi extend subcommand, where the plugin patches the relevant annotations for the given Pod. These annotations include the duration and username who made the extension request for transparency (displayed in the table returned from the kubectl pi get command).

    Correspondingly, there is another webhook defined in kube-exec-controller which admits valid annotation updates. Once admitted, those updates reset the eviction timer of the target Pod as requested. An example of requesting the extension from the developer side would be:

    $ kubectl pi extend test-pod --duration=30m
    Successfully extended the termination time of pod/test-pod with a duration=30m
     
    $ kubectl pi get test-pod
    POD-NAME  INTERACTOR  POD-TTL  EXTENSION  EXTENSION-REQUESTER  EVICTION-TIME
    test-pod  username-1  1h0m0s   30m        username-2           2021-09-24 17:30:00 -0800 PST
    

    Future improvement

    Although our admission controller service works great in handling interactive requests to a Pod, it could as well evict the Pod while the actual commands are no-op in these requests. For instance, developers sometimes run kubectl exec merely to check their service logs stored on hosts. Nevertheless, the target Pods would still get bounced despite the state of their containers not changing at all. One of the improvements here could be adding the ability to distinguish the commands that are passed to the interactive requests, so that no-op commands should not always force a Pod eviction. However, this becomes challenging when developers get a shell to a running container and execute commands inside the shell, since they will no longer be visible to our admission controller service.

    Another item worth pointing out here is the choice of using K8s labels and annotations. In our design, we decided to have all immutable metadata attached as labels for better enforcing the immutability in our admission control. Yet some of these metadata could fit better as annotations. For instance, we had a label with the key box.com/podInitialInteractionTimestamp used to list all affected Pods in kube-exec-controller code, although its value would be unlikely to query for. As a more ideal design in the K8s world, a single label could be preferable in our case for identification with other metadata applied as annotations instead.

    Summary

    With the power of admission controllers, we are able to secure our K8s clusters by detecting potentially mutated containers at runtime, and evicting their Pods without affecting service availability. We also utilize kubectl plugins to provide flexibility of the eviction time and hence, bringing a better and more self-independent experience to service owners. We are proud to announce that we have open-sourced the whole project for the community to leverage in their own K8s clusters. Any contribution is more than welcomed and appreciated. You can find this project hosted on GitHub at https://github.com/box/kube-exec-controller

    Special thanks to Ayush Sobti and Ethan Goldblum for their technical guidance on this project.