Kubernetes News

The Kubernetes project blog
  1. Authors: Nuno do Carmo Docker Captain and WSL Corsair; Ihor Dvoretskyi, Developer Advocate, Cloud Native Computing Foundation

    Introduction

    New to Windows 10 and WSL2, or new to Docker and Kubernetes? Welcome to this blog post where we will install from scratch Kubernetes in Docker KinD and Minikube.

    Why Kubernetes on Windows?

    For the last few years, Kubernetes became a de-facto standard platform for running containerized services and applications in distributed environments. While a wide variety of distributions and installers exist to deploy Kubernetes in the cloud environments (public, private or hybrid), or within the bare metal environments, there is still a need to deploy and run Kubernetes locally, for example, on the developer’s workstation.

    Kubernetes has been originally designed to be deployed and used in the Linux environments. However, a good number of users (and not only application developers) use Windows OS as their daily driver. When Microsoft revealed WSL - the Windows Subsystem for Linux, the line between Windows and Linux environments became even less visible.

    Also, WSL brought an ability to run Kubernetes on Windows almost seamlessly!

    Below, we will cover in brief how to install and use various solutions to run Kubernetes locally.

    Prerequisites

    Since we will explain how to install KinD, we won’t go into too much detail around the installation of KinD’s dependencies.

    However, here is the list of the prerequisites needed and their version/lane:

    • OS: Windows 10 version 2004, Build 19041
    • WSL2 enabled
      • In order to install the distros as WSL2 by default, once WSL2 installed, run the command wsl.exe --set-default-version 2 in Powershell
    • WSL2 distro installed from the Windows Store - the distro used is Ubuntu-18.04
    • Docker Desktop for Windows, stable channel - the version used is 2.2.0.4
    • [Optional] Microsoft Terminal installed from the Windows Store
      • Open the Windows store and type “Terminal” in the search, it will be (normally) the first option

    Windows Store Terminal

    And that’s actually it. For Docker Desktop for Windows, no need to configure anything yet as we will explain it in the next section.

    WSL2: First contact

    Once everything is installed, we can launch the WSL2 terminal from the Start menu, and type “Ubuntu” for searching the applications and documents:

    Start Menu Search

    Once found, click on the name and it will launch the default Windows console with the Ubuntu bash shell running.

    Like for any normal Linux distro, you need to create a user and set a password:

    User-Password

    [Optional] Update the sudoers

    As we are working, normally, on our local computer, it might be nice to update the sudoers and set the group %sudo to be password-less:

    # Edit the sudoers with the visudo command
    sudo visudo
    # Change the %sudo group to be password-less
    %sudo ALL=(ALL:ALL) NOPASSWD: ALL
    # Press CTRL+X to exit
    # Press Y to save
    # Press Enter to confirm
    

    visudo

    Update Ubuntu

    Before we move to the Docker Desktop settings, let’s update our system and ensure we start in the best conditions:

    # Update the repositories and list of the packages available
    sudo apt update
    # Update the system based on the packages installed > the "-y" will approve the change automatically
    sudo apt upgrade -y
    

    apt-update-upgrade

    Docker Desktop: faster with WSL2

    Before we move into the settings, let’s do a small test, it will display really how cool the new integration with Docker Desktop is:

    # Try to see if the docker cli and daemon are installed
    docker version
    # Same for kubectl
    kubectl version
    

    kubectl-error

    You got an error? Perfect! It’s actually good news, so let’s now move on to the settings.

    Docker Desktop settings: enable WSL2 integration

    First let’s start Docker Desktop for Windows if it’s not still the case. Open the Windows start menu and type “docker”, click on the name to start the application:

    docker-start

    You should now see the Docker icon with the other taskbar icons near the clock:

    docker-taskbar

    Now click on the Docker icon and choose settings. A new window will appear:

    docker-settings-general

    By default, the WSL2 integration is not active, so click the “Enable the experimental WSL 2 based engine” and click “Apply & Restart”:

    docker-settings-wsl2

    What this feature did behind the scenes was to create two new distros in WSL2, containing and running all the needed backend sockets, daemons and also the CLI tools (read: docker and kubectl command).

    Still, this first setting is still not enough to run the commands inside our distro. If we try, we will have the same error as before.

    In order to fix it, and finally be able to use the commands, we need to tell the Docker Desktop to “attach” itself to our distro also:

    docker-resources-wsl

    Let’s now switch back to our WSL2 terminal and see if we can (finally) launch the commands:

    # Try to see if the docker cli and daemon are installed
    docker version
    # Same for kubectl
    kubectl version
    

    docker-kubectl-success

    Tip: if nothing happens, restart Docker Desktop and restart the WSL process in Powershell: Restart-Service LxssManager and launch a new Ubuntu session

    And success! The basic settings are now done and we move to the installation of KinD.

    KinD: Kubernetes made easy in a container

    Right now, we have Docker that is installed, configured and the last test worked fine.

    However, if we look carefully at the kubectl command, it found the “Client Version” (1.15.5), but it didn’t find any server.

    This is normal as we didn’t enable the Docker Kubernetes cluster. So let’s install KinD and create our first cluster.

    And as sources are always important to mention, we will follow (partially) the how-to on the official KinD website:

    # Download the latest version of KinD
    curl -Lo ./kind https://github.com/kubernetes-sigs/kind/releases/download/v0.7.0/kind-$(uname)-amd64
    # Make the binary executable
    chmod +x ./kind
    # Move the binary to your executable path
    sudo mv ./kind /usr/local/bin/
    

    kind-install

    KinD: the first cluster

    We are ready to create our first cluster:

    # Check if the KUBECONFIG is not set
    echo $KUBECONFIG
    # Check if the .kube directory is created > if not, no need to create it
    ls $HOME/.kube
    # Create the cluster and give it a name (optional)
    kind create cluster --name wslkind
    # Check if the .kube has been created and populated with files
    ls $HOME/.kube
    

    kind-cluster-create

    Tip: as you can see, the Terminal was changed so the nice icons are all displayed

    The cluster has been successfully created, and because we are using Docker Desktop, the network is all set for us to use “as is”.

    So we can open the Kubernetes master URL in our Windows browser:

    kind-browser-k8s-master

    And this is the real strength from Docker Desktop for Windows with the WSL2 backend. Docker really did an amazing integration.

    KinD: counting 1 - 2 - 3

    Our first cluster was created and it’s the “normal” one node cluster:

    # Check how many nodes it created
    kubectl get nodes
    # Check the services for the whole cluster
    kubectl get all --all-namespaces
    

    kind-list-nodes-services

    While this will be enough for most people, let’s leverage one of the coolest feature, multi-node clustering:

    # Delete the existing cluster
    kind delete cluster --name wslkind
    # Create a config file for a 3 nodes cluster
    cat << EOF > kind-3nodes.yaml
    kind: Cluster
    apiVersion: kind.x-k8s.io/v1alpha4
    nodes:
     - role: control-plane
     - role: worker
     - role: worker
    EOF
    # Create a new cluster with the config file
    kind create cluster --name wslkindmultinodes --config ./kind-3nodes.yaml
    # Check how many nodes it created
    kubectl get nodes
    

    kind-cluster-create-multinodes

    Tip: depending on how fast we run the “get nodes” command, it can be that not all the nodes are ready, wait few seconds and run it again, everything should be ready

    And that’s it, we have created a three-node cluster, and if we look at the services one more time, we will see several that have now three replicas:

    # Check the services for the whole cluster
    kubectl get all --all-namespaces
    

    wsl2-kind-list-services-multinodes

    KinD: can I see a nice dashboard?

    Working on the command line is always good and very insightful. However, when dealing with Kubernetes we might want, at some point, to have a visual overview.

    For that, the Kubernetes Dashboard project has been created. The installation and first connection test is quite fast, so let’s do it:

    # Install the Dashboard application into our cluster
    kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.0.0-rc6/aio/deploy/recommended.yaml
    # Check the resources it created based on the new namespace created
    kubectl get all -n kubernetes-dashboard
    

    kind-install-dashboard

    As it created a service with a ClusterIP (read: internal network address), we cannot reach it if we type the URL in our Windows browser:

    kind-browse-dashboard-error

    That’s because we need to create a temporary proxy:

    # Start a kubectl proxy
    kubectl proxy
    # Enter the URL on your browser: http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/
    

    kind-browse-dashboard-success

    Finally to login, we can either enter a Token, which we didn’t create, or enter the kubeconfig file from our Cluster.

    If we try to login with the kubeconfig, we will get the error “Internal error (500): Not enough data to create auth info structure”. This is due to the lack of credentials in the kubeconfig file.

    So to avoid you ending with the same error, let’s follow the recommended RBAC approach.

    Let’s open a new WSL2 session:

    # Create a new ServiceAccount
    kubectl apply -f - <<EOF
    apiVersion: v1
    kind: ServiceAccount
    metadata:
     name: admin-user
     namespace: kubernetes-dashboard
    EOF
    # Create a ClusterRoleBinding for the ServiceAccount
    kubectl apply -f - <<EOF
    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
     name: admin-user
    roleRef:
     apiGroup: rbac.authorization.k8s.io
     kind: ClusterRole
     name: cluster-admin
    subjects:
    - kind: ServiceAccount
     name: admin-user
     namespace: kubernetes-dashboard
    EOF
    

    kind-browse-dashboard-rbac-serviceaccount

    # Get the Token for the ServiceAccount
    kubectl -n kubernetes-dashboard describe secret $(kubectl -n kubernetes-dashboard get secret | grep admin-user | awk '{print $1}')
    # Copy the token and copy it into the Dashboard login and press "Sign in"
    

    kind-browse-dashboard-login-success

    Success! And let’s see our nodes listed also:

    kind-browse-dashboard-browse-nodes

    A nice and shiny three nodes appear.

    Minikube: Kubernetes from everywhere

    Right now, we have Docker that is installed, configured and the last test worked fine.

    However, if we look carefully at the kubectl command, it found the “Client Version” (1.15.5), but it didn’t find any server.

    This is normal as we didn’t enable the Docker Kubernetes cluster. So let’s install Minikube and create our first cluster.

    And as sources are always important to mention, we will follow (partially) the how-to from the Kubernetes.io website:

    # Download the latest version of Minikube
    curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
    # Make the binary executable
    chmod +x ./minikube
    # Move the binary to your executable path
    sudo mv ./minikube /usr/local/bin/
    

    minikube-install

    Minikube: updating the host

    If we follow the how-to, it states that we should use the --driver=none flag in order to run Minikube directly on the host and Docker.

    Unfortunately, we will get an error about “conntrack” being required to run Kubernetes v 1.18:

    # Create a minikube one node cluster
    minikube start --driver=none
    

    minikube-start-error

    Tip: as you can see, the Terminal was changed so the nice icons are all displayed

    So let’s fix the issue by installing the missing package:

    # Install the conntrack package
    sudo apt install -y conntrack
    

    ![minikube-install-conntrack](/images/blog/2020-05-21-wsl2-dockerdesktop-k8s/wsl2-minikube-install conntrack.png)

    Let’s try to launch it again:

    # Create a minikube one node cluster
    minikube start --driver=none
    # We got a permissions error > try again with sudo
    sudo minikube start --driver=none
    

    minikube-start-error-systemd

    Ok, this error cloud be problematic … in the past. Luckily for us, there’s a solution

    Minikube: enabling SystemD

    In order to enable SystemD on WSL2, we will apply the scripts from Daniel Llewellyn.

    I invite you to read the full blog post and how he came to the solution, and the various iterations he did to fix several issues.

    So in a nutshell, here are the commands:

    # Install the needed packages
    sudo apt install -yqq daemonize dbus-user-session fontconfig
    

    minikube-systemd-packages

    # Create the start-systemd-namespace script
    sudo vi /usr/sbin/start-systemd-namespace
    #!/bin/bash
    SYSTEMD_PID=$(ps -ef | grep '/lib/systemd/systemd --system-unit=basic.target$' | grep -v unshare | awk '{print $2}')
    if [ -z "$SYSTEMD_PID" ] || [ "$SYSTEMD_PID" != "1" ]; then
    export PRE_NAMESPACE_PATH="$PATH"
    (set -o posix; set) | \
     grep -v "^BASH" | \
     grep -v "^DIRSTACK=" | \
     grep -v "^EUID=" | \
     grep -v "^GROUPS=" | \
     grep -v "^HOME=" | \
     grep -v "^HOSTNAME=" | \
     grep -v "^HOSTTYPE=" | \
     grep -v "^IFS='.*"$'\n'"'" | \
     grep -v "^LANG=" | \
     grep -v "^LOGNAME=" | \
     grep -v "^MACHTYPE=" | \
     grep -v "^NAME=" | \
     grep -v "^OPTERR=" | \
     grep -v "^OPTIND=" | \
     grep -v "^OSTYPE=" | \
     grep -v "^PIPESTATUS=" | \
     grep -v "^POSIXLY_CORRECT=" | \
     grep -v "^PPID=" | \
     grep -v "^PS1=" | \
     grep -v "^PS4=" | \
     grep -v "^SHELL=" | \
     grep -v "^SHELLOPTS=" | \
     grep -v "^SHLVL=" | \
     grep -v "^SYSTEMD_PID=" | \
     grep -v "^UID=" | \
     grep -v "^USER=" | \
     grep -v "^_=" | \
     cat - > "$HOME/.systemd-env"
    echo "PATH='$PATH'" >> "$HOME/.systemd-env"
    exec sudo /usr/sbin/enter-systemd-namespace "$BASH_EXECUTION_STRING"
    fi
    if [ -n "$PRE_NAMESPACE_PATH" ]; then
    export PATH="$PRE_NAMESPACE_PATH"
    fi
    
    # Create the enter-systemd-namespace
    sudo vi /usr/sbin/enter-systemd-namespace
    #!/bin/bash
    if [ "$UID" != 0 ]; then
    echo "You need to run$0 through sudo"
    exit 1
    fi
    SYSTEMD_PID="$(ps -ef | grep '/lib/systemd/systemd --system-unit=basic.target$' | grep -v unshare | awk '{print $2}')"
    if [ -z "$SYSTEMD_PID" ]; then
    /usr/sbin/daemonize /usr/bin/unshare --fork --pid --mount-proc /lib/systemd/systemd --system-unit=basic.target
    while [ -z "$SYSTEMD_PID" ]; do
    SYSTEMD_PID="$(ps -ef | grep '/lib/systemd/systemd --system-unit=basic.target$' | grep -v unshare | awk '{print $2}')"
    done
    fi
    if [ -n "$SYSTEMD_PID" ] && [ "$SYSTEMD_PID" != "1" ]; then
    if [ -n "$1" ] && [ "$1" != "bash --login" ] && [ "$1" != "/bin/bash --login" ]; then
    exec /usr/bin/nsenter -t "$SYSTEMD_PID" -a \
     /usr/bin/sudo -H -u "$SUDO_USER" \
     /bin/bash -c 'set -a; source "$HOME/.systemd-env"; set +a; exec bash -c '"$(printf "%q" "$@")"
    else
    exec /usr/bin/nsenter -t "$SYSTEMD_PID" -a \
     /bin/login -p -f "$SUDO_USER" \
     $(/bin/cat "$HOME/.systemd-env" | grep -v "^PATH=")
    fi
    echo "Existential crisis"
    fi
    
    # Edit the permissions of the enter-systemd-namespace script
    sudo chmod +x /usr/sbin/enter-systemd-namespace
    # Edit the bash.bashrc file
    sudo sed -i 2a"# Start or enter a PID namespace in WSL2\nsource /usr/sbin/start-systemd-namespace\n" /etc/bash.bashrc
    

    minikube-systemd-files

    Finally, exit and launch a new session. You do not need to stop WSL2, a new session is enough:

    minikube-systemd-enabled

    Minikube: the first cluster

    We are ready to create our first cluster:

    # Check if the KUBECONFIG is not set
    echo $KUBECONFIG
    # Check if the .kube directory is created > if not, no need to create it
    ls $HOME/.kube
    # Check if the .minikube directory is created > if yes, delete it
    ls $HOME/.minikube
    # Create the cluster with sudo
    sudo minikube start --driver=none
    

    In order to be able to use kubectl with our user, and not sudo, Minikube recommends running the chown command:

    # Change the owner of the .kube and .minikube directories
    sudo chown -R $USER $HOME/.kube $HOME/.minikube
    # Check the access and if the cluster is running
    kubectl cluster-info
    # Check the resources created
    kubectl get all --all-namespaces
    

    minikube-start-fixed

    The cluster has been successfully created, and Minikube used the WSL2 IP, which is great for several reasons, and one of them is that we can open the Kubernetes master URL in our Windows browser:

    minikube-browse-k8s-master

    And the real strength of WSL2 integration, the port 8443 once open on WSL2 distro, it actually forwards it to Windows, so instead of the need to remind the IP address, we can also reach the Kubernetes master URL via localhost:

    minikube-browse-k8s-master-localhost

    Minikube: can I see a nice dashboard?

    Working on the command line is always good and very insightful. However, when dealing with Kubernetes we might want, at some point, to have a visual overview.

    For that, Minikube embeded the Kubernetes Dashboard. Thanks to it, running and accessing the Dashboard is very simple:

    # Enable the Dashboard service
    sudo minikube dashboard
    # Access the Dashboard from a browser on Windows side
    

    minikube-browse-dashboard

    The command creates also a proxy, which means that once we end the command, by pressing CTRL+C, the Dashboard will no more be accessible.

    Still, if we look at the namespace kubernetes-dashboard, we will see that the service is still created:

    # Get all the services from the dashboard namespace
    kubectl get all --namespace kubernetes-dashboard
    

    minikube-dashboard-get-all

    Let’s edit the service and change it’s type to LoadBalancer:

    # Edit the Dashoard service
    kubectl edit service/kubernetes-dashboard --namespace kubernetes-dashboard
    # Go to the very end and remove the last 2 lines
    status:
    loadBalancer: {}
    # Change the type from ClusterIO to LoadBalancer
    type: LoadBalancer
    # Save the file
    

    minikube-dashboard-type-loadbalancer

    Check again the Dashboard service and let’s access the Dashboard via the LoadBalancer:

    # Get all the services from the dashboard namespace
    kubectl get all --namespace kubernetes-dashboard
    # Access the Dashboard from a browser on Windows side with the URL: localhost:<port exposed>
    

    minikube-browse-dashboard-loadbalancer

    Conclusion

    It’s clear that we are far from done as we could have some LoadBalancing implemented and/or other services (storage, ingress, registry, etc…).

    Concerning Minikube on WSL2, as it needed to enable SystemD, we can consider it as an intermediate level to be implemented.

    So with two solutions, what could be the “best for you”? Both bring their own advantages and inconveniences, so here an overview from our point of view solely:

    Criteria KinD Minikube
    Installation on WSL2 Very Easy Medium
    Multi-node Yes No
    Plugins Manual install Yes
    Persistence Yes, however not designed for Yes
    Alternatives K3d Microk8s

    We hope you could have a real taste of the integration between the different components: WSL2 - Docker Desktop - KinD/Minikube. And that gave you some ideas or, even better, some answers to your Kubernetes workflows with KinD and/or Minikube on Windows and WSL2.

    See you soon for other adventures in the Kubernetes ocean.

    Nuno & Ihor

  2. Author: Zach Corleissen, Cloud Native Computing Foundation

    Editor’s note: Zach is one of the chairs for the Kubernetes documentation special interest group (SIG Docs).

    Late last summer, SIG Docs started a community conversation about third party content in Kubernetes docs. This conversation became a Kubernetes Enhancement Proposal (KEP) and, after five months for review and comment, SIG Architecture approved the KEP as a content guide for Kubernetes docs.

    Here’s how Kubernetes docs handle third party content now:

    Links to active content in the Kubernetes project (projects in the kubernetes and kubernetes-sigs GitHub orgs) are always allowed.

    Kubernetes requires some third party content to function. Examples include container runtimes (containerd, CRI-O, Docker), networking policy (CNI plugins), Ingress controllers, and logging.

    Docs can link to third party open source software (OSS) outside the Kubernetes project if it’s necessary for Kubernetes to function.

    These common sense guidelines make sure that Kubernetes docs document Kubernetes.

    Keeping the docs focused

    Our goal is for Kubernetes docs to be a trustworthy guide to Kubernetes features. To achieve this goal, SIG Docs is tracking third party content and removing any third party content that isn’t both in the Kubernetes project and required for Kubernetes to function.

    Re-homing content

    Some content will be removed that readers may find helpful. To make sure readers have continous access to information, we’re giving stakeholders until the 1.19 release deadline for docs, July 9th, 2020 to re-home any content slated for removal.

    Over the next few months you’ll see less third party content in the docs as contributors open PRs to remove content.

    Background

    Over time, SIG Docs observed increasing vendor content in the docs. Some content took the form of vendor-specific implementations that aren’t required for Kubernetes to function in-project. Other content was thinly-disguised advertising with minimal to no feature content. Some vendor content was new; other content had been in the docs for years. It became clear that the docs needed clear, well-bounded guidelines for what kind of third party content is and isn’t allowed. The content guide emerged from an extensive period for review and comment from the community.

    Docs work best when they’re accurate, helpful, trustworthy, and remain focused on features. In our experience, vendor content dilutes trust and accuracy.

    Put simply: feature docs aren’t a place for vendors to advertise their products. Our content policy keeps the docs focused on helping developers and cluster admins, not on marketing.

    Dual sourced content

    Less impactful but also important is how Kubernetes docs handle dual-sourced content. Dual-sourced content is content published in more than one location, or from a non-canonical source.

    From the Kubernetes content guide:

    Wherever possible, Kubernetes docs link to canonical sources instead of hosting dual-sourced content.

    Minimizing dual-sourced content streamlines the docs and makes content across the Web more searchable. We’re working to consolidate and redirect dual-sourced content in the Kubernetes docs as well.

    Ways to contribute

    We’re tracking third-party content in an issue in the Kubernetes website repository. If you see third party content that’s out of project and isn’t required for Kubernetes to function, please comment on the tracking issue.

    Feel free to open a PR that removes non-conforming content once you’ve identified it!

    Want to know more?

    For more information, read the issue description for tracking third party content.

  3. Author: Wei Huang (IBM), Aldo Culquicondor (Google)

    Managing Pods distribution across a cluster is hard. The well-known Kubernetes features for Pod affinity and anti-affinity, allow some control of Pod placement in different topologies. However, these features only resolve part of Pods distribution use cases: either place unlimited Pods to a single topology, or disallow two Pods to co-locate in the same topology. In between these two extreme cases, there is a common need to distribute the Pods evenly across the topologies, so as to achieve better cluster utilization and high availability of applications.

    The PodTopologySpread scheduling plugin (originally proposed as EvenPodsSpread) was designed to fill that gap. We promoted it to beta in 1.18.

    API changes

    A new field topologySpreadConstraints is introduced in the Pod’s spec API:

    spec:
    topologySpreadConstraints:
    - maxSkew: <integer>
    topologyKey: <string>
    whenUnsatisfiable: <string>
    labelSelector: <object>
    

    As this API is embedded in Pod’s spec, you can use this feature in all the high-level workload APIs, such as Deployment, DaemonSet, StatefulSet, etc.

    Let’s see an example of a cluster to understand this API.

    API

    • labelSelector is used to find matching Pods. For each topology, we count the number of Pods that match this label selector. In the above example, given the labelSelector as “app: foo”, the matching number in “zone1” is 2; while the number in “zone2” is 0.
    • topologyKey is the key that defines a topology in the Nodes’ labels. In the above example, some Nodes are grouped into “zone1” if they have the label “zone=zone1” label; while other ones are grouped into “zone2”.
    • maxSkew describes the maximum degree to which Pods can be unevenly distributed. In the above example:
      • if we put the incoming Pod to “zone1”, the skew on “zone1” will become 3 (3 Pods matched in “zone1”; global minimum of 0 Pods matched on “zone2”), which violates the “maxSkew: 1” constraint.
      • if the incoming Pod is placed to “zone2”, the skew on “zone2” is 0 (1 Pod matched in “zone2”; global minimum of 1 Pod matched on “zone2” itself), which satisfies the “maxSkew: 1” constraint. Note that the skew is calculated per each qualified Node, instead of a global skew.
    • whenUnsatisfiable specifies, when “maxSkew” can’t be satisfied, what action should be taken:
      • DoNotSchedule (default) tells the scheduler not to schedule it. It’s a hard constraint.
      • ScheduleAnyway tells the scheduler to still schedule it while prioritizing Nodes that reduce the skew. It’s a soft constraint.

    Advanced usage

    As the feature name “PodTopologySpread” implies, the basic usage of this feature is to run your workload with an absolute even manner (maxSkew=1), or relatively even manner (maxSkew>=2). See the official document for more details.

    In addition to this basic usage, there are some advanced usage examples that enable your workloads to benefit on high availability and cluster utilization.

    Usage along with NodeSelector / NodeAffinity

    You may have found that we didn’t have a “topologyValues” field to limit which topologies the Pods are going to be scheduled to. By default, it is going to search all Nodes and group them by “topologyKey”. Sometimes this may not be the ideal case. For instance, suppose there is a cluster with Nodes tagged with “env=prod”, “env=staging” and “env=qa”, and now you want to evenly place Pods to the “qa” environment across zones, is it possible?

    The answer is yes. You can leverage the NodeSelector or NodeAffinity API spec. Under the hood, the PodTopologySpread feature will honor that and calculate the spread constraints among the nodes that satisfy the selectors.

    Advanced-Usage-1

    As illustrated above, you can specify spec.affinity.nodeAffinity to limit the “searching scope” to be “qa” environment, and within that scope, the Pod will be scheduled to one zone which satisfies the topologySpreadConstraints. In this case, it’s “zone2”.

    Multiple TopologySpreadConstraints

    It’s intuitive to understand how one single TopologySpreadConstraint works. What’s the case for multiple TopologySpreadConstraints? Internally, each TopologySpreadConstraint is calculated independently, and the result sets will be merged to generate the eventual result set - i.e., suitable Nodes.

    In the following example, we want to schedule a Pod to a cluster with 2 requirements at the same time:

    • place the Pod evenly with Pods across zones
    • place the Pod evenly with Pods across nodes

    Advanced-Usage-2

    For the first constraint, there are 3 Pods in zone1 and 2 Pods in zone2, so the incoming Pod can be only put to zone2 to satisfy the “maxSkew=1” constraint. In other words, the result set is nodeX and nodeY.

    For the second constraint, there are too many Pods in nodeB and nodeX, so the incoming Pod can be only put to nodeA and nodeY.

    Now we can conclude the only qualified Node is nodeY - from the intersection of the sets {nodeX, nodeY} (from the first constraint) and {nodeA, nodeY} (from the second constraint).

    Multiple TopologySpreadConstraints is powerful, but be sure to understand the difference with the preceding “NodeSelector/NodeAffinity” example: one is to calculate result set independently and then interjoined; while the other is to calculate topologySpreadConstraints based on the filtering results of node constraints.

    Instead of using “hard” constraints in all topologySpreadConstraints, you can also combine using “hard” constraints and “soft” constraints to adhere to more diverse cluster situations.

    Note: If two TopologySpreadConstraints are being applied for the same {topologyKey, whenUnsatisfiable} tuple, the Pod creation will be blocked returning a validation error.

    PodTopologySpread defaults

    PodTopologySpread is a Pod level API. As such, to use the feature, workload authors need to be aware of the underlying topology of the cluster, and then specify proper topologySpreadConstraints in the Pod spec for every workload. While the Pod-level API gives the most flexibility it is also possible to specify cluster-level defaults.

    The default PodTopologySpread constraints allow you to specify spreading for all the workloads in the cluster, tailored for its topology. The constraints can be specified by an operator/admin as PodTopologySpread plugin arguments in the scheduling profile configuration API when starting kube-scheduler.

    A sample configuration could look like this:

    apiVersion: kubescheduler.config.k8s.io/v1alpha2
    kind: KubeSchedulerConfiguration
    profiles:
    pluginConfig:
    - name: PodTopologySpread
    args:
    defaultConstraints:
    - maxSkew: 1
    topologyKey: example.com/rack
    whenUnsatisfiable: ScheduleAnyway
    

    When configuring default constraints, label selectors must be left empty. kube-scheduler will deduce the label selectors from the membership of the Pod to Services, ReplicationControllers, ReplicaSets or StatefulSets. Pods can always override the default constraints by providing their own through the PodSpec.

    Note: When using default PodTopologySpread constraints, it is recommended to disable the old DefaultTopologySpread plugin.

    Wrap-up

    PodTopologySpread allows you to define spreading constraints for your workloads with a flexible and expressive Pod-level API. In the past, workload authors used Pod AntiAffinity rules to force or hint the scheduler to run a single Pod per topology domain. In contrast, the new PodTopologySpread constraints allow Pods to specify skew levels that can be required (hard) or desired (soft). The feature can be paired with Node selectors and Node affinity to limit the spreading to specific domains. Pod spreading constraints can be defined for different topologies such as hostnames, zones, regions, racks, etc.

    Lastly, cluster operators can define default constraints to be applied to all Pods. This way, Pods don’t need to be aware of the underlying topology of the cluster.

  4. Author: Rick Ducott | GitHub | Twitter

    Every day, my colleagues and I are talking to platform owners, architects, and engineers who are using Gloo as an API gateway to expose their applications to end users. These applications may span legacy monoliths, microservices, managed cloud services, and Kubernetes clusters. Fortunately, Gloo makes it easy to set up routes to manage, secure, and observe application traffic while supporting a flexible deployment architecture to meet the varying production needs of our users.

    Beyond the initial set up, platform owners frequently ask us to help design the operational workflows within their organization: How do we bring a new application online? How do we upgrade an application? How do we divide responsibilities across our platform, ops, and development teams?

    In this post, we’re going to use Gloo to design a two-phased canary rollout workflow for application upgrades:

    • In the first phase, we’ll do canary testing by shifting a small subset of traffic to the new version. This allows you to safely perform smoke and correctness tests.
    • In the second phase, we’ll progressively shift traffic to the new version, allowing us to monitor the new version under load, and eventually, decommission the old version.

    To keep it simple, we’re going to focus on designing the workflow using open source Gloo, and we’re going to deploy the gateway and application to Kubernetes. At the end, we’ll talk about a few extensions and advanced topics that could be interesting to explore in a follow up.

    Initial setup

    To start, we need a Kubernetes cluster. This example doesn’t take advantage of any cloud specific features, and can be run against a local test cluster such as minikube. This post assumes a basic understanding of Kubernetes and how to interact with it using kubectl.

    We’ll install the latest open source Gloo to the gloo-system namespace and deploy version v1 of an example application to the echo namespace. We’ll expose this application outside the cluster by creating a route in Gloo, to end up with a picture like this:

    Setup

    Deploying Gloo

    We’ll install gloo with the glooctl command line tool, which we can download and add to the PATH with the following commands:

    curl -sL https://run.solo.io/gloo/install | sh
    export PATH=$HOME/.gloo/bin:$PATH
    

    Now, you should be able to run glooctl version to see that it is installed correctly:

    ➜ glooctl version
    Client: {"version":"1.3.15"}
    Server: version undefined, could not find any version of gloo running
    

    Now we can install the gateway to our cluster with a simple command:

    glooctl install gateway
    

    The console should indicate the install finishes successfully:

    Creating namespace gloo-system... Done.
    Starting Gloo installation...
    Gloo was successfully installed!
    

    Before long, we can see all the Gloo pods running in the gloo-system namespace:

    ➜ kubectl get pod -n gloo-system
    NAME READY STATUS RESTARTS AGE
    discovery-58f8856bd7-4fftg 1/1 Running 0 13s
    gateway-66f86bc8b4-n5crc 1/1 Running 0 13s
    gateway-proxy-5ff99b8679-tbp65 1/1 Running 0 13s
    gloo-66b8dc8868-z5c6r 1/1 Running 0 13s
    

    Deploying the application

    Our echo application is a simple container (thanks to our friends at HashiCorp) that will respond with the application version, to help demonstrate our canary workflows as we start testing and shifting traffic to a v2 version of the application.

    Kubernetes gives us a lot of flexibility in terms of modeling this application. We’ll adopt the following conventions:

    • We’ll include the version in the deployment name so we can run two versions of the application side-by-side and manage their lifecycle differently.
    • We’ll label pods with an app label (app: echo) and a version label (version: v1) to help with our canary rollout.
    • We’ll deploy a single Kubernetes Service for the application to set up networking. Instead of updating this or using multiple services to manage routing to different versions, we’ll manage the rollout with Gloo configuration.

    The following is our v1 echo application:

    apiVersion:apps/v1
    kind:Deployment
    metadata:
    name:echo-v1
    spec:
    replicas:1
    selector:
    matchLabels:
    app:echo
    version:v1
    template:
    metadata:
    labels:
    app:echo
    version:v1
    spec:
    containers:
    # Shout out to our friends at Hashi for this useful test server
    - image:hashicorp/http-echo
    args:
    - "-text=version:v1"
    - -listen=:8080
    imagePullPolicy:Always
    name:echo-v1
    ports:
    - containerPort:8080
    

    And here is the echo Kubernetes Service object:

    apiVersion:v1
    kind:Service
    metadata:
    name:echo
    spec:
    ports:
    - port:80
    targetPort:8080
    protocol:TCP
    selector:
    app:echo
    

    For convenience, we’ve published this yaml in a repo so we can deploy it with the following command:

    kubectl apply -f https://raw.githubusercontent.com/solo-io/gloo-ref-arch/blog-30-mar-20/platform/prog-delivery/two-phased-with-os-gloo/1-setup/echo.yaml
    

    We should see the following output:

    namespace/echo created
    deployment.apps/echo-v1 created
    service/echo created
    

    And we should be able to see all the resources healthy in the echo namespace:

    ➜ kubectl get all -n echo
    NAME READY STATUS RESTARTS AGE
    pod/echo-v1-66dbfffb79-287s5 1/1 Running 0 6s
    NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
    service/echo ClusterIP 10.55.252.216 <none> 80/TCP 6s
    NAME READY UP-TO-DATE AVAILABLE AGE
    deployment.apps/echo-v1 1/1 1 1 7s
    NAME DESIRED CURRENT READY AGE
    replicaset.apps/echo-v1-66dbfffb79 1 1 1 7s
    

    Exposing outside the cluster with Gloo

    We can now expose this service outside the cluster with Gloo. First, we’ll model the application as a Gloo Upstream, which is Gloo’s abstraction for a traffic destination:

    apiVersion:gloo.solo.io/v1
    kind:Upstream
    metadata:
    name:echo
    namespace:gloo-system
    spec:
    kube:
    selector:
    app:echo
    serviceName:echo
    serviceNamespace:echo
    servicePort:8080
    subsetSpec:
    selectors:
    - keys:
    - version
    

    Here, we’re setting up subsets based on the version label. We don’t have to use this in our routes, but later we’ll start to use it to support our canary workflow.

    We can now create a route to this upstream in Gloo by defining a Virtual Service:

    apiVersion:gateway.solo.io/v1
    kind:VirtualService
    metadata:
    name:echo
    namespace:gloo-system
    spec:
    virtualHost:
    domains:
    - "*"
    routes:
    - matchers:
    - prefix:/
    routeAction:
    single:
    upstream:
    name:echo
    namespace:gloo-system
    

    We can apply these resources with the following commands:

    kubectl apply -f https://raw.githubusercontent.com/solo-io/gloo-ref-arch/blog-30-mar-20/platform/prog-delivery/two-phased-with-os-gloo/1-setup/upstream.yaml
    kubectl apply -f https://raw.githubusercontent.com/solo-io/gloo-ref-arch/blog-30-mar-20/platform/prog-delivery/two-phased-with-os-gloo/1-setup/vs.yaml
    

    Once we apply these two resources, we can start to send traffic to the application through Gloo:

    ➜ curl $(glooctl proxy url)/
    version:v1
    

    Our setup is complete, and our cluster now looks like this:

    Setup

    Two-Phased Rollout Strategy

    Now we have a new version v2 of the echo application that we wish to roll out. We know that when the rollout is complete, we are going to end up with this picture:

    End State

    However, to get there, we may want to perform a few rounds of testing to ensure the new version of the application meets certain correctness and/or performance acceptance criteria. In this post, we’ll introduce a two-phased approach to canary rollout with Gloo, that could be used to satisfy the vast majority of acceptance tests.

    In the first phase, we’ll perform smoke and correctness tests by routing a small segment of the traffic to the new version of the application. In this demo, we’ll use a header stage: canary to trigger routing to the new service, though in practice it may be desirable to make this decision based on another part of the request, such as a claim in a verified JWT.

    In the second phase, we’ve already established correctness, so we are ready to shift all of the traffic over to the new version of the application. We’ll configure weighted destinations, and shift the traffic while monitoring certain business metrics to ensure the service quality remains at acceptable levels. Once 100% of the traffic is shifted to the new version, the old version can be decommissioned.

    In practice, it may be desirable to only use one of the phases for testing, in which case the other phase can be skipped.

    Phase 1: Initial canary rollout of v2

    In this phase, we’ll deploy v2, and then use a header stage: canary to start routing a small amount of specific traffic to the new version. We’ll use this header to perform some basic smoke testing and make sure v2 is working the way we’d expect:

    Subset Routing

    Setting up subset routing

    Before deploying our v2 service, we’ll update our virtual service to only route to pods that have the subset label version: v1, using a Gloo feature called subset routing.

    apiVersion:gateway.solo.io/v1
    kind:VirtualService
    metadata:
    name:echo
    namespace:gloo-system
    spec:
    virtualHost:
    domains:
    - "*"
    routes:
    - matchers:
    - prefix:/
    routeAction:
    single:
    upstream:
    name:echo
    namespace:gloo-system
    subset:
    values:
    version:v1
    

    We can apply them to the cluster with the following commands:

    kubectl apply -f https://raw.githubusercontent.com/solo-io/gloo-ref-arch/blog-30-mar-20/platform/prog-delivery/two-phased-with-os-gloo/2-initial-subset-routing-to-v2/vs-1.yaml
    

    The application should continue to function as before:

    ➜ curl $(glooctl proxy url)/
    version:v1
    

    Deploying echo v2

    Now we can safely deploy v2 of the echo application:

    apiVersion:apps/v1
    kind:Deployment
    metadata:
    name:echo-v2
    spec:
    replicas:1
    selector:
    matchLabels:
    app:echo
    version:v2
    template:
    metadata:
    labels:
    app:echo
    version:v2
    spec:
    containers:
    - image:hashicorp/http-echo
    args:
    - "-text=version:v2"
    - -listen=:8080
    imagePullPolicy:Always
    name:echo-v2
    ports:
    - containerPort:8080
    

    We can deploy with the following command:

    kubectl apply -f https://raw.githubusercontent.com/solo-io/gloo-ref-arch/blog-30-mar-20/platform/prog-delivery/two-phased-with-os-gloo/2-initial-subset-routing-to-v2/echo-v2.yaml
    

    Since our gateway is configured to route specifically to the v1 subset, this should have no effect. However, it does enable v2 to be routable from the gateway if the v2 subset is configured for a route.

    Make sure v2 is running before moving on:

    ➜ kubectl get pod -n echo
    NAME READY STATUS RESTARTS AGE
    echo-v1-66dbfffb79-2qw86 1/1 Running 0 5m25s
    echo-v2-86584fbbdb-slp44 1/1 Running 0 93s
    

    The application should continue to function as before:

    ➜ curl $(glooctl proxy url)/
    version:v1
    

    Adding a route to v2 for canary testing

    We’ll route to the v2 subset when the stage: canary header is supplied on the request. If the header isn’t provided, we’ll continue to route to the v1 subset as before.

    apiVersion:gateway.solo.io/v1
    kind:VirtualService
    metadata:
    name:echo
    namespace:gloo-system
    spec:
    virtualHost:
    domains:
    - "*"
    routes:
    - matchers:
    - headers:
    - name:stage
    value:canary
    prefix:/
    routeAction:
    single:
    upstream:
    name:echo
    namespace:gloo-system
    subset:
    values:
    version:v2
    - matchers:
    - prefix:/
    routeAction:
    single:
    upstream:
    name:echo
    namespace:gloo-system
    subset:
    values:
    version:v1
    

    We can deploy with the following command:

    kubectl apply -f https://raw.githubusercontent.com/solo-io/gloo-ref-arch/blog-30-mar-20/platform/prog-delivery/two-phased-with-os-gloo/2-initial-subset-routing-to-v2/vs-2.yaml
    

    Canary testing

    Now that we have this route, we can do some testing. First let’s ensure that the existing route is working as expected:

    ➜ curl $(glooctl proxy url)/
    version:v1
    

    And now we can start to canary test our new application version:

    ➜ curl $(glooctl proxy url)/ -H "stage: canary"
    version:v2
    

    Advanced use cases for subset routing

    We may decide that this approach, using user-provided request headers, is too open. Instead, we may want to restrict canary testing to a known, authorized user.

    A common implementation of this that we’ve seen is for the canary route to require a valid JWT that contains a specific claim to indicate the subject is authorized for canary testing. Enterprise Gloo has out of the box support for verifying JWTs, updating the request headers based on the JWT claims, and recomputing the routing destination based on the updated headers. We’ll save that for a future post covering more advanced use cases in canary testing.

    Phase 2: Shifting all traffic to v2 and decommissioning v1

    At this point, we’ve deployed v2, and created a route for canary testing. If we are satisfied with the results of the testing, we can move on to phase 2 and start shifting the load from v1 to v2. We’ll use weighted destinations in Gloo to manage the load during the migration.

    Setting up the weighted destinations

    We can change the Gloo route to route to both of these destinations, with weights to decide how much of the traffic should go to the v1 versus the v2 subset. To start, we’re going to set it up so 100% of the traffic continues to get routed to the v1 subset, unless the stage: canary header was provided as before.

    apiVersion:gateway.solo.io/v1
    kind:VirtualService
    metadata:
    name:echo
    namespace:gloo-system
    spec:
    virtualHost:
    domains:
    - "*"
    routes:
    # We'll keep our route from before if we want to continue testing with this header
    - matchers:
    - headers:
    - name:stage
    value:canary
    prefix:/
    routeAction:
    single:
    upstream:
    name:echo
    namespace:gloo-system
    subset:
    values:
    version:v2
    # Now we'll route the rest of the traffic to the upstream, load balanced across the two subsets.
    - matchers:
    - prefix:/
    routeAction:
    multi:
    destinations:
    - destination:
    upstream:
    name:echo
    namespace:gloo-system
    subset:
    values:
    version:v1
    weight:100
    - destination:
    upstream:
    name:echo
    namespace:gloo-system
    subset:
    values:
    version:v2
    weight:0
    

    We can apply this virtual service update to the cluster with the following commands:

    kubectl apply -f https://raw.githubusercontent.com/solo-io/gloo-ref-arch/blog-30-mar-20/platform/prog-delivery/two-phased-with-os-gloo/3-progressive-traffic-shift-to-v2/vs-1.yaml
    

    Now the cluster looks like this, for any request that doesn’t have the stage: canary header:

    Initialize Traffic Shift

    With the initial weights, we should see the gateway continue to serve v1 for all traffic.

    ➜ curl $(glooctl proxy url)/
    version:v1
    

    Commence rollout

    To simulate a load test, let’s shift half the traffic to v2:

    Load Test

    This can be expressed on our virtual service by adjusting the weights:

    apiVersion:gateway.solo.io/v1
    kind:VirtualService
    metadata:
    name:echo
    namespace:gloo-system
    spec:
    virtualHost:
    domains:
    - "*"
    routes:
    - matchers:
    - headers:
    - name:stage
    value:canary
    prefix:/
    routeAction:
    single:
    upstream:
    name:echo
    namespace:gloo-system
    subset:
    values:
    version:v2
    - matchers:
    - prefix:/
    routeAction:
    multi:
    destinations:
    - destination:
    upstream:
    name:echo
    namespace:gloo-system
    subset:
    values:
    version:v1
    # Update the weight so 50% of the traffic hits v1
    weight:50
    - destination:
    upstream:
    name:echo
    namespace:gloo-system
    subset:
    values:
    version:v2
    # And 50% is routed to v2
    weight:50
    

    We can apply this to the cluster with the following command:

    kubectl apply -f https://raw.githubusercontent.com/solo-io/gloo-ref-arch/blog-30-mar-20/platform/prog-delivery/two-phased-with-os-gloo/3-progressive-traffic-shift-to-v2/vs-2.yaml
    

    Now when we send traffic to the gateway, we should see half of the requests return version:v1 and the other half return version:v2.

    ➜ curl $(glooctl proxy url)/
    version:v1
    ➜ curl $(glooctl proxy url)/
    version:v2
    ➜ curl $(glooctl proxy url)/
    version:v1
    

    In practice, during this process it’s likely you’ll be monitoring some performance and business metrics to ensure the traffic shift isn’t resulting in a decline in the overall quality of service. We can even leverage operators like Flagger to help automate this Gloo workflow. Gloo Enterprise integrates with your metrics backend and provides out of the box and dynamic, upstream-based dashboards that can be used to monitor the health of the rollout. We will save these topics for a future post on advanced canary testing use cases with Gloo.

    Finishing the rollout

    We will continue adjusting weights until eventually, all of the traffic is now being routed to v2:

    Final Shift

    Our virtual service will look like this:

    apiVersion:gateway.solo.io/v1
    kind:VirtualService
    metadata:
    name:echo
    namespace:gloo-system
    spec:
    virtualHost:
    domains:
    - "*"
    routes:
    - matchers:
    - headers:
    - name:stage
    value:canary
    prefix:/
    routeAction:
    single:
    upstream:
    name:echo
    namespace:gloo-system
    subset:
    values:
    version:v2
    - matchers:
    - prefix:/
    routeAction:
    multi:
    destinations:
    - destination:
    upstream:
    name:echo
    namespace:gloo-system
    subset:
    values:
    version:v1
    # No traffic will be sent to v1 anymore
    weight:0
    - destination:
    upstream:
    name:echo
    namespace:gloo-system
    subset:
    values:
    version:v2
    # Now all the traffic will be routed to v2
    weight:100
    

    We can apply that to the cluster with the following command:

    kubectl apply -f https://raw.githubusercontent.com/solo-io/gloo-ref-arch/blog-30-mar-20/platform/prog-delivery/two-phased-with-os-gloo/3-progressive-traffic-shift-to-v2/vs-3.yaml
    

    Now when we send traffic to the gateway, we should see all of the requests return version:v2.

    ➜ curl $(glooctl proxy url)/
    version:v2
    ➜ curl $(glooctl proxy url)/
    version:v2
    ➜ curl $(glooctl proxy url)/
    version:v2
    

    Decommissioning v1

    At this point, we have deployed the new version of our application, conducted correctness tests using subset routing, conducted load and performance tests by progressively shifting traffic to the new version, and finished the rollout. The only remaining task is to clean up our v1 resources.

    First, we’ll clean up our routes. We’ll leave the subset specified on the route so we are all setup for future upgrades.

    apiVersion:gateway.solo.io/v1
    kind:VirtualService
    metadata:
    name:echo
    namespace:gloo-system
    spec:
    virtualHost:
    domains:
    - "*"
    routes:
    - matchers:
    - prefix:/
    routeAction:
    single:
    upstream:
    name:echo
    namespace:gloo-system
    subset:
    values:
    version:v2
    

    We can apply this update with the following command:

    kubectl apply -f https://raw.githubusercontent.com/solo-io/gloo-ref-arch/blog-30-mar-20/platform/prog-delivery/two-phased-with-os-gloo/4-decommissioning-v1/vs.yaml
    

    And we can delete the v1 deployment, which is no longer serving any traffic.

    kubectl delete deploy -n echo echo-v1
    

    Now our cluster looks like this:

    End State

    And requests to the gateway return this:

    ➜ curl $(glooctl proxy url)/
    version:v2
    

    We have now completed our two-phased canary rollout of an application update using Gloo!

    Other Advanced Topics

    Over the course of this post, we collected a few topics that could be a good starting point for advanced exploration:

    • Using the JWT filter to verify JWTs, extract claims onto headers, and route to canary versions depending on a claim value.
    • Looking at Prometheus metrics and Grafana dashboards created by Gloo to monitor the health of the rollout.
    • Automating the rollout by integrating Flagger with Gloo.

    A few other topics that warrant further exploration:

    • Supporting self-service upgrades by giving teams ownership over their upstream and route configuration
    • Utilizing Gloo’s delegation feature and Kubernetes RBAC to decentralize the configuration management safely
    • Fully automating the continuous delivery process by applying GitOps principles and using tools like Flux to push config to the cluster
    • Supporting hybrid or non-Kubernetes application use-cases by setting up Gloo with a different deployment pattern
    • Utilizing traffic shadowing to begin testing the new version with realistic data before shifting production traffic to it

    Get Involved in the Gloo Community

    Gloo has a large and growing community of open source users, in addition to an enterprise customer base. To learn more about Gloo:

    • Check out the repo, where you can see the code and file issues
    • Check out the docs, which have an extensive collection of guides and examples
    • Join the slack channel and start chatting with the Solo engineering team and user community

    If you’d like to get in touch with me (feedback is always appreciated!), you can find me on the Solo slack or email me at rick.ducott@solo.io.

  5. Author: Daniel Lipovetsky (D2IQ)

    Cluster API Logo: Turtles All The Way Down

    The Cluster API is a Kubernetes project to bring declarative, Kubernetes-style APIs to cluster creation, configuration, and management. It provides optional, additive functionality on top of core Kubernetes to manage the lifecycle of a Kubernetes cluster.

    Following the v1alpha2 release in October 2019, many members of the Cluster API community met in San Francisco, California, to plan the next release. The project had just gone through a major transformation, delivering a new architecture that promised to make the project easier for users to adopt, and faster for the community to build. Over the course of those two days, we found our common goals: To implement the features critical to managing production clusters, to make its user experience more intuitive, and to make it a joy to develop.

    The v1alpha3 release of Cluster API brings significant features for anyone running Kubernetes in production and at scale. Among the highlights:

    For anyone who wants to understand the API, or prizes a simple, but powerful, command-line interface, the new release brings:

    Finally, for anyone extending the Cluster API for their custom infrastructure or software needs:

    All this was possible thanks to the hard work of many contributors.

    Declarative Control Plane Management

    Special thanks to Jason DeTiberus, Naadir Jeewa, and Chuck Ha

    The Kubeadm-based Control Plane (KCP) provides a declarative API to deploy and scale the Kubernetes control plane, including etcd. This is the feature many Cluster API users have been waiting for! Until now, to deploy and scale up the control plane, users had to create specially-crafted Machine resources. To scale down the control plane, they had to manually remove members from the etcd cluster. KCP automates deployment, scaling, and upgrades.

    What is the Kubernetes Control Plane? The Kubernetes control plane is, at its core, kube-apiserver and etcd. If either of these are unavailable, no API requests can be handled. This impacts not only core Kubernetes APIs, but APIs implemented with CRDs. Other components, like kube-scheduler and kube-controller-manager, are also important, but do not have the same impact on availability.

    The control plane was important in the beginning because it scheduled workloads. However, some workloads could continue to run during a control plane outage. Today, workloads depend on operators, service meshes, and API gateways, which all use the control plane as a platform. Therefore, the control plane’s availability is more important than ever.

    Managing the control plane is one of the most complex parts of cluster operation. Because the typical control plane includes etcd, it is stateful, and operations must be done in the correct sequence. Control plane replicas can and do fail, and maintaining control plane availability means being able to replace failed nodes.

    The control plane can suffer a complete outage (e.g. permanent loss of quorum in etcd), and recovery (along with regular backups) is sometimes the only feasible option.

    For more details, read about Kubernetes Components in the Kubernetes documentation.

    Here’s an example of a 3-replica control plane for the Cluster API Docker Infrastructure, which the project maintains for testing and development. For brevity, other required resources, like Cluster, and Infrastructure Template, referenced by its name and namespace, are not shown.

    apiVersion:controlplane.cluster.x-k8s.io/v1alpha3
    kind:KubeadmControlPlane
    metadata:
    name:example
    spec:
    infrastructureTemplate:
    apiVersion:infrastructure.cluster.x-k8s.io/v1alpha3
    kind:DockerMachineTemplate
    name:example
    namespace:default
    kubeadmConfigSpec:
    clusterConfiguration:
    replicas:3
    version:1.16.3
    

    Deploy this control plane with kubectl:

    kubectl apply -f example-docker-control-plane.yaml
    

    Scale the control plane the same way you scale other Kubernetes resources:

    kubectl scale kubeadmcontrolplane example --replicas=5
    
    kubeadmcontrolplane.controlplane.cluster.x-k8s.io/example scaled
    

    Upgrade the control plane to a newer patch of the Kubernetes release:

    kubectl patch kubeadmcontrolplane example --type=json -p '[{"op": "replace", "path": "/spec/version", "value": "1.16.4"}]'
    

    Number of Control Plane Replicas By default, KCP is configured to manage etcd, and requires an odd number of replicas. If KCP is configured to not manage etcd, an odd number is recommended, but not required. An odd number of replicas ensures optimal etcd configuration. To learn why your etcd cluster should have an odd number of members, see the etcd FAQ.

    Because it is a core Cluster API component, KCP can be used with any v1alpha3-compatible Infrastructure Provider that provides a fixed control plane endpoint, i.e., a load balancer or virtual IP. This endpoint enables requests to reach multiple control plane replicas.

    What is an Infrastructure Provider? A source of computational resources (e.g. machines, networking, etc.). The community maintains providers for AWS, Azure, Google Cloud, and VMWare. For details, see the list of providers in the Cluster API Book.

    Distributing Control Plane Nodes To Reduce Risk

    Special thanks to Vince Prignano, and Chuck Ha

    Cluster API users can now deploy nodes in different failure domains, reducing the risk of a cluster failing due to a domain outage. This is especially important for the control plane: If nodes in one domain fail, the cluster can continue to operate as long as the control plane is available to nodes in other domains.

    What is a Failure Domain? A failure domain is a way to group the resources that would be made unavailable by some failure. For example, in many public clouds, an “availability zone” is the default failure domain. A zone corresponds to a data center. So, if a specific data center is brought down by a power outage or natural disaster, all resources in that zone become unavailable. If you run Kubernetes on your own hardware, your failure domain might be a rack, a network switch, or power distribution unit.

    The Kubeadm-based ControlPlane distributes nodes across failure domains. To minimize the chance of losing multiple nodes in the event of a domain outage, it tries to distribute them evenly: it deploys a new node in the failure domain with the fewest existing nodes, and it removes an existing node in the failure domain with the most existing nodes.

    MachineDeployments and MachineSets do not distribute nodes across failure domains. To deploy your worker nodes across multiple failure domains, create a MachineDeployment or MachineSet for each failure domain.

    The Failure Domain API works on any infrastructure. That’s because every Infrastructure Provider maps failure domains in its own way. The API is optional, so if your infrastructure is not complex enough to need failure domains, you do not need to support it. This example is for the Cluster API Docker Infrastructure Provider. Note that two of the domains are marked as suitable for control plane nodes, while a third is not. The Kubeadm-based ControlPlane will only deploy nodes to domains marked suitable.

    apiVersion:infrastructure.cluster.x-k8s.io/v1alpha3
    kind:DockerCluster
    metadata:
    name:example
    spec:
    controlPlaneEndpoint:
    host:172.17.0.4
    port:6443
    failureDomains:
    domain-one:
    controlPlane:true
    domain-two:
    controlPlane:true
    domain-three:
    controlPlane:false
    

    The AWS Infrastructure Provider (CAPA), maintained by the Cluster API project, maps failure domains to AWS Availability Zones. Using CAPA, you can deploy a cluster across multiple Availability Zones. First, define subnets for multiple Availability Zones. The CAPA controller will define a failure domain for each Availability Zone. Deploy the control plane with the KubeadmControlPlane: it will distribute replicas across the failure domains. Finally, create a separate MachineDeployment for each failure domain.

    Automated Replacement of Unhealthy Nodes

    Special thanks to Alberto García Lamela, and Joel Speed

    There are many reasons why a node might be unhealthy. The kubelet process may stop. The container runtime might have a bug. The kernel might have a memory leak. The disk may run out of space. CPU, disk, or memory hardware may fail. A power outage may happen. Failures like these are especially common in larger clusters.

    Kubernetes is designed to tolerate them, and to help your applications tolerate them as well. Nevertheless, only a finite number of nodes can be unhealthy before the cluster runs out of resources, and Pods are evicted or not scheduled in the first place. Unhealthy nodes should be repaired or replaced at the earliest opportunity.

    The Cluster API now includes a MachineHealthCheck resource, and a controller that monitors node health. When it detects an unhealthy node, it removes it. (Another Cluster API controller detects the node has been removed and replaces it.) You can configure the controller to suit your needs. You can configure how long to wait before removing the node. You can also set a threshold for the number of unhealthy nodes. When the threshold is reached, no more nodes are removed. The wait can be used to tolerate short-lived outages, and the threshold to prevent too many nodes from being replaced at the same time.

    The controller will remove only nodes managed by a Cluster API MachineSet. The controller does not remove control plane nodes, whether managed by the Kubeadm-based Control Plane, or by the user, as in v1alpha2. For more, see Limits and Caveats of a MachineHealthCheck.

    Here is an example of a MachineHealthCheck. For more details, see Configure a MachineHealthCheck in the Cluster API book.

    apiVersion:cluster.x-k8s.io/v1alpha3
    kind:MachineHealthCheck
    metadata:
    name:example-node-unhealthy-5m
    spec:
    clusterName:example
    maxUnhealthy:33%
    nodeStartupTimeout:10m
    selector:
    matchLabels:
    nodepool:nodepool-0
    unhealthyConditions:
    - type:Ready
    status:Unknown
    timeout:300s
    - type:Ready
    status:"False"
    timeout:300s
    

    Infrastructure-Managed Node Groups

    Special thanks to Juan-Lee Pang and Cecile Robert-Michon

    If you run large clusters, you need to create and destroy hundreds of nodes, sometimes in minutes. Although public clouds make it possible to work with large numbers of nodes, having to make a separate API request to create or delete every node may scale poorly. For example, API requests may have to be delayed to stay within rate limits.

    Some public clouds offer APIs to manage groups of nodes as one single entity. For example, AWS has AutoScaling Groups, Azure has Virtual Machine Scale Sets, and GCP has Managed Instance Groups. With this release of Cluster API, Infrastructure Providers can add support for these APIs, and users can deploy groups of Cluster API Machines by using the MachinePool Resource. For more information, see the proposal in the Cluster API repository.

    Experimental Feature The MachinePool API is an experimental feature that is not enabled by default. Users are encouraged to try it and report on how well it meets their needs.

    The Cluster API User Experience, Reimagined

    clusterctl

    Special thanks to Fabrizio Pandini

    If you are new to Cluster API, your first experience will probably be with the project’s command-line tool, clusterctl. And with the new Cluster API release, it has been re-designed to be more pleasing to use than before. The tool is all you need to deploy your first workload cluster in just a few steps.

    First, use clusterctl init to fetch the configuration for your Infrastructure and Bootstrap Providers and deploy all of the components that make up the Cluster API. Second, use clusterctl config cluster to create the workload cluster manifest. This manifest is just a collection of Kubernetes objects. To create the workload cluster, just kubectl apply the manifest. Don’t be surprised if this workflow looks familiar: Deploying a workload cluster with Cluster API is just like deploying an application workload with Kubernetes!

    Clusterctl also helps with the “day 2” operations. Use clusterctl move to migrate Cluster API custom resources, such as Clusters, and Machines, from one Management Cluster to another. This step–also known as a pivot–is necessary to create a workload cluster that manages itself with Cluster API. Finally, use clusterctl upgrade to upgrade all of the installed components when a new Cluster API release becomes available.

    One more thing! Clusterctl is not only a command-line tool. It is also a Go library! Think of the library as an integration point for projects that build on top of Cluster API. All of clusterctl’s command-line functionality is available in the library, making it easy to integrate into your stack. To get started with the library, please read its documentation.

    The Cluster API Book

    Thanks to many contributors!

    The project’s documentation is extensive. New users should get some background on the architecture, and then create a cluster of their own with the Quick Start. The clusterctl tool has its own reference. The Developer Guide has plenty of information for anyone interested in contributing to the project.

    Above and beyond the content itself, the project’s documentation site is a pleasure to use. It is searchable, has an outline, and even supports different color themes. If you think the site a lot like the documentation for a different community project, Kubebuilder, that is no coincidence! Many thanks to Kubebuilder authors for creating a great example of documentation. And many thanks to the mdBook authors for creating a great tool for building documentation.

    Integrate & Customize

    End-to-End Test Framework

    Special thanks to Chuck Ha

    The Cluster API project is designed to be extensible. For example, anyone can develop their own Infrastructure and Bootstrap Providers. However, it’s important that Providers work in a uniform way. And, because the project is still evolving, it takes work to make sure that Providers are up-to-date with new releases of the core.

    The End-to-End Test Framework provides a set of standard tests for developers to verify that their Providers integrate correctly with the current release of Cluster API, and help identify any regressions that happen after a new release of the Cluster API, or the Provider.

    For more details on the Framework, see Testing in the Cluster API Book, and the README in the repository.

    Provider Implementer’s Guide

    Thanks to many contributors!

    The community maintains Infrastructure Providers for a many popular infrastructures. However, if you want to build your own Infrastructure or Bootstrap Provider, the Provider Implementer’s guide explains the entire process, from creating a git repository, to creating CustomResourceDefinitions for your Providers, to designing, implementing, and testing the controllers.

    Under Active Development The Provider Implementer’s Guide is actively under development, and may not yet reflect all of the changes in the v1alpha3 release.

    Join Us!

    The Cluster API project is a very active project, and covers many areas of interest. If you are an infrastructure expert, you can contribute to one of the Infrastructure Providers. If you like building controllers, you will find opportunities to innovate. If you’re curious about testing distributed systems, you can help develop the project’s end-to-end test framework. Whatever your interests and background, you can make a real impact on the project.

    Come introduce yourself to the community at our weekly meeting, where we dedicate a block of time for a Q&A session. You can also find maintainers and users on the Kubernetes Slack, and in the Kubernetes forum. Please check out the links below. We look forward to seeing you!