Kubernetes News

The Kubernetes project blog
Kubernetes.io
  1. Authors: Mark Church (Google), Harry Bagdi (Kong), Daneyon Hanson (Red Hat), Nick Young (VMware), Manuel Zapf (Traefik Labs)

    The Ingress resource is one of the many Kubernetes success stories. It created a diverse ecosystem of Ingress controllers which were used across hundreds of thousands of clusters in a standardized and consistent way. This standardization helped users adopt Kubernetes. However, five years after the creation of Ingress, there are signs of fragmentation into different but strikingly similar CRDs and overloaded annotations. The same portability that made Ingress pervasive also limited its future.

    It was at Kubecon 2019 San Diego when a passionate group of contributors gathered to discuss the evolution of Ingress. The discussion overflowed to the hotel lobby across the street and what came out of it would later be known as the Gateway API. This discussion was based on a few key assumptions:

    1. The API standards underlying route matching, traffic management, and service exposure are commoditized and provide little value to their implementers and users as custom APIs
    2. It’s possible to represent L4/L7 routing and traffic management through common core API resources
    3. It’s possible to provide extensibility for more complex capabilities in a way that does not sacrifice the user experience of the core API

    Introducing the Gateway API

    This led to design principles that allow the Gateway API to improve upon Ingress:

    • Expressiveness - In addition to HTTP host/path matching and TLS, Gateway API can express capabilities like HTTP header manipulation, traffic weighting & mirroring, TCP/UDP routing, and other capabilities that were only possible in Ingress through custom annotations.
    • Role-oriented design - The API resource model reflects the separation of responsibilities that is common in routing and Kubernetes service networking.
    • Extensibility - The resources allow arbitrary configuration attachment at various layers within the API. This makes granular customization possible at the most appropriate places.
    • Flexible conformance - The Gateway API defines varying conformance levels - core (mandatory support), extended (portable if supported), and custom (no portability guarantee), known together as flexible conformance. This promotes a highly portable core API (like Ingress) that still gives flexibility for Gateway controller implementers.

    What does the Gateway API look like?

    The Gateway API introduces a few new resource types:

    • GatewayClasses are cluster-scoped resources that act as templates to explicitly define behavior for Gateways derived from them. This is similar in concept to StorageClasses, but for networking data-planes.
    • Gateways are the deployed instances of GatewayClasses. They are the logical representation of the data-plane which performs routing, which may be in-cluster proxies, hardware LBs, or cloud LBs.
    • Routes are not a single resource, but represent many different protocol-specific Route resources. The HTTPRoute has matching, filtering, and routing rules that get applied to Gateways that can process HTTP and HTTPS traffic. Similarly, there are TCPRoutes, UDPRoutes, and TLSRoutes which also have protocol-specific semantics. This model also allows the Gateway API to incrementally expand its protocol support in the future.

    The resources of the Gateway API

    Gateway Controller Implementations

    The good news is that although Gateway is in Alpha, there are already several Gateway controller implementations that you can run. Since it’s a standardized spec, the following example could be run on any of them and should function the exact same way. Check out getting started to see how to install and use one of these Gateway controllers.

    Getting Hands-on with the Gateway API

    In the following example, we’ll demonstrate the relationships between the different API Resources and walk you through a common use case:

    • Team foo has their app deployed in the foo Namespace. They need to control the routing logic for the different pages of their app.
    • Team bar is running in the bar Namespace. They want to be able to do blue-green rollouts of their application to reduce risk.
    • The platform team is responsible for managing the load balancer and network security of all the apps in the Kubernetes cluster.

    The following foo-route does path matching to various Services in the foo Namespace and also has a default route to a 404 server. This exposes foo-auth and foo-home Services via foo.example.com/login and foo.example.com/home respectively.:

    kind:HTTPRoute
    apiVersion:networking.x-k8s.io/v1alpha1
    metadata:
    name:foo-route
    namespace:foo
    labels:
    gateway:external-https-prod
    spec:
    hostnames:
    - "foo.example.com"
    rules:
    - matches:
    - path:
    type:Prefix
    value:/login
    forwardTo:
    - serviceName:foo-auth
    port:8080
    - matches:
    - path:
    type:Prefix
    value:/home
    forwardTo:
    - serviceName:foo-home
    port:8080
    - matches:
    - path:
    type:Prefix
    value:/
    forwardTo:
    - serviceName:foo-404
    port:8080
    

    The bar team, operating in the bar Namespace of the same Kubernetes cluster, also wishes to expose their application to the internet, but they also want to control their own canary and blue-green rollouts. The following HTTPRoute is configured for the following behavior:

    • For traffic to bar.example.com:

      • Send 90% of the traffic to bar-v1
      • Send 10% of the traffic to bar-v2
    • For traffic to bar.example.com with the HTTP header env: canary:

      • Send all the traffic to bar-v2

    The routing rules configured for the bar-v1 and bar-v2 Services

    kind:HTTPRoute
    apiVersion:networking.x-k8s.io/v1alpha1
    metadata:
    name:bar-route
    namespace:bar
    labels:
    gateway:external-https-prod
    spec:
    hostnames:
    - "bar.example.com"
    rules:
    - forwardTo:
    - serviceName:bar-v1
    port:8080
    weight:90
    - serviceName:bar-v2
    port:8080
    weight:10
    - matches:
    - headers:
    values:
    env:canary
    forwardTo:
    - serviceName:bar-v2
    port:8080
    

    Route and Gateway Binding

    So we have two HTTPRoutes matching and routing traffic to different Services. You might be wondering, where are these Services accessible? Through which networks or IPs are they exposed?

    How Routes are exposed to clients is governed by Route binding, which describes how Routes and Gateways create a bidirectional relationship between each other. When Routes are bound to a Gateway it means their collective routing rules are configured on the underlying load balancers or proxies and the Routes are accessible through the Gateway. Thus, a Gateway is a logical representation of a networking data plane that can be configured through Routes.

    How Routes bind with Gateways

    Administrative Delegation

    The split between Gateway and Route resources allows the cluster administrator to delegate some of the routing configuration to individual teams while still retaining centralized control. The following Gateway resource exposes HTTPS on port 443 and terminates all traffic on the port with a certificate controlled by the cluster administrator.

    kind:Gateway
    apiVersion:networking.x-k8s.io/v1alpha1
    metadata:
    name:prod-web
    spec:
    gatewayClassName:acme-lb
    listeners:
    - protocol:HTTPS
    port:443
    routes:
    kind:HTTPRoute
    selector:
    matchLabels:
    gateway:external-https-prod
    namespaces:
    from:All
    tls:
    certificateRef:
    name:admin-controlled-cert
    

    The following HTTPRoute shows how the Route can ensure it matches the Gateway's selector via it’s kind (HTTPRoute) and resource labels (gateway=external-https-prod).

    # Matches the required kind selector on the Gateway
    kind:HTTPRoute
    apiVersion:networking.x-k8s.io/v1alpha1
    metadata:
    name:foo-route
    namespace:foo-ns
    labels:
    
    # Matches the required label selector on the Gateway
    gateway:external-https-prod
    ...
    

    Role Oriented Design

    When you put it all together, you have a single load balancing infrastructure that can be safely shared by multiple teams. The Gateway API not only a more expressive API for advanced routing, but is also a role-oriented API, designed for multi-tenant infrastructure. Its extensibility ensures that it will evolve for future use-cases while preserving portability. Ultimately these characteristics will allow Gateway API to adapt to different organizational models and implementations well into the future.

    Try it out and get involved

    There are many resources to check out to learn more.

  2. Authors: David Porter (Google), Mrunal Patel (Red Hat), and Tim Bannister (The Scale Factory)

    Graceful node shutdown, beta in 1.21, enables kubelet to gracefully evict pods during a node shutdown.

    Kubernetes is a distributed system and as such we need to be prepared for inevitable failures — nodes will fail, containers might crash or be restarted, and - ideally - your workloads will be able to withstand these catastrophic events.

    One of the common classes of issues are workload failures on node shutdown or restart. The best practice prior to bringing your node down is to safely drain and cordon your node. This will ensure that all pods running on this node can safely be evicted. An eviction will ensure your pods can follow the expected pod termination lifecycle meaning receiving a SIGTERM in your container and/or running preStopHooks.

    Prior to Kubernetes 1.20 (when graceful node shutdown was introduced as an alpha feature), safe node draining was not easy: it required users to manually take action and drain the node beforehand. If someone or something shut down your node without draining it first, most likely your pods would not be safely evicted from your node and shutdown abruptly. Other services talking to those pods might see errors due to the pods exiting abruptly. Some examples of this situation may be caused by a reboot due to security patches or preemption of short lived cloud compute instances.

    Kubernetes 1.21 brings graceful node shutdown to beta. Graceful node shutdown gives you more control over some of those unexpected shutdown situations. With graceful node shutdown, the kubelet is aware of underlying system shutdown events and can propagate these events to pods, ensuring containers can shut down as gracefully as possible. This gives the containers a chance to checkpoint their state or release back any resources they are holding.

    Note, that for the best availability, even with graceful node shutdown, you should still design your deployments to be resilient to node failures.

    How does it work?

    On Linux, your system can shut down in many different situations. For example:

    • A user or script running shutdown -h now or systemctl poweroff or systemctl reboot.
    • Physically pressing a power button on the machine.
    • Stopping a VM instance on a cloud provider, e.g. gcloud compute instances stop on GCP.
    • A Preemptible VM or Spot Instance that your cloud provider can terminate unexpectedly, but with a brief warning.

    Many of these situations can be unexpected and there is no guarantee that a cluster administrator drained the node prior to these events. With the graceful node shutdown feature, kubelet uses a systemd mechanism called "Inhibitor Locks" to allow draining in most cases. Using Inhibitor Locks, kubelet instructs systemd to postpone system shutdown for a specified duration, giving a chance for the node to drain and evict pods on the system.

    Kubelet makes use of this mechanism to ensure your pods will be terminated cleanly. When the kubelet starts, it acquires a systemd delay-type inhibitor lock. When the system is about to shut down, the kubelet can delay that shutdown for a configurable, short duration utilizing the delay-type inhibitor lock it acquired earlier. This gives your pods extra time to terminate. As a result, even during unexpected shutdowns, your application will receive a SIGTERM, preStop hooks will execute, and kubelet will properly update Ready node condition and respective pod statuses to the api-server.

    For example, on a node with graceful node shutdown enabled, you can see that the inhibitor lock is taken by the kubelet:

    kubelet-node ~ # systemd-inhibit --list
    Who: kubelet (UID 0/root, PID 1515/kubelet)
    What: shutdown
    Why: Kubelet needs time to handle node shutdown
    Mode: delay
    1 inhibitors listed.
    

    One important consideration we took when designing this feature is that not all pods are created equal. For example, some of the pods running on a node such as a logging related daemonset should stay running as long as possible to capture important logs during the shutdown itself. As a result, pods are split into two categories: "regular" and "critical". Critical pods are those that have priorityClassName set to system-cluster-critical or system-node-critical; all other pods are considered regular.

    In our example, the logging DaemonSet would run as a critical pod. During the graceful node shutdown, regular pods are terminated first, followed by critical pods. As an example, this would allow a critical pod associated with a logging daemonset to continue functioning, and collecting logs during the termination of regular pods.

    We will evaluate during the beta phase if we need more flexibility for different pod priority classes and add support if needed, please let us know if you have some scenarios in mind.

    How do I use it?

    Graceful node shutdown is controlled with the GracefulNodeShutdown feature gate and is enabled by default in Kubernetes 1.21.

    You can configure the graceful node shutdown behavior using two kubelet configuration options: ShutdownGracePeriod and ShutdownGracePeriodCriticalPods. To configure these options, you edit the kubelet configuration file that is passed to kubelet via the --config flag; for more details, refer to Set kubelet parameters via a configuration file.

    During a shutdown, kubelet terminates pods in two phases. You can configure how long each of these phases lasts.

    1. Terminate regular pods running on the node.
    2. Terminate critical pods running on the node.

    The settings that control the duration of shutdown are:

    • ShutdownGracePeriod
      • Specifies the total duration that the node should delay the shutdown by. This is the total grace period for pod termination for both regular and critical pods.
    • ShutdownGracePeriodCriticalPods
      • Specifies the duration used to terminate critical pods during a node shutdown. This should be less than ShutdownGracePeriod.

    For example, if ShutdownGracePeriod=30s, and ShutdownGracePeriodCriticalPods=10s, kubelet will delay the node shutdown by 30 seconds. During this time, the first 20 seconds (30-10) would be reserved for gracefully terminating normal pods, and the last 10 seconds would be reserved for terminating critical pods.

    Note that by default, both configuration options described above, ShutdownGracePeriod and ShutdownGracePeriodCriticalPods are set to zero, so you will need to configure them as appropriate for your environment to activate graceful node shutdown functionality.

    How can I learn more?

    How do I get involved?

    Your feedback is always welcome! SIG Node meets regularly and can be reached via Slack (channel #sig-node), or the SIG's mailing list

  3. Author: Richard Li, Ambassador Labs

    Have you ever been asked to troubleshoot a failing Kubernetes service and struggled to find basic information about the service such as the source repository and owner?

    One of the problems as Kubernetes applications grow is the proliferation of services. As the number of services grows, developers start to specialize working with specific services. When it comes to troubleshooting, however, developers need to be able to find the source, understand the service and dependencies, and chat with the owning team for any service.

    Human service discovery

    Troubleshooting always begins with information gathering. While much attention has been paid to centralizing machine data (e.g., logs, metrics), much less attention has been given to the human aspect of service discovery. Who owns a particular service? What Slack channel does the team work on? Where is the source for the service? What issues are currently known and being tracked?

    Kubernetes annotations

    Kubernetes annotations are designed to solve exactly this problem. Oft-overlooked, Kubernetes annotations are designed to add metadata to Kubernetes objects. The Kubernetes documentation says annotations can “attach arbitrary non-identifying metadata to objects.” This means that annotations should be used for attaching metadata that is external to Kubernetes (i.e., metadata that Kubernetes won’t use to identify objects. As such, annotations can contain any type of data. This is a contrast to labels, which are designed for uses internal to Kubernetes. As such, label structure and values are constrained so they can be efficiently used by Kubernetes.

    Kubernetes annotations in action

    Here is an example. Imagine you have a Kubernetes service for quoting, called the quote service. You can do the following:

    kubectl annotate service quote a8r.io/owner=”@sally”
    

    In this example, we've just added an annotation called a8r.io/owner with the value of @sally. Now, we can use kubectl describe to get the information.

    Name: quote
    Namespace: default
    Labels: <none>
    Annotations: a8r.io/owner: @sally
    Selector: app=quote
    Type: ClusterIP
    IP: 10.109.142.131
    Port: http 80/TCP
    TargetPort: 8080/TCP
    Endpoints: <none>
    Session Affinity: None
    Events: <none>
    

    If you’re practicing GitOps (and you should be!) you’ll want to code these values directly into your Kubernetes manifest, e.g.,

    apiVersion:v1
    kind:Service
    metadata:
    name:quote
    annotations:
    a8r.io/owner:“@sally”
    spec:
    ports:
    - name:http
    port:80
    targetPort:8080
    selector:
    app:quote
    

    A Convention for Annotations

    Adopting a common convention for annotations ensures consistency and understandability. Typically, you’ll want to attach the annotation to the service object, as services are the high-level resource that maps most clearly to a team’s responsibility. Namespacing your annotations is also very important. Here is one set of conventions, documented at a8r.io, and reproduced below:

    Annotation convention for human-readable services
    Annotation Description
    a8r.io/description Unstructured text description of the service for humans.
    a8r.io/owner SSO username (GitHub), email address (linked to GitHub account), or unstructured owner description.
    a8r.io/chat Slack channel, or link to external chat system.
    a8r.io/bugs Link to external bug tracker.
    a8r.io/logs Link to external log viewer.
    a8r.io/documentation Link to external project documentation.
    a8r.io/repository Link to external VCS repository.
    a8r.io/support Link to external support center.
    a8r.io/runbook Link to external project runbook.
    a8r.io/incidents Link to external incident dashboard.
    a8r.io/uptime Link to external uptime dashboard.
    a8r.io/performance Link to external performance dashboard.
    a8r.io/dependencies Unstructured text describing the service dependencies for humans.

    Visualizing annotations: Service Catalogs

    As the number of microservices and annotations proliferate, running kubectl describe can get tedious. Moreover, using kubectl describe requires every developer to have some direct access to the Kubernetes cluster. Over the past few years, service catalogs have gained greater visibility in the Kubernetes ecosystem. Popularized by tools such as Shopify's ServicesDB and Spotify's System Z, service catalogs are internally-facing developer portals that present critical information about microservices.

    Note that these service catalogs should not be confused with the Kubernetes Service Catalog project. Built on the Open Service Broker API, the Kubernetes Service Catalog enables Kubernetes operators to plug in different services (e.g., databases) to their cluster.

    Annotate your services now and thank yourself later

    Much like implementing observability within microservice systems, you often don’t realize that you need human service discovery until it’s too late. Don't wait until something is on fire in production to start wishing you had implemented better metrics and also documented how to get in touch with the part of your organization that looks after it.

    There's enormous benefits to building an effective “version 0” service: a dancing skeleton application with a thin slice of complete functionality that can be deployed to production with a minimal yet effective continuous delivery pipeline.

    Adding service annotations should be an essential part of your “version 0” for all of your services. Add them now, and you’ll thank yourself later.

  4. Authors: Matt Fenwick (Synopsys), Jay Vyas (VMWare), Ricardo Katz, Amim Knabben (Loadsmart), Douglas Schilling Landgraf (Red Hat), Christopher Tomkins (Tigera)

    Special thanks to Tim Hockin and Bowie Du (Google), Dan Winship and Antonio Ojea (Red Hat), Casey Davenport and Shaun Crampton (Tigera), and Abhishek Raut and Antonin Bas (VMware) for being supportive of this work, and working with us to resolve issues in different Container Network Interfaces (CNIs) over time.

    A brief conversation around "node local" Network Policies in April of 2020 inspired the creation of a NetworkPolicy subproject from SIG Network. It became clear that as a community, we need a rock-solid story around how to do pod network security on Kubernetes, and this story needed a community around it, so as to grow the cultural adoption of enterprise security patterns in K8s.

    In this post we'll discuss:

    • Why we created a subproject for Network Policies
    • How we changed the Kubernetes e2e framework to visualize NetworkPolicy implementation of your CNI provider
    • The initial results of our comprehensive NetworkPolicy conformance validator, Cyclonus, built around these principles
    • Improvements subproject contributors have made to the NetworkPolicy user experience

    Why we created a subproject for NetworkPolicies

    In April of 2020 it was becoming clear that many CNIs were emerging, and many vendors implement these CNIs in subtly different ways. Users were beginning to express a little bit of confusion around how to implement policies for different scenarios, and asking for new features. It was clear that we needed to begin unifying the way we think about Network Policies in Kubernetes, to avoid API fragmentation and unnecessary complexity.

    For example:

    • In order to be flexible to the user’s environment, Calico as a CNI provider can be run using IPIP or VXLAN mode, or without encapsulation overhead. CNIs such as Antrea and Cilium offer similar configuration options as well.
    • Some CNI plugins offer iptables for NetworkPolicies amongst other options, whereas other CNIs use a completely different technology stack (for example, the Antrea project uses Open vSwitch rules).
    • Some CNI plugins only implement a subset of the Kubernetes NetworkPolicy API, and some a superset. For example, certain plugins don't support the ability to target a named port; others don't work with certain IP address types, and there are diverging semantics for similar policy types.
    • Some CNI plugins combine with OTHER CNI plugins in order to implement NetworkPolicies (canal), some CNI's might mix implementations (multus), and some clouds do routing separately from NetworkPolicy implementation.

    Although this complexity is to some extent necessary to support different environments, end-users find that they need to follow a multistep process to implement Network Policies to secure their applications:

    • Confirm that their network plugin supports NetworkPolicies (some don't, such as Flannel)
    • Confirm that their cluster's network plugin supports the specific NetworkPolicy features that they are interested in (again, the named port or port range examples come to mind here)
    • Confirm that their application's Network Policy definitions are doing the right thing
    • Find out the nuances of a vendor's implementation of policy, and check whether or not that implementation has a CNI neutral implementation (which is sometimes adequate for users)

    The NetworkPolicy project in upstream Kubernetes aims at providing a community where people can learn about, and contribute to, the Kubernetes NetworkPolicy API and the surrounding ecosystem.

    The First step: A validation framework for NetworkPolicies that was intuitive to use and understand

    The Kubernetes end to end suite has always had NetworkPolicy tests, but these weren't run in CI, and the way they were implemented didn't provide holistic, easily consumable information about how a policy was working in a cluster. This is because the original tests didn't provide any kind of visual summary of connectivity across a cluster. We thus initially set out to make it easy to confirm CNI support for NetworkPolicies by making the end to end tests (which are often used by administrators or users to diagnose cluster conformance) easy to interpret.

    To solve the problem of confirming that CNIs support the basic features most users care about for a policy, we built a new NetworkPolicy validation tool into the Kubernetes e2e framework which allows for visual inspection of policies and their effect on a standard set of pods in a cluster. For example, take the following test output. We found a bug in OVN Kubernetes. This bug has now been resolved. With this tool the bug was really easy to characterize, wherein certain policies caused a state-modification that, later on, caused traffic to incorrectly be blocked (even after all Network Policies were deleted from the cluster).

    This is the network policy for the test in question:

    metadata:
    creationTimestamp:null
    name:allow-ingress-port-80
    spec:
    ingress:
    - ports:
    - port:serve-80-tcp
    podSelector:{}
    

    These are the expected connectivity results. The test setup is 9 pods (3 namespaces: x, y, and z; and 3 pods in each namespace: a, b, and c); each pod runs a server on the same port and protocol that can be reached through HTTP calls in the absence of network policies. Connectivity is verified by using the agnhost network utility to issue HTTP calls on a port and protocol that other pods are expected to be serving. A test scenario first runs a connectivity check to ensure that each pod can reach each other pod, for 81 (= 9 x 9) data points. This is the "control". Then perturbations are applied, depending on the test scenario: policies are created, updated, and deleted; labels are added and removed from pods and namespaces, and so on. After each change, the connectivity matrix is recollected and compared to the expected connectivity.

    These results give a visual indication of connectivity in a simple matrix. Going down the leftmost column is the "source" pod, or the pod issuing the request; going across the topmost row is the "destination" pod, or the pod receiving the request. A . means that the connection was allowed; an X means the connection was blocked. For example:

    Nov 4 16:58:43.449: INFO: expected:
    - x/a x/b x/c y/a y/b y/c z/a z/b z/c
    x/a . . . . . . . . .
    x/b . . . . . . . . .
    x/c . . . . . . . . .
    y/a . . . . . . . . .
    y/b . . . . . . . . .
    y/c . . . . . . . . .
    z/a . . . . . . . . .
    z/b . . . . . . . . .
    z/c . . . . . . . . .
    

    Below are the observed connectivity results in the case of the OVN Kubernetes bug. Notice how the top three rows indicate that all requests from namespace x regardless of pod and destination were blocked. Since these experimental results do not match the expected results, a failure will be reported. Note how the specific pattern of failure provides clear insight into the nature of the problem -- since all requests from a specific namespace fail, we have a clear clue to start our investigation.

    Nov 4 16:58:43.449: INFO: observed:
    - x/a x/b x/c y/a y/b y/c z/a z/b z/c
    x/a X X X X X X X X X
    x/b X X X X X X X X X
    x/c X X X X X X X X X
    y/a . . . . . . . . .
    y/b . . . . . . . . .
    y/c . . . . . . . . .
    z/a . . . . . . . . .
    z/b . . . . . . . . .
    z/c . . . . . . . . .
    

    This was one of our earliest wins in the Network Policy group, as we were able to identify and work with the OVN Kubernetes group to fix a bug in egress policy processing.

    However, even though this tool has made it easy to validate roughly 30 common scenarios, it doesn't validate all Network Policy scenarios - because there are an enormous number of possible permutations that one might create (technically, we might say this number is infinite given that there's an infinite number of possible namespace/pod/port/protocol variations one can create).

    Once these tests were in play, we worked with the Upstream SIG Network and SIG Testing communities (thanks to Antonio Ojea and Ben Elder) to put a testgrid Network Policy job in place. This job continuously runs the entire suite of Network Policy tests against GCE with Calico as a Network Policy provider.

    Part of our role as a subproject is to help make sure that, when these tests break, we can help triage them effectively.

    Cyclonus: The next step towards Network Policy conformance

    Around the time that we were finishing the validation work, it became clear from the community that, in general, we needed to solve the overall problem of testing ALL possible Network Policy implementations. For example, a KEP was recently written which introduced the concept of micro versioning to Network Policies to accommodate describing this at the API level, by Dan Winship.

    In response to this increasingly obvious need to comprehensively evaluate Network Policy implementations from all vendors, Matt Fenwick decided to evolve our approach to Network Policy validation again by creating Cyclonus.

    Cyclonus is a comprehensive Network Policy fuzzing tool which verifies a CNI provider against hundreds of different Network Policy scenarios, by defining similar truth table/policy combinations as demonstrated in the end to end tests, while also providing a hierarchical representation of policy "categories". We've found some interesting nuances and issues in almost every CNI we've tested so far, and have even contributed some fixes back.

    To perform a Cyclonus validation run, you create a Job manifest similar to:

    apiVersion:batch/v1
    kind:Job
    metadata:
    name:cyclonus
    spec:
    template:
    spec:
    restartPolicy:Never
    containers:
    - command:
    - ./cyclonus
    - generate
    - --perturbation-wait-seconds=15
    - --server-protocol=tcp,udp
    name:cyclonus
    imagePullPolicy:IfNotPresent
    image:mfenwick100/cyclonus:latest
    serviceAccount:cyclonus
    

    Cyclonus outputs a report of all the test cases it will run:

    test cases to run by tag:
    - target: 6
    - peer-ipblock: 4
    - udp: 16
    - delete-pod: 1
    - conflict: 16
    - multi-port/protocol: 14
    - ingress: 51
    - all-pods: 14
    - egress: 51
    - all-namespaces: 10
    - sctp: 10
    - port: 56
    - miscellaneous: 22
    - direction: 100
    - multi-peer: 0
    - any-port-protocol: 2
    - set-namespace-labels: 1
    - upstream-e2e: 0
    - allow-all: 6
    - namespaces-by-label: 6
    - deny-all: 10
    - pathological: 6
    - action: 6
    - rule: 30
    - policy-namespace: 4
    - example: 0
    - tcp: 16
    - target-namespace: 3
    - named-port: 24
    - update-policy: 1
    - any-peer: 2
    - target-pod-selector: 3
    - IP-block-with-except: 2
    - pods-by-label: 6
    - numbered-port: 28
    - protocol: 42
    - peer-pods: 20
    - create-policy: 2
    - policy-stack: 0
    - any-port: 14
    - delete-namespace: 1
    - delete-policy: 1
    - create-pod: 1
    - IP-block-no-except: 2
    - create-namespace: 1
    - set-pod-labels: 1
    testing 112 cases
    

    Note that Cyclonus tags its tests based on the type of policy being created, because the policies themselves are auto-generated, and thus have no meaningful names to be recognized by.

    For each test, Cyclonus outputs a truth table, which is again similar to that of the E2E tests, along with the policy being validated:

    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
    creationTimestamp: null
    name: base
    namespace: x
    spec:
    egress:
    - ports:
    - port: 81
    to:
    - namespaceSelector:
    matchExpressions:
    - key: ns
    operator: In
    values:
    - "y"
    - z
    podSelector:
    matchExpressions:
    - key: pod
    operator: In
    values:
    - a
    - b
    - ports:
    - port: 53
    protocol: UDP
    ingress:
    - from:
    - namespaceSelector:
    matchExpressions:
    - key: ns
    operator: In
    values:
    - x
    - "y"
    podSelector:
    matchExpressions:
    - key: pod
    operator: In
    values:
    - b
    - c
    ports:
    - port: 80
    protocol: TCP
    podSelector:
    matchLabels:
    pod: a
    policyTypes:
    - Ingress
    - Egress
    0 wrong, 0 ignored, 81 correct
    +--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+
    | TCP/80 | X/A | X/B | X/C | Y/A | Y/B | Y/C | Z/A | Z/B | Z/C |
    | TCP/81 | | | | | | | | | |
    | UDP/80 | | | | | | | | | |
    | UDP/81 | | | | | | | | | |
    +--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+
    | x/a | X | X | X | X | X | X | X | X | X |
    | | X | X | X | . | . | X | . | . | X |
    | | X | X | X | X | X | X | X | X | X |
    | | X | X | X | X | X | X | X | X | X |
    +--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+
    | x/b | . | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    +--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+
    | x/c | . | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    +--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+
    | y/a | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    +--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+
    | y/b | . | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    +--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+
    | y/c | . | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    +--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+
    | z/a | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    +--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+
    | z/b | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    +--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+
    | z/c | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    | | X | . | . | . | . | . | . | . | . |
    +--------+-----+-----+-----+-----+-----+-----+-----+-----+-----+
    

    Both Cyclonus and the e2e tests use the same strategy to validate a Network Policy - probing pods over TCP or UDP, with SCTP support available as well for CNIs that support it (such as Calico).

    As examples of how we use Cyclonus to help make CNI implementations better from a Network Policy perspective, you can see the following issues:

    The good news is that Antrea and Calico have already merged fixes for all the issues found and other CNI providers are working on it, with the support of SIG Network and the Network Policy subproject.

    Are you interested in verifying NetworkPolicy functionality on your cluster? (if you care about security or offer multi-tenant SaaS, you should be) If so, you can run the upstream end to end tests, or Cyclonus, or both.

    • If you're just getting started with NetworkPolicies and want to simply verify the "common" NetworkPolicy cases that most CNIs should be implementing correctly, in a way that is quick to diagnose, then you're better off running the e2e tests only.
    • If you are deeply curious about your CNI provider's NetworkPolicy implementation, and want to verify it: use Cyclonus.
    • If you want to test hundreds of policies, and evaluate your CNI plugin for comprehensive functionality, for deep discovery of potential security holes: use Cyclonus, and also consider running end-to-end cluster tests.
    • If you're thinking of getting involved with the upstream NetworkPolicy efforts: use Cyclonus, and read at least an outline of which e2e tests are relevant.

    Where to start with NetworkPolicy testing?

    • Cyclonus is easy to run on your cluster, check out the instructions on github, and determine whether your specific CNI configuration is fully conformant to the hundreds of different Kubernetes Network Policy API constructs.
    • Alternatively, you can use a tool like sonobuoy to run the existing E2E tests in Kubernetes, with the --ginkgo.focus=NetworkPolicy flag. Make sure that you use the K8s conformance image for K8s 1.21 or above (for example, by using the --kube-conformance-image-version v1.21.0 flag), as older images will not have the new Network Policy tests in them.

    Improvements to the NetworkPolicy API and user experience

    In addition to cleaning up the validation story for CNI plugins that implement NetworkPolicies, subproject contributors have also spent some time improving the Kubernetes NetworkPolicy API for a few commonly requested features. After months of deliberation, we eventually settled on a few core areas for improvement:

    • Port Range policies: We now allow you to specify a range of ports for a policy. This allows users interested in scenarios like FTP or virtualization to enable advanced policies. The port range option for network policies will be available to use in Kubernetes 1.21. Read more in targeting a range of ports.

    • Namespace as name policies: Allowing users in Kubernetes >= 1.21 to target namespaces using names, when building Network Policy objects. This was done in collaboration with Jordan Liggitt and Tim Hockin on the API Machinery side. This change allowed us to improve the Network Policy user experience without actually changing the API! For more details, you can read Automatic labelling in the page about Namespaces. The TL,DR; is that for Kubernetes 1.21 and later, all namespaces have the following label added by default:

      kubernetes.io/metadata.name: <name-of-namespace>
      

    This means you can write a namespace policy against this namespace, even if you can't edit its labels. For example, this policy, will 'just work', without needing to run a command such as kubectl edit namespace. In fact, it will even work if you can't edit or view this namespace's data at all, because of the magic of API server defaulting.

    apiVersion:networking.k8s.io/v1
    kind:NetworkPolicy
    metadata:
    name:test-network-policy
    namespace:default
    spec:
    podSelector:
    matchLabels:
    role:db
    policyTypes:
    - Ingress
    # Allow inbound traffic to Pods labelled role=db, in the namespace 'default'
    # provided that the source is a Pod in the namespace 'my-namespace'
    ingress:
    - from:
    - namespaceSelector:
    matchLabels:
    kubernetes.io/metadata.name:my-namespace
    

    Results

    In our tests, we found that:

    • Antrea and Calico are at a point where they support all of cyclonus's scenarios, modulo a few very minor tweaks which we've made.
    • Cilium also conformed to the majority of the policies, outside known features that aren't fully supported (for example, related to the way Cilium deals with pod CIDR policies).

    If you are a CNI provider and interested in helping us to do a better job curating large tests of network policies, please reach out! We are continuing to curate the Network Policy conformance results from Cyclonus here, but we are not capable of maintaining all of the subtleties in NetworkPolicy testing data on our own. For now, we use github actions and Kind to test in CI.

    The Future

    We're also working on some improvements for the future of Network Policies, including:

    • Fully qualified Domain policies: The Google Cloud team created a prototype (which we are really excited about) of FQDN policies. This tool uses the Network Policy API to enforce policies against L7 URLs, by finding their IPs and blocking them proactively when requests are made.
    • Cluster Administrative policies: We're working hard at enabling administrative or cluster scoped Network Policies for the future. These are being presented iteratively to the NetworkPolicy subproject. You can read about them here in Cluster Scoped Network Policy.

    The Network Policy subproject meets on mondays at 4PM EST. For details, check out the SIG Network community repo. We'd love to hang out with you, hack on stuff, and help you adopt K8s Network Policies for your cluster wherever possible.

    A quick note on User Feedback

    We've gotten a lot of ideas and feedback from users on Network Policies. A lot of people have interesting ideas about Network Policies, but we've found that as a subproject, very few people were deeply interested in implementing these ideas to the full extent.

    Almost every change to the NetworkPolicy API includes weeks or months of discussion to cover different cases, and ensure no CVEs are being introduced. Thus, long term ownership is the biggest impediment in improving the NetworkPolicy user experience for us, over time.

    • We've documented a lot of the history of the Network Policy dialogue here.
    • We've also taken a poll of users, for what they'd like to see in the Network Policy API here.

    We encourage anyone to provide us with feedback, but our most pressing issues right now involve finding long term owners to help us drive changes.

    This doesn't require a lot of technical knowledge, but rather, just a long term commitment to helping us stay organized, do paperwork, and iterate through the many stages of the K8s feature process. If you want to help us and get involved, please reach out on the SIG Network mailing list, or in the SIG Network room in the k8s.io slack channel!

    Anyone can put an oar in the water and help make NetworkPolices better!

  5. Author: Aldo Culquicondor (Google)

    Once you have containerized a non-parallel Job, it is quite easy to get it up and running on Kubernetes without modifications to the binary. In most cases, when running parallel distributed Jobs, you had to set a separate system to partition the work among the workers. For example, you could set up a task queue to assign one work item to each Pod or multiple items to each Pod until the queue is emptied.

    The Kubernetes 1.21 release introduces a new field to control Job completion mode, a configuration option that allows you to control how Pod completions affect the overall progress of a Job, with two possible options (for now):

    • NonIndexed (default): the Job is considered complete when there has been a number of successfully completed Pods equal to the specified number in .spec.completions. In other words, each Pod completion is homologous to each other. Any Job you might have created before the introduction of completion modes is implicitly NonIndexed.
    • Indexed: the Job is considered complete when there is one successfully completed Pod associated with each index from 0 to .spec.completions-1. The index is exposed to each Pod in the batch.kubernetes.io/job-completion-index annotation and the JOB_COMPLETION_INDEX environment variable.

    You can start using Jobs with Indexed completion mode, or Indexed Jobs, for short, to easily start parallel Jobs. Then, each worker Pod can have a statically assigned partition of the data based on the index. This saves you from having to set up a queuing system or even having to modify your binary!

    Creating an Indexed Job

    To create an Indexed Job, you just have to add completionMode: Indexed to the Job spec and make use of the JOB_COMPLETION_INDEX environment variable.

    apiVersion:batch/v1
    kind:Job
    metadata:
    name:'sample-job'
    spec:
    completions:3
    parallelism:3
    completionMode:Indexed
    template:
    spec:
    restartPolicy:Never
    containers:
    - command:
    - 'bash'
    - '-c'
    - 'echo "My partition: ${JOB_COMPLETION_INDEX}"'
    image:'docker.io/library/bash'
    name:'sample-load'
    

    Note that completion mode is an alpha feature in the 1.21 release. To be able to use it in your cluster, make sure to enable the IndexedJob feature gate on the API server and the controller manager.

    When you run the example, you will see that each of the three created Pods gets a different completion index. For the user's convenience, the control plane sets the JOB_COMPLETION_INDEX environment variable, but you can choose to set your own or expose the index as a file.

    See Indexed Job for parallel processing with static work assignment for a step-by-step guide, and a few more examples.

    Future plans

    SIG Apps envisions that there might be more completion modes that enable more use cases for the Job API. We welcome you to open issues in kubernetes/kubernetes with your suggestions.

    In particular, we are considering an IndexedAndUnique mode where the indexes are not just available as annotation, but they are part of the Pod names, similar to StatefulSet. This should facilitate inter-Pod communication for tightly coupled Pods. You can join the discussion in the open issue.

    Wrap-up

    Indexed Jobs allows you to statically partition work among the workers of your parallel Jobs. SIG Apps hopes that this feature facilitates the migration of more batch workloads to Kubernetes.