Introduction to kubernetes pt. 2: Ingress networking

In the previous post I gave a rough overview about the control plane of a kubernetes cluster, as well as some explanation on how to think about it.

Why a post about this? Can't I just create a loadbalancer Service and everything works? Or just create an ingress with the hostname I want? Well, if you buy your cloud, this almost certainly will work, but you still need to have a rough idea what kind of magic is happening there. And more importantly, if you run your own cloud, this is nothing that works out of the box in most instances. [1]

In this post I want to talk about the question: How do I get traffic into my cluster? How do I expose services to the outside world?

It might be helpful to point out that in this post capitalization is significant: To further differentiate between our application as a service and the homonymous Kubernetes resource, the former starts with lower case, and the resource is capitalized as in the resource files as Service.

Understanding networking inside a cluster

As in the previous post I only handwave the in-cluster networking and make for simplicity the following assumptions:

  • the pod network is
  • the Service network is
  • I try to not assume calico

As pods are ephemeral and each pod has its own ip address, we need a way to find them by some other means as their ip. This is done by Services, a resource designed to do exactly that. They provide a DNS name inside the cluster, which, usually is not exported to the outside world.

Many people struggle with correctly configuring Services at first, as the amount of metadata is just something one has to get used to. Consider a slight modification of the example Service from the documentation, where I added a second selector label.

apiVersion: v1
kind: Service
  name: my-service-cache
    app: MyApp
    type: Cache
    - protocol: TCP
      port: 80
      targetPort: 9376

The thing we need to know and that is easy to miss in the documentation is that this Service will now only select pods that have both the app: MyApp and the type: Cache label. So we have to think of an AND operator between each of the lines. You can always check which pods should be selected via kubectl, i.e. in this case kubectl get pods -l app=MyApp,type=Cache. Also note that Services only select pods in their own namespace and which are ready, although the latter can be overwritten. [2]

So to summarize: Services with label selectors select pods that

  • live in the same namespace
  • match all of the labels
  • are ready

To see if your Service now has selected the pods it should have you can check with kubectl describe, in this case kubectl describe service my-services-cache. The pod ips then show up as endpoints and should match up the ips of the pods you've seen before. To see pod ips in the output you can for example append -o wide to the kubectl get pods command.

Bridging the gap between the cluster and the outside world

So far all our pods and Services are only accessible from inside the cluster, i.e. pod-to-pod. Even with calico where you usually can access pods from the nodes themselves, without further interventions you won't be able to resolve my-service-cache.default.svc.

The simplest but worst solution to this problem are NodePort Services. Simply by changing the Service resource to type NodePort, the service will be exposed on all nodes on a specific port, usually somewhere beyond port 30000. There are certainly situations in which this is appropriate, but for the most cases, this is not what we want. Of course we could then loadbalance everything with a separate machine in front of it, but it also means e always get additional network hops. Also it means that, unless we actually expose our ingress controller with this, we would have to build many routes by hand.

So the solution I like to propose is the following: We run a pod with ports open on HostNetwork on some node(s) at the edge of the network of our nodes via NodeSelector terms. This way we solve two problems:

  • it is clear where traffic gets into the network
  • the pod running inside the network already knows how to resolve in-cluster hostnames like my-service-cache.default.svc

Before we discuss some of the options, we have to take a brief look at Ingress resources and controllers.

Ingress resources and controllers

Usually one has to expose more than a single service to the outside world. I will focus on HTTP as it is the most important to expose. Everything else can also be managed with ingress controllers, but you might choose different ingress controllers depending on the traffic you want to expose.

In traditional deployments there are several ways to achieve this, most of which are interpolations of the following two approaches: The first is to loadbalance traffic across all webservers and then use name-based virtual hosting on the servers themselves to discern different services by the requested host. The second is to put a reverse proxy in front of the webserver fleet that then handles name-based virtual hosting and routes traffic to the respective servers in the backend.

The first approach is very simple to manage and scale, as in a webserverpark of say twenty machines, all of them are interchangagble. Clearly it requires a large amount of homogeneity in your service, so what one often ends up with is a main fleet of homogeneous webservers serving the main application and then some additional servers that require more work to manage.

The second approach of course is more complicated than the first one. Servers are not completely interchangable anymore and more care has to be taken to keep services running and one has to do reconfigurations on the reverse proxy if something in the backend changes. This introduces toil, but, well, homogeneity is a strong assumption and only works in the most basic cases.

So the general pattern is to route traffic via name-based virtual hosting to the corresponding backend-fleet. And herein is the beauty of ingress controllers: They do the name-based virtual hosting for you, but you do not have to configure them manually, but you simply feed them Ingress resources and they watch out for these and then reconfigure themselves properly. And they can even handle TLS termination for you, including integrations with automatic SSL providers.

The skeleton of an Ingress resource usually looks like the following:

kind: Ingress
  name: my-service-cache-ingress
  - host:
      - path: /
            name: my-service-cache
              number: 80

This would send all traffic directed at back to our Service my-service-cache. Note that the name of the Ingress does not have to match anything of the service. Ingresses are namespaced resources and they only select Services from their own namespace by their name.

The official documentation already outlines plentiful possibilities of configuration, but as most Ingress controllers base on tools like nginx, traefik or haproxy, they usually expose even more configuration possibilities via annotations.

Making it work

We are now in a position to talk about how to set things up. First of all I want to mention that you probably want to use pre-existing Helm charts and not roll your own Kubernetes resources for the deployment of Ingresses.

The edge node(s)

First we want to have at least one node at the edge, which is both part of the cluster, but also has external IPs. For redundancy reasons it makes sense to have multiple ones and how you realize mostly depends on your ambient network. Round robin DNS might be a possibility, but I would suggest one uses a pair of edge nodes, which then share a virtual service IP via keepalived. Then all public DNS names for our service would get mapped to this IP. Downstream we then also do not really care about the number of edge nodes, so everything applies both for multiple edge nodes or only a single one.

Ingress with HostNetwork on edge node

This would be the simplest setup. The edge node simply runs the Ingress controller (e.g. nginx, traefik, haproxy, whatever you like) and opens all ports you need to the outside world.

The downside to this is that you might run into issues with SSL termination. A rule of thumb I've read somewhere on the the HFT guy's blog was around one core per 0.1 - 1 GBit/s, so probably for starters you can just look at the hoses that run into your datacenter.

Loadbalancer on edge in front of Ingresses

So when you need to scale the Ingresses, you need to actually loadbalance them with a proper loadbalancer like HAProxy. There you have then some decisions to make and all of them are equally bad.

You can just run it in the cluster, again with HostNetwork on the edge node. You run the Ingress Controllers as a stateful set with a headless service and then use these names to point at the nodes. This is elegant, but sadly DNS resolution in HAProxy is not really made for short lived DNS names and this might blow up. If you do not use HAProxy but another loadbalancer, this might not even be an issue for you. Or you can just use NodeSelectors to put the IngressControllers on a specific set of nodes and then use HostNetwork to expose them on the internal Node network, so you can just point the Loadbalancer at these Nodes.

This all introduces a bit more complexity, but at least now you won't be bottlenecked by CPU when TLS traffic hits.

Further reading

Of course these are not all possibilities and you can quickly devise interpolations between these examples, depending on what you need, what you can offload to other teams and well just about any other factor you can think of.

Official docs aside, the one nugget of gold in the internet of not-so-useful posts was the ProSiebenSat1 blog post series on that:

We really see here the beauty of kubernetes, as it is very flexible and can adjust to pretty much every situation. But there are always choices to be made and it helps to have someone with some experience, who can help you making them decisevly and shed light on the downstream effect. Moving confidently is important, but with central infrastructure like kubernetes, you better back that confidence up. Hopefully this post helps you a little bit to do exactly that, and if you liked it, you might also like the upcoming posts about storage and monitoring.

[1]As far as I understand some network plugins like metalllb already come with LoadBalancer-type capablilities given the ambient network allows for such configurations, but I did not have the chance to play with this yet
[2]This behaviour can be overwritten with the true annotation