Introduction to kubernetes pt. 2: Ingress networking

In the previous post I gave a rough overview about the control plane of a kubernetes cluster, as well as some explanation on how to think about it.

Why a post about this? Can't I just create a loadbalancer Service and everything works? Or just create an ingress with the hostname I want? Well, if you buy your cloud, this almost certainly will work, but you still need to have a rough idea what kind of magic is happening there. And more importantly, if you run your own cloud, this is nothing that works out of the box in most instances. ¹

In this post I want to talk about the question: How do I get traffic into my cluster? How do I expose services to the outside world?

It might be helpful to point out that in this post capitalization is significant: To further differentiate between our application as a service and the homonymous Kubernetes resource, the former starts with lower case, and the resource is capitalized as in the resource files as Service.

Understanding networking inside a cluster

As in the previous post I only handwave the in-cluster networking and make for simplicity the following assumptions:

the pod network is 192.168.0.0/16
the Service network is 10.96.0.0/12
I try to not assume calico

As pods are ephemeral and each pod has its own ip address, we need a way to find them by some other means as their ip. This is done by Services, a resource designed to do exactly that. They provide a DNS name inside the cluster, which, usually is not exported to the outside world.

Many people struggle with correctly configuring Services at first, as the amount of metadata is just something one has to get used to. Consider a slight modification of the example Service from the documentation, where I added a second selector label.

apiVersion: v1
kind: Service
metadata:
  name: my-service-cache
spec:
  selector:
    app: MyApp
    type: Cache
  ports:
    - protocol: TCP
      port: 80
      targetPort: 9376

The thing we need to know and that is easy to miss in the documentation is that this Service will now only select pods that have both the app: MyApp and the type: Cache label. So we have to think of an AND operator between each of the lines. You can always check which pods should be selected via kubectl, i.e. in this case kubectl get pods -l app=MyApp,type=Cache. Also note that Services only select pods in their own namespace and which are ready, although the latter can be overwritten. ²

So to summarize: Services with label selectors select pods that

live in the same namespace
match all of the labels
are ready

To see if your Service now has selected the pods it should have you can check with kubectl describe, in this case kubectl describe service my-services-cache. The pod ips then show up as endpoints and should match up the ips of the pods you've seen before. To see pod ips in the output you can for example append -o wide to the kubectl get pods command.

Bridging the gap between the cluster and the outside world

So far all our pods and Services are only accessible from inside the cluster, i.e. pod-to-pod. Even with calico where you usually can access pods from the nodes themselves, without further interventions you won't be able to resolve my-service-cache.default.svc.

The simplest but worst solution to this problem are NodePort Services. Simply by changing the Service resource to type NodePort, the service will be exposed on all nodes on a specific port, usually somewhere beyond port 30000. There are certainly situations in which this is appropriate, but for the most cases, this is not what we want. Of course we could then loadbalance everything with a separate machine in front of it, but it also means we always get additional network hops. Also it means that, unless we actually expose our ingress controller with this, we would have to build many routes by hand.

So the solution I like to propose is the following: We run a pod with ports open on HostNetwork on some node(s) at the edge of the network of our nodes via NodeSelector terms. This way we solve two problems:

it is clear where traffic gets into the network
the pod running inside the network already knows how to resolve in-cluster hostnames like my-service-cache.default.svc

Before we discuss some of the options, we have to take a brief look at Ingress resources and controllers.

Ingress resources and controllers

Usually one has to expose more than a single service to the outside world. I will focus on HTTP as it is the most important to expose. Everything else can also be managed with ingress controllers, but you might choose different ingress controllers depending on the traffic you want to expose.

In traditional deployments there are several ways to achieve this, most of which are interpolations of the following two approaches: The first is to loadbalance traffic across all webservers and then use name-based virtual hosting on the servers themselves to discern different services by the requested host. The second is to put a reverse proxy in front of the webserver fleet that then handles name-based virtual hosting and routes traffic to the respective servers in the backend.

The first approach is very simple to manage and scale, as in a webserverpark of say twenty machines, all of them are interchangable. Clearly it requires a large amount of homogeneity in your service, so what one often ends up with is a main fleet of homogeneous webservers serving the main application and then some additional servers that require more work to manage.

The second approach of course is more complicated than the first one. Servers are not completely interchangable anymore and more care has to be taken to keep services running and one has to do reconfigurations on the reverse proxy if something in the backend changes. This introduces toil, but, well, homogeneity is a strong assumption and only works in the most basic cases.

So the general pattern is to route traffic via name-based virtual hosting to the corresponding backend-fleet. And herein is the beauty of ingress controllers: They do the name-based virtual hosting for you, but you do not have to configure them manually, but you simply feed them Ingress resources and they watch out for these and then reconfigure themselves properly. And they can even handle TLS termination for you, including integrations with automatic SSL providers.

The skeleton of an Ingress resource usually looks like the following:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-service-cache-ingress
spec:
  rules:
  - host: cache.example.com
    http:
      paths:
      - path: /
        backend:
          service:
            name: my-service-cache
            port:
              number: 80

This would send all traffic directed at cache.example.com back to our Service my-service-cache. Note that the name of the Ingress does not have to match anything of the service. Ingresses are namespaced resources and they only select Services from their own namespace by their name.

The official documentation already outlines plentiful possibilities of configuration, but as most Ingress controllers base on tools like nginx, traefik or haproxy, they usually expose even more configuration possibilities via annotations.

Making it work

We are now in a position to talk about how to set things up. First of all I want to mention that you probably want to use pre-existing Helm charts and not roll your own Kubernetes resources for the deployment of Ingresses.

The edge node(s)

First we want to have at least one node at the edge, which is both part of the cluster, but also has external IPs. For redundancy reasons it makes sense to have multiple ones and how you realize mostly depends on your ambient network. Round robin DNS might be a possibility, but I would suggest one uses a pair of edge nodes, which then share a virtual service IP via keepalived. Then all public DNS names for our service would get mapped to this IP. Downstream we then also do not really care about the number of edge nodes, so everything applies both for multiple edge nodes or only a single one.

Ingress with HostNetwork on edge node

This would be the simplest setup. The edge node simply runs the Ingress controller (e.g. nginx, traefik, haproxy, whatever you like) and opens all ports you need to the outside world.

The downside to this is that you might run into issues with SSL termination. A rule of thumb I've read somewhere on the the HFT guy's blog was around one core per 0.1 - 1 GBit/s, so probably for starters you can just look at the hoses that run into your datacenter.

Loadbalancer on edge in front of Ingresses

So when you need to scale the Ingresses, you need to actually loadbalance them with a proper loadbalancer like HAProxy. There you have then some decisions to make and all of them are equally bad.

You can just run it in the cluster, again with HostNetwork on the edge node. You run the Ingress Controllers as a stateful set with a headless service and then use these names to point at the nodes. This is elegant, but sadly DNS resolution in HAProxy is not really made for short lived DNS names and this might blow up. If you do not use HAProxy but another loadbalancer, this might not even be an issue for you. Or you can just use NodeSelectors to put the IngressControllers on a specific set of nodes and then use HostNetwork to expose them on the internal Node network, so you can just point the Loadbalancer at these Nodes.

This all introduces a bit more complexity, but at least now you won't be bottlenecked by CPU when TLS traffic hits.

deaddy.net