An incomplete and incorrect guide to the Kubernetes ecosystem

… from someone who doesn’t know anything.

I’ve been using Kubernetes professionally for about seven months, which makes now the ideal time to write this post. There are only a few known unknowns, enough known knowns that Dunning-Kruger is in full effect, and presumably the space of unknown unknowns remains approximately infinite (but, luckily, by definition I see no evidence of that.)

This post is flippant and unfair. The fact that any of this works at all is a miracle, and is only possible thanks to an army of maintainers (many who are volunteers and some who no longer even use what they're maintaining), inheriting constraints and contexts that I can't even imagine.

Full disclosure: I once tried to join the Kubernetes core team and was rejected. I leave it as an exercise for the reader to determine whether that was for the best.

The first thing to know about Kubernetes is that it’s a multiplier. If you can make some software run in a container on one machine, you can easily make that software run on 1,000 machines. On the other hand, if you currently have 1 problem (presumably what you’re using Kubernetes to solve), tomorrow you will 1,000 problems.

Anyway, let’s work our way up the stack.

The platform

There are two high level ways to get a cluster: you can run it yourself or you can pay someone else to run it.

When deciding between the two options, just think of Gentoo Linux. Do you enjoy constantly recompiling everything, setting compiler flags (-funroll-loops anyone?), and watching passively as your life force slowly leaves your body? Run it yourself using Kubeadm. Ignore kind unless you want a development cluster, ignore Minikube because kind is more representative, and ignore K3S because are you serious? Why would you want to run a weird bespoke environment on your weird bespoke hardware?

If you’re the type of person who would rather not recompile your kernel, you will enjoy paying a platform to manage your Kubernetes cluster for you. The only platform I have experience with is AWS’s EKS, but please allow me to make broad sweeping generalizations. While I don’t know what mania feels like, the experience of using EKS is probably similar. When it’s good, it’s really good: for example IRSA is phenomenal. When it’s bad, it’s really bad: the EBS CSI plugin is basically unmaintained, the control plane appears to be cobbled together with sticks and is incredibly slow, “EKS addons” are terrible, you can’t downgrade if you find a breakage when you upgrade to a newer EKS version, who even knows what the AWS VPC plugin is doing or which CNIs will work with it (just kidding, I do know what the AWS VPC plugin is doing: it’s causing you to spin up extra nodes because it enforces hard and low limits on the number of pods you can run per node.) I can’t imagine how GKE could be as bad as EKS, but I also can’t imagine the heinous things that I will read about in the news tomorrow, the day after, the day after that, and for the rest of my life.

Infrastructure as code

You can manage your cluster with a pile of YAML files and Helm charts. You don’t want to do this for your production environment. Are you still thinking about it? Just don’t.

Instead, you can manage your cluster using Pulumi, it’s pretty cool and uses nice languages you already know. Their engineers have a sense of humor. Unfortunately, basically no one uses Pulumi. When you have a problem (and as already mentioned, you’re about to have 1,000 problems), you are much more likely to be on your own. Also, I don’t really understand why Pulumi is built around the model of replica environments. I don’t want a production cluster and an exact replica dev cluster. Though, if I’m being honest, I don’t even want a production cluster. I want to go hiking. It makes so little sense to me that I almost feel like I’ve misunderstood something core to Pulumi, but, thanks to Dunning-Kruger, I am immune to actually experiencing that feeling.

Therefore, you will use Terraform. To be clear, Terraform is terra-ble. But a terrible thing that everyone uses is much better than a slightly less terrible thing that far fewer people use. There are many pre-built modules that make using Terraform easier, for example this EKS one. These will appear attractive, and you should use them as a beginner because they’ll help you get up and running when you understand none of the individual pieces, but they are bad for non-obvious reasons and are tech debt and there will be a time when you reap what you’ve sewn. Have fun!

Overall, the situation makes me sad. Though, if you want a laugh, you might consider kubecfg.

Continuously integrating your infrastructure

This section is a little off-topic since it isn’t really Kubernetes specific. However, I have strong and likely wrong opinions so you’re going to hear about it anyway.

You have a few different source control systems you can put your code in. You can use Mercurial, which has nice UX and will make the most sense to everyone. You can use Perforce, which is very enterprise-y and deals very well with large binary files. Or, you can use Git, which is neither user-friendly nor deals well with large binary files. After a careful analysis of the aforementioned pros and cons, you will obviously pick Git.

You could self-host your repository, which requires a level of hubris I have yet to experience. You could use GitLab, which I quite like, has nice UX, and I’ve heard has good code review tooling. Or you could use GitHub, which is generally an abomination (good luck trying to use their code review tooling or code search or tracking blame after a file move or seeing what time a commit was made on mobile or really doing literally anything), but all your users already have accounts, it’s 75% cheaper than GitLab, and the GitHub Actions ecosystem is remarkably large.

Now that you’ve chosen GitHub, you find yourself becoming more familiar with GitHub Actions. Similar to how a dam breaks, you start to develop a creeping feeling that the Actions security model is horrifically broken but push the thought out of your mind. Suddenly, without warning, you are forced to accept that anyone with the ability to push a branch to your repository has the ability to execute any code they want with any capability the repository has. This is particularly lovely in the context of a Terraform repository, because in a naive integration it means any user can easily escalate their privileges by modifying the workflow in their branch and making a PR. There are solutions to this. In the general case, you can use Actions solely to trigger BuildKite, which doesn’t suffer from this problem. In the particular case of Terraform, I assume their cloud product has a solution. You can also come up with your own custom solution. Or, you could close your eyes, hail Satan, and pretend there is no problem. Regardless of which path you choose, I assure you that you will cry yourself to sleep every night.

Somewhere along the way you will discover Reviewable, which is not perfect but is a million times better than doing code reviews in the GitHub interface. Unfortunately, you will not have the energy to force everyone in your team to sign up for it and so you’ll give it up when the trial period ends.

CSIs: I assure you a crime occurred, but I can’t put my finger on it

I’m already bored just thinking about this topic, so we’ll keep it short. You want your services to be stateless, and you want all of the data they need to already be in the container or available from an API (such as a managed database or S3.) So, logically, you avoid thinking about container storage interfaces and live happily ever after.

Unfortunately this isn’t a fairytale and you live in hell world (don’t try to argue, you’re using Kubernetes), sometimes you need to interface with storage and maintain state. I only have three pieces of advice.

  • Don’t worry! You likely only have one good choice for which storage interface to use so that will make this easier
  • If you are a control-freak and taint all of your nodes by default (hello, my name is April), don’t be surprised when the CSI daemonsets don’t create any pods and nothing works and you have no idea why
  • Don’t believe the Kubernetes documentation’s lies. Look at how well explained this entry is, now look at whatever this is, now tell me whether it’s a good idea to use awsElasticBlockStore. You aren’t sure? Same. (The answer is no, it’s not a good idea. But neither was using AWS and look at you now.)

Whose CNI Is It Anyway?

If I had to list the top five things that confuse me in Kubernetes, they would be as follows.

  1. CNIs
  2. CNIs
  3. CNIs
  4. CNIs
  5. Why are groups so close to being useful but so bad

The most important thing to know about any CNI is that the marketing pages and the documentation will make absolutely zero sense to you unless you already understand what the CNI does. This presents a bit of a chicken and egg problem. I will group and summarize a few here.

On a “bare metal” (your self-hosted Kubeadm) cluster, you need to expose your cluster-internal IPs to the outside world. You will do this by installing MetalLB. You could also use PureLB I guess, it has a nice logo and this guy wrote a lot of words that look intelligent, but you could also just install MetalLB and go for a hike instead. Alternatively, if you are using a cloud provider’s Kubernetes platform, congratulations: you have just solved this problem (having to think about Linux networking) by throwing money at it.

On a “bare metal” cluster with multiple nodes, you likely want a way for a pod on one node to interact with a pod on another node. While I’ve never really tried self-hosting a multi-node cluster, I can extrapolate from my experience with Kubernetes generally: typically literally nothing works by default, so I assume that this doesn’t either. To make it work, you’ll install flannel. As an alternative, in the event you want to control where a pod can receive data from and where it can send data to, you can use Calico. I’ll admit I have a soft-spot in my heart for Calico, but if you’re reading this post the reality is that Calico is YAGNI. If you have a strong sense of fear of missing out, you can also use flannel with Calico (but why?) And if you like being special, you can use Weave (but why?) As with the load balancer CNIs, if you’re using a cloud provider’s Kubernetes platform then this whole problem is already solved for you. But don’t celebrate too much, if you actually want to use flannel or Calico networking (for example, to get around the AWS VPC CNI’s low limit of pods per node), you may be about to experience chaos. Luckily you’re already used to throwing money at problems, so maybe just give up and throw money at AWS for a few more nodes than necessary?

Moving up the stack from these bare metal concerns, you’ll find another set of confusing CNI options. I have never used it, but I am fairly certain you want to install Linkerd. It allows you to automatically encrypt traffic between pods (saving you the trouble of having to ensure every container you use supports encryption natively), it has a dashboard that seems to be very useful for observability, and it has a very pretty website. The primary alternative is Istio, which has a pleasant logo but seems to have lost the meme wars and feels like choosing Nomad over Linkerd’s Kubernetes. You can also use Traefik Mesh, which I would bet is good since the Traefik folks seem smart and their website is also nice. But I don’t think it offers encryption and I just don’t know why you would pick it over Linkerd.

Continuous deployment

I have nothing good to say about continuous deployment with Kubernetes. I have no good solution, and I know very little, but both Argo CD and Weave GitOps seem awful. Why would I want to use a frontend UI to configure weird values and Helm charts and then use weird systems to run the deployments? Why can’t I just do this with a Git PR?

Here’s what I want: I want to use Terraform to provision some stuff in Kubernetes. I want something like Dependabot that will automatically make PRs bumping the versions in my Terraform code when a newer image is available. I will probably give you money for this.

One bummer with Terraform is you miss out on the amazing things that systems like Argo Rollouts can do, but I’m already drowning in enough complexity. Maybe next year.

Container registries

This topic is boring, and we shall dwell on it only long enough for me to complain: I hate Docker style registries and the Kubernetes integration. I don’t understand why my options are imagePullPolicy: Always or imagePullPolicy: IfNotPresent. Why can’t I choose imagePullPolicy: IfLabelMoved? Yeah, yeah, yeah, reproducibility, blah, blah, blah.Update: since writing this post I learned that this is exactly what Always does.

Also, I want my images to be deleted thirty days after their last use. Why are my options with AWS ECR instead “thirty days after upload, but only if it isn’t the five most recent and doesn’t have the live label”? Or with Google’s GCR: “this is literally impossible”. Yes, I know a huge part of the problem is that imagePullPolicy: IfNotPresent won’t even contact the registry if the image has already been downloaded, and so there will not be any ping to reset the TTL, but also the Kubernetes folks have already rewritten enormous pieces and solved incredibly difficult problems. Is forking bazel-remote and adding a heartbeat call when a node starts a container really beyond their abilities? Obviously this is an idealogical difference, but thanks to Dunning-Kruger I will assure you that Kubernetes core team doesn’t know anything about Kubernetes and is objectively wrong.

Workflow orchestration

As an aware newbie (versus being unaware that I’m a newbie, as I am now), I tried my best to be thoughtful about orchestration choices and was starstruck by Airflow. Look at that website. Look at all the things it can do. Look at the graph of GitHub stars. Airflow must be amazing!

Unfortunately, it’s really not. If you are willing to buy in to the Airflow ecosystem 100% (writing all of your logic in Python in DAGs), and if you want to run one instance of your pipeline at a time on a cron-like schedule, I guess it’s fine. But I am using Kubernetes for a reason: I just want to trigger pods with pre-existing containers and sometimes I want 1 pod and sometimes I want to kick off 1,000. I quickly discovered that all of the cool integrations weren’t relevant, and the parts of Airflow I was dealing with (execution_date, I see you) were in my way.

Alternatively, you can use Prefect (undergoing a major rewrite, definitely a good sign), Luigi (I’ve heard this is dead), and Kubeflow (interesting but I am too afraid to buy in to the ecosystem, this feels on the verge of dying, and I have non-ML jobs.)

I finally ended up with Argo Workflows. I like it, I really do, but it’s an ugly beast to love. It’s buggy, it will absolutely destroy your control plane if you aren’t careful, and it has some design choices that just seem plain bad (I really think Argo should just create jobs and have a central work server, rather than creating pods directly.) Despite trying pretty hard, I’ve failed to fix both issues I dove into the Argo source code to attempt to resolve. Nonetheless, it’s pretty rad that you can just submit YAML to the Kubernetes API server and magically workflows will start.

Ingresses

I’ve heard that happiness is when what you think, what you say, and what you do are all aligned. I acknowledge that none of those are aligned for me when it comes to ingresses.

I think that unless you have a compelling reason otherwise, Traefik is the best solution. It seems really solid and I wish I had learned about it first.

If someone asked me for advice, I would probably tell them to use Caddy. The main reason is because only a newbie would be silly enough to ask me for advice, and I feel like Caddy is going to be easy to set up and do the right thing and they won’t have to learn much.

I use ingress-nginx, mostly because it’s the first thing I learned about and because I like nginx and because I’ve already wrapped my head around it. Since SSL certificate generation isn’t built in, I use cert-manager. Please note that ingress-nginx and NGINX Ingress Controller are completely different (though also note that the ingress-nginx website is titled “NGINX Ingress Controller”, lovely.)

If you’re using ingress-nginx and want to put your frontends behind SSO, you have a few options.

  1. You can put native SSO support into each frontend
  2. You can instead use ingress-nginx to proxy to a frontend that only does SSO, and then have that frontend proxy to the actual frontend when the request is authorized
  3. Or you can use ingress-nginx in forward auth mode, which means that nginx will make a request to a frontend to check the request’s authorization. If the authorization fails, the user will be redirected to that frontend to login. If it succeeds, ingress-nginx itself will then proxy the request to the actual frontend for the request.

Having every frontend implement SSO itself seems really bad, both because it’s a tremendous waste of developer time and because you have to configure each one independently. I’m already invested in ingress-nginx in forward auth mode, so I just stick with that. If you pick an SSO frontend that supports it, I imagine the proxy approach is fine too. Though I would worry that you’ll end up in situations that need a specific configuration customization only nginx can do (request headers too large? Who knows.)

You have a number of similar but different options for SSO proxies. You can use oauth2-proxy, which is simple and works pretty well and whose maintainers don’t even use the software anymore. You can use Pomerium, which I think is what I am supposed to use but at this point feels like a lateral change from oauth2-proxy (I don’t want to learn their ingress controller and I don’t like their device identity implementation.) You can use authentik but then you’re also getting an authentication provider. I disapprove of you self-hosting your own authentication provider, but I will defend to the death your right to do so.

For self-hosted authentication providers you might also consider Keycloak, which is the most enterprise community software I’ve ever used. There’s also Dex (I cannot tell if this project is on deathwatch), Ory, and probably a million other providers. If you’re reading this post, I assure that thinking about any of this is a complete waste of your time.

Monitoring

You’re going to use Prometheus. Are there alternatives? I have no idea. Do I care that I don’t know if there are alternatives? Not at all. Prometheus is great. Yes, I’m familiar with this hilarious bug that produces white graph lines on a white background. It’s endearing.

Grafana, on the other hand, seems like such an afterthought. It makes me sad.

Logs

I have absolutely no idea what logging infrastructure is available or good. Maybe you can tell me.

Cross cluster service discovery

I have no idea what a good way to do this is. Probably you should start by being careful what CIDR addresses your pods get in each cluster, which I haven’t done but through blind faith believe everything will be fine, and then I think people generally decide that the solution to complexity is to use even more complexity (called Consul.) Somehow this magically works.

Then, eventually, your company explodes.

Conclusion

The world outside is beautiful, take a hike.

Go for a hike