Incomplete and incorrect, part 3: I learned some things

This follows part 2 of An Incomplete and Incorrect Guide to the Kubernetes Ecosystem.

Five months have passed since my last post, suggesting I have learned five months worth of knowledge. Unfortunately, I think I’ve gained one month of knowledge and used the remaining four months to become a luddite.

Argo Workflows

In a previous post I said I like it. I no longer like it. I now have several concerns.

  • It’s architecturally flawed at scale. The primary thing it does is launch pods individually, and when you have thousands of pods it uses a tremendous amount of CPU polling the Kubernetes control plane constantly. I imagine there are knobs to tweak this behavior, perhaps telling it to poll less often, but that’s a bandaid on a bad choice. Additionally, the state of the workflow gets saved into a CRD which causes yet more load (there are some details here around offload which I’ve forgotten now that I’ve abandoned Argo, feels good.)
  • It starts from the assumption that users are repeatedly running ETL-style workflows. My users are generally running one-off workflows, and while parts get reused the workflow as a whole is often only one run once (aside from debugging/test runs while they iterate.) The overhead of pushing a YAML workflow into the cluster just to run it once was confusing for them, and provided no value.
  • It tries to abstract away the internals of your steps, but in practice it just duplicates the internals into YAML. We had usecases that would download a folder from S3, run a step on each file in parallel, and then aggregate the results. This was a nightmare using Argo’s primitives, because we had bugs like one thing expecting a slash at the end of a directory name and forgetting to do it in the workflow input (or worse, in workflow string transformations.) Doing it in code allows us to just use types, an enormous improvement.
  • It’s buggy and poorly documented. Under load I’ve experienced unretryable state corruption, and when I dove into the code in order to fix it I found that it was just a total mess.

So what do I recommend? Well, ultimately I wrote a simple client-side library that runs pipelines (and schedules work in the cluster by starting Kubernetes jobs.) It’s great because you don’t need to maintain a central server, and users are totally isolated. But if you’re okay with some compromises, and if you’re still looking for an off-the-shelf solution, I would consider Flyte.

Cluster Autoscaler

There is one pathological behavior that should be front-and-center for anyone thinking about using it: if you have pods that are stuck in Pending (for example, waiting for spot resources that will never become available) anywhere in your cluster, it will block scale-down for your entire cluster. I am not sure whether the solution is to fork the autoscaler, or if it’s to add alerts for pending pods, but I do know that this behavior is atrocious.

EKS and GKE

Being an ex-Googler, knowing that Google built Kubernetes, and considering that the primary idea was to provide the world’s best Kubernetes implementation so people would flock to Google Cloud, I assumed that GKE would be stellar. Unfortunately, upon review, I have to say: it’s bad folks.

It mostly feels like they offered an amazing service for Kubernetes 1.12. But as the platform moved on, they basically didn’t. There are some cool checkboxes, like it can automatically install the Kubernetes Dashboard and it pre-installs the Cluster Autoscaler. But it turns out you don’t actually want it to do these things, because the Dashboard is deprecated (so you’ll install it yourself unless you’re only using GCP and are okay teaching your users about the Cloud Console) and because you almost certainly want the power to tweak the Cluster Autoscaler’s flags. And meanwhile Kubernetes got better, for example the control plane natively speaks OIDC. GKE will tell you that they support OIDC, but it’s via an Anthos identity proxy that none of your pods will know about.

Meanwhile, EKS is just a step above bare metal hosting (most notably providing control plane hosting, cluster identity, networking with aliased IPs, and node provisioning.) It offers very little, but that’s fine! There are still some features I wish I could add, for example support for trusting external certificate authorities, but it mostly just doesn’t get in your way. Ambition is great if you’re successful, but it’s a negative if you aren’t.

I do have one good thing to say about GCP: having multi-zonal subnets makes networking delightfully easy. But I’ll admit that it also makes me scared, because being able to do routing or DNS per-zone on AWS is an incredibly powerful feature.

Pulumi

In a previous post I made the point that Terraform was better because more people used it. I still think that is true generally, but it’s hard to overstate how awful Terraform is. Passing data around requires a lot of copy-paste, and multi-cluster solutions often devolve into piles of Terragrunt. My final straw was when I wanted a module to initialize cloud-provider-specific resources, output them to a higher level module, and then have those outputs get passed into another module that initialized cloud-provider-independent resources. While it’s technically possible, because of provider limitations it turned out to be an enormous amount of copy paste.

And while I stand by everything I said about Pulumi, and while I will add that it’s extremely buggy, it at least allows me to express my ideas and complete my goals. Sometimes I tell people that it’s like a tech demo that no one has actually used. Yes, it’s broken, yes you’ll get cut if you touch that corner, but just think about the potential!

I’ve since realized one major benefit of using Pulumi in a predominantly Python environment: it means I can embed my production configuration directly into my applications. If I make a list of which zones support which types of machines to create node pools for GKE, then I can make my users’ launchers reference that list of zones, determine the providing clusters, and then make API calls completely magically.