Incomplete and incorrect, part 2: logging in Kubernetes

This follows part 1 of An Incomplete and Incorrect Guide to the Kubernetes Ecosystem. I still don’t know anything, but with a day of logging experience under my belt I’m prepared to pretend I do.

Have you ever heard the story of the paperclip maximizer? The theory goes that, if an intelligence’s sole purpose is to make paperclips, it will do everything possible to make more paperclips. That “everything” might be quite undesirable from a human perspective. Humans contain plenty of atoms that can be used to make paperclips, so such an intelligence would probably try to kill all humans. Similarly, the intelligence would likely also try to destroy Earth, because Earth contains lots of buried material that could be turned into paperclips. Overall, seems kind of bad.

While it remains to be seen if artificial intelligence will be so myopic, it turns out that even non-artificial intelligence is capable of such atrocities. For one example, consider the folks pushing Elasticsearch as the solution to all your Kubernetes logging needs. Given a goal (“make a powerful way to collect and filter logs”), they executed so relentlessly that the solution is so powerful and featureful and enterprisey that it’s unusable by standard humans. It’s a bit unfortunate.

But before we get there, let’s step back for a second and figure out how Fluentd, Fluent Bit, Elastic, Grafana, Kibana, Filebeat, Logstash, and Loki all relate.

The ecosystem

Logging ecosystem

Many systems, often with overlapping optional functionality, compose into a logging solution, giving rise to the huge diversity of chosen solutions. You may even use multiple forwarders forwarding to each other, or multiple forwarding to the same database.

My initial forays into logging left me deeply confused. Sometimes you would see people talking about an Elastic stack, but then their articles completely glossed over Elastic and mostly talked about Fluentd. Looking into Fluentd, I found plenty of people talking about log processing and forwarding using Fluent Bit. Some people talked about using kubernetes_metadata_filter with Fluentd, while others completely ignored it. Then you look at Elastic’s website and they totally ignore Fluent, instead talking about Logstash. Sometimes you see Fluent as a log scraper, and other times you see it as a way to process logs and simply dump them to S3. And why do some people run Fluent as a sidecar, some as a daemonset, and some in a way that is kind-of in between? All I wanted to do was get logs from pods into a UI! Why is it so complicated?

So here’s the (probably) technically inaccurate yet (hopefully) helpful summary you need to know.

  • In my mind, the solution I wanted was “upload the plain text log from this pod into a plain text file on S3.” That’s not what you’re going to get so just forget about it.
  • You should probably start by selecting whether you want to use Elastic or Loki, rather than fixating on your log collector/processor/forwarder choices.
  • Fluentd has been replaced by Fluent Bit, so ignore Fluentd. I read somewhere that Fluent Bit is being replaced by something else that isn’t ready yet, which seems extremely on brand.
  • Ignore people running in a sidecar configuration. They’re supporting legacy systems and they need obscene Fluent configuration, nothing good will come of following their example.
  • Use Prometheus metrics for metrics, not your logs. If you treat your logs as a debugging aid, not an alerting system, then you don’t need crazy log processing and your life will be much simpler. Just aim for something that collects logs and puts them in your log database.
  • Don’t log personally identifiable information and your life will be much easier. Kibana supports a dizzying array of ACL options, but if your logs aren’t sensitive then you can ignore all of it.

Why is everything terrible?

In my infinite ignorance, I think there are three reasons this space is so complicated.

  1. A lot of the solutions are paperclip maximizing. Using Elastic for logging is like using a Saturn V rocket to go to the grocery store, but from an enterprise checklist standpoint it’s difficult to justify selling anything less. I wanted something to put a file in S3 so I could look at it later, but what I got was an indexed full text search over every possible field in any log with join support.
  2. When I approach this problem I think “I have containers that output to stdout and stderr, how do I get these logs somewhere useful?” But when the Fluentd authors approached this problem, they considered an ecosystem where every application might log to a different place and they needed a solution for every case.
  3. I am incapable of thinking about parsing log lines in order to derive metrics, it just seems so terrible. But a lot of the complexity in these solutions is required to support exactly that (anti-)pattern.

What do

So what should you do? Well, you should think about your goals. Is your goal to set up the most amazing log infrastructure you’ve ever seen and optimize it for the rest of your life? Or is your job to set up something good enough and move on to solving your actual problem?

If you want to spend the rest of your life thinking about logs, I think you should set up Fluent Bit in a daemonset configuration with no processing and use it to ship logs straight into Elastic with Kibana as your UI. Setting up a user for Fluent and figuring out the right TLS options and dot renaming will be a fun adventure you can look forward to. After a few hours of messing with it I finally saw logs in Kibana (yay!), but the UI was so complicated and had so many options that the screen real estate dedicated to showing the actual log message seemed to be about 40 pixels wide.

If you’re looking for something that solves the problem well enough that you can forget about it, then set up Promtail shipping logs into Loki with Grafana as your UI. This stack is a delight to use, especially because you’re likely already using Grafana with Prometheus.

Stack Complexity Innovation Usability Weighted score
Elastic + Kibana 14/10 93/10 -397/10 -4/10
Loki + Grafana 2/10 2/10 10/10 9.34/10