The more I develop applications these days, the more I'm inclined to enable meaningful logging and an observability stack in place as modern web apps tend become more and more complex. What are your insights into some of the common mistakes or some of your tips on the same?

Question

Frank Wiles · Accepted Answer

Hey Rinaldo, long time no chat!  *(Rinaldo used to work at one of [REVSYS's former clients](https://www.revsys.com/clients/) and it was clear 
he was a kindred spirit with me in tech.)*

Like most things in tech, it depends.  You certainly want robust logging but it's easy to go 
overboard on observability in terms of costs.  Let me elaborate on both a little bit, ok a lot as 
I have lots of thoughts on this area.

## Logging Best Practices

### How to log

You want lots of structured logging.

By structured I mean, instead of:

```text
User PK=14 rinaldo@example.com logged in successfully
```

You want logs that look like:

```json
{
  "event": "user_login", 
  "user_pk": 14, 
  "user_email": "rinaldo@example.com", 
  "success": true
}
```

See the difference? One is text and one is data.  You want data because you're going to be consuming these logs with 
something like Kibana or Grafana and ain't nobody got time to write the logs and then write parsers and transforms for the 
logs to be useful again.

Just use JSON logging ***always***.  For python you want to be using Hynek's amazing [structlog](https://www.structlog.org/en/stable/) 
library.  If you use something else, you should have a damn good reason why.

Reading JSON isn't the most fun, but if your logs are data, you can have your favorite LLM whip up a quick script to 
pull out EXACTLY what you need to be seeing for the debugging situation you're currently faced with.  Prompting Claude for 
5 minutes and telling it to use [Typer](https://typer.tiangolo.com) or [Click](https://click.palletsprojects.com/en/stable/) and [Rich](https://github.com/Textualize/rich?tab=readme-ov-file)
and you've got a custom, purpose fit CLI viewer for the situation you're currently facing.

Small tangent but, here is an example of using [python rich in pytest](https://www.revsys.com/tidbits/en-rich-your-python-testing/) It's a similar 
idea to what I'm talking about here with making JSON logs easier for your brain to consume.

As you can imagine, many of these are "throw away tools" may become re-usable over time and/or the basis of a new dashboard in your 
observability stack.

Be sure to pick a standard for how you name your `events`.  Try not to get into a situation where Bob writes everything as singular 
and Stacey uses plurals as it makes writing these scripts and Grafana dashboards a huge pain to sort out.  Avoid `user_login` and `logged_out_user` 
in favor of `user_login` and `user_logout` for example.

### When to log

Typically you want to log just after something *happened* or *should have happened*.   You can also choose to log just *before*, 
but whichever you pick stay consistent across the entire codebase!

Generally you want to do something like this:

```python 
try:
  do_something_for_user(user) 
  logger.info(
    "something", 
    user_pk=user.pk, 
    user_email=user.email
  )
except Exception as e:
  logger.error(
    "something_failure", 
    user_pk=user.pk, 
    user_email=user.email
  )
  # ... continue to handle error case appropriately... 
```

You can also obviously push these logs down into `do_something_for_user()` instead if that's appropriate often you're doing a multi-step 
thing and it's better to log this stuff at a level above.

### What to log

It's honestly hard to get *too verbose* with structured logging, except in terms of what it costs to store, index, and search which 
I'll cover next.

Often you want to log a user's or some data's journey.  If we were building an email client, you want events like:

- user_login
- email_refresh
- email_viewed
- email_archived
- email_viewed
- email_deleted

To represent a user logging in, getting fresh email, viewing two messages, archiving one of them and deleting the other.

Now, `*_viewed` is a choice here.  This might not be something you want to log and instead relay on web or product analytics 
but maybe it's important for your use case.  Again, consistency is the key.

While you do want consistency, you likely want *MORE* logging and details in the areas you have or expect more bugs.  Where 
the data and situations are complicated and you're likely going to need to do a bunch of digging in the future.

Got a complex RAG process for airplane manuals where if you return incorrect data you get fined by the government or someone dies? ***LOTS*** of 
logs.

Code that handles recording someone clicking the "Like" button on your personal blog? Maybe one log message or even none.

### What to keep

This is where you can start to control the costs of your logging.  Contrary to many people's beliefs, there is no hard and fast rule 
that you need to keep *ALL* logs for the same duration.

You might keep sucessful login history for 24 months and maybe errors only for a month.  You likely want to keep logs that are useful 
business metrics for the trailing 13 months so you can see yearly trends, but you don't likely need to keep them for 10 years.

And tooling that is super chatty like [kube-proxy](https://kubernetes.io/docs/reference/command-line-tools-reference/kube-proxy/)  you might 
just keep for 72 hours (or less) to debug live situations like outages.

Balance the amount of the logs being generated with how useful they are over time.

Think of them like memories.  I want to keep the memory of my wedding forever, but I only need to remember the grocery store 
shopping list for today and *MAYBE* tomorrow to prove to my wife she did not, in fact, put strawberries on the list. 🤣

## Observability and Metrics

The best stack of this is always evolving.  If you asked me a few years ago I would say the ELK stack.  Last year, I would have 
said Prometheus, Loki and Grafana.  Lately we've been moving all of our clients to VictoriaMetrics, VictoriaLogs, and Grafana 
as it's more resource efficient in general.

But don't sleep on hosted services like [Logfire](https://pydantic.dev/logfire), [Axiom](https://axiom.co), Grafana Cloud, Datadog, and Honeycomb.  They 
are a great way to get started in your observability journey but some of them, notably Datadog especially, can get expensive quickly as you scale.

## What metrics

You want your general "system" level metrics of CPU, RAM usage, disk I/O, and network I/O whether you're deploying on a VPS 
or in Kubernetes.

The rest of what you collect is likely tooling specific (nginx metrics are different than other web servers for example) and 
your business logic.

It's helpful to have business object counts and CRUD operations around them.  You see the CPU spiking and it correlates with 
a huge spike in CustomWidget updates, you have a better idea of what is going on than just "there is a CPU spike".  You have 
an idea of where to start looking to determine if this is legit traffic, a bot, DDoS, or some sort of error condition going on.

## What to keep

If you can, keep a rolling 13 months or more of your metrics so you can see trends over time.  If that's too expensive, try to keep 
a rolling 31 days, again to see trends.  Is this normal for Tuesday afternoons is very helpful in the moment.

Some stacks allow you to keep rolled up or aggregated metrics for longer time periods.  You don't need 10 second granularity of 
your CPU usage from 10 months ago.  Per minute or even per hour is perfectly fine.  If your stack can do that be sure to take 
advantage of it.

## To summarize my thoughts...

If I had to pick, I'd pick logs over detailed metrics.  And I'd trade away long retention times before I traded away 
being chatty with either of them.

It's often a good idea to every few months or at least once a year dig around and turn off any logs and/metrics that are 
expensive (to generate or to store) and provide little value.

Pro-Tip: it's **NOT** hard to set up a stack like VictoriaMetrics, VictoriaLogs, and Grafana in `docker compose` right along 
with your application so you can better develop your logs but also so those logs and metrics become useful *WHILE* you're 
developing and not just when deployed.

Hope this helps!

The more I develop applications these days, the more I'm inclined to enable meaningful logging and an observability stack in place as modern web apps tend become more and more complex. What are your insights into some of the common mistakes or some of your tips on the same?

Logging Best Practices

How to log

When to log

What to log

What to keep

Observability and Metrics

What metrics

What to keep

To summarize my thoughts…