FrankWiles.com
Back to all questions
RR Rinaldo Rex
• January 4, 2026

The more I develop applications these days, the more I'm inclined to enable meaningful logging and an observability stack in place as modern web apps tend become more and more complex. What are your insights into some of the common mistakes or some of your tips on the same?

Hey Rinaldo, long time no chat! (Rinaldo used to work at one of REVSYS’s former clients and it was clear he was a kindred spirit with me in tech.)

Like most things in tech, it depends. You certainly want robust logging but it’s easy to go overboard on observability in terms of costs. Let me elaborate on both a little bit, ok a lot as I have lots of thoughts on this area.

Logging Best Practices

How to log

You want lots of structured logging.

By structured I mean, instead of:

text
User PK=14 [email protected] logged in successfully

You want logs that look like:

json
{
  "event": "user_login", 
  "user_pk": 14, 
  "user_email": "[email protected]", 
  "success": true
}

See the difference? One is text and one is data. You want data because you’re going to be consuming these logs with something like Kibana or Grafana and ain’t nobody got time to write the logs and then write parsers and transforms for the logs to be useful again.

Just use JSON logging always. For python you want to be using Hynek’s amazing structlog library. If you use something else, you should have a damn good reason why.

Reading JSON isn’t the most fun, but if your logs are data, you can have your favorite LLM whip up a quick script to pull out EXACTLY what you need to be seeing for the debugging situation you’re currently faced with. Prompting Claude for 5 minutes and telling it to use Typer or Click and Rich and you’ve got a custom, purpose fit CLI viewer for the situation you’re currently facing.

Small tangent but, here is an example of using python rich in pytest It’s a similar idea to what I’m talking about here with making JSON logs easier for your brain to consume.

As you can imagine, many of these “throw away tools” because re-usable over time and/or the basis of a new dashboard in your observability stack.

Be sure to pick a standard for how you name your events. Try not to get into a situation where Bob does writes everything singular and Stacey uses plurals as it makes writing these scripts and Grafana dashboards a huge pain to sort out. Avoid user_login and logged_out_user in favor of user_login and user_logout for example.

When to log

Typically you want to log just after something happened or should have happened. You can also choose to log just before, but whichever you pick stay consistent across the entire codebase!

Generally you want to do something like this:

python
try:
  do_something_for_user(user) 
  logger.info(
    "something", 
    user_pk=user.pk, 
    user_email=user.email
  )
except Exception as e:
  logger.error(
    "something_failure", 
    user_pk=user.pk, 
    user_email=user.email
  )
  # ... continue to handle error case appropriately... 

You can also obviously push these logs down into do_something_for_user() instead if that’s appropriate often you’re doing a multi-step thing and it’s better to log this stuff at a level above.

What to log

It’s honestly hard to get too verbose with structured logging, except in terms of what it costs to store, index, and search which I’ll cover next.

Often you want to log a user’s or some data’s journey. If we were building an email client, you want events like:

  • user_login
  • email_refresh
  • email_viewed
  • email_archived
  • email_viewed
  • email_deleted

To represent a user logging in, getting fresh email, viewing two messages, archiving one of them and deleting the other.

Now, *_viewed is a choice here. This might not be something you want to log and instead relay on web or product analytics but maybe it’s important for your use case. Again, consistency is the key.

While you do want consistency, you likely want MORE logging and details in the areas you have or expect more bugs. Where the data and situations are complicated and you’re likely going to need to do a bunch of digging in the future.

Got a complex RAG process for airplane manuals where if you return incorrect data you get fined by the government or someone dies? LOTS of logs.

Code that handles recording someone clicking the “Like” button on your personal blog? Maybe one log message or even none.

What to keep

This is where you can start to control the costs of your logging. Contrary to many people’s beliefs, there is no hard and fast rule that you need to keep ALL logs for the same duration.

You might keep sucessful login history for 24 months and maybe errors only for a month. You likely want to keep logs that are useful business metrics for the trailing 13 months so you can see yearly trends, but you don’t likely need to keep them for 10 years.

And tooling that is super chatty like kube-proxy you might just keep for 72 hours (or less) to debug live situations like outages.

Balance the amount of the logs being generated with how useful they are over time.

Think of them like memories. I want to keep the memory of my wedding forever, but I only need to remember the grocery store shopping list for today and MAYBE tomorrow to prove to my wife she did not, in fact, put strawberries on the list. 🤣

Observability and Metrics

The best stack of this is always evolving. If you asked me a few years ago I would say the ELK stack. Last year, I would have said Prometheus, Loki and Grafana. Lately we’ve been moving all of our clients to VictoriaMetrics, VictoriaLogs, and Grafana as it’s more resource efficient in general.

But don’t sleep on hosted services like Logfire, Axiom, Grafana Cloud, Datadog, and Honeycomb. They are a great way to get started in your observability journey but some of them, notably Datadog especially, can get expensive quickly as you scale.

What metrics

You want your general “system” level metrics of CPU, RAM usage, disk I/O, and network I/O whether you’re deploying on a VPS or in Kubernetes.

The rest of what you collect is likely tooling specific (nginx metrics are different than other web servers for example) and your business logic.

It’s helpful to have business object counts and CRUD operations around them. You see the CPU spiking and it correlates with a huge spike in CustomWidget updates, you have a better idea of what is going on than just “there is a CPU spike”. You have an idea of where to start looking to determine if this is legit traffic, a bot, DDoS, or some sort of error condition going on.

What to keep

If you can, keep a rolling 13 months or more of your metrics so you can see trends over time. If that’s too expensive, try to keep a rolling 31 days, again to see trends. Is this normal for Tuesday afternoons is very helpful in the moment.

Some stacks allow you to keep rolled up or aggregated metrics for longer time periods. You don’t need 10 second granularity of your CPU usage from 10 months ago. Per minute or even per hour is perfectly fine. If your stack can do that be sure to take advantage of it.

To summarize my thoughts…

If I had to pick, I’d pick logs over detailed metrics. And I’d trade away long retention times before I traded away being chatty with either of them.

It’s often a good idea to every few months or at least once a year dig around and turn off any logs and/metrics that are expensive (to generate or to store) and provide little value.

Pro-Tip: it’s NOT hard to set up a stack like VictoriaMetrics, VictoriaLogs, and Grafana in docker compose right along with your application so you can better develop your logs but also so those logs and metrics become useful WHILE you’re developing and not just when deployed.

Hope this helps!

Next Question
I’ve always been curious, how do you and your team at REVSYS prioritize what to work on? Obviously client work pays the bills, but your team has been heavily involved in the Python and Django community for years, either serving on boards/working groups/committees/etc or helping maintain a myriad of open source libraries and helpful community sites. I figure each person has a certain level of autonomy to prioritize, but do ya'll have something like "10% time" rule or is it more fluid and circumstanstial?