Getting your money’s worth: making runtime logging more valuable

“Get your money’s worth”

I like this phrase. I hadn’t really stopped to think about it until I wrote this blog post. I unpack it as:

“Get the amount of value you expect to receive for the cost.”

Today I want to write about something that I’ve been thinking about for a long time. Companies spend a lot of money on our logs. A lot. Whether you pay per gigabyte processed, or stored, or queried, there exists the universal truth that somehow, despite it being 2022, simply collecting or utilizing logs can be one of the most costly infrastructure tasks out there.

Regardless of how you process or store your logs, the important question is this:
Do you get the amount of value you expect for the cost?

$2,000 USD (2,631 CAD / 2.023 EUR)

This is the average amount a previous employer of mine spent each day on collecting and retaining runtime logs back when I ran the numbers many years ago. Historically, those expenses basically doubled each year, so by now it’s probably much, much higher.

While we did have a serious quantity-of-unhelpful-logs problem that we later addressed, I’m not going to focus on reducing costs at all in this blog post. Instead, I want us to focus on the value we are getting from existing logging. I want us to get our $2,000’s worth each day. Let’s find out how we can!

Avoid formatting variables into your log message

How you usually do it

logging.info("User's favorite ice cream is %s." % flavor)

Imagine someone walks up to you and says,

Can you go into Splunk and chart, over time, the preferences in flavor of our users?

You think for a moment about this, and decide the easiest way to pull this information out of your log messages is to use a regex match in your Splunk search.

| rex field=message "^User's favorite ice cream is (?<flavor>.*).$"

That’s not the worst solution, but it’s also not particularly clean. Now imagine that you log a lot more data than just flavor:

logging.info('{} from company {}, favorite ice cream is {}, referral source is {}.'.format(username, company, flavor, channel))

Your Splunk search to pull out these items now needs regular expressions and looks like this:
| rex field=message "^(?<username>[^\s]+) from company (?<company>.*), favorite ice cream is (?<flavor>.*), referral source is (?<channel>.*)$"

Gross

How you should be doing it

Pretty much all logging solutions support hydrating metadata properties on each log event with as many fields as you’d like. The implementation is language-specific, but the here’s the Python example:

context = {'flavor': flavor}
logging.info("User's favorite ice cream recorded.", extra=context)

Some languages can natively output JSON from their loggers. Others, like Python, need a library to do it. In any case, research what it takes to log JSON and push the resulting output to your logging tool of choice.

Now that we’ve utilized the extra dictionary in the above example, set up JSON log output, and set up ingestion of those logs, let’s take a look at the results in Splunk:

{
         level: info
         message: "User's favorite ice cream recorded."
         metadata:
         {
             func: create_or_update_preference
             lineno: 45
             flavor: "vanilla"
         }
         timestamp:      2022-06-14T20:57:32.330888Z
         type: log
         version: 2.0.0
}

Now it is much easier to utilize the metadata in Splunk. Let’s find the distinct number of users who have a favorite flavor other than vanilla, and report the results back by flavor:
metadata.flavor!=vanilla | stats dc(message) by metadata.flavor

dc(message)metadata.flavor
6chocolate
2mint
4cherry

Obviously the above examples are Python-specific, but Golang has several libraries for structured logging, such as the very popular Logrus and Java has these features built-in using the MDC. A google search like “structured logging ruby” should get you started no matter what language you are using.

Avoid dumping object representations straight to logs

One of the things that greatly increases log volume without increasing their value is the dumping of objects directly into the message field.

logging.debug("Found matching user: %s" % str(user))

>>> Found matching user: User{details=Details{requestId=1387225,
usertype=1, accountId=QWNjb3bnRvIGh1bWFuLXJlYWRhYm,
organizationId=QWNjb3vcm1hdC4KCkJhc2U2NCBlbmNvZGl,
solutionType=QWNjb3b1tb25seSB1c2VkIHdoZW4gdGh},
firstName=Matthew, lastName=Sullivan}

Something roughly similar to the log above was sent through our logging pipeline millions of times per day in production. We can assume that this log data helped the product team debug production issues, but with such a large volume of unparsed data, narrowing down on a specific user or organization would be extraordinarily difficult, and Splunk search performance is impacted substantially because of the large message field size. Additionally, producing a report or creating a scheduled alert will require some significant regular expression work.

Avoid using log data as performance telemetry

We utilized a number of services for collecting metrics around application performance. Sometimes, developers were also sending timing/tracing and telemetry-type data through the runtime logging pipeline. This was problematic because that data should clearly have been going to a purpose-built tool for it, such as New Relic or Datadog.

logging.info("Query ran for a total of %d ms" % query_time)
>>> Query ran for a total of 43 ms

The log above was sent through the pipeline tens of millions of times per day in production. While Splunk has a number of very powerful visualization tools, teams needed to be using a tool more suited to this type of mass data collection and visualization. If you really must log this type of data, consider sending only a small representative sample:

LOG_SAMPLE_RATE = 0.01 # 1%
def sample(message, level='debug', extra={}):
    # random.random() returns a float between 0.0 and 1.0
    rnd = random.random()
    if rnd > LOG_SAMPLE_RATE:
        return

    extra['sample_rate'] = LOG_SAMPLE_RATE
    getattr(logging, level)(message, extra=extra)

Even this is kind of strange and gross, so if you are going to do it, do it on a temporary basis and then eventually iterate it out of existence.

Parting thoughts

As applications and cloud workloads continue to grow in both their size and complexity, it’s critical that you have the right tools in place, and that you know how to maximize the value you can derive from those tools. Consider how you can add more value to your runtime logs in order to detect problems and glean valuable data about customer interactions with your platform. A day spent investing in log value will pay dividends to your teams, your support engineers, and your customers.

I’d like to quote something a colleague of mine mentioned while reviewing this post, which I think is a very valuable insight:

Another dimension of cost is the time it takes to diagnose an issue in production. We spend money and time on logging to reduce time (and, by extension, money) spent in the future. Good logs ensure production issues are diagnosed quickly, and that errors encountered during development are obvious. The engineering trade-off is minimizing the total number of log messages per request while maximizing visibility into execution.

Like in all application or system development, tooling will only take you so far. It’s the quality of the data going into those tools which will make the biggest impact at the end of the day. Don’t be afraid to push back on your product or project managers; it’s your job to help educate them of the value good log hygiene will provide in the long-term. Maybe share this post with them 😉

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s