~ 9 min read

March 2023 - General Purpose Troubleshooting Principles

Written by Brie Carranza

This is the written version of my Troubleshooting like Batman talk.

Hello, world!

This month, I took some time to write a version of my Troubleshooting like Batman talk I gave at ElevateCX in Denver in autumn 2022. If you are a support engineer or someone who works in technical support, these should sound familiar as I describe them. With my talk and this blog post, I want to encourage a conversation about troubleshooting principles. By identifying these tools and understanding how and when to use them, you can go from New to Solved more effectively! 🎫 I can’t make it easy but I hope this helps to make it easier.

— Brie

Troubleshooting like Batman

I realized that I used a set of general purpose troubleshooting tools when solving problems in OpenAFS to ZFS, bzr, cvs, hg, git, AND svn as well as a great many other things. In no particular order, the principles I chose:

  • Clarify the problem statement
  • Ask questions
  • Reproduce the behavior
  • Reduce complexity
  • Increase verbosity
  • Reset expectations

A Few Notes

  • These are six of the tools in your troubleshooting toolbox
    • there are others
  • Your job is figuring out:
    • when to use which principle
    • how to apply it to the specific situation in front of you

These tools work best when used in conjunction with one another, much like a hammer and a saw. As you diagnose and work towards resolving a problem, you’ll need to adjust and drop one to use another. Oh, these are my thoughts and opinions and not those of $EMPLOYER.

Clarify the problem statement

Onstage, I put it this way:

If you cannot clearly articulate the problem: how will you know that you solved it?

Make sure you can clearly describe the problem that you are trying to solve and make sure the customer agrees with your problem statement.

  • ”Here’s what I think the problem is… is that right?"
  • "It sounds like the problem is… let me know if I have misunderstood.”

This principle is closely linked with reproduce the behavior. The process of reproducing the behavior should be one of the things that helps you to clarify the problem statement.

Ask questions

I am a staunch advocate of a healthy sense of curiosity.

  • How do you know that?
  • Why do you think that?
  • How is this supposed to work?
  • What impact should this have?
  • How do we expect this to fail?
  • Can I reproduce this behavior?

If you don’t understand why something isn’t working: ask. If you don’t understand why something is working: ask.

There are a lot of people, places and things to ask:

  • The docs
  • Your self
  • Your team
  • Another team in your organization
  • The customer
  • The Internet
    • Your search engine of choice
    • The appropriate community forum, mailinglist, IRC, Mattermost, Slack, Gitter, Discord, etcetera.
    • ChatGPT (keep in mind just how confidently wrong it can be)

As you do research, refine your questions. The better you get at asking questions, the easier it is for other people to help you.

It’s OK to have a lot of questions: prioritize them!

When you don’t know what to do

It is OK to not know the answer! When faced with a ticket, case or issue, remember that the person who is asking doesn’t know the answer either: that’s OK! Let them know that you’re in it together:

I don’t know but here is what I am going to do to find out.

Remember:

✨ What you know matters; what you can figure out will get you home!

Reproduce the behavior

Can you break the system in the same way as the person you’re helping?

  • Yes? Good! Now…can you fix it?
  • No? Hmm. Why not?
  • Sometimes? Intermittent, yay!

Sometimes, reproducing the problem is pretty straightforward:

  1. Attempt to browse to https://whatever.example.con
  2. Observe that the page never loads

It’s not always that obvious! If you cannot immediately reproduce the undesired behavior that you are seeking to change: it’s OK!

In those situations, imagine you are writing the “steps to reproduce” section of a good bug report. At a minimum, you should assemble an ordered list of the steps a colleague would need to take to replicate the undesired behavior described in a ticket.

Something interesting (and good) happens to me often when I do this: I realize that the problem statement was missing some pretty critical information!

🎫 hi, I get limited results back from the API, I’m using this call but it’s not returning all the data can you help?

You go to the docs and…the API call that the customer is using is structured correctly. Now what?

This is where our jobs are fun: experiment with some technology! Get hands-on! As you start to test the call yourself, you realize that you need to know what kind of token is being used to authenticate that API call. Hold on: is any data being returned? What is the HTTP Response Status? Perhaps the permissions associated with the token only permit access to a limited set of data. We don’t really know anything about the data that is being returned: maybe it’s an error message and that’s not clear to the customer. (This happens more than you might think but I don’t know you.)

You now have a few good questions to ask the customer to help clarify the problem statement.

Reduce complexity

When people describe problems with technology, a lot of detail and complexity can be involved. A useful technique is to reduce complexity by removing some of the moving parts and checking whether you can still reproduce the behavior.

  • If you can reproduce the reported problem in the simpler environment: awesome! Fix it there and then check how that solution applies in the real environment.
  • If you can’t reproduce the reported problem in the simpler environment: it’s time to start methodically exploring the differences between the two environments. This is still a good thing: you have eliminated some possible causes already.

Here’s an example that I think is pretty common:

If you can reproduce the problem the keys: cool. Keep investigating but take note of this deviation from the problem statement and keep it in mind as you proceed.

  • 🔖 Plan to build the complexity back in.

I am honored to work alongside a group of incredible engineers every day. I have to take a moment to address our humanity a bit here. Sometimes, I am absolutely in awe at the challenge before many of us who troubleshoot:

Find and solve a tricky problem in a complicated environment that one person set up and another person maintains. You are neitiher of those people and oh yeah: fix it yesterday.

Give it your all and give yourself grace. On a long enough timeline, every ticket is either Closed or not yet Closed. This too shall pass.

Increase verbosity

-vvv

The goal is to get more information from the system that is misbehaving. Think about this broadly: look at the client, the server and the relevant systems and networks in-between. Adjust your scope over time.

This is an extension of ask questions.

Adjust the configuration so that the log verbosity is increased. 🎗 Save yourself or the customer a bad day: remember to return the verbosity to the original setting as you close things out.

A few notes on log analysis

Plan to spend some time looking through logs. Take your time. Take notes.

Stop and ask: am I down a rabbit hole? Should I zoom out and follow another thread?

🎵 An old infosec and sysadmin pro tip: make sure you know what normal logs look like. It’ll make log analysis during an incident or outage much smoother. (It can be incredibly frustrating to spend a lot of time chasing after a spurious error message.)

Reset Expectations

In stand-up comedy, there’s the concept of “yes, and”. When helping to solve problems, it is sometimes helpful to think about “no, but…“.

Note: I know there are lots of opinions about whether or how to say “no” to a customer. This is not about what you actually communicate to the customer: this is a way to frame things. The goal here is to be clear while continuing to be helpful. I think the most important thing is what comes after the “but”. It’s a chance to start over: we can’t do that but here’s what we can do. Doesn’t the “here’s what we can do” part sound promising?

It usually comes out looking more like this:

With the current configuration, it is not possible to submit those GET requests via HTTP. However, you could either:

  • Configure the Web server to accept incoming requests on port 80
  • Modify the code to submit the requests via port 443

The first option would require performing some system maintenance. The second option would be less impactful in production and has the added benefit of an encrypted connection. Please let me know how you would like to proceed.

I also like to think about it in terms of today and tomorrow:

While it is not possible to do x today, here are a few options for how we could solve this in a slightly different way “tomorrow”.

There are some situations where there are no good or easy answers presently available. Helping to (re)set expectations by letting folks know what we can do today, what we can not to do today and how we might be able to do it tomorrow (in the future) can be super helpful.

This requires some creativity and may involve getting more information about the business impact and larger context for the problem you are trying to solve.


George Barris, Public domain, via Wikimedia Commons

Troubleshooting like Batman

If you got this far, you might be wondering:

…but what does this all have to do with Batman?

Your patience will be rewarded shortly.

I opened my talk by asking the audience:

Batman is (arguably) the greatest superhero. Yet, he has no super powers. How does he do it?

The money certainly helps but he’s been effective despite losing his money in the past. I think the answer is his toolbelt and how he uses it.

Batman is (arguably) the greatest superhero because of how effectively he uses the various tools and resources that are available to him in his toolbelt. Like me and you, Batman has the ability to call on teammates so awesome, they just might be superheroes.

Get out there and troubleshoot like Batman.

If you want to keep reading on this topic, please see:

I had to draw a line somewhere for the talk and these are the six I chose. I am available and around if you want to tell me what you would add or remove from this list. This is one of my absolute favorite topics so please do feel free to reach out to discuss troubleshooting. Send me interesting links, write-ups, toots and tweets.

As always: be well!

— Brie 🌈 🦄

Lee and Sarah: thank you.