The devil is in the details – How to find it

The devil is in the details – How to find it

Were you ever in the position when, after trying to identify a bug, your best outcome was ‘it does not reproduce’, ’it works on my machine’, ‘it does not make sense’, ‘we’ll check once more if it happens again’ ?

Every developer has had at some point to figure out why the application he was working on did not behave as expected. Regardless of the nature of the application, there are a few steps that you should follow.

Where to start?

Make sure you fully understand the issue. If it is a new topic for you, take your time to familiarize yourself with the prerequisites and the underlying conditions that could potentially trigger the behavior. If you are going blind into it, your chances of success are already slim.

Do not jump headfirst into the code, while the most common bugs do come from mistakes during development. If you are not able to find the source, you might be tempted to give up due to rush or frustration. If a root cause investigation was opened into the issue, it probably is a deeper concern.

Collect information

Once, I remember we had an issue with some bulk inserts failing due to a schema change. It was very frustrating because it required a lot of manual work to fix corrupted data. We tried adding extra logging, isolating the operation inside a high-priority transaction, but nothing seemed to work or provide us any insight into what was happening, and, of course, the issue could not be reproduced locally.

It turned out that our maintenance plan that was backing up the transaction logs of the database used to enable and disable the CDC on the database every hour and locked the system login table to recreate it’s user, which resulted in a schema change. We only managed to link the events together after we performed another investigation into some deadlocks, but we sure did learn an important lesson:

Try to find as much information as you can about the circumstance of the incident, even if it seems insignificant. Potential info might include steps to reproduce, logs, time and date, values of resources, database entries/flags, activity inside or outside of the application, and of course, the code.

Always keep your eyes open to any activity that took place during the time of the event, be it a user or other processes running, the intercalation of threads is common and difficult to diagnose.

Classify your information

Now that you have a stack of info about everything that took place, you must filter the data, some pieces might not be part of your puzzle, do not try to formulate a hypothesis just yet.

During the evaluation of records, you need to categorize your information into groups: circumstantial, documentary, demonstrative, unreliable, and real evidence.

Circumstantial: relation of events that might have triggered a behavior, such as the steps to reproduce an issue. This type should not be taken for granted as it relies on individual experience ( A user says he clicked a button and the application crashed, was it really the button?). However, do not disregard it! (A user clicked on a button which started a process that deadlocked somewhere, and the application crashed).

Documentary: information that is recorded in some way: logs, traces, activity monitoring. It will provide a better insight into the flow of the event, based on which you might deduce other actions that were triggered.

Demonstrative: evidence that helps support the context of other evidence: graphs of CPU/memory recordings ( you might suspect an out of memory exception).

Unreliable: information that can change over time, such as a database entry that was updated between the event and the time of the investigation. One of the traps you can fall in is to base your theory upon unreliable information. It will most likely detour your investigation.

Real: undebatable piece of evidence that points to the source of the problem: line of code, misconfiguration.

Piece it together

Your information is now grouped based on reliability; you have to go top to bottom to find possible scenarios that might fit and digest each one.

Envision the events as individual elements and make a note of each one, then you will gain a better understanding of their dependency when you combine them.

Try to create a timeline of the incidents that led to a specific behavior. Place the proof that supports an event below it to create a picture of what happened.

If the issue presents a repetitive behavior, reconstruct other occurrences as well to see if you can identify a pattern between the scenarios.

Try to reproduce it

After you have all the circumstances of the issue, you can now try to reproduce it. Make sure you replicate the environment with as much accuracy as possible. If you are working with a bunch of data, do not be lazy and investigate a specific element that is causing trouble, it might be an issue within the group, not the individual.

Follow the code line by line, even if you have an idea of what is happening. Sometimes default settings might trick you.

If you think you still do not have a clear vision, do not be afraid to introduce more logging and revisit the investigation after you have additional information.

It’s ok to ask for help

If you have struggled for days to figure it out to no avail, try to reach out to your colleagues, sometimes a fresh pair of eyes or a more experienced one can help.

Collaboration is key to solving problems!

Anti-patterns for investigations

Geronimo: try to keep your steps and facts organized, once you find a piece of the puzzle that you believe is the smoking gun, do not put all your bets on it, keep your line of thought and investigate all the way, otherwise you might jump to the wrong conclusion

Mad chicken: do not run around trying to pull information from everywhere even if it is unrelated just because you lack solid proof, your conclusions might get completely off track

Peacock: use your logic, do not formulate a theory before you even begin just based on your experience or hearsay, otherwise you might see only the evidence that supports your opinion disregarding other clear tracks

Sloth: even if it takes more time, do not assume anything, verify everything with your own eyes, you might pass by the culprit

Passer: own your investigation, do not throw it to someone else just because you could not figure it out, it is not supposed to be easy, if it were, it would have already been solved

Denier: do not proceed by taking for granted a specific functionality, every event is a potential suspect, even if it is a third party

But he just knew!

Did you encounter someone who could just smell the smoke and tell you exactly where it is coming from? With experience comes instinct, by finalizing multiple such investigations, you learn to filter better the information and go straight to the source.

Yes, performing all the steps described can be time-consuming, but as you exercise this pattern of thinking, you will reach proficiency, and it will become second nature. With enough practice, you will start achieving this in your head. Until then, be thorough and relentless.

Comments (0)
Join the discussion
Read them all
 

Comment

Hide Comments
Back

This is a unique website which will require a more modern browser to work!

Please upgrade today!

Share