The Art of Troubleshooting¶

As technicians, we often have to solve some difficult problems. I've had my fair share of whoppers, were I thought the world is coming crashing down. After a while, I started learning a process for dealing with technical problems.

Understand your problem¶

You can't begin to solve an issue if you're not clear yet on what the actual problem is. Sometimes assumptions are made on the issue, so spend a few minutes to clarify exactly what the issue is.

WHAT happened?¶

Spend a few minutes to clarify what the actual issue is. Teams are quick to focus on the impact, and while that's important, that's not the issue we are trying to address. Stay focussed. Server X crashed, website Y went down - be specific about what the actual thing is that occurred that lead to this investgation.

HOW did it happen?¶

What do you know about the event? Maybe someone was performing a routine change, and upon loading the change, the system died. Maybe it was completely random. If you don't know how it happend, that's ok. We'll circle back to this as we start to uncover more details about the issue.

WHEN did it happen?¶

Timelines are key. Start recording a log of what happened when. Stay factual. On this day, change X was applied, then on this day, we observed Y in the log. Keep it concise and to the point. Reference actual dates and times - there will be cases when a specific action lead to another, and having an accurate timeline of what happened when is important.

Warning

When working with different systems, make sure that your system clocks are synchronised. There's nothing worse than trying to correlate logs and find that times are out by a few minutes, or system timezones do not line up.

What was the IMPACT?¶

You had some sort of business impact, but how much of an impact was it? Assessing the impact is an important step, as this will help drive priority, and remediation tasks.

You can list things like

Total outage time
What portion of the business was impacted
What was the perceived revenue loss
How many users were impacted
What was the total cost of recovery / restoration

Something changed¶

What lead to this issue to occur? The primary root cause in most scenarios is that something changed. Something that was working correctly previously is now not working anymore.

Defective component¶

It is entirely possible that a defective component is leading to the issue. Defective components are usually easy to spot. There will be obvious signs that something is not quite working as it should.

Design flaw¶

An architectural design flaw in your solution could also be the culprit. As an example, a systems crash due to a lack of disk space because it failed to delete the temporary cache folder can be seen as a design flaw.

What changed?¶

Work with your Change Management team to isolate any changes that may have occurred in the lead-up to the incident. You can also review the system in question's logs, and see who logged on around the time of the outage, what processes were running at the time, and what could have lead to this issue.

Mortal enemies¶

Check the obvious¶

Is everything working the way it should? I've learnt to use commands like df, top, ps, iostat, vmstat, ping, nslookup and tracert effectively to see if the issue is with the infrastructure, a process, the OS, or the network. So check the obvious.

Has This Happened Before?¶

Before diving into troubleshooting, check your documentation and knowledge base. A quick search may reveal that this issue has occurred previously and that a solution or workaround already exists. Leveraging past experiences can save time and effort. Do not overlook the possibility that someone else has encountered and resolved this problem before. Reviewing historical records can also help identify recurring patterns and prevent similar issues in the future.

Assumptions¶

Don't make any assumptions. Just because you think it's not DNS, don't assume it's not DNS. Go check and confirm it. The same goes for those firewall rules, or any other config for that matter. While you may not have all the information yet, you do know that there are some system dependencies your solution is relying on. Make sure that they are all working as expected.

Fixating on one thing¶

During your investigation, you may identify a possible cause for your issue. Be careful that you do not get so fixated on proving that issue that you loose sight of the bigger issue. I have had cases where two unrelated things went wrong at the same time causing confusion during the process.

Keep it factual. If you think a specific component is at fault, test it, and if proven right, work on resolving it.

Analysis Paralysis¶

Being overwhelmed and stressed in a critical incident can lead to a lack of decision and action. When presented with too much data and too many options, the team may fall into the trap of not being able to move forward due to too many data points. Prioritise what makes sense, and start eliminating issues one by one.

Being a Cowboy¶

Being a "cowboy" refers to the process of just logging to the system, and doing things in a reckless manner, like backing up and restoring databases without any real evidence, simply taking action because you "think" it is the right thing to do. Cowboy-mode can lead to more problems than you started with. Plan your remediation actions carefully.

Stakeholder Interference¶

We understand the impact - no one has to tell you how important this is. When the stakeholders are breathing down your neck to resolve this issue while you're trying to stay technical can be a real challenge. If you're the leader managing the team responsible for the remediation, provide them that cover by engaging with the stakeholders, allowing the technical team to focus on the technical work.

No decision makers¶

Certain technical decisions can have significant consequences, potentially affecting system stability, security, or compliance. When critical decisions impact remediation efforts, they must be clearly defined, well-documented, and reviewed to ensure alignment with business objectives. Approval from the appropriate business owners is essential to mitigate risks and ensure accountability in the decision-making process.

Root cause analysis¶

5-Why¶

5-Why Analysis is a process where you keep asking WHY questions (for about 5 times) until you isolate why a particular problem exists. Here's an example showing why a car won't start.

graph TD
    A[Problem: Car won't start] --> B[Why 1: Battery is dead]
    A --> B1[Why 1: Out of fuel]
    A --> B2[Why 1: Starter motor failure]

    B --> C[Why 2: Alternator is not charging the battery]
    B --> C3[Why 2: Battery is old and needs replacement]

    C --> D[Why 3: Alternator belt is broken]
    D --> E[Why 4: Belt was worn out and not replaced]
    E --> F[Why 5: Lack of regular maintenance]

    B1 --> C1[Why 2: Forgot to add fuel]
    B1 --> C2[Why 2: Someone stole the fuel!]
    C2 --> D1[Why 3: No security measures in place]

    B2 --> C4[Why 2: Electrical wiring issue]
    B2 --> C5[Why 2: Worn-out starter motor]
    C5 --> D2[Why 3: Excessive cranking over time]
    D2 --> E2[Why 4: Ignored early warning signs]

Note

Learn more about 5-Why from the Learn Lean Sigma site

Fishbone¶

Another favourite is the Fishbone diagram. Unlike the 5 Why, the Fishbone focuses more on the cause and effect.

graph LR
    A[Problem: Car won't start]

    A --- B[Fuel Problems]
    A --- C[Mechanical Failures]
    A --- D[Electrical Issues]

    B --> B1[Out of fuel]
    B --> B2[Fuel line blockage]

    C --> C1[Broken alternator belt]
    C --> C3[Engine component failure]
    C --> C2[Carburettor  failure]

    D --> D1[Battery is dead]
    D --> D2[Alternator not charging]
    D --> D3[Solenoid is worn out]
    D --> D4[Electrical wiring fault]
    D --> D5[Faulty starter motor]
    D --> D6[Ignition switch failure]

Note

Learn more about 5-Why from the ASQ site

Brainstorming¶

Either the 5-Why and the Fishbone diagrams are tools in your toolbox that you use for isolating the problem. None of these are really useful if you do not engage your team.

Make sure you have the right people in the room. Share your thoughts and ideas with them. Let them contribute to your diagrams with their ideas of they think it might be. With more ideas on the board, you can start isolating what it is not, until you find the main issue contributing to the outage.

Testing it¶

You may come up with a hypotheses of what the problem might be. Just like in science, go and test it. In many cases I ended up writing little Python scripts that would emulate the same behaviour we are seeing on the system, so we can reproduce the problem.

Reproducing the problem is a way to isolate the problem to a specific component that is not behaving the way it should. Being able to build a small solution that emulates the same behaviour is also a good way to get the vendor involved (where necessary) to show the behaviour being experienced.

Learning from it¶

Once you've resolved the issue, it is important to learn from it. With all the detail you have uncovered, there are some key lessons to be learnt. Lessons like:

What worked well?
What did not work so well?
What did we do wrong?
What can we do to prevent this from happening again?
What can do next time to reduce the time taken on similar investigatons?

Document this in your document management system. Make sure your team knows about it, and are able to search for this information the next time they face a similar issue.

Real world example¶

Print Server Outage¶

A number of years ago, I was responsible for a team managing a global print system for a large manufacturer. One fine day, something happened - the system slowed down, to the point where a single print job that would take 5 seconds is now taking 15 minutes. It may not seem like much, until you consider that this system is managing invoices, delivery notices, and everything else in between. The impact was huge.

I did my normal troubleshooting like I always did. Checked the OS, checked the network, rebooted the system multiple times, even rediverted all traffic from the production system to the non-prod system - no change. I even considered reinstalling the environment from scratch just to get things up and running.

I had a team from around the world that I could lean on. Everyone was trying their best to try and troubleshoot the problem. We even engaged the vendor (who, sad to say was quite useless in this case).

Towards the end of the afternoon, Charles (the unix expert I was leaning on the entire day) came walking past my desk, he just said : "It's DNS".... and kept walking.... I thought .. No -- can't be. He was right... It was DNS.... He was on his way to the Infrastructure team to sort out the problem with the DNS server. While he was doing that, I did some more checking to confirm what he was on about.

On a Linux system, your DNS servers are defined in the /etc/resolve.conf file. In the resolve file, there were 2 DNS servers. When you perform an nslookup function, the OS will try to query the first DNS, if it's unable to reach that, it will try the 2nd DNS server. Windows does that as well. In the data center, no other system had any issues - only this print system. Why?

The primary DNS server in the data center had a faulty network cable. The network card would detect the fault, and terminate the connection. The server was technically still up, because its secondary network card was still operational, hence, still communicating back to the management system. No other system in the data center felt this issue because they failed over to the secondary DNS server. This print server on the other hand kept using the primary DNS server.

We solved the problem by updating /etc/resolve.conf and just swapping the DNS servers around. Restarted the application, and everything started working. A full day's backlog got cleared in like 5 minutes. It was amazing to see the system resolve itself it so quickly.

A few things came out from our investigation.

The Print Server had a weird behaviour in that it caches the primary DNS server name on startup. So if the primary DNS failed, it would not fail over to the secondary. I raised this with the vendor. Their response was it was by design. They did not want the server to switch between DNS because it would cause performance delays. As a workaround, I built an explict DNS monitor for the primary interface (something like nslookup google.com `cat /etc/resolve.conf |grep nameserver | head -1 | awk {'print $2'}), so we can be alerted the next time the primary DNS server failed.
The faulty network cable on the primary DNS server was a tricky one. There were apparantly 2 network cables for the same interface, and one of them was faulty. For some reason this caused the nic to terminate the connection. It did not isolate the server (if it did, maybe the infrastructure team would've picked up on it). There were events in the Windows Event Log to indicate something is up with the nic, but with so many alerts, no one noticed it (or were ever inclined to look for it).
No one else noticed. All other systems were happily working as expected. If we had more than one system impacted that day, it may have shifted our focus to something infrastructure related that might be the cause.

Database Server Outage¶

At the start of my IT career, I was sent to Mosselbay to replace a Microsoft SQL Server with new hardware. 3 of us occupied the computer room, myself, the local site IT administrator, and the Database Administrator. I backed up the old server, removed it from the domain, replaced the hardware with the same hostname, put it back into the domain, and restored the database. Everything worked. I flew back to Johannesburg the next morning.

I hardly arrived in Johannesburg, and my phone rang. It was the local site. Their application is not working. Something is wrong. I rushed back to the head office and logged onto the server I just left behind. It turns out we had a major performance issue. The old server, seemingly busy dying was running much faster than the brand new Compaq Proliant server (yup, back when Compaq was still a thing) we just put in its place. This did not make sense.

"It's indexes" exclaimed the DBA. I remember he was working in SQL Studio the whole day trying to rebuild indexes. Even after rebuilding the indexes, everything was still slow. It was not indexes.

I cracked open Performance Monitor, and I spotted some odd behaviour - while the CPU and memory was idling, the disk was working very hard to keep up. What's going on with the disk? I told my manager that something is up with the disk - she didn't believe me because "It's a brand new server. It is fast. It can handle the load.." yet the evidence says otherwise.

I flew back to Mosselbay that same day to revert what we did, and bring the old hardware back to service. It worked - the old hardware worked perfectly fine.

So what went wrong with the new hardware?

It turns out there was one big change between the two servers. The original server had 2 disks, with no redundancy. One for the OS and transaction log, the other for the data file. The new server however, had 3 disks in a RAID-5 configuration, because, redundancy is a good thing. What we've learnt about RAID-5 (the hard way) is that while it is great for redundancy, it is quite fast on data READS, but terribly slow on data WRITES. It is slow, because the RAID controller has to figure out what to write to each of the 3 disks to ensure redundancy of data. I also learnt that there are literature for SQL server that says : "Do not put your data and transation log files on the same disks on a RAID controller". Oops... That's exactly what we did.

Fortunatly the new Proliant server was quite big, so it had extra disk slots. We got 2 more disks, slapped that in there, and configured them to be the transaction log, thus seperating the data files from the transaciton logs.

Once we restored the data files again, the new server was flying. All the issues we experienced were gone.

The lessons we learnted:

You're not as smart as you think. New technology, like RAID-5 at the time for me, can have an adverse impact if you do not understand the impact to your application.
Test, test, test. Do not try to do this straight in production if you don't know what you're doing.

Tips for success¶

Test your assumptions. Just because someone says it is true doesn't make it true. Go check.
Check the obviouos. It may sound silly, but check that power cord, check that network link, check your DNS server. A lot of troubleshooting time has been wasted because teams assume something must be right, and then its not.
Draw it out. Nothing beats just writing out the architecture on a piece of paper or a whiteboard to get better understanding.
Seek input from others - you're not as smart as you think! More brains work better than one.
Leave your ego at the door. This is about solving a technical issue - not trying to find blame.
The same goes for emotion. While users may have been upset, you can list that as part of the impact, but upset users and management screaming at you will not help to solve the problem any faster.
Communicate your findings. Whether you document it, do a show case, or just share it within the technical team. Talking and sharing your experiences is a great way to build resillience within the team, and help the team avoid such issues in future.
It's time to update your disaster recovery plan. Take your learnings, apply it to your DR plan, so any learnings can be applied to avoid similar disasters in the future.