On a daily basis I battle memory leaks, GC contentions, .NET crashes, performance issues. The titles associated with the escalations I see have beautiful designations, like:
- “The migration is stuck at 6%”
- “The service is not responding to pings”
- “The IIS Servers are reporting HTTP 500”
- “The workers are consuming 90% CPU”
My team works on investigating these issues, finding the root cause and working with the product group on how to avoid them in the future. If the fix is quick to implement, we do that ourselves. We do have access to the source code and use it for most of the investigations.
Most of our time is dedicated to reactive issues. When we have free time, we do proactive work like creating tools/scripts or Kibana queries, which I’ll talk more about below. Likewise, we often depend on tools to do our work. Some of these tools are free, powerful and not very well known—I’ll share more on these in a moment.
Also keep in mind that troubleshooting and debugging are part art, part science. Some people use the same tools in different ways; others use different tools altogether for the same task. What is important is to follow a scientific approach instead of taking random actions with no clear purpose, and to be methodical when investigating software problems.
Here are three tools I value highly in troubleshooting and debugging:
Kibana / ElasticSearch
These are the pair of tools we use to analyze our logs. ElatiscSearch centralizes logs from different servers in one place, with powerful query capabilities. Kibana enables us to visualize the information from the logs, using nice charts and queries.
We use these tools to analyze IIS logs, our applications logs, and eI vent logs from several servers at the same time. We can find patterns – for example, seeing if specific first-chance exceptions are related, or finding pages causing HTTP 404. We can easily find out the time of the day when our applications are logging more of a specific exception. These save us a lot of time: consider the manual work of having to collect IIS logs and Event Logs from, say, 8 remote computers and manually analyzing them.
Tip: Using Python (and Visual Studio in our case) it’s possible to create alerts. These alerts run queries that extract specific information from the logs and notify the appropriate person when something appears abnormal.
Performance Analysis of Logs (PAL)
Performance Analysis of Logs is a tool used to analyze Performance Monitor logs. Performance Monitor, or PerfMon, is part of the Operating System and helpful in finding trends and bottlenecks.
For example, suppose your application is a mixed application, so it uses native code (C/C++) and managed code (.NET). Perhaps you suspect that your application is consuming too much memory over time without releasing that memory. Using PerfMon, you can collect a log for 5 hours, for example, then manually analyze it and prove if the application is increasing the memory consumption over time, and whether this memory consumption is related to the managed or native code.
You won’t get the root cause via PerfMon, but you can collect valuable information about the symptom to focus and continue your investigation using the right set of tools.
In the example above, if you have collected a PerfMon log, PAL can automatically analyze the log and create a nice report clearly showing problems or potential problems.
The biggest problem engineers have when analyzing PerfMon logs is knowing the threshold to use for each counter. PAL has this knowledge, so if you collect a PerfMon log from your ASP.NET application, for example, you can select the ASP.NET filter, and it will analyze the log following the thresholds for the product. Same for SharePoint, SQL Server, Exchange, etc.
Tip: When using PAL to analyze Performance Monitor logs that you’ve created, select the Objects and Counters you want and on the Threshold File tab select Auto-detect. That way PAL can detect which threshold files it should use to perform the analysis based on the Objects/Counters you’ve collected.
Tip 2: If you’re not familiar with PerfMon and need to collect a PerfMon log, do this: run PAL, go to the Threshold File tab, as seen above, and select the Threshold file title according to the product you want to monitor. Then click Export to Perfmon template file… so you can load this template from Performance Monitor and use it. This template has all the Objects and Counters you need to collect information from your server. If you don’t know what to use, just use System Overview. You can learn more about PAL reading the author’s blog.
DebugDiag
Often when you encounter a problem related to an application, the best thing to do is to debug the application. However, many problems don’t happen in the development or test environment but rather on the production servers. This is a tricky situation because you can’t just install Visual Studio there and debug it. This approach is too invasive.
In situations where applications running on production servers need to be debugged, we need to tread carefully. Our approach, then, is to collect dump files from the application and copy them in a place where we can debug them.
Full user mode dump files are large binary files that contain the content of a specific application at a certain moment in time. The process of debugging dump files is called “post-mortem analysis,” and it is very time consuming and technically complex.
Enter DebugDiag! DebugDiag is a tool for troubleshooting issues, which has a nice user interface and can be used to collect dump files based on specific rules. Moreover, DebugDiag can be used to analyze those dump files!
Many tools can collect dump files—my favorites are ProcDump and Process Explorer; however, there are advantages of DebugDiag: