March 24, 2020

Move Your Development Forward With Reverse Debugging

Debugging Best Practices

Some of the most vexing software bugs to solve in both serial and parallel applications are those where the failure happens long after the program execution or in a completely unrelated section of the program than the root cause of the bug.

Scientists and engineers working "backward" from the crash to the root cause often have to apply tedious and brute force analysis to examine the program because they are only rudimentary debugging techniques offered by most debuggers.

Let’s explore how TotalView’s advanced reverse debugging capabilities can radically improve the speed and accuracy and reduce the difficulty of troubleshooting this class of defects that is both common and challenging.

How Does a Debugger Work?
The Need for Reverse Debugging
Reverse Debugging on Parallel Applications
Reverse Debugging Solutions
Reverse Debugging With TotalView
Try for Free in Your Application

How Does a Debugger Work?

Debuggers provide the ability to dynamically analyze a running application by attaching to it and allowing the user to control the execution of their program and examine values of variables. Across different languages and operating systems all debuggers provide similar features including allowing developers to set breakpoints in their code, step through code statements and examine variables values. Developers use these debugging features to investigate how their application is running and gather information on why it might crash or incorrectly compute a result from an algorithm. For tough problems, developers may have to run their applications multiple times to reproduce an error and place breakpoints at proper locations before a crash occurs or before incorrect data is generated.

The Need for Reverse Debugging

Much of the frustration and time-consuming aspects of debugging comes from the fact that developers currently have to work from the point of a crash or a bad data value without any to way to easily go “backwards” to locate what may have caused the problem to begin with. Anything that allows developers to work in a straightforward and predictable way towards the root cause of a bug greatly improves and simplifies debugging.

In many cases, getting the program to crash or otherwise misbehave is only the beginning of the troubleshooting process. In the most trivial bugs, the crash happens on the same line or immediately after the bug and the error becomes obvious to developers if they are simply directed to the point where the program crashes.

In many other cases, a wide gulf exists between the point in the program where it does something obviously wrong (the point of failure) and the point where the error occurs (the site of the bug). Once the developer understands the failure on one side of the gulf they must find a way to identify the error on the other side of that gulf. The developer naturally examines the state of the program after the failure looking for clues to help generate a troubleshooting hypothesis.

Sometimes failures actually erase important clues.

For example, failures may overwrite previous key variable values that explain how the program got to where it is, stack backtraces may be gone or indecipherable, and memory allocations may potentially be corrupted.

Failure often means the program is unable to continue running. Restarting the program from scratch is often the first step after examining the failure. At that point, the developer is looking at a fresh run of the program, before the error. The unspoken challenge is now to run the program, using only "forward" commands, to the point right before the error happens. If this is accomplished, then they can very carefully step through the relevant code and identify the error.

The challenge of finding errors in complex code.

Even if the developer knows where the error is, the challenge of finding it can be daunting because the code may have complex iterative and conditional structure. These multiple levels mean that getting to a desired point in the program is not as simple as just setting a breakpoint on the right line. Conditional breakpoints can be used effectively in these cases, but setting them up to get the desired effects often takes some trial and error. Any mistake in this sequence can mean that the program needs to be restarted in the debugger.

The degree of difficulty and frustration only goes up if the program is unfamiliar to the user or if the program contains concurrency. Doing this kind of troubleshooting with unfamiliar code means that the developer may find themselves spending time to understand details of the program simply for the purpose of finding out how to "drive" the program without losing control.

Concurrency, such as that introduced by multiple threads or multiple processes, may mean that the behavior of the program depends on the order and timing of the threads' execution on the hardware. In such a case, restarting the program may simply result in a run of the program where fallout does not occur or occurs in a different way.

The only way to eliminate this stochasticity is to carefully control the execution of the program, starting and stopping processes or threads, such that critical sequences happen the same way every time. All of this means time spent focusing on code that relates only in the most incidental way to the problem at hand.

Sometimes the code works fine but you need to learn it.

Developers new to a codebase will find that a debugger is an invaluable tool for following how a program works during execution. Using the debuggers' ability to stop at specific functions and examine data will enable developers to quickly understand how the program actually runs before commencing on refactoring and enhancements. Having this forward debugging ability is valuable but what if the developer needs to examine something that occurred during the startup of the program? They’ll be forced to waste time by rerunning the application again and setting breakpoints at locations they want to investigate.

What would troubleshooting and code exploration be like if all of these challenges could be sidestepped?

If the developer could go backward as part of the same debugging session through the execution of their program, revisiting how the program got to where it is, and the data it generated along the way, then there is no need to restart, and therefore no risk that a stochastic problem will be gone. Nor is the developer required to focus on the challenge of repeatedly driving the program carefully forward over the same ground to just the right point.

Because the developer isn't required to focus on driving the program precisely forward, they are much less likely to spend time dwelling on uninteresting swaths of the program. Opening up the potential to "go directly backward" radically simplifies the task at hand when a developer is troubleshooting an error in their program.

Reverse Debugging on Parallel Applications

The radical simplification of the troubleshooting process outlined above would seem to apply with very little change to parallel contexts. Many of the problems that occur in parallel programs are serial bugs that occur on one or many of the parallel processes. In that case, the general idea will simply be to capture the bug in the reverse debugger and focus on analyzing the history of that specific process to work backward from the failure to the error.

If that is not possible, it is still likely that an error on process A occurred generating data that when transmitted to process B caused the failure to occur there. In that case, the hunt backward from the failure in process B should end at the point that it received the suspect data from process A. The focus of investigation can then simply switch to process A and proceed from the sending of the suspect data to the point of the original error.

Tracking down a serial error in a parallel program based on message passing, therefore, may not require the parallel debugger to construct an absolute mapping of execution trajectory to an absolute time across the cluster. It is not yet clear if a synchronized parallel clock will end up being relevant for errors that boil down to being more directly related to the parallel nature of parallel programs.

Reverse Debugging Solutions

There are a number of products, papers, and initiatives in the broader software development market that address or support the general idea of reverse debugging.

Microsoft has time travel debugging.
Virtual machines provide an opportunity to record and replay the state not just of a single process but of an entire operating system.
GreenHills provides a hardware-level reverse debugging tool called TimeMachine.
For java programming, there is a tool called the Omniscient debugger that appears to provide a very transparent reverse debugging functionality.
GDB reverse debugging is available for only certain target debugging environments.

However, none of these tools are focused on the needs of scientists and computer scientists working on multithreaded and multiprocess compiled applications like TotalView.

Reverse Debugging With TotalView

The TotalView Debugger is a source code debugger and dynamic analysis tool for troubleshooting complex, multithreaded, or multiprocess C, C++, and Fortran programs. It simplifies and shortens the troubleshooting process necessary to understand bugs and ultimately resolves defects in desktop applications, programs running on servers, and scientific simulations running on clusters.

Directly integrated into TotalView is its reverse debugging engine which enables developers to easily record and deterministically replay the execution of programs under the debugger's control. Literally, with a click of a button, developers can record the execution of the program and at any point go back through the recorded history. Traversing through recorded history uses all the same, but “reverse”, controls as going forward. Developers can step backward, step backward out of a function call, and run backward until a breakpoint is hit. At any point in recorded history, developers can examine the values of variables and even use advanced capabilities like watchpoints to stop reverse execution of the program when a variable's value changes. All of TotalView’s other debugging technologies such as memory debugging are fully compatible with its reverse debugging engine.

Watch this short demo on TotalView Reverse Debugging >>

Recording Program Execution

Recording the execution of a program being debugged by TotalView is as simple as pressing the “record” button. Recording can be enabled before the program begins execution or at any point later, which enables developers to optimize performance and record only the segments of execution that they are really interested in. At any point during the recorded debugging session developers can navigate back through recorded history or proceed forward with normal live execution.

Developers will quickly appreciate the power of reverse debugging and adapt new debugging workflows to solve issues in their code faster and learn new code quicker. In the case of solving a program crash, instead of rerunning the program repeatedly, setting new breakpoints, and slowly stepping through code developers can simply enable reverse debugging, hit the crash and begin stepping backwards through the recorded history to understand what led to the crash. For developers learning new code, reverse debugging with TotalView enables them to effortlessly move forwards and backwards through the executed code, allowing them to gain a thorough understanding of how the program actually works.

Deterministic Replay

TotalView’s reverse debugging technology allows users to freely replay execution of the program during the live debugging session. They will be able to see exactly how the program ran, the values of variables and run forwards and backwards to, in one debugging session, fully understand what happened during execution. But that is not the only way developers can revisit the execution history. At any point during the debugging session, developers can save the recorded history to a “replay” file. You can think of this replay file as a “super” core file that contains the full execution history of the program. A replay file can be loaded later for further analysis and even shared with colleagues.

Try for Free in Your Application

Looking at troubleshooting and debugging as a process, reverse debugging seeks to eliminate and shorten the process cycles by allowing the developer to work backwards from failure to error.

See for yourself how TotalView helps you intuitively diagnose and understand your complex code.

START FREE TRIAL

Finding Memory Leaks and Errors in Parallel Applications