April 16, 2020

Debugging in High Performance Cluster Computing

High Performance Computing

High performance computing (HPC) can add challenges to all stages of development — but debugging can prove to be especially challenging. Debugging high performance cluster computing programs, for example, can mean adopting entirely different debugging approaches to accommodate the scale and complexity of the environment.

In this article, we’ll look at high performance cluster computing, why debugging high performance cluster computing programs can be difficult, and the debugging techniques and tools that can make debugging in high performance cluster environments a more surmountable process for development teams.

What Is High Performance Cluster Computing?

High performance cluster computing is the practice of aggregating computing power into an organized system that allows users to achieve higher computational performance than standalone computers.

Today’s HPC clusters integrate a wide variety of hardware and software components, support a mixture of compiled and interpreted programming languages, scale from dozens to thousands of compute nodes, and run applications with huge data sets and millions of lines of code. These clusters provide multiple levels of parallelism: a job running in parallel on multiple nodes with process-level communication supported by the Message Passing Interface (MPI), thread- or core-level parallelism within a process using Posix threads (pthreads) or OpenMP (Open Multi-Processing) for shared-memory multiprocessing, and accelerator-level parallelism offloaded onto GPU or other specialized device.

For example, it is not unusual for an HPC application code to be written in Python, C/C++, and/or Fortran, use MPI for process-level parallelism, OpenMP for CPU thread-level CPU, and OpenMP v5+, OpenCL, OpenACC, CUDA, or HIP for GPU offloading.

Compounding the complexity, many applications are designed to be portable across multiple platforms and programming models, thus use an HPC Performance Portability layer such as Kokkos and RAJA that make existing parallel applications portable with minimal disruption and provide a portability model for new applications. These packages use advanced C++ language features and inject their own conceptual programming abstractions.

How Is High Performance Cluster Computing Used?

High-performance computing clusters are typically used to process data-heavy computations that wouldn’t be achievable via traditional computing means. High performance computing clusters are used by organizations of all sizes and disciplines for data modeling, simulation, analysis, or other data processing functions. Here are a few examples of how HPC is used:

Academic and Government Research - Academic and government research laboratories use HPC to model physical systems for a variety of reasons, including energy, astrophysics, weather, and materials science.
Oil and Gas - The oil and gas industry uses HPC to locate drill sites and maximize production.
Entertainment- The entertainment industry uses HPC to render and edit films, create animations and special effects, and live streaming.
AI and Machine Learning - HPC is used by artificial intelligence and machine learning systems for autonomous vehicles, medical diagnostics, fraud detection, and much more.

Costs for High Performance Cluster Computing

As the scale and complexity of an HPC cluster grows, so does the complexity of the application as it attempts to efficiently make use of the cluster’s hardware components. Inevitably, it becomes necessary to modify, reorganize and optimize the application to exploit the system, which introduces bugs into the application.

HPC clusters are not only expensive resources to buy, own and operate, they are typically shared resources and time on the system can be scarce, especially as the demand for compute nodes increases. When the demand for nodes greatly exceeds the cluster’s capacity, developers may need to wait for hours, or even days, to run their applications.

The systems are often partitioned into interactive nodes used for debugging and development and non-interactive batch nodes dedicated to production or mission-critical runs. The majority of the nodes are often dedicated to production use, which further limits the amount of time a developer has to debug an application and causes development delays.

Therefore, developer productivity and time-to-solution are critical when debugging HPC applications.

Debugging Techniques for High Performance Cluster Computing Applications

While debugging an application executing across a large cluster of nodes may seem like a daunting task for development teams, there are ways to make debugging HPC applications more manageable and cost-effective.

1. Fault Isolation in HPC Applications

Finding a single fault point in single-process program can be difficult enough. But in high performance computing, those faults can be buried underneath hundreds of thousands of lines of code, spread across thousands of nodes, each running multiple processes containing multiple threads, with computations offloaded onto accelerator devices. Further, the “fault” in an HPC application may be buried in the data, which is typically stored as multi-dimensional arrays of scalar or aggregate values, organized into complex hierarchies of objects.

Using tools like TotalView to isolate those faults quickly without the manual work of digging through mounds of code and data spread across the cluster can help teams quickly find faults, and act upon them.

2. Dynamic Visualization of HPC Applications

When debugging an HPC application, sometimes a telescope is needed, sometimes a microscope is needed, and sometimes something in between. For example, when a large MPI job hangs, it is often useful to try to visualize all of the processes and threads in the job by looking for outliers, that is, processes and threads that are different in some way than most of the others.

Aggregation and reduction techniques allow the developer to form equivalence classes that define which program or data attributes to compare. For example, an aggregated stack backtrace of threads in a job can help locate outlies, where some threads are blocked in functions or at source lines waiting for an event or executing unexpected code.

Once an outlier is found, a developer often wants to drill down or “dive” into the details of what the process or thread is doing: blocked on a mutex, waiting for an MPI message, stuck in a spin loop, etc. It is this ability to visualize the big picture, identify anomalies, and dive into the details that make HPC debugging tools a valuable time saver for HPC developers.

When debugging HPC applications, the ability to easily visualize, analyze, filter, and transform data is critical. The specific features required at any given time depends on the situation, so it is important that the debugging tool provides a rich set features. Examples include built-in 2-D and 3-D array visualization, array data slicing and filtering, array statistics, and the ability to transform C++ class objects to their conceptual form. For example, a C++ std::vector is more easily viewed as an array of double values than the underlying class implementation.

Tool developers cannot anticipate every data analysis or visualization feature a user might need, therefore the debugger should provide support for exporting and analyzing data using external tools, or transforming data using functions integrated into the application code itself.

3. Memory Correctness in HPC Clusters

Most HPC applications dynamically allocate memory for data structures and arrays, and thus are vulnerable to a number of memory errors including memory leaks, heap fragmentation, dangling pointers, deallocating a memory block multiple times, deallocating an invalid memory address, accessing uninitialized heap memory, and corrupting the memory heap, for example due to writing off the end of an array.

HPC debuggers that support memory correctness must be scalable and lightweight, and must be able to operate in parallel across thousands of processes. Memory leaks may be inconsequential for some applications, but they can be fatal for HPC applications that can take many hours or days to run. A small memory leak in a frequently called function or heap fragmentation can exhaust available memory, especially in HPC clusters where the application must fit within the physical memory limits on the node, because HPC clusters do support swapping memory to disk the way most traditional computers do.

Additional Resources

For further reading on techniques for debugging high performance computing applications, or to learn more about how TotalView can help, please visit our blog or resources page via the links below.

Articles

Resources

TotalView Product Brief (PDF)
Debugging Mixed Language Applications (Video)
Multithreaded Analysis and Debugging (Video)

Documentation

Basic Debugging Tasks (Guide)
Using the TotalView CUDA Debugger (Guide)
Debugging Python With TotalView (Guide)

Using TotalView to Debug High Performance Computing Clusters

TotalView simplifies the debugging process for high performance computing clusters, allowing developers to easily resolve complex code issues across clusters and sites.

Want to try debugging with TotalView on your application? Try free by clicking the button below.

Try TotalView Free

Finding Memory Leaks and Errors in Parallel Applications