Illustration of magnifying glass overlooking the CUDA programming logo.
February 24, 2021

How to Solve CUDA Programming Challenges With TotalView

Technologies & Languages
High Performance Computing

CUDA introduces developers to a number of new concepts that are not encountered in serial or other parallel programming paradigms. Visibility into these elements is critical for troubleshooting and tuning applications that make use of CUDA. In this blog, we explore CUDA and how TotalView helps users deal with new CUDA-specific constructs. 

What Is CUDA Programming?

CUDA is a parallel computing model created by NVIDIA. It is used for general computing on NVIDIA’s own graphics processing units (GPUs), accelerating the runtime performance of computationally intensive applications.

Challenges of CUDA

While the benefit of CUDA is clear—faster parallel programming—there are several challenges for high performance computing (HPC) applications using CUDA.

Lack of Abstraction

CUDA generally exposes quite a bit about the way that the hardware works to the programmer. In this regard, it is often compared to assembly language for parallel programming. Explicit concepts such as thread arrays, map directly to the way the hardware groups computational units and local memory.

The memory model requires explicit data movement between the host processor and the device. CUDA makes explicit use of a hierarchy of memory address spaces, each of which obeys different sharing rules. This is more low-level detail than the typical application or scientific programmer has to deal with when programming in C, C++, or Fortran. It also raises significant concerns with regard to portability and maintainability of code.

Data vs. Task Parallelism

The second challenge is that the programming model for CUDA is one of data parallelism rather than task parallelism. When divvying up work across the nodes of a cluster, HPC programmers are used to looking for and exploiting parallelism at a certain level. The amount of data to assign to each node depends on a number of factors including computational speed of the node, the available memory, the bandwidth and latency of the interconnect, and the frequency with which results need to be shared with other processes.

Since processors are fast and network bandwidth is relatively scarce, the balance is typically to put quite a bit of data on each compute node and to move data as infrequently as possible. CUDA invites the programmer to think about parallelism of a completely different order, encouraging the developer to break the problem apart into units that are much smaller. This is often many orders of magnitude more parallelism than was previously expressed in the code and requires reasoning somewhat differently about the computations themselves.

Data Movement And Multiple Address Spaces

NVIDIA GPUs are external accelerators attached to the host system via the PCI bus. Each GPU has both its own onboard memory and also features smaller bits of memory attached to each one of the compute elements. While there are now mechanisms to address regions of host memory from device kernels, such access is slower compared to accessing device memory.

CUDA programs therefore use a model in which code running on the host processor prepares and explicitly dispatches work to the GPU, pauses for the GPU to complete that work, then reads the resulting data back from the device. Both the units of code representing computational kernels and the associated data on which these computational kernels will execute are dispatched to the device. The data is moved, in whole or part, to the device over the PCI bus for execution. As results are produced, they need to be moved, just as explicitly, back from the device to the host computer’s main memory. 

TotalView For CUDA Programming

TotalView has provided CUDA debugging support for many years. The current version of TotalView builds on and extends the model of processes made up of threads to incorporate CUDA threads. There are two major differences: CUDA threads are heterogeneous and TotalView shows a representative CUDA thread from among all those created as part of running a routine on the GPU. The GPU thread is shown as part of the process alongside other CPU threads, including OpenMP threads. Threads created by OpenMP or threads in any single process share an executable image, a single memory address space, and other process level resources like file and network handles. 

CUDA threads, on the other hand, execute a different image, on a set of GPU processors with a different instruction set, and in a completely distinct memory address space. Because of the large number of CUDA threads, and the fact that only a small number of this large set of logical threads are likely to actually be instantiated on the hardware at any given moment in time, the TotalView UI has been adapted to display the thread specific data from a user selectable representative CUDA device thread. CUDA is often heavily used in conjunction with distributed multi-process programming paradigms such as MPI. 

As such, the CUDA support in TotalView is complimentary and additive with the MPI support in TotalView. Users debugging an MPI+CUDA job see the same view of all their MPI processes running across the cluster and when they focus on any individual process they can easily see both what is happening on the host processor and what is happening in the CUDA code running on GPU kernels.

Related reading: CUDA Debugging Support For Apps Using NVIDIA GPUs on ARM64

Conclusion

We’ve covered how the TotalView process and thread model has been extended to allow users to easily switch back and forth between what is happening on the host processors, where MPI-level communication occurs and the program sets the stage for the data parallel work to happen on the GPU.

For an in-depth look at how TotalView addresses additional challenges presented by CUDA, download our white paper, Debugging CUDA-Accelerated Parallel Applications.

Download White Paper