Helping Lawrence Livermore National Laboratory Developers Scale Applications
As a national security laboratory, Lawrence Livermore National Laboratory (LLNL) is responsible for ensuring that the nation’s nuclear weapons remain safe, secure and reliable by applying the latest advances in science and engineering. With its special capabilities, the laboratory also meets other pressing national and international security needs, such as countering the threat of nuclear proliferation and terrorism, strengthening homeland security, and enhancing military effectiveness.
Given the critical role that LLNL plays in national security, it is no surprise that they utilize the world’s fastest supercomputers, the IBM Blue Gene series, to develop their mission-critical applications.
At LLNL, Blue Gene/L is optimized to run molecular dynamics applications at extreme speeds to address materials aging issues confronting the Stockpile Stewardship Program. Blue Gene/L is also used to explore the potential of system-on-a-chip technologies to achieve extreme speed while minimizing floor space and electrical power consumption.
TotalView Helps LLNL…
Debug code scaled to a very large number of processors.
Understand the complexity of developing applications.
Reduce overall application development time for involved teams.
The nicest feature is the ability to trap on memory writes to specific locations. That’s irreplaceable and can save time on nasty bugs.”
Solving Application Development Challenges
Researchers at LLNL develop mission-critical Grand Challenge applications using the IBM Blue Gene/L, one of the fastest supercomputers in the world. Applications written for the IBM Blue Gene/L are highly complex, using thousands of processors and consuming gigabytes of memory, and developing efficient code for such an advanced supercomputer presents great challenges for developers.
Applications being developed range from “simple” scalable linear solvers to large hydrodynamic and simulation codes that use multiple languages and network communication patterns.
One developer at LLNL described his application: “The code is a large, highly portable hydrodynamics code. It is a mixture of C, C++, Fortran and Fortran 90. It compiles to a 37-MB executable on Blue Gene/L when optimized. It has a variety of network communication patterns that are dynamic over time. It incorporates many different third-party libraries and thus must embrace a large number of coding styles with different language features used. It will soon be able to run on all 12,800 processors of the Blue Gene/L machine.”
For this developer, the biggest challenge is debugging his code that crashes when scaled on a very large number of processors. “We don’t always have the luxury of scaling back to 2048 processors,” he said. “On the other hand, the debugger needs to work fairly quickly at this scale to be of real use.”
Another programmer is developing scalable linear solvers, mainly algebraic multi-grid, written in C and consisting of short programs designed to be used by bigger applications. The goal is scalability across a large number of processors, which is difficult since the algorithms require a large amount of communication across many processors.
Other applications developed on Blue Gene/L at LLNL include a large multiphysics code written in C, which runs on a variety of platforms and has been used for scaling studies up to 12,000 processors.
How TotalView Helps
LLNL developers use the TotalView debugger to understand and reduce the complexity of developing applications on Blue Gene/L. TotalView is the most proven scalable debugging product of its kind, able to handle from one to thousands of processes. The advanced debugging capabilities of TotalView, including independent thread control, multi-platform support, register and instruction level debugging, and a built-in memory debugger, have been proven to reduce development time in some areas by more than 20 percent.
For the LLNL programmer developing scalable linear solvers, using TotalView has yielded great benefits. He says,
“TotalView has been extremely helpful as part of my development process in finding bugs...I like the fact that I can look at all jobs in parallel and really see what is going on — on all processors at the same time. Also, being able to set conditional breaks has been helpful.”
The developer working on large hydrodynamics code lauds TotalView’s breakpoint management capabilities, as well as its ability to scale transparently up to thousands of processors or processes while remaining easy to use.
“The ability to step through individual processes in addition to aggregations of processors is useful, as is the fact that breakpoints are saved when a job exits. The fact that TotalView understands C++ method calls is also useful.”
According to a third LLNL developer,
“Typically, we debug on no more than 64 –128 processors. It is easy to use TotalView on 4,096 processors and this has helped get the code scaled up. The nicest feature is the ability to trap on memory writes to specific locations. That’s irreplaceable and can save time on nasty bugs.”
Quickly Scale to Meet Your Application Needs
From one process to thousands, TotalView helps organizations like LLNL intuitively diagnose and understand their complex code. See for yourself how TotalView will help you do the same.