Changhee Jung, an assistant professor of computer science in Virginia Tech's College of Engineering, has received a National Science Foundation Faculty Early Career Development (CAREER) award to study resilience in high-performance computing.

Advancing high-performance computing systems is integral to achieving the next steps in computer modeling for industry and academia worldwide, and running the billions of calculations necessary to confront some of society’s most pressing challenges depend upon reliable next-generation high-performance computer systems.

Discovering cancer-fighting drugs, modeling climate change, and even extending humanity’s reach into space all depend upon computer systems that will be able to perform at the exascale — systems that are capable of at least one exaFLOPS or a billion billion calculations per second.

Jung’s research award, totaling $521,718 over five years, will focus on reducing the ever-increasing error rate in high-performance computers without sacrificing energy efficiency or increasing the complexity of hardware.

The bane of high-performance computing systems are the soft errors that are generated from cosmic radiation and cause single event upsets — a change of state caused by one single ionizing particle — that results in the computer system changing an instruction in a program or a data value that becomes flipped when the computer chips are hit with showers of cosmic rays from outer space.

“Resilience is one of the key challenges in acheiving exascale computing,” said Jung. “We are living in an era where we need to think seriously not just about high-performance computing but error-tolerant computing. The question is how to keep the supercomputers running efficiently without system failures due to soft errors. If exascale systems are not resilient enough to correctly execute long-running simulations in the presence of a soft error, then it doesn’t matter how large and powerful they are, the simulation results will be useless unless you can trust the computer system is outputting correct information.”

Jung's research seeks to achieve lightweight soft-error resilience by leveraging novel compiler optimizations that can generate error-tolerant code without significant performance degradation.

His project has four goals: acheiving low-cost soft-error resilience for CPUs; compiler-directed soft error resilience for commodity GPUs; lightweight nonvolatile memory persistence; and low-cost timing error resilience for aggressive voltage scaling to maximize energy efficiency with program correctness guarantees.

His research will eventually yield algorithms, models, and software to combat against soft errors; testing infrastructure, including simulators and evaluation benchmarks and their traces; and educational materials.

Commercially, the research in soft-error resilience has the potential to save in manufacturing costs for those who produce and sell processors, and to enable computer manufacturers to take advantage of a cheaper resilience solution. Overall, Jung’s work could make the execution of current and emerging applications much more reliable at a low cost.

Established in 1995, the NSF CAREER Award is the most prestigious award given in support of junior faculty who demonstrate the potential to effectively integrate research and education.

— Written by Amy Loeffler

Share this story