Virginia Tech Creates A Groundbreaking Supercomputer Cluster With Industrial Assistance
Virginia Tech, teaming with Apple Computer, Cisco, Liebert, and Mellanox Technologies, is introducing a new, turnkey solution for creating supercomputing clusters. They are ambitiously designing a large 64-bit InfiniBand cluster using existing, off-the-shelf industry components. The supercomputer will belong to Virginia Tech, significantly enhancing its research capabilities.
"Virginia Tech's idea was to develop a supercomputer of national prominence based upon a homegrown cluster," says Hassan Aref, dean of Virginia Tech's College of Engineering and a former chief scientist at the San Diego Supercomputer Center.
The Virginia Tech team of engineers, computer experts, and officials selected Apple's new Power Mac, the G5, as the framework for the cluster. For months, the university worked with Apple to purchase and adapt the new machines, the world's fastest personal computers, as they rolled off the manufacturing line in August.
As they waited for the machines, Virginia Tech identified Mellanox, the leading provider of the InfiniBand semiconductor technology, to supply the primary communications fabric, drivers, cards, and switches for the project. The university asked Cisco Systems to join the enterprising effort. Cisco's Gigabit Ethernet switches were the choice for the secondary communications fabric to interconnect the cluster. Cisco provided a significant educational discount to support the project.
Virginia Tech needed a cooling system, and it worked with Liebert, a division of Emerson Network Power, known for its comprehensive range of protection systems for sensitive electronics. Based on the heat load for the system, normal air conditioning units were insufficient. Liebert was able to provide its new high-density rack mounted cooling system within the budget and time constraints of the project. They also custom designed computer racks along with power distribution equipment.
Weekly conference calls between the various players were organized in order to build the supercomputer at a record pace. Geographically, the operation was international in scope, with experts as far away as Israel and Japan taking part in the project.
This collaborative effort represents a "groundbreaking project," Aref says. The people working this project "pulled off miracles, raising glass ceilings and opening locked doors."
The new facility will be located at Virginia Tech's computing center. Plans call for a future installation to be housed in a building dedicated to the Institute for Critical Technology and Applied Science (ICTAS) at Virginia Tech. ICTAS is a new venture of the university that allows organized research units to cluster together on synergistic research.
Srinidhi Varadarajan, an assistant professor of computer science at Virginia Tech, and Jason Lockhart, director of the College of Engineering's High Performance Computing and Technology Innovation, initiated the venture at Virginia Tech. Varadarajan is an expert in reliability, a key issue in successfully exploiting terascale computing.
Component failures are endemic to any large-scale computational resource. While previous generations of supercomputers engineered reliability into systems hardware, today's high performance computing environments are based on inexpensive clusters of commodity components, with no systemic solution for the reliability of total machine.
Virginia Tech has the first comprehensive solution to the problem of transparent fault tolerance, which enables large-scale supercomputers to mask hardware, operating system and software failures - a decades old problem. It's a software program called Déjà vu, designed by Varadarajan. He also integrated the software with Apple's G5s. This work will enable the terascale computing facility to operate as the first reliable supercomputing facility, according to Varadarajan, a National Science Foundation Faculty Early Career Development Program (CAREER) Award recipient.
Virginia Tech researchers are already active in a number of areas that will benefit from the new supercomputing facilities, says Kevin Shinpaugh, director of research and cluster computing for the university. These include: nanoscale electronics, quantum chemistry, computational chemistry, aerodynamics through multidisciplinary design optimization, molecular statics, computational acoustics, and the molecular modeling of proteins.
Terascale computing is motivated by the needs of problems too large to be solved by any individual computer. The majority of these problems arise in the context of computational science. Until recently, progress in science and engineering has relied on a combination of theory and experiment. In recent decades, however, a third paradigm has emerged, namely computational science. The idea of computational science is to use computers to simulate the behavior of natural or human-engineered systems, rather than to observe the system or build a physical model of it.
"Virginia Tech will have one of the top ranked supercomputing facilities in the world, supporting significant "big science" research. It is anticipated that Virginia Tech will realize at least a five to one return on this investment in terms of annual research grant and contract activity," says Glenda Scales, assistant dean of computing and distance learning at Virginia Tech.
To help keep the ambitious job on schedule "we used an assembly line of volunteer students to unpack computers and perform many of the routine but time consuming functions." Patricia Arvin, associate vice president of information systems and computing, also credits the many disparate parts of the University, from electrical services to purchasing to facilities planners, for the success of this project.
"Mellanox embraces Virginia Tech's decision to deploy one of the top supercomputers in the world based completely on off-the-shelf industry standard components," said Eyal Waldman, CEO of Mellanox Technologies. "As evidenced by Virginia Tech's cluster, the combination of industry standard servers, Linux and InfiniBand creates a new standard in clustering and is changing the way compute power is deployed."