Research at Virginia Tech is about to get a boost through a new high-performance computing system available through Advanced Research Computing, a unit of the Division of Information Technology.

The university’s central resource for high-performance computing (HPC), Advanced Research Computing provides systems also known as “clusters,” storage, and visualization resources and is staffed by a team of computational scientists, systems engineers, and developers who provide consulting services to assist researchers in using the unit's systems and software.

“Virginia Tech cultivates a broad research profile, and we see it as our responsibility to host capable, scalable computational resources that enable researchers across the disciplinary spectrum to tackle cutting-edge discovery,”  said Matthew Brown, computational scientist for Advanced Research Computing (ARC). “ARC’s clusters are constantly running close to their top capacity, reflecting the ongoing and expanding computational work being conducted at Virginia Tech. Our latest computing architecture makes it possible to create simulations and analyses from traditional HPC workloads in far greater detail than we ever could before.” 

Meet Owl

Owl is Advanced Research Computing's newest CPU cluster — CPU stands for central processing unit, often dubbed the “brains” behind a computer. CPU clusters are optimal for researchers who need to perform a series of calculations on their data because they excel at completing a task and moving on to the next one very quickly.

Owl contains 84 nodes, the individual computers within the cluster, with a total of 8,064 processing cores and 768 gigabytes of DDR5 memory with an additional three huge-memory nodes with two 4-terabyte nodes and one 8-terabyte node.

With high memory per core, computations can fly

For a computer to work really fast, it needs a lot of processing power, which comes from the number and speed of the computing cores. But you also need excellent memory — in terms of speed, quantity, and connectivity —  to handle the workload at hand.  

Think of it like a highway, where the cores set your speed limit and memory provides lanes for your data to travel in. With powerful cores, calculations can be done extremely quickly, but if there’s only one lane, only so much data can move through in any given time period. Increasing memory is like increasing the number of lanes on the highway.

Owl is an eight-lane highway, so to speak. Compared to Advanced Research Computing’s other large CPU cluster, TinkerCliffs, which has 2 gigabytes of memory per core, Owl has 8 gigabytes. This allows researchers using Owl to

  • Conduct more types of calculations simultaneously
  • Increase the amount of detail in data simulations for more detailed results
  • Run jobs quickly and make any needed adjustments sooner in the research process 
  • Turn around the results of research more quickly

Direct cooling improves performance

Owl is the first cluster on Virginia Tech’s campus to use direct-to-node cooling. With this setup, a network of small ducts carrying liquid coolant run throughout each node alongside the components that create the most heat, providing near-instant cooling via conduction. This eliminates the need for bulky and loud fans while providing the most efficient cooling possible for Owl’s hard-working cores. It also eliminates thermal throttling, which happens when the cluster reduces its computing speed to prevent overheating.

“The effect of power usage effectiveness of a data center utilizing direct-to-node cooling is significant,” said Jeremy Johnson, Advanced Research Computing's IT operations manager.

Power usage effectiveness measures the amount of power a data center uses and is expressed by a ratio of the total energy required to run the facility by the energy used for computing. The lower the power usage effectiveness, the more energy efficient a high-performance computing cluster is.

Hands point out a network of covered copper ducts in a computer
Advanced Research Computing Systems Engineer Jessie Bowman points out the direct-to-node cooling channels in one of Owl’s nodes. Photo by Angela Correa for Virginia Tech.

An ideal power usage effectiveness score is 1.0 — this would mean that that all of the energy consumed in running the data center is going directly to the computing process. A lower power usage effectiveness (PUE) score is more efficient and better for the environment and lowers energy costs.

“An air-cooled data center typically has a PUE of 1.5 to 2.0, while a rear door heat exchanger cooling system, such as that utilized by Tinkercliffs, can reach efficiencies of 1.2 to 1.4," Johnson said. "By eliminating the power required for cooling fans, direct-to-node cooling can provide a PUE of 1.1, with the added benefit of allowing the processors to run at maximum speed with no thermal throttling.”

What Owl’s first users have to say

A research team led by Wu Feng, a professor in computer science, was among the first to use Owl this spring, while Advanced Research Computing was finalizing its installation. Feng’s team is testing the scalability of new code for graph clustering. The technique identifies patterns and commonalities across complex data sets and is used in a myriad of research areas ranging from biomedical research to social science, providing a mechanism to analyze incredibly large graphs quickly and accurately.

Feng’s team is building upon previous graph clustering work its members conducted using TinkerCliffs, which won the Grand Champion award at the 2023 GraphChallenge, an international competition from Massachusetts Institute of Technology, Amazon Web Services, and IEEE that recognizes innovation in graph analysis solutions.

Given the increased amount of CPU bandwidth team members required to run their algorithm at scale, Advanced Research Computing identified them as excellent candidates to move their work to Owl.

“The Owl cluster delivers up to twice the performance per node and can analyze up to thrice the size of graph per node than before with TinkerCliffs. This increased capacity enabled us to complete an unprecedented study of graph clustering algorithms that span a multitude of research domains, including bioinformatics, health, networking, and social media,” said Feng.

Feng, who is also co-founder and technical lead of the Green500, which ranks the most energy-efficient supercomputers worldwide, said Owl’s relatively low energy consumption is important. “The addition of Owl to ARC’s infrastructure should be lauded as it represents Virginia Tech’s commitment toward improving energy efficiency and sustainability with respect to high-performance computing.”

“With the addition of Owl, we can now offer researchers access to two powerful CPU clusters, each with its own advantages,” said Brown.

While Owl is more efficient in many aspects, TinkerCliffs remains the largest CPU cluster by far. With Owl able to take on jobs that benefit from its ultra-efficient memory speed and capacity, this will free up space on TinkerCliffs, enabling more researchers access to the high-performance computing resources they need, when they need it.

As a result, Brown said, “We can get more science done each day.”

Owl is currently in the final stages of testing and will be available to the Virginia Tech research community by August.

Share this story