Jyotishka Datta receives CAREER award to create improved statistical models
These new models can be utilized in a wide range of interdisciplinary applications.
Jyotishka Datta. Photo by Liz McNeil for Virginia Tech.
As a statistician, Associate Professor Jyotishka Datta collaborates on research projects across a variety of domains. Whether he’s working with criminologists examining environmental factors of crime, ecologists tracking species diversity, computational biologists analyzing cancer genomics, or astronomers trying to calculate the distance of a faraway star, Datta has seen a common need for better statistical models to handle “count data” – particularly high-dimensional discrete data.
Datta’s quest to build better Bayesian models for these particular datasets is now being supported by a five-year National Science Foundation Faculty Early Career Development Program (CAREER) award.
Fast facts
- Primary researcher: Datta, Department of Statistics
- Project: Heavy-Tailed Priors for Robust Bayesian Inference in Ecology, Machine Learning, and Astronomy
- Award amount: $450,000
- Collaborators: Phyllis Newbill, Center for Educational Networks and Impacts
- Research goal: Developing scalable Bayesian statistical methods using heavy-tailed priors to extract meaningful patterns from complex, high-dimensional data across astronomy, ecology, machine learning, and genomics
Building better models
“Count data” essentially refers to information that represents how many times an event or object occurs. For example, count data could include the number of species found in a particular location, the number of violent crimes committed in a certain area, or the number of genes mutated at a specific site on a chromosome. While there may be pockets of areas with high counts, these datasets often have high numbers of zero or low counts, also know as zero-inflation, resulting in less reliable statistical models.
“One of the things that I understood as I worked with many different primary investigators across many different disciplines is that there are some similarities in terms of the statistical model that you apply" to count data, said Datta. “The statistical model that you apply to learn the structure, or the underlying factors, for such a dataset needs to have something called a heavy-tailed prior distribution.”
Datta posits that by employing some sort of heavy-tailed prior distribution – which allows for the possibility of extreme values – these new models can address some of the challenges that come along with high-dimensional count and compositional data, including sparsity, zero-inflation, and complex dependence structures.
Statistics are ‘inherently interdisciplinary’
The end goal of Datta’s project is to develop computationally efficient methods that can lead to new insights in applications across multiple domains.
While his research grant specifically names the fields of ecology, machine learning, and astronomy, Datta’s improved statistical models can be utilized in just about any field that utilizes count data, for example, cancer genomics.
"Statistics is an inherently interdisciplinary field,” said Datta. “I think that it is imperative that we make more of an effort to be familiar with other disciplines and research across different quantitative fields, and try to forge and foster those relationships.”
Building for the future
In addition to creating more efficient modeling methods that can be used in a variety of applications, Datta’s project aims to promote public scientific literacy and quantitative reasoning, preparing students — specifically K-12 students — for a data-driven society.
A key to reaching this goal? Educational outreach.
In collaboration with Radford City Schools and Virginia Tech’s Center for Educational Networks and Impacts, Datta plans to create a “data course” that will incorporate real-world examples to promote statistical thinking. The course will emphasize the usage of storytelling as an effective strategy to communicate through data.
Datta, who has been working with high school students around the state for years, said by sharing real-world statistical examples through anecdotes — and not just through numbers — students are more likely to become engaged with what they’re learning.
“One of the things that I really want them to do is to be able to think critically with data,” said Datta. “What I did was explain these different types of paradoxes or biases or hidden patterns, but using stories, not dry formulas.”
Not only does Datta plan to incorporate his research findings into educational activities for K-12 students, but he will also provide research opportunities and training for graduate students. Furthermore, both undergraduate and graduate curricula will be enhanced by the publication of a book, tentatively titled “Bayesian Inference for High-dimensional Data,” to be co-written with colleague Sayantan Banerjee from the Indian Institute of Management, Indore.
“All that we do in statistics or in data science can only survive if we have excellent applied problems that could motivate new methodology from other disciplines,” said Datta. “All the methods that I've developed, and that I will develop, they come from different fields.”