Q&A with Director of Center for Causal Discovery, Gregory Cooper

By Liberty Ferda

Issue Date:

November 16, 2015

Last October, the National Institutes of Health awarded an $11 million, four-year grant to Pitt to lead a Big Data to Knowledge Center of Excellence called the Center for Causal Discovery. The center comprises several partners, including researchers at Pitt, Carnegie Mellon University, the Pitt-CMU Pittsburgh Supercomputing Center, Yale University, and four other universities. Its goal is to help scientists better analyze vast amounts of data to discover biomedical knowledge such as causes of cancer. The term “Big Data” refers to huge sets of complex data that may be analyzed to reveal patterns and associations not available otherwise. Professor Gregory Cooper is the center’s director and also vice chair of the Department of Biomedical Informatics in Pitt’s School of Medicine.

Why should the average person care about Big Data?

In biomedicine, Big Data come from genomic and other molecular analyses, electronic health records, imaging, social media, wearable devices, and many other sources. The analysis of such data will help researchers uncover the causal mechanisms of disease and eventually allow physicians to provide personalized medicine that is tailored to address the specific disease-contributing factors of each patient.

What are some practical applications of Big Data?

The Center for Causal Discovery uses three primary biomedical projects to drive the development of our algorithms and software and to demonstrate the use of Big Data for causal discovery. One project involves using the data to discover the molecular causes of an individual patient’s cancer. Another project is concentrating on the analysis of data about lung diseases, such as Chronic Pulmonary Lung Disease (COPD), to help better understand how and why these diseases start and worsen over time. A third project focuses on uncovering how functional brain activity patterns differ between patients with autism or schizophrenia and those without these conditions, with the goal of being able to better diagnose and classify them. We anticipate that these projects will increase our understanding of these three disease areas, which could lead to improved methods for diagnosis, treatment, and prevention.

Is the center virtual or is it tied to a physical site?

The center does not require a central physical location and is distributed across several departments at the University of Pittsburgh, Carnegie Mellon University, the Pittsburgh Supercomputing Center, and Yale University. We frequently meet in person, with those outside of Pittsburgh joining by video or teleconference. The main administrative office is located in the Department of Biomedical Informatics at the University of Pittsburgh.

How many people work for the center?

More than 50 scientists are working within the Center for Causal Discovery either full-time or part-time. We have faculty at all levels plus staff, postdoctoral fellows, graduate students, and undergraduate students working on center projects.

What kind of computer is used to crunch the data?

We use a variety of existing computers from desktop machines to the Pittsburgh Supercomputing Center. Our software is intended to be able to run on any computer, depending on the demands of a given data-analysis task. For example, we have an algorithm that requires 18 hours on a supercomputer to model data on one million variables, or 13 minutes on a laptop for 50,000 variables. Our center software will include a gateway at the Pittsburgh Supercomputing Center for outside biomedical researchers who need access to such high-performance computing resources to analyze their own Big Data.

Have any discoveries been made since the grant was issued last year?

Yes— we are in the process of submitting a paper for publication related to a new discovery from our cancer project. The algorithm identified that mutations in a group of four proteins not previously studied together are causally associated with the development of cancer, and biochemical experiments confirmed that these four proteins do interact as predicted by the algorithm.

We have also increased the speed and capabilities of the algorithms used to search Big Data for causal associations and have developed user-friendly software for biomedical scientists who are not computing experts.

Is it fair to say that harnessing Big Data is a primary challenge of the 21st century?

Harnessing the potential of Big Data is a great challenge and a great opportunity. Both the amount and variety of data are increasing rapidly in biomedicine. At the same time, the capabilities of algorithms to better analyze such data and uncover new knowledge are also improving steadily, and, of course, available computing power continues to advance rapidly. The convergence of these trends suggests that we will be learning more and more biomedical knowledge from large datasets.

What’s the most exciting thing about this venture in your opinion?

Causal discovery is at the core of biomedical science and the search for new ways to improve human health. Tools for finding causal associations in big complex datasets have the potential to dramatically increase the rate of scientific discovery. The Center for Causal Discovery is focused on supporting biomedical scientists in making causal discoveries from their data using software tools that are effective, easy to use, and free.