Raul Castro Fernandez

Assistant Professor in the Department of Computer Science, Committee for Data Science, The University of Chicago

I am interested in data; how to think about it, what does it mean to make good use of it, what theory, algorithms, and systems do we need to exploit it, and how to leverage it to better our lives. I use a variety of approaches to study these questions.

I am looking for PhD students, postdocs, and research assistants. Take a look at this brief writeup if you are interested in working with us.

Research Overview

Data Ecology

Data Ecology is the name I give to the study of the principles, algorithms, systems, and methodologies to understand and design data ecosystems. Here's a one-pager overview. Here's a link to a Data Science Institute's Research Initiative on this topic. There are some data ecosystems to which we have dedicated more time to date:

Data Markets. We study data markets, which are an important data ecosystem. You can see our vision for internal (read within organizations) data markets, a survey about marketplaces, and some more technical work.

Data Sharing. We work a lot on data-sharing markets. One question that motivates much of this research is how to incentivize data sharing when beneficial. My NSF CAREER is studying this type of ecosystem. We have designed and built a data escrow, which permits multiple agnets pools and operate on their data. We have models for incentivizing the formation of data-sharing consortia.

Data Discovery

I define data discovery as the problem of identifying and retrieving documents that satisfy an information need. There's a strong connection to information retrieval but we have concentrated primarily in tabular data and in data augmentation techniques. This all started with Aurum. ARDA is an application of Aurum to do feature engineering from external repositories. Continuations to Aurum include Ver, and continuations to ARDA include Leva and, Metam. Right now, Metam says all I want to say about data augmentation. Ver is still evolving. And while we are at it, we have been exploring the role of LLMs in this context. First with Solo, a RAG-style system that uses a self-supervised approach to train automatically, and more recently with Pneuma, a work in progress. We have started to explore connections between data discovery and causal inference (really correlation discovery over large repositories) with Nexus.

NEWS

FEBRUARY'25 Named Sloan Research Fellow

JANUARY'25 Talk on data ecology at the AI/ML Affinity group at the US Census Bureau

JANUARY'25 Manuel Cebrian visited our group

NOVEMBER'24 Talks on data ecology at the Harris School of Public Policy, on data discovery at GSL@Microsoft, and on data sharing at an IDEAL workshop.

OCTOBER'24 New Safeinsights project Safeinsights project kick-off meeting

See a log of all updates

PUBLICATIONS

Here you will find a list of my latest publications.

2024

2023

2022

2020

STUDENTS

Below I include Postdocs, PhD, and Master students. In addition to these, I’m fortunate to work with great undergraduate students and occasionally with external students.

Postdocs and PhD Students

  • Qiming Wang
  • Yue Gong
  • Zhiru Zhu
  • Tapan Srivastava
  • Steven Xia
  • Chris Zhu
  • Hrishee Shastri

Master and Undergraduate Students

  • Joyce Chen
  • Alena Zeng
  • Chirag Kawediya

Alumni

  • Kevin Dharmawan (external collaborator, to SBU PhD program)
  • Zach Hempstead (to Anthropic)
  • Sainyam Galhotra (to Cornell (assistant professor))
  • Stanley Zhu (to Google)
  • Alex Zhao (to Citadel)
  • Jenny Long
  • Yintong Ma (to ByteDance)
  • Ipsita Mohanty (to UWaterloo MSC program)
  • Ryan Wong (to UMichigan Undegraduate program)

TEACHING

SERVICE