Raul Castro Fernandez

Assistant Professor in the Department of Computer Science, Committee for Data Science, The University of Chicago

I am interested in data; how to think about it, what does it mean to make good use of it, what theory, algorithms, and systems do we need to exploit it, and how to leverage it to better our lives. I use a variety of approaches to study these questions.

I am looking for PhD students, postdocs, and research assistants. Take a look at this brief writeup if you are interested in working with us.

The main area of research my group explores is data ecology. This is the name I give to the principles, theory, and methodology that analyze and synthesize data ecosystems. Here's a one-pager that gives an overview of the work. And here's a link to a Research Initiative on this topic.

There are a number of specific areas we are actively interested in:

Data Markets. We are studying data markets, as these are an important type of data ecosystem. You can see our vision for internal (read within organizations) data markets, a survey about marketplaces, and some more technical work.

Data Sharing. We work a lot of data-sharing markets. My NSF CAREER is studying this type of ecosystem. We have designed and built a data escrow, which permits multiple agnets pools and operate on their data. We have models for incentivizing the formation of data-sharing consortia.

Data Discovery I define data discovery as the problem of identifying and retrieving documents that satisfy an information need. There's a strong connection to information retrieval but we have concentrated primarily in tabular data and in data augmentation techniques. This all started with Aurum. ARDA is an application of Aurum to do feature engineering from external repositories. Continuations to Aurum include Ver, and continuations to ARDA include Leva and, Metam. Right now, Metam says all I want to say about data augmentation. Ver is still evolving. And while we are at it, we have been exploring the role of LLMs in this context. First with Solo, a RAG-style system that uses a self-supervised approach to train automatically, and more recently with Pneuma, a work in progress. We have started to explore connections between data discovery and causal inference (really correlation discovery over large repositories) with Nexus.

NEWS

NOVEMBER'24 Talks on data ecology at the Harris School of Public Policy, on data discovery at GSL@Microsoft, and on data sharing at an IDEAL workshop.

OCTOBER'24 New Safeinsights project Safeinsights project kick-off meeting

SEPTEMBER'24 New Members Join the Group Joyce Chen and Hrishee Shastri join the group

AUGUST'24 VLDB'24 Tapan presents Arachne at VLDB

See a log of all updates

PUBLICATIONS

Here you will find a list of my latest publications.

2024

2023

2022

2020

STUDENTS

Below I include Postdocs, PhD, and Master students. In addition to these, I’m fortunate to work with great undergraduate students and occasionally with external students.

Postdocs and PhD Students

  • Qiming Wang
  • Yue Gong
  • Zhiru Zhu
  • Tapan Srivastava
  • Steven Xia
  • Chris Zhu
  • Hrishee Shastri

Master and Undergraduate Students

  • Joyce Chen
  • Alena Zeng
  • Chirag Kawediya

Alumni

  • Kevin Dharmawan (external collaborator, to SBU PhD program)
  • Zach Hempstead (to Anthropic)
  • Sainyam Galhotra (to Cornell (assistant professor))
  • Stanley Zhu (to Google)
  • Alex Zhao (to Citadel)
  • Jenny Long
  • Yintong Ma (to ByteDance)
  • Ipsita Mohanty (to UWaterloo MSC program)
  • Ryan Wong (to UMichigan Undegraduate program)

TEACHING

SERVICE