Assistant Professor in the Department of Computer Science, Committee for Data Science, The University of Chicago
I am interested in data; how to think about it, what does it mean to make good use of it, what theory, algorithms, and systems do we need to exploit it, and how to leverage it to better our lives. I use a variety of approaches to study these questions.
I am looking for PhD students, postdocs, and research assistants. Take a look at this brief writeup if you are interested in working with us.
Data Ecology is the name I give to the study of the principles, algorithms, systems, and methodologies to understand and design data ecosystems. Here's a one-pager overview. Here's a link to a Data Science Institute's Research Initiative on this topic. There are some data ecosystems to which we have dedicated more time to date:
Data Markets. We study data markets, which are an important data ecosystem. You can see our vision for internal (read within organizations) data markets, a survey about marketplaces, and some more technical work.
Data Sharing. We work a lot on data-sharing markets. One question that motivates much of this research is how to incentivize data sharing when beneficial. My NSF CAREER is studying this type of ecosystem. We have designed and built a data escrow, which permits multiple agnets pools and operate on their data. We have models for incentivizing the formation of data-sharing consortia.
I define data discovery as the problem of identifying and retrieving documents that satisfy an information need. There's a strong connection to information retrieval but we have concentrated primarily in tabular data and in data augmentation techniques. This all started with Aurum. ARDA is an application of Aurum to do feature engineering from external repositories. Continuations to Aurum include Ver, and continuations to ARDA include Leva and, Metam. Right now, Metam says all I want to say about data augmentation. Ver is still evolving. And while we are at it, we have been exploring the role of LLMs in this context. First with Solo, a RAG-style system that uses a self-supervised approach to train automatically, and more recently with Pneuma, a work in progress. We have started to explore connections between data discovery and causal inference (really correlation discovery over large repositories) with Nexus.
FEBRUARY'25 Named Sloan Research Fellow
JANUARY'25 Talk on data ecology at the AI/ML Affinity group at the US Census Bureau
JANUARY'25 Manuel Cebrian visited our group
NOVEMBER'24 Talks on data ecology at the Harris School of Public Policy, on data discovery at GSL@Microsoft, and on data sharing at an IDEAL workshop.
OCTOBER'24 New Safeinsights project Safeinsights project kick-off meeting
See a log of all updates
Here you will find a list of my latest publications.
Below I include Postdocs, PhD, and Master students. In addition to these, I’m fortunate to work with great undergraduate students and occasionally with external students.