Assistant Professor in the Department of Computer Science, Committee for Data Science, The University of Chicago
I am interested in data; how to think about it, what does it mean to make good use of it, what theory, algorithms, and systems do we need to exploit it, and how to leverage it to better our lives. I use a variety of approaches to study these questions.
Data Ecology is the name I give to the study of the principles, algorithms, systems, and methodologies to understand and design data ecosystems. Here's a one-pager overview. Here's a link to a Data Science Institute's Research Initiative on this topic. There are some data ecosystems to which we have dedicated more time to date:
Data Markets. We study data markets, which are an important data ecosystem. You can see our vision for internal (read within organizations) data markets, a survey about marketplaces, and some more technical work.
Data Sharing. We work a lot on data-sharing markets. One question that motivates much of this research is how to incentivize data sharing when beneficial. My NSF CAREER is studying this type of ecosystem. We have designed and built a data escrow, which permits multiple agnets pools and operate on their data. We have models for incentivizing the formation of data-sharing consortia.
I define data discovery as the problem of identifying and retrieving documents that satisfy an information need. There's a strong connection to information retrieval but we have concentrated primarily in tabular data and in data augmentation techniques. This all started with Aurum. ARDA is an application of Aurum to do feature engineering from external repositories. Continuations to Aurum include Ver, and continuations to ARDA include Leva and, Metam. Right now, Metam says all I want to say about data augmentation. Ver is still evolving. And while we are at it, we have been exploring the role of LLMs in this context. First with Solo, a RAG-style system that uses a self-supervised approach to train automatically, and more recently with Pneuma, a work in progress. We have started to explore connections between data discovery and causal inference (really correlation discovery over large repositories) with Nexus.