I am interested in data: how to think about it, what it means to make good use of it, what theory, algorithms, and systems we need to exploit it, and how to leverage it to better our lives. I use a variety of approaches to study these questions.

Research Overview

Data Ecology

Data Ecology is the name I give to the study of the principles, algorithms, systems, and methodologies to understand and design data ecosystems. Here's a one-pager overview. Here's a link to the Data Science Institute's Research Initiative on this topic. There are some example data ecosystems we have studied:

Data Markets are an important data ecosystem. You can see our vision for internal (within-organization) data markets, a survey about marketplaces, and some more technical work.

Data Sharing. We work extensively on data-sharing markets. One question that motivates much of this research is how to incentivize data sharing when beneficial. My NSF CAREER is studying this type of ecosystem. We have designed and built a data escrow, which permits multiple agents to pool and operate on their data. We have models for incentivizing the formation of data-sharing consortia.

Data Discovery

I define data discovery as the problem of identifying and retrieving documents that satisfy an information need. There's a strong connection to information retrieval, but we have concentrated primarily on tabular data and data augmentation techniques. This all started with Aurum. ARDA applies Aurum to feature engineering from external repositories. Continuations to Aurum include Ver, and continuations to ARDA include Leva and Metam. We have also been exploring the role of LLMs in this context: first with Solo, a RAG-style system using a self-supervised approach, and more recently with Pneuma. We have started to explore connections between data discovery and causal inference (correlation discovery over large repositories) with Nexus.