In my research I build systems for discovering, preparing, and processing data. The goal of my research is to understand and exploit the value of data. I often use techniques from data management, statistics, and machine learning. My main effort these days is on building platforms to support markets of data. This is part of a larger research effort on understanding the Economics of Data. I'm part of ChiData, the data systems research group at The University of Chicago.
If you are interested in doing a PhD or Postdoc with me, send me an email explaining your interest, and attach an up-to-date CV. If you are currently an undergraduate at UChicago and are interested in participating in a research project, send me an up-to-date CV and let's chat.
Data only generates value for a few organizations with expertise and resources to make data shareable, discoverable, and easy to integrate. Sharing data that is easy to discover and integrate is hard because data owners lack information (who needs what data) and they do not have incentives to prepare the data in a way that is easy to consume by others. In this project we are studying how to design markets that incentivize participants to behave in a way that increases data's value and we are designing platforms to support this vision.
Organizations store data in hundreds of different data sources, including relational databases, files, and large data lake repositories. These data sources contain valuable information and insights that can be beneficial to multiple aspects of modern data-driven organizations. However, as more data is produced, our ability to use it reduces dramatically, as no single person knows about all the existent data sources. One big challenge is to discover the data sources that are relevant to answer a particular question. Aurum is a data discovery system to answer "discovery queries" on large volumes of data.
In addition to structured sources such as relational tables, organizations are plagued with unstructured sources such as PDFs, text files and emails as well. Integrating both kinds of sources has been a cornerstone of multiple research communitifies for decades. It is challenging because it demands extracting structure from the unstructured sources and then finding a common schema to represent both. In this line of research, we advocate a different approach: rather than trying to infer a common schema, we aim to find a common representation for both structured and unstructured data. Specifically, we argue for an embedding (i.e., a vector space) in which all entities, rows, columns, and paragraphs are represented as points. In the embedding, the distance between points indicates their degree of relatedness, and we learn the embedding so that it satisfies different downstream applications, from filling missing values, to data discovery and verification among many others. In this project we are building new abstractions for data management, such as a relational embedding, but also the next generation of open domain question answering systems and information extractors. The end goal is to be able to organize knowledge from different sources in a way that is easy to consume.
Large-scale data processing systems depend on stateless dataflows to extract data parallelism and execute the programs with fault tolerance. Many applications that require explicit access to state cannot be executed efficiently in such systems. Stateful data-parallel processing permits to execute stateful programs efficiently and still keeping the data parallelism and fault tolerance properties of traditional dataflow systems. In addition, with state in the applications we can translate imperative programs into stateful dataflow graphs, that can execute on a stateful data-parallel processing system.
Ethics, Fairness, Responsibility, and Privacy in Data Science (Spring'20)
Introduction to Databases (Winter'20)
VLDB'21 PC Member
ICDE'21 PC Member
VLDB'20 PC Member
SoCC'20 PC Member
SIGMOD'19 PC Member