Assistant Professor, Computer Science, The University of Chicago

Chicago, IL, USA

Office 245, John Crerar Library

raulcf at uchicago.edu

Picture by Jason Dorfman@CSAIL

In my research I build systems for discovering, preparing, and processing data. The goal of my reseach is to understand and exploit the value of data. I often use techniques from data management, statistics, and machine learning.

If you are interested in doing a PhD or Postdoc with me, send me an email explaining your interest, and attach an up-to-date CV.

If you are currently an undergraduate at UChicago and are interested in participating in a research project, send me an up-to-date CV and let's chat.


Preprint/Work in Progress


Data Discovery

Organizations store data in hundreds of different data sources, including relational databases, files, and large data lake repositories. These data sources contain valuable information and insights that can be beneficial to multiple aspects of modern data-driven organizations. However, as more data is produced, our ability to use it reduces dramatically, as no single person knows about all the existent data sources. One big challenge is to discover the data sources that are relevant to answer a particular question. Aurum is a data discovery system to answer "discovery queries" on large volumes of data.

Fabric of Data: Termite

In addition to structured sources such as relational tables, organizations are plagued with unstructured sources such as PDFs, text files and emails as well. Integrating both kinds of sources has been a cornerstone of multiple research communitifies for decades. It is challenging because it demands extracting structure from the unstructured sources and then finding a common schema to represent both. In this line of research, we advocate a different approach: rather than trying to infer a common schema, we aim to find a common representation for both structured and unstructured data. Specifically, we argue for an embedding (i.e., a vector space) in which all entities, rows, columns, and paragraphs are represented as points. In the embedding, the distance between points indicates their degree of relatedness, and we learn the embedding so that it satisfies different downstream applications, from filling missing values, to data discovery and verification among many others.

Stateful Data Processing

Large-scale data processing systems depend on stateless dataflows to extract data parallelims and execute the programs with fault tolerance. Many applications that require explicit access to state cannot be executed efficiently in such systems. Stateful data-parallel processing permits to execute stateful programs efficiently and still keeping the data parallelism and fault tolerance properties of traditional dataflow systems. In addition, with state in the applications we can translate imperative programs into stateful dataflow graphs, that can execute on a stateful data-parallel processing system.


I completed a postdoc at MIT working with Sam Madden. Before that, I obtained my PhD at Imperial College London working with Peter Pietzuch. In the past, I've started two companies and I'm always interested in doing tech transfer of my research.

Picture by Jason Dorfman @ CSAIL. Raul Castro Fernandez © 2019