RAUL CASTRO FERNANDEZ

Assistant Professor, Computer Science, The University of Chicago

Chicago, IL, USA

Office 245, John Crerar Library

raulcf at uchicago.edu

Picture by Jason Dorfman@CSAIL

In my research I build systems for discovering, preparing, and processing data. The goal of my research is to understand and exploit the value of data. I often use techniques from data management, statistics, and machine learning. My main effort these days is on building platforms to support markets of data. This is part of a larger research effort on understanding the Economics of Data. I'm part of ChiData, the data systems research group at The University of Chicago.

If you are interested in doing a PhD or Postdoc with me, send me an email explaining your interest, and attach an up-to-date CV.

If you are currently an undergraduate at UChicago and are interested in participating in a research project, send me an up-to-date CV and let's chat.

PUBLICATIONS

Preprint/Work in Progress
2020
2019
2018
2017
2016
2015
2014
2013

RESEARCH

Economics of Data

Data only generates value for a few organizations with expertise and resources to make data shareable, discoverable, and easy to integrate. Sharing data that is easy to discover and integrate is hard because data owners lack information (who needs what data) and they do not have incentives to prepare the data in a way that is easy to consume by others. In this project we are studying how to design markets that incentivize participants to behave in a way that increases data's value and we are designing platforms to support this vision.

Data Discovery

Organizations store data in hundreds of different data sources, including relational databases, files, and large data lake repositories. These data sources contain valuable information and insights that can be beneficial to multiple aspects of modern data-driven organizations. However, as more data is produced, our ability to use it reduces dramatically, as no single person knows about all the existent data sources. One big challenge is to discover the data sources that are relevant to answer a particular question. Aurum is a data discovery system to answer "discovery queries" on large volumes of data.

Fabric of Data

In addition to structured sources such as relational tables, organizations are plagued with unstructured sources such as PDFs, text files and emails as well. Integrating both kinds of sources has been a cornerstone of multiple research communitifies for decades. It is challenging because it demands extracting structure from the unstructured sources and then finding a common schema to represent both. In this line of research, we advocate a different approach: rather than trying to infer a common schema, we aim to find a common representation for both structured and unstructured data. Specifically, we argue for an embedding (i.e., a vector space) in which all entities, rows, columns, and paragraphs are represented as points. In the embedding, the distance between points indicates their degree of relatedness, and we learn the embedding so that it satisfies different downstream applications, from filling missing values, to data discovery and verification among many others. In this project we are building new abstractions for data management, such as a relational embedding, but also the next generation of open domain question answering systems and information extractors. The end goal is to be able to organize knowledge from different sources in a way that is easy to consume.

Stateful Data Processing

Large-scale data processing systems depend on stateless dataflows to extract data parallelism and execute the programs with fault tolerance. Many applications that require explicit access to state cannot be executed efficiently in such systems. Stateful data-parallel processing permits to execute stateful programs efficiently and still keeping the data parallelism and fault tolerance properties of traditional dataflow systems. In addition, with state in the applications we can translate imperative programs into stateful dataflow graphs, that can execute on a stateful data-parallel processing system.

TEACHING, SERVICE, BIO

Teaching
Service
Bio
I completed a postdoc at MIT working with Sam Madden. Before that, I obtained my PhD at Imperial College London working with Peter Pietzuch. In the past, I've started two companies and I'm always interested in doing tech transfer of my research.

Picture by Jason Dorfman @ CSAIL. Raul Castro Fernandez © 2020