Postdoc MIT, CSAIL

Cambridge, MA, USA

raulcf at csail.mit.edu

There is a gap between our capacity to generate data and our capacity to benefit from it. To bridge the gap we must: i) make data easy to access, where can I find these records?; ii) we must make it easy to understand, what questions can I answer with this data? and; iii) we must make it easy to use to support ever more sophisticated algorithms.

In my research I use techniques from data management, statistics and machine learning to build data-centric systems that help access, understand and use data. At MIT I work with professors Sam Madden and Mike Stonebraker. Before MIT, I completed my PhD at Imperial College London with Peter Pietzuch.




Data Discovery

Organizations store data in hundreds of different data sources, including relational databases, files, and large data lake repositories. These data sources contain valuable information and insights that can be beneficial to multiple aspects of modern data-driven organizations. However, as more data is produced, our ability to use it reduces dramatically, as no single person knows about all the existent data sources. One big challenge is to discover the data sources that are relevant to answer a particular question. Aurum is a data discovery system to answer "discovery queries" on large volumes of data.

Fabric of Data: Termite

In addition to structured sources such as relational tables, organizations are plagued with unstructured sources such as PDFs, text files and emails as well. Integrating both kinds of sources has been a cornerstone of multiple research communitifies for decades. It is challenging because it demands extracting structure from the unstructured sources and then finding a common schema to represent both. In this line of research, we advocate a different approach: rather than trying to infer a common schema, we aim to find a common representation for both structured and unstructured data. Specifically, we argue for an embedding (i.e., a vector space) in which all entities, rows, columns, and paragraphs are represented as points. In the embedding, the distance between points indicates their degree of relatedness, and we learn the embedding so that it satisfies different downstream applications, from filling missing values, to data discovery and verification among many others.

Stateful Data Processing

Large-scale data processing systems depend on stateless dataflows to extract data parallelims and execute the programs with fault tolerance. Many applications that require explicit access to state cannot be executed efficiently in such systems. Stateful data-parallel processing permits to execute stateful programs efficiently and still keeping the data parallelism and fault tolerance properties of traditional dataflow systems. In addition, with state in the applications we can translate imperative programs into stateful dataflow graphs, that can execute on a stateful data-parallel processing system.


Work Experience

Raul Castro Fernandez © 2018