Raul Castro Fernandez

Postdoctoral Associate. MIT, CSAIL

Today we generate more data than we know how to comprehend. To benefit from the value hidden in data, our capacity to explore and understand it must match our capacity to generate it. In my research I work on problems geared towards bridging the gap. I like designing and building systems to solve practical problems; my research interests lie at the intersection of databases, systems and distributed systems. At MIT I work with professors Sam Madden and Mike Stonebraker. Before MIT, I completed my PhD at Imperial College London with Peter Pietzuch.

 

Publications

2018

Aurum: A Data Discovery System

Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, Michael Stonebraker

34th IEEE International Conference on Data Engineering, 04/2018, Paris, France

ICDE

Seeping Semantics: Linking Datasets using Word Embeddings for Data Discovery

Raul Castro Fernandez, Essam Mansour, Abdulhakim Qahtan, Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang

34th IEEE International Conference on Data Engineering, 04/2018, Paris, France

ICDE

Meta-Dataflows: Efficient Exploratory Dataflow Jobs

Raul Castro Fernandez, William Culhane, William Culhane, Pijika Watcharapichat, Matthias Weidlich, Victoria Lopez Morales, Peter Pietzuch

ACM International Conference on Management of Data (SIGMOD), 06/2018, Houston, TX.

SIGMOD
2018
2017

Extracting Syntactical Patterns from Databases

Andrew Ilyas, Joana M. F. da Trindade, Raul Castro Fernandez, Samuel Madden

34th IEEE International Conference on Data Engineering, 04/2018, Paris, France

ICDE

Building Data Civilizer Pipelines with an Advanced Workflow Engine

Essam Mansour, Dong Deng, Raul Castro Fernandez, Abdulhakim Qahtan, Wenbo Tao, Ziawasch Abedjan, Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang

34th IEEE International Conference on Data Engineering, 04/2018, Paris, France

ICDE

A Demo of the Data Civilizer System

Raul Castro Fernandez, Dong Deng, Essam Mansour, Abdulhakim A Qahtan, Wenbo Tao, Ziawasch Abedjan, Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang

ACM International Conference on Management of Data (SIGMOD), 05/2017, Chicago, IL.

SIGMOD
2017
2016

Quill: Efficient, Transferable, and Rich Analytics at Scale

Badrish Chandramouli, Raul Castro Fernandez, Jonathan Goldstein, Ahmed Eldawy, Abdul Quamar

43rd International Conference on Very Large DataBases (VLDB), 09/2017, Munich, Germany

VLDB

Ako: Decentralised Deep Learning with Partial Gradient Exchange

Pijika Watcharapichat, Victoria Lopez Morales, Raul Castro Fernandez, Peter Pietzuch

ACM Symposium on Cloud Computing (SOCC) 10/2016, Santa Clara, CA.

SOCC

Detecting Data Errors: Where are we and what needs to be done?

Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Nan Tang

42nd International Conference on Very Large DataBases (VLDB), 09/2016, New Delhi, India.

VLDB
2016

Towards Large-Scale Data Discovery

Raul Castro Fernandez, Ziawasch Abedjan, Samuel Madden, Michael Stonebraker

ExploreDB Workshop collocated with ACM SIGMOD, 06/2016, San Francisco, CA.

ExploreDB@SIGMOD

SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures

Alexandros Koliousis, Matthias Weidlich, Raul Castro Fernandez, Paolo Costa, Alexander Wolf, Peter Pietzuch

ACM International Conference on Management of Data (SIGMOD), 06/2016, San Francisco, CA.

SIGMOD

Java2SDG: Stateful Big Data Processing for the Masses

Raul Castro Fernandez, Panagiotis Garefalakis, Peter Pietzuch.

32nd IEEE International Conference on Data Engineering, 05/2016, Helsinki, Finland

ICDE
2015
2014

Liquid: Unifying Nearline and Offline Big Data Integration

Raul Castro Fernandez, Peter Pietzuch, Joel Koshy, Jay Kreps, Dong Lin, Neha Narkhede, Jun Rao, Chris Riccomini, Guozhang Wang.

In 7th Biennial Conference on Innovative Data Systems Research, Monterey, CA, USA.

CIDR

Making State Explicit for Imperative Big Data Processing

Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki and Peter Pietzuch.

USENIX Annual Technical Conference, 06/2014, Philadelphia, PA, USA.

USENIX ATC

Grand Challenge Scalable Stateful Stream Processing for Smart Grids

Raul Castro Fernandez, Matthias Weidlich, Peter Pietzuch and Avigdor Gal.

8th ACM International Conference on Distributed Event Based Systems, 05/2014, Mumbai, India.

DEBS
2013

Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management

Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki and Peter Pietzuch.

ACM International Conference on Management of Data (SIGMOD), 06/2013, New York, NY.

SIGMOD

Towards Low-Latency and In-Memory Large-Scale Data Processing

Raul Castro Fernandez and Peter Pietzuch.

Doctoral Workshop of the 7th ACM International Conference on Distributed Event Based Systems, 06/2013, Arlington, Texas, USA.

DEBS
 

Projects

Data Discovery

Data is stored everywhere: in relational databases, files and hundreds of different data sources. These data sources contain valuable information and insights that can be beneficial to multiple aspects of modern data-driven organizations. However, as more data is produced, our ability to use it reduces dramatically, as no single person in the organization knows about all the existent data sources and so they are lost in the crowd. One big challenge is to discover the data sources that are relevant to answer a particular question. We are building Aurum, a data discovery system to answer "discovery queries" on large volumes of data.

Topology-Aware Dataflows

The dataflow abstraction implemented as part of large-scale data processing engines has reduced the processing times required to run analytical queries over large volumes of data by exploiting data parallelism, while permitting users to write their algorithms in a high-level language. However, there is a large number of applications that require both data- and task-parallelism--such as complex physical and biological simulations that depend on linear algebra operations. These applications are typically expressed as SPMD programs that intertwine algorithm logic and topology information. This produces hard to understand and error-prone code. We are exploring a new abstraction to bring the benefits of dataflows to HPC-like programs.

Metadataflows

Some applications require both input data and fine-tuning a set of parameters---such as learning rate, smoothing factor and optimizer type for machine learning applications---to produce results: we call these applications exploratory queries. Users spend long times orchestrating the different parameters they want to try, which is time-consuming and resource inefficient: each instantiation becomes a dataflow representation that executes in a dataflow system. Instead, we propose metadataflows as a new dataflow abstraction for users to represent exploratory queries succcintly. Metadataflows permit to exploit characteristics that allow us to execute these kind of queries more efficiently, such as performing sharing of intermediate results, avoiding redundant computation and using more sophisticated memory management mechanisms.

Stateful Data-Parallel Processing

Large-scale data processing systems depend on stateless dataflows to extract data parallelims and execute the programs with fault tolerance. Many applications that require explicit access to state cannot be executed efficiently in such systems. Stateful data-parallel processing permits to execute stateful programs efficiently and still keeping the data parallelism and fault tolerance properties of traditional dataflow systems. In addition, with state in the applications we can translate imperative programs into stateful dataflow graphs, that can execute on a stateful data-parallel processing system.

 

Experience

MIT: Postdoctoral Associate

Period: October 2015 - Current

Cambridge, MA, USA

Microsoft Research: Research Intern

Period: July 2015 - September 2015

Redmond, WA, USA

Imperial College London: PhD student

Period: October 2011 - September 2015

London, UK

LinkedIn: Software Engineer Intern

Period: June 2014 - August 2014

Mountain View, CA, USA

UC3M: Researcher at FP7 Project

Period: September 2009 - September 2011 2014

Madrid, Spain

Others

contxt.in

contxt helps you to discuss news with people you care about. Information overload means that we do not have time to process the seemingly infinite streams of news we receive every day. contxt helps to tame this overload by curating news from the many different data sources that interest you (Facebook, Twitter, LinkedIn, feeds, etc.) and offering them as concise summaries. You can then start private conversations about those pieces of news that are more interesting with people you want. By curating multiple sources of news and trusting your friends, contxt helps you to stay up to data without effort.

Ecana

I co-founded this company (Ecana Sistemas de Informacion SL) to help wineries improve their production processes. Ecana acquires data from sensors deployed in wineyards, weather stations and human-provided knowledge. We then transform that data into valuable information to humans and finally we visualise the information in dashboards. The goal was to keep winemakers up to date as to what is going on in their winery and alert them when important events occur such as rising probability of freezing or disease.