Postdoctoral Associate. MIT, CSAIL
Today we generate more data than we know how to comprehend. To benefit from the value hidden in data, our capacity to explore and understand it must match our capacity to generate it. In my research I work on problems geared towards bridging the gap. I like designing and building systems to solve practical problems; my research interests lie at the intersection of databases, systems and distributed systems. At MIT I work with professors Sam Madden and Mike Stonebraker. Before MIT, I completed my PhD at Imperial College London with Peter Pietzuch.
43rd International Conference on Very Large DataBases (VLDB), 09/2017, Munich, Germany
ACM Symposium on Cloud Computing (SOCC) 10/2016, Santa Clara, CA.
42nd International Conference on Very Large DataBases (VLDB), 09/2016, New Delhi, India.
ExploreDB Workshop collocated with ACM SIGMOD, 06/2016, San Francisco, CA.
ACM International Conference on Management of Data (SIGMOD), 06/2016, San Francisco, CA.
32nd IEEE International Conference on Data Engineering, 05/2016, Helsinki, Finland
In 7th Biennial Conference on Innovative Data Systems Research, Monterey, CA, USA.
USENIX Annual Technical Conference, 06/2014, Philadelphia, PA, USA.
8th ACM International Conference on Distributed Event Based Systems, 05/2014, Mumbai, India.
ACM International Conference on Management of Data (SIGMOD), 06/2013, New York, NY.
Doctoral Workshop of the 7th ACM International Conference on Distributed Event Based Systems, 06/2013, Arlington, Texas, USA.
Data is stored everywhere: in relational databases, files and hundreds of different data sources. These data sources contain valuable information and insights that can be beneficial to multiple aspects of modern data-driven organizations. However, as more data is produced, our ability to use it reduces dramatically, as no single person in the organization knows about all the existent data sources and so they are lost in the crowd. One big challenge is to discover the data sources that are relevant to answer a particular question. We are building Aurum, a data discovery system to answer "discovery queries" on large volumes of data.
The dataflow abstraction implemented as part of large-scale data processing engines has reduced the processing times required to run analytical queries over large volumes of data by exploiting data parallelism, while permitting users to write their algorithms in a high-level language. However, there is a large number of applications that require both data- and task-parallelism--such as complex physical and biological simulations that depend on linear algebra operations. These applications are typically expressed as SPMD programs that intertwine algorithm logic and topology information. This produces hard to understand and error-prone code. We are exploring a new abstraction to bring the benefits of dataflows to HPC-like programs.
Some applications require both input data and fine-tuning a set of parameters---such as learning rate, smoothing factor and optimizer type for machine learning applications---to produce results: we call these applications exploratory queries. Users spend long times orchestrating the different parameters they want to try, which is time-consuming and resource inefficient: each instantiation becomes a dataflow representation that executes in a dataflow system. Instead, we propose metadataflows as a new dataflow abstraction for users to represent exploratory queries succcintly. Metadataflows permit to exploit characteristics that allow us to execute these kind of queries more efficiently, such as performing sharing of intermediate results, avoiding redundant computation and using more sophisticated memory management mechanisms.
Large-scale data processing systems depend on stateless dataflows to extract data parallelims and execute the programs with fault tolerance. Many applications that require explicit access to state cannot be executed efficiently in such systems. Stateful data-parallel processing permits to execute stateful programs efficiently and still keeping the data parallelism and fault tolerance properties of traditional dataflow systems. In addition, with state in the applications we can translate imperative programs into stateful dataflow graphs, that can execute on a stateful data-parallel processing system.
Period: October 2015 - Current
Period: July 2015 - September 2015
Period: October 2011 - September 2015
contxt helps you to discuss news with people you care about. Information overload means that we do not have time to process the seemingly infinite streams of news we receive every day. contxt helps to tame this overload by curating news from the many different data sources that interest you (Facebook, Twitter, LinkedIn, feeds, etc.) and offering them as concise summaries. You can then start private conversations about those pieces of news that are more interesting with people you want. By curating multiple sources of news and trusting your friends, contxt helps you to stay up to data without effort.
I co-founded this company (Ecana Sistemas de Informacion SL) to help wineries improve their production processes. Ecana acquires data from sensors deployed in wineyards, weather stations and human-provided knowledge. We then transform that data into valuable information to humans and finally we visualise the information in dashboards. The goal was to keep winemakers up to date as to what is going on in their winery and alert them when important events occur such as rising probability of freezing or disease.