Raul Castro Fernandez

Assistant Professor of Computer Science · The University of Chicago · Chief Research Officer, invocate

Data ecology asks what data does and what we can make it do.

About

I'm an Assistant Professor of Computer Science at the University of Chicago and co-founder and Chief Research Officer at invocate. I originated the concept of data ecology, which frames my research on how data moves through and transforms technological, economic, and social systems and how we can design interventions (technical, economic, and social) to steer those ecosystems toward goals we deem worthy.

I lead work in data ecology, data discovery, and related areas such as data markets and data integration. My group develops both theory and systems that help people and organizations find, evaluate, and use data effectively.

I co-lead the Data Ecology research initiative at the Data Science Institute, and I co-run Chicago Data Night, a forum that brings together industry and academia in Chicago.

★ Sloan Research Fellowship (2025) ★ NSF CAREER Award (2024) ★ SIGMOD Test-of-Time Award (2023)

Research overview & longer bio →

News

APR'26Presented at the AImpact Workshop at UIUC

FEB'26Presented Data Ecology and Data Markets work at M3 Workshop

JAN'26Presented the Pneuma Project at CIDR

SEP'25Named Distinguished Associate Editor for VLDB. Co-chaired the PhD Workshop at VLDB'25

AUG'25Speaker at the Incentives for Collaborative Learning and Data Sharing — TTIC Summer Workshop

FEB'25Named Sloan Research Fellow

JAN'25Talk on data ecology at the AI/ML Affinity group at the US Census Bureau

Full activity log →

Publications

2026

How AI Companies Can Pay Fair Rates for the Content They Need E. Glen Weyl, Raul Castro Fernandez Harvard Business Review 2026 New
Optimal Pricing for Data-Augmented AutoML Marketplaces. Steven Xia, Minbiao Han, Jonathan Light, Sainyam Galhotra, Raul Castro Fernandez, Haifeng Xu ICML 2026 New
Demonstration of Pneuma-Seeker: Agentic System for Reifying and Fulfilling Information Needs on Tabular Data. Muhammad Imam Luthfi Balaka, Raul Castro Fernandez. CAIS 2026 (demo) New
Programmable Dataflows: Abstraction and Programming Model for Data Sharing. Siyuan Xia, Chris Zhu, Tapan Srivastava, Bridget Fahey, Raul Castro Fernandez. PVLDB Journal 2026 New
The Pneuma Project: Reifying Information Needs as Relational Schemas to Automate Discovery, Guide Preparation, and Align Data with Intent. Muhammad Imam Luthfi Balaka, Raul Castro Fernandez. CIDR 2026
The Structural Law of Data. Bridget Fahey, Raul Castro Fernandez. The University of Chicago Law Review 2026

2025

What is the Value of Data?: A Theory and Systematization. Raul Castro Fernandez. ACM/IMS Journal of Data Science 2025
Data Discovery is a Socio-Technical Problem: the Path from Document Identification and Retrieval to Data Ecology. Raul Castro Fernandez. IEEE Data Engineering Bulletin 2025
Core Hours and Carbon Credits: Incentivizing Sustainability in HPC. Alok Kamatar, Maxime Gonthier, Valérie Hayot-Sasson, André Bauer, Marcin Copik, Torsten Hoefler, Raul Castro Fernandez, Kyle Chard, Ian T. Foster. SC 2025
Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System. Muhammad Imam Luthfi Balaka, David Alexander, Qiming Wang, Yue Gong, Adila Krisnadhi, Raul Castro Fernandez. SIGMOD 2025
Data Ecology: Understanding and Designing Data Ecosystems. Raul Castro Fernandez. SIGMOD Record (DBrainstorming) 2025
Mass-Scale Analysis of In-the-Wild Conversations Reveals Complexity Bounds on LLM Jailbreaking. Aldan Creo, Raul Castro Fernandez, Manuel Cebrian. SGAI-AI 2025
Not-So-Bitter Pill to Swallow: Slipstreaming Memory Safe Programming via Rust as part of a Database Systems Course. Mohammed Suhail Rehman, Aaron Elmore, Raul Castro Fernandez. SIGMOD 2025

2024

Saving Money for Analytical Workloads in the Cloud. Tapan Srivastava, Raul Castro Fernandez. VLDB 2024
Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised Approach. Qiming Wang, Raul Castro Fernandez. SIGMOD 2024
Nexus: Correlation Discovery over Collections of Spatio-Temporal Tabular Data. Yue Gong, Sainyam Galhotra, Raul Castro Fernandez. SIGMOD 2024
Cackle: Analytical Workload Cost and Performance Stability with Elastic Pools. Matthew Perron, Raul Castro Fernandez, David DeWitt, Michael Cafarella, Samuel Madden. SIGMOD 2024
Responsible Sharing of Spatiotemporal Data. Raul Castro Fernandez, Arnab Nandi. SIGMOD 2024 (Tutorial)
Demonstration of Ver: View Discovery in the Wild. Kevin Dharmawan, Chirag Kawediya, Yue Gong, Zaki Indra Yudhistira, Zhiru Zhu, Sainyam Galhotra, Adila Alfa Krisnadhi, Raul Castro Fernandez. SIGMOD 2024 (Demo)
Demonstrating Nexus for Correlation Discovery over Collections of Spatio-Temporal Tabular Data. Yue Gong, Raul Castro Fernandez. SIGMOD 2024 (Demo)

2023

How Large Language Models Will Disrupt Data Management. Raul Castro Fernandez, Aaron Elmore, Michael Franklin, Sanjay Krishnan, Chenhao Tan. VLDB 2023
Data and AI Model Markets: Grand Opportunities for Data and Model Sharing, Discovery, and Integration. Jian Pei, Raul Castro Fernandez, Xiaohui Yu. VLDB 2023 (Tutorial)
Saibot: A Differentially Private Data Search Platform. Zezhou Huang, Jiaxiang Liu, Daniel Gbenga Alabi, Raul Castro Fernandez, Eugene Wu. VLDB 2023
Addressing Budget Allocation and Revenue Allocation in Data Market Environments Using an Adaptive Sampling Algorithm. Boxin Zhao, Boxiang Lyu, Raul Castro Fernandez, Mladen Kolar. ICML 2023
Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia. Raul Castro Fernandez. SIGMOD 2023
Metam: Goal-Oriented Data Discovery. Sainyam Galhotra, Yue Gong, Raul Castro Fernandez. ICDE 2023
Ver: View-Discovery in the Wild. Yue Gong, Zhiru Zhu, Sainyam Galhotra, Raul Castro Fernandez. ICDE 2023

2022

Data Station: Delegated, Trustworthy, and Auditable Computation to Enable Data-Sharing Consortia with a Data Escrow. Siyuan Xia, Zhiru Zhu, Chris Zhu, Jinjin Zhao, Kyle Chard, Aaron Elmore, Ian Foster, Michael Franklin, Sanjay Krishnan, Raul Castro Fernandez. VLDB 2022
Revisiting Online Data Markets in 2022. A Seller and Buyer Perspective. Javen Kennedy, Pranav Subramaniam, Sainyam Galhotra, Raul Castro Fernandez. SIGMOD Record
Enabling AI Innovation via Data and Model Sharing: An Overview of the NSF Convergence Accelerator Track D. Several authors. AI Magazine
Protecting Data Markets from Strategic Buyers. Raul Castro Fernandez. SIGMOD 2022
Leva: Boosting Machine Learning Performance with Relational Embedding Data Augmentation. Alex Zhao, Raul Castro Fernandez. SIGMOD 2022

2020

Data Market Platforms: Trading Data Assets to Solve Data Problems. Raul Castro Fernandez, Pranav Subramaniam, Michael Franklin. VLDB 2020
ARDA: Automatic Relational Data Augmentation for Machine Learning. Nadiia Chepurko, Ryan Marcus, Emanuel Zgraggen, Raul Castro Fernandez, Tim Kraska, David Karger. VLDB 2020
Starling: A Scalable Query Engine on Cloud Function Services. Matt Perron, Raul Castro Fernandez, David DeWitt, Samuel Madden. SIGMOD 2020
A System for Studying Deep Network Training. Raul Castro Fernandez. CIDR 2020 (Abstract)

2019

Lazo: A Cardinality-Based Method for Coupled Estimation of Jaccard Similarity and Containment. Raul Castro Fernandez, Jisoo Min, Demitri Devada, Samuel Madden. ICDE 2019
Termite: A System for Tunneling Through Heterogeneous Data. Raul Castro Fernandez, Samuel Madden. AIDM@SIGMOD 2019
Raha: A Configuration-Free Error Detection System. Mohammad Mahdavi, Ziawasch Abedjan, Raul Castro Fernandez, Sam Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang. SIGMOD 2019
Aurum: A Story About Research Taste. Raul Castro Fernandez. Making Databases Work. ACM Morgan & Claypool. 2019

2018

Aurum: A Data Discovery System. Raul Castro Fernandez, Ziawasch Abedjan, Famien Koko, Gina Yuan, Samuel Madden, Michael Stonebraker. ICDE 2018
Seeping Semantics: Linking Datasets using Word Embeddings for Data Discovery. Raul Castro Fernandez, Essam Mansour, Abdulhakim Qahtan, Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang. ICDE 2018
Meta-Dataflows: Efficient Exploratory Dataflow Jobs. Raul Castro Fernandez, William Culhane, Pijika Watcharapichat, Matthias Weidlich, Victoria Lopez Morales, Peter Pietzuch. SIGMOD 2018
Extracting Syntactical Patterns from Databases. Andrew Ilyas, Joana M. F. da Trindade, Raul Castro Fernandez, Samuel Madden. ICDE 2018
FAHES: A Robust Disguised Missing Values Detector. Mourad Ouzzani, Nan Tang, Ahmed Elmagarmid, Raul Castro Fernandez, Abdulhakim A. Qahtan. KDD 2018
Building Data Civilizer Pipelines with an Advanced Workflow Engine. Essam Mansour, Dong Deng, Raul Castro Fernandez, Abdulhakim Qahtan, Wenbo Tao, Ziawasch Abedjan, Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang. ICDE 2018 (Demo)

2017

Quill: Efficient, Transferable, and Rich Analytics at Scale. Badrish Chandramouli, Raul Castro Fernandez, Jonathan Goldstein, Ahmed Eldawy, Abdul Quamar. VLDB 2017
The Data Civilizer System. Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Nan Tang. CIDR 2017
A Demo of the Data Civilizer System. Raul Castro Fernandez, Dong Deng, Essam Mansour, Abdulhakim A Qahtan, Wenbo Tao, Ziawasch Abedjan, Ahmed Elmagarmid, Ihab Ilyas, Samuel Madden, Mourad Ouzzani, Michael Stonebraker, Nan Tang. SIGMOD 2017 (Demo)

2016

Ako: Decentralised Deep Learning with Partial Gradient Exchange. Pijika Watcharapichat, Victoria Lopez Morales, Raul Castro Fernandez, Peter Pietzuch. SoCC 2016
Detecting Data Errors: Where are we and what needs to be done? Ziawasch Abedjan, Xu Chu, Dong Deng, Raul Castro Fernandez, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Michael Stonebraker, Nan Tang. VLDB 2016
Towards Large-Scale Data Discovery. Raul Castro Fernandez, Ziawasch Abedjan, Samuel Madden, Michael Stonebraker. ExploreDB@SIGMOD 2016
SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures. Alexandros Koliousis, Matthias Weidlich, Raul Castro Fernandez, Paolo Costa, Alexander Wolf, Peter Pietzuch. SIGMOD 2016
Java2SDG: Stateful Big Data Processing for the Masses. Raul Castro Fernandez, Panagiotis Garefalakis, Peter Pietzuch. ICDE 2016 (Demo)

2015

Liquid: Unifying Nearline and Offline Big Data Integration. Raul Castro Fernandez, Peter Pietzuch, Joel Koshy, Jay Kreps, Dong Lin, Neha Narkhede, Jun Rao, Chris Riccomini, Guozhang Wang. CIDR 2015

2014

Making State Explicit for Imperative Big Data Processing. Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki and Peter Pietzuch. USENIX ATC 2014
Grand Challenge: Scalable Stateful Stream Processing for Smart Grids. Raul Castro Fernandez, Matthias Weidlich, Peter Pietzuch and Avigdor Gal. DEBS 2014

2013

Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management. Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki and Peter Pietzuch. SIGMOD 2013 ★ SIGMOD 2023 Test-of-Time Award
Towards Low-Latency and In-Memory Large-Scale Data Processing. Raul Castro Fernandez and Peter Pietzuch. PhD Workshop@DEBS 2013

Students

I work with postdocs and PhD students. In the past I have also worked with master's, undergraduate, and visiting students.

Postdocs & PhD Students

Chris Zhu
Hrishee Shastri
Luthfi Balaka
Danni Liu
Steven Kochevar
Pooja Kulkarni

Alumni

Zhiru Zhu
Tapan Srivastava
Steven Xia (Meta)
Alena Zeng (Notion)
Yue Gong (Amazon AWS)
Qiming Wang (stealth startup)
Sainyam Galhotra (Cornell, Asst. Professor)
Kevin Dharmawan (Stony Brook PhD)
David Alexander (UWashington PhD)
Chirag Kawediya (startup)
Zach Hempstead (Anthropic)
Stanley Zhu (Google)
Alex Zhao (Citadel)
Jenny Long
Yintong Ma (ByteDance)
Ipsita Mohanty (UWaterloo MSc)
Ryan Wong (UMichigan undergrad)

Teaching

Databases for Public Policy — Spring 2025, 2026
The Value of Data — Fall 2020, 2021, 2022, 2023; Spring 2024; Fall 2024, 2025
Ethics, Fairness, Responsibility, and Privacy in Data Science — Spring 2020, 2021, 2022, 2023, 2024
Introduction to Databases — Winter 2020, 2021, 2022, 2023

Service

VLDB 2026 — Metareviewer
CIDR 2026 — PC Member
VLDB 2025 — Metareviewer, PhD Workshop Co-Chair
Tabular Representation Workshop, NeurIPS 2025
SIGMOD 2025 — PC Member
CIDR 2025 — PC Member
SIGMOD 2024 — PC Member
CIDR 2024 — PC Member
SIGMOD 2023 — PC Member, Mentorship Co-Chair
VLDB 2023 — PC Member, Publicity Chair
HPTS 2022 — PC Member
SIGMOD 2022 — PC Member, Publicity Chair
VLDB 2022 — PC Member, Workshop Co-Chair
KDD 2021 — PC Member
SIGMOD 2021 — PC Member (Demo track)
VLDB 2021 — PC Member (Distinguished Reviewer Award)
ICDE 2021 — PC Member
VLDB 2020 — PC Member
SoCC 2020 — PC Member
SIGMOD 2019 — PC Member (Distinguished Reviewer Award)
Journals: VLDBJ, TKDE, TODS, SIGMOD Record

Awards & Recognition

Sloan Research Fellowship (2025) — Alfred P. Sloan Foundation
NSF CAREER Award (2024) — National Science Foundation
SIGMOD Test-of-Time Award (2023) — ACM SIGMOD, for the 2013 paper on stream processing state management
VLDB Distinguished Associate Editor (2025)
VLDB Distinguished Reviewer Award (2021)
SIGMOD Distinguished Reviewer Award (2019)