This is what we are presenting at SIGMOD'24

2024, Jun 03    

SIGMOD’24 will take place in Santiago (Chile) next week, and, as usual, I thought I’d share some of the work our group will be presenting then. We have three research papers, two demos, and one tutorial. Here’s a sneak peek of each of these in case you cannot make it to the conference.

How do we Incorporate Knowledge into LLMs? The Solo System

If you could incorporate your organization’s data into LLMs, it’d be much easier to find what you need, right? I’m not sure; I think there are many other issues that the premise does not represent, but let’s say “maybe.” In any case, figuring out how to do that is a great research topic, and there are already a few alternative approaches that are exploring this question: RAG, fine-tuning, etc… if you want to read a bit more about the difference (as of ~6 months ago), you can take a look at this post.

Solo, the work that my PhD student Qiming will present at SIGMOD is a system that answers questions written in English over collections of tabular data. You point to the data and get the answer to your questions (detail: you get the table that contains the answer to your question and not the answer directly; there’s a good reason for this design). Crucially, you don’t need to “retrain” the system whenever you change the input data. This is due to a self-supervised approach that we implement as part of the system, and that is one of the main contributions of the paper. Not having to think about obtaining training data is the main advantage of this approach.

Compared to a RAG architecture (which typically also does not need retraining), Solo uses more “structure”: structure present in the data and structure present in the question, i.e., schema, data types, etc. Incorporating this structure carefully, instead of just hoping the LLM would do the right thing, was another innovation of Solo. This worked so well that I think we’ll see a lot more of this moving forward. The intuitive reason why this works well is that it reduces the space for the LLM to make mistakes, including what some call “hallucinations.” It is not obvious what’s the right way to include this “structure” into these classes of pipelines, but I think it’s increasingly clear that carefully including some structure helps. Among those approaches being tested, some are advocating for representing structure as controlled vocabularies—think of ontologies and knowledge graphs. Those seem promising. What we did in Solo is simpler and, I believe, complementary.

Nexus: Discovering Correlations in Data

The first step in many statistical analyses is understanding if two (or more) variables are related. For example, if a researcher wants to understand whether school attendance and crime are related, they may start by checking whether there is a correlation between these variables. If there is, they may form a hypothesis and then run an experiment or conduct some other statistical analysis to (in the best-case scenario and under some assumptions) establish causality. Nexus is a software system designed to help researchers identify correlations between attributes when the number of attributes is very large. Where do these attributes come from? In most common scenarios, researchers will have a (more or less small) curated list of plausibly relevant attributes. This is good because it simplifies the analysis they have to conduct. At the same time, they may miss important variables that would affect their analysis (confounders); they may even miss relationships between other variables that may pose alternative interesting hypotheses. So what do we do about this?

Nexus takes as input a collection of tables (each containing multiple variables represented as attributes) and, optionally, an existing hypothesis between two variables. Then, Nexus can: 1) identify variables related to those in the hypothesis, e.g., variables that help explain the strength of the relationship; 2) list all pairs of correlated variables in the input collection.

Nexus will automatically align rows from disparate tables by exploiting temporal and spatial semantics. For example, Nexus will align all events (think of rows of tables) that happen within a temporal and spatial range. One challenge of spatiotemporally aligning data is dealing with different time and space granularities, e.g., a dataset containing only hourly data while others containing events at the second level.

After aligning datasets, Nexus will find all correlations between those variables (at the right granularity). When we got this running, we immediately ran into a problem. The number of correlations identified in even small datasets (~300 tables) is (may be) huge. No analyst wants to check out 30K relationships manually. An innovation of Nexus is that, as a system, it incorporates different strategies to help analysts filter and navigate those correlations, including exploiting the potential structure of the causal graph, a causal graph we don’t have access to!

If you want to read about all this in more detail, take a look at the paper. Oh, and there’ll be a demo about Nexus as well – that’s one of the two demos we are presenting at SIGMOD.

By the way, the other demo is about Ver, a system I have described in the past and that we put together in collaboration with Universitas Indonesia and undergraduate researchers at The University of Chicago.

Cackle: Analytical Workload Cost and Performance Stability with Elastic Pools

This is Matt Perron’s last work for his PhD before heading to MSR (what a great hire!). Cackle is, in some ways, an evolution of Starling, a data processing system built on top of serverless functions and blob storage. Unlike Starling, Cackle considers the availability of traditional VMs in addition to serverless functions. Then it asks: given the unpredictability of future workloads, how should we provision compute (VMs or serverless functions) to strike a balance between performance and cost?

These two projects have been extremely enjoyable to work on because of the problems, the ideas, and because of the great team behind them. I am looking forward to seeing what’s next for Matt. Check out the paper if you want to know more, and try to catch Matt at the conference.

Data Sharing

I think a lot about data sharing. In fact, I think about it so much that I have not had time to talk about the efforts the group has been putting on this line for a while. Data sharing is a socio-technical problem, not a purely technical one, even though technology can clearly help. I like to think about data-sharing scenarios as data markets. That way, I can apply the “data ecology” lens that our group has been developing since it started. Anyway, none of this will make much sense right now: I’ll say much more about this soon.

Until then, and alongside Arnab Nandi from Ohio State University, we put together a tutorial that lays out some of the challenges of sharing spatio-temporal data (think of intelligence, adtech, research applications) responsibly. It turns out that the data escrow system we have developed has much to contribute to this space and even has a few advantages compared to other “privacy enhancing technologies” (PETs), such as multi-party computation, federated learning, and others. So, we prepared a tutorial where we’ll walk through some of these concepts. In particular, we will motivate the importance of spatio-temporal data; we will overview several PETs, and explain some of the considerations one must consider when building sharing applications. While I think it makes a lot of sense to concentrate on concrete domains (spatio-temporal data in our case), I believe the insights from the tutorial will be applicable in other domains, too.

Wrapping Up

Drop me your thoughts and questions at, or even better, catch me at SIGMOD and let’s chat. Based on what we’ve learned over the last year and a half, we are already working on Solo2 and Nexus2. For Nexus, we are seeking active users, and while we have identified a few, we would love to talk to you if you think any of what it does may be helpful to your work.

Last, I have open postdoc positions and am looking to hire a “research manager” as well. If you are wrapping up your PhD and are interested in discussing postdoc positions, send me a note, and let’s chat at the conference. If you are looking for a job in academia without the pressures from industry, let’s chat, too, and I’ll explain more about what this “research manager” position is about.