How Large Language Models Will Disrupt Data Management

Large Language models are exciting. Since OpenAI made ChatGPT widely available last November, this class of models has captured the public imagination, showcasing prominently in social media, mass media, and, of course, in the research community. It’s been a while since technology has caused so much interest across such broad communities, and that’s extremely exciting. As an extra, folks are concentrated not only on the upsides of the technology but also on the downsides and potential downsides, e.g., the discourse around regulating data and AI has tremendously accelerated over the last few months. Regardless of whether regulation is the right way to address this problem, this discourse is only happening because people have been more responsible and responsive in pointing out those downsides, so it’s not time to claim victory, but I remain optimistic. Of course, there’s still been some damage.

It is so fun to think about how LLMs change the game that we could not resist speculating on that here at UChicago. After a few brainstorming sessions, we put together a few ideas on how we think LLMs will impact the field of data management (and, by extension, a lot of applications, given the vast footprint of data management) on a vision paper that we have recently presented at VLDB’23 in Vancouver, Canada. The paper lays out our vision for what changes will unfold (it’s funny how many of the ideas we wrote in March have already materialized!) and what kinds of interventions would ease the deployment of LLMs moving forward. In this post, I want to chat about this latter part, as the primary mechanism we propose to address those potential issues is data markets, which, as I’ve written before in this blog, is one of the main research thrusts in my group.

What do markets have to do with LLMs?

Section 4 of the paper explains this in much more detail, but here’s the gist. These are two key LLM-adjacent problems where markets can help:

Incentivizing data sharing.

Much of the quality of LLMs is due to the data, as evidenced by the reticence of some of the big vendors in releasing details about what data they used and how it was prepared. In a market where those with ideas on how to prepare data can contribute those ideas to train a model and benefit from that model’s usage, we elicit competition that drives models to improve. In addition to data, LLMs remain (although they are quickly becoming more efficient) resource-hungry. So, if you are a large company with in-house data centers, you can quickly repurpose them for training and serving LLMs. But what if you are a small organization without the resources to pull that off? In this case, sharing mechanisms that facilitate cost-sharing enable groups of organizations to “split the bill” and get an LLM they can all share.

Now, put it all together. We highlight two interesting outcomes in the paper. First, In the paper, we speculate on a push toward “public common” style models, where the public provides resources and data with the intention of serving the public — that is, instead of the whole process being operated by private companies. Despite the potential inefficiencies of this approach, if the outcome is a well-performing model, that would ensure that anyone who contributed anything to the creation of the model can enjoy the benefits. Second, we envision the creation of data-sharing consortia markets where organizations pitch in whatever they have, data, skills, resources… and the market orchestrates their interactions so resources are allocated where they are most valuable. The end product is participants with access to a better LLM that they could have obtained individually.

Controlling data’s provenance.

At the same time, the reticence of builders in releasing their data sources means they may use (sometimes it’s clear they did use) copyrighted materials, e.g., the art from an independent artist or the writings of an aspiring journalist. When we query an LLM and observe the output, we have very few options to track who/what contributed to that output, and that means that some of the value we perceive, which stems from the work of many others, remains unrecognized. This is a massive problem. There are today a few proposals for regulation that would require the inclusion of provenance along with model outputs, essentially revealing when content has been generated by a model. This sounds good, but more is needed. Folks whose work and efforts are being absorbed by these models will likely want those models to: i) not read their data; ii) read their data but state it as a source; iii) get compensation when their data is used. Those are just three potential options; there are many others already, and interested parties continue envisioning new licenses to protect all stakeholders so this does not end in disaster.

Anyway, tracking provenance over a multiple-billion-th parameter network seems, frankly, quite difficult. Plus, what kinds of contributions? should we value a piece of text whose style is leveraged by the LLM equally to a piece that provides a key idea used by the LLM? How do we even think about these questions? and if we disagree on our answers, how do we reach an agreement? and if we did reach an agreement, how do we ensure the LLM creators enforce such an outcome? In short, we can’t force the last one. These are all difficult questions. In the paper, we give some ideas on how we could go about “valuing data’s contributions” to the LLM, which is through a fun analogy with how grain is valued (you do not aim to value each grain, but rather tend to group grain depend on origin and some other coarse dimensions). It’s been fun to think about this attribution problem in the context of LLMs, and we hope to have something more concrete to say soon.

So what?

LLMs feed from data, and their quality depends significantly on that data. LLM builders exercise dataflows to bring in the data necessary to build better models. Those dataflows move data from point A to B without always asking A for permission. If you are A today, you have very little say in how the LLM creators use your data (nothing different from how big tech and the data broker industry manages your data today). This dataflow governance question is the core problem our thrust in data markets aims to address. The VLDB’23 vision paper on LLMs was a fun way of envisioning how these problems that we have grown accustomed to in the Internet era will transfer to the Generative AI era. Stay tuned for updates on research soon!

Other notes: The VLDB’23 vision paper discusses many other problems and how we think LLMs will impact them. One of those problems is data discovery. We write our vision for how this problem will change based on LLM technology. And, we informed our vision based on ongoing work in our group: over the last year and a half, we have been building Solo, a discovery system that leverages large language models to permit users to query repositories of tables without any supervision (the self-supervised strategy Solo uses to bootstrap is one of the key contributions); Solo can be seen as a kind of RAG-style system, although it has several significant differences, and we are currently exploring how these two approaches complement each other.

Last, if you are based in Chicagoland and find these topics interesting, come to the first ever Chicago Data Night, which we are organizing in the Merchandise Mart.