Adding New Knowledge to LLMs, RAG-style systems, Solo, and More
LLMs are great tools to query unstructured and structured documents, but the knowledge (it’s just data) they encode is mostly static due to the difficulty and expense of incorporating new data sources. For example, if you want to include your organization-specific databases in an LLM to access them via a chat interface, there are not yet really good options despite quick progress. In this post, I i) overview at a high level some of the strategies one can use to incorporate new data (others call this knowledge) into the LLM and ii) explain how Solo, a recent data discovery system in my group, follows one of those strategies to facilitate the querying of tabular data.
There are a few ways of incorporating new data into an LLM:
- Retrain. Simply retrain the LLM from scratch and include the new data with the previous training dataset. No one wants to do this because it is too expensive and the process would need to repeat every time the underlying data changes to keep the model fresh.
- Fine-tuning. Adjust the weights of the LLM using the new training data. The popular proprietary LLM companies offer APIs to facilitate this fine-tuning process. It remains expensive and cumbersome to deploy in production when the underlying data changes continuously. Furthermore, by following this approach, you cannot easily control how the new data was incorporated and how it would be retrieved.
- Keep the data in the database, and have the LLM learn how to query the database. The context length of modern LLMs (measured in the number of input tokens you can use to query the LLM) keeps expanding fast. You can use this growing context to teach the LLM to query an external data source, e.g., by asking it to call an API and incorporate the result in the response. See, for example, ToolFormer as an early example of this approach.
- Retrieval-Augmented-Generation (RAG) style. In this case, you keep the data in the database. When you receive a query, you query the database and obtain a series of candidate results, say K. You then forward these K results to the LLM along with the query. Essentially, you exploit the context of the LLM to incorporate a candidate answer obtained from a database (which can change and be updated as usual). This space is exciting, with approaches such as llama-index, LangChain, and vector databases adapting their APIs to fit in well with this architecture (see, for example, Chroma).
There is a lot of work going on in each of the 4 options above. There is research to make LLMs smaller (and thus easier and cheaper to retrain), approaches to improve the fine-tuning process, a plethora of work teaching LLMs how to query external sources (you’ve probably noticed that ChatGPT has recently started to include websites in their responses, which is an indication of the model using external tools, in this case the web), and a lot of activity in the RAG space. This all means that we don’t know which of these architectures (or others not explained here) will succeed. However, the RAG-style architecture has several characteristics that make it quite attractive. Organizations can keep their data in their local databases without having to send it anywhere (assuming they host the LLM or they run it in isolation), and this is good for compliance reasons, e.g., to comply with privacy regulations. The data can keep changing with the same pipelines and processes used today. All that changes is the querying interface: in addition to existing querying interfaces, now we have an alternative RAG-style interface, where an LLM helps produce the right answer.
Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised Approach (SIGMOD’24)
We started exploring the last architecture before the industry gave it its name. Solo is a data discovery system for tabular data. You point to a collection of tables, ask a question in natural language, and the system will return the top-k tables most relevant to answer the question. It crucially does not return an answer directly, as the analyst querying the system will want to inspect what data is being used first. And it does not translate the natural language question into SQL either, which means that one can unstructured data (which cannot be queried with SQL easily) and still get an answer.
In the offline stage, Solo takes the input tables, serializes them in a format conducive to retrieval, and indexes them in a vector database. In the online stage, the query is used to retrieve the top-k tables from the retrieval system and then given to an LLM module that ranks them before delivering them to the user. For the system to work well, the retrieval module must be trained for the target table collection, which is extremely inconvenient! If we use Solo (or a table answering system) trained on dataset A on a new dataset B, the performance tends to suffer greatly. However, retraining is inconvenient because it essentially requires a collection of training data in the form of pairs of <question, table-with-answer>, and such a training dataset is time-consuming to procure. So this is a major impediment to using systems of this kind and is precisely one of the key challenges we addressed in this paper.
Solo is a self-supervised system, which means that the system synthesizes its own training data, and no one needs to provide it externally, thus saving a lot of time and trouble. This is done using a somewhat intricate pipeline of stages that generate questions from the input tables, each associated with the ground truth table. Then, serializes the tables and questions in a format amenable for training and uses the resulting training data to fine-tune the retrieval module. This works great! Now we can point Solo to a new input table collection and, without any further handholding, have the system training itself and getting ready to answer queries.
Solo belongs to the data discovery line of research my group has been exploring for a few years. LLMs are a new, exciting technology to study in the context of this research line. We discussed what LLMs could mean to data management in this paper (and in this post), and we explored an architecture for querying tabular data with natural language in Solo. With what we’ve learned from these efforts, we are actively working on a few directions we hope will contribute to making it easier to explore structure data using LLMs. And I’m looking for postdocs and students to spearhead the effort. Reach out if any of this sounds interesting.