Over the last year, our group has spent a lot of time pushing on our ideas on data markets, including designing and publishing a data escrow system, which is a system foundation to deploy data markets and other applications that require controlling data flows. There are many exciting ideas and new designs that I’m looking forward to sharing soon. But today’s post is about data discovery.
I have mentioned this problem of data discovery before. Some of my students are asking interesting questions in this area and building systems to address discovery problems in various scenarios. Today’s post is about data discovery and, concretely, two systems, Ver and Metam, that we are presenting in Los Angeles next week. We have big plans for both, and I’ll share some.
Data discovery is the problem of identifying and retrieving data that satisfies an information need. A lot of what is tricky about this problem is the “identification” part. When someone has a problem that may be solved using data, they still need to articulate their “information need”. We are used to this process when searching on the web using keywords. For example, Google (and many advances in information retrieval) made it easy to describe information needs. But there are many other scenarios where it takes much more work to articulate such needs. Think of a data analyst searching for the data they need to complete a report over several data lakes and databases in the organization. Or a citizen journalist that suspects they may gather some helpful information to back up a claim from open data repositories. There is a lot of data, and it is hard to explain what one wants precisely. And to make matters worse, even if the need is articulated, the heterogeneity and volume of the data, problems of semantic ambiguity, and inexact and noisy results mean that extracting the info one wants is tedious and time-consuming.
Ver and Metam are two systems that tackle this problem from two different angles. Ver (Yue, Zhiru, and Sainyam) realizes that users may not know their exact information need and engages directly with this problem by giving them a series of “tools” that permit them to interact with the system, understand and refine their information need, and help them go from the vast volumes of data out there, to a handful of relevant datasets. Metam (Sainyam and Yue) recognizes that people often want to find data to do “something else”. That “something else”, which we call the “goal”, could be training a machine learning model to solve a prediction problem with high accuracy or answer a what-if query. Whatever it is, the goal determines the information need: anything that accomplishes the goal counts. And we rank options based on the utility they yield on that goal. With that, we say Metam is a “goal-oriented data discovery” system: if you can define your goal, Metam automatically finds the data for you. Instead, Metam will take your task definition and identify what’s relevant for you. Let me say more about each of these systems.
Ver: View Discovery in the Wild
Ver is a data discovery system for tabular data. The premise is that if you can envision the table you would like to have in front of you, i.e., you can write down the attributes you would like the table to contain, then the system will find it for you. This type of interface is called a query-by-example in data management. Traditionally, QBE was envisioned to navigate a single database with a well-designed schema and information about how tables can be combined. However, in Ver, we are applying QBE in “the wild”, i.e., collections of tables without schemas and without information on how tables can be combined with each other. We call these setups pathless table collections. The challenge of these pathless table repositories is that for a given input table, there may be many ways in which we can assemble tables that resemble what you asked for. But many of those will be wrong because of semantic ambiguity, incorrect join paths (that we need to find automatically because remember that pathless table collections do not have information on how to combine tables), and other problems.
Ver will take the many resulting views for any search and then apply a distillation and presentation algorithm to narrow them down to the user’s original information need. Distillation automatically identifies patterns among the resulting views and summarizes them: views can be compatible, contained, complementary, and contradictory, and each category gives us a strategy to summarize the views. Presentation works with the user, aiming to understand the information need, thus obtaining and delivering the final view.
While we envisioned and designed Ver to work with a QBE interface, many of the techniques we introduced to solve the problem end-to-end have applications in many other discovery scenarios. We are looking forward to exploring those soon.
Metam: Goal-Oriented Data Discovery
A data-driven task takes an input dataset and produces a result with a given utility. For example, a machine learning model takes a training dataset as input and yields certain utility on a test set. If you can define your task in those terms: accompanying it with a utility function, then Metam will help you identify data from table repositories that can augment the input data, to increase the task’s utility. It does this automatically.
A different way of thinking about Metam is as a “data augmentation” engine. This line started with ARDA, where we used Aurum to identify data that could augment training datasets for supervised ML tasks. There has been a lot of work since then. I think of Metam as the more general and principled approach to automatic data augmentation. Metam is more general in that it accepts any task for which one can write a utility function. And it is more principled in that with its architecture, we have found the sweet spot: an architecture that we can implement easily, understand in detail, and extensible to other types of data and tasks.
The implications of “goal-oriented” data discovery are immense. We are excited about how to cast problems into “tasks with a utility function” and how to expand the inner workings of Metam to work faster (right now, a typical query will take a few minutes to run on modest dataset sizes) and more efficiently. More soon.
The group is working on exploring two other approaches to data discovery in addition to continuing the work on Ver and Metam. Related to this line, we are working (in collaboration with Haifeng Xu) on a “market of augmentations” that uses data discovery techniques in the context of a data market. I anticipate some interesting results soon.