Data, data, data

2022, Jan 30    

Data is the new oil may not be a great analogy if taken literally. But it implies that data is a valuable asset, and this is crucial to understanding the data-driven world we live in today. Data is so valuable, in fact, that it is the main actor of important socio-technical trends. Data is at the center of the machine learning revolution of the last years and the tech-driven economy—the largest companies today in the US are all built on data. Just take a look at the record-high IPOs of the last years and the increasingly data-centric revenue streams (think the metaverse, the new IoT-centric startups, and even web3) that make many of these companies valuable. Data also drives social change, and not always in a good direction. As with any other valuable asset, it can be used for good and for evil. Now, despite all the promise and potential value of data, the truth is that today only a few have the means and expertise to reap the benefits. This, unfortunately, means that most value remains unleashed. I have been interested in understanding data as a valuable asset for many years, but only after arriving at UChicago have I made understanding the value and economics of data the central theme of my research agenda and devising ways of exploiting, and helping others exploit this value more widely, my research goal.

After completing my PhD building systems for data processing at Imperial College London, my transition to MIT took me in a slightly different direction. Instead of focusing on processing data faster or cheaper, or in more resource-efficient ways, I focused on how to identify data that would be useful to solve a problem in the first place. Because of the swath of data repositories, the variety of sources, and the often hard-to-define notion of “relevance”, this data discovery problem is quite challenging and took most of my time and effort at MIT. I built Aurum, a prototype data discovery system that built on a lot of cool research that many had done before. One of the most exciting things about Aurum is that I got to see it in action in the hands of professionals who had really exciting data discovery problems on a daily basis. I saw it used in pharma to integrate public and private databases, in finance to find features for machine learning models, at big enterprise data warehouses to find duplicate data, and used by sustainability teams who wanted to identify external data to help their internal analysis. It was during this time that I saw how much data can help when used well, and I got really excited about this larger question of the value of data: what is it, how do we measure it, how do we boost it, and how do we unleash its value.

So when I went to the tenure-track academic job market I was pretty sure I wanted to explore that topic in-depth, and that is what I have been doing ever since. I want to build software to facilitate sharing data, and bring its benefits to many. I want to build software to understand better how our personal data is used, by whom, and with what intentions. And I want to challenge current data sharing practices when they are arguably not beneficial for those who participate in the exchange. For example, when data brokers trade data from subjects without their explicit consent, thus disregarding subjects’ privacy preferences, and sometimes sowing unrest in entire communities. The more I think about these questions, the clearer it becomes that we are surrounded by many data markets: environments where data is allocated among interested parties. One data market is formed when consortia of participants pool and combine their data together to drive more value (think federated learning on the technical side or data-sharing agreements on the legal side). Another data market example is the barter market formed by users giving their data away to online platforms to obtain services such as search, social networks, entertainment, and more. And perhaps more familiar data markets are the online platforms where sellers list datasets they have collected and offer them to buyers, aiming to make a profit off the transaction. Many of these markets have a tremendous impact on those who participate, and, in many cases, many others who may not participate directly but are affected nevertheless. And, given the increasing impact data markets have on individuals and the economy, it is important to understand how they work and when they fail. Only then we can design better data markets that truly create value instead of more problems. Designing a data market requires a fundamental understanding of what is the value of the asset exchanged: data.

Despite the raging pandemic of the last two years, I managed to assemble a great team at UChicago composed of postdocs, PhD students, masters students, and undergraduate researchers, coming with backgrounds in CS, statistics, economics, and others. The team is making great progress, and I am planning to use this space to share some of our findings with anyone who is interested (admittedly, I am also doing this to keep track of how my thinking of the problem changes over time, which is always interesting to see). I’m planning to share content beyond the topics above, as the group is working on a diverse set of problems including systems, cost optimization, discovery, integration, and data management for ML.