As the leaders of a developing field, data scientists must often deal with a frustratingly slippery question: What is data science, precisely, and what is it good for?
Alfred Spector is a visiting scholar in the MIT Department of Electrical Engineering and Computer Science (EECS), an influential developer of distributed computing systems and applications, and a successful tech executive with companies including IBM and Google. Along with three co-authors — Peter Norvig at Stanford University and Google, Chris Wiggins at Columbia University and The New York Times, and Jeannette M. Wing at Columbia — Spector recently published “Data Science in Context: Foundations, Challenges, Opportunities” (Cambridge University Press), which provides a broad, conversational overview of the wide-ranging field driving change in sectors ranging from health care to transportation to commerce to entertainment.
Here, Spector talks about data-driven life, what makes a good data scientist, and how his book came together during the height of the Covid-19 pandemic.
Q: One of the most common buzzwords Americans hear is “data-driven,” but many might not know what that term is supposed to mean. Can you unpack it for us?
A: Data-driven broadly refers to techniques or algorithms powered by data — they either provide insight or reach conclusions, say, a recommendation or a prediction. The algorithms power models which are increasingly woven into the fabric of science, commerce, and life, and they often provide excellent results. The list of their successes is really too long to even begin to list. However, one concern is that the proliferation of data makes it easy for us as students, scientists, or just members of the public to jump to erroneous conclusions. As just one example, our own confirmation biases make us prone to believing some data elements or insights “prove” something we already believe to be true. Additionally, we often tend to see causal relationships where the data only shows correlation. It might seem paradoxical, but data science makes critical reading and analysis of data all the more important.
Q: What, to your mind, makes a good data scientist?
A: [In talking to students and colleagues] I optimistically emphasize the power of data science and the importance of gaining the computational, statistical, and machine learning skills to apply it. But, I also remind students that we are obligated to solve problems well. In our book, Chris [Wiggins] paraphrases danah boyd, who says that a successful application of data science is not one that merely meets some technical goal, but one that actually improves lives. More specifically, I exhort practitioners to provide a real solution to problems, or else clearly identify what we are not solving so that people see the limitations of our work. We should be extremely clear so that we do not generate harmful results or lead others to erroneous conclusions. I also remind people that all of us, including scientists and engineers, are human and subject to the same human foibles as everyone else, such as various biases.
Q: You discuss Covid-19 in your book. While some short-range models for mortality were very accurate during the heart of the pandemic, you note the failure of long-range models to predict any of 2020’s four major geotemporal Covid waves in the United States. Do you feel Covid was a uniquely hard situation to model?
A: Covid was particularly difficult to predict over the long term because of many factors — the virus was changing, human behavior was changing, political entities changed their minds. Also, we didn’t have fine-grained mobility data (perhaps, for good reasons), and we lacked sufficient scientific understanding of the virus, particularly in the first year.
I think there are many other domains which are similarly difficult. Our book teases out many reasons why data-driven models may not be applicable. Perhaps it’s too difficult to get or hold the necessary data. Perhaps the past doesn’t predict the future. If data models are being used in life-and-death situations, we may not be able to make them sufficiently dependable; this is particularly true as we’ve seen all the motivations that bad actors have to find vulnerabilities. So, as we continue to apply data science, we need to think through all the requirements we have, and the capability of the field to meet them. They often align, but not always. And, as data science seeks to solve problems into ever more important areas such as human health, education, transportation safety, etc., there will be many challenges.
Q: Let’s talk about the power of good visualization. You mention the popular, early 2000’s Baby Name Voyager website as one that changed your view on the importance of data visualization. Tell us how that happened.
A: That website, recently reborn as the Name Grapher, had two characteristics that I thought were brilliant. First, it had a really natural interface, where you type the initial characters of a name and it shows a frequency graph of all the names beginning with those letters, and their popularity over time. Second, it’s so much better than a spreadsheet with 140 columns representing years and rows representing names, despite the fact it contains no extra information. It also provided instantaneous feedback with its display graph dynamically changing as you type. To me, this showed the power of a very simple transformation that is done correctly.
Q: When you and your co-authors began planning “Data Science In Context,” what did you hope to offer?
A: We portray present data science as a field that’s already had enormous benefits, that provides even more future opportunities, but one that requires equally enormous care in its use. Referencing the word “context” in the title, we explain that the proper use of data science must consider the specifics of the application, the laws and norms of the society in which the application is used, and even the time period of its deployment. And, importantly for an MIT audience, the practice of data science must go beyond just the data and the model to the careful consideration of an application’s objectives, its security, privacy, abuse, and resilience risks, and even the understandability it conveys to humans. Within this expansive notion of context, we finally explain that data scientists must also carefully consider ethical trade-offs and societal implications.
Q: How did you keep focus throughout the process?
A: Much like in open-source projects, I played both the coordinating author role and also the role of overall librarian of all the material, but we all made significant contributions. Chris Wiggins is very knowledgeable on the Belmont principles and applied ethics; he was the major contributor of those sections. Peter Norvig, as the coauthor of a bestselling AI textbook, was particularly involved in the sections on building models and causality. Jeannette Wing worked with me very closely on our seven-element Analysis Rubric and recognized that a checklist for data science practitioners would end up being one of our book’s most important contributions.
From a nuts-and-bolts perspective, we wrote the book during Covid, using one large shared Google doc with weekly video conferences. Amazingly enough, Chris, Jeannette, and I didn’t meet in person at all, and Peter and I met only once — sitting outdoors on a wooden bench on the Stanford campus.
Q: That is an unusual way to write a book! Do you recommend it?
A: It would be nice to have had more social interaction, but a shared document, at least with a coordinating author, worked pretty well for something up to this size. The benefit is that we always had a single, coherent textual base, not dissimilar to how a programming team works together.
This is a condensed, edited version of a longer interview that originally appeared on the MIT EECS website.