More stories

  • in

    When should data scientists try a new technique?

    If a scientist wanted to forecast ocean currents to understand how pollution travels after an oil spill, she could use a common approach that looks at currents traveling between 10 and 200 kilometers. Or, she could choose a newer model that also includes shorter currents. This might be more accurate, but it could also require learning new software or running new computational experiments. How to know if it will be worth the time, cost, and effort to use the new method?

    A new approach developed by MIT researchers could help data scientists answer this question, whether they are looking at statistics on ocean currents, violent crime, children’s reading ability, or any number of other types of datasets.

    The team created a new measure, known as the “c-value,” that helps users choose between techniques based on the chance that a new method is more accurate for a specific dataset. This measure answers the question “is it likely that the new method is more accurate for this data than the common approach?”

    Traditionally, statisticians compare methods by averaging a method’s accuracy across all possible datasets. But just because a new method is better for all datasets on average doesn’t mean it will actually provide a better estimate using one particular dataset. Averages are not application-specific.

    So, researchers from MIT and elsewhere created the c-value, which is a dataset-specific tool. A high c-value means it is unlikely a new method will be less accurate than the original method on a specific data problem.

    In their proof-of-concept paper, the researchers describe and evaluate the c-value using real-world data analysis problems: modeling ocean currents, estimating violent crime in neighborhoods, and approximating student reading ability at schools. They show how the c-value could help statisticians and data analysts achieve more accurate results by indicating when to use alternative estimation methods they otherwise might have ignored.

    “What we are trying to do with this particular work is come up with something that is data specific. The classical notion of risk is really natural for someone developing a new method. That person wants their method to work well for all of their users on average. But a user of a method wants something that will work on their individual problem. We’ve shown that the c-value is a very practical proof-of-concept in that direction,” says senior author Tamara Broderick, an associate professor in the Department of Electrical Engineering and Computer Science (EECS) and a member of the Laboratory for Information and Decision Systems and the Institute for Data, Systems, and Society.

    She’s joined on the paper by Brian Trippe PhD ’22, a former graduate student in Broderick’s group who is now a postdoc at Columbia University; and Sameer Deshpande ’13, a former postdoc in Broderick’s group who is now an assistant professor at the University of Wisconsin at Madison. An accepted version of the paper is posted online in the Journal of the American Statistical Association.

    Evaluating estimators

    The c-value is designed to help with data problems in which researchers seek to estimate an unknown parameter using a dataset, such as estimating average student reading ability from a dataset of assessment results and student survey responses. A researcher has two estimation methods and must decide which to use for this particular problem.

    The better estimation method is the one that results in less “loss,” which means the estimate will be closer to the ground truth. Consider again the forecasting of ocean currents: Perhaps being off by a few meters per hour isn’t so bad, but being off by many kilometers per hour makes the estimate useless. The ground truth is unknown, though; the scientist is trying to estimate it. Therefore, one can never actually compute the loss of an estimate for their specific data. That’s what makes comparing estimates challenging. The c-value helps a scientist navigate this challenge.

    The c-value equation uses a specific dataset to compute the estimate with each method, and then once more to compute the c-value between the methods. If the c-value is large, it is unlikely that the alternative method is going to be worse and yield less accurate estimates than the original method.

    “In our case, we are assuming that you conservatively want to stay with the default estimator, and you only want to go to the new estimator if you feel very confident about it. With a high c-value, it’s likely that the new estimate is more accurate. If you get a low c-value, you can’t say anything conclusive. You might have actually done better, but you just don’t know,” Broderick explains.

    Probing the theory

    The researchers put that theory to the test by evaluating three real-world data analysis problems.

    For one, they used the c-value to help determine which approach is best for modeling ocean currents, a problem Trippe has been tackling. Accurate models are important for predicting the dispersion of contaminants, like pollution from an oil spill. The team found that estimating ocean currents using multiple scales, one larger and one smaller, likely yields higher accuracy than using only larger scale measurements.

    “Oceans researchers are studying this, and the c-value can provide some statistical ‘oomph’ to support modeling the smaller scale,” Broderick says.

    In another example, the researchers sought to predict violent crime in census tracts in Philadelphia, an application Deshpande has been studying. Using the c-value, they found that one could get better estimates about violent crime rates by incorporating information about census-tract-level nonviolent crime into the analysis. They also used the c-value to show that additionally leveraging violent crime data from neighboring census tracts in the analysis isn’t likely to provide further accuracy improvements.

    “That doesn’t mean there isn’t an improvement, that just means that we don’t feel confident saying that you will get it,” she says.

    Now that they have proven the c-value in theory and shown how it could be used to tackle real-world data problems, the researchers want to expand the measure to more types of data and a wider set of model classes.

    The ultimate goal is to create a measure that is general enough for many more data analysis problems, and while there is still a lot of work to do to realize that objective, Broderick says this is an important and exciting first step in the right direction.

    This research was supported, in part, by an Advanced Research Projects Agency-Energy grant, a National Science Foundation CAREER Award, the Office of Naval Research, and the Wisconsin Alumni Research Foundation. More

  • in

    Q&A: A fresh look at data science

    As the leaders of a developing field, data scientists must often deal with a frustratingly slippery question: What is data science, precisely, and what is it good for?

    Alfred Spector is a visiting scholar in the MIT Department of Electrical Engineering and Computer Science (EECS), an influential developer of distributed computing systems and applications, and a successful tech executive with companies including IBM and Google. Along with three co-authors — Peter Norvig at Stanford University and Google, Chris Wiggins at Columbia University and The New York Times, and Jeannette M. Wing at Columbia — Spector recently published “Data Science in Context: Foundations, Challenges, Opportunities” (Cambridge University Press), which provides a broad, conversational overview of the wide-ranging field driving change in sectors ranging from health care to transportation to commerce to entertainment. 

    Here, Spector talks about data-driven life, what makes a good data scientist, and how his book came together during the height of the Covid-19 pandemic.

    Q: One of the most common buzzwords Americans hear is “data-driven,” but many might not know what that term is supposed to mean. Can you unpack it for us?

    A: Data-driven broadly refers to techniques or algorithms powered by data — they either provide insight or reach conclusions, say, a recommendation or a prediction. The algorithms power models which are increasingly woven into the fabric of science, commerce, and life, and they often provide excellent results. The list of their successes is really too long to even begin to list. However, one concern is that the proliferation of data makes it easy for us as students, scientists, or just members of the public to jump to erroneous conclusions. As just one example, our own confirmation biases make us prone to believing some data elements or insights “prove” something we already believe to be true. Additionally, we often tend to see causal relationships where the data only shows correlation. It might seem paradoxical, but data science makes critical reading and analysis of data all the more important.

    Q: What, to your mind, makes a good data scientist?

    A: [In talking to students and colleagues] I optimistically emphasize the power of data science and the importance of gaining the computational, statistical, and machine learning skills to apply it. But, I also remind students that we are obligated to solve problems well. In our book, Chris [Wiggins] paraphrases danah boyd, who says that a successful application of data science is not one that merely meets some technical goal, but one that actually improves lives. More specifically, I exhort practitioners to provide a real solution to problems, or else clearly identify what we are not solving so that people see the limitations of our work. We should be extremely clear so that we do not generate harmful results or lead others to erroneous conclusions. I also remind people that all of us, including scientists and engineers, are human and subject to the same human foibles as everyone else, such as various biases. 

    Q: You discuss Covid-19 in your book. While some short-range models for mortality were very accurate during the heart of the pandemic, you note the failure of long-range models to predict any of 2020’s four major geotemporal Covid waves in the United States. Do you feel Covid was a uniquely hard situation to model? 

    A: Covid was particularly difficult to predict over the long term because of many factors — the virus was changing, human behavior was changing, political entities changed their minds. Also, we didn’t have fine-grained mobility data (perhaps, for good reasons), and we lacked sufficient scientific understanding of the virus, particularly in the first year.

    I think there are many other domains which are similarly difficult. Our book teases out many reasons why data-driven models may not be applicable. Perhaps it’s too difficult to get or hold the necessary data. Perhaps the past doesn’t predict the future. If data models are being used in life-and-death situations, we may not be able to make them sufficiently dependable; this is particularly true as we’ve seen all the motivations that bad actors have to find vulnerabilities. So, as we continue to apply data science, we need to think through all the requirements we have, and the capability of the field to meet them. They often align, but not always. And, as data science seeks to solve problems into ever more important areas such as human health, education, transportation safety, etc., there will be many challenges.

    Q: Let’s talk about the power of good visualization. You mention the popular, early 2000’s Baby Name Voyager website as one that changed your view on the importance of data visualization. Tell us how that happened. 

    A: That website, recently reborn as the Name Grapher, had two characteristics that I thought were brilliant. First, it had a really natural interface, where you type the initial characters of a name and it shows a frequency graph of all the names beginning with those letters, and their popularity over time. Second, it’s so much better than a spreadsheet with 140 columns representing years and rows representing names, despite the fact it contains no extra information. It also provided instantaneous feedback with its display graph dynamically changing as you type. To me, this showed the power of a very simple transformation that is done correctly.

    Q: When you and your co-authors began planning “Data Science In Context,” what did you hope to offer?

    A: We portray present data science as a field that’s already had enormous benefits, that provides even more future opportunities, but one that requires equally enormous care in its use. Referencing the word “context” in the title, we explain that the proper use of data science must consider the specifics of the application, the laws and norms of the society in which the application is used, and even the time period of its deployment. And, importantly for an MIT audience, the practice of data science must go beyond just the data and the model to the careful consideration of an application’s objectives, its security, privacy, abuse, and resilience risks, and even the understandability it conveys to humans. Within this expansive notion of context, we finally explain that data scientists must also carefully consider ethical trade-offs and societal implications.

    Q: How did you keep focus throughout the process?

    A: Much like in open-source projects, I played both the coordinating author role and also the role of overall librarian of all the material, but we all made significant contributions. Chris Wiggins is very knowledgeable on the Belmont principles and applied ethics; he was the major contributor of those sections. Peter Norvig, as the coauthor of a bestselling AI textbook, was particularly involved in the sections on building models and causality. Jeannette Wing worked with me very closely on our seven-element Analysis Rubric and recognized that a checklist for data science practitioners would end up being one of our book’s most important contributions. 

    From a nuts-and-bolts perspective, we wrote the book during Covid, using one large shared Google doc with weekly video conferences. Amazingly enough, Chris, Jeannette, and I didn’t meet in person at all, and Peter and I met only once — sitting outdoors on a wooden bench on the Stanford campus.

    Q: That is an unusual way to write a book! Do you recommend it?

    A: It would be nice to have had more social interaction, but a shared document, at least with a coordinating author, worked pretty well for something up to this size. The benefit is that we always had a single, coherent textual base, not dissimilar to how a programming team works together.

    This is a condensed, edited version of a longer interview that originally appeared on the MIT EECS website. More

  • in

    Unpacking the “black box” to build better AI models

    When deep learning models are deployed in the real world, perhaps to detect financial fraud from credit card activity or identify cancer in medical images, they are often able to outperform humans.

    But what exactly are these deep learning models learning? Does a model trained to spot skin cancer in clinical images, for example, actually learn the colors and textures of cancerous tissue, or is it flagging some other features or patterns?

    These powerful machine-learning models are typically based on artificial neural networks that can have millions of nodes that process data to make predictions. Due to their complexity, researchers often call these models “black boxes” because even the scientists who build them don’t understand everything that is going on under the hood.

    Stefanie Jegelka isn’t satisfied with that “black box” explanation. A newly tenured associate professor in the MIT Department of Electrical Engineering and Computer Science, Jegelka is digging deep into deep learning to understand what these models can learn and how they behave, and how to build certain prior information into these models.

    “At the end of the day, what a deep-learning model will learn depends on so many factors. But building an understanding that is relevant in practice will help us design better models, and also help us understand what is going on inside them so we know when we can deploy a model and when we can’t. That is critically important,” says Jegelka, who is also a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Institute for Data, Systems, and Society (IDSS).

    Jegelka is particularly interested in optimizing machine-learning models when input data are in the form of graphs. Graph data pose specific challenges: For instance, information in the data consists of both information about individual nodes and edges, as well as the structure — what is connected to what. In addition, graphs have mathematical symmetries that need to be respected by the machine-learning model so that, for instance, the same graph always leads to the same prediction. Building such symmetries into a machine-learning model is usually not easy.

    Take molecules, for instance. Molecules can be represented as graphs, with vertices that correspond to atoms and edges that correspond to chemical bonds between them. Drug companies may want to use deep learning to rapidly predict the properties of many molecules, narrowing down the number they must physically test in the lab.

    Jegelka studies methods to build mathematical machine-learning models that can effectively take graph data as an input and output something else, in this case a prediction of a molecule’s chemical properties. This is particularly challenging since a molecule’s properties are determined not only by the atoms within it, but also by the connections between them.  

    Other examples of machine learning on graphs include traffic routing, chip design, and recommender systems.

    Designing these models is made even more difficult by the fact that data used to train them are often different from data the models see in practice. Perhaps the model was trained using small molecular graphs or traffic networks, but the graphs it sees once deployed are larger or more complex.

    In this case, what can researchers expect this model to learn, and will it still work in practice if the real-world data are different?

    “Your model is not going to be able to learn everything because of some hardness problems in computer science, but what you can learn and what you can’t learn depends on how you set the model up,” Jegelka says.

    She approaches this question by combining her passion for algorithms and discrete mathematics with her excitement for machine learning.

    From butterflies to bioinformatics

    Jegelka grew up in a small town in Germany and became interested in science when she was a high school student; a supportive teacher encouraged her to participate in an international science competition. She and her teammates from the U.S. and Singapore won an award for a website they created about butterflies, in three languages.

    “For our project, we took images of wings with a scanning electron microscope at a local university of applied sciences. I also got the opportunity to use a high-speed camera at Mercedes Benz — this camera usually filmed combustion engines — which I used to capture a slow-motion video of the movement of a butterfly’s wings. That was the first time I really got in touch with science and exploration,” she recalls.

    Intrigued by both biology and mathematics, Jegelka decided to study bioinformatics at the University of Tübingen and the University of Texas at Austin. She had a few opportunities to conduct research as an undergraduate, including an internship in computational neuroscience at Georgetown University, but wasn’t sure what career to follow.

    When she returned for her final year of college, Jegelka moved in with two roommates who were working as research assistants at the Max Planck Institute in Tübingen.

    “They were working on machine learning, and that sounded really cool to me. I had to write my bachelor’s thesis, so I asked at the institute if they had a project for me. I started working on machine learning at the Max Planck Institute and I loved it. I learned so much there, and it was a great place for research,” she says.

    She stayed on at the Max Planck Institute to complete a master’s thesis, and then embarked on a PhD in machine learning at the Max Planck Institute and the Swiss Federal Institute of Technology.

    During her PhD, she explored how concepts from discrete mathematics can help improve machine-learning techniques.

    Teaching models to learn

    The more Jegelka learned about machine learning, the more intrigued she became by the challenges of understanding how models behave, and how to steer this behavior.

    “You can do so much with machine learning, but only if you have the right model and data. It is not just a black-box thing where you throw it at the data and it works. You actually have to think about it, its properties, and what you want the model to learn and do,” she says.

    After completing a postdoc at the University of California at Berkeley, Jegelka was hooked on research and decided to pursue a career in academia. She joined the faculty at MIT in 2015 as an assistant professor.

    “What I really loved about MIT, from the very beginning, was that the people really care deeply about research and creativity. That is what I appreciate the most about MIT. The people here really value originality and depth in research,” she says.

    That focus on creativity has enabled Jegelka to explore a broad range of topics.

    In collaboration with other faculty at MIT, she studies machine-learning applications in biology, imaging, computer vision, and materials science.

    But what really drives Jegelka is probing the fundamentals of machine learning, and most recently, the issue of robustness. Often, a model performs well on training data, but its performance deteriorates when it is deployed on slightly different data. Building prior knowledge into a model can make it more reliable, but understanding what information the model needs to be successful and how to build it in is not so simple, she says.

    She is also exploring methods to improve the performance of machine-learning models for image classification.

    Image classification models are everywhere, from the facial recognition systems on mobile phones to tools that identify fake accounts on social media. These models need massive amounts of data for training, but since it is expensive for humans to hand-label millions of images, researchers often use unlabeled datasets to pretrain models instead.

    These models then reuse the representations they have learned when they are fine-tuned later for a specific task.

    Ideally, researchers want the model to learn as much as it can during pretraining, so it can apply that knowledge to its downstream task. But in practice, these models often learn only a few simple correlations — like that one image has sunshine and one has shade — and use these “shortcuts” to classify images.

    “We showed that this is a problem in ‘contrastive learning,’ which is a standard technique for pre-training, both theoretically and empirically. But we also show that you can influence the kinds of information the model will learn to represent by modifying the types of data you show the model. This is one step toward understanding what models are actually going to do in practice,” she says.

    Researchers still don’t understand everything that goes on inside a deep-learning model, or details about how they can influence what a model learns and how it behaves, but Jegelka looks forward to continue exploring these topics.

    “Often in machine learning, we see something happen in practice and we try to understand it theoretically. This is a huge challenge. You want to build an understanding that matches what you see in practice, so that you can do better. We are still just at the beginning of understanding this,” she says.

    Outside the lab, Jegelka is a fan of music, art, traveling, and cycling. But these days, she enjoys spending most of her free time with her preschool-aged daughter. More

  • in

    Simulating discrimination in virtual reality

    Have you ever been advised to “walk a mile in someone else’s shoes?” Considering another person’s perspective can be a challenging endeavor — but recognizing our errors and biases is key to building understanding across communities. By challenging our preconceptions, we confront prejudice, such as racism and xenophobia, and potentially develop a more inclusive perspective about others.

    To assist with perspective-taking, MIT researchers have developed “On the Plane,” a virtual reality role-playing game (VR RPG) that simulates discrimination. In this case, the game portrays xenophobia directed against a Malaysian America woman, but the approach can be generalized. Situated on an airplane, players can take on the role of characters from different backgrounds, engaging in dialogue with others while making in-game choices to a series of prompts. In turn, players’ decisions control the outcome of a tense conversation between the characters about cultural differences.

    As a VR RPG, “On the Plane” encourages players to take on new roles that may be outside of their personal experiences in the first person, allowing them to confront in-group/out-group bias by incorporating new perspectives into their understanding of different cultures. Players engage with three characters: Sarah, a first-generation Muslim American of Malaysian ancestry who wears a hijab; Marianne, a white woman from the Midwest with little exposure to other cultures and customs; or a flight attendant. Sarah represents the out group, Marianne is a member of the in group, and the flight staffer is a bystander witnessing an exchange between the two passengers.“This project is part of our efforts to harness the power of virtual reality and artificial intelligence to address social ills, such as discrimination and xenophobia,” says Caglar Yildirim, an MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) research scientist who is a co-author and co-game designer on the project. “Through the exchange between the two passengers, players experience how one passenger’s xenophobia manifests itself and how it affects the other passenger. The simulation engages players in critical reflection and seeks to foster empathy for the passenger who was ‘othered’ due to her outfit being not so ‘prototypical’ of what an American should look like.”

    Yildirim worked alongside the project’s principal investigator, D. Fox Harrell, MIT professor of digital media and AI at CSAIL, the Program in Comparative Media Studies/Writing (CMS), and the Institute for Data, Systems, and Society (IDSS) and founding director of the MIT Center for Advanced Virtuality. “It is not possible for a simulation to give someone the life experiences of another person, but while you cannot ‘walk in someone else’s shoes’ in that sense, a system like this can help people recognize and understand the social patterns at work when it comes to issue like bias,” says Harrell, who is also co-author and designer on this project. “An engaging, immersive, interactive narrative can also impact people emotionally, opening the door for users’ perspectives to be transformed and broadened.” This simulation also utilizes an interactive narrative engine that creates several options for responses to in-game interactions based on a model of how people are categorized socially. The tool grants players a chance to alter their standing in the simulation through their reply choices to each prompt, affecting their affinity toward the other two characters. For example, if you play as the flight attendant, you can react to Marianne’s xenophobic expressions and attitudes toward Sarah, changing your affinities. The engine will then provide you with a different set of narrative events based on your changes in standing with others.

    To animate each avatar, “On the Plane” incorporates artificial intelligence knowledge representation techniques controlled by probabilistic finite state machines, a tool commonly used in machine learning systems for pattern recognition. With the help of these machines, characters’ body language and gestures are customizable: if you play as Marianne, the game will customize her mannerisms toward Sarah based on user inputs, impacting how comfortable she appears in front of a member of a perceived out group. Similarly, players can do the same from Sarah or the flight attendant’s point of view.In a 2018 paper based on work done in a collaboration between MIT CSAIL and the Qatar Computing Research Institute, Harrell and co-author Sercan Şengün advocated for virtual system designers to be more inclusive of Middle Eastern identities and customs. They claimed that if designers allowed users to customize virtual avatars more representative of their background, it might empower players to engage in a more supportive experience. Four years later, “On the Plane” accomplishes a similar goal, incorporating a Muslim’s perspective into an immersive environment.

    “Many virtual identity systems, such as avatars, accounts, profiles, and player characters, are not designed to serve the needs of people across diverse cultures. We have used statistical and AI methods in conjunction with qualitative approaches to learn where the gaps are,” they note. “Our project helps engender perspective transformation so that people will treat each other with respect and enhanced understanding across diverse cultural avatar representations.”

    Harrell and Yildirim’s work is part of the MIT IDSS’s Initiative on Combatting Systemic Racism (ICSR). Harrell is on the initiative’s steering committee and is the leader of the newly forming Antiracism, Games, and Immersive Media vertical, who study behavior, cognition, social phenomena, and computational systems related to race and racism in video games and immersive experiences.

    The researchers’ latest project is part of the ICSR’s broader goal to launch and coordinate cross-disciplinary research that addresses racially discriminatory processes across American institutions. Using big data, members of the research initiative develop and employ computing tools that drive racial equity. Yildirim and Harrell accomplish this goal by depicting a frequent, problematic scenario that illustrates how bias creeps into our everyday lives.“In a post-9/11 world, Muslims often experience ethnic profiling in American airports. ‘On the Plane’ builds off of that type of in-group favoritism, a well-established finding in psychology,” says MIT Professor Fotini Christia, director of the Sociotechnical Systems Research Center (SSRC) and associate director or IDSS. “This game also takes a novel approach to analyzing hardwired bias by utilizing VR instead of field experiments to simulate prejudice. Excitingly, this research demonstrates that VR can be used as a tool to help us better measure bias, combating systemic racism and other forms of discrimination.”“On the Plane” was developed on the Unity game engine using the XR Interaction Toolkit and Harrell’s Chimeria platform for authoring interactive narratives that involve social categorization. The game will be deployed for research studies later this year on both desktop computers and the standalone, wireless Meta Quest headsets. A paper on the work was presented in December at the 2022 IEEE International Conference on Artificial Intelligence and Virtual Reality. More

  • in

    Subtle biases in AI can influence emergency decisions

    It’s no secret that people harbor biases — some unconscious, perhaps, and others painfully overt. The average person might suppose that computers — machines typically made of plastic, steel, glass, silicon, and various metals — are free of prejudice. While that assumption may hold for computer hardware, the same is not always true for computer software, which is programmed by fallible humans and can be fed data that is, itself, compromised in certain respects.

    Artificial intelligence (AI) systems — those based on machine learning, in particular — are seeing increased use in medicine for diagnosing specific diseases, for example, or evaluating X-rays. These systems are also being relied on to support decision-making in other areas of health care. Recent research has shown, however, that machine learning models can encode biases against minority subgroups, and the recommendations they make may consequently reflect those same biases.

    A new study by researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and the MIT Jameel Clinic, which was published last month in Communications Medicine, assesses the impact that discriminatory AI models can have, especially for systems that are intended to provide advice in urgent situations. “We found that the manner in which the advice is framed can have significant repercussions,” explains the paper’s lead author, Hammaad Adam, a PhD student at MIT’s Institute for Data Systems and Society. “Fortunately, the harm caused by biased models can be limited (though not necessarily eliminated) when the advice is presented in a different way.” The other co-authors of the paper are Aparna Balagopalan and Emily Alsentzer, both PhD students, and the professors Fotini Christia and Marzyeh Ghassemi.

    AI models used in medicine can suffer from inaccuracies and inconsistencies, in part because the data used to train the models are often not representative of real-world settings. Different kinds of X-ray machines, for instance, can record things differently and hence yield different results. Models trained predominately on white people, moreover, may not be as accurate when applied to other groups. The Communications Medicine paper is not focused on issues of that sort but instead addresses problems that stem from biases and on ways to mitigate the adverse consequences.

    A group of 954 people (438 clinicians and 516 nonexperts) took part in an experiment to see how AI biases can affect decision-making. The participants were presented with call summaries from a fictitious crisis hotline, each involving a male individual undergoing a mental health emergency. The summaries contained information as to whether the individual was Caucasian or African American and would also mention his religion if he happened to be Muslim. A typical call summary might describe a circumstance in which an African American man was found at home in a delirious state, indicating that “he has not consumed any drugs or alcohol, as he is a practicing Muslim.” Study participants were instructed to call the police if they thought the patient was likely to turn violent; otherwise, they were encouraged to seek medical help.

    The participants were randomly divided into a control or “baseline” group plus four other groups designed to test responses under slightly different conditions. “We want to understand how biased models can influence decisions, but we first need to understand how human biases can affect the decision-making process,” Adam notes. What they found in their analysis of the baseline group was rather surprising: “In the setting we considered, human participants did not exhibit any biases. That doesn’t mean that humans are not biased, but the way we conveyed information about a person’s race and religion, evidently, was not strong enough to elicit their biases.”

    The other four groups in the experiment were given advice that either came from a biased or unbiased model, and that advice was presented in either a “prescriptive” or a “descriptive” form. A biased model would be more likely to recommend police help in a situation involving an African American or Muslim person than would an unbiased model. Participants in the study, however, did not know which kind of model their advice came from, or even that models delivering the advice could be biased at all. Prescriptive advice spells out what a participant should do in unambiguous terms, telling them they should call the police in one instance or seek medical help in another. Descriptive advice is less direct: A flag is displayed to show that the AI system perceives a risk of violence associated with a particular call; no flag is shown if the threat of violence is deemed small.  

    A key takeaway of the experiment is that participants “were highly influenced by prescriptive recommendations from a biased AI system,” the authors wrote. But they also found that “using descriptive rather than prescriptive recommendations allowed participants to retain their original, unbiased decision-making.” In other words, the bias incorporated within an AI model can be diminished by appropriately framing the advice that’s rendered. Why the different outcomes, depending on how advice is posed? When someone is told to do something, like call the police, that leaves little room for doubt, Adam explains. However, when the situation is merely described — classified with or without the presence of a flag — “that leaves room for a participant’s own interpretation; it allows them to be more flexible and consider the situation for themselves.”

    Second, the researchers found that the language models that are typically used to offer advice are easy to bias. Language models represent a class of machine learning systems that are trained on text, such as the entire contents of Wikipedia and other web material. When these models are “fine-tuned” by relying on a much smaller subset of data for training purposes — just 2,000 sentences, as opposed to 8 million web pages — the resultant models can be readily biased.  

    Third, the MIT team discovered that decision-makers who are themselves unbiased can still be misled by the recommendations provided by biased models. Medical training (or the lack thereof) did not change responses in a discernible way. “Clinicians were influenced by biased models as much as non-experts were,” the authors stated.

    “These findings could be applicable to other settings,” Adam says, and are not necessarily restricted to health care situations. When it comes to deciding which people should receive a job interview, a biased model could be more likely to turn down Black applicants. The results could be different, however, if instead of explicitly (and prescriptively) telling an employer to “reject this applicant,” a descriptive flag is attached to the file to indicate the applicant’s “possible lack of experience.”

    The implications of this work are broader than just figuring out how to deal with individuals in the midst of mental health crises, Adam maintains.  “Our ultimate goal is to make sure that machine learning models are used in a fair, safe, and robust way.” More

  • in

    Meet the 2022-23 Accenture Fellows

    Launched in October 2020, the MIT and Accenture Convergence Initiative for Industry and Technology underscores the ways in which industry and technology can collaborate to spur innovation. The five-year initiative aims to achieve its mission through research, education, and fellowships. To that end, Accenture has once again awarded five annual fellowships to MIT graduate students working on research in industry and technology convergence who are underrepresented, including by race, ethnicity, and gender.This year’s Accenture Fellows work across research areas including telemonitoring, human-computer interactions, operations research,  AI-mediated socialization, and chemical transformations. Their research covers a wide array of projects, including designing low-power processing hardware for telehealth applications; applying machine learning to streamline and improve business operations; improving mental health care through artificial intelligence; and using machine learning to understand the environmental and health consequences of complex chemical reactions.As part of the application process, student nominations were invited from each unit within the School of Engineering, as well as from the Institute’s four other schools and the MIT Schwarzman College of Computing. Five exceptional students were selected as fellows for the initiative’s third year.Drew Buzzell is a doctoral candidate in electrical engineering and computer science whose research concerns telemonitoring, a fast-growing sphere of telehealth in which information is collected through internet-of-things (IoT) connected devices and transmitted to the cloud. Currently, the high volume of information involved in telemonitoring — and the time and energy costs of processing it — make data analysis difficult. Buzzell’s work is focused on edge computing, a new computing architecture that seeks to address these challenges by managing data closer to the source, in a distributed network of IoT devices. Buzzell earned his BS in physics and engineering science and his MS in engineering science from the Pennsylvania State University.

    Mengying (Cathy) Fang is a master’s student in the MIT School of Architecture and Planning. Her research focuses on augmented reality and virtual reality platforms. Fang is developing novel sensors and machine components that combine computation, materials science, and engineering. Moving forward, she will explore topics including soft robotics techniques that could be integrated with clothes and wearable devices and haptic feedback in order to develop interactions with digital objects. Fang earned a BS in mechanical engineering and human-computer interaction from Carnegie Mellon University.

    Xiaoyue Gong is a doctoral candidate in operations research at the MIT Sloan School of Management. Her research aims to harness the power of machine learning and data science to reduce inefficiencies in the operation of businesses, organizations, and society. With the support of an Accenture Fellowship, Gong seeks to find solutions to operational problems by designing reinforcement learning methods and other machine learning techniques to embedded operational problems. Gong earned a BS in honors mathematics and interactive media arts from New York University.

    Ruby Liu is a doctoral candidate in medical engineering and medical physics. Their research addresses the growing pandemic of loneliness among older adults, which leads to poor health outcomes and presents particularly high risks for historically marginalized people, including members of the LGBTQ+ community and people of color. Liu is designing a network of interconnected AI agents that foster connections between user and agent, offering mental health care while strengthening and facilitating human-human connections. Liu received a BS in biomedical engineering from Johns Hopkins University.

    Joules Provenzano is a doctoral candidate in chemical engineering. Their work integrates machine learning and liquid chromatography-high resolution mass spectrometry (LC-HRMS) to improve our understanding of complex chemical reactions in the environment. As an Accenture Fellow, Provenzano will build upon recent advances in machine learning and LC-HRMS, including novel algorithms for processing real, experimental HR-MS data and new approaches in extracting structure-transformation rules and kinetics. Their research could speed the pace of discovery in the chemical sciences and benefits industries including oil and gas, pharmaceuticals, and agriculture. Provenzano earned a BS in chemical engineering and international and global studies from the Rochester Institute of Technology. More

  • in

    A faster way to preserve privacy online

    Searching the internet can reveal information a user would rather keep private. For instance, when someone looks up medical symptoms online, they could reveal their health conditions to Google, an online medical database like WebMD, and perhaps hundreds of these companies’ advertisers and business partners.

    For decades, researchers have been crafting techniques that enable users to search for and retrieve information from a database privately, but these methods remain too slow to be effectively used in practice.

    MIT researchers have now developed a scheme for private information retrieval that is about 30 times faster than other comparable methods. Their technique enables a user to search an online database without revealing their query to the server. Moreover, it is driven by a simple algorithm that would be easier to implement than the more complicated approaches from previous work.

    Their technique could enable private communication by preventing a messaging app from knowing what users are saying or who they are talking to. It could also be used to fetch relevant online ads without advertising servers learning a users’ interests.

    “This work is really about giving users back some control over their own data. In the long run, we’d like browsing the web to be as private as browsing a library. This work doesn’t achieve that yet, but it starts building the tools to let us do this sort of thing quickly and efficiently in practice,” says Alexandra Henzinger, a computer science graduate student and lead author of a paper introducing the technique.

    Co-authors include Matthew Hong, an MIT computer science graduate student; Henry Corrigan-Gibbs, the Douglas Ross Career Development Professor of Software Technology in the MIT Department of Electrical Engineering and Computer Science (EECS) and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); Sarah Meiklejohn, a professor in cryptography and security at University College London and a staff research scientist at Google; and senior author Vinod Vaikuntanathan, an EECS professor and principal investigator in CSAIL. The research will be presented at the 2023 USENIX Security Symposium. 

    Preserving privacy

    The first schemes for private information retrieval were developed in the 1990s, partly by researchers at MIT. These techniques enable a user to communicate with a remote server that holds a database, and read records from that database without the server knowing what the user is reading.

    To preserve privacy, these techniques force the server to touch every single item in the database, so it can’t tell which entry a user is searching for. If one area is left untouched, the server would learn that the client is not interested in that item. But touching every item when there may be millions of database entries slows down the query process.

    To speed things up, the MIT researchers developed a protocol, known as Simple PIR, in which the server performs much of the underlying cryptographic work in advance, before a client even sends a query. This preprocessing step produces a data structure that holds compressed information about the database contents, and which the client downloads before sending a query.

    In a sense, this data structure is like a hint for the client about what is in the database.

    “Once the client has this hint, it can make an unbounded number of queries, and these queries are going to be much smaller in both the size of the messages you are sending and the work that you need the server to do. This is what makes Simple PIR so much faster,” Henzinger explains.

    But the hint can be relatively large in size. For example, to query a 1-gigabyte database, the client would need to download a 124-megabyte hint. This drives up communication costs, which could make the technique difficult to implement on real-world devices.

    To reduce the size of the hint, the researchers developed a second technique, known as Double PIR, that basically involves running the Simple PIR scheme twice. This produces a much more compact hint that is fixed in size for any database.

    Using Double PIR, the hint for a 1 gigabyte database would only be 16 megabytes.

    “Our Double PIR scheme runs a little bit slower, but it will have much lower communication costs. For some applications, this is going to be a desirable tradeoff,” Henzinger says.

    Hitting the speed limit

    They tested the Simple PIR and Double PIR schemes by applying them to a task in which a client seeks to audit a specific piece of information about a website to ensure that website is safe to visit. To preserve privacy, the client cannot reveal the website it is auditing.

    The researchers’ fastest technique was able to successfully preserve privacy while running at about 10 gigabytes per second. Previous schemes could only achieve a throughput of about 300 megabytes per second.

    They show that their method approaches the theoretical speed limit for private information retrieval — it is nearly the fastest possible scheme one can build in which the server touches every record in the database, adds Corrigan-Gibbs.

    In addition, their method only requires a single server, making it much simpler than many top-performing techniques that require two separate servers with identical databases. Their method outperformed these more complex protocols.

    “I’ve been thinking about these schemes for some time, and I never thought this could be possible at this speed. The folklore was that any single-server scheme is going to be really slow. This work turns that whole notion on its head,” Corrigan-Gibbs says.

    While the researchers have shown that they can make PIR schemes much faster, there is still work to do before they would be able to deploy their techniques in real-world scenarios, says Henzinger. They would like to cut the communication costs of their schemes while still enabling them to achieve high speeds. In addition, they want to adapt their techniques to handle more complex queries, such as general SQL queries, and more demanding applications, such as a general Wikipedia search. And in the long run, they hope to develop better techniques that can preserve privacy without requiring a server to touch every database item. 

    “I’ve heard people emphatically claiming that PIR will never be practical. But I would never bet against technology. That is an optimistic lesson to learn from this work. There are always ways to innovate,” Vaikuntanathan says.

    “This work makes a major improvement to the practical cost of private information retrieval. While it was known that low-bandwidth PIR schemes imply public-key cryptography, which is typically orders of magnitude slower than private-key cryptography, this work develops an ingenious method to bridge the gap. This is done by making a clever use of special properties of a public-key encryption scheme due to Regev to push the vast majority of the computational work to a precomputation step, in which the server computes a short ‘hint’ about the database,” says Yuval Ishai, a professor of computer science at Technion (the Israel Institute of Technology), who was not involved in the study. “What makes their approach particularly appealing is that the same hint can be used an unlimited number of times, by any number of clients. This renders the (moderate) cost of computing the hint insignificant in a typical scenario where the same database is accessed many times.”

    This work is funded, in part, by the National Science Foundation, Google, Facebook, MIT’s Fintech@CSAIL Initiative, an NSF Graduate Research Fellowship, an EECS Great Educators Fellowship, the National Institutes of Health, the Defense Advanced Research Projects Agency, the MIT-IBM Watson AI Lab, Analog Devices, Microsoft, and a Thornton Family Faculty Research Innovation Fellowship. More

  • in

    Large language models help decipher clinical notes

    Electronic health records (EHRs) need a new public relations manager. Ten years ago, the U.S. government passed a law that required hospitals to digitize their health records with the intent of improving and streamlining care. The enormous amount of information in these now-digital records could be used to answer very specific questions beyond the scope of clinical trials: What’s the right dose of this medication for patients with this height and weight? What about patients with a specific genomic profile?

    Unfortunately, most of the data that could answer these questions is trapped in doctor’s notes, full of jargon and abbreviations. These notes are hard for computers to understand using current techniques — extracting information requires training multiple machine learning models. Models trained for one hospital, also, don’t work well at others, and training each model requires domain experts to label lots of data, a time-consuming and expensive process. 

    An ideal system would use a single model that can extract many types of information, work well at multiple hospitals, and learn from a small amount of labeled data. But how? Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) believed that to disentangle the data, they needed to call on something bigger: large language models. To pull that important medical information, they used a very big, GPT-3 style model to do tasks like expand overloaded jargon and acronyms and extract medication regimens. 

    For example, the system takes an input, which in this case is a clinical note, “prompts” the model with a question about the note, such as “expand this abbreviation, C-T-A.” The system returns an output such as “clear to auscultation,” as opposed to say, a CT angiography. The objective of extracting this clean data, the team says, is to eventually enable more personalized clinical recommendations. 

    Medical data is, understandably, a pretty tricky resource to navigate freely. There’s plenty of red tape around using public resources for testing the performance of large models because of data use restrictions, so the team decided to scrape together their own. Using a set of short, publicly available clinical snippets, they cobbled together a small dataset to enable evaluation of the extraction performance of large language models. 

    “It’s challenging to develop a single general-purpose clinical natural language processing system that will solve everyone’s needs and be robust to the huge variation seen across health datasets. As a result, until today, most clinical notes are not used in downstream analyses or for live decision support in electronic health records. These large language model approaches could potentially transform clinical natural language processing,” says David Sontag, MIT professor of electrical engineering and computer science, principal investigator in CSAIL and the Institute for Medical Engineering and Science, and supervising author on a paper about the work, which will be presented at the Conference on Empirical Methods in Natural Language Processing. “The research team’s advances in zero-shot clinical information extraction makes scaling possible. Even if you have hundreds of different use cases, no problem — you can build each model with a few minutes of work, versus having to label a ton of data for that particular task.”

    For example, without any labels at all, the researchers found these models could achieve 86 percent accuracy at expanding overloaded acronyms, and the team developed additional methods to boost this further to 90 percent accuracy, with still no labels required.

    Imprisoned in an EHR 

    Experts have been steadily building up large language models (LLMs) for quite some time, but they burst onto the mainstream with GPT-3’s widely covered ability to complete sentences. These LLMs are trained on a huge amount of text from the internet to finish sentences and predict the next most likely word. 

    While previous, smaller models like earlier GPT iterations or BERT have pulled off a good performance for extracting medical data, they still require substantial manual data-labeling effort. 

    For example, a note, “pt will dc vanco due to n/v” means that this patient (pt) was taking the antibiotic vancomycin (vanco) but experienced nausea and vomiting (n/v) severe enough for the care team to discontinue (dc) the medication. The team’s research avoids the status quo of training separate machine learning models for each task (extracting medication, side effects from the record, disambiguating common abbreviations, etc). In addition to expanding abbreviations, they investigated four other tasks, including if the models could parse clinical trials and extract detail-rich medication regimens.  

    “Prior work has shown that these models are sensitive to the prompt’s precise phrasing. Part of our technical contribution is a way to format the prompt so that the model gives you outputs in the correct format,” says Hunter Lang, CSAIL PhD student and author on the paper. “For these extraction problems, there are structured output spaces. The output space is not just a string. It can be a list. It can be a quote from the original input. So there’s more structure than just free text. Part of our research contribution is encouraging the model to give you an output with the correct structure. That significantly cuts down on post-processing time.”

    The approach can’t be applied to out-of-the-box health data at a hospital: that requires sending private patient information across the open internet to an LLM provider like OpenAI. The authors showed that it’s possible to work around this by distilling the model into a smaller one that could be used on-site.

    The model — sometimes just like humans — is not always beholden to the truth. Here’s what a potential problem might look like: Let’s say you’re asking the reason why someone took medication. Without proper guardrails and checks, the model might just output the most common reason for that medication, if nothing is explicitly mentioned in the note. This led to the team’s efforts to force the model to extract more quotes from data and less free text.

    Future work for the team includes extending to languages other than English, creating additional methods for quantifying uncertainty in the model, and pulling off similar results with open-sourced models. 

    “Clinical information buried in unstructured clinical notes has unique challenges compared to general domain text mostly due to large use of acronyms, and inconsistent textual patterns used across different health care facilities,” says Sadid Hasan, AI lead at Microsoft and former executive director of AI at CVS Health, who was not involved in the research. “To this end, this work sets forth an interesting paradigm of leveraging the power of general domain large language models for several important zero-/few-shot clinical NLP tasks. Specifically, the proposed guided prompt design of LLMs to generate more structured outputs could lead to further developing smaller deployable models by iteratively utilizing the model generated pseudo-labels.”

    “AI has accelerated in the last five years to the point at which these large models can predict contextualized recommendations with benefits rippling out across a variety of domains such as suggesting novel drug formulations, understanding unstructured text, code recommendations or create works of art inspired by any number of human artists or styles,” says Parminder Bhatia, who was formerly Head of Machine Learning at AWS Health AI and is currently Head of ML for low-code applications leveraging large language models at AWS AI Labs. “One of the applications of these large models [the team has] recently launched is Amazon CodeWhisperer, which is [an] ML-powered coding companion that helps developers in building applications.”

    As part of the MIT Abdul Latif Jameel Clinic for Machine Learning in Health, Agrawal, Sontag, and Lang wrote the paper alongside Yoon Kim, MIT assistant professor and CSAIL principal investigator, and Stefan Hegselmann, a visiting PhD student from the University of Muenster. First-author Agrawal’s research was supported by a Takeda Fellowship, the MIT Deshpande Center for Technological Innovation, and the MLA@CSAIL Initiatives. More