More stories

  • in

    Exploring emerging topics in artificial intelligence policy

    Members of the public sector, private sector, and academia convened for the second AI Policy Forum Symposium last month to explore critical directions and questions posed by artificial intelligence in our economies and societies.

    The virtual event, hosted by the AI Policy Forum (AIPF) — an undertaking by the MIT Schwarzman College of Computing to bridge high-level principles of AI policy with the practices and trade-offs of governing — brought together an array of distinguished panelists to delve into four cross-cutting topics: law, auditing, health care, and mobility.

    In the last year there have been substantial changes in the regulatory and policy landscape around AI in several countries — most notably in Europe with the development of the European Union Artificial Intelligence Act, the first attempt by a major regulator to propose a law on artificial intelligence. In the United States, the National AI Initiative Act of 2020, which became law in January 2021, is providing a coordinated program across federal government to accelerate AI research and application for economic prosperity and security gains. Finally, China recently advanced several new regulations of its own.

    Each of these developments represents a different approach to legislating AI, but what makes a good AI law? And when should AI legislation be based on binding rules with penalties versus establishing voluntary guidelines?

    Jonathan Zittrain, professor of international law at Harvard Law School and director of the Berkman Klein Center for Internet and Society, says the self-regulatory approach taken during the expansion of the internet had its limitations with companies struggling to balance their interests with those of their industry and the public.

    “One lesson might be that actually having representative government take an active role early on is a good idea,” he says. “It’s just that they’re challenged by the fact that there appears to be two phases in this environment of regulation. One, too early to tell, and two, too late to do anything about it. In AI I think a lot of people would say we’re still in the ‘too early to tell’ stage but given that there’s no middle zone before it’s too late, it might still call for some regulation.”

    A theme that came up repeatedly throughout the first panel on AI laws — a conversation moderated by Dan Huttenlocher, dean of the MIT Schwarzman College of Computing and chair of the AI Policy Forum — was the notion of trust. “If you told me the truth consistently, I would say you are an honest person. If AI could provide something similar, something that I can say is consistent and is the same, then I would say it’s trusted AI,” says Bitange Ndemo, professor of entrepreneurship at the University of Nairobi and the former permanent secretary of Kenya’s Ministry of Information and Communication.

    Eva Kaili, vice president of the European Parliament, adds that “In Europe, whenever you use something, like any medication, you know that it has been checked. You know you can trust it. You know the controls are there. We have to achieve the same with AI.” Kalli further stresses that building trust in AI systems will not only lead to people using more applications in a safe manner, but that AI itself will reap benefits as greater amounts of data will be generated as a result.

    The rapidly increasing applicability of AI across fields has prompted the need to address both the opportunities and challenges of emerging technologies and the impact they have on social and ethical issues such as privacy, fairness, bias, transparency, and accountability. In health care, for example, new techniques in machine learning have shown enormous promise for improving quality and efficiency, but questions of equity, data access and privacy, safety and reliability, and immunology and global health surveillance remain at large.

    MIT’s Marzyeh Ghassemi, an assistant professor in the Department of Electrical Engineering and Computer Science and the Institute for Medical Engineering and Science, and David Sontag, an associate professor of electrical engineering and computer science, collaborated with Ziad Obermeyer, an associate professor of health policy and management at the University of California Berkeley School of Public Health, to organize AIPF Health Wide Reach, a series of sessions to discuss issues of data sharing and privacy in clinical AI. The organizers assembled experts devoted to AI, policy, and health from around the world with the goal of understanding what can be done to decrease barriers to access to high-quality health data to advance more innovative, robust, and inclusive research results while being respectful of patient privacy.

    Over the course of the series, members of the group presented on a topic of expertise and were tasked with proposing concrete policy approaches to the challenge discussed. Drawing on these wide-ranging conversations, participants unveiled their findings during the symposium, covering nonprofit and government success stories and limited access models; upside demonstrations; legal frameworks, regulation, and funding; technical approaches to privacy; and infrastructure and data sharing. The group then discussed some of their recommendations that are summarized in a report that will be released soon.

    One of the findings calls for the need to make more data available for research use. Recommendations that stem from this finding include updating regulations to promote data sharing to enable easier access to safe harbors such as the Health Insurance Portability and Accountability Act (HIPAA) has for de-identification, as well as expanding funding for private health institutions to curate datasets, amongst others. Another finding, to remove barriers to data for researchers, supports a recommendation to decrease obstacles to research and development on federally created health data. “If this is data that should be accessible because it’s funded by some federal entity, we should easily establish the steps that are going to be part of gaining access to that so that it’s a more inclusive and equitable set of research opportunities for all,” says Ghassemi. The group also recommends taking a careful look at the ethical principles that govern data sharing. While there are already many principles proposed around this, Ghassemi says that “obviously you can’t satisfy all levers or buttons at once, but we think that this is a trade-off that’s very important to think through intelligently.”

    In addition to law and health care, other facets of AI policy explored during the event included auditing and monitoring AI systems at scale, and the role AI plays in mobility and the range of technical, business, and policy challenges for autonomous vehicles in particular.

    The AI Policy Forum Symposium was an effort to bring together communities of practice with the shared aim of designing the next chapter of AI. In his closing remarks, Aleksander Madry, the Cadence Designs Systems Professor of Computing at MIT and faculty co-lead of the AI Policy Forum, emphasized the importance of collaboration and the need for different communities to communicate with each other in order to truly make an impact in the AI policy space.

    “The dream here is that we all can meet together — researchers, industry, policymakers, and other stakeholders — and really talk to each other, understand each other’s concerns, and think together about solutions,” Madry said. “This is the mission of the AI Policy Forum and this is what we want to enable.” More

  • in

    Robots play with play dough

    The inner child in many of us feels an overwhelming sense of joy when stumbling across a pile of the fluorescent, rubbery mixture of water, salt, and flour that put goo on the map: play dough. (Even if this happens rarely in adulthood.)

    While manipulating play dough is fun and easy for 2-year-olds, the shapeless sludge is hard for robots to handle. Machines have become increasingly reliable with rigid objects, but manipulating soft, deformable objects comes with a laundry list of technical challenges, and most importantly, as with most flexible structures, if you move one part, you’re likely affecting everything else. 

    Scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Stanford University recently let robots take their hand at playing with the modeling compound, but not for nostalgia’s sake. Their new system learns directly from visual inputs to let a robot with a two-fingered gripper see, simulate, and shape doughy objects. “RoboCraft” could reliably plan a robot’s behavior to pinch and release play dough to make various letters, including ones it had never seen. With just 10 minutes of data, the two-finger gripper rivaled human counterparts that teleoperated the machine — performing on-par, and at times even better, on the tested tasks. 

    “Modeling and manipulating objects with high degrees of freedom are essential capabilities for robots to learn how to enable complex industrial and household interaction tasks, like stuffing dumplings, rolling sushi, and making pottery,” says Yunzhu Li, CSAIL PhD student and author on a new paper about RoboCraft. “While there’s been recent advances in manipulating clothes and ropes, we found that objects with high plasticity, like dough or plasticine — despite ubiquity in those household and industrial settings — was a largely underexplored territory. With RoboCraft, we learn the dynamics models directly from high-dimensional sensory data, which offers a promising data-driven avenue for us to perform effective planning.” 

    Play video

    With undefined, smooth material, the whole structure needs to be accounted for before you can do any type of efficient and effective modeling and planning. By turning the images into graphs of little particles, coupled with algorithms, RoboCraft, using a graph neural network as the dynamics model, makes more accurate predictions about the material’s change of shapes. 

    Typically, researchers have used complex physics simulators to model and understand force and dynamics being applied to objects, but RoboCraft simply uses visual data. The inner-workings of the system relies on three parts to shape soft material into, say, an “R.” 

    The first part — perception — is all about learning to “see.” It uses cameras to collect raw, visual sensor data from the environment, which are then turned into little clouds of particles to represent the shapes. A graph-based neural network then uses said particle data to learn to “simulate” the object’s dynamics, or how it moves. Then, algorithms help plan the robot’s behavior so it learns to “shape” a blob of dough, armed with the training data from the many pinches. While the letters are a bit loose, they’re indubitably representative. 

    Besides cutesy shapes, the team is (actually) working on making dumplings from dough and a prepared filling. Right now, with just a two finger gripper, it’s a big ask. RoboCraft would need additional tools (a baker needs multiple tools to cook; so do robots) — a rolling pin, a stamp, and a mold. 

    A more far in the future domain the scientists envision is using RoboCraft for assistance with household tasks and chores, which could be of particular help to the elderly or those with limited mobility. To accomplish this, given the many obstructions that could take place, a much more adaptive representation of the dough or item would be needed, and as well as exploration into what class of models might be suitable to capture the underlying structural systems. 

    “RoboCraft essentially demonstrates that this predictive model can be learned in very data-efficient ways to plan motion. In the long run, we are thinking about using various tools to manipulate materials,” says Li. “If you think about dumpling or dough making, just one gripper wouldn’t be able to solve it. Helping the model understand and accomplish longer-horizon planning tasks, such as, how the dough will deform given the current tool, movements and actions, is a next step for future work.” 

    Li wrote the paper alongside Haochen Shi, Stanford master’s student; Huazhe Xu, Stanford postdoc; Zhiao Huang, PhD student at the University of California at San Diego; and Jiajun Wu, assistant professor at Stanford. They will present the research at the Robotics: Science and Systems conference in New York City. The work is in part supported by the Stanford Institute for Human-Centered AI (HAI), the Samsung Global Research Outreach (GRO) Program, the Toyota Research Institute (TRI), and Amazon, Autodesk, Salesforce, and Bosch. More

  • in

    Hallucinating to better text translation

    As babies, we babble and imitate our way to learning languages. We don’t start off reading raw text, which requires fundamental knowledge and understanding about the world, as well as the advanced ability to interpret and infer descriptions and relationships. Rather, humans begin our language journey slowly, by pointing and interacting with our environment, basing our words and perceiving their meaning through the context of the physical and social world. Eventually, we can craft full sentences to communicate complex ideas.

    Similarly, when humans begin learning and translating into another language, the incorporation of other sensory information, like multimedia, paired with the new and unfamiliar words, like flashcards with images, improves language acquisition and retention. Then, with enough practice, humans can accurately translate new, unseen sentences in context without the accompanying media; however, imagining a picture based on the original text helps.

    This is the basis of a new machine learning model, called VALHALLA, by researchers from MIT, IBM, and the University of California at San Diego, in which a trained neural network sees a source sentence in one language, hallucinates an image of what it looks like, and then uses both to translate into a target language. The team found that their method demonstrates improved accuracy of machine translation over text-only translation. Further, it provided an additional boost for cases with long sentences, under-resourced languages, and instances where part of the source sentence is inaccessible to the machine translator.

    As a core task within the AI field of natural language processing (NLP), machine translation is an “eminently practical technology that’s being used by millions of people every day,” says study co-author Yoon Kim, assistant professor in MIT’s Department of Electrical Engineering and Computer Science with affiliations in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the MIT-IBM Watson AI Lab. With recent, significant advances in deep learning, “there’s been an interesting development in how one might use non-text information — for example, images, audio, or other grounding information — to tackle practical tasks involving language” says Kim, because “when humans are performing language processing tasks, we’re doing so within a grounded, situated world.” The pairing of hallucinated images and text during inference, the team postulated, imitates that process, providing context for improved performance over current state-of-the-art techniques, which utilize text-only data.

    This research will be presented at the IEEE / CVF Computer Vision and Pattern Recognition Conference this month. Kim’s co-authors are UC San Diego graduate student Yi Li and Professor Nuno Vasconcelos, along with research staff members Rameswar Panda, Chun-fu “Richard” Chen, Rogerio Feris, and IBM Director David Cox of IBM Research and the MIT-IBM Watson AI Lab.

    Learning to hallucinate from images

    When we learn new languages and to translate, we’re often provided with examples and practice before venturing out on our own. The same is true for machine-translation systems; however, if images are used during training, these AI methods also require visual aids for testing, limiting their applicability, says Panda.

    “In real-world scenarios, you might not have an image with respect to the source sentence. So, our motivation was basically: Instead of using an external image during inference as input, can we use visual hallucination — the ability to imagine visual scenes — to improve machine translation systems?” says Panda.

    To do this, the team used an encoder-decoder architecture with two transformers, a type of neural network model that’s suited for sequence-dependent data, like language, that can pay attention key words and semantics of a sentence. One transformer generates a visual hallucination, and the other performs multimodal translation using outputs from the first transformer.

    During training, there are two streams of translation: a source sentence and a ground-truth image that is paired with it, and the same source sentence that is visually hallucinated to make a text-image pair. First the ground-truth image and sentence are tokenized into representations that can be handled by transformers; for the case of the sentence, each word is a token. The source sentence is tokenized again, but this time passed through the visual hallucination transformer, outputting a hallucination, a discrete image representation of the sentence. The researchers incorporated an autoregression that compares the ground-truth and hallucinated representations for congruency — e.g., homonyms: a reference to an animal “bat” isn’t hallucinated as a baseball bat. The hallucination transformer then uses the difference between them to optimize its predictions and visual output, making sure the context is consistent.

    The two sets of tokens are then simultaneously passed through the multimodal translation transformer, each containing the sentence representation and either the hallucinated or ground-truth image. The tokenized text translation outputs are compared with the goal of being similar to each other and to the target sentence in another language. Any differences are then relayed back to the translation transformer for further optimization.

    For testing, the ground-truth image stream drops off, since images likely wouldn’t be available in everyday scenarios.

    “To the best of our knowledge, we haven’t seen any work which actually uses a hallucination transformer jointly with a multimodal translation system to improve machine translation performance,” says Panda.

    Visualizing the target text

    To test their method, the team put VALHALLA up against other state-of-the-art multimodal and text-only translation methods. They used public benchmark datasets containing ground-truth images with source sentences, and a dataset for translating text-only news articles. The researchers measured its performance over 13 tasks, ranging from translation on well-resourced languages (like English, German, and French), under-resourced languages (like English to Romanian) and non-English (like Spanish to French). The group also tested varying transformer model sizes, how accuracy changes with the sentence length, and translation under limited textual context, where portions of the text were hidden from the machine translators.

    The team observed significant improvements over text-only translation methods, improving data efficiency, and that smaller models performed better than the larger base model. As sentences became longer, VALHALLA’s performance over other methods grew, which the researchers attributed to the addition of more ambiguous words. In cases where part of the sentence was masked, VALHALLA could recover and translate the original text, which the team found surprising.

    Further unexpected findings arose: “Where there weren’t as many training [image and] text pairs, [like for under-resourced languages], improvements were more significant, which indicates that grounding in images helps in low-data regimes,” says Kim. “Another thing that was quite surprising to me was this improved performance, even on types of text that aren’t necessarily easily connectable to images. For example, maybe it’s not so surprising if this helps in translating visually salient sentences, like the ‘there is a red car in front of the house.’ [However], even in text-only [news article] domains, the approach was able to improve upon text-only systems.”

    While VALHALLA performs well, the researchers note that it does have limitations, requiring pairs of sentences to be annotated with an image, which could make it more expensive to obtain. It also performs better in its ground domain and not the text-only news articles. Moreover, Kim and Panda note, a technique like VALHALLA is still a black box, with the assumption that hallucinated images are providing helpful information, and the team plans to investigate what and how the model is learning in order to validate their methods.

    In the future, the team plans to explore other means of improving translation. “Here, we only focus on images, but there are other types of a multimodal information — for example, speech, video or touch, or other sensory modalities,” says Panda. “We believe such multimodal grounding can lead to even more efficient machine translation models, potentially benefiting translation across many low-resource languages spoken in the world.”

    This research was supported, in part, by the MIT-IBM Watson AI Lab and the National Science Foundation. More

  • in

    In bias we trust?

    When the stakes are high, machine-learning models are sometimes used to aid human decision-makers. For instance, a model could predict which law school applicants are most likely to pass the bar exam to help an admissions officer determine which students should be accepted.

    These models often have millions of parameters, so how they make predictions is nearly impossible for researchers to fully understand, let alone an admissions officer with no machine-learning experience. Researchers sometimes employ explanation methods that mimic a larger model by creating simple approximations of its predictions. These approximations, which are far easier to understand, help users determine whether to trust the model’s predictions.

    But are these explanation methods fair? If an explanation method provides better approximations for men than for women, or for white people than for Black people, it may encourage users to trust the model’s predictions for some people but not for others.

    MIT researchers took a hard look at the fairness of some widely used explanation methods. They found that the approximation quality of these explanations can vary dramatically between subgroups and that the quality is often significantly lower for minoritized subgroups.

    In practice, this means that if the approximation quality is lower for female applicants, there is a mismatch between the explanations and the model’s predictions that could lead the admissions officer to wrongly reject more women than men.

    Once the MIT researchers saw how pervasive these fairness gaps are, they tried several techniques to level the playing field. They were able to shrink some gaps, but couldn’t eradicate them.

    “What this means in the real-world is that people might incorrectly trust predictions more for some subgroups than for others. So, improving explanation models is important, but communicating the details of these models to end users is equally important. These gaps exist, so users may want to adjust their expectations as to what they are getting when they use these explanations,” says lead author Aparna Balagopalan, a graduate student in the Healthy ML group of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).

    Balagopalan wrote the paper with CSAIL graduate students Haoran Zhang and Kimia Hamidieh; CSAIL postdoc Thomas Hartvigsen; Frank Rudzicz, associate professor of computer science at the University of Toronto; and senior author Marzyeh Ghassemi, an assistant professor and head of the Healthy ML Group. The research will be presented at the ACM Conference on Fairness, Accountability, and Transparency.

    High fidelity

    Simplified explanation models can approximate predictions of a more complex machine-learning model in a way that humans can grasp. An effective explanation model maximizes a property known as fidelity, which measures how well it matches the larger model’s predictions.

    Rather than focusing on average fidelity for the overall explanation model, the MIT researchers studied fidelity for subgroups of people in the model’s dataset. In a dataset with men and women, the fidelity should be very similar for each group, and both groups should have fidelity close to that of the overall explanation model.

    “When you are just looking at the average fidelity across all instances, you might be missing out on artifacts that could exist in the explanation model,” Balagopalan says.

    They developed two metrics to measure fidelity gaps, or disparities in fidelity between subgroups. One is the difference between the average fidelity across the entire explanation model and the fidelity for the worst-performing subgroup. The second calculates the absolute difference in fidelity between all possible pairs of subgroups and then computes the average.

    With these metrics, they searched for fidelity gaps using two types of explanation models that were trained on four real-world datasets for high-stakes situations, such as predicting whether a patient dies in the ICU, whether a defendant reoffends, or whether a law school applicant will pass the bar exam. Each dataset contained protected attributes, like the sex and race of individual people. Protected attributes are features that may not be used for decisions, often due to laws or organizational policies. The definition for these can vary based on the task specific to each decision setting.

    The researchers found clear fidelity gaps for all datasets and explanation models. The fidelity for disadvantaged groups was often much lower, up to 21 percent in some instances. The law school dataset had a fidelity gap of 7 percent between race subgroups, meaning the approximations for some subgroups were wrong 7 percent more often on average. If there are 10,000 applicants from these subgroups in the dataset, for example, a significant portion could be wrongly rejected, Balagopalan explains.

    “I was surprised by how pervasive these fidelity gaps are in all the datasets we evaluated. It is hard to overemphasize how commonly explanations are used as a ‘fix’ for black-box machine-learning models. In this paper, we are showing that the explanation methods themselves are imperfect approximations that may be worse for some subgroups,” says Ghassemi.

    Narrowing the gaps

    After identifying fidelity gaps, the researchers tried some machine-learning approaches to fix them. They trained the explanation models to identify regions of a dataset that could be prone to low fidelity and then focus more on those samples. They also tried using balanced datasets with an equal number of samples from all subgroups.

    These robust training strategies did reduce some fidelity gaps, but they didn’t eliminate them.

    The researchers then modified the explanation models to explore why fidelity gaps occur in the first place. Their analysis revealed that an explanation model might indirectly use protected group information, like sex or race, that it could learn from the dataset, even if group labels are hidden.

    They want to explore this conundrum more in future work. They also plan to further study the implications of fidelity gaps in the context of real-world decision making.

    Balagopalan is excited to see that concurrent work on explanation fairness from an independent lab has arrived at similar conclusions, highlighting the importance of understanding this problem well.

    As she looks to the next phase in this research, she has some words of warning for machine-learning users.

    “Choose the explanation model carefully. But even more importantly, think carefully about the goals of using an explanation model and who it eventually affects,” she says.

    This work was funded, in part, by the MIT-IBM Watson AI Lab, the Quanta Research Institute, a Canadian Institute for Advanced Research AI Chair, and Microsoft Research. More

  • in

    Emery Brown wins a share of 2022 Gruber Neuroscience Prize

    The Gruber Foundation announced on May 17 that Emery N. Brown, the Edward Hood Taplin Professor of Medical Engineering and Computational Neuroscience at MIT, has won the 2022 Gruber Neuroscience Prize along with neurophysicists Laurence Abbott of Columbia University, Terrence Sejnowski of the Salk Institute for Biological Studies, and Haim Sompolinsky of the Hebrew University of Jerusalem.

    The foundation says it honored the four recipients for their influential contributions to the fields of computational and theoretical neuroscience. As datasets have grown ever larger and more complex, these fields have increasingly helped scientists unravel the mysteries of how the brain functions in both health and disease. The prize, which includes a total $500,000 award, will be presented in San Diego, California, on Nov. 13 at the annual meeting of the Society for Neuroscience.

    “These four remarkable scientists have applied their expertise in mathematical and statistical analysis, physics, and machine learning to create theories, mathematical models, and tools that have greatly advanced how we study and understand the brain,” says Joshua Sanes, professor of molecular and cellular biology and founding director of the Center for Brain Science at Harvard University and member of the selection advisory board to the prize. “Their insights and research have not only transformed how experimental neuroscientists do their research, but also are leading to promising new ways of providing clinical care.”

    Brown, who is an investigator in The Picower Institute for Learning and Memory and the Institute for Medical Engineering and Science at MIT, an anesthesiologist at Massachusetts General Hospital, and a professor at Harvard Medical School, says: “It is a pleasant surprise and tremendous honor to be named a co-recipient of the 2022 Gruber Prize in Neuroscience. I am especially honored to share this award with three luminaries in computational and theoretical neuroscience.”

    Brown’s early groundbreaking findings in neuroscience included a novel algorithm that decodes the position of an animal by observing the activity of a small group of place cells in the animal’s brain, a discovery he made while working with fellow Picower Institute investigator Matt Wilson in the 1990s. The resulting state-space algorithm for point processes not only offered much better decoding with fewer neurons than previous approaches, but it also established a new framework for specifying dynamically the relationship between the spike trains (the timing sequence of firing neurons) in the brain and factors from the outside world.

    “One of the basic questions at the time was whether an animal holds a representation of where it is in its mind — in the hippocampus,” Brown says. “We were able to show that it did, and we could show that with only 30 neurons.”

    After introducing this state-space paradigm to neuroscience, Brown went on to refine the original idea and apply it to other dynamic situations — to simultaneously track neural activity and learning, for example, and to define with precision anesthesia-induced loss of consciousness, as well as its subsequent recovery. In the early 2000s, Brown put together a team to specifically study anesthesia’s effects on the brain.

    Through experimental research and mathematical modeling, Brown and his team showed that the altered arousal states produced by the main classes of anesthesia medications can be characterized by analyzing the oscillatory patterns observed in the EEG along with the locations of their molecular targets, and the anatomy and physiology of the neural circuits that connect those locations. He has established, including in recent papers with Picower Professor Earl K. Miller, that a principal way in which anesthetics produce unconsciousness is by producing oscillations that impair how different brain regions communicate with each other.

    The result of Brown’s research has been a new paradigm for brain monitoring during general anesthesia for surgery, one that allows an anesthesiologist to dose the patient based on EEG readouts (neural oscillations) of the patient’s anesthetic state rather than purely on vital sign responses. This pioneering approach promises to revolutionize how anesthesia medications are delivered to patients, and also shed light on other altered states of arousal such as sleep and coma.

    To advance that vision, Brown recently discussed how he is working to develop a new research center at MIT and MGH to further integrate anesthesiology with neuroscience research. The Brain Arousal State Control Innovation Center, he said, would not only advance anesthesiology care but also harness insights gained from anesthesiology research to improve other aspects of clinical neuroscience.

    “By demonstrating that physics and mathematics can make an enormous contribution to neuroscience, doctors Abbott, Brown, Sejnowski, and Sompolinsky have inspired an entire new generation of physicists and other quantitative scientists to follow in their footsteps,” says Frances Jensen, professor and chair of the Department of Neurology and co-director of the Penn Medicine Translational Neuroscience Center within the Perelman School of Medicine at the University of Pennsylvania, and chair of the Selection Advisory Board to the prize. “The ramifications for neuroscience have been broad and profound. It is a great pleasure to be honoring each of them with this prestigious award.”

    This report was adapted from materials provided by the Gruber Foundation. More

  • in

    Living better with algorithms

    Laboratory for Information and Decision Systems (LIDS) student Sarah Cen remembers the lecture that sent her down the track to an upstream question.

    At a talk on ethical artificial intelligence, the speaker brought up a variation on the famous trolley problem, which outlines a philosophical choice between two undesirable outcomes.

    The speaker’s scenario: Say a self-driving car is traveling down a narrow alley with an elderly woman walking on one side and a small child on the other, and no way to thread between both without a fatality. Who should the car hit?

    Then the speaker said: Let’s take a step back. Is this the question we should even be asking?

    That’s when things clicked for Cen. Instead of considering the point of impact, a self-driving car could have avoided choosing between two bad outcomes by making a decision earlier on — the speaker pointed out that, when entering the alley, the car could have determined that the space was narrow and slowed to a speed that would keep everyone safe.

    Recognizing that today’s AI safety approaches often resemble the trolley problem, focusing on downstream regulation such as liability after someone is left with no good choices, Cen wondered: What if we could design better upstream and downstream safeguards to such problems? This question has informed much of Cen’s work.

    “Engineering systems are not divorced from the social systems on which they intervene,” Cen says. Ignoring this fact risks creating tools that fail to be useful when deployed or, more worryingly, that are harmful.

    Cen arrived at LIDS in 2018 via a slightly roundabout route. She first got a taste for research during her undergraduate degree at Princeton University, where she majored in mechanical engineering. For her master’s degree, she changed course, working on radar solutions in mobile robotics (primarily for self-driving cars) at Oxford University. There, she developed an interest in AI algorithms, curious about when and why they misbehave. So, she came to MIT and LIDS for her doctoral research, working with Professor Devavrat Shah in the Department of Electrical Engineering and Computer Science, for a stronger theoretical grounding in information systems.

    Auditing social media algorithms

    Together with Shah and other collaborators, Cen has worked on a wide range of projects during her time at LIDS, many of which tie directly to her interest in the interactions between humans and computational systems. In one such project, Cen studies options for regulating social media. Her recent work provides a method for translating human-readable regulations into implementable audits.

    To get a sense of what this means, suppose that regulators require that any public health content — for example, on vaccines — not be vastly different for politically left- and right-leaning users. How should auditors check that a social media platform complies with this regulation? Can a platform be made to comply with the regulation without damaging its bottom line? And how does compliance affect the actual content that users do see?

    Designing an auditing procedure is difficult in large part because there are so many stakeholders when it comes to social media. Auditors have to inspect the algorithm without accessing sensitive user data. They also have to work around tricky trade secrets, which can prevent them from getting a close look at the very algorithm that they are auditing because these algorithms are legally protected. Other considerations come into play as well, such as balancing the removal of misinformation with the protection of free speech.

    To meet these challenges, Cen and Shah developed an auditing procedure that does not need more than black-box access to the social media algorithm (which respects trade secrets), does not remove content (which avoids issues of censorship), and does not require access to users (which preserves users’ privacy).

    In their design process, the team also analyzed the properties of their auditing procedure, finding that it ensures a desirable property they call decision robustness. As good news for the platform, they show that a platform can pass the audit without sacrificing profits. Interestingly, they also found the audit naturally incentivizes the platform to show users diverse content, which is known to help reduce the spread of misinformation, counteract echo chambers, and more.

    Who gets good outcomes and who gets bad ones?

    In another line of research, Cen looks at whether people can receive good long-term outcomes when they not only compete for resources, but also don’t know upfront what resources are best for them.

    Some platforms, such as job-search platforms or ride-sharing apps, are part of what is called a matching market, which uses an algorithm to match one set of individuals (such as workers or riders) with another (such as employers or drivers). In many cases, individuals have matching preferences that they learn through trial and error. In labor markets, for example, workers learn their preferences about what kinds of jobs they want, and employers learn their preferences about the qualifications they seek from workers.

    But learning can be disrupted by competition. If workers with a particular background are repeatedly denied jobs in tech because of high competition for tech jobs, for instance, they may never get the knowledge they need to make an informed decision about whether they want to work in tech. Similarly, tech employers may never see and learn what these workers could do if they were hired.

    Cen’s work examines this interaction between learning and competition, studying whether it is possible for individuals on both sides of the matching market to walk away happy.

    Modeling such matching markets, Cen and Shah found that it is indeed possible to get to a stable outcome (workers aren’t incentivized to leave the matching market), with low regret (workers are happy with their long-term outcomes), fairness (happiness is evenly distributed), and high social welfare.

    Interestingly, it’s not obvious that it’s possible to get stability, low regret, fairness, and high social welfare simultaneously.  So another important aspect of the research was uncovering when it is possible to achieve all four criteria at once and exploring the implications of those conditions.

    What is the effect of X on Y?

    For the next few years, though, Cen plans to work on a new project, studying how to quantify the effect of an action X on an outcome Y when it’s expensive — or impossible — to measure this effect, focusing in particular on systems that have complex social behaviors.

    For instance, when Covid-19 cases surged in the pandemic, many cities had to decide what restrictions to adopt, such as mask mandates, business closures, or stay-home orders. They had to act fast and balance public health with community and business needs, public spending, and a host of other considerations.

    Typically, in order to estimate the effect of restrictions on the rate of infection, one might compare the rates of infection in areas that underwent different interventions. If one county has a mask mandate while its neighboring county does not, one might think comparing the counties’ infection rates would reveal the effectiveness of mask mandates. 

    But of course, no county exists in a vacuum. If, for instance, people from both counties gather to watch a football game in the maskless county every week, people from both counties mix. These complex interactions matter, and Sarah plans to study questions of cause and effect in such settings.

    “We’re interested in how decisions or interventions affect an outcome of interest, such as how criminal justice reform affects incarceration rates or how an ad campaign might change the public’s behaviors,” Cen says.

    Cen has also applied the principles of promoting inclusivity to her work in the MIT community.

    As one of three co-presidents of the Graduate Women in MIT EECS student group, she helped organize the inaugural GW6 research summit featuring the research of women graduate students — not only to showcase positive role models to students, but also to highlight the many successful graduate women at MIT who are not to be underestimated.

    Whether in computing or in the community, a system taking steps to address bias is one that enjoys legitimacy and trust, Cen says. “Accountability, legitimacy, trust — these principles play crucial roles in society and, ultimately, will determine which systems endure with time.”  More

  • in

    On the road to cleaner, greener, and faster driving

    No one likes sitting at a red light. But signalized intersections aren’t just a minor nuisance for drivers; vehicles consume fuel and emit greenhouse gases while waiting for the light to change.

    What if motorists could time their trips so they arrive at the intersection when the light is green? While that might be just a lucky break for a human driver, it could be achieved more consistently by an autonomous vehicle that uses artificial intelligence to control its speed.

    In a new study, MIT researchers demonstrate a machine-learning approach that can learn to control a fleet of autonomous vehicles as they approach and travel through a signalized intersection in a way that keeps traffic flowing smoothly.

    Using simulations, they found that their approach reduces fuel consumption and emissions while improving average vehicle speed. The technique gets the best results if all cars on the road are autonomous, but even if only 25 percent use their control algorithm, it still leads to substantial fuel and emissions benefits.

    “This is a really interesting place to intervene. No one’s life is better because they were stuck at an intersection. With a lot of other climate change interventions, there is a quality-of-life difference that is expected, so there is a barrier to entry there. Here, the barrier is much lower,” says senior author Cathy Wu, the Gilbert W. Winslow Career Development Assistant Professor in the Department of Civil and Environmental Engineering and a member of the Institute for Data, Systems, and Society (IDSS) and the Laboratory for Information and Decision Systems (LIDS).

    The lead author of the study is Vindula Jayawardana, a graduate student in LIDS and the Department of Electrical Engineering and Computer Science. The research will be presented at the European Control Conference.

    Intersection intricacies

    While humans may drive past a green light without giving it much thought, intersections can present billions of different scenarios depending on the number of lanes, how the signals operate, the number of vehicles and their speeds, the presence of pedestrians and cyclists, etc.

    Typical approaches for tackling intersection control problems use mathematical models to solve one simple, ideal intersection. That looks good on paper, but likely won’t hold up in the real world, where traffic patterns are often about as messy as they come.

    Wu and Jayawardana shifted gears and approached the problem using a model-free technique known as deep reinforcement learning. Reinforcement learning is a trial-and-error method where the control algorithm learns to make a sequence of decisions. It is rewarded when it finds a good sequence. With deep reinforcement learning, the algorithm leverages assumptions learned by a neural network to find shortcuts to good sequences, even if there are billions of possibilities.

    This is useful for solving a long-horizon problem like this; the control algorithm must issue upwards of 500 acceleration instructions to a vehicle over an extended time period, Wu explains.

    “And we have to get the sequence right before we know that we have done a good job of mitigating emissions and getting to the intersection at a good speed,” she adds.

    But there’s an additional wrinkle. The researchers want the system to learn a strategy that reduces fuel consumption and limits the impact on travel time. These goals can be conflicting.

    “To reduce travel time, we want the car to go fast, but to reduce emissions, we want the car to slow down or not move at all. Those competing rewards can be very confusing to the learning agent,” Wu says.

    While it is challenging to solve this problem in its full generality, the researchers employed a workaround using a technique known as reward shaping. With reward shaping, they give the system some domain knowledge it is unable to learn on its own. In this case, they penalized the system whenever the vehicle came to a complete stop, so it would learn to avoid that action.

    Traffic tests

    Once they developed an effective control algorithm, they evaluated it using a traffic simulation platform with a single intersection. The control algorithm is applied to a fleet of connected autonomous vehicles, which can communicate with upcoming traffic lights to receive signal phase and timing information and observe their immediate surroundings. The control algorithm tells each vehicle how to accelerate and decelerate.

    Their system didn’t create any stop-and-go traffic as vehicles approached the intersection. (Stop-and-go traffic occurs when cars are forced to come to a complete stop due to stopped traffic ahead). In simulations, more cars made it through in a single green phase, which outperformed a model that simulates human drivers. When compared to other optimization methods also designed to avoid stop-and-go traffic, their technique resulted in larger fuel consumption and emissions reductions. If every vehicle on the road is autonomous, their control system can reduce fuel consumption by 18 percent and carbon dioxide emissions by 25 percent, while boosting travel speeds by 20 percent.

    “A single intervention having 20 to 25 percent reduction in fuel or emissions is really incredible. But what I find interesting, and was really hoping to see, is this non-linear scaling. If we only control 25 percent of vehicles, that gives us 50 percent of the benefits in terms of fuel and emissions reduction. That means we don’t have to wait until we get to 100 percent autonomous vehicles to get benefits from this approach,” she says.

    Down the road, the researchers want to study interaction effects between multiple intersections. They also plan to explore how different intersection set-ups (number of lanes, signals, timings, etc.) can influence travel time, emissions, and fuel consumption. In addition, they intend to study how their control system could impact safety when autonomous vehicles and human drivers share the road. For instance, even though autonomous vehicles may drive differently than human drivers, slower roadways and roadways with more consistent speeds could improve safety, Wu says.

    While this work is still in its early stages, Wu sees this approach as one that could be more feasibly implemented in the near-term.

    “The aim in this work is to move the needle in sustainable mobility. We want to dream, as well, but these systems are big monsters of inertia. Identifying points of intervention that are small changes to the system but have significant impact is something that gets me up in the morning,” she says.  

    This work was supported, in part, by the MIT-IBM Watson AI Lab. More

  • in

    Artificial intelligence system learns concepts shared across video, audio, and text

    Humans observe the world through a combination of different modalities, like vision, hearing, and our understanding of language. Machines, on the other hand, interpret the world through data that algorithms can process.

    So, when a machine “sees” a photo, it must encode that photo into data it can use to perform a task like image classification. This process becomes more complicated when inputs come in multiple formats, like videos, audio clips, and images.

    “The main challenge here is, how can a machine align those different modalities? As humans, this is easy for us. We see a car and then hear the sound of a car driving by, and we know these are the same thing. But for machine learning, it is not that straightforward,” says Alexander Liu, a graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and first author of a paper tackling this problem. 

    Liu and his collaborators developed an artificial intelligence technique that learns to represent data in a way that captures concepts which are shared between visual and audio modalities. For instance, their method can learn that the action of a baby crying in a video is related to the spoken word “crying” in an audio clip.

    Using this knowledge, their machine-learning model can identify where a certain action is taking place in a video and label it.

    It performs better than other machine-learning methods at cross-modal retrieval tasks, which involve finding a piece of data, like a video, that matches a user’s query given in another form, like spoken language. Their model also makes it easier for users to see why the machine thinks the video it retrieved matches their query.

    This technique could someday be utilized to help robots learn about concepts in the world through perception, more like the way humans do.

    Joining Liu on the paper are CSAIL postdoc SouYoung Jin; grad students Cheng-I Jeff Lai and Andrew Rouditchenko; Aude Oliva, senior research scientist in CSAIL and MIT director of the MIT-IBM Watson AI Lab; and senior author James Glass, senior research scientist and head of the Spoken Language Systems Group in CSAIL. The research will be presented at the Annual Meeting of the Association for Computational Linguistics.

    Learning representations

    The researchers focus their work on representation learning, which is a form of machine learning that seeks to transform input data to make it easier to perform a task like classification or prediction.

    The representation learning model takes raw data, such as videos and their corresponding text captions, and encodes them by extracting features, or observations about objects and actions in the video. Then it maps those data points in a grid, known as an embedding space. The model clusters similar data together as single points in the grid. Each of these data points, or vectors, is represented by an individual word.

    For instance, a video clip of a person juggling might be mapped to a vector labeled “juggling.”

    The researchers constrain the model so it can only use 1,000 words to label vectors. The model can decide which actions or concepts it wants to encode into a single vector, but it can only use 1,000 vectors. The model chooses the words it thinks best represent the data.

    Rather than encoding data from different modalities onto separate grids, their method employs a shared embedding space where two modalities can be encoded together. This enables the model to learn the relationship between representations from two modalities, like video that shows a person juggling and an audio recording of someone saying “juggling.”

    To help the system process data from multiple modalities, they designed an algorithm that guides the machine to encode similar concepts into the same vector.

    “If there is a video about pigs, the model might assign the word ‘pig’ to one of the 1,000 vectors. Then if the model hears someone saying the word ‘pig’ in an audio clip, it should still use the same vector to encode that,” Liu explains.

    A better retriever

    They tested the model on cross-modal retrieval tasks using three datasets: a video-text dataset with video clips and text captions, a video-audio dataset with video clips and spoken audio captions, and an image-audio dataset with images and spoken audio captions.

    For example, in the video-audio dataset, the model chose 1,000 words to represent the actions in the videos. Then, when the researchers fed it audio queries, the model tried to find the clip that best matched those spoken words.

    “Just like a Google search, you type in some text and the machine tries to tell you the most relevant things you are searching for. Only we do this in the vector space,” Liu says.

    Not only was their technique more likely to find better matches than the models they compared it to, it is also easier to understand.

    Because the model could only use 1,000 total words to label vectors, a user can more see easily which words the machine used to conclude that the video and spoken words are similar. This could make the model easier to apply in real-world situations where it is vital that users understand how it makes decisions, Liu says.

    The model still has some limitations they hope to address in future work. For one, their research focused on data from two modalities at a time, but in the real world humans encounter many data modalities simultaneously, Liu says.

    “And we know 1,000 words works on this kind of dataset, but we don’t know if it can be generalized to a real-world problem,” he adds.

    Plus, the images and videos in their datasets contained simple objects or straightforward actions; real-world data are much messier. They also want to determine how well their method scales up when there is a wider diversity of inputs.

    This research was supported, in part, by the MIT-IBM Watson AI Lab and its member companies, Nexplore and Woodside, and by the MIT Lincoln Laboratory. More