More stories

  • in

    Statistics, operations research, and better algorithms

    In this day and age, many companies and institutions are not just data-driven, but data-intensive. Insurers, health providers, government agencies, and social media platforms are all heavily dependent on data-rich models and algorithms to identify the characteristics of the people who use them, and to nudge their behavior in various ways.

    That doesn’t mean organizations are always using optimal models, however. Determining efficient algorithms is a research area of its own — and one where Rahul Mazumder happens to be a leading expert.

    Mazumder, an associate professor in the MIT Sloan School of Management and an affiliate of the Operations Research Center, works both to expand the techniques of model-building and to refine models that apply to particular problems. His work pertains to a wealth of areas, including statistics and operations research, with applications in finance, health care, advertising, online recommendations, and more.

    “There is engineering involved, there is science involved, there is implementation involved, there is theory involved, it’s at the junction of various disciplines,” says Mazumder, who is also affiliated with the Center for Statistics and Data Science and the MIT-IBM Watson AI Lab.

    There is also a considerable amount of practical-minded judgment, logic, and common-sense decision-making at play, in order to bring the right techniques to bear on any individual task.

    “Statistics is about having data coming from a physical system, or computers, or humans, and you want to make sense of the data,” Mazumder says. “And you make sense of it by building models because that gives some pattern to a dataset. But of course, there is a lot of subjectivity in that. So, there is subjectivity in statistics, but also mathematical rigor.”

    Over roughly the last decade, Mazumder, often working with co-authors, has published about 40 peer-reviewed papers, won multiple academic awards, collaborated with major companies about their work, and helped advise graduate students. For his research and teaching, Mazumder was granted tenure by MIT last year.

    From deep roots to new tools

    Mazumder grew up in Kolkata, India, where his father was a professor at the Indian Statistical Institute and his mother was a schoolteacher. Mazumder received his undergraduate and master’s degrees from the Indian Statistical Institute as well, although without really focusing on the same areas as his father, whose work was in fluid mechanics.

    For his doctoral work, Mazumder attended Stanford University, where he earned his PhD in 2012. After a year as a postdoc at MIT’s Operations Research Center, he joined the faculty at Columbia University, then moved to MIT in 2015.

    While Mazumder’s work has many facets, his research portfolio does have notable central achievements. Mazumder has helped combine ideas from two branches of optimization to facilitate addressing computational problems in statistics. One of these branches, discrete optimization, uses discrete variables — integers — to find the best candidate among a finite set of options. This can relate to operational efficiency: What is the shortest route someone might take while making a designated set of stops? Convex optimization, on the other hand, encompasses an array of algorithms that can obtain the best solution for what Mazumder calls “nicely behaved” mathematical functions. They are typically applied to optimize continuous decisions in financial portfolio allocation and health care outcomes, among other things.

    In some recent papers, such as “Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms,” co-authored with Hussein Hazimeh and published in Operations Research in 2020, and in “Sparse regression at scale: branch-and-bound rooted in first-order optimization,” co-authored with Hazimeh and A. Saab and published in Mathematical Programming in 2022, Mazumder has found ways to combine ideas from the two branches.

    “The tools and techniques we are using are new for the class of statistical problems because we are combining different developments in convex optimization and exploring that within discrete optimization,” Mazumder says.

    As new as these tools are, however, Mazumder likes working on techniques that “have old roots,” as he puts it. The two types of optimization methods were considered less separate in the 1950s or 1960s, he says, then grew apart.

    “I like to go back and see how things developed,” Mazumder says. “If I look back in history at [older] papers, it’s actually very fascinating. One thing was developed, another was developed, another was developed kind of independently, and after a while you see connections across them. If I go back, I see some parallels. And that actually helps in my thought process.”

    Predictions and parsimony

    Mazumder’s work is often aimed at simplifying the model or algorithm being applied to a problem. In some instances, bigger models would require enormous amounts of processing power, so simpler methods can provide equally good results while using fewer resources. In other cases — ranging from the finance and tech firms Mazumder has sometimes collaborated with — simpler models may work better by having fewer moving parts.

    “There is a notion of parsimony involved,” Mazumder says. Genomic studies aim to find particularly influential genes; similarly, tech giants may benefit from simpler models of consumer behavior, not more complex ones, when they are recommending a movie to you.

    Very often, Mazumder says, modeling “is a very large-scale prediction problem. But we don’t think all the features or attributes are going to be important. A small collection is going to be important. Why? Because if you think about movies, there are not really 20,000 different movies; there are genres of movies. If you look at individual users, there are hundreds of millions of users, but really they are grouped together into cliques. Can you capture the parsimony in a model?”

    One part of his career that does not lend itself to parsimony, Mazumder feels, is crediting others. In conversation he emphasizes how grateful he is to his mentors in academia, and how much of his work is developed in concert with collaborators and, in particular, his students at MIT. 

    “I really, really like working with my students,” Mazumder says. “I perceive my students as my colleagues. Some of these problems, I thought they could not be solved, but then we just made it work. Of course, no method is perfect. But the fact we can use ideas from different areas in optimization with very deep roots, to address problems of core statistics and machine learning interest, is very exciting.”

    Teaching and doing research at MIT, Mazumder says, allows him to push forward on difficult problems — while also being pushed along by the interest and work of others around him.

    “MIT is a very vibrant community,” Mazumder says. “The thing I find really fascinating is, people here are very driven. They want to make a change in whatever area they are working in. And I also feel motivated to do this.” More

  • in

    Learning the language of molecules to predict their properties

    Discovering new materials and drugs typically involves a manual, trial-and-error process that can take decades and cost millions of dollars. To streamline this process, scientists often use machine learning to predict molecular properties and narrow down the molecules they need to synthesize and test in the lab.

    Researchers from MIT and the MIT-Watson AI Lab have developed a new, unified framework that can simultaneously predict molecular properties and generate new molecules much more efficiently than these popular deep-learning approaches.

    To teach a machine-learning model to predict a molecule’s biological or mechanical properties, researchers must show it millions of labeled molecular structures — a process known as training. Due to the expense of discovering molecules and the challenges of hand-labeling millions of structures, large training datasets are often hard to come by, which limits the effectiveness of machine-learning approaches.

    By contrast, the system created by the MIT researchers can effectively predict molecular properties using only a small amount of data. Their system has an underlying understanding of the rules that dictate how building blocks combine to produce valid molecules. These rules capture the similarities between molecular structures, which helps the system generate new molecules and predict their properties in a data-efficient manner.

    This method outperformed other machine-learning approaches on both small and large datasets, and was able to accurately predict molecular properties and generate viable molecules when given a dataset with fewer than 100 samples.

    “Our goal with this project is to use some data-driven methods to speed up the discovery of new molecules, so you can train a model to do the prediction without all of these cost-heavy experiments,” says lead author Minghao Guo, a computer science and electrical engineering (EECS) graduate student.

    Guo’s co-authors include MIT-IBM Watson AI Lab research staff members Veronika Thost, Payel Das, and Jie Chen; recent MIT graduates Samuel Song ’23 and Adithya Balachandran ’23; and senior author Wojciech Matusik, a professor of electrical engineering and computer science and a member of the MIT-IBM Watson AI Lab, who leads the Computational Design and Fabrication Group within the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). The research will be presented at the International Conference for Machine Learning.

    Learning the language of molecules

    To achieve the best results with machine-learning models, scientists need training datasets with millions of molecules that have similar properties to those they hope to discover. In reality, these domain-specific datasets are usually very small. So, researchers use models that have been pretrained on large datasets of general molecules, which they apply to a much smaller, targeted dataset. However, because these models haven’t acquired much domain-specific knowledge, they tend to perform poorly.

    The MIT team took a different approach. They created a machine-learning system that automatically learns the “language” of molecules — what is known as a molecular grammar — using only a small, domain-specific dataset. It uses this grammar to construct viable molecules and predict their properties.

    In language theory, one generates words, sentences, or paragraphs based on a set of grammar rules. You can think of a molecular grammar the same way. It is a set of production rules that dictate how to generate molecules or polymers by combining atoms and substructures.

    Just like a language grammar, which can generate a plethora of sentences using the same rules, one molecular grammar can represent a vast number of molecules. Molecules with similar structures use the same grammar production rules, and the system learns to understand these similarities.

    Since structurally similar molecules often have similar properties, the system uses its underlying knowledge of molecular similarity to predict properties of new molecules more efficiently. 

    “Once we have this grammar as a representation for all the different molecules, we can use it to boost the process of property prediction,” Guo says.

    The system learns the production rules for a molecular grammar using reinforcement learning — a trial-and-error process where the model is rewarded for behavior that gets it closer to achieving a goal.

    But because there could be billions of ways to combine atoms and substructures, the process to learn grammar production rules would be too computationally expensive for anything but the tiniest dataset.

    The researchers decoupled the molecular grammar into two parts. The first part, called a metagrammar, is a general, widely applicable grammar they design manually and give the system at the outset. Then it only needs to learn a much smaller, molecule-specific grammar from the domain dataset. This hierarchical approach speeds up the learning process.

    Big results, small datasets

    In experiments, the researchers’ new system simultaneously generated viable molecules and polymers, and predicted their properties more accurately than several popular machine-learning approaches, even when the domain-specific datasets had only a few hundred samples. Some other methods also required a costly pretraining step that the new system avoids.

    The technique was especially effective at predicting physical properties of polymers, such as the glass transition temperature, which is the temperature required for a material to transition from solid to liquid. Obtaining this information manually is often extremely costly because the experiments require extremely high temperatures and pressures.

    To push their approach further, the researchers cut one training set down by more than half — to just 94 samples. Their model still achieved results that were on par with methods trained using the entire dataset.

    “This grammar-based representation is very powerful. And because the grammar itself is a very general representation, it can be deployed to different kinds of graph-form data. We are trying to identify other applications beyond chemistry or material science,” Guo says.

    In the future, they also want to extend their current molecular grammar to include the 3D geometry of molecules and polymers, which is key to understanding the interactions between polymer chains. They are also developing an interface that would show a user the learned grammar production rules and solicit feedback to correct rules that may be wrong, boosting the accuracy of the system.

    This work is funded, in part, by the MIT-IBM Watson AI Lab and its member company, Evonik. More

  • in

    Day of AI curriculum meets the moment

    MIT Responsible AI for Social Empowerment and Education (RAISE) recently celebrated the second annual Day of AI with two flagship local events. The Edward M. Kennedy Institute for the U.S. Senate in Boston hosted a human rights and data policy-focused event that was streamed worldwide. Dearborn STEM Academy in Roxbury, Massachusetts, hosted a student workshop in collaboration with Amazon Future Engineer. With over 8,000 registrations across all 50 U.S. states and 108 countries in 2023, participation in Day of AI has more than doubled since its inaugural year.

    Day of AI is a free curriculum of lessons and hands-on activities designed to teach kids of all ages and backgrounds the basics and responsible use of artificial intelligence, designed by researchers at MIT RAISE. This year, resources were available for educators to run at any time and in any increments they chose. The curriculum included five new modules to address timely topics like ChatGPT in School, Teachable Machines, AI and Social Media, Data Science and Me, and more. A collaboration with the International Society for Technology in Education also introduced modules for early elementary students. Educators across the world shared photos, videos, and stories of their students’ engagement, expressing excitement and even relief over the accessible lessons.

    Professor Cynthia Breazeal, director of RAISE, dean for digital learning at MIT, and head of the MIT Media Lab’s Personal Robots research group, said, “It’s been a year of extraordinary advancements in AI, and with that comes necessary conversations and concerns about who and what this technology is for. With our Day of AI events, we want to celebrate the teachers and students who are putting in the work to make sure that AI is for everyone.”

    Reflecting community values and protecting digital citizens

    Play video

    On May 18, 2023, MIT RAISE hosted a global Day of AI celebration featuring a flagship local event focused on human rights and data policy at the Edward M. Kennedy Institute for the U.S. Senate. Students from the Warren Prescott Middle School and New Mission High School heard from speakers the City of Boston, Liberty Mutual, and MIT to discuss the many benefits and challenges of artificial intelligence education. Video: MIT Open Learning

    MIT President Sally Kornbluth welcomed students from Warren Prescott Middle School and New Mission High School to the Day of AI program at the Edward M. Kennedy Institute. Kornbluth reflected on the exciting potential of AI, along with the ethical considerations society needs to be responsible for.

    “AI has the potential to do all kinds of fantastic things, including driving a car, helping us with the climate crisis, improving health care, and designing apps that we can’t even imagine yet. But what we have to make sure it doesn’t do is cause harm to individuals, to communities, to us — society as a whole,” she said.

    This theme resonated with each of the event speakers, whose jobs spanned the sectors of education, government, and business. Yo Deshpande, technologist for the public realm, and Michael Lawrence Evans, program director of new urban mechanics from the Boston Mayor’s Office, shared how Boston thinks about using AI to improve city life in ways that are “equitable, accessible, and delightful.” Deshpande said, “We have the opportunity to explore not only how AI works, but how using AI can line up with our values, the way we want to be in the world, and the way we want to be in our community.”

    Adam L’Italien, chief innovation officer at Liberty Mutual Insurance (one of Day of AI’s founding sponsors), compared our present moment with AI technologies to the early days of personal computers and internet connection. “Exposure to emerging technologies can accelerate progress in the world and in your own lives,” L’Italien said, while recognizing that the AI development process needs to be inclusive and mitigate biases.

    Human policies for artificial intelligence

    So how does society address these human rights concerns about AI? Marc Aidinoff ’21, former White House Office of Science and Technology Policy chief of staff, led a discussion on how government policy can influence the parameters of how technology is developed and used, like the Blueprint for an AI Bill of Rights. Aidinoff said, “The work of building the world you want to see is far harder than building the technical AI system … How do you work with other people and create a collective vision for what we want to do?” Warren Prescott Middle School students described how AI could be used to solve problems that humans couldn’t. But they also shared their concerns that AI could affect data privacy, learning deficits, social media addiction, job displacement, and propaganda.

    In a mock U.S. Senate trial activity designed by Daniella DiPaola, PhD student at the MIT Media Lab, the middle schoolers investigated what rights might be undermined by AI in schools, hospitals, law enforcement, and corporations. Meanwhile, New Mission High School students workshopped the ideas behind bill S.2314, the Social Media Addiction Reduction Technology (SMART) Act, in an activity designed by Raechel Walker, graduate research assistant in the Personal Robots Group, and Matt Taylor, research assistant at the Media Lab. They discussed what level of control could or should be introduced at the parental, educational, and governmental levels to reduce the risks of internet addiction.

    “Alexa, how do I program AI?”

    Play video

    The 2023 Day of AI celebration featured a flagship local event at the Dearborn STEM Academy in Roxbury in collaboration with Amazon Future Engineer. Students participated in a hands-on activity using MIT App Inventor as part of Day of AI’s Alexa lesson. Video: MIT Open Learning

    At Dearborn STEM Academy, Amazon Future Engineer helped students work through the Intro to Voice AI curriculum module in real-time. Students used MIT App Inventor to code basic commands for Alexa. In an interview with WCVB, Principal Darlene Marcano said, “It’s important that we expose our students to as many different experiences as possible. The students that are participating are on track to be future computer scientists and engineers.”

    Breazeal told Dearborn students, “We want you to have an informed voice about how you want AI to be used in society. We want you to feel empowered that you can shape the world. You can make things with AI to help make a better world and a better community.”

    Rohit Prasad ’08, senior vice president and head scientist for Alexa at Amazon, and Victor Reinoso ’97, global director of philanthropic education initiatives at Amazon, also joined the event. “Amazon and MIT share a commitment to helping students discover a world of possibilities through STEM and AI education,” said Reinoso. “There’s a lot of current excitement around the technological revolution with generative AI and large language models, so we’re excited to help students explore careers of the future and navigate the pathways available to them.” To highlight their continued investment in the local community and the school program, Amazon donated a $25,000 Innovation and Early College Pathways Program Grant to the Boston Public School system.

    Day of AI down under

    Not only was the Day of AI program widely adopted across the globe, Australian educators were inspired to adapt their own regionally specific curriculum. An estimated 161,000 AI professionals will be needed in Australia by 2030, according to the National Artificial Intelligence Center in the Commonwealth Scientific and Industrial Research Organization (CSIRO), an Australian government agency and Day of AI Australia project partner. CSIRO worked with the University of New South Wales to develop supplementary educational resources on AI ethics and machine learning. Day of AI Australia reached 85,000 students at 400-plus secondary schools this year, sparking curiosity in the next generation of AI experts.

    The interest in AI is accelerating as fast as the technology is being developed. Day of AI offers a unique opportunity for K-12 students to shape our world’s digital future and their own.

    “I hope that some of you will decide to be part of this bigger effort to help us figure out the best possible answers to questions that are raised by AI,” Kornbluth told students at the Edward M. Kennedy Institute. “We’re counting on you, the next generation, to learn how AI works and help make sure it’s for everyone.” More

  • in

    Bringing the social and ethical responsibilities of computing to the forefront

    There has been a remarkable surge in the use of algorithms and artificial intelligence to address a wide range of problems and challenges. While their adoption, particularly with the rise of AI, is reshaping nearly every industry sector, discipline, and area of research, such innovations often expose unexpected consequences that involve new norms, new expectations, and new rules and laws.

    To facilitate deeper understanding, the Social and Ethical Responsibilities of Computing (SERC), a cross-cutting initiative in the MIT Schwarzman College of Computing, recently brought together social scientists and humanists with computer scientists, engineers, and other computing faculty for an exploration of the ways in which the broad applicability of algorithms and AI has presented both opportunities and challenges in many aspects of society.

    “The very nature of our reality is changing. AI has the ability to do things that until recently were solely the realm of human intelligence — things that can challenge our understanding of what it means to be human,” remarked Daniel Huttenlocher, dean of the MIT Schwarzman College of Computing, in his opening address at the inaugural SERC Symposium. “This poses philosophical, conceptual, and practical questions on a scale not experienced since the start of the Enlightenment. In the face of such profound change, we need new conceptual maps for navigating the change.”

    The symposium offered a glimpse into the vision and activities of SERC in both research and education. “We believe our responsibility with SERC is to educate and equip our students and enable our faculty to contribute to responsible technology development and deployment,” said Georgia Perakis, the William F. Pounds Professor of Management in the MIT Sloan School of Management, co-associate dean of SERC, and the lead organizer of the symposium. “We’re drawing from the many strengths and diversity of disciplines across MIT and beyond and bringing them together to gain multiple viewpoints.”

    Through a succession of panels and sessions, the symposium delved into a variety of topics related to the societal and ethical dimensions of computing. In addition, 37 undergraduate and graduate students from a range of majors, including urban studies and planning, political science, mathematics, biology, electrical engineering and computer science, and brain and cognitive sciences, participated in a poster session to exhibit their research in this space, covering such topics as quantum ethics, AI collusion in storage markets, computing waste, and empowering users on social platforms for better content credibility.

    Showcasing a diversity of work

    In three sessions devoted to themes of beneficent and fair computing, equitable and personalized health, and algorithms and humans, the SERC Symposium showcased work by 12 faculty members across these domains.

    One such project from a multidisciplinary team of archaeologists, architects, digital artists, and computational social scientists aimed to preserve endangered heritage sites in Afghanistan with digital twins. The project team produced highly detailed interrogable 3D models of the heritage sites, in addition to extended reality and virtual reality experiences, as learning resources for audiences that cannot access these sites.

    In a project for the United Network for Organ Sharing, researchers showed how they used applied analytics to optimize various facets of an organ allocation system in the United States that is currently undergoing a major overhaul in order to make it more efficient, equitable, and inclusive for different racial, age, and gender groups, among others.

    Another talk discussed an area that has not yet received adequate public attention: the broader implications for equity that biased sensor data holds for the next generation of models in computing and health care.

    A talk on bias in algorithms considered both human bias and algorithmic bias, and the potential for improving results by taking into account differences in the nature of the two kinds of bias.

    Other highlighted research included the interaction between online platforms and human psychology; a study on whether decision-makers make systemic prediction mistakes on the available information; and an illustration of how advanced analytics and computation can be leveraged to inform supply chain management, operations, and regulatory work in the food and pharmaceutical industries.

    Improving the algorithms of tomorrow

    “Algorithms are, without question, impacting every aspect of our lives,” said Asu Ozdaglar, deputy dean of academics for the MIT Schwarzman College of Computing and head of the Department of Electrical Engineering and Computer Science, in kicking off a panel she moderated on the implications of data and algorithms.

    “Whether it’s in the context of social media, online commerce, automated tasks, and now a much wider range of creative interactions with the advent of generative AI tools and large language models, there’s little doubt that much more is to come,” Ozdaglar said. “While the promise is evident to all of us, there’s a lot to be concerned as well. This is very much time for imaginative thinking and careful deliberation to improve the algorithms of tomorrow.”

    Turning to the panel, Ozdaglar asked experts from computing, social science, and data science for insights on how to understand what is to come and shape it to enrich outcomes for the majority of humanity.

    Sarah Williams, associate professor of technology and urban planning at MIT, emphasized the critical importance of comprehending the process of how datasets are assembled, as data are the foundation for all models. She also stressed the need for research to address the potential implication of biases in algorithms that often find their way in through their creators and the data used in their development. “It’s up to us to think about our own ethical solutions to these problems,” she said. “Just as it’s important to progress with the technology, we need to start the field of looking at these questions of what biases are in the algorithms? What biases are in the data, or in that data’s journey?”

    Shifting focus to generative models and whether the development and use of these technologies should be regulated, the panelists — which also included MIT’s Srini Devadas, professor of electrical engineering and computer science, John Horton, professor of information technology, and Simon Johnson, professor of entrepreneurship — all concurred that regulating open-source algorithms, which are publicly accessible, would be difficult given that regulators are still catching up and struggling to even set guardrails for technology that is now 20 years old.

    Returning to the question of how to effectively regulate the use of these technologies, Johnson proposed a progressive corporate tax system as a potential solution. He recommends basing companies’ tax payments on their profits, especially for large corporations whose massive earnings go largely untaxed due to offshore banking. By doing so, Johnson said that this approach can serve as a regulatory mechanism that discourages companies from trying to “own the entire world” by imposing disincentives.

    The role of ethics in computing education

    As computing continues to advance with no signs of slowing down, it is critical to educate students to be intentional in the social impact of the technologies they will be developing and deploying into the world. But can one actually be taught such things? If so, how?

    Caspar Hare, professor of philosophy at MIT and co-associate dean of SERC, posed this looming question to faculty on a panel he moderated on the role of ethics in computing education. All experienced in teaching ethics and thinking about the social implications of computing, each panelist shared their perspective and approach.

    A strong advocate for the importance of learning from history, Eden Medina, associate professor of science, technology, and society at MIT, said that “often the way we frame computing is that everything is new. One of the things that I do in my teaching is look at how people have confronted these issues in the past and try to draw from them as a way to think about possible ways forward.” Medina regularly uses case studies in her classes and referred to a paper written by Yale University science historian Joanna Radin on the Pima Indian Diabetes Dataset that raised ethical issues on the history of that particular collection of data that many don’t consider as an example of how decisions around technology and data can grow out of very specific contexts.

    Milo Phillips-Brown, associate professor of philosophy at Oxford University, talked about the Ethical Computing Protocol that he co-created while he was a SERC postdoc at MIT. The protocol, a four-step approach to building technology responsibly, is designed to train computer science students to think in a better and more accurate way about the social implications of technology by breaking the process down into more manageable steps. “The basic approach that we take very much draws on the fields of value-sensitive design, responsible research and innovation, participatory design as guiding insights, and then is also fundamentally interdisciplinary,” he said.

    Fields such as biomedicine and law have an ethics ecosystem that distributes the function of ethical reasoning in these areas. Oversight and regulation are provided to guide front-line stakeholders and decision-makers when issues arise, as are training programs and access to interdisciplinary expertise that they can draw from. “In this space, we have none of that,” said John Basl, associate professor of philosophy at Northeastern University. “For current generations of computer scientists and other decision-makers, we’re actually making them do the ethical reasoning on their own.” Basl commented further that teaching core ethical reasoning skills across the curriculum, not just in philosophy classes, is essential, and that the goal shouldn’t be for every computer scientist be a professional ethicist, but for them to know enough of the landscape to be able to ask the right questions and seek out the relevant expertise and resources that exists.

    After the final session, interdisciplinary groups of faculty, students, and researchers engaged in animated discussions related to the issues covered throughout the day during a reception that marked the conclusion of the symposium. More

  • in

    MIT researchers make language models scalable self-learners

    Socrates once said: “It is not the size of a thing, but the quality that truly matters. For it is in the nature of substance, not its volume, that true value is found.”

    Does size always matter for large language models (LLMs)? In a technological landscape bedazzled by LLMs taking center stage, a team of MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers think smaller models shouldn’t be overlooked, especially for natural language understanding products widely deployed in the industry.

    To that end, the researchers cooked up an approach to long-standing problems of inefficiency and privacy associated with big, text-based AI models — a logic-aware model that outperforms 500-times-bigger counterparts on some language understanding tasks without human-generated annotations, while preserving privacy and robustness with high performance.

    LLMs, which have shown some promising skills in generating language, art, and code, are computationally expensive, and their data requirements can risk privacy leaks when using application programming interfaces for data upload. Smaller models have been historically less capable, particularly in multitasking and weakly supervised tasks, compared to their larger counterparts.

    So what’s helping these smaller models act so mighty, then? Something called “textual entailment,” a way to help these models understand a variety of language tasks, where if one sentence (the premise) is true, then the other sentence (the hypothesis) is likely to be true as well. For example, if the premise is, “all cats have tails” then the hypothesis “a tabby cat has a tail” would be entailed by the premise. This concept is used to train an “entailment model” that proved to be less biased than other language models, from the team’s previous research. They then created “prompts” that the models can use to figure out if certain information is entailed by a given sentence or phrase according to different tasks. This method improved the model’s ability to adapt to different tasks without any additional training, known as zero-shot adaptation.

    In the realm of “natural language understanding,” there are various applications that hinge on determining the relationship between two pieces of text. For example, in sentiment classification, a statement like “I think the movie is good” can be inferred or entailed from a movie review that says, “I like the story and the acting is great,” indicating a positive sentiment. Another is news classification, where the topic of a news article can be inferred from its content. For example, a statement like “the news article is about sports” can be entailed if the main content of the article reports on an NBA game. The key insight was that many existing natural language understanding tasks could be recast as an entailment (i.e., logical inference in natural language) task. 

    “Our research is about improving the ability of computer programs to understand and process natural language — the way humans speak and write. Our self-trained, 350-million-parameter entailment models, without human-generated labels, outperform supervised language models with 137 to 175 billion parameters,” says MIT CSAIL postdoc Hongyin Luo, lead author on a new paper about the study. “This has potential to reshape the landscape of AI and machine learning, providing a more scalable, trustworthy, and cost-effective solution to language modeling,” says Luo. “By proving that smaller models can perform at the same level as larger ones for language understanding, this work paves the way for more sustainable and privacy-preserving AI technologies.” 

    The team discovered that they could improve the model’s performance even more by using a technique called “self-training,” where the model uses its own predictions to teach itself, effectively learning without human supervision and additional annotated training data.The self-training method significantly improved performance on a bunch of downstream tasks, including sentiment analysis, question-answering, and news classification. It outperformed both Google’s LaMDA and FLAN in zero-shot capabilities, GPT models, and other supervised algorithms. 

    However, one challenge with self-training is that the model can sometimes generate incorrect or noisy labels that harm performance. To overcome this, they developed a new algorithm called ‘SimPLE’ (Simple Pseudo-Label Editing), a process to review and modify the pseudo-labels made in initial rounds of learning. By correcting any mislabeled instances, it improved the overall quality of the self-generated labels. This not only made the models more effective at understanding language, but more robust when faced with adversarial data. 

    As with most research, there are some limitations. The self-training on multi-class classification tasks didn’t perform as well as on binary natural language understanding tasks, indicating the challenge of applying entailment models to multi-choice tasks.“This research presents an efficient and effective way to train large language models (LLMs) by formulating natural language understanding tasks as contextual entailment problems and employing a pseudo-labeling self-training mechanism to incorporate large quantities of unlabelled text data in the training process,” adds CSAIL Senior Research Scientist James Glass, who is also an author on the paper. “While the field of LLMs is undergoing rapid and dramatic changes, this research shows that it is possible to produce relatively compact language models that perform very well on benchmark understanding tasks compared to their peers of roughly the same size, or even much larger language models.”

    “Entailment task is a popular proxy to evaluate “understanding” of a given context by an AI model,” says Leonid Karlinsky, research staff member at the MIT-IBM Watson AI Lab. “It is used in many areas analyzing models with unimodal, like LLMs, and and multi-modal, like VLMs [visual language models] inputs, simplifying the task of question-answering about a given input context to a binary classification problem — does this context entail a certain (e.g., text) conclusion or not? This paper makes two contributions in this space. First, it proposes a way to improve the zero-shot (without additional tuning) NLU performance and robustness to adversarial attacks via tuning with synthesized (specialized) entailment tasks generated for the primal NLU task. Second, it offers a self-supervised SimPLE method including pseudo-labeling and confidence-based filtering to further improve large LLMs’ NLU performance.”

    Luo and Glass wrote the paper with Yoon Kim, a CSAIL member and assistant professor in MIT’s Department of Electrical Engineering and Computer Science, and Jiaxin Ge of Peking University. Their work will be presented at the meeting of the Association for Computational Linguistics in Toronto, Ontario this July. This research was supported by a grant from the Hong Kong Innovation AI program. More

  • in

    Scaling audio-visual learning without labels

    Researchers from MIT, the MIT-IBM Watson AI Lab, IBM Research, and elsewhere have developed a new technique for analyzing unlabeled audio and visual data that could improve the performance of machine-learning models used in applications like speech recognition and object detection. The work, for the first time, combines two architectures of self-supervised learning, contrastive learning and masked data modeling, in an effort to scale machine-learning tasks like event classification in single- and multimodal data without the need for annotation, thereby replicating how humans understand and perceive our world.

    “A larger portion of human knowledge is learned in a self-supervised way, because we don’t always get supervision signals, and we want to enable the machine-learning model to have the same ability,” says Yuan Gong, an MIT postdoc in the Computer Science and Artificial Intelligence Laboratory (CSAIL).

    “So, another way to put it is that self-supervised learning often forms the foundation of an initial model, because it can learn on vast amounts of unlabeled data. And then you can use classical, supervised learning or reinforcement learning to fine tune the model to something particular if you want to,” says Jim Glass, an MIT senior research scientist and member of the MIT-IBM Watson AI Lab.

    The technique, called the contrastive audio-visual masked autoencoder (CAV-MAE), is a type of neural network that can learn to extract and map meaningful latent representations into high-dimensional space from acoustic and visual data by training on large YouTube datasets of audio and video 10-second clips. The researchers say the technique is more effective than previous approaches because it explicitly models the relationships between audio and visual data in a way that other methods do not.

    Joining Gong and Glass on the study are graduate students Andrew Rouditchenko and Alexander H. Liu of MIT, David Harwath PhD ’18 of the University of Texas at Austin, and MIT-IBM Watson AI Lab members Leonid Karlinsky and Hilde Kuehne. Kuehne is also affiliated with Goethe University Frankfurt. The method was recently presented at the International Conference on Learning Representations.

    A joint and coordinated approach

    The CAV-MAE works by “learning by prediction” and “learning by comparison,” says Gong. The masked data modeling, or the prediction method, takes a video along with its coordinated audio waveform, converts the audio to a spectrogram, and masks 75 percent of both. The unmasked data is tokenized, then fed into separate audio and visual encoders before entering a joint encoder/decoder, where the model is asked to recover the missing data. The difference (reconstruction loss) between the resulting reconstructed prediction and the original audio-visual combination is then used to train the model for better performance. An example of this would be covering part of a video of a piano and part of a spectrogram of piano music, and then asking the model to try to determine the masked inputs. Unfortunately, this method may not capture the association between the video and audio pair, whereas contrastive learning leverages this, but may discard some modality-unique information, like the background in a video.

    Contrastive learning aims to map representations that are similar close to each other. For example, the model will attempt to place different video and audio data of different parrots close to each other and further away from pairs of video and audio of guitars playing. In a similar fashion to masked autoencoding, audio-visual pairs are passed into separate modality encoders; however, the audio and visual components are kept separately within the joint encoder before the model performs pooling and contrastive loss. In this way, contrastive learning tries to identify the parts of each audio or video that are most relevant to the other. For example, if a video shows someone speaking and the corresponding audio clip contains speech, the autoencoder will learn to associate the mouth movements of the speaker with the words being spoken. It will then adjust the model’s parameters so that those inputs are represented close to each other. Ultimately, the CAV-MAE method combines both techniques with multiple forward data streams with masking as a first step, modality-specific encoders, and layer normalization so that the representation strengths are similar.

    “We [then] wanted to compare the proposed CAV-MAE with a model trained only with a masked autoencoder and a model trained only with contrastive learning, because we want to show that by combining masked autoencoder and contrastive learning, we can get some performance improvement,” says Gong, “and the results support our hypothesis that there’s obvious improvement.”

    The researchers tested CAV-MAE — as well as their method without contrastive loss or a masked autoencoder — against other state-of-the-art methods on audio-visual retrieval and audio-visual event classification tasks using standard AudioSet (20K and 2M) and VGGSound datasets — labeled, realistic short clips, which could include multiple sounds. Audio-visual retrieval means that the model sees either the audio or visual component of a query pair and searches for the missing one; event classification includes identifying actions or sounds within data, like a person singing or a car driving.

    Overall, they found that contrastive learning and masked data modeling are complementary methods. CAV-MAE was able to outperform previous techniques (with fully self-supervised pre-training) by about 2 percent for event classification performance verses models with comparable computation and, more impressively, kept pace with or outperformed models with industry-level computational resources. The team’s model ranked similarly to models trained with only the contrastive loss. And surprisingly, the team says, the incorporation of multi-modal data into CAV-MAE pre-training greatly improves the fine-tuning of single-modality representation via supervised learning (with some labeled data) and performance on audio-only event classification tasks. This demonstrates that, like humans, multi-modal information provides an additional “soft label” boost even for audio or visual only tasks; for instance, it helps the model to understand if it’s looking for an electric or acoustic guitar — a richer supervision signal.

    “I think people like the elegance of this model for combining information in the different audio and visual streams. It has the contrastive and the reconstruction loss, and compared to models that have been evaluated with similar data, it clearly does very well across a range of these tasks,” says Glass.

    Building on this, “one special thing is, our model can do both classification and the retrieval, which is not common,” Gong adds. “Before this work, these methods are used separately, but after this work, I see that most of the audio-visual learning frameworks use contracting loss and the masked autoencoder together, implicitly or explicitly.”

    Bringing self-supervised audio-visual learning into our world

    The researchers see their contribution of the contrastive audio-visual masked autoencoder (CAV-MAE) as an important milestone and a step forward for applications, which are increasingly moving from single modality to multi-modality and which require or leverage audio-visual fusion. They hypothesize that one day it could be used for action recognition in realms like sports, education, entertainment, motor vehicles, and public safety. It could also, one day, extend to other modalities. At this time, the fact that, “this only applies to audio-visual data may be a limitation, but we are targeting multi-modal learning, which is trend of machine learning,” says Gong. “As humans, we have multi-modalities — we have smell, touch — many more things that just audio-visual. So, when we try to build AI, we try to mimic humans somehow, not necessarily from the biological perspective, and this method could [potentially be] generalized to other unexplored modalities.”

    As machine-learning models continue to play an increasingly important role in our lives, techniques like this one will become increasingly valuable.

    This research was supported by the MIT-IBM Watson AI Lab. More

  • in

    Using data to write songs for progress

    A three-year recipient of MIT’s Emerson Classical Vocal Scholarships, senior Ananya Gurumurthy recalls getting ready to step onto the Carnegie Hall stage to sing a Mozart opera that she once sang with the New York All-State Choir. The choir conductor reminded her to articulate her words and to engage her diaphragm.

    “If you don’t project your voice, how are people going to hear you when you perform?” Gurumurthy recalls her conductor telling her. “This is your moment, your chance to connect with such a tremendous audience.”

    Gurumurthy reflects on the universal truth of those words as she adds her musical talents to her math and computer science studies to campaign for social and economic justice.

    The daughter of immigrants

    Growing up in Edgemont, New York, she was inspired to fight on behalf of others by her South Asian immigrant parents, who came to the United States in the 1980s. Her father is a management consultant and her mother has experience as an investment banker.

    “They came barely 15 years after the passage of the 1965 Immigration and Nationality Act, which removed national origin quotas from the American immigration system,” she says. “I would not be here if it had not been for the Civil Rights Movement, which preceded both me and my parents.”

    Her parents told her about their new home’s anti-immigrant sentiments; for example, her father was a graduate student in Dallas exiting a store when he was pelted with glass bottles and racial slurs.

    “I often consider the amount of bravery that it must have taken them to abandon everything they knew to immigrate to a new, but still imperfect, country in search of something better,” she says. “As a result, I have always felt so grounded in my identity both as a South Asian American and a woman of color. These identities have allowed me to think critically about how I can most effectively reform the institutions surrounding me.”

    Gurumurthy has been singing since she was 11, but in high school, she decided to also build her political voice by working for New York Senator Andrea Stewart-Cousins. At one point, Gurumurthy noted a log was kept for the subjects of constituent calls, such as “affordable housing” and  “infrastructure,” and it was then that she became aware that Stewart-Cousins would address the most pressing of these callers’ issues before the Senate.

    “This experience was my first time witnessing how powerful the mobilization of constituents in vast numbers was for influencing meaningful legislative change,” says Gurumurthy.

    After she began applying her math skills to political campaigns, Gurumurthy was soon tapped to run analytics for the Democratic National Committee’s (DNC) midterm election initiative. As a lead analyst for the New York DNC, she adapted an interactive activation-competition (IAC) model to understand voting patterns in the 2018 and 2020 elections. She collected data from public voting records to predict how constituents would cast their ballots and used an IAC algorithm to strategize alongside grassroots organizations and allocate resources to empower historically disenfranchised groups in municipal, state, and federal elections to encourage them to vote.

    Research and student organizing at MIT

    When she arrived at MIT in 2019 to study mathematics with computer science, along with minors in music and economics, she admits she was saddled with the naïve notion that she would “build digital tools that could single-handedly alleviate all of the collective pressures of systemic injustice in this country.” 

    Since then, she has learned to create what she calls “a more nuanced view.” She picked up data analytics skills to build mobilization platforms for organizations that pursued social and economic justice, including working in Fulton County, Georgia, with Fair Fight Action (through the Kelly-Douglas Fund Scholarship) to analyze patterns of voter suppression, and MIT’s ethics laboratories in the Computer Science and Artificial Intelligence Laboratory to build symbolic artificial intelligence protocols to better understand bias in artificial intelligence algorithms. For her work on the International Monetary Fund (through the MIT Washington Summer Internship Program), Gurumurthy was awarded second place for the 2022 S. Klein Prize in Technical Writing for her paper “The Rapid Rise of Cryptocurrency.”

    “The outcomes of each project gave me more hope to begin the next because I could see the impact of these digital tools,” she says. “I saw people feel empowered to use their voices whether it was voting for the first time, protesting exploitative global monetary policy, or fighting gender discrimination. I’ve been really fortunate to see the power of mathematical analysis firsthand.”

    “I have come to realize that the constructive use of technology could be a powerful voice of resistance against injustice,” she says. “Because numbers matter, and when people bear witness to them, they are pushed to take action in meaningful ways.”

    Hoping to make a difference in her own community, she joined several Institute committees. As co-chair of the Undergraduate Association’s education committee, she propelled MIT’s first-ever digital petition for grade transparency and worked with faculty members on Institute committees to ensure that all students were being provided adequate resources to participate in online education in the wake of the Covid-19 pandemic. The digital petition inspired her to begin a project, called Insite, to develop a more centralized digital means of data collection on student life at MIT to better inform policies made by its governing bodies. As Ring Committee chair, she ensured that the special traditions of the “Brass Rat” were made economically accessible to all class members by helping the committee nearly triple its financial aid budget. For her efforts at MIT, last May she received the William L. Stewart, Jr. Award for “[her] contributions [as] an individual student at MIT to extracurricular activities and student life.”

    Ananya plans on going to law school after graduation, to study constitutional law so that she can use her technical background to build quantitative evidence in cases pertaining to voting rights, social welfare, and ethical technology, and set legal standards ”for the humane use of data,” she says.

    “In building digital tools for a variety of social and economic justice organizations, I hope that we can challenge our existing systems of power and realize the progress we so dearly need to witness. There is strength in numbers, both algorithmically and organizationally. I believe it is our responsibility to simultaneously use these strengths to change the world.”

    Her ambitions, however, began when she began singing lessons when she was 11; without her background as a vocalist, she says she would be voiceless.

    “Operatic performance has given me the ability to truly step into my character and convey powerful emotions in my performance. In the process, I have realized that my voice is most powerful when it reflects my true convictions, whether I am performing or publicly speaking. I truly believe that this honesty has allowed me to become an effective community organizer. I’d like to believe that this voice is what compels those around me to act.”

    Private musical study is available for students through the Emerson/Harris Program, which offers merit-based financial awards to students of outstanding achievement on their instruments or voice in classical, jazz, or world music. The Emerson/Harris Program is funded by the late Cherry L. Emerson Jr. SM ’41, in response to an appeal from Associate Provost Ellen T. Harris (Class of 1949 professor emeritus of music). More

  • in

    A better way to study ocean currents

    To study ocean currents, scientists release GPS-tagged buoys in the ocean and record their velocities to reconstruct the currents that transport them. These buoy data are also used to identify “divergences,” which are areas where water rises up from below the surface or sinks beneath it.

    By accurately predicting currents and pinpointing divergences, scientists can more precisely forecast the weather, approximate how oil will spread after a spill, or measure energy transfer in the ocean. A new model that incorporates machine learning makes more accurate predictions than conventional models do, a new study reports.

    A multidisciplinary research team including computer scientists at MIT and oceanographers has found that a standard statistical model typically used on buoy data can struggle to accurately reconstruct currents or identify divergences because it makes unrealistic assumptions about the behavior of water.

    The researchers developed a new model that incorporates knowledge from fluid dynamics to better reflect the physics at work in ocean currents. They show that their method, which only requires a small amount of additional computational expense, is more accurate at predicting currents and identifying divergences than the traditional model.

    This new model could help oceanographers make more accurate estimates from buoy data, which would enable them to more effectively monitor the transportation of biomass (such as Sargassum seaweed), carbon, plastics, oil, and nutrients in the ocean. This information is also important for understanding and tracking climate change.

    “Our method captures the physical assumptions more appropriately and more accurately. In this case, we know a lot of the physics already. We are giving the model a little bit of that information so it can focus on learning the things that are important to us, like what are the currents away from the buoys, or what is this divergence and where is it happening?” says senior author Tamara Broderick, an associate professor in MIT’s Department of Electrical Engineering and Computer Science (EECS) and a member of the Laboratory for Information and Decision Systems and the Institute for Data, Systems, and Society.

    Broderick’s co-authors include lead author Renato Berlinghieri, an electrical engineering and computer science graduate student; Brian L. Trippe, a postdoc at Columbia University; David R. Burt and Ryan Giordano, MIT postdocs; Kaushik Srinivasan, an assistant researcher in atmospheric and ocean sciences at the University of California at Los Angeles; Tamay Özgökmen, professor in the Department of Ocean Sciences at the University of Miami; and Junfei Xia, a graduate student at the University of Miami. The research will be presented at the International Conference on Machine Learning.

    Diving into the data

    Oceanographers use data on buoy velocity to predict ocean currents and identify “divergences” where water rises to the surface or sinks deeper.

    To estimate currents and find divergences, oceanographers have used a machine-learning technique known as a Gaussian process, which can make predictions even when data are sparse. To work well in this case, the Gaussian process must make assumptions about the data to generate a prediction.

    A standard way of applying a Gaussian process to oceans data assumes the latitude and longitude components of the current are unrelated. But this assumption isn’t physically accurate. For instance, this existing model implies that a current’s divergence and its vorticity (a whirling motion of fluid) operate on the same magnitude and length scales. Ocean scientists know this is not true, Broderick says. The previous model also assumes the frame of reference matters, which means fluid would behave differently in the latitude versus the longitude direction.

    “We were thinking we could address these problems with a model that incorporates the physics,” she says.

    They built a new model that uses what is known as a Helmholtz decomposition to accurately represent the principles of fluid dynamics. This method models an ocean current by breaking it down into a vorticity component (which captures the whirling motion) and a divergence component (which captures water rising or sinking).

    In this way, they give the model some basic physics knowledge that it uses to make more accurate predictions.

    This new model utilizes the same data as the old model. And while their method can be more computationally intensive, the researchers show that the additional cost is relatively small.

    Buoyant performance

    They evaluated the new model using synthetic and real ocean buoy data. Because the synthetic data were fabricated by the researchers, they could compare the model’s predictions to ground-truth currents and divergences. But simulation involves assumptions that may not reflect real life, so the researchers also tested their model using data captured by real buoys released in the Gulf of Mexico.

    This shows the trajectories of approximately 300 buoys released during the Grand LAgrangian Deployment (GLAD) in the Gulf of Mexico in the summer of 2013, to learn about ocean surface currents around the Deepwater Horizon oil spill site. The small, regular clockwise rotations are due to Earth’s rotation.Credit: Consortium of Advanced Research for Transport of Hydrocarbons in the Environment

    In each case, their method demonstrated superior performance for both tasks, predicting currents and identifying divergences, when compared to the standard Gaussian process and another machine-learning approach that used a neural network. For example, in one simulation that included a vortex adjacent to an ocean current, the new method correctly predicted no divergence while the previous Gaussian process method and the neural network method both predicted a divergence with very high confidence.

    The technique is also good at identifying vortices from a small set of buoys, Broderick adds.

    Now that they have demonstrated the effectiveness of using a Helmholtz decomposition, the researchers want to incorporate a time element into their model, since currents can vary over time as well as space. In addition, they want to better capture how noise impacts the data, such as winds that sometimes affect buoy velocity. Separating that noise from the data could make their approach more accurate.

    “Our hope is to take this noisily observed field of velocities from the buoys, and then say what is the actual divergence and actual vorticity, and predict away from those buoys, and we think that our new technique will be helpful for this,” she says.

    “The authors cleverly integrate known behaviors from fluid dynamics to model ocean currents in a flexible model,” says Massimiliano Russo, an associate biostatistician at Brigham and Women’s Hospital and instructor at Harvard Medical School, who was not involved with this work. “The resulting approach retains the flexibility to model the nonlinearity in the currents but can also characterize phenomena such as vortices and connected currents that would only be noticed if the fluid dynamic structure is integrated into the model. This is an excellent example of where a flexible model can be substantially improved with a well thought and scientifically sound specification.”

    This research is supported, in part, by the Office of Naval Research, a National Science Foundation (NSF) CAREER Award, and the Rosenstiel School of Marine, Atmospheric, and Earth Science at the University of Miami. More