More stories

  • in

    Multiple AI models help robots execute complex plans more transparently

    Your daily to-do list is likely pretty straightforward: wash the dishes, buy groceries, and other minutiae. It’s unlikely you wrote out “pick up the first dirty dish,” or “wash that plate with a sponge,” because each of these miniature steps within the chore feels intuitive. While we can routinely complete each step without much thought, a robot requires a complex plan that involves more detailed outlines.

    MIT’s Improbable AI Lab, a group within the Computer Science and Artificial Intelligence Laboratory (CSAIL), has offered these machines a helping hand with a new multimodal framework: Compositional Foundation Models for Hierarchical Planning (HiP), which develops detailed, feasible plans with the expertise of three different foundation models. Like OpenAI’s GPT-4, the foundation model that ChatGPT and Bing Chat were built upon, these foundation models are trained on massive quantities of data for applications like generating images, translating text, and robotics.Unlike RT2 and other multimodal models that are trained on paired vision, language, and action data, HiP uses three different foundation models each trained on different data modalities. Each foundation model captures a different part of the decision-making process and then works together when it’s time to make decisions. HiP removes the need for access to paired vision, language, and action data, which is difficult to obtain. HiP also makes the reasoning process more transparent.

    What’s considered a daily chore for a human can be a robot’s “long-horizon goal” — an overarching objective that involves completing many smaller steps first — requiring sufficient data to plan, understand, and execute objectives. While computer vision researchers have attempted to build monolithic foundation models for this problem, pairing language, visual, and action data is expensive. Instead, HiP represents a different, multimodal recipe: a trio that cheaply incorporates linguistic, physical, and environmental intelligence into a robot.

    “Foundation models do not have to be monolithic,” says NVIDIA AI researcher Jim Fan, who was not involved in the paper. “This work decomposes the complex task of embodied agent planning into three constituent models: a language reasoner, a visual world model, and an action planner. It makes a difficult decision-making problem more tractable and transparent.”The team believes that their system could help these machines accomplish household chores, such as putting away a book or placing a bowl in the dishwasher. Additionally, HiP could assist with multistep construction and manufacturing tasks, like stacking and placing different materials in specific sequences.Evaluating HiP

    The CSAIL team tested HiP’s acuity on three manipulation tasks, outperforming comparable frameworks. The system reasoned by developing intelligent plans that adapt to new information.

    First, the researchers requested that it stack different-colored blocks on each other and then place others nearby. The catch: Some of the correct colors weren’t present, so the robot had to place white blocks in a color bowl to paint them. HiP often adjusted to these changes accurately, especially compared to state-of-the-art task planning systems like Transformer BC and Action Diffuser, by adjusting its plans to stack and place each square as needed.

    Another test: arranging objects such as candy and a hammer in a brown box while ignoring other items. Some of the objects it needed to move were dirty, so HiP adjusted its plans to place them in a cleaning box, and then into the brown container. In a third demonstration, the bot was able to ignore unnecessary objects to complete kitchen sub-goals such as opening a microwave, clearing a kettle out of the way, and turning on a light. Some of the prompted steps had already been completed, so the robot adapted by skipping those directions.

    A three-pronged hierarchy

    HiP’s three-pronged planning process operates as a hierarchy, with the ability to pre-train each of its components on different sets of data, including information outside of robotics. At the bottom of that order is a large language model (LLM), which starts to ideate by capturing all the symbolic information needed and developing an abstract task plan. Applying the common sense knowledge it finds on the internet, the model breaks its objective into sub-goals. For example, “making a cup of tea” turns into “filling a pot with water,” “boiling the pot,” and the subsequent actions required.

    “All we want to do is take existing pre-trained models and have them successfully interface with each other,” says Anurag Ajay, a PhD student in the MIT Department of Electrical Engineering and Computer Science (EECS) and a CSAIL affiliate. “Instead of pushing for one model to do everything, we combine multiple ones that leverage different modalities of internet data. When used in tandem, they help with robotic decision-making and can potentially aid with tasks in homes, factories, and construction sites.”

    These models also need some form of “eyes” to understand the environment they’re operating in and correctly execute each sub-goal. The team used a large video diffusion model to augment the initial planning completed by the LLM, which collects geometric and physical information about the world from footage on the internet. In turn, the video model generates an observation trajectory plan, refining the LLM’s outline to incorporate new physical knowledge.This process, known as iterative refinement, allows HiP to reason about its ideas, taking in feedback at each stage to generate a more practical outline. The flow of feedback is similar to writing an article, where an author may send their draft to an editor, and with those revisions incorporated in, the publisher reviews for any last changes and finalizes.

    In this case, the top of the hierarchy is an egocentric action model, or a sequence of first-person images that infer which actions should take place based on its surroundings. During this stage, the observation plan from the video model is mapped over the space visible to the robot, helping the machine decide how to execute each task within the long-horizon goal. If a robot uses HiP to make tea, this means it will have mapped out exactly where the pot, sink, and other key visual elements are, and begin completing each sub-goal.Still, the multimodal work is limited by the lack of high-quality video foundation models. Once available, they could interface with HiP’s small-scale video models to further enhance visual sequence prediction and robot action generation. A higher-quality version would also reduce the current data requirements of the video models.That being said, the CSAIL team’s approach only used a tiny bit of data overall. Moreover, HiP was cheap to train and demonstrated the potential of using readily available foundation models to complete long-horizon tasks. “What Anurag has demonstrated is proof-of-concept of how we can take models trained on separate tasks and data modalities and combine them into models for robotic planning. In the future, HiP could be augmented with pre-trained models that can process touch and sound to make better plans,” says senior author Pulkit Agrawal, MIT assistant professor in EECS and director of the Improbable AI Lab. The group is also considering applying HiP to solving real-world long-horizon tasks in robotics.Ajay and Agrawal are lead authors on a paper describing the work. They are joined by MIT professors and CSAIL principal investigators Tommi Jaakkola, Joshua Tenenbaum, and Leslie Pack Kaelbling; CSAIL research affiliate and MIT-IBM AI Lab research manager Akash Srivastava; graduate students Seungwook Han and Yilun Du ’19; former postdoc Abhishek Gupta, who is now assistant professor at University of Washington; and former graduate student Shuang Li PhD ’23.

    The team’s work was supported, in part, by the National Science Foundation, the U.S. Defense Advanced Research Projects Agency, the U.S. Army Research Office, the U.S. Office of Naval Research Multidisciplinary University Research Initiatives, and the MIT-IBM Watson AI Lab. Their findings were presented at the 2023 Conference on Neural Information Processing Systems (NeurIPS). More

  • in

    Technique could efficiently solve partial differential equations for numerous applications

    In fields such as physics and engineering, partial differential equations (PDEs) are used to model complex physical processes to generate insight into how some of the most complicated physical and natural systems in the world function.

    To solve these difficult equations, researchers use high-fidelity numerical solvers, which can be very time-consuming and computationally expensive to run. The current simplified alternative, data-driven surrogate models, compute the goal property of a solution to PDEs rather than the whole solution. Those are trained on a set of data that has been generated by the high-fidelity solver, to predict the output of the PDEs for new inputs. This is data-intensive and expensive because complex physical systems require a large number of simulations to generate enough data. 

    In a new paper, “Physics-enhanced deep surrogates for partial differential equations,” published in December in Nature Machine Intelligence, a new method is proposed for developing data-driven surrogate models for complex physical systems in such fields as mechanics, optics, thermal transport, fluid dynamics, physical chemistry, and climate models.

    The paper was authored by MIT’s professor of applied mathematics Steven G. Johnson along with Payel Das and Youssef Mroueh of the MIT-IBM Watson AI Lab and IBM Research; Chris Rackauckas of Julia Lab; and Raphaël Pestourie, a former MIT postdoc who is now at Georgia Tech. The authors call their method “physics-enhanced deep surrogate” (PEDS), which combines a low-fidelity, explainable physics simulator with a neural network generator. The neural network generator is trained end-to-end to match the output of the high-fidelity numerical solver.

    “My aspiration is to replace the inefficient process of trial and error with systematic, computer-aided simulation and optimization,” says Pestourie. “Recent breakthroughs in AI like the large language model of ChatGPT rely on hundreds of billions of parameters and require vast amounts of resources to train and evaluate. In contrast, PEDS is affordable to all because it is incredibly efficient in computing resources and has a very low barrier in terms of infrastructure needed to use it.”

    In the article, they show that PEDS surrogates can be up to three times more accurate than an ensemble of feedforward neural networks with limited data (approximately 1,000 training points), and reduce the training data needed by at least a factor of 100 to achieve a target error of 5 percent. Developed using the MIT-designed Julia programming language, this scientific machine-learning method is thus efficient in both computing and data.

    The authors also report that PEDS provides a general, data-driven strategy to bridge the gap between a vast array of simplified physical models with corresponding brute-force numerical solvers modeling complex systems. This technique offers accuracy, speed, data efficiency, and physical insights into the process.

    Says Pestourie, “Since the 2000s, as computing capabilities improved, the trend of scientific models has been to increase the number of parameters to fit the data better, sometimes at the cost of a lower predictive accuracy. PEDS does the opposite by choosing its parameters smartly. It leverages the technology of automatic differentiation to train a neural network that makes a model with few parameters accurate.”

    “The main challenge that prevents surrogate models from being used more widely in engineering is the curse of dimensionality — the fact that the needed data to train a model increases exponentially with the number of model variables,” says Pestourie. “PEDS reduces this curse by incorporating information from the data and from the field knowledge in the form of a low-fidelity model solver.”

    The researchers say that PEDS has the potential to revive a whole body of the pre-2000 literature dedicated to minimal models — intuitive models that PEDS could make more accurate while also being predictive for surrogate model applications.

    “The application of the PEDS framework is beyond what we showed in this study,” says Das. “Complex physical systems governed by PDEs are ubiquitous, from climate modeling to seismic modeling and beyond. Our physics-inspired fast and explainable surrogate models will be of great use in those applications, and play a complementary role to other emerging techniques, like foundation models.”

    The research was supported by the MIT-IBM Watson AI Lab and the U.S. Army Research Office through the Institute for Soldier Nanotechnologies.  More

  • in

    Leveraging language to understand machines

    Natural language conveys ideas, actions, information, and intent through context and syntax; further, there are volumes of it contained in databases. This makes it an excellent source of data to train machine-learning systems on. Two master’s of engineering students in the 6A MEng Thesis Program at MIT, Irene Terpstra ’23 and Rujul Gandhi ’22, are working with mentors in the MIT-IBM Watson AI Lab to use this power of natural language to build AI systems.

    As computing is becoming more advanced, researchers are looking to improve the hardware that they run on; this means innovating to create new computer chips. And, since there is literature already available on modifications that can be made to achieve certain parameters and performance, Terpstra and her mentors and advisors Anantha Chandrakasan, MIT School of Engineering dean and the Vannevar Bush Professor of Electrical Engineering and Computer Science, and IBM’s researcher Xin Zhang, are developing an AI algorithm that assists in chip design.

    “I’m creating a workflow to systematically analyze how these language models can help the circuit design process. What reasoning powers do they have, and how can it be integrated into the chip design process?” says Terpstra. “And then on the other side, if that proves to be useful enough, [we’ll] see if they can automatically design the chips themselves, attaching it to a reinforcement learning algorithm.”

    To do this, Terpstra’s team is creating an AI system that can iterate on different designs. It means experimenting with various pre-trained large language models (like ChatGPT, Llama 2, and Bard), using an open-source circuit simulator language called NGspice, which has the parameters of the chip in code form, and a reinforcement learning algorithm. With text prompts, researchers will be able to query how the physical chip should be modified to achieve a certain goal in the language model and produced guidance for adjustments. This is then transferred into a reinforcement learning algorithm that updates the circuit design and outputs new physical parameters of the chip.

    “The final goal would be to combine the reasoning powers and the knowledge base that is baked into these large language models and combine that with the optimization power of the reinforcement learning algorithms and have that design the chip itself,” says Terpstra.

    Rujul Gandhi works with the raw language itself. As an undergraduate at MIT, Gandhi explored linguistics and computer sciences, putting them together in her MEng work. “I’ve been interested in communication, both between just humans and between humans and computers,” Gandhi says.

    Robots or other interactive AI systems are one area where communication needs to be understood by both humans and machines. Researchers often write instructions for robots using formal logic. This helps ensure that commands are being followed safely and as intended, but formal logic can be difficult for users to understand, while natural language comes easily. To ensure this smooth communication, Gandhi and her advisors Yang Zhang of IBM and MIT assistant professor Chuchu Fan are building a parser that converts natural language instructions into a machine-friendly form. Leveraging the linguistic structure encoded by the pre-trained encoder-decoder model T5, and a dataset of annotated, basic English commands for performing certain tasks, Gandhi’s system identifies the smallest logical units, or atomic propositions, which are present in a given instruction.

    “Once you’ve given your instruction, the model identifies all the smaller sub-tasks you want it to carry out,” Gandhi says. “Then, using a large language model, each sub-task can be compared against the available actions and objects in the robot’s world, and if any sub-task can’t be carried out because a certain object is not recognized, or an action is not possible, the system can stop right there to ask the user for help.”

    This approach of breaking instructions into sub-tasks also allows her system to understand logical dependencies expressed in English, like, “do task X until event Y happens.” Gandhi uses a dataset of step-by-step instructions across robot task domains like navigation and manipulation, with a focus on household tasks. Using data that are written just the way humans would talk to each other has many advantages, she says, because it means a user can be more flexible about how they phrase their instructions.

    Another of Gandhi’s projects involves developing speech models. In the context of speech recognition, some languages are considered “low resource” since they might not have a lot of transcribed speech available, or might not have a written form at all. “One of the reasons I applied to this internship at the MIT-IBM Watson AI Lab was an interest in language processing for low-resource languages,” she says. “A lot of language models today are very data-driven, and when it’s not that easy to acquire all of that data, that’s when you need to use the limited data efficiently.” 

    Speech is just a stream of sound waves, but humans having a conversation can easily figure out where words and thoughts start and end. In speech processing, both humans and language models use their existing vocabulary to recognize word boundaries and understand the meaning. In low- or no-resource languages, a written vocabulary might not exist at all, so researchers can’t provide one to the model. Instead, the model can make note of what sound sequences occur together more frequently than others, and infer that those might be individual words or concepts. In Gandhi’s research group, these inferred words are then collected into a pseudo-vocabulary that serves as a labeling method for the low-resource language, creating labeled data for further applications.

    The applications for language technology are “pretty much everywhere,” Gandhi says. “You could imagine people being able to interact with software and devices in their native language, their native dialect. You could imagine improving all the voice assistants that we use. You could imagine it being used for translation or interpretation.” More

  • in

    Image recognition accuracy: An unseen challenge confounding today’s AI

    Imagine you are scrolling through the photos on your phone and you come across an image that at first you can’t recognize. It looks like maybe something fuzzy on the couch; could it be a pillow or a coat? After a couple of seconds it clicks — of course! That ball of fluff is your friend’s cat, Mocha. While some of your photos could be understood in an instant, why was this cat photo much more difficult?

    MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers were surprised to find that despite the critical importance of understanding visual data in pivotal areas ranging from health care to transportation to household devices, the notion of an image’s recognition difficulty for humans has been almost entirely ignored. One of the major drivers of progress in deep learning-based AI has been datasets, yet we know little about how data drives progress in large-scale deep learning beyond that bigger is better.

    In real-world applications that require understanding visual data, humans outperform object recognition models despite the fact that models perform well on current datasets, including those explicitly designed to challenge machines with debiased images or distribution shifts. This problem persists, in part, because we have no guidance on the absolute difficulty of an image or dataset. Without controlling for the difficulty of images used for evaluation, it’s hard to objectively assess progress toward human-level performance, to cover the range of human abilities, and to increase the challenge posed by a dataset.

    To fill in this knowledge gap, David Mayo, an MIT PhD student in electrical engineering and computer science and a CSAIL affiliate, delved into the deep world of image datasets, exploring why certain images are more difficult for humans and machines to recognize than others. “Some images inherently take longer to recognize, and it’s essential to understand the brain’s activity during this process and its relation to machine learning models. Perhaps there are complex neural circuits or unique mechanisms missing in our current models, visible only when tested with challenging visual stimuli. This exploration is crucial for comprehending and enhancing machine vision models,” says Mayo, a lead author of a new paper on the work.

    This led to the development of a new metric, the “minimum viewing time” (MVT), which quantifies the difficulty of recognizing an image based on how long a person needs to view it before making a correct identification. Using a subset of ImageNet, a popular dataset in machine learning, and ObjectNet, a dataset designed to test object recognition robustness, the team showed images to participants for varying durations from as short as 17 milliseconds to as long as 10 seconds, and asked them to choose the correct object from a set of 50 options. After over 200,000 image presentation trials, the team found that existing test sets, including ObjectNet, appeared skewed toward easier, shorter MVT images, with the vast majority of benchmark performance derived from images that are easy for humans.

    The project identified interesting trends in model performance — particularly in relation to scaling. Larger models showed considerable improvement on simpler images but made less progress on more challenging images. The CLIP models, which incorporate both language and vision, stood out as they moved in the direction of more human-like recognition.

    “Traditionally, object recognition datasets have been skewed towards less-complex images, a practice that has led to an inflation in model performance metrics, not truly reflective of a model’s robustness or its ability to tackle complex visual tasks. Our research reveals that harder images pose a more acute challenge, causing a distribution shift that is often not accounted for in standard evaluations,” says Mayo. “We released image sets tagged by difficulty along with tools to automatically compute MVT, enabling MVT to be added to existing benchmarks and extended to various applications. These include measuring test set difficulty before deploying real-world systems, discovering neural correlates of image difficulty, and advancing object recognition techniques to close the gap between benchmark and real-world performance.”

    “One of my biggest takeaways is that we now have another dimension to evaluate models on. We want models that are able to recognize any image even if — perhaps especially if — it’s hard for a human to recognize. We’re the first to quantify what this would mean. Our results show that not only is this not the case with today’s state of the art, but also that our current evaluation methods don’t have the ability to tell us when it is the case because standard datasets are so skewed toward easy images,” says Jesse Cummings, an MIT graduate student in electrical engineering and computer science and co-first author with Mayo on the paper.

    From ObjectNet to MVT

    A few years ago, the team behind this project identified a significant challenge in the field of machine learning: Models were struggling with out-of-distribution images, or images that were not well-represented in the training data. Enter ObjectNet, a dataset comprised of images collected from real-life settings. The dataset helped illuminate the performance gap between machine learning models and human recognition abilities, by eliminating spurious correlations present in other benchmarks — for example, between an object and its background. ObjectNet illuminated the gap between the performance of machine vision models on datasets and in real-world applications, encouraging use for many researchers and developers — which subsequently improved model performance.

    Fast forward to the present, and the team has taken their research a step further with MVT. Unlike traditional methods that focus on absolute performance, this new approach assesses how models perform by contrasting their responses to the easiest and hardest images. The study further explored how image difficulty could be explained and tested for similarity to human visual processing. Using metrics like c-score, prediction depth, and adversarial robustness, the team found that harder images are processed differently by networks. “While there are observable trends, such as easier images being more prototypical, a comprehensive semantic explanation of image difficulty continues to elude the scientific community,” says Mayo.

    In the realm of health care, for example, the pertinence of understanding visual complexity becomes even more pronounced. The ability of AI models to interpret medical images, such as X-rays, is subject to the diversity and difficulty distribution of the images. The researchers advocate for a meticulous analysis of difficulty distribution tailored for professionals, ensuring AI systems are evaluated based on expert standards, rather than layperson interpretations.

    Mayo and Cummings are currently looking at neurological underpinnings of visual recognition as well, probing into whether the brain exhibits differential activity when processing easy versus challenging images. The study aims to unravel whether complex images recruit additional brain areas not typically associated with visual processing, hopefully helping demystify how our brains accurately and efficiently decode the visual world.

    Toward human-level performance

    Looking ahead, the researchers are not only focused on exploring ways to enhance AI’s predictive capabilities regarding image difficulty. The team is working on identifying correlations with viewing-time difficulty in order to generate harder or easier versions of images.

    Despite the study’s significant strides, the researchers acknowledge limitations, particularly in terms of the separation of object recognition from visual search tasks. The current methodology does concentrate on recognizing objects, leaving out the complexities introduced by cluttered images.

    “This comprehensive approach addresses the long-standing challenge of objectively assessing progress towards human-level performance in object recognition and opens new avenues for understanding and advancing the field,” says Mayo. “With the potential to adapt the Minimum Viewing Time difficulty metric for a variety of visual tasks, this work paves the way for more robust, human-like performance in object recognition, ensuring that models are truly put to the test and are ready for the complexities of real-world visual understanding.”

    “This is a fascinating study of how human perception can be used to identify weaknesses in the ways AI vision models are typically benchmarked, which overestimate AI performance by concentrating on easy images,” says Alan L. Yuille, Bloomberg Distinguished Professor of Cognitive Science and Computer Science at Johns Hopkins University, who was not involved in the paper. “This will help develop more realistic benchmarks leading not only to improvements to AI but also make fairer comparisons between AI and human perception.” 

    “It’s widely claimed that computer vision systems now outperform humans, and on some benchmark datasets, that’s true,” says Anthropic technical staff member Simon Kornblith PhD ’17, who was also not involved in this work. “However, a lot of the difficulty in those benchmarks comes from the obscurity of what’s in the images; the average person just doesn’t know enough to classify different breeds of dogs. This work instead focuses on images that people can only get right if given enough time. These images are generally much harder for computer vision systems, but the best systems are only a bit worse than humans.”

    Mayo, Cummings, and Xinyu Lin MEng ’22 wrote the paper alongside CSAIL Research Scientist Andrei Barbu, CSAIL Principal Research Scientist Boris Katz, and MIT-IBM Watson AI Lab Principal Researcher Dan Gutfreund. The researchers are affiliates of the MIT Center for Brains, Minds, and Machines.

    The team is presenting their work at the 2023 Conference on Neural Information Processing Systems (NeurIPS). More

  • in

    Automated system teaches users when to collaborate with an AI assistant

    Artificial intelligence models that pick out patterns in images can often do so better than human eyes — but not always. If a radiologist is using an AI model to help her determine whether a patient’s X-rays show signs of pneumonia, when should she trust the model’s advice and when should she ignore it?

    A customized onboarding process could help this radiologist answer that question, according to researchers at MIT and the MIT-IBM Watson AI Lab. They designed a system that teaches a user when to collaborate with an AI assistant.

    In this case, the training method might find situations where the radiologist trusts the model’s advice — except she shouldn’t because the model is wrong. The system automatically learns rules for how she should collaborate with the AI, and describes them with natural language.

    During onboarding, the radiologist practices collaborating with the AI using training exercises based on these rules, receiving feedback about her performance and the AI’s performance.

    The researchers found that this onboarding procedure led to about a 5 percent improvement in accuracy when humans and AI collaborated on an image prediction task. Their results also show that just telling the user when to trust the AI, without training, led to worse performance.

    Importantly, the researchers’ system is fully automated, so it learns to create the onboarding process based on data from the human and AI performing a specific task. It can also adapt to different tasks, so it can be scaled up and used in many situations where humans and AI models work together, such as in social media content moderation, writing, and programming.

    “So often, people are given these AI tools to use without any training to help them figure out when it is going to be helpful. That’s not what we do with nearly every other tool that people use — there is almost always some kind of tutorial that comes with it. But for AI, this seems to be missing. We are trying to tackle this problem from a methodological and behavioral perspective,” says Hussein Mozannar, a graduate student in the Social and Engineering Systems doctoral program within the Institute for Data, Systems, and Society (IDSS) and lead author of a paper about this training process.

    The researchers envision that such onboarding will be a crucial part of training for medical professionals.

    “One could imagine, for example, that doctors making treatment decisions with the help of AI will first have to do training similar to what we propose. We may need to rethink everything from continuing medical education to the way clinical trials are designed,” says senior author David Sontag, a professor of EECS, a member of the MIT-IBM Watson AI Lab and the MIT Jameel Clinic, and the leader of the Clinical Machine Learning Group of the Computer Science and Artificial Intelligence Laboratory (CSAIL).

    Mozannar, who is also a researcher with the Clinical Machine Learning Group, is joined on the paper by Jimin J. Lee, an undergraduate in electrical engineering and computer science; Dennis Wei, a senior research scientist at IBM Research; and Prasanna Sattigeri and Subhro Das, research staff members at the MIT-IBM Watson AI Lab. The paper will be presented at the Conference on Neural Information Processing Systems.

    Training that evolves

    Existing onboarding methods for human-AI collaboration are often composed of training materials produced by human experts for specific use cases, making them difficult to scale up. Some related techniques rely on explanations, where the AI tells the user its confidence in each decision, but research has shown that explanations are rarely helpful, Mozannar says.

    “The AI model’s capabilities are constantly evolving, so the use cases where the human could potentially benefit from it are growing over time. At the same time, the user’s perception of the model continues changing. So, we need a training procedure that also evolves over time,” he adds.

    To accomplish this, their onboarding method is automatically learned from data. It is built from a dataset that contains many instances of a task, such as detecting the presence of a traffic light from a blurry image.

    The system’s first step is to collect data on the human and AI performing this task. In this case, the human would try to predict, with the help of AI, whether blurry images contain traffic lights.

    The system embeds these data points onto a latent space, which is a representation of data in which similar data points are closer together. It uses an algorithm to discover regions of this space where the human collaborates incorrectly with the AI. These regions capture instances where the human trusted the AI’s prediction but the prediction was wrong, and vice versa.

    Perhaps the human mistakenly trusts the AI when images show a highway at night.

    After discovering the regions, a second algorithm utilizes a large language model to describe each region as a rule, using natural language. The algorithm iteratively fine-tunes that rule by finding contrasting examples. It might describe this region as “ignore AI when it is a highway during the night.”

    These rules are used to build training exercises. The onboarding system shows an example to the human, in this case a blurry highway scene at night, as well as the AI’s prediction, and asks the user if the image shows traffic lights. The user can answer yes, no, or use the AI’s prediction.

    If the human is wrong, they are shown the correct answer and performance statistics for the human and AI on these instances of the task. The system does this for each region, and at the end of the training process, repeats the exercises the human got wrong.

    “After that, the human has learned something about these regions that we hope they will take away in the future to make more accurate predictions,” Mozannar says.

    Onboarding boosts accuracy

    The researchers tested this system with users on two tasks — detecting traffic lights in blurry images and answering multiple choice questions from many domains (such as biology, philosophy, computer science, etc.).

    They first showed users a card with information about the AI model, how it was trained, and a breakdown of its performance on broad categories. Users were split into five groups: Some were only shown the card, some went through the researchers’ onboarding procedure, some went through a baseline onboarding procedure, some went through the researchers’ onboarding procedure and were given recommendations of when they should or should not trust the AI, and others were only given the recommendations.

    Only the researchers’ onboarding procedure without recommendations improved users’ accuracy significantly, boosting their performance on the traffic light prediction task by about 5 percent without slowing them down. However, onboarding was not as effective for the question-answering task. The researchers believe this is because the AI model, ChatGPT, provided explanations with each answer that convey whether it should be trusted.

    But providing recommendations without onboarding had the opposite effect — users not only performed worse, they took more time to make predictions.

    “When you only give someone recommendations, it seems like they get confused and don’t know what to do. It derails their process. People also don’t like being told what to do, so that is a factor as well,” Mozannar says.

    Providing recommendations alone could harm the user if those recommendations are wrong, he adds. With onboarding, on the other hand, the biggest limitation is the amount of available data. If there aren’t enough data, the onboarding stage won’t be as effective, he says.

    In the future, he and his collaborators want to conduct larger studies to evaluate the short- and long-term effects of onboarding. They also want to leverage unlabeled data for the onboarding process, and find methods to effectively reduce the number of regions without omitting important examples.

    “People are adopting AI systems willy-nilly, and indeed AI offers great potential, but these AI agents still sometimes makes mistakes. Thus, it’s crucial for AI developers to devise methods that help humans know when it’s safe to rely on the AI’s suggestions,” says Dan Weld, professor emeritus at the Paul G. Allen School of Computer Science and Engineering at the University of Washington, who was not involved with this research. “Mozannar et al. have created an innovative method for identifying situations where the AI is trustworthy, and (importantly) to describe them to people in a way that leads to better human-AI team interactions.”

    This work is funded, in part, by the MIT-IBM Watson AI Lab. More

  • in

    AI accelerates problem-solving in complex scenarios

    While Santa Claus may have a magical sleigh and nine plucky reindeer to help him deliver presents, for companies like FedEx, the optimization problem of efficiently routing holiday packages is so complicated that they often employ specialized software to find a solution.

    This software, called a mixed-integer linear programming (MILP) solver, splits a massive optimization problem into smaller pieces and uses generic algorithms to try and find the best solution. However, the solver could take hours — or even days — to arrive at a solution.

    The process is so onerous that a company often must stop the software partway through, accepting a solution that is not ideal but the best that could be generated in a set amount of time.

    Researchers from MIT and ETH Zurich used machine learning to speed things up.

    They identified a key intermediate step in MILP solvers that has so many potential solutions it takes an enormous amount of time to unravel, which slows the entire process. The researchers employed a filtering technique to simplify this step, then used machine learning to find the optimal solution for a specific type of problem.

    Their data-driven approach enables a company to use its own data to tailor a general-purpose MILP solver to the problem at hand.

    This new technique sped up MILP solvers between 30 and 70 percent, without any drop in accuracy. One could use this method to obtain an optimal solution more quickly or, for especially complex problems, a better solution in a tractable amount of time.

    This approach could be used wherever MILP solvers are employed, such as by ride-hailing services, electric grid operators, vaccination distributors, or any entity faced with a thorny resource-allocation problem.

    “Sometimes, in a field like optimization, it is very common for folks to think of solutions as either purely machine learning or purely classical. I am a firm believer that we want to get the best of both worlds, and this is a really strong instantiation of that hybrid approach,” says senior author Cathy Wu, the Gilbert W. Winslow Career Development Assistant Professor in Civil and Environmental Engineering (CEE), and a member of a member of the Laboratory for Information and Decision Systems (LIDS) and the Institute for Data, Systems, and Society (IDSS).

    Wu wrote the paper with co-lead authors Siriu Li, an IDSS graduate student, and Wenbin Ouyang, a CEE graduate student; as well as Max Paulus, a graduate student at ETH Zurich. The research will be presented at the Conference on Neural Information Processing Systems.

    Tough to solve

    MILP problems have an exponential number of potential solutions. For instance, say a traveling salesperson wants to find the shortest path to visit several cities and then return to their city of origin. If there are many cities which could be visited in any order, the number of potential solutions might be greater than the number of atoms in the universe.  

    “These problems are called NP-hard, which means it is very unlikely there is an efficient algorithm to solve them. When the problem is big enough, we can only hope to achieve some suboptimal performance,” Wu explains.

    An MILP solver employs an array of techniques and practical tricks that can achieve reasonable solutions in a tractable amount of time.

    A typical solver uses a divide-and-conquer approach, first splitting the space of potential solutions into smaller pieces with a technique called branching. Then, the solver employs a technique called cutting to tighten up these smaller pieces so they can be searched faster.

    Cutting uses a set of rules that tighten the search space without removing any feasible solutions. These rules are generated by a few dozen algorithms, known as separators, that have been created for different kinds of MILP problems. 

    Wu and her team found that the process of identifying the ideal combination of separator algorithms to use is, in itself, a problem with an exponential number of solutions.

    “Separator management is a core part of every solver, but this is an underappreciated aspect of the problem space. One of the contributions of this work is identifying the problem of separator management as a machine learning task to begin with,” she says.

    Shrinking the solution space

    She and her collaborators devised a filtering mechanism that reduces this separator search space from more than 130,000 potential combinations to around 20 options. This filtering mechanism draws on the principle of diminishing marginal returns, which says that the most benefit would come from a small set of algorithms, and adding additional algorithms won’t bring much extra improvement.

    Then they use a machine-learning model to pick the best combination of algorithms from among the 20 remaining options.

    This model is trained with a dataset specific to the user’s optimization problem, so it learns to choose algorithms that best suit the user’s particular task. Since a company like FedEx has solved routing problems many times before, using real data gleaned from past experience should lead to better solutions than starting from scratch each time.

    The model’s iterative learning process, known as contextual bandits, a form of reinforcement learning, involves picking a potential solution, getting feedback on how good it was, and then trying again to find a better solution.

    This data-driven approach accelerated MILP solvers between 30 and 70 percent without any drop in accuracy. Moreover, the speedup was similar when they applied it to a simpler, open-source solver and a more powerful, commercial solver.

    In the future, Wu and her collaborators want to apply this approach to even more complex MILP problems, where gathering labeled data to train the model could be especially challenging. Perhaps they can train the model on a smaller dataset and then tweak it to tackle a much larger optimization problem, she says. The researchers are also interested in interpreting the learned model to better understand the effectiveness of different separator algorithms.

    This research is supported, in part, by Mathworks, the National Science Foundation (NSF), the MIT Amazon Science Hub, and MIT’s Research Support Committee. More

  • in

    Search algorithm reveals nearly 200 new kinds of CRISPR systems

    Microbial sequence databases contain a wealth of information about enzymes and other molecules that could be adapted for biotechnology. But these databases have grown so large in recent years that they’ve become difficult to search efficiently for enzymes of interest.

    Now, scientists at the McGovern Institute for Brain Research at MIT, the Broad Institute of MIT and Harvard, and the National Center for Biotechnology Information (NCBI) at the National Institutes of Health have developed a new search algorithm that has identified 188 kinds of new rare CRISPR systems in bacterial genomes, encompassing thousands of individual systems. The work appears today in Science.

    The algorithm, which comes from the lab of pioneering CRISPR researcher Professor Feng Zhang, uses big-data clustering approaches to rapidly search massive amounts of genomic data. The team used their algorithm, called Fast Locality-Sensitive Hashing-based clustering (FLSHclust) to mine three major public databases that contain data from a wide range of unusual bacteria, including ones found in coal mines, breweries, Antarctic lakes, and dog saliva. The scientists found a surprising number and diversity of CRISPR systems, including ones that could make edits to DNA in human cells, others that can target RNA, and many with a variety of other functions.

    The new systems could potentially be harnessed to edit mammalian cells with fewer off-target effects than current Cas9 systems. They could also one day be used as diagnostics or serve as molecular records of activity inside cells.

    The researchers say their search highlights an unprecedented level of diversity and flexibility of CRISPR and that there are likely many more rare systems yet to be discovered as databases continue to grow.

    “Biodiversity is such a treasure trove, and as we continue to sequence more genomes and metagenomic samples, there is a growing need for better tools, like FLSHclust, to search that sequence space to find the molecular gems,” says Zhang, a co-senior author on the study and the James and Patricia Poitras Professor of Neuroscience at MIT with joint appointments in the departments of Brain and Cognitive Sciences and Biological Engineering. Zhang is also an investigator at the McGovern Institute for Brain Research at MIT, a core institute member at the Broad, and an investigator at the Howard Hughes Medical Institute. Eugene Koonin, a distinguished investigator at the NCBI, is co-senior author on the study as well.

    Searching for CRISPR

    CRISPR, which stands for clustered regularly interspaced short palindromic repeats, is a bacterial defense system that has been engineered into many tools for genome editing and diagnostics.

    To mine databases of protein and nucleic acid sequences for novel CRISPR systems, the researchers developed an algorithm based on an approach borrowed from the big data community. This technique, called locality-sensitive hashing, clusters together objects that are similar but not exactly identical. Using this approach allowed the team to probe billions of protein and DNA sequences — from the NCBI, its Whole Genome Shotgun database, and the Joint Genome Institute — in weeks, whereas previous methods that look for identical objects would have taken months. They designed their algorithm to look for genes associated with CRISPR.

    “This new algorithm allows us to parse through data in a time frame that’s short enough that we can actually recover results and make biological hypotheses,” says Soumya Kannan PhD ’23, who is a co-first author on the study. Kannan was a graduate student in Zhang’s lab when the study began and is currently a postdoc and Junior Fellow at Harvard University. Han Altae-Tran PhD ’23, a graduate student in Zhang’s lab during the study and currently a postdoc at the University of Washington, was the study’s other co-first author.

    “This is a testament to what you can do when you improve on the methods for exploration and use as much data as possible,” says Altae-Tran. “It’s really exciting to be able to improve the scale at which we search.”

    New systems

    In their analysis, Altae-Tran, Kannan, and their colleagues noticed that the thousands of CRISPR systems they found fell into a few existing and many new categories. They studied several of the new systems in greater detail in the lab.

    They found several new variants of known Type I CRISPR systems, which use a guide RNA that is 32 base pairs long rather than the 20-nucleotide guide of Cas9. Because of their longer guide RNAs, these Type I systems could potentially be used to develop more precise gene-editing technology that is less prone to off-target editing. Zhang’s team showed that two of these systems could make short edits in the DNA of human cells. And because these Type I systems are similar in size to CRISPR-Cas9, they could likely be delivered to cells in animals or humans using the same gene-delivery technologies being used today for CRISPR.

    One of the Type I systems also showed “collateral activity” — broad degradation of nucleic acids after the CRISPR protein binds its target. Scientists have used similar systems to make infectious disease diagnostics such as SHERLOCK, a tool capable of rapidly sensing a single molecule of DNA or RNA. Zhang’s team thinks the new systems could be adapted for diagnostic technologies as well.

    The researchers also uncovered new mechanisms of action for some Type IV CRISPR systems, and a Type VII system that precisely targets RNA, which could potentially be used in RNA editing. Other systems could potentially be used as recording tools — a molecular document of when a gene was expressed — or as sensors of specific activity in a living cell.

    Mining data

    The scientists say their algorithm could aid in the search for other biochemical systems. “This search algorithm could be used by anyone who wants to work with these large databases for studying how proteins evolve or discovering new genes,” Altae-Tran says.

    The researchers add that their findings illustrate not only how diverse CRISPR systems are, but also that most are rare and only found in unusual bacteria. “Some of these microbial systems were exclusively found in water from coal mines,” Kannan says. “If someone hadn’t been interested in that, we may never have seen those systems. Broadening our sampling diversity is really important to continue expanding the diversity of what we can discover.”

    This work was supported by the Howard Hughes Medical Institute; the K. Lisa Yang and Hock E. Tan Molecular Therapeutics Center at MIT; Broad Institute Programmable Therapeutics Gift Donors; The Pershing Square Foundation, William Ackman and Neri Oxman; James and Patricia Poitras; BT Charitable Foundation; Asness Family Foundation; Kenneth C. Griffin; the Phillips family; David Cheng; and Robert Metcalfe. More

  • in

    Synthetic imagery sets new bar in AI training efficiency

    Data is the new soil, and in this fertile new ground, MIT researchers are planting more than just pixels. By using synthetic images to train machine learning models, a team of scientists recently surpassed results obtained from traditional “real-image” training methods. 

    At the core of the approach is a system called StableRep, which doesn’t just use any synthetic images; it generates them through ultra-popular text-to-image models like Stable Diffusion. It’s like creating worlds with words. 

    So what’s in StableRep’s secret sauce? A strategy called “multi-positive contrastive learning.”

    “We’re teaching the model to learn more about high-level concepts through context and variance, not just feeding it data,” says Lijie Fan, MIT PhD student in electrical engineering, affiliate of the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL), lead researcher on the work. “When multiple images, all generated from the same text, all treated as depictions of the same underlying thing, the model dives deeper into the concepts behind the images, say the object, not just their pixels.”

    This approach considers multiple images spawned from identical text prompts as positive pairs, providing additional information during training, not just adding more diversity but specifying to the vision system which images are alike and which are different. Remarkably, StableRep outshone the prowess of top-tier models trained on real images, such as SimCLR and CLIP, in extensive datasets.

    “While StableRep helps mitigate the challenges of data acquisition in machine learning, it also ushers in a stride towards a new era of AI training techniques. The capacity to produce high-caliber, diverse synthetic images on command could help curtail cumbersome expenses and resources,” says Fan. 

    The process of data collection has never been straightforward. Back in the 1990s, researchers had to manually capture photographs to assemble datasets for objects and faces. The 2000s saw individuals scouring the internet for data. However, this raw, uncurated data often contained discrepancies when compared to real-world scenarios and reflected societal biases, presenting a distorted view of reality. The task of cleansing datasets through human intervention is not only expensive, but also exceedingly challenging. Imagine, though, if this arduous data collection could be distilled down to something as simple as issuing a command in natural language. 

    A pivotal aspect of StableRep’s triumph is the adjustment of the “guidance scale” in the generative model, which ensures a delicate balance between the synthetic images’ diversity and fidelity. When finely tuned, synthetic images used in training these self-supervised models were found to be as effective, if not more so, than real images.

    Taking it a step forward, language supervision was added to the mix, creating an enhanced variant: StableRep+. When trained with 20 million synthetic images, StableRep+ not only achieved superior accuracy but also displayed remarkable efficiency compared to CLIP models trained with a staggering 50 million real images.

    Yet, the path ahead isn’t without its potholes. The researchers candidly address several limitations, including the current slow pace of image generation, semantic mismatches between text prompts and the resultant images, potential amplification of biases, and complexities in image attribution, all of which are imperative to address for future advancements. Another issue is that StableRep requires first training the generative model on large-scale real data. The team acknowledges that starting with real data remains a necessity; however, when you have a good generative model, you can repurpose it for new tasks, like training recognition models and visual representations. 

    The team notes that they haven’t gotten around the need to start with real data; it’s just that once you have a good generative model you can repurpose it for new tasks, like training recognition models and visual representations. 

    While StableRep offers a good solution by diminishing the dependency on vast real-image collections, it brings to the fore concerns regarding hidden biases within the uncurated data used for these text-to-image models. The choice of text prompts, integral to the image synthesis process, is not entirely free from bias, “indicating the essential role of meticulous text selection or possible human curation,” says Fan. 

    “Using the latest text-to-image models, we’ve gained unprecedented control over image generation, allowing for a diverse range of visuals from a single text input. This surpasses real-world image collection in efficiency and versatility. It proves especially useful in specialized tasks, like balancing image variety in long-tail recognition, presenting a practical supplement to using real images for training,” says Fan. “Our work signifies a step forward in visual learning, towards the goal of offering cost-effective training alternatives while highlighting the need for ongoing improvements in data quality and synthesis.”

    “One dream of generative model learning has long been to be able to generate data useful for discriminative model training,” says Google DeepMind researcher and University of Toronto professor of computer science David Fleet, who was not involved in the paper. “While we have seen some signs of life, the dream has been elusive, especially on large-scale complex domains like high-resolution images. This paper provides compelling evidence, for the first time to my knowledge, that the dream is becoming a reality. They show that contrastive learning from massive amounts of synthetic image data can produce representations that outperform those learned from real data at scale, with the potential to improve myriad downstream vision tasks.”

    Fan is joined by Yonglong Tian PhD ’22 as lead authors of the paper, as well as MIT associate professor of electrical engineering and computer science and CSAIL principal investigator Phillip Isola; Google researcher and OpenAI technical staff member Huiwen Chang; and Google staff research scientist Dilip Krishnan. The team will present StableRep at the 2023 Conference on Neural Information Processing Systems (NeurIPS) in New Orleans. More