More stories

  • in

    Researchers use large language models to help robots navigate

    Someday, you may want your home robot to carry a load of dirty clothes downstairs and deposit them in the washing machine in the far-left corner of the basement. The robot will need to combine your instructions with its visual observations to determine the steps it should take to complete this task.For an AI agent, this is easier said than done. Current approaches often utilize multiple hand-crafted machine-learning models to tackle different parts of the task, which require a great deal of human effort and expertise to build. These methods, which use visual representations to directly make navigation decisions, demand massive amounts of visual data for training, which are often hard to come by.To overcome these challenges, researchers from MIT and the MIT-IBM Watson AI Lab devised a navigation method that converts visual representations into pieces of language, which are then fed into one large language model that achieves all parts of the multistep navigation task.Rather than encoding visual features from images of a robot’s surroundings as visual representations, which is computationally intensive, their method creates text captions that describe the robot’s point-of-view. A large language model uses the captions to predict the actions a robot should take to fulfill a user’s language-based instructions.Because their method utilizes purely language-based representations, they can use a large language model to efficiently generate a huge amount of synthetic training data.While this approach does not outperform techniques that use visual features, it performs well in situations that lack enough visual data for training. The researchers found that combining their language-based inputs with visual signals leads to better navigation performance.“By purely using language as the perceptual representation, ours is a more straightforward approach. Since all the inputs can be encoded as language, we can generate a human-understandable trajectory,” says Bowen Pan, an electrical engineering and computer science (EECS) graduate student and lead author of a paper on this approach.Pan’s co-authors include his advisor, Aude Oliva, director of strategic industry engagement at the MIT Schwarzman College of Computing, MIT director of the MIT-IBM Watson AI Lab, and a senior research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL); Philip Isola, an associate professor of EECS and a member of CSAIL; senior author Yoon Kim, an assistant professor of EECS and a member of CSAIL; and others at the MIT-IBM Watson AI Lab and Dartmouth College. The research will be presented at the Conference of the North American Chapter of the Association for Computational Linguistics.Solving a vision problem with languageSince large language models are the most powerful machine-learning models available, the researchers sought to incorporate them into the complex task known as vision-and-language navigation, Pan says.But such models take text-based inputs and can’t process visual data from a robot’s camera. So, the team needed to find a way to use language instead.Their technique utilizes a simple captioning model to obtain text descriptions of a robot’s visual observations. These captions are combined with language-based instructions and fed into a large language model, which decides what navigation step the robot should take next.The large language model outputs a caption of the scene the robot should see after completing that step. This is used to update the trajectory history so the robot can keep track of where it has been.The model repeats these processes to generate a trajectory that guides the robot to its goal, one step at a time.To streamline the process, the researchers designed templates so observation information is presented to the model in a standard form — as a series of choices the robot can make based on its surroundings.For instance, a caption might say “to your 30-degree left is a door with a potted plant beside it, to your back is a small office with a desk and a computer,” etc. The model chooses whether the robot should move toward the door or the office.“One of the biggest challenges was figuring out how to encode this kind of information into language in a proper way to make the agent understand what the task is and how they should respond,” Pan says.Advantages of languageWhen they tested this approach, while it could not outperform vision-based techniques, they found that it offered several advantages.First, because text requires fewer computational resources to synthesize than complex image data, their method can be used to rapidly generate synthetic training data. In one test, they generated 10,000 synthetic trajectories based on 10 real-world, visual trajectories.The technique can also bridge the gap that can prevent an agent trained with a simulated environment from performing well in the real world. This gap often occurs because computer-generated images can appear quite different from real-world scenes due to elements like lighting or color. But language that describes a synthetic versus a real image would be much harder to tell apart, Pan says. Also, the representations their model uses are easier for a human to understand because they are written in natural language.“If the agent fails to reach its goal, we can more easily determine where it failed and why it failed. Maybe the history information is not clear enough or the observation ignores some important details,” Pan says.In addition, their method could be applied more easily to varied tasks and environments because it uses only one type of input. As long as data can be encoded as language, they can use the same model without making any modifications.But one disadvantage is that their method naturally loses some information that would be captured by vision-based models, such as depth information.However, the researchers were surprised to see that combining language-based representations with vision-based methods improves an agent’s ability to navigate.“Maybe this means that language can capture some higher-level information than cannot be captured with pure vision features,” he says.This is one area the researchers want to continue exploring. They also want to develop a navigation-oriented captioner that could boost the method’s performance. In addition, they want to probe the ability of large language models to exhibit spatial awareness and see how this could aid language-based navigation.This research is funded, in part, by the MIT-IBM Watson AI Lab. More

  • in

    Looking for a specific action in a video? This AI-based method can find it for you

    The internet is awash in instructional videos that can teach curious viewers everything from cooking the perfect pancake to performing a life-saving Heimlich maneuver.But pinpointing when and where a particular action happens in a long video can be tedious. To streamline the process, scientists are trying to teach computers to perform this task. Ideally, a user could just describe the action they’re looking for, and an AI model would skip to its location in the video.However, teaching machine-learning models to do this usually requires a great deal of expensive video data that have been painstakingly hand-labeled.A new, more efficient approach from researchers at MIT and the MIT-IBM Watson AI Lab trains a model to perform this task, known as spatio-temporal grounding, using only videos and their automatically generated transcripts.The researchers teach a model to understand an unlabeled video in two distinct ways: by looking at small details to figure out where objects are located (spatial information) and looking at the bigger picture to understand when the action occurs (temporal information).Compared to other AI approaches, their method more accurately identifies actions in longer videos with multiple activities. Interestingly, they found that simultaneously training on spatial and temporal information makes a model better at identifying each individually.In addition to streamlining online learning and virtual training processes, this technique could also be useful in health care settings by rapidly finding key moments in videos of diagnostic procedures, for example.“We disentangle the challenge of trying to encode spatial and temporal information all at once and instead think about it like two experts working on their own, which turns out to be a more explicit way to encode the information. Our model, which combines these two separate branches, leads to the best performance,” says Brian Chen, lead author of a paper on this technique.Chen, a 2023 graduate of Columbia University who conducted this research while a visiting student at the MIT-IBM Watson AI Lab, is joined on the paper by James Glass, senior research scientist, member of the MIT-IBM Watson AI Lab, and head of the Spoken Language Systems Group in the Computer Science and Artificial Intelligence Laboratory (CSAIL); Hilde Kuehne, a member of the MIT-IBM Watson AI Lab who is also affiliated with Goethe University Frankfurt; and others at MIT, Goethe University, the MIT-IBM Watson AI Lab, and Quality Match GmbH. The research will be presented at the Conference on Computer Vision and Pattern Recognition.Global and local learningResearchers usually teach models to perform spatio-temporal grounding using videos in which humans have annotated the start and end times of particular tasks.Not only is generating these data expensive, but it can be difficult for humans to figure out exactly what to label. If the action is “cooking a pancake,” does that action start when the chef begins mixing the batter or when she pours it into the pan?“This time, the task may be about cooking, but next time, it might be about fixing a car. There are so many different domains for people to annotate. But if we can learn everything without labels, it is a more general solution,” Chen says.For their approach, the researchers use unlabeled instructional videos and accompanying text transcripts from a website like YouTube as training data. These don’t need any special preparation.They split the training process into two pieces. For one, they teach a machine-learning model to look at the entire video to understand what actions happen at certain times. This high-level information is called a global representation.For the second, they teach the model to focus on a specific region in parts of the video where action is happening. In a large kitchen, for instance, the model might only need to focus on the wooden spoon a chef is using to mix pancake batter, rather than the entire counter. This fine-grained information is called a local representation.The researchers incorporate an additional component into their framework to mitigate misalignments that occur between narration and video. Perhaps the chef talks about cooking the pancake first and performs the action later.To develop a more realistic solution, the researchers focused on uncut videos that are several minutes long. In contrast, most AI techniques train using few-second clips that someone trimmed to show only one action.A new benchmarkBut when they came to evaluate their approach, the researchers couldn’t find an effective benchmark for testing a model on these longer, uncut videos — so they created one.To build their benchmark dataset, the researchers devised a new annotation technique that works well for identifying multistep actions. They had users mark the intersection of objects, like the point where a knife edge cuts a tomato, rather than drawing a box around important objects.“This is more clearly defined and speeds up the annotation process, which reduces the human labor and cost,” Chen says.Plus, having multiple people do point annotation on the same video can better capture actions that occur over time, like the flow of milk being poured. All annotators won’t mark the exact same point in the flow of liquid.When they used this benchmark to test their approach, the researchers found that it was more accurate at pinpointing actions than other AI techniques.Their method was also better at focusing on human-object interactions. For instance, if the action is “serving a pancake,” many other approaches might focus only on key objects, like a stack of pancakes sitting on a counter. Instead, their method focuses on the actual moment when the chef flips a pancake onto a plate.Next, the researchers plan to enhance their approach so models can automatically detect when text and narration are not aligned, and switch focus from one modality to the other. They also want to extend their framework to audio data, since there are usually strong correlations between actions and the sounds objects make.“AI research has made incredible progress towards creating models like ChatGPT that understand images. But our progress on understanding video is far behind. This work represents a significant step forward in that direction,” says Kate Saenko, a professor in the Department of Computer Science at Boston University who was not involved with this work.This research is funded, in part, by the MIT-IBM Watson AI Lab. More

  • in

    This tiny chip can safeguard user data while enabling efficient computing on a smartphone

    Health-monitoring apps can help people manage chronic diseases or stay on track with fitness goals, using nothing more than a smartphone. However, these apps can be slow and energy-inefficient because the vast machine-learning models that power them must be shuttled between a smartphone and a central memory server.

    Engineers often speed things up using hardware that reduces the need to move so much data back and forth. While these machine-learning accelerators can streamline computation, they are susceptible to attackers who can steal secret information.

    To reduce this vulnerability, researchers from MIT and the MIT-IBM Watson AI Lab created a machine-learning accelerator that is resistant to the two most common types of attacks. Their chip can keep a user’s health records, financial information, or other sensitive data private while still enabling huge AI models to run efficiently on devices.

    The team developed several optimizations that enable strong security while only slightly slowing the device. Moreover, the added security does not impact the accuracy of computations. This machine-learning accelerator could be particularly beneficial for demanding AI applications like augmented and virtual reality or autonomous driving.

    While implementing the chip would make a device slightly more expensive and less energy-efficient, that is sometimes a worthwhile price to pay for security, says lead author Maitreyi Ashok, an electrical engineering and computer science (EECS) graduate student at MIT.

    “It is important to design with security in mind from the ground up. If you are trying to add even a minimal amount of security after a system has been designed, it is prohibitively expensive. We were able to effectively balance a lot of these tradeoffs during the design phase,” says Ashok.

    Her co-authors include Saurav Maji, an EECS graduate student; Xin Zhang and John Cohn of the MIT-IBM Watson AI Lab; and senior author Anantha Chandrakasan, MIT’s chief innovation and strategy officer, dean of the School of Engineering, and the Vannevar Bush Professor of EECS. The research will be presented at the IEEE Custom Integrated Circuits Conference.

    Side-channel susceptibility

    The researchers targeted a type of machine-learning accelerator called digital in-memory compute. A digital IMC chip performs computations inside a device’s memory, where pieces of a machine-learning model are stored after being moved over from a central server.

    The entire model is too big to store on the device, but by breaking it into pieces and reusing those pieces as much as possible, IMC chips reduce the amount of data that must be moved back and forth.

    But IMC chips can be susceptible to hackers. In a side-channel attack, a hacker monitors the chip’s power consumption and uses statistical techniques to reverse-engineer data as the chip computes. In a bus-probing attack, the hacker can steal bits of the model and dataset by probing the communication between the accelerator and the off-chip memory.

    Digital IMC speeds computation by performing millions of operations at once, but this complexity makes it tough to prevent attacks using traditional security measures, Ashok says.

    She and her collaborators took a three-pronged approach to blocking side-channel and bus-probing attacks.

    First, they employed a security measure where data in the IMC are split into random pieces. For instance, a bit zero might be split into three bits that still equal zero after a logical operation. The IMC never computes with all pieces in the same operation, so a side-channel attack could never reconstruct the real information.

    But for this technique to work, random bits must be added to split the data. Because digital IMC performs millions of operations at once, generating so many random bits would involve too much computing. For their chip, the researchers found a way to simplify computations, making it easier to effectively split data while eliminating the need for random bits.

    Second, they prevented bus-probing attacks using a lightweight cipher that encrypts the model stored in off-chip memory. This lightweight cipher only requires simple computations. In addition, they only decrypted the pieces of the model stored on the chip when necessary.

    Third, to improve security, they generated the key that decrypts the cipher directly on the chip, rather than moving it back and forth with the model. They generated this unique key from random variations in the chip that are introduced during manufacturing, using what is known as a physically unclonable function.

    “Maybe one wire is going to be a little bit thicker than another. We can use these variations to get zeros and ones out of a circuit. For every chip, we can get a random key that should be consistent because these random properties shouldn’t change significantly over time,” Ashok explains.

    They reused the memory cells on the chip, leveraging the imperfections in these cells to generate the key. This requires less computation than generating a key from scratch.

    “As security has become a critical issue in the design of edge devices, there is a need to develop a complete system stack focusing on secure operation. This work focuses on security for machine-learning workloads and describes a digital processor that uses cross-cutting optimization. It incorporates encrypted data access between memory and processor, approaches to preventing side-channel attacks using randomization, and exploiting variability to generate unique codes. Such designs are going to be critical in future mobile devices,” says Chandrakasan.

    Safety testing

    To test their chip, the researchers took on the role of hackers and tried to steal secret information using side-channel and bus-probing attacks.

    Even after making millions of attempts, they couldn’t reconstruct any real information or extract pieces of the model or dataset. The cipher also remained unbreakable. By contrast, it took only about 5,000 samples to steal information from an unprotected chip.

    The addition of security did reduce the energy efficiency of the accelerator, and it also required a larger chip area, which would make it more expensive to fabricate.

    The team is planning to explore methods that could reduce the energy consumption and size of their chip in the future, which would make it easier to implement at scale.

    “As it becomes too expensive, it becomes harder to convince someone that security is critical. Future work could explore these tradeoffs. Maybe we could make it a little less secure but easier to implement and less expensive,” Ashok says.

    The research is funded, in part, by the MIT-IBM Watson AI Lab, the National Science Foundation, and a Mathworks Engineering Fellowship. More

  • in

    Multiple AI models help robots execute complex plans more transparently

    Your daily to-do list is likely pretty straightforward: wash the dishes, buy groceries, and other minutiae. It’s unlikely you wrote out “pick up the first dirty dish,” or “wash that plate with a sponge,” because each of these miniature steps within the chore feels intuitive. While we can routinely complete each step without much thought, a robot requires a complex plan that involves more detailed outlines.

    MIT’s Improbable AI Lab, a group within the Computer Science and Artificial Intelligence Laboratory (CSAIL), has offered these machines a helping hand with a new multimodal framework: Compositional Foundation Models for Hierarchical Planning (HiP), which develops detailed, feasible plans with the expertise of three different foundation models. Like OpenAI’s GPT-4, the foundation model that ChatGPT and Bing Chat were built upon, these foundation models are trained on massive quantities of data for applications like generating images, translating text, and robotics.Unlike RT2 and other multimodal models that are trained on paired vision, language, and action data, HiP uses three different foundation models each trained on different data modalities. Each foundation model captures a different part of the decision-making process and then works together when it’s time to make decisions. HiP removes the need for access to paired vision, language, and action data, which is difficult to obtain. HiP also makes the reasoning process more transparent.

    What’s considered a daily chore for a human can be a robot’s “long-horizon goal” — an overarching objective that involves completing many smaller steps first — requiring sufficient data to plan, understand, and execute objectives. While computer vision researchers have attempted to build monolithic foundation models for this problem, pairing language, visual, and action data is expensive. Instead, HiP represents a different, multimodal recipe: a trio that cheaply incorporates linguistic, physical, and environmental intelligence into a robot.

    “Foundation models do not have to be monolithic,” says NVIDIA AI researcher Jim Fan, who was not involved in the paper. “This work decomposes the complex task of embodied agent planning into three constituent models: a language reasoner, a visual world model, and an action planner. It makes a difficult decision-making problem more tractable and transparent.”The team believes that their system could help these machines accomplish household chores, such as putting away a book or placing a bowl in the dishwasher. Additionally, HiP could assist with multistep construction and manufacturing tasks, like stacking and placing different materials in specific sequences.Evaluating HiP

    The CSAIL team tested HiP’s acuity on three manipulation tasks, outperforming comparable frameworks. The system reasoned by developing intelligent plans that adapt to new information.

    First, the researchers requested that it stack different-colored blocks on each other and then place others nearby. The catch: Some of the correct colors weren’t present, so the robot had to place white blocks in a color bowl to paint them. HiP often adjusted to these changes accurately, especially compared to state-of-the-art task planning systems like Transformer BC and Action Diffuser, by adjusting its plans to stack and place each square as needed.

    Another test: arranging objects such as candy and a hammer in a brown box while ignoring other items. Some of the objects it needed to move were dirty, so HiP adjusted its plans to place them in a cleaning box, and then into the brown container. In a third demonstration, the bot was able to ignore unnecessary objects to complete kitchen sub-goals such as opening a microwave, clearing a kettle out of the way, and turning on a light. Some of the prompted steps had already been completed, so the robot adapted by skipping those directions.

    A three-pronged hierarchy

    HiP’s three-pronged planning process operates as a hierarchy, with the ability to pre-train each of its components on different sets of data, including information outside of robotics. At the bottom of that order is a large language model (LLM), which starts to ideate by capturing all the symbolic information needed and developing an abstract task plan. Applying the common sense knowledge it finds on the internet, the model breaks its objective into sub-goals. For example, “making a cup of tea” turns into “filling a pot with water,” “boiling the pot,” and the subsequent actions required.

    “All we want to do is take existing pre-trained models and have them successfully interface with each other,” says Anurag Ajay, a PhD student in the MIT Department of Electrical Engineering and Computer Science (EECS) and a CSAIL affiliate. “Instead of pushing for one model to do everything, we combine multiple ones that leverage different modalities of internet data. When used in tandem, they help with robotic decision-making and can potentially aid with tasks in homes, factories, and construction sites.”

    These models also need some form of “eyes” to understand the environment they’re operating in and correctly execute each sub-goal. The team used a large video diffusion model to augment the initial planning completed by the LLM, which collects geometric and physical information about the world from footage on the internet. In turn, the video model generates an observation trajectory plan, refining the LLM’s outline to incorporate new physical knowledge.This process, known as iterative refinement, allows HiP to reason about its ideas, taking in feedback at each stage to generate a more practical outline. The flow of feedback is similar to writing an article, where an author may send their draft to an editor, and with those revisions incorporated in, the publisher reviews for any last changes and finalizes.

    In this case, the top of the hierarchy is an egocentric action model, or a sequence of first-person images that infer which actions should take place based on its surroundings. During this stage, the observation plan from the video model is mapped over the space visible to the robot, helping the machine decide how to execute each task within the long-horizon goal. If a robot uses HiP to make tea, this means it will have mapped out exactly where the pot, sink, and other key visual elements are, and begin completing each sub-goal.Still, the multimodal work is limited by the lack of high-quality video foundation models. Once available, they could interface with HiP’s small-scale video models to further enhance visual sequence prediction and robot action generation. A higher-quality version would also reduce the current data requirements of the video models.That being said, the CSAIL team’s approach only used a tiny bit of data overall. Moreover, HiP was cheap to train and demonstrated the potential of using readily available foundation models to complete long-horizon tasks. “What Anurag has demonstrated is proof-of-concept of how we can take models trained on separate tasks and data modalities and combine them into models for robotic planning. In the future, HiP could be augmented with pre-trained models that can process touch and sound to make better plans,” says senior author Pulkit Agrawal, MIT assistant professor in EECS and director of the Improbable AI Lab. The group is also considering applying HiP to solving real-world long-horizon tasks in robotics.Ajay and Agrawal are lead authors on a paper describing the work. They are joined by MIT professors and CSAIL principal investigators Tommi Jaakkola, Joshua Tenenbaum, and Leslie Pack Kaelbling; CSAIL research affiliate and MIT-IBM AI Lab research manager Akash Srivastava; graduate students Seungwook Han and Yilun Du ’19; former postdoc Abhishek Gupta, who is now assistant professor at University of Washington; and former graduate student Shuang Li PhD ’23.

    The team’s work was supported, in part, by the National Science Foundation, the U.S. Defense Advanced Research Projects Agency, the U.S. Army Research Office, the U.S. Office of Naval Research Multidisciplinary University Research Initiatives, and the MIT-IBM Watson AI Lab. Their findings were presented at the 2023 Conference on Neural Information Processing Systems (NeurIPS). More

  • in

    Technique could efficiently solve partial differential equations for numerous applications

    In fields such as physics and engineering, partial differential equations (PDEs) are used to model complex physical processes to generate insight into how some of the most complicated physical and natural systems in the world function.

    To solve these difficult equations, researchers use high-fidelity numerical solvers, which can be very time-consuming and computationally expensive to run. The current simplified alternative, data-driven surrogate models, compute the goal property of a solution to PDEs rather than the whole solution. Those are trained on a set of data that has been generated by the high-fidelity solver, to predict the output of the PDEs for new inputs. This is data-intensive and expensive because complex physical systems require a large number of simulations to generate enough data. 

    In a new paper, “Physics-enhanced deep surrogates for partial differential equations,” published in December in Nature Machine Intelligence, a new method is proposed for developing data-driven surrogate models for complex physical systems in such fields as mechanics, optics, thermal transport, fluid dynamics, physical chemistry, and climate models.

    The paper was authored by MIT’s professor of applied mathematics Steven G. Johnson along with Payel Das and Youssef Mroueh of the MIT-IBM Watson AI Lab and IBM Research; Chris Rackauckas of Julia Lab; and Raphaël Pestourie, a former MIT postdoc who is now at Georgia Tech. The authors call their method “physics-enhanced deep surrogate” (PEDS), which combines a low-fidelity, explainable physics simulator with a neural network generator. The neural network generator is trained end-to-end to match the output of the high-fidelity numerical solver.

    “My aspiration is to replace the inefficient process of trial and error with systematic, computer-aided simulation and optimization,” says Pestourie. “Recent breakthroughs in AI like the large language model of ChatGPT rely on hundreds of billions of parameters and require vast amounts of resources to train and evaluate. In contrast, PEDS is affordable to all because it is incredibly efficient in computing resources and has a very low barrier in terms of infrastructure needed to use it.”

    In the article, they show that PEDS surrogates can be up to three times more accurate than an ensemble of feedforward neural networks with limited data (approximately 1,000 training points), and reduce the training data needed by at least a factor of 100 to achieve a target error of 5 percent. Developed using the MIT-designed Julia programming language, this scientific machine-learning method is thus efficient in both computing and data.

    The authors also report that PEDS provides a general, data-driven strategy to bridge the gap between a vast array of simplified physical models with corresponding brute-force numerical solvers modeling complex systems. This technique offers accuracy, speed, data efficiency, and physical insights into the process.

    Says Pestourie, “Since the 2000s, as computing capabilities improved, the trend of scientific models has been to increase the number of parameters to fit the data better, sometimes at the cost of a lower predictive accuracy. PEDS does the opposite by choosing its parameters smartly. It leverages the technology of automatic differentiation to train a neural network that makes a model with few parameters accurate.”

    “The main challenge that prevents surrogate models from being used more widely in engineering is the curse of dimensionality — the fact that the needed data to train a model increases exponentially with the number of model variables,” says Pestourie. “PEDS reduces this curse by incorporating information from the data and from the field knowledge in the form of a low-fidelity model solver.”

    The researchers say that PEDS has the potential to revive a whole body of the pre-2000 literature dedicated to minimal models — intuitive models that PEDS could make more accurate while also being predictive for surrogate model applications.

    “The application of the PEDS framework is beyond what we showed in this study,” says Das. “Complex physical systems governed by PDEs are ubiquitous, from climate modeling to seismic modeling and beyond. Our physics-inspired fast and explainable surrogate models will be of great use in those applications, and play a complementary role to other emerging techniques, like foundation models.”

    The research was supported by the MIT-IBM Watson AI Lab and the U.S. Army Research Office through the Institute for Soldier Nanotechnologies.  More

  • in

    Leveraging language to understand machines

    Natural language conveys ideas, actions, information, and intent through context and syntax; further, there are volumes of it contained in databases. This makes it an excellent source of data to train machine-learning systems on. Two master’s of engineering students in the 6A MEng Thesis Program at MIT, Irene Terpstra ’23 and Rujul Gandhi ’22, are working with mentors in the MIT-IBM Watson AI Lab to use this power of natural language to build AI systems.

    As computing is becoming more advanced, researchers are looking to improve the hardware that they run on; this means innovating to create new computer chips. And, since there is literature already available on modifications that can be made to achieve certain parameters and performance, Terpstra and her mentors and advisors Anantha Chandrakasan, MIT School of Engineering dean and the Vannevar Bush Professor of Electrical Engineering and Computer Science, and IBM’s researcher Xin Zhang, are developing an AI algorithm that assists in chip design.

    “I’m creating a workflow to systematically analyze how these language models can help the circuit design process. What reasoning powers do they have, and how can it be integrated into the chip design process?” says Terpstra. “And then on the other side, if that proves to be useful enough, [we’ll] see if they can automatically design the chips themselves, attaching it to a reinforcement learning algorithm.”

    To do this, Terpstra’s team is creating an AI system that can iterate on different designs. It means experimenting with various pre-trained large language models (like ChatGPT, Llama 2, and Bard), using an open-source circuit simulator language called NGspice, which has the parameters of the chip in code form, and a reinforcement learning algorithm. With text prompts, researchers will be able to query how the physical chip should be modified to achieve a certain goal in the language model and produced guidance for adjustments. This is then transferred into a reinforcement learning algorithm that updates the circuit design and outputs new physical parameters of the chip.

    “The final goal would be to combine the reasoning powers and the knowledge base that is baked into these large language models and combine that with the optimization power of the reinforcement learning algorithms and have that design the chip itself,” says Terpstra.

    Rujul Gandhi works with the raw language itself. As an undergraduate at MIT, Gandhi explored linguistics and computer sciences, putting them together in her MEng work. “I’ve been interested in communication, both between just humans and between humans and computers,” Gandhi says.

    Robots or other interactive AI systems are one area where communication needs to be understood by both humans and machines. Researchers often write instructions for robots using formal logic. This helps ensure that commands are being followed safely and as intended, but formal logic can be difficult for users to understand, while natural language comes easily. To ensure this smooth communication, Gandhi and her advisors Yang Zhang of IBM and MIT assistant professor Chuchu Fan are building a parser that converts natural language instructions into a machine-friendly form. Leveraging the linguistic structure encoded by the pre-trained encoder-decoder model T5, and a dataset of annotated, basic English commands for performing certain tasks, Gandhi’s system identifies the smallest logical units, or atomic propositions, which are present in a given instruction.

    “Once you’ve given your instruction, the model identifies all the smaller sub-tasks you want it to carry out,” Gandhi says. “Then, using a large language model, each sub-task can be compared against the available actions and objects in the robot’s world, and if any sub-task can’t be carried out because a certain object is not recognized, or an action is not possible, the system can stop right there to ask the user for help.”

    This approach of breaking instructions into sub-tasks also allows her system to understand logical dependencies expressed in English, like, “do task X until event Y happens.” Gandhi uses a dataset of step-by-step instructions across robot task domains like navigation and manipulation, with a focus on household tasks. Using data that are written just the way humans would talk to each other has many advantages, she says, because it means a user can be more flexible about how they phrase their instructions.

    Another of Gandhi’s projects involves developing speech models. In the context of speech recognition, some languages are considered “low resource” since they might not have a lot of transcribed speech available, or might not have a written form at all. “One of the reasons I applied to this internship at the MIT-IBM Watson AI Lab was an interest in language processing for low-resource languages,” she says. “A lot of language models today are very data-driven, and when it’s not that easy to acquire all of that data, that’s when you need to use the limited data efficiently.” 

    Speech is just a stream of sound waves, but humans having a conversation can easily figure out where words and thoughts start and end. In speech processing, both humans and language models use their existing vocabulary to recognize word boundaries and understand the meaning. In low- or no-resource languages, a written vocabulary might not exist at all, so researchers can’t provide one to the model. Instead, the model can make note of what sound sequences occur together more frequently than others, and infer that those might be individual words or concepts. In Gandhi’s research group, these inferred words are then collected into a pseudo-vocabulary that serves as a labeling method for the low-resource language, creating labeled data for further applications.

    The applications for language technology are “pretty much everywhere,” Gandhi says. “You could imagine people being able to interact with software and devices in their native language, their native dialect. You could imagine improving all the voice assistants that we use. You could imagine it being used for translation or interpretation.” More

  • in

    Image recognition accuracy: An unseen challenge confounding today’s AI

    Imagine you are scrolling through the photos on your phone and you come across an image that at first you can’t recognize. It looks like maybe something fuzzy on the couch; could it be a pillow or a coat? After a couple of seconds it clicks — of course! That ball of fluff is your friend’s cat, Mocha. While some of your photos could be understood in an instant, why was this cat photo much more difficult?

    MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers were surprised to find that despite the critical importance of understanding visual data in pivotal areas ranging from health care to transportation to household devices, the notion of an image’s recognition difficulty for humans has been almost entirely ignored. One of the major drivers of progress in deep learning-based AI has been datasets, yet we know little about how data drives progress in large-scale deep learning beyond that bigger is better.

    In real-world applications that require understanding visual data, humans outperform object recognition models despite the fact that models perform well on current datasets, including those explicitly designed to challenge machines with debiased images or distribution shifts. This problem persists, in part, because we have no guidance on the absolute difficulty of an image or dataset. Without controlling for the difficulty of images used for evaluation, it’s hard to objectively assess progress toward human-level performance, to cover the range of human abilities, and to increase the challenge posed by a dataset.

    To fill in this knowledge gap, David Mayo, an MIT PhD student in electrical engineering and computer science and a CSAIL affiliate, delved into the deep world of image datasets, exploring why certain images are more difficult for humans and machines to recognize than others. “Some images inherently take longer to recognize, and it’s essential to understand the brain’s activity during this process and its relation to machine learning models. Perhaps there are complex neural circuits or unique mechanisms missing in our current models, visible only when tested with challenging visual stimuli. This exploration is crucial for comprehending and enhancing machine vision models,” says Mayo, a lead author of a new paper on the work.

    This led to the development of a new metric, the “minimum viewing time” (MVT), which quantifies the difficulty of recognizing an image based on how long a person needs to view it before making a correct identification. Using a subset of ImageNet, a popular dataset in machine learning, and ObjectNet, a dataset designed to test object recognition robustness, the team showed images to participants for varying durations from as short as 17 milliseconds to as long as 10 seconds, and asked them to choose the correct object from a set of 50 options. After over 200,000 image presentation trials, the team found that existing test sets, including ObjectNet, appeared skewed toward easier, shorter MVT images, with the vast majority of benchmark performance derived from images that are easy for humans.

    The project identified interesting trends in model performance — particularly in relation to scaling. Larger models showed considerable improvement on simpler images but made less progress on more challenging images. The CLIP models, which incorporate both language and vision, stood out as they moved in the direction of more human-like recognition.

    “Traditionally, object recognition datasets have been skewed towards less-complex images, a practice that has led to an inflation in model performance metrics, not truly reflective of a model’s robustness or its ability to tackle complex visual tasks. Our research reveals that harder images pose a more acute challenge, causing a distribution shift that is often not accounted for in standard evaluations,” says Mayo. “We released image sets tagged by difficulty along with tools to automatically compute MVT, enabling MVT to be added to existing benchmarks and extended to various applications. These include measuring test set difficulty before deploying real-world systems, discovering neural correlates of image difficulty, and advancing object recognition techniques to close the gap between benchmark and real-world performance.”

    “One of my biggest takeaways is that we now have another dimension to evaluate models on. We want models that are able to recognize any image even if — perhaps especially if — it’s hard for a human to recognize. We’re the first to quantify what this would mean. Our results show that not only is this not the case with today’s state of the art, but also that our current evaluation methods don’t have the ability to tell us when it is the case because standard datasets are so skewed toward easy images,” says Jesse Cummings, an MIT graduate student in electrical engineering and computer science and co-first author with Mayo on the paper.

    From ObjectNet to MVT

    A few years ago, the team behind this project identified a significant challenge in the field of machine learning: Models were struggling with out-of-distribution images, or images that were not well-represented in the training data. Enter ObjectNet, a dataset comprised of images collected from real-life settings. The dataset helped illuminate the performance gap between machine learning models and human recognition abilities, by eliminating spurious correlations present in other benchmarks — for example, between an object and its background. ObjectNet illuminated the gap between the performance of machine vision models on datasets and in real-world applications, encouraging use for many researchers and developers — which subsequently improved model performance.

    Fast forward to the present, and the team has taken their research a step further with MVT. Unlike traditional methods that focus on absolute performance, this new approach assesses how models perform by contrasting their responses to the easiest and hardest images. The study further explored how image difficulty could be explained and tested for similarity to human visual processing. Using metrics like c-score, prediction depth, and adversarial robustness, the team found that harder images are processed differently by networks. “While there are observable trends, such as easier images being more prototypical, a comprehensive semantic explanation of image difficulty continues to elude the scientific community,” says Mayo.

    In the realm of health care, for example, the pertinence of understanding visual complexity becomes even more pronounced. The ability of AI models to interpret medical images, such as X-rays, is subject to the diversity and difficulty distribution of the images. The researchers advocate for a meticulous analysis of difficulty distribution tailored for professionals, ensuring AI systems are evaluated based on expert standards, rather than layperson interpretations.

    Mayo and Cummings are currently looking at neurological underpinnings of visual recognition as well, probing into whether the brain exhibits differential activity when processing easy versus challenging images. The study aims to unravel whether complex images recruit additional brain areas not typically associated with visual processing, hopefully helping demystify how our brains accurately and efficiently decode the visual world.

    Toward human-level performance

    Looking ahead, the researchers are not only focused on exploring ways to enhance AI’s predictive capabilities regarding image difficulty. The team is working on identifying correlations with viewing-time difficulty in order to generate harder or easier versions of images.

    Despite the study’s significant strides, the researchers acknowledge limitations, particularly in terms of the separation of object recognition from visual search tasks. The current methodology does concentrate on recognizing objects, leaving out the complexities introduced by cluttered images.

    “This comprehensive approach addresses the long-standing challenge of objectively assessing progress towards human-level performance in object recognition and opens new avenues for understanding and advancing the field,” says Mayo. “With the potential to adapt the Minimum Viewing Time difficulty metric for a variety of visual tasks, this work paves the way for more robust, human-like performance in object recognition, ensuring that models are truly put to the test and are ready for the complexities of real-world visual understanding.”

    “This is a fascinating study of how human perception can be used to identify weaknesses in the ways AI vision models are typically benchmarked, which overestimate AI performance by concentrating on easy images,” says Alan L. Yuille, Bloomberg Distinguished Professor of Cognitive Science and Computer Science at Johns Hopkins University, who was not involved in the paper. “This will help develop more realistic benchmarks leading not only to improvements to AI but also make fairer comparisons between AI and human perception.” 

    “It’s widely claimed that computer vision systems now outperform humans, and on some benchmark datasets, that’s true,” says Anthropic technical staff member Simon Kornblith PhD ’17, who was also not involved in this work. “However, a lot of the difficulty in those benchmarks comes from the obscurity of what’s in the images; the average person just doesn’t know enough to classify different breeds of dogs. This work instead focuses on images that people can only get right if given enough time. These images are generally much harder for computer vision systems, but the best systems are only a bit worse than humans.”

    Mayo, Cummings, and Xinyu Lin MEng ’22 wrote the paper alongside CSAIL Research Scientist Andrei Barbu, CSAIL Principal Research Scientist Boris Katz, and MIT-IBM Watson AI Lab Principal Researcher Dan Gutfreund. The researchers are affiliates of the MIT Center for Brains, Minds, and Machines.

    The team is presenting their work at the 2023 Conference on Neural Information Processing Systems (NeurIPS). More

  • in

    Technique enables AI on edge devices to keep learning over time

    Personalized deep-learning models can enable artificial intelligence chatbots that adapt to understand a user’s accent or smart keyboards that continuously update to better predict the next word based on someone’s typing history. This customization requires constant fine-tuning of a machine-learning model with new data.

    Because smartphones and other edge devices lack the memory and computational power necessary for this fine-tuning process, user data are typically uploaded to cloud servers where the model is updated. But data transmission uses a great deal of energy, and sending sensitive user data to a cloud server poses a security risk.  

    Researchers from MIT, the MIT-IBM Watson AI Lab, and elsewhere developed a technique that enables deep-learning models to efficiently adapt to new sensor data directly on an edge device.

    Their on-device training method, called PockEngine, determines which parts of a huge machine-learning model need to be updated to improve accuracy, and only stores and computes with those specific pieces. It performs the bulk of these computations while the model is being prepared, before runtime, which minimizes computational overhead and boosts the speed of the fine-tuning process.    

    When compared to other methods, PockEngine significantly sped up on-device training, performing up to 15 times faster on some hardware platforms. Moreover, PockEngine didn’t cause models to have any dip in accuracy. The researchers also found that their fine-tuning method enabled a popular AI chatbot to answer complex questions more accurately.

    “On-device fine-tuning can enable better privacy, lower costs, customization ability, and also lifelong learning, but it is not easy. Everything has to happen with a limited number of resources. We want to be able to run not only inference but also training on an edge device. With PockEngine, now we can,” says Song Han, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, a distinguished scientist at NVIDIA, and senior author of an open-access paper describing PockEngine.

    Han is joined on the paper by lead author Ligeng Zhu, an EECS graduate student, as well as others at MIT, the MIT-IBM Watson AI Lab, and the University of California San Diego. The paper was recently presented at the IEEE/ACM International Symposium on Microarchitecture.

    Layer by layer

    Deep-learning models are based on neural networks, which comprise many interconnected layers of nodes, or “neurons,” that process data to make a prediction. When the model is run, a process called inference, a data input (such as an image) is passed from layer to layer until the prediction (perhaps the image label) is output at the end. During inference, each layer no longer needs to be stored after it processes the input.

    But during training and fine-tuning, the model undergoes a process known as backpropagation. In backpropagation, the output is compared to the correct answer, and then the model is run in reverse. Each layer is updated as the model’s output gets closer to the correct answer.

    Because each layer may need to be updated, the entire model and intermediate results must be stored, making fine-tuning more memory demanding than inference

    However, not all layers in the neural network are important for improving accuracy. And even for layers that are important, the entire layer may not need to be updated. Those layers, and pieces of layers, don’t need to be stored. Furthermore, one may not need to go all the way back to the first layer to improve accuracy — the process could be stopped somewhere in the middle.

    PockEngine takes advantage of these factors to speed up the fine-tuning process and cut down on the amount of computation and memory required.

    The system first fine-tunes each layer, one at a time, on a certain task and measures the accuracy improvement after each individual layer. In this way, PockEngine identifies the contribution of each layer, as well as trade-offs between accuracy and fine-tuning cost, and automatically determines the percentage of each layer that needs to be fine-tuned.

    “This method matches the accuracy very well compared to full back propagation on different tasks and different neural networks,” Han adds.

    A pared-down model

    Conventionally, the backpropagation graph is generated during runtime, which involves a great deal of computation. Instead, PockEngine does this during compile time, while the model is being prepared for deployment.

    PockEngine deletes bits of code to remove unnecessary layers or pieces of layers, creating a pared-down graph of the model to be used during runtime. It then performs other optimizations on this graph to further improve efficiency.

    Since all this only needs to be done once, it saves on computational overhead for runtime.

    “It is like before setting out on a hiking trip. At home, you would do careful planning — which trails are you going to go on, which trails are you going to ignore. So then at execution time, when you are actually hiking, you already have a very careful plan to follow,” Han explains.

    When they applied PockEngine to deep-learning models on different edge devices, including Apple M1 Chips and the digital signal processors common in many smartphones and Raspberry Pi computers, it performed on-device training up to 15 times faster, without any drop in accuracy. PockEngine also significantly slashed the amount of memory required for fine-tuning.

    The team also applied the technique to the large language model Llama-V2. With large language models, the fine-tuning process involves providing many examples, and it’s crucial for the model to learn how to interact with users, Han says. The process is also important for models tasked with solving complex problems or reasoning about solutions.

    For instance, Llama-V2 models that were fine-tuned using PockEngine answered the question “What was Michael Jackson’s last album?” correctly, while models that weren’t fine-tuned failed. PockEngine cut the time it took for each iteration of the fine-tuning process from about seven seconds to less than one second on a NVIDIA Jetson Orin, an edge GPU platform.

    In the future, the researchers want to use PockEngine to fine-tune even larger models designed to process text and images together.

    “This work addresses growing efficiency challenges posed by the adoption of large AI models such as LLMs across diverse applications in many different industries. It not only holds promise for edge applications that incorporate larger models, but also for lowering the cost of maintaining and updating large AI models in the cloud,” says Ehry MacRostie, a senior manager in Amazon’s Artificial General Intelligence division who was not involved in this study but works with MIT on related AI research through the MIT-Amazon Science Hub.

    This work was supported, in part, by the MIT-IBM Watson AI Lab, the MIT AI Hardware Program, the MIT-Amazon Science Hub, the National Science Foundation (NSF), and the Qualcomm Innovation Fellowship. More