More stories

  • in

    Busy GPUs: Sampling and pipelining method speeds up deep learning on large graphs

    Graphs, a potentially extensive web of nodes connected by edges, can be used to express and interrogate relationships between data, like social connections, financial transactions, traffic, energy grids, and molecular interactions. As researchers collect more data and build out these graphical pictures, researchers will need faster and more efficient methods, as well as more computational power, to conduct deep learning on them, in the way of graph neural networks (GNN).  

    Now, a new method, called SALIENT (SAmpling, sLIcing, and data movemeNT), developed by researchers at MIT and IBM Research, improves the training and inference performance by addressing three key bottlenecks in computation. This dramatically cuts down on the runtime of GNNs on large datasets, which, for example, contain on the scale of 100 million nodes and 1 billion edges. Further, the team found that the technique scales well when computational power is added from one to 16 graphical processing units (GPUs). The work was presented at the Fifth Conference on Machine Learning and Systems.

    “We started to look at the challenges current systems experienced when scaling state-of-the-art machine learning techniques for graphs to really big datasets. It turned out there was a lot of work to be done, because a lot of the existing systems were achieving good performance primarily on smaller datasets that fit into GPU memory,” says Tim Kaler, the lead author and a postdoc in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).

    By vast datasets, experts mean scales like the entire Bitcoin network, where certain patterns and data relationships could spell out trends or foul play. “There are nearly a billion Bitcoin transactions on the blockchain, and if we want to identify illicit activities inside such a joint network, then we are facing a graph of such a scale,” says co-author Jie Chen, senior research scientist and manager of IBM Research and the MIT-IBM Watson AI Lab. “We want to build a system that is able to handle that kind of graph and allows processing to be as efficient as possible, because every day we want to keep up with the pace of the new data that are generated.”

    Kaler and Chen’s co-authors include Nickolas Stathas MEng ’21 of Jump Trading, who developed SALIENT as part of his graduate work; former MIT-IBM Watson AI Lab intern and MIT graduate student Anne Ouyang; MIT CSAIL postdoc Alexandros-Stavros Iliopoulos; MIT CSAIL Research Scientist Tao B. Schardl; and Charles E. Leiserson, the Edwin Sibley Webster Professor of Electrical Engineering at MIT and a researcher with the MIT-IBM Watson AI Lab.     

    For this problem, the team took a systems-oriented approach in developing their method: SALIENT, says Kaler. To do this, the researchers implemented what they saw as important, basic optimizations of components that fit into existing machine-learning frameworks, such as PyTorch Geometric and the deep graph library (DGL), which are interfaces for building a machine-learning model. Stathas says the process is like swapping out engines to build a faster car. Their method was designed to fit into existing GNN architectures, so that domain experts could easily apply this work to their specified fields to expedite model training and tease out insights during inference faster. The trick, the team determined, was to keep all of the hardware (CPUs, data links, and GPUs) busy at all times: while the CPU samples the graph and prepares mini-batches of data that will then be transferred through the data link, the more critical GPU is working to train the machine-learning model or conduct inference. 

    The researchers began by analyzing the performance of a commonly used machine-learning library for GNNs (PyTorch Geometric), which showed a startlingly low utilization of available GPU resources. Applying simple optimizations, the researchers improved GPU utilization from 10 to 30 percent, resulting in a 1.4 to two times performance improvement relative to public benchmark codes. This fast baseline code could execute one complete pass over a large training dataset through the algorithm (an epoch) in 50.4 seconds.                          

    Seeking further performance improvements, the researchers set out to examine the bottlenecks that occur at the beginning of the data pipeline: the algorithms for graph sampling and mini-batch preparation. Unlike other neural networks, GNNs perform a neighborhood aggregation operation, which computes information about a node using information present in other nearby nodes in the graph — for example, in a social network graph, information from friends of friends of a user. As the number of layers in the GNN increase, the number of nodes the network has to reach out to for information can explode, exceeding the limits of a computer. Neighborhood sampling algorithms help by selecting a smaller random subset of nodes to gather; however, the researchers found that current implementations of this were too slow to keep up with the processing speed of modern GPUs. In response, they identified a mix of data structures, algorithmic optimizations, and so forth that improved sampling speed, ultimately improving the sampling operation alone by about three times, taking the per-epoch runtime from 50.4 to 34.6 seconds. They also found that sampling, at an appropriate rate, can be done during inference, improving overall energy efficiency and performance, a point that had been overlooked in the literature, the team notes.      

    In previous systems, this sampling step was a multi-process approach, creating extra data and unnecessary data movement between the processes. The researchers made their SALIENT method more nimble by creating a single process with lightweight threads that kept the data on the CPU in shared memory. Further, SALIENT takes advantage of a cache of modern processors, says Stathas, parallelizing feature slicing, which extracts relevant information from nodes of interest and their surrounding neighbors and edges, within the shared memory of the CPU core cache. This again reduced the overall per-epoch runtime from 34.6 to 27.8 seconds.

    The last bottleneck the researchers addressed was to pipeline mini-batch data transfers between the CPU and GPU using a prefetching step, which would prepare data just before it’s needed. The team calculated that this would maximize bandwidth usage in the data link and bring the method up to perfect utilization; however, they only saw around 90 percent. They identified and fixed a performance bug in a popular PyTorch library that caused unnecessary round-trip communications between the CPU and GPU. With this bug fixed, the team achieved a 16.5 second per-epoch runtime with SALIENT.

    “Our work showed, I think, that the devil is in the details,” says Kaler. “When you pay close attention to the details that impact performance when training a graph neural network, you can resolve a huge number of performance issues. With our solutions, we ended up being completely bottlenecked by GPU computation, which is the ideal goal of such a system.”

    SALIENT’s speed was evaluated on three standard datasets ogbn-arxiv, ogbn-products, and ogbn-papers100M, as well as in multi-machine settings, with different levels of fanout (amount of data that the CPU would prepare for the GPU), and across several architectures, including the most recent state-of-the-art one, GraphSAGE-RI. In each setting, SALIENT outperformed PyTorch Geometric, most notably on the large ogbn-papers100M dataset, containing 100 million nodes and over a billion edges Here, it was three times faster, running on one GPU, than the optimized baseline that was originally created for this work; with 16 GPUs, SALIENT was an additional eight times faster. 

    While other systems had slightly different hardware and experimental setups, so it wasn’t always a direct comparison, SALIENT still outperformed them. Among systems that achieved similar accuracy, representative performance numbers include 99 seconds using one GPU and 32 CPUs, and 13 seconds using 1,536 CPUs. In contrast, SALIENT’s runtime using one GPU and 20 CPUs was 16.5 seconds and was just two seconds with 16 GPUs and 320 CPUs. “If you look at the bottom-line numbers that prior work reports, our 16 GPU runtime (two seconds) is an order of magnitude faster than other numbers that have been reported previously on this dataset,” says Kaler. The researchers attributed their performance improvements, in part, to their approach of optimizing their code for a single machine before moving to the distributed setting. Stathas says that the lesson here is that for your money, “it makes more sense to use the hardware you have efficiently, and to its extreme, before you start scaling up to multiple computers,” which can provide significant savings on cost and carbon emissions that can come with model training.

    This new capacity will now allow researchers to tackle and dig deeper into bigger and bigger graphs. For example, the Bitcoin network that was mentioned earlier contained 100,000 nodes; the SALIENT system can capably handle a graph 1,000 times (or three orders of magnitude) larger.

    “In the future, we would be looking at not just running this graph neural network training system on the existing algorithms that we implemented for classifying or predicting the properties of each node, but we also want to do more in-depth tasks, such as identifying common patterns in a graph (subgraph patterns), [which] may be actually interesting for indicating financial crimes,” says Chen. “We also want to identify nodes in a graph that are similar in a sense that they possibly would be corresponding to the same bad actor in a financial crime. These tasks would require developing additional algorithms, and possibly also neural network architectures.”

    This research was supported by the MIT-IBM Watson AI Lab and in part by the U.S. Air Force Research Laboratory and the U.S. Air Force Artificial Intelligence Accelerator. More

  • in

    Learning on the edge

    Microcontrollers, miniature computers that can run simple commands, are the basis for billions of connected devices, from internet-of-things (IoT) devices to sensors in automobiles. But cheap, low-power microcontrollers have extremely limited memory and no operating system, making it challenging to train artificial intelligence models on “edge devices” that work independently from central computing resources.

    Training a machine-learning model on an intelligent edge device allows it to adapt to new data and make better predictions. For instance, training a model on a smart keyboard could enable the keyboard to continually learn from the user’s writing. However, the training process requires so much memory that it is typically done using powerful computers at a data center, before the model is deployed on a device. This is more costly and raises privacy issues since user data must be sent to a central server.

    To address this problem, researchers at MIT and the MIT-IBM Watson AI Lab developed a new technique that enables on-device training using less than a quarter of a megabyte of memory. Other training solutions designed for connected devices can use more than 500 megabytes of memory, greatly exceeding the 256-kilobyte capacity of most microcontrollers (there are 1,024 kilobytes in one megabyte).

    The intelligent algorithms and framework the researchers developed reduce the amount of computation required to train a model, which makes the process faster and more memory efficient. Their technique can be used to train a machine-learning model on a microcontroller in a matter of minutes.

    This technique also preserves privacy by keeping data on the device, which could be especially beneficial when data are sensitive, such as in medical applications. It also could enable customization of a model based on the needs of users. Moreover, the framework preserves or improves the accuracy of the model when compared to other training approaches.

    “Our study enables IoT devices to not only perform inference but also continuously update the AI models to newly collected data, paving the way for lifelong on-device learning. The low resource utilization makes deep learning more accessible and can have a broader reach, especially for low-power edge devices,” says Song Han, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the MIT-IBM Watson AI Lab, and senior author of the paper describing this innovation.

    Joining Han on the paper are co-lead authors and EECS PhD students Ji Lin and Ligeng Zhu, as well as MIT postdocs Wei-Ming Chen and Wei-Chen Wang, and Chuang Gan, a principal research staff member at the MIT-IBM Watson AI Lab. The research will be presented at the Conference on Neural Information Processing Systems.

    Han and his team previously addressed the memory and computational bottlenecks that exist when trying to run machine-learning models on tiny edge devices, as part of their TinyML initiative.

    Lightweight training

    A common type of machine-learning model is known as a neural network. Loosely based on the human brain, these models contain layers of interconnected nodes, or neurons, that process data to complete a task, such as recognizing people in photos. The model must be trained first, which involves showing it millions of examples so it can learn the task. As it learns, the model increases or decreases the strength of the connections between neurons, which are known as weights.

    The model may undergo hundreds of updates as it learns, and the intermediate activations must be stored during each round. In a neural network, activation is the middle layer’s intermediate results. Because there may be millions of weights and activations, training a model requires much more memory than running a pre-trained model, Han explains.

    Han and his collaborators employed two algorithmic solutions to make the training process more efficient and less memory-intensive. The first, known as sparse update, uses an algorithm that identifies the most important weights to update at each round of training. The algorithm starts freezing the weights one at a time until it sees the accuracy dip to a set threshold, then it stops. The remaining weights are updated, while the activations corresponding to the frozen weights don’t need to be stored in memory.

    “Updating the whole model is very expensive because there are a lot of activations, so people tend to update only the last layer, but as you can imagine, this hurts the accuracy. For our method, we selectively update those important weights and make sure the accuracy is fully preserved,” Han says.

    Their second solution involves quantized training and simplifying the weights, which are typically 32 bits. An algorithm rounds the weights so they are only eight bits, through a process known as quantization, which cuts the amount of memory for both training and inference. Inference is the process of applying a model to a dataset and generating a prediction. Then the algorithm applies a technique called quantization-aware scaling (QAS), which acts like a multiplier to adjust the ratio between weight and gradient, to avoid any drop in accuracy that may come from quantized training.

    The researchers developed a system, called a tiny training engine, that can run these algorithmic innovations on a simple microcontroller that lacks an operating system. This system changes the order of steps in the training process so more work is completed in the compilation stage, before the model is deployed on the edge device.

    “We push a lot of the computation, such as auto-differentiation and graph optimization, to compile time. We also aggressively prune the redundant operators to support sparse updates. Once at runtime, we have much less workload to do on the device,” Han explains.

    A successful speedup

    Their optimization only required 157 kilobytes of memory to train a machine-learning model on a microcontroller, whereas other techniques designed for lightweight training would still need between 300 and 600 megabytes.

    They tested their framework by training a computer vision model to detect people in images. After only 10 minutes of training, it learned to complete the task successfully. Their method was able to train a model more than 20 times faster than other approaches.

    Now that they have demonstrated the success of these techniques for computer vision models, the researchers want to apply them to language models and different types of data, such as time-series data. At the same time, they want to use what they’ve learned to shrink the size of larger models without sacrificing accuracy, which could help reduce the carbon footprint of training large-scale machine-learning models.

    “AI model adaptation/training on a device, especially on embedded controllers, is an open challenge. This research from MIT has not only successfully demonstrated the capabilities, but also opened up new possibilities for privacy-preserving device personalization in real-time,” says Nilesh Jain, a principal engineer at Intel who was not involved with this work. “Innovations in the publication have broader applicability and will ignite new systems-algorithm co-design research.”

    “On-device learning is the next major advance we are working toward for the connected intelligent edge. Professor Song Han’s group has shown great progress in demonstrating the effectiveness of edge devices for training,” adds Jilei Hou, vice president and head of AI research at Qualcomm. “Qualcomm has awarded his team an Innovation Fellowship for further innovation and advancement in this area.”

    This work is funded by the National Science Foundation, the MIT-IBM Watson AI Lab, the MIT AI Hardware Program, Amazon, Intel, Qualcomm, Ford Motor Company, and Google. More

  • in

    Neurodegenerative disease can progress in newly identified patterns

    Neurodegenerative diseases — like amyotrophic lateral sclerosis (ALS, or Lou Gehrig’s disease), Alzheimer’s, and Parkinson’s — are complicated, chronic ailments that can present with a variety of symptoms, worsen at different rates, and have many underlying genetic and environmental causes, some of which are unknown. ALS, in particular, affects voluntary muscle movement and is always fatal, but while most people survive for only a few years after diagnosis, others live with the disease for decades. Manifestations of ALS can also vary significantly; often slower disease development correlates with onset in the limbs and affecting fine motor skills, while the more serious, bulbar ALS impacts swallowing, speaking, breathing, and mobility. Therefore, understanding the progression of diseases like ALS is critical to enrollment in clinical trials, analysis of potential interventions, and discovery of root causes.

    However, assessing disease evolution is far from straightforward. Current clinical studies typically assume that health declines on a downward linear trajectory on a symptom rating scale, and use these linear models to evaluate whether drugs are slowing disease progression. However, data indicate that ALS often follows nonlinear trajectories, with periods where symptoms are stable alternating with periods when they are rapidly changing. Since data can be sparse, and health assessments often rely on subjective rating metrics measured at uneven time intervals, comparisons across patient populations are difficult. These heterogenous data and progression, in turn, complicate analyses of invention effectiveness and potentially mask disease origin.

    Now, a new machine-learning method developed by researchers from MIT, IBM Research, and elsewhere aims to better characterize ALS disease progression patterns to inform clinical trial design.

    “There are groups of individuals that share progression patterns. For example, some seem to have really fast-progressing ALS and others that have slow-progressing ALS that varies over time,” says Divya Ramamoorthy PhD ’22, a research specialist at MIT and lead author of a new paper on the work that was published this month in Nature Computational Science. “The question we were asking is: can we use machine learning to identify if, and to what extent, those types of consistent patterns across individuals exist?”

    Their technique, indeed, identified discrete and robust clinical patterns in ALS progression, many of which are non-linear. Further, these disease progression subtypes were consistent across patient populations and disease metrics. The team additionally found that their method can be applied to Alzheimer’s and Parkinson’s diseases as well.

    Joining Ramamoorthy on the paper are MIT-IBM Watson AI Lab members Ernest Fraenkel, a professor in the MIT Department of Biological Engineering; Research Scientist Soumya Ghosh of IBM Research; and Principal Research Scientist Kenney Ng, also of IBM Research. Additional authors include Kristen Severson PhD ’18, a senior researcher at Microsoft Research and former member of the Watson Lab and of IBM Research; Karen Sachs PhD ’06 of Next Generation Analytics; a team of researchers with Answer ALS; Jonathan D. Glass and Christina N. Fournier of the Emory University School of Medicine; the Pooled Resource Open-Access ALS Clinical Trials Consortium; ALS/MND Natural History Consortium; Todd M. Herrington of Massachusetts General Hospital (MGH) and Harvard Medical School; and James D. Berry of MGH.

    Play video

    MIT Professor Ernest Fraenkel describes early stages of his research looking at root causes of amyotrophic lateral sclerosis (ALS).

    Reshaping health decline

    After consulting with clinicians, the team of machine learning researchers and neurologists let the data speak for itself. They designed an unsupervised machine-learning model that employed two methods: Gaussian process regression and Dirichlet process clustering. These inferred the health trajectories directly from patient data and automatically grouped similar trajectories together without prescribing the number of clusters or the shape of the curves, forming ALS progression “subtypes.” Their method incorporated prior clinical knowledge in the way of a bias for negative trajectories — consistent with expectations for neurodegenerative disease progressions — but did not assume any linearity. “We know that linearity is not reflective of what’s actually observed,” says Ng. “The methods and models that we use here were more flexible, in the sense that, they capture what was seen in the data,” without the need for expensive labeled data and prescription of parameters.

    Primarily, they applied the model to five longitudinal datasets from ALS clinical trials and observational studies. These used the gold standard to measure symptom development: the ALS functional rating scale revised (ALSFRS-R), which captures a global picture of patient neurological impairment but can be a bit of a “messy metric.” Additionally, performance on survivability probabilities, forced vital capacity (a measurement of respiratory function), and subscores of ALSFRS-R, which looks at individual bodily functions, were incorporated.

    New regimes of progression and utility

    When their population-level model was trained and tested on these metrics, four dominant patterns of disease popped out of the many trajectories — sigmoidal fast progression, stable slow progression, unstable slow progression, and unstable moderate progression — many with strong nonlinear characteristics. Notably, it captured trajectories where patients experienced a sudden loss of ability, called a functional cliff, which would significantly impact treatments, enrollment in clinical trials, and quality of life.

    The researchers compared their method against other commonly used linear and nonlinear approaches in the field to separate the contribution of clustering and linearity to the model’s accuracy. The new work outperformed them, even patient-specific models, and found that subtype patterns were consistent across measures. Impressively, when data were withheld, the model was able to interpolate missing values, and, critically, could forecast future health measures. The model could also be trained on one ALSFRS-R dataset and predict cluster membership in others, making it robust, generalizable, and accurate with scarce data. So long as 6-12 months of data were available, health trajectories could be inferred with higher confidence than conventional methods.

    The researchers’ approach also provided insights into Alzheimer’s and Parkinson’s diseases, both of which can have a range of symptom presentations and progression. For Alzheimer’s, the new technique could identify distinct disease patterns, in particular variations in the rates of conversion of mild to severe disease. The Parkinson’s analysis demonstrated a relationship between progression trajectories for off-medication scores and disease phenotypes, such as the tremor-dominant or postural instability/gait difficulty forms of Parkinson’s disease.

    The work makes significant strides to find the signal amongst the noise in the time-series of complex neurodegenerative disease. “The patterns that we see are reproducible across studies, which I don’t believe had been shown before, and that may have implications for how we subtype the [ALS] disease,” says Fraenkel. As the FDA has been considering the impact of non-linearity in clinical trial designs, the team notes that their work is particularly pertinent.

    As new ways to understand disease mechanisms come online, this model provides another tool to pick apart illnesses like ALS, Alzheimer’s, and Parkinson’s from a systems biology perspective.

    “We have a lot of molecular data from the same patients, and so our long-term goal is to see whether there are subtypes of the disease,” says Fraenkel, whose lab looks at cellular changes to understand the etiology of diseases and possible targets for cures. “One approach is to start with the symptoms … and see if people with different patterns of disease progression are also different at the molecular level. That might lead you to a therapy. Then there’s the bottom-up approach, where you start with the molecules” and try to reconstruct biological pathways that might be affected. “We’re going [to be tackling this] from both ends … and finding if something meets in the middle.”

    This research was supported, in part, by the MIT-IBM Watson AI Lab, the Muscular Dystrophy Association, Department of Veterans Affairs of Research and Development, the Department of Defense, NSF Gradate Research Fellowship Program, Siebel Scholars Fellowship, Answer ALS, the United States Army Medical Research Acquisition Activity, National Institutes of Health, and the NIH/NINDS. More

  • in

    Hallucinating to better text translation

    As babies, we babble and imitate our way to learning languages. We don’t start off reading raw text, which requires fundamental knowledge and understanding about the world, as well as the advanced ability to interpret and infer descriptions and relationships. Rather, humans begin our language journey slowly, by pointing and interacting with our environment, basing our words and perceiving their meaning through the context of the physical and social world. Eventually, we can craft full sentences to communicate complex ideas.

    Similarly, when humans begin learning and translating into another language, the incorporation of other sensory information, like multimedia, paired with the new and unfamiliar words, like flashcards with images, improves language acquisition and retention. Then, with enough practice, humans can accurately translate new, unseen sentences in context without the accompanying media; however, imagining a picture based on the original text helps.

    This is the basis of a new machine learning model, called VALHALLA, by researchers from MIT, IBM, and the University of California at San Diego, in which a trained neural network sees a source sentence in one language, hallucinates an image of what it looks like, and then uses both to translate into a target language. The team found that their method demonstrates improved accuracy of machine translation over text-only translation. Further, it provided an additional boost for cases with long sentences, under-resourced languages, and instances where part of the source sentence is inaccessible to the machine translator.

    As a core task within the AI field of natural language processing (NLP), machine translation is an “eminently practical technology that’s being used by millions of people every day,” says study co-author Yoon Kim, assistant professor in MIT’s Department of Electrical Engineering and Computer Science with affiliations in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the MIT-IBM Watson AI Lab. With recent, significant advances in deep learning, “there’s been an interesting development in how one might use non-text information — for example, images, audio, or other grounding information — to tackle practical tasks involving language” says Kim, because “when humans are performing language processing tasks, we’re doing so within a grounded, situated world.” The pairing of hallucinated images and text during inference, the team postulated, imitates that process, providing context for improved performance over current state-of-the-art techniques, which utilize text-only data.

    This research will be presented at the IEEE / CVF Computer Vision and Pattern Recognition Conference this month. Kim’s co-authors are UC San Diego graduate student Yi Li and Professor Nuno Vasconcelos, along with research staff members Rameswar Panda, Chun-fu “Richard” Chen, Rogerio Feris, and IBM Director David Cox of IBM Research and the MIT-IBM Watson AI Lab.

    Learning to hallucinate from images

    When we learn new languages and to translate, we’re often provided with examples and practice before venturing out on our own. The same is true for machine-translation systems; however, if images are used during training, these AI methods also require visual aids for testing, limiting their applicability, says Panda.

    “In real-world scenarios, you might not have an image with respect to the source sentence. So, our motivation was basically: Instead of using an external image during inference as input, can we use visual hallucination — the ability to imagine visual scenes — to improve machine translation systems?” says Panda.

    To do this, the team used an encoder-decoder architecture with two transformers, a type of neural network model that’s suited for sequence-dependent data, like language, that can pay attention key words and semantics of a sentence. One transformer generates a visual hallucination, and the other performs multimodal translation using outputs from the first transformer.

    During training, there are two streams of translation: a source sentence and a ground-truth image that is paired with it, and the same source sentence that is visually hallucinated to make a text-image pair. First the ground-truth image and sentence are tokenized into representations that can be handled by transformers; for the case of the sentence, each word is a token. The source sentence is tokenized again, but this time passed through the visual hallucination transformer, outputting a hallucination, a discrete image representation of the sentence. The researchers incorporated an autoregression that compares the ground-truth and hallucinated representations for congruency — e.g., homonyms: a reference to an animal “bat” isn’t hallucinated as a baseball bat. The hallucination transformer then uses the difference between them to optimize its predictions and visual output, making sure the context is consistent.

    The two sets of tokens are then simultaneously passed through the multimodal translation transformer, each containing the sentence representation and either the hallucinated or ground-truth image. The tokenized text translation outputs are compared with the goal of being similar to each other and to the target sentence in another language. Any differences are then relayed back to the translation transformer for further optimization.

    For testing, the ground-truth image stream drops off, since images likely wouldn’t be available in everyday scenarios.

    “To the best of our knowledge, we haven’t seen any work which actually uses a hallucination transformer jointly with a multimodal translation system to improve machine translation performance,” says Panda.

    Visualizing the target text

    To test their method, the team put VALHALLA up against other state-of-the-art multimodal and text-only translation methods. They used public benchmark datasets containing ground-truth images with source sentences, and a dataset for translating text-only news articles. The researchers measured its performance over 13 tasks, ranging from translation on well-resourced languages (like English, German, and French), under-resourced languages (like English to Romanian) and non-English (like Spanish to French). The group also tested varying transformer model sizes, how accuracy changes with the sentence length, and translation under limited textual context, where portions of the text were hidden from the machine translators.

    The team observed significant improvements over text-only translation methods, improving data efficiency, and that smaller models performed better than the larger base model. As sentences became longer, VALHALLA’s performance over other methods grew, which the researchers attributed to the addition of more ambiguous words. In cases where part of the sentence was masked, VALHALLA could recover and translate the original text, which the team found surprising.

    Further unexpected findings arose: “Where there weren’t as many training [image and] text pairs, [like for under-resourced languages], improvements were more significant, which indicates that grounding in images helps in low-data regimes,” says Kim. “Another thing that was quite surprising to me was this improved performance, even on types of text that aren’t necessarily easily connectable to images. For example, maybe it’s not so surprising if this helps in translating visually salient sentences, like the ‘there is a red car in front of the house.’ [However], even in text-only [news article] domains, the approach was able to improve upon text-only systems.”

    While VALHALLA performs well, the researchers note that it does have limitations, requiring pairs of sentences to be annotated with an image, which could make it more expensive to obtain. It also performs better in its ground domain and not the text-only news articles. Moreover, Kim and Panda note, a technique like VALHALLA is still a black box, with the assumption that hallucinated images are providing helpful information, and the team plans to investigate what and how the model is learning in order to validate their methods.

    In the future, the team plans to explore other means of improving translation. “Here, we only focus on images, but there are other types of a multimodal information — for example, speech, video or touch, or other sensory modalities,” says Panda. “We believe such multimodal grounding can lead to even more efficient machine translation models, potentially benefiting translation across many low-resource languages spoken in the world.”

    This research was supported, in part, by the MIT-IBM Watson AI Lab and the National Science Foundation. More

  • in

    Artificial intelligence system learns concepts shared across video, audio, and text

    Humans observe the world through a combination of different modalities, like vision, hearing, and our understanding of language. Machines, on the other hand, interpret the world through data that algorithms can process.

    So, when a machine “sees” a photo, it must encode that photo into data it can use to perform a task like image classification. This process becomes more complicated when inputs come in multiple formats, like videos, audio clips, and images.

    “The main challenge here is, how can a machine align those different modalities? As humans, this is easy for us. We see a car and then hear the sound of a car driving by, and we know these are the same thing. But for machine learning, it is not that straightforward,” says Alexander Liu, a graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and first author of a paper tackling this problem. 

    Liu and his collaborators developed an artificial intelligence technique that learns to represent data in a way that captures concepts which are shared between visual and audio modalities. For instance, their method can learn that the action of a baby crying in a video is related to the spoken word “crying” in an audio clip.

    Using this knowledge, their machine-learning model can identify where a certain action is taking place in a video and label it.

    It performs better than other machine-learning methods at cross-modal retrieval tasks, which involve finding a piece of data, like a video, that matches a user’s query given in another form, like spoken language. Their model also makes it easier for users to see why the machine thinks the video it retrieved matches their query.

    This technique could someday be utilized to help robots learn about concepts in the world through perception, more like the way humans do.

    Joining Liu on the paper are CSAIL postdoc SouYoung Jin; grad students Cheng-I Jeff Lai and Andrew Rouditchenko; Aude Oliva, senior research scientist in CSAIL and MIT director of the MIT-IBM Watson AI Lab; and senior author James Glass, senior research scientist and head of the Spoken Language Systems Group in CSAIL. The research will be presented at the Annual Meeting of the Association for Computational Linguistics.

    Learning representations

    The researchers focus their work on representation learning, which is a form of machine learning that seeks to transform input data to make it easier to perform a task like classification or prediction.

    The representation learning model takes raw data, such as videos and their corresponding text captions, and encodes them by extracting features, or observations about objects and actions in the video. Then it maps those data points in a grid, known as an embedding space. The model clusters similar data together as single points in the grid. Each of these data points, or vectors, is represented by an individual word.

    For instance, a video clip of a person juggling might be mapped to a vector labeled “juggling.”

    The researchers constrain the model so it can only use 1,000 words to label vectors. The model can decide which actions or concepts it wants to encode into a single vector, but it can only use 1,000 vectors. The model chooses the words it thinks best represent the data.

    Rather than encoding data from different modalities onto separate grids, their method employs a shared embedding space where two modalities can be encoded together. This enables the model to learn the relationship between representations from two modalities, like video that shows a person juggling and an audio recording of someone saying “juggling.”

    To help the system process data from multiple modalities, they designed an algorithm that guides the machine to encode similar concepts into the same vector.

    “If there is a video about pigs, the model might assign the word ‘pig’ to one of the 1,000 vectors. Then if the model hears someone saying the word ‘pig’ in an audio clip, it should still use the same vector to encode that,” Liu explains.

    A better retriever

    They tested the model on cross-modal retrieval tasks using three datasets: a video-text dataset with video clips and text captions, a video-audio dataset with video clips and spoken audio captions, and an image-audio dataset with images and spoken audio captions.

    For example, in the video-audio dataset, the model chose 1,000 words to represent the actions in the videos. Then, when the researchers fed it audio queries, the model tried to find the clip that best matched those spoken words.

    “Just like a Google search, you type in some text and the machine tries to tell you the most relevant things you are searching for. Only we do this in the vector space,” Liu says.

    Not only was their technique more likely to find better matches than the models they compared it to, it is also easier to understand.

    Because the model could only use 1,000 total words to label vectors, a user can more see easily which words the machine used to conclude that the video and spoken words are similar. This could make the model easier to apply in real-world situations where it is vital that users understand how it makes decisions, Liu says.

    The model still has some limitations they hope to address in future work. For one, their research focused on data from two modalities at a time, but in the real world humans encounter many data modalities simultaneously, Liu says.

    “And we know 1,000 words works on this kind of dataset, but we don’t know if it can be generalized to a real-world problem,” he adds.

    Plus, the images and videos in their datasets contained simple objects or straightforward actions; real-world data are much messier. They also want to determine how well their method scales up when there is a wider diversity of inputs.

    This research was supported, in part, by the MIT-IBM Watson AI Lab and its member companies, Nexplore and Woodside, and by the MIT Lincoln Laboratory. More

  • in

    Does this artificial intelligence think like a human?

    In machine learning, understanding why a model makes certain decisions is often just as important as whether those decisions are correct. For instance, a machine-learning model might correctly predict that a skin lesion is cancerous, but it could have done so using an unrelated blip on a clinical photo.

    While tools exist to help experts make sense of a model’s reasoning, often these methods only provide insights on one decision at a time, and each must be manually evaluated. Models are commonly trained using millions of data inputs, making it almost impossible for a human to evaluate enough decisions to identify patterns.

    Now, researchers at MIT and IBM Research have created a method that enables a user to aggregate, sort, and rank these individual explanations to rapidly analyze a machine-learning model’s behavior. Their technique, called Shared Interest, incorporates quantifiable metrics that compare how well a model’s reasoning matches that of a human.

    Shared Interest could help a user easily uncover concerning trends in a model’s decision-making — for example, perhaps the model often becomes confused by distracting, irrelevant features, like background objects in photos. Aggregating these insights could help the user quickly and quantitatively determine whether a model is trustworthy and ready to be deployed in a real-world situation.

    “In developing Shared Interest, our goal is to be able to scale up this analysis process so that you could understand on a more global level what your model’s behavior is,” says lead author Angie Boggust, a graduate student in the Visualization Group of the Computer Science and Artificial Intelligence Laboratory (CSAIL).

    Boggust wrote the paper with her advisor, Arvind Satyanarayan, an assistant professor of computer science who leads the Visualization Group, as well as Benjamin Hoover and senior author Hendrik Strobelt, both of IBM Research. The paper will be presented at the Conference on Human Factors in Computing Systems.

    Boggust began working on this project during a summer internship at IBM, under the mentorship of Strobelt. After returning to MIT, Boggust and Satyanarayan expanded on the project and continued the collaboration with Strobelt and Hoover, who helped deploy the case studies that show how the technique could be used in practice.

    Human-AI alignment

    Shared Interest leverages popular techniques that show how a machine-learning model made a specific decision, known as saliency methods. If the model is classifying images, saliency methods highlight areas of an image that are important to the model when it made its decision. These areas are visualized as a type of heatmap, called a saliency map, that is often overlaid on the original image. If the model classified the image as a dog, and the dog’s head is highlighted, that means those pixels were important to the model when it decided the image contains a dog.

    Shared Interest works by comparing saliency methods to ground-truth data. In an image dataset, ground-truth data are typically human-generated annotations that surround the relevant parts of each image. In the previous example, the box would surround the entire dog in the photo. When evaluating an image classification model, Shared Interest compares the model-generated saliency data and the human-generated ground-truth data for the same image to see how well they align.

    The technique uses several metrics to quantify that alignment (or misalignment) and then sorts a particular decision into one of eight categories. The categories run the gamut from perfectly human-aligned (the model makes a correct prediction and the highlighted area in the saliency map is identical to the human-generated box) to completely distracted (the model makes an incorrect prediction and does not use any image features found in the human-generated box).

    “On one end of the spectrum, your model made the decision for the exact same reason a human did, and on the other end of the spectrum, your model and the human are making this decision for totally different reasons. By quantifying that for all the images in your dataset, you can use that quantification to sort through them,” Boggust explains.

    The technique works similarly with text-based data, where key words are highlighted instead of image regions.

    Rapid analysis

    The researchers used three case studies to show how Shared Interest could be useful to both nonexperts and machine-learning researchers.

    In the first case study, they used Shared Interest to help a dermatologist determine if he should trust a machine-learning model designed to help diagnose cancer from photos of skin lesions. Shared Interest enabled the dermatologist to quickly see examples of the model’s correct and incorrect predictions. Ultimately, the dermatologist decided he could not trust the model because it made too many predictions based on image artifacts, rather than actual lesions.

    “The value here is that using Shared Interest, we are able to see these patterns emerge in our model’s behavior. In about half an hour, the dermatologist was able to make a confident decision of whether or not to trust the model and whether or not to deploy it,” Boggust says.

    In the second case study, they worked with a machine-learning researcher to show how Shared Interest can evaluate a particular saliency method by revealing previously unknown pitfalls in the model. Their technique enabled the researcher to analyze thousands of correct and incorrect decisions in a fraction of the time required by typical manual methods.

    In the third case study, they used Shared Interest to dive deeper into a specific image classification example. By manipulating the ground-truth area of the image, they were able to conduct a what-if analysis to see which image features were most important for particular predictions.   

    The researchers were impressed by how well Shared Interest performed in these case studies, but Boggust cautions that the technique is only as good as the saliency methods it is based upon. If those techniques contain bias or are inaccurate, then Shared Interest will inherit those limitations.

    In the future, the researchers want to apply Shared Interest to different types of data, particularly tabular data which is used in medical records. They also want to use Shared Interest to help improve current saliency techniques. Boggust hopes this research inspires more work that seeks to quantify machine-learning model behavior in ways that make sense to humans.

    This work is funded, in part, by the MIT-IBM Watson AI Lab, the United States Air Force Research Laboratory, and the United States Air Force Artificial Intelligence Accelerator. More

  • in

    Generating new molecules with graph grammar

    Chemical engineers and materials scientists are constantly looking for the next revolutionary material, chemical, and drug. The rise of machine-learning approaches is expediting the discovery process, which could otherwise take years. “Ideally, the goal is to train a machine-learning model on a few existing chemical samples and then allow it to produce as many manufacturable molecules of the same class as possible, with predictable physical properties,” says Wojciech Matusik, professor of electrical engineering and computer science at MIT. “If you have all these components, you can build new molecules with optimal properties, and you also know how to synthesize them. That’s the overall vision that people in that space want to achieve”

    However, current techniques, mainly deep learning, require extensive datasets for training models, and many class-specific chemical datasets contain a handful of example compounds, limiting their ability to generalize and generate physical molecules that could be created in the real world.

    Now, a new paper from researchers at MIT and IBM tackles this problem using a generative graph model to build new synthesizable molecules within the same chemical class as their training data. To do this, they treat the formation of atoms and chemical bonds as a graph and develop a graph grammar — a linguistics analogy of systems and structures for word ordering — that contains a sequence of rules for building molecules, such as monomers and polymers. Using the grammar and production rules that were inferred from the training set, the model can not only reverse engineer its examples, but can create new compounds in a systematic and data-efficient way. “We basically built a language for creating molecules,” says Matusik “This grammar essentially is the generative model.”

    Matusik’s co-authors include MIT graduate students Minghao Guo, who is the lead author, and Beichen Li as well as Veronika Thost, Payal Das, and Jie Chen, research staff members with IBM Research. Matusik, Thost, and Chen are affiliated with the MIT-IBM Watson AI Lab. Their method, which they’ve called data-efficient graph grammar (DEG), will be presented at the International Conference on Learning Representations.

    “We want to use this grammar representation for monomer and polymer generation, because this grammar is explainable and expressive,” says Guo. “With only a few number of the production rules, we can generate many kinds of structures.”

    A molecular structure can be thought of as a symbolic representation in a graph — a string of atoms (nodes) joined together by chemical bonds (edges). In this method, the researchers allow the model to take the chemical structure and collapse a substructure of the molecule down to one node; this may be two atoms connected by a bond, a short sequence of bonded atoms, or a ring of atoms. This is done repeatedly, creating the production rules as it goes, until a single node remains. The rules and grammar then could be applied in the reverse order to recreate the training set from scratch or combined in different combinations to produce new molecules of the same chemical class.

    “Existing graph generation methods would produce one node or one edge sequentially at a time, but we are looking at higher-level structures and, specifically, exploiting chemistry knowledge, so that we don’t treat the individual atoms and bonds as the unit. This simplifies the generation process and also makes it more data-efficient to learn,” says Chen.

    Further, the researchers optimized the technique so that the bottom-up grammar was relatively simple and straightforward, such that it fabricated molecules that could be made.

    “If we switch the order of applying these production rules, we would get another molecule; what’s more, we can enumerate all the possibilities and generate tons of them,” says Chen. “Some of these molecules are valid and some of them not, so the learning of the grammar itself is actually to figure out a minimal collection of production rules, such that the percentage of molecules that can actually be synthesized is maximized.” While the researchers concentrated on three training sets of less than 33 samples each — acrylates, chain extenders, and isocyanates — they note that the process could be applied to any chemical class.

    To see how their method performed, the researchers tested DEG against other state-of-the-art models and techniques, looking at percentages of chemically valid and unique molecules, diversity of those created, success rate of retrosynthesis, and percentage of molecules belonging to the training data’s monomer class.

    “We clearly show that, for the synthesizability and membership, our algorithm outperforms all the existing methods by a very large margin, while it’s comparable for some other widely-used metrics,” says Guo. Further, “what is amazing about our algorithm is that we only need about 0.15 percent of the original dataset to achieve very similar results compared to state-of-the-art approaches that train on tens of thousands of samples. Our algorithm can specifically handle the problem of data sparsity.”

    In the immediate future, the team plans to address scaling up this grammar learning process to be able to generate large graphs, as well as produce and identify chemicals with desired properties.

    Down the road, the researchers see many applications for the DEG method, as it’s adaptable beyond generating new chemical structures, the team points out. A graph is a very flexible representation, and many entities can be symbolized in this form — robots, vehicles, buildings, and electronic circuits, for example. “Essentially, our goal is to build up our grammar, so that our graphic representation can be widely used across many different domains,” says Guo, as “DEG can automate the design of novel entities and structures,” says Chen.

    This research was supported, in part, by the MIT-IBM Watson AI Lab and Evonik. More

  • in

    Unlocking new doors to artificial intelligence

    Artificial intelligence research is constantly developing new hypotheses that have the potential to benefit society and industry; however, sometimes these benefits are not fully realized due to a lack of engineering tools. To help bridge this gap, graduate students in the MIT Department of Electrical Engineering and Computer Science’s 6-A Master of Engineering (MEng) Thesis Program work with some of the most innovative companies in the world and collaborate on cutting-edge projects, while contributing to and completing their MEng thesis.

    During a portion of the last year, four 6-A MEng students teamed up and completed an internship with IBM Research’s advanced prototyping team through the MIT-IBM Watson AI Lab on AI projects, often developing web applications to solve a real-world issue or business use cases. Here, the students worked alongside AI engineers, user experience engineers, full-stack researchers, and generalists to accommodate project requests and receive thesis advice, says Lee Martie, IBM research staff member and 6-A manager. The students’ projects ranged from generating synthetic data to allow for privacy-sensitive data analysis to using computer vision to identify actions in video that allows for monitoring human safety and tracking build progress on a construction site.

    “I appreciated all of the expertise from the team and the feedback,” says 6-A graduate Violetta Jusiega ’21, who participated in the program. “I think that working in industry gives the lens of making sure that the project’s needs are satisfied and [provides the opportunity] to ground research and make sure that it is helpful for some use case in the future.”

    Jusiega’s research intersected the fields of computer vision and design to focus on data visualization and user interfaces for the medical field. Working with IBM, she built an application programming interface (API) that let clinicians interact with a medical treatment strategy AI model, which was deployed in the cloud. Her interface provided a medical decision tree, as well as some prescribed treatment plans. After receiving feedback on her design from physicians at a local hospital, Jusiega developed iterations of the API and how the results where displayed, visually, so that it would be user-friendly and understandable for clinicians, who don’t usually code. She says that, “these tools are often not acquired into the field because they lack some of these API principles which become more important in an industry where everything is already very fast paced, so there’s little time to incorporate a new technology.” But this project might eventually allow for industry deployment. “I think this application has a bunch of potential, whether it does get picked up by clinicians or whether it’s simply used in research. It’s very promising and very exciting to see how technology can help us modify, or I can improve, the health-care field to be even more custom-tailored towards patients and giving them the best care possible,” she says.

    Another 6-A graduate student, Spencer Compton, was also considering aiding professionals to make more informed decisions, for use in settings including health care, but he was tackling it from a causal perspective. When given a set of related variables, Compton was investigating if there was a way to determine not just correlation, but the cause-and-effect relationship between them (the direction of the interaction) from the data alone. For this, he and his collaborators from IBM Research and Purdue University turned to a field of math called information theory. With the goal of designing an algorithm to learn complex networks of causal relationships, Compton used ideas relating to entropy, the randomness in a system, to help determine if a causal relationship is present and how variables might be interacting. “When judging an explanation, people often default to Occam’s razor” says Compton. “We’re more inclined to believe a simpler explanation than a more complex one.” In many cases, he says, it seemed to perform well. For instance, they were able to consider variables such as lung cancer, pollution, and X-ray findings. He was pleased that his research allowed him to help create a framework of “entropic causal inference” that could aid in safe and smart decisions in the future, in a satisfying way. “The math is really surprisingly deep, interesting, and complex,” says Compton. “We’re basically asking, ‘when is the simplest explanation correct?’ but as a math question.”

    Determining relationships within data can sometimes require large volumes of it to suss out patterns, but for data that may contain sensitive information, this may not be available. For her master’s work, Ivy Huang worked with IBM Research to generate synthetic tabular data using a natural language processing tool called a transformer model, which can learn and predict future values from past values. Trained on real data, the model can produce new data with similar patterns, properties, and relationships without restrictions like privacy, availability, and access that might come with real data in financial transactions and electronic medical records. Further, she created an API and deployed the model in an IBM cluster, which allowed users increased access to the model and abilities to query it without compromising the original data.

    Working with the advanced prototyping team, MEng candidate Brandon Perez also considered how to gather and investigate data with restrictions, but in his case it was to use computer vision frameworks, centered on an action recognition model, to identify construction site happenings. The team based their work on the Moments in Time dataset, which contains over a million three-second video clips with about 300 attached classification labels, and has performed well during AI training. However, the group needed more construction-based video data. For this, they used YouTube-8M. Perez built a framework for testing and fine-tuning existing object detection models and action recognition models that could plug into an automatic spatial and temporal localization tool — how they would identify and label particular actions in a video timeline. “I was satisfied that I was able to explore what made me curious, and I was grateful for the autonomy that I was given with this project,” says Perez. “I felt like I was always supported, and my mentor was a great support to the project.”

    “The kind of collaborations that we have seen between our MEng students and IBM researchers are exactly what the 6-A MEng Thesis program at MIT is all about,” says Tomas Palacios, professor of electrical engineering and faculty director of the MIT 6-A MEng Thesis program. “For more than 100 years, 6-A has been connecting MIT students with industry to solve together some of the most important problems in the world.” More