More stories

  • in

    From physics to generative AI: An AI model for advanced pattern generation

    Generative AI, which is currently riding a crest of popular discourse, promises a world where the simple transforms into the complex — where a simple distribution evolves into intricate patterns of images, sounds, or text, rendering the artificial startlingly real. 

    The realms of imagination no longer remain as mere abstractions, as researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have brought an innovative AI model to life. Their new technology integrates two seemingly unrelated physical laws that underpin the best-performing generative models to date: diffusion, which typically illustrates the random motion of elements, like heat permeating a room or a gas expanding into space, and Poisson Flow, which draws on the principles governing the activity of electric charges.

    This harmonious blend has resulted in superior performance in generating new images, outpacing existing state-of-the-art models. Since its inception, the “Poisson Flow Generative Model ++” (PFGM++) has found potential applications in various fields, from antibody and RNA sequence generation to audio production and graph generation.

    The model can generate complex patterns, like creating realistic images or mimicking real-world processes. PFGM++ builds off of PFGM, the team’s work from the prior year. PFGM takes inspiration from the means behind the mathematical equation known as the “Poisson” equation, and then applies it to the data the model tries to learn from. To do this, the team used a clever trick: They added an extra dimension to their model’s “space,” kind of like going from a 2D sketch to a 3D model. This extra dimension gives more room for maneuvering, places the data in a larger context, and helps one approach the data from all directions when generating new samples. 

    “PFGM++ is an example of the kinds of AI advances that can be driven through interdisciplinary collaborations between physicists and computer scientists,” says Jesse Thaler, theoretical particle physicist in MIT’s Laboratory for Nuclear Science’s Center for Theoretical Physics and director of the National Science Foundation’s AI Institute for Artificial Intelligence and Fundamental Interactions (NSF AI IAIFI), who was not involved in the work. “In recent years, AI-based generative models have yielded numerous eye-popping results, from photorealistic images to lucid streams of text. Remarkably, some of the most powerful generative models are grounded in time-tested concepts from physics, such as symmetries and thermodynamics. PFGM++ takes a century-old idea from fundamental physics — that there might be extra dimensions of space-time — and turns it into a powerful and robust tool to generate synthetic but realistic datasets. I’m thrilled to see the myriad of ways ‘physics intelligence’ is transforming the field of artificial intelligence.”

    The underlying mechanism of PFGM isn’t as complex as it might sound. The researchers compared the data points to tiny electric charges placed on a flat plane in a dimensionally expanded world. These charges produce an “electric field,” with the charges looking to move upwards along the field lines into an extra dimension and consequently forming a uniform distribution on a vast imaginary hemisphere. The generation process is like rewinding a videotape: starting with a uniformly distributed set of charges on the hemisphere and tracking their journey back to the flat plane along the electric lines, they align to match the original data distribution. This intriguing process allows the neural model to learn the electric field, and generate new data that mirrors the original. 

    The PFGM++ model extends the electric field in PFGM to an intricate, higher-dimensional framework. When you keep expanding these dimensions, something unexpected happens — the model starts resembling another important class of models, the diffusion models. This work is all about finding the right balance. The PFGM and diffusion models sit at opposite ends of a spectrum: one is robust but complex to handle, the other simpler but less sturdy. The PFGM++ model offers a sweet spot, striking a balance between robustness and ease of use. This innovation paves the way for more efficient image and pattern generation, marking a significant step forward in technology. Along with adjustable dimensions, the researchers proposed a new training method that enables more efficient learning of the electric field. 

    To bring this theory to life, the team resolved a pair of differential equations detailing these charges’ motion within the electric field. They evaluated the performance using the Frechet Inception Distance (FID) score, a widely accepted metric that assesses the quality of images generated by the model in comparison to the real ones. PFGM++ further showcases a higher resistance to errors and robustness toward the step size in the differential equations.

    Looking ahead, they aim to refine certain aspects of the model, particularly in systematic ways to identify the “sweet spot” value of D tailored for specific data, architectures, and tasks by analyzing the behavior of estimation errors of neural networks. They also plan to apply the PFGM++ to the modern large-scale text-to-image/text-to-video generation.

    “Diffusion models have become a critical driving force behind the revolution in generative AI,” says Yang Song, research scientist at OpenAI. “PFGM++ presents a powerful generalization of diffusion models, allowing users to generate higher-quality images by improving the robustness of image generation against perturbations and learning errors. Furthermore, PFGM++ uncovers a surprising connection between electrostatics and diffusion models, providing new theoretical insights into diffusion model research.”

    “Poisson Flow Generative Models do not only rely on an elegant physics-inspired formulation based on electrostatics, but they also offer state-of-the-art generative modeling performance in practice,” says NVIDIA Senior Research Scientist Karsten Kreis, who was not involved in the work. “They even outperform the popular diffusion models, which currently dominate the literature. This makes them a very powerful generative modeling tool, and I envision their application in diverse areas, ranging from digital content creation to generative drug discovery. More generally, I believe that the exploration of further physics-inspired generative modeling frameworks holds great promise for the future and that Poisson Flow Generative Models are only the beginning.”

    Authors on a paper about this work include three MIT graduate students: Yilun Xu of the Department of Electrical Engineering and Computer Science (EECS) and CSAIL, Ziming Liu of the Department of Physics and the NSF AI IAIFI, and Shangyuan Tong of EECS and CSAIL, as well as Google Senior Research Scientist Yonglong Tian PhD ’23. MIT professors Max Tegmark and Tommi Jaakkola advised the research.

    The team was supported by the MIT-DSTA Singapore collaboration, the MIT-IBM Grand Challenge project, National Science Foundation grants, The Casey and Family Foundation, the Foundational Questions Institute, the Rothberg Family Fund for Cognitive Science, and the ML for Pharmaceutical Discovery and Synthesis Consortium. Their work was presented at the International Conference on Machine Learning this summer. More

  • in

    Helping computer vision and language models understand what they see

    Powerful machine-learning algorithms known as vision and language models, which learn to match text with images, have shown remarkable results when asked to generate captions or summarize videos.

    While these models excel at identifying objects, they often struggle to understand concepts, like object attributes or the arrangement of items in a scene. For instance, a vision and language model might recognize the cup and table in an image, but fail to grasp that the cup is sitting on the table.

    Researchers from MIT, the MIT-IBM Watson AI Lab, and elsewhere have demonstrated a new technique that utilizes computer-generated data to help vision and language models overcome this shortcoming.

    The researchers created a synthetic dataset of images that depict a wide range of scenarios, object arrangements, and human actions, coupled with detailed text descriptions. They used this annotated dataset to “fix” vision and language models so they can learn concepts more effectively. Their technique ensures these models can still make accurate predictions when they see real images.

    When they tested models on concept understanding, the researchers found that their technique boosted accuracy by up to 10 percent. This could improve systems that automatically caption videos or enhance models that provide natural language answers to questions about images, with applications in fields like e-commerce or health care.

    “With this work, we are going beyond nouns in the sense that we are going beyond just the names of objects to more of the semantic concept of an object and everything around it. Our idea was that, when a machine-learning model sees objects in many different arrangements, it will have a better idea of how arrangement matters in a scene,” says Khaled Shehada, a graduate student in the Department of Electrical Engineering and Computer Science and co-author of a paper on this technique.

    Shehada wrote the paper with lead author Paola Cascante-Bonilla, a computer science graduate student at Rice University; Aude Oliva, director of strategic industry engagement at the MIT Schwarzman College of Computing, MIT director of the MIT-IBM Watson AI Lab, and a senior research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL); senior author Leonid Karlinsky, a research staff member in the MIT-IBM Watson AI Lab; and others at MIT, the MIT-IBM Watson AI Lab, Georgia Tech, Rice University, École des Ponts, Weizmann Institute of Science, and IBM Research. The paper will be presented at the International Conference on Computer Vision.

    Focusing on objects

    Vision and language models typically learn to identify objects in a scene, and can end up ignoring object attributes, such as color and size, or positional relationships, such as which object is on top of another object.

    This is due to the method with which these models are often trained, known as contrastive learning. This training method involves forcing a model to predict the correspondence between images and text. When comparing natural images, the objects in each scene tend to cause the most striking differences. (Perhaps one image shows a horse in a field while the second shows a sailboat on the water.)

    “Every image could be uniquely defined by the objects in the image. So, when you do contrastive learning, just focusing on the nouns and objects would solve the problem. Why would the model do anything differently?” says Karlinsky.

    The researchers sought to mitigate this problem by using synthetic data to fine-tune a vision and language model. The fine-tuning process involves tweaking a model that has already been trained to improve its performance on a specific task.

    They used a computer to automatically create synthetic videos with diverse 3D environments and objects, such as furniture and luggage, and added human avatars that interacted with the objects.

    Using individual frames of these videos, they generated nearly 800,000 photorealistic images, and then paired each with a detailed caption. The researchers developed a methodology for annotating every aspect of the image to capture object attributes, positional relationships, and human-object interactions clearly and consistently in dense captions.

    Because the researchers created the images, they could control the appearance and position of objects, as well as the gender, clothing, poses, and actions of the human avatars.

    “Synthetic data allows a lot of diversity. With real images, you might not have a lot of elephants in a room, but with synthetic data, you could actually have a pink elephant in a room with a human, if you want,” Cascante-Bonilla says.

    Synthetic data have other advantages, too. They are cheaper to generate than real data, yet the images are highly photorealistic. They also preserve privacy because no real humans are shown in the images. And, because data are produced automatically by a computer, they can be generated quickly in massive quantities.

    By using different camera viewpoints, or slightly changing the positions or attributes of objects, the researchers created a dataset with a far wider variety of scenarios than one would find in a natural dataset.

    Fine-tune, but don’t forget

    However, when one fine-tunes a model with synthetic data, there is a risk that model might “forget” what it learned when it was originally trained with real data.

    The researchers employed a few techniques to prevent this problem, such as adjusting the synthetic data so colors, lighting, and shadows more closely match those found in natural images. They also made adjustments to the model’s inner-workings after fine-tuning to further reduce any forgetfulness.

    Their synthetic dataset and fine-tuning strategy improved the ability of popular vision and language models to accurately recognize concepts by up to 10 percent. At the same time, the models did not forget what they had already learned.

    Now that they have shown how synthetic data can be used to solve this problem, the researchers want to identify ways to improve the visual quality and diversity of these data, as well as the underlying physics that makes synthetic scenes look realistic. In addition, they plan to test the limits of scalability, and investigate whether model improvement starts to plateau with larger and more diverse synthetic datasets.

    This research is funded, in part, by the U.S. Defense Advanced Research Projects Agency, the National Science Foundation, and the MIT-IBM Watson AI Lab. More

  • in

    A new dataset of Arctic images will spur artificial intelligence research

    As the U.S. Coast Guard (USCG) icebreaker Healy takes part in a voyage across the North Pole this summer, it is capturing images of the Arctic to further the study of this rapidly changing region. Lincoln Laboratory researchers installed a camera system aboard the Healy while at port in Seattle before it embarked on a three-month science mission on July 11. The resulting dataset, which will be one of the first of its kind, will be used to develop artificial intelligence tools that can analyze Arctic imagery.

    “This dataset not only can help mariners navigate more safely and operate more efficiently, but also help protect our nation by providing critical maritime domain awareness and an improved understanding of how AI analysis can be brought to bear in this challenging and unique environment,” says Jo Kurucar, a researcher in Lincoln Laboratory’s AI Software Architectures and Algorithms Group, which led this project.

    As the planet warms and sea ice melts, Arctic passages are opening up to more traffic, both to military vessels and ships conducting illegal fishing. These movements may pose national security challenges to the United States. The opening Arctic also leaves questions about how its climate, wildlife, and geography are changing.

    Today, very few imagery datasets of the Arctic exist to study these changes. Overhead images from satellites or aircraft can only provide limited information about the environment. An outward-looking camera attached to a ship can capture more details of the setting and different angles of objects, such as other ships, in the scene. These types of images can then be used to train AI computer-vision tools, which can help the USCG plan naval missions and automate analysis. According to Kurucar, USCG assets in the Arctic are spread thin and can benefit greatly from AI tools, which can act as a force multiplier.

    The Healy is the USCG’s largest and most technologically advanced icebreaker. Given its current mission, it was a fitting candidate to be equipped with a new sensor to gather this dataset. The laboratory research team collaborated with the USCG Research and Development Center to determine the sensor requirements. Together, they developed the Cold Region Imaging and Surveillance Platform (CRISP).

    “Lincoln Laboratory has an excellent relationship with the Coast Guard, especially with the Research and Development Center. Over a decade, we’ve established ties that enabled the deployment of the CRISP system,” says Amna Greaves, the CRISP project lead and an assistant leader in the AI Software Architectures and Algorithms Group. “We have strong ties not only because of the USCG veterans working at the laboratory and in our group, but also because our technology missions are complementary. Today it was deploying infrared sensing in the Arctic; tomorrow it could be operating quadruped robot dogs on a fast-response cutter.”

    The CRISP system comprises a long-wave infrared camera, manufactured by Teledyne FLIR (for forward-looking infrared), that is designed for harsh maritime environments. The camera can stabilize itself during rough seas and image in complete darkness, fog, and glare. It is paired with a GPS-enabled time-synchronized clock and a network video recorder to record both video and still imagery along with GPS-positional data.  

    The camera is mounted at the front of the ship’s fly bridge, and the electronics are housed in a ruggedized rack on the bridge. The system can be operated manually from the bridge or be placed into an autonomous surveillance mode, in which it slowly pans back and forth, recording 15 minutes of video every three hours and a still image once every 15 seconds.

    “The installation of the equipment was a unique and fun experience. As with any good project, our expectations going into the install did not meet reality,” says Michael Emily, the project’s IT systems administrator who traveled to Seattle for the install. Working with the ship’s crew, the laboratory team had to quickly adjust their route for running cables from the camera to the observation station after they discovered that the expected access points weren’t in fact accessible. “We had 100-foot cables made for this project just in case of this type of scenario, which was a good thing because we only had a few inches to spare,” Emily says.

    The CRISP project team plans to publicly release the dataset, anticipated to be about 4 terabytes in size, once the USCG science mission concludes in the fall.

    The goal in releasing the dataset is to enable the wider research community to develop better tools for those operating in the Arctic, especially as this region becomes more navigable. “Collecting and publishing the data allows for faster and greater progress than what we could accomplish on our own,” Kurucar adds. “It also enables the laboratory to engage in more advanced AI applications while others make more incremental advances using the dataset.”

    On top of providing the dataset, the laboratory team plans to provide a baseline object-detection model, from which others can make progress on their own models. More advanced AI applications planned for development are classifiers for specific objects in the scene and the ability to identify and track objects across images.

    Beyond assisting with USCG missions, this project could create an influential dataset for researchers looking to apply AI to data from the Arctic to help combat climate change, says Paul Metzger, who leads the AI Software Architectures and Algorithms Group.

    Metzger adds that the group was honored to be a part of this project and is excited to see the advances that come from applying AI to novel challenges facing the United States: “I’m extremely proud of how our group applies AI to the highest-priority challenges in our nation, from predicting outbreaks of Covid-19 and assisting the U.S. European Command in their support of Ukraine to now employing AI in the Arctic for maritime awareness.”

    Once the dataset is available, it will be free to download on the Lincoln Laboratory dataset website. More

  • in

    Putting clear bounds on uncertainty

    In science and technology, there has been a long and steady drive toward improving the accuracy of measurements of all kinds, along with parallel efforts to enhance the resolution of images. An accompanying goal is to reduce the uncertainty in the estimates that can be made, and the inferences drawn, from the data (visual or otherwise) that have been collected. Yet uncertainty can never be wholly eliminated. And since we have to live with it, at least to some extent, there is much to be gained by quantifying the uncertainty as precisely as possible.

    Expressed in other terms, we’d like to know just how uncertain our uncertainty is.

    That issue was taken up in a new study, led by Swami Sankaranarayanan, a postdoc at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), and his co-authors — Anastasios Angelopoulos and Stephen Bates of the University of California at Berkeley; Yaniv Romano of Technion, the Israel Institute of Technology; and Phillip Isola, an associate professor of electrical engineering and computer science at MIT. These researchers succeeded not only in obtaining accurate measures of uncertainty, they also found a way to display uncertainty in a manner the average person could grasp.

    Their paper, which was presented in December at the Neural Information Processing Systems Conference in New Orleans, relates to computer vision — a field of artificial intelligence that involves training computers to glean information from digital images. The focus of this research is on images that are partially smudged or corrupted (due to missing pixels), as well as on methods — computer algorithms, in particular — that are designed to uncover the part of the signal that is marred or otherwise concealed. An algorithm of this sort, Sankaranarayanan explains, “takes the blurred image as the input and gives you a clean image as the output” — a process that typically occurs in a couple of steps.

    First, there is an encoder, a kind of neural network specifically trained by the researchers for the task of de-blurring fuzzy images. The encoder takes a distorted image and, from that, creates an abstract (or “latent”) representation of a clean image in a form — consisting of a list of numbers — that is intelligible to a computer but would not make sense to most humans. The next step is a decoder, of which there are a couple of types, that are again usually neural networks. Sankaranarayanan and his colleagues worked with a kind of decoder called a “generative” model. In particular, they used an off-the-shelf version called StyleGAN, which takes the numbers from the encoded representation (of a cat, for instance) as its input and then constructs a complete, cleaned-up image (of that particular cat). So the entire process, including the encoding and decoding stages, yields a crisp picture from an originally muddied rendering.

    But how much faith can someone place in the accuracy of the resultant image? And, as addressed in the December 2022 paper, what is the best way to represent the uncertainty in that image? The standard approach is to create a “saliency map,” which ascribes a probability value — somewhere between 0 and 1 — to indicate the confidence the model has in the correctness of every pixel, taken one at a time. This strategy has a drawback, according to Sankaranarayanan, “because the prediction is performed independently for each pixel. But meaningful objects occur within groups of pixels, not within an individual pixel,” he adds, which is why he and his colleagues are proposing an entirely different way of assessing uncertainty.

    Their approach is centered around the “semantic attributes” of an image — groups of pixels that, when taken together, have meaning, making up a human face, for example, or a dog, or some other recognizable thing. The objective, Sankaranarayanan maintains, “is to estimate uncertainty in a way that relates to the groupings of pixels that humans can readily interpret.”

    Whereas the standard method might yield a single image, constituting the “best guess” as to what the true picture should be, the uncertainty in that representation is normally hard to discern. The new paper argues that for use in the real world, uncertainty should be presented in a way that holds meaning for people who are not experts in machine learning. Rather than producing a single image, the authors have devised a procedure for generating a range of images — each of which might be correct. Moreover, they can set precise bounds on the range, or interval, and provide a probabilistic guarantee that the true depiction lies somewhere within that range. A narrower range can be provided if the user is comfortable with, say, 90 percent certitude, and a narrower range still if more risk is acceptable.

    The authors believe their paper puts forth the first algorithm, designed for a generative model, which can establish uncertainty intervals that relate to meaningful (semantically-interpretable) features of an image and come with “a formal statistical guarantee.” While that is an important milestone, Sankaranarayanan considers it merely a step toward “the ultimate goal. So far, we have been able to do this for simple things, like restoring images of human faces or animals, but we want to extend this approach into more critical domains, such as medical imaging, where our ‘statistical guarantee’ could be especially important.”

    Suppose that the film, or radiograph, of a chest X-ray is blurred, he adds, “and you want to reconstruct the image. If you are given a range of images, you want to know that the true image is contained within that range, so you are not missing anything critical” — information that might reveal whether or not a patient has lung cancer or pneumonia. In fact, Sankaranarayanan and his colleagues have already begun working with a radiologist to see if their algorithm for predicting pneumonia could be useful in a clinical setting.

    Their work may also have relevance in the law enforcement field, he says. “The picture from a surveillance camera may be blurry, and you want to enhance that. Models for doing that already exist, but it is not easy to gauge the uncertainty. And you don’t want to make a mistake in a life-or-death situation.” The tools that he and his colleagues are developing could help identify a guilty person and help exonerate an innocent one as well.

    Much of what we do and many of the things happening in the world around us are shrouded in uncertainty, Sankaranarayanan notes. Therefore, gaining a firmer grasp of that uncertainty could help us in countless ways. For one thing, it can tell us more about exactly what it is we do not know.

    Angelopoulos was supported by the National Science Foundation. Bates was supported by the Foundations of Data Science Institute and the Simons Institute. Romano was supported by the Israel Science Foundation and by a Career Advancement Fellowship from Technion. Sankaranarayanan’s and Isola’s research for this project was sponsored by the U.S. Air Force Research Laboratory and the U.S. Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2- 1000. MIT SuperCloud and the Lincoln Laboratory Supercomputing Center also provided computing resources that contributed to the results reported in this work. More

  • in

    Computing for the health of the planet

    The health of the planet is one of the most important challenges facing humankind today. From climate change to unsafe levels of air and water pollution to coastal and agricultural land erosion, a number of serious challenges threaten human and ecosystem health.

    Ensuring the health and safety of our planet necessitates approaches that connect scientific, engineering, social, economic, and political aspects. New computational methods can play a critical role by providing data-driven models and solutions for cleaner air, usable water, resilient food, efficient transportation systems, better-preserved biodiversity, and sustainable sources of energy.

    The MIT Schwarzman College of Computing is committed to hiring multiple new faculty in computing for climate and the environment, as part of MIT’s plan to recruit 20 climate-focused faculty under its climate action plan. This year the college undertook searches with several departments in the schools of Engineering and Science for shared faculty in computing for health of the planet, one of the six strategic areas of inquiry identified in an MIT-wide planning process to help focus shared hiring efforts. The college also undertook searches for core computing faculty in the Department of Electrical Engineering and Computer Science (EECS).

    The searches are part of an ongoing effort by the MIT Schwarzman College of Computing to hire 50 new faculty — 25 shared with other academic departments and 25 in computer science and artificial intelligence and decision-making. The goal is to build capacity at MIT to help more deeply infuse computing and other disciplines in departments.

    Four interdisciplinary scholars were hired in these searches. They will join the MIT faculty in the coming year to engage in research and teaching that will advance physical understanding of low-carbon energy solutions, Earth-climate modeling, biodiversity monitoring and conservation, and agricultural management through high-performance computing, transformational numerical methods, and machine-learning techniques.

    “By coordinating hiring efforts with multiple departments and schools, we were able to attract a cohort of exceptional scholars in this area to MIT. Each of them is developing and using advanced computational methods and tools to help find solutions for a range of climate and environmental issues,” says Daniel Huttenlocher, dean of the MIT Schwarzman College of Computing and the Henry Warren Ellis Professor of Electrical Engineering and Computer Science. “They will also help strengthen cross-departmental ties in computing across an important, critical area for MIT and the world.”

    “These strategic hires in the area of computing for climate and the environment are an incredible opportunity for the college to deepen its academic offerings and create new opportunity for collaboration across MIT,” says Anantha P. Chandrakasan, dean of the MIT School of Engineering and the Vannevar Bush Professor of Electrical Engineering and Computer Science. “The college plays a pivotal role in MIT’s overarching effort to hire climate-focused faculty — introducing the critical role of computing to address the health of the planet through innovative research and curriculum.”

    The four new faculty members are:

    Sara Beery will join MIT as an assistant professor in the Faculty of Artificial Intelligence and Decision-Making in EECS in September 2023. Beery received her PhD in computing and mathematical sciences at Caltech in 2022, where she was advised by Pietro Perona. Her research focuses on building computer vision methods that enable global-scale environmental and biodiversity monitoring across data modalities, tackling real-world challenges including strong spatiotemporal correlations, imperfect data quality, fine-grained categories, and long-tailed distributions. She partners with nongovernmental organizations and government agencies to deploy her methods in the wild worldwide and works toward increasing the diversity and accessibility of academic research in artificial intelligence through interdisciplinary capacity building and education.

    Priya Donti will join MIT as an assistant professor in the faculties of Electrical Engineering and Artificial Intelligence and Decision-Making in EECS in academic year 2023-24. Donti recently finished her PhD in the Computer Science Department and the Department of Engineering and Public Policy at Carnegie Mellon University, co-advised by Zico Kolter and Inês Azevedo. Her work focuses on machine learning for forecasting, optimization, and control in high-renewables power grids. Specifically, her research explores methods to incorporate the physics and hard constraints associated with electric power systems into deep learning models. Donti is also co-founder and chair of Climate Change AI, a nonprofit initiative to catalyze impactful work at the intersection of climate change and machine learning that is currently running through the Cornell Tech Runway Startup Postdoc Program.

    Ericmoore Jossou will join MIT as an assistant professor in a shared position between the Department of Nuclear Science and Engineering and the faculty of electrical engineering in EECS in July 2023. He is currently an assistant scientist at the Brookhaven National Laboratory, a U.S. Department of Energy-affiliated lab that conducts research in nuclear and high energy physics, energy science and technology, environmental and bioscience, nanoscience, and national security. His research at MIT will focus on understanding the processing-structure-properties correlation of materials for nuclear energy applications through advanced experiments, multiscale simulations, and data science. Jossou obtained his PhD in mechanical engineering in 2019 from the University of Saskatchewan.

    Sherrie Wang will join MIT as an assistant professor in a shared position between the Department of Mechanical Engineering and the Institute for Data, Systems, and Society in academic year 2023-24. Wang is currently a Ciriacy-Wantrup Postdoctoral Fellow at the University of California at Berkeley, hosted by Solomon Hsiang and the Global Policy Lab. She develops machine learning for Earth observation data. Her primary application areas are improving agricultural management and forecasting climate phenomena. She obtained her PhD in computational and mathematical engineering from Stanford University in 2021, where she was advised by David Lobell. More

  • in

    Researchers release open-source photorealistic simulator for autonomous driving

    Hyper-realistic virtual worlds have been heralded as the best driving schools for autonomous vehicles (AVs), since they’ve proven fruitful test beds for safely trying out dangerous driving scenarios. Tesla, Waymo, and other self-driving companies all rely heavily on data to enable expensive and proprietary photorealistic simulators, since testing and gathering nuanced I-almost-crashed data usually isn’t the most easy or desirable to recreate. 

    To that end, scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) created “VISTA 2.0,” a data-driven simulation engine where vehicles can learn to drive in the real world and recover from near-crash scenarios. What’s more, all of the code is being open-sourced to the public. 

    “Today, only companies have software like the type of simulation environments and capabilities of VISTA 2.0, and this software is proprietary. With this release, the research community will have access to a powerful new tool for accelerating the research and development of adaptive robust control for autonomous driving,” says MIT Professor and CSAIL Director Daniela Rus, senior author on a paper about the research. 

    Play video

    VISTA is a data-driven, photorealistic simulator for autonomous driving. It can simulate not just live video but LiDAR data and event cameras, and also incorporate other simulated vehicles to model complex driving situations. VISTA is open source and the code can be found below.

    VISTA 2.0 builds off of the team’s previous model, VISTA, and it’s fundamentally different from existing AV simulators since it’s data-driven — meaning it was built and photorealistically rendered from real-world data — thereby enabling direct transfer to reality. While the initial iteration supported only single car lane-following with one camera sensor, achieving high-fidelity data-driven simulation required rethinking the foundations of how different sensors and behavioral interactions can be synthesized. 

    Enter VISTA 2.0: a data-driven system that can simulate complex sensor types and massively interactive scenarios and intersections at scale. With much less data than previous models, the team was able to train autonomous vehicles that could be substantially more robust than those trained on large amounts of real-world data. 

    “This is a massive jump in capabilities of data-driven simulation for autonomous vehicles, as well as the increase of scale and ability to handle greater driving complexity,” says Alexander Amini, CSAIL PhD student and co-lead author on two new papers, together with fellow PhD student Tsun-Hsuan Wang. “VISTA 2.0 demonstrates the ability to simulate sensor data far beyond 2D RGB cameras, but also extremely high dimensional 3D lidars with millions of points, irregularly timed event-based cameras, and even interactive and dynamic scenarios with other vehicles as well.” 

    The team was able to scale the complexity of the interactive driving tasks for things like overtaking, following, and negotiating, including multiagent scenarios in highly photorealistic environments. 

    Training AI models for autonomous vehicles involves hard-to-secure fodder of different varieties of edge cases and strange, dangerous scenarios, because most of our data (thankfully) is just run-of-the-mill, day-to-day driving. Logically, we can’t just crash into other cars just to teach a neural network how to not crash into other cars.

    Recently, there’s been a shift away from more classic, human-designed simulation environments to those built up from real-world data. The latter have immense photorealism, but the former can easily model virtual cameras and lidars. With this paradigm shift, a key question has emerged: Can the richness and complexity of all of the sensors that autonomous vehicles need, such as lidar and event-based cameras that are more sparse, accurately be synthesized? 

    Lidar sensor data is much harder to interpret in a data-driven world — you’re effectively trying to generate brand-new 3D point clouds with millions of points, only from sparse views of the world. To synthesize 3D lidar point clouds, the team used the data that the car collected, projected it into a 3D space coming from the lidar data, and then let a new virtual vehicle drive around locally from where that original vehicle was. Finally, they projected all of that sensory information back into the frame of view of this new virtual vehicle, with the help of neural networks. 

    Together with the simulation of event-based cameras, which operate at speeds greater than thousands of events per second, the simulator was capable of not only simulating this multimodal information, but also doing so all in real time — making it possible to train neural nets offline, but also test online on the car in augmented reality setups for safe evaluations. “The question of if multisensor simulation at this scale of complexity and photorealism was possible in the realm of data-driven simulation was very much an open question,” says Amini. 

    With that, the driving school becomes a party. In the simulation, you can move around, have different types of controllers, simulate different types of events, create interactive scenarios, and just drop in brand new vehicles that weren’t even in the original data. They tested for lane following, lane turning, car following, and more dicey scenarios like static and dynamic overtaking (seeing obstacles and moving around so you don’t collide). With the multi-agency, both real and simulated agents interact, and new agents can be dropped into the scene and controlled any which way. 

    Taking their full-scale car out into the “wild” — a.k.a. Devens, Massachusetts — the team saw  immediate transferability of results, with both failures and successes. They were also able to demonstrate the bodacious, magic word of self-driving car models: “robust.” They showed that AVs, trained entirely in VISTA 2.0, were so robust in the real world that they could handle that elusive tail of challenging failures. 

    Now, one guardrail humans rely on that can’t yet be simulated is human emotion. It’s the friendly wave, nod, or blinker switch of acknowledgement, which are the type of nuances the team wants to implement in future work. 

    “The central algorithm of this research is how we can take a dataset and build a completely synthetic world for learning and autonomy,” says Amini. “It’s a platform that I believe one day could extend in many different axes across robotics. Not just autonomous driving, but many areas that rely on vision and complex behaviors. We’re excited to release VISTA 2.0 to help enable the community to collect their own datasets and convert them into virtual worlds where they can directly simulate their own virtual autonomous vehicles, drive around these virtual terrains, train autonomous vehicles in these worlds, and then can directly transfer them to full-sized, real self-driving cars.” 

    Amini and Wang wrote the paper alongside Zhijian Liu, MIT CSAIL PhD student; Igor Gilitschenski, assistant professor in computer science at the University of Toronto; Wilko Schwarting, AI research scientist and MIT CSAIL PhD ’20; Song Han, associate professor at MIT’s Department of Electrical Engineering and Computer Science; Sertac Karaman, associate professor of aeronautics and astronautics at MIT; and Daniela Rus, MIT professor and CSAIL director. The researchers presented the work at the IEEE International Conference on Robotics and Automation (ICRA) in Philadelphia. 

    This work was supported by the National Science Foundation and Toyota Research Institute. The team acknowledges the support of NVIDIA with the donation of the Drive AGX Pegasus. More

  • in

    A new state of the art for unsupervised vision

    Labeling data can be a chore. It’s the main source of sustenance for computer-vision models; without it, they’d have a lot of difficulty identifying objects, people, and other important image characteristics. Yet producing just an hour of tagged and labeled data can take a whopping 800 hours of human time. Our high-fidelity understanding of the world develops as machines can better perceive and interact with our surroundings. But they need more help.

    Scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), Microsoft, and Cornell University have attempted to solve this problem plaguing vision models by creating “STEGO,” an algorithm that can jointly discover and segment objects without any human labels at all, down to the pixel.

    STEGO learns something called “semantic segmentation” — fancy speak for the process of assigning a label to every pixel in an image. Semantic segmentation is an important skill for today’s computer-vision systems because images can be cluttered with objects. Even more challenging is that these objects don’t always fit into literal boxes; algorithms tend to work better for discrete “things” like people and cars as opposed to “stuff” like vegetation, sky, and mashed potatoes. A previous system might simply perceive a nuanced scene of a dog playing in the park as just a dog, but by assigning every pixel of the image a label, STEGO can break the image into its main ingredients: a dog, sky, grass, and its owner.

    Play video

    A new state of the art for unsupervised computer vision

    Assigning every single pixel of the world a label is ambitious — especially without any kind of feedback from humans. The majority of algorithms today get their knowledge from mounds of labeled data, which can take painstaking human-hours to source. Just imagine the excitement of labeling every pixel of 100,000 images! To discover these objects without a human’s helpful guidance, STEGO looks for similar objects that appear throughout a dataset. It then associates these similar objects together to construct a consistent view of the world across all of the images it learns from.

    Seeing the world

    Machines that can “see” are crucial for a wide array of new and emerging technologies like self-driving cars and predictive modeling for medical diagnostics. Since STEGO can learn without labels, it can detect objects in many different domains, even those that humans don’t yet understand fully. 

    “If you’re looking at oncological scans, the surface of planets, or high-resolution biological images, it’s hard to know what objects to look for without expert knowledge. In emerging domains, sometimes even human experts don’t know what the right objects should be,” says Mark Hamilton, a PhD student in electrical engineering and computer science at MIT, research affiliate of MIT CSAIL, software engineer at Microsoft, and lead author on a new paper about STEGO. “In these types of situations where you want to design a method to operate at the boundaries of science, you can’t rely on humans to figure it out before machines do.”

    STEGO was tested on a slew of visual domains spanning general images, driving images, and high-altitude aerial photographs. In each domain, STEGO was able to identify and segment relevant objects that were closely aligned with human judgments. STEGO’s most diverse benchmark was the COCO-Stuff dataset, which is made up of diverse images from all over the world, from indoor scenes to people playing sports to trees and cows. In most cases, the previous state-of-the-art system could capture a low-resolution gist of a scene, but struggled on fine-grained details: A human was a blob, a motorcycle was captured as a person, and it couldn’t recognize any geese. On the same scenes, STEGO doubled the performance of previous systems and discovered concepts like animals, buildings, people, furniture, and many others.

    STEGO not only doubled the performance of prior systems on the COCO-Stuff benchmark, but made similar leaps forward in other visual domains. When applied to driverless car datasets, STEGO successfully segmented out roads, people, and street signs with much higher resolution and granularity than previous systems. On images from space, the system broke down every single square foot of the surface of the Earth into roads, vegetation, and buildings. 

    Connecting the pixels

    STEGO — which stands for “Self-supervised Transformer with Energy-based Graph Optimization” — builds on top of the DINO algorithm, which learned about the world through 14 million images from the ImageNet database. STEGO refines the DINO backbone through a learning process that mimics our own way of stitching together pieces of the world to make meaning. 

    For example, you might consider two images of dogs walking in the park. Even though they’re different dogs, with different owners, in different parks, STEGO can tell (without humans) how each scene’s objects relate to each other. The authors even probe STEGO’s mind to see how each little, brown, furry thing in the images are similar, and likewise with other shared objects like grass and people. By connecting objects across images, STEGO builds a consistent view of the word.

    “The idea is that these types of algorithms can find consistent groupings in a largely automated fashion so we don’t have to do that ourselves,” says Hamilton. “It might have taken years to understand complex visual datasets like biological imagery, but if we can avoid spending 1,000 hours combing through data and labeling it, we can find and discover new information that we might have missed. We hope this will help us understand the visual word in a more empirically grounded way.”

    Looking ahead

    Despite its improvements, STEGO still faces certain challenges. One is that labels can be arbitrary. For example, the labels of the COCO-Stuff dataset distinguish between “food-things” like bananas and chicken wings, and “food-stuff” like grits and pasta. STEGO doesn’t see much of a distinction there. In other cases, STEGO was confused by odd images — like one of a banana sitting on a phone receiver — where the receiver was labeled “foodstuff,” instead of “raw material.” 

    For upcoming work, they’re planning to explore giving STEGO a bit more flexibility than just labeling pixels into a fixed number of classes as things in the real world can sometimes be multiple things at the same time (like “food”, “plant” and “fruit”). The authors hope this will give the algorithm room for uncertainty, trade-offs, and more abstract thinking.

    “In making a general tool for understanding potentially complicated datasets, we hope that this type of an algorithm can automate the scientific process of object discovery from images. There’s a lot of different domains where human labeling would be prohibitively expensive, or humans simply don’t even know the specific structure, like in certain biological and astrophysical domains. We hope that future work enables application to a very broad scope of datasets. Since you don’t need any human labels, we can now start to apply ML tools more broadly,” says Hamilton.

    “STEGO is simple, elegant, and very effective. I consider unsupervised segmentation to be a benchmark for progress in image understanding, and a very difficult problem. The research community has made terrific progress in unsupervised image understanding with the adoption of transformer architectures,” says Andrea Vedaldi, professor of computer vision and machine learning and a co-lead of the Visual Geometry Group at the engineering science department of the University of Oxford. “This research provides perhaps the most direct and effective demonstration of this progress on unsupervised segmentation.” 

    Hamilton wrote the paper alongside MIT CSAIL PhD student Zhoutong Zhang, Assistant Professor Bharath Hariharan of Cornell University, Associate Professor Noah Snavely of Cornell Tech, and MIT professor William T. Freeman. They will present the paper at the 2022 International Conference on Learning Representations (ICLR).  More

  • in

    Security tool guarantees privacy in surveillance footage

    Surveillance cameras have an identity problem, fueled by an inherent tension between utility and privacy. As these powerful little devices have cropped up seemingly everywhere, the use of machine learning tools has automated video content analysis at a massive scale — but with increasing mass surveillance, there are currently no legally enforceable rules to limit privacy invasions. 

    Security cameras can do a lot — they’ve become smarter and supremely more competent than their ghosts of grainy pictures past, the ofttimes “hero tool” in crime media. (“See that little blurry blue blob in the right hand corner of that densely populated corner — we got him!”) Now, video surveillance can help health officials measure the fraction of people wearing masks, enable transportation departments to monitor the density and flow of vehicles, bikes, and pedestrians, and provide businesses with a better understanding of shopping behaviors. But why has privacy remained a weak afterthought? 

    The status quo is to retrofit video with blurred faces or black boxes. Not only does this prevent analysts from asking some genuine queries (e.g., Are people wearing masks?), it also doesn’t always work; the system may miss some faces and leave them unblurred for the world to see. Dissatisfied with this status quo, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), in collaboration with other institutions, came up with a system to better guarantee privacy in video footage from surveillance cameras. Called “Privid,” the system lets analysts submit video data queries, and adds a little bit of noise (extra data) to the end result to ensure that an individual can’t be identified. The system builds on a formal definition of privacy — “differential privacy” — which allows access to aggregate statistics about private data without revealing personally identifiable information.

    Typically, analysts would just have access to the entire video to do whatever they want with it, but Privid makes sure the video isn’t a free buffet. Honest analysts can get access to the information they need, but that access is restrictive enough that malicious analysts can’t do too much with it. To enable this, rather than running the code over the entire video in one shot, Privid breaks the video into small pieces and runs processing code over each chunk. Instead of getting results back from each piece, the segments are aggregated, and that additional noise is added. (There’s also information on the error bound you’re going to get on your result — maybe a 2 percent error margin, given the extra noisy data added). 

    For example, the code might output the number of people observed in each video chunk, and the aggregation might be the “sum,” to count the total number of people wearing face coverings, or the “average” to estimate the density of crowds. 

    Privid allows analysts to use their own deep neural networks that are commonplace for video analytics today. This gives analysts the flexibility to ask questions that the designers of Privid did not anticipate. Across a variety of videos and queries, Privid was accurate within 79 to 99 percent of a non-private system.

    “We’re at a stage right now where cameras are practically ubiquitous. If there’s a camera on every street corner, every place you go, and if someone could actually process all of those videos in aggregate, you can imagine that entity building a very precise timeline of when and where a person has gone,” says MIT CSAIL PhD student ​​Frank Cangialosi, the lead author on a paper about Privid. “People are already worried about location privacy with GPS — video data in aggregate could capture not only your location history, but also moods, behaviors, and more at each location.” 

    Privid introduces a new notion of “duration-based privacy,” which decouples the definition of privacy from its enforcement — with obfuscation, if your privacy goal is to protect all people, the enforcement mechanism needs to do some work to find the people to protect, which it may or may not do perfectly. With this mechanism, you don’t need to fully specify everything, and you’re not hiding more information than you need to. 

    Let’s say we have a video overlooking a street. Two analysts, Alice and Bob, both claim they want to count the number of people that pass by each hour, so they submit a video processing module and ask for a sum aggregation.

    The first analyst is the city planning department, which hopes to use this information to understand footfall patterns and plan sidewalks for the city. Their model counts people and outputs this count for each video chunk.

    The other analyst is malicious. They hope to identify every time “Charlie” passes by the camera. Their model only looks for Charlie’s face, and outputs a large number if Charlie is present (i.e., the “signal” they’re trying to extract), or zero otherwise. Their hope is that the sum will be non-zero if Charlie was present. 

    From Privid’s perspective, these two queries look identical. It’s hard to reliably determine what their models might be doing internally, or what the analyst hopes to use the data for. This is where the noise comes in. Privid executes both of the queries, and adds the same amount of noise for each. In the first case, because Alice was counting all people, this noise will only have a small impact on the result, but likely won’t impact the usefulness. 

    In the second case, since Bob was looking for a specific signal (Charlie was only visible for a few chunks), the noise is enough to prevent them from knowing if Charlie was there or not. If they see a non-zero result, it might be because Charlie was actually there, or because the model outputs “zero,” but the noise made it non-zero. Privid didn’t need to know anything about when or where Charlie appeared, the system just needed to know a rough upper bound on how long Charlie might appear for, which is easier to specify than figuring out the exact locations, which prior methods rely on. 

    The challenge is determining how much noise to add — Privid wants to add just enough to hide everyone, but not so much that it would be useless for analysts. Adding noise to the data and insisting on queries over time windows means that your result isn’t going to be as accurate as it could be, but the results are still useful while providing better privacy. 

    Cangialosi wrote the paper with Princeton PhD student Neil Agarwal, MIT CSAIL PhD student Venkat Arun, assistant professor at the University of Chicago Junchen Jiang, assistant professor at Rutgers University and former MIT CSAIL postdoc Srinivas Narayana, associate professor at Rutgers University Anand Sarwate, and assistant professor at Princeton University and Ravi Netravali SM ’15, PhD ’18. Cangialosi will present the paper at the USENIX Symposium on Networked Systems Design and Implementation Conference in April in Renton, Washington. 

    This work was partially supported by a Sloan Research Fellowship and National Science Foundation grants. More