More stories

  • in

    Generating new molecules with graph grammar

    Chemical engineers and materials scientists are constantly looking for the next revolutionary material, chemical, and drug. The rise of machine-learning approaches is expediting the discovery process, which could otherwise take years. “Ideally, the goal is to train a machine-learning model on a few existing chemical samples and then allow it to produce as many manufacturable molecules of the same class as possible, with predictable physical properties,” says Wojciech Matusik, professor of electrical engineering and computer science at MIT. “If you have all these components, you can build new molecules with optimal properties, and you also know how to synthesize them. That’s the overall vision that people in that space want to achieve”

    However, current techniques, mainly deep learning, require extensive datasets for training models, and many class-specific chemical datasets contain a handful of example compounds, limiting their ability to generalize and generate physical molecules that could be created in the real world.

    Now, a new paper from researchers at MIT and IBM tackles this problem using a generative graph model to build new synthesizable molecules within the same chemical class as their training data. To do this, they treat the formation of atoms and chemical bonds as a graph and develop a graph grammar — a linguistics analogy of systems and structures for word ordering — that contains a sequence of rules for building molecules, such as monomers and polymers. Using the grammar and production rules that were inferred from the training set, the model can not only reverse engineer its examples, but can create new compounds in a systematic and data-efficient way. “We basically built a language for creating molecules,” says Matusik “This grammar essentially is the generative model.”

    Matusik’s co-authors include MIT graduate students Minghao Guo, who is the lead author, and Beichen Li as well as Veronika Thost, Payal Das, and Jie Chen, research staff members with IBM Research. Matusik, Thost, and Chen are affiliated with the MIT-IBM Watson AI Lab. Their method, which they’ve called data-efficient graph grammar (DEG), will be presented at the International Conference on Learning Representations.

    “We want to use this grammar representation for monomer and polymer generation, because this grammar is explainable and expressive,” says Guo. “With only a few number of the production rules, we can generate many kinds of structures.”

    A molecular structure can be thought of as a symbolic representation in a graph — a string of atoms (nodes) joined together by chemical bonds (edges). In this method, the researchers allow the model to take the chemical structure and collapse a substructure of the molecule down to one node; this may be two atoms connected by a bond, a short sequence of bonded atoms, or a ring of atoms. This is done repeatedly, creating the production rules as it goes, until a single node remains. The rules and grammar then could be applied in the reverse order to recreate the training set from scratch or combined in different combinations to produce new molecules of the same chemical class.

    “Existing graph generation methods would produce one node or one edge sequentially at a time, but we are looking at higher-level structures and, specifically, exploiting chemistry knowledge, so that we don’t treat the individual atoms and bonds as the unit. This simplifies the generation process and also makes it more data-efficient to learn,” says Chen.

    Further, the researchers optimized the technique so that the bottom-up grammar was relatively simple and straightforward, such that it fabricated molecules that could be made.

    “If we switch the order of applying these production rules, we would get another molecule; what’s more, we can enumerate all the possibilities and generate tons of them,” says Chen. “Some of these molecules are valid and some of them not, so the learning of the grammar itself is actually to figure out a minimal collection of production rules, such that the percentage of molecules that can actually be synthesized is maximized.” While the researchers concentrated on three training sets of less than 33 samples each — acrylates, chain extenders, and isocyanates — they note that the process could be applied to any chemical class.

    To see how their method performed, the researchers tested DEG against other state-of-the-art models and techniques, looking at percentages of chemically valid and unique molecules, diversity of those created, success rate of retrosynthesis, and percentage of molecules belonging to the training data’s monomer class.

    “We clearly show that, for the synthesizability and membership, our algorithm outperforms all the existing methods by a very large margin, while it’s comparable for some other widely-used metrics,” says Guo. Further, “what is amazing about our algorithm is that we only need about 0.15 percent of the original dataset to achieve very similar results compared to state-of-the-art approaches that train on tens of thousands of samples. Our algorithm can specifically handle the problem of data sparsity.”

    In the immediate future, the team plans to address scaling up this grammar learning process to be able to generate large graphs, as well as produce and identify chemicals with desired properties.

    Down the road, the researchers see many applications for the DEG method, as it’s adaptable beyond generating new chemical structures, the team points out. A graph is a very flexible representation, and many entities can be symbolized in this form — robots, vehicles, buildings, and electronic circuits, for example. “Essentially, our goal is to build up our grammar, so that our graphic representation can be widely used across many different domains,” says Guo, as “DEG can automate the design of novel entities and structures,” says Chen.

    This research was supported, in part, by the MIT-IBM Watson AI Lab and Evonik. More

  • in

    Study: Global cancer risk from burning organic matter comes from unregulated chemicals

    Whenever organic matter is burned, such as in a wildfire, a power plant, a car’s exhaust, or in daily cooking, the combustion releases polycyclic aromatic hydrocarbons (PAHs) — a class of pollutants that is known to cause lung cancer.

    There are more than 100 known types of PAH compounds emitted daily into the atmosphere. Regulators, however, have historically relied on measurements of a single compound, benzo(a)pyrene, to gauge a community’s risk of developing cancer from PAH exposure. Now MIT scientists have found that benzo(a)pyrene may be a poor indicator of this type of cancer risk.

    In a modeling study appearing today in the journal GeoHealth, the team reports that benzo(a)pyrene plays a small part — about 11 percent — in the global risk of developing PAH-associated cancer. Instead, 89 percent of that cancer risk comes from other PAH compounds, many of which are not directly regulated.

    Interestingly, about 17 percent of PAH-associated cancer risk comes from “degradation products” — chemicals that are formed when emitted PAHs react in the atmosphere. Many of these degradation products can in fact be more toxic than the emitted PAH from which they formed.

    The team hopes the results will encourage scientists and regulators to look beyond benzo(a)pyrene, to consider a broader class of PAHs when assessing a community’s cancer risk.

    “Most of the regulatory science and standards for PAHs are based on benzo(a)pyrene levels. But that is a big blind spot that could lead you down a very wrong path in terms of assessing whether cancer risk is improving or not, and whether it’s relatively worse in one place than another,” says study author Noelle Selin, a professor in MIT’s Institute for Data, Systems and Society, and the Department of Earth, Atmospheric and Planetary Sciences.

    Selin’s MIT co-authors include Jesse Kroll, Amy Hrdina, Ishwar Kohale, Forest White, and Bevin Engelward, and Jamie Kelly (who is now at University College London). Peter Ivatt and Mathew Evans at the University of York are also co-authors.

    Chemical pixels

    Benzo(a)pyrene has historically been the poster chemical for PAH exposure. The compound’s indicator status is largely based on early toxicology studies. But recent research suggests the chemical may not be the PAH representative that regulators have long relied upon.   

    “There has been a bit of evidence suggesting benzo(a)pyrene may not be very important, but this was from just a few field studies,” says Kelly, a former postdoc in Selin’s group and the study’s lead author.

    Kelly and his colleagues instead took a systematic approach to evaluate benzo(a)pyrene’s suitability as a PAH indicator. The team began by using GEOS-Chem, a global, three-dimensional chemical transport model that breaks the world into individual grid boxes and simulates within each box the reactions and concentrations of chemicals in the atmosphere.

    They extended this model to include chemical descriptions of how various PAH compounds, including benzo(a)pyrene, would react in the atmosphere. The team then plugged in recent data from emissions inventories and meteorological observations, and ran the model forward to simulate the concentrations of various PAH chemicals around the world over time.

    Risky reactions

    In their simulations, the researchers started with 16 relatively well-studied PAH chemicals, including benzo(a)pyrene, and traced the concentrations of these chemicals, plus the concentration of their degradation products over two generations, or chemical transformations. In total, the team evaluated 48 PAH species.

    They then compared these concentrations with actual concentrations of the same chemicals, recorded by monitoring stations around the world. This comparison was close enough to show that the model’s concentration predictions were realistic.

    Then within each model’s grid box, the researchers related the concentration of each PAH chemical to its associated cancer risk; to do this, they had to develop a new method based on previous studies in the literature to avoid double-counting risk from the different chemicals. Finally, they overlaid population density maps to predict the number of cancer cases globally, based on the concentration and toxicity of a specific PAH chemical in each location.

    Dividing the cancer cases by population produced the cancer risk associated with that chemical. In this way, the team calculated the cancer risk for each of the 48 compounds, then determined each chemical’s individual contribution to the total risk.

    This analysis revealed that benzo(a)pyrene had a surprisingly small contribution, of about 11 percent, to the overall risk of developing cancer from PAH exposure globally. Eighty-nine percent of cancer risk came from other chemicals. And 17 percent of this risk arose from degradation products.

    “We see places where you can find concentrations of benzo(a)pyrene are lower, but the risk is higher because of these degradation products,” Selin says. “These products can be orders of magnitude more toxic, so the fact that they’re at tiny concentrations doesn’t mean you can write them off.”

    When the researchers compared calculated PAH-associated cancer risks around the world, they found significant differences depending on whether that risk calculation was based solely on concentrations of benzo(a)pyrene or on a region’s broader mix of PAH compounds.

    “If you use the old method, you would find the lifetime cancer risk is 3.5 times higher in Hong Kong versus southern India, but taking into account the differences in PAH mixtures, you get a difference of 12 times,” Kelly says. “So, there’s a big difference in the relative cancer risk between the two places. And we think it’s important to expand the group of compounds that regulators are thinking about, beyond just a single chemical.”

    The team’s study “provides an excellent contribution to better understanding these ubiquitous pollutants,” says Elisabeth Galarneau, an air quality expert and PhD research scientist in Canada’s Department of the Environment. “It will be interesting to see how these results compare to work being done elsewhere … to pin down which (compounds) need to be tracked and considered for the protection of human and environmental health.”

    This research was conducted in MIT’s Superfund Research Center and is supported in part by the National Institute of Environmental Health Sciences Superfund Basic Research Program, and the National Institutes of Health. More

  • in

    MIT welcomes nine MLK Visiting Professors and Scholars for 2021-22

    In its 31st year, the Martin Luther King Jr. (MLK) Visiting Professors and Scholars Program will host nine outstanding scholars from across the Americas. The flagship program honors the life and legacy of Martin Luther King Jr. by increasing the presence and recognizing the contributions of underrepresented minority scholars at MIT. Throughout the year, the cohort will enhance their scholarship through intellectual engagement with the MIT community and enrich the cultural, academic, and professional experience of students.

    The 2021-22 scholars

    Sanford Biggers is an interdisciplinary artist hosted by the Department of Architecture. His work is an interplay of narrative, perspective, and history that speaks to current social, political, and economic happenings while examining their contexts. His diverse practice positions him as a collaborator with the past through explorations of often-overlooked cultural and political narratives from American history. Through collaboration with his faculty host, Brandon Clifford, he will spend the year contributing to projects with Architecture; Art, Culture and Technology; the Transmedia Storytelling initiatives; and community workshops and engagement with local K-12 education.

    Kristen Dorsey is an assistant professor of engineering at Smith College. She will be hosted by the Program in Media Arts and Sciences at the MIT Media Lab. Her research focuses on the fabrication and characterization of microscale sensors and microelectromechanical systems. Dorsey tries to understand “why things go wrong” by investigating device reliability and stability. At MIT, Dorsey is interested in forging collaborations to consider issues of access and equity as they apply to wearable health care devices.

    Omolola “Lola” Eniola-Adefeso is the associate dean for graduate and professional education and associate professor of chemical engineering at the University of Michigan. She will join MIT’s Department of Chemical Engineering (ChemE). Eniola-Adefeso will work with Professor Paula Hammond on developing electrostatically assembled nanoparticle coatings that enable targeting of specific immune cell types. A co-founder and chief scientific officer of Asalyxa Bio, she is interested in the interactions between blood leukocytes and endothelial cells in vessel lumen lining, and how they change during inflammation response. Eniola-Adefeso will also work with the Diversity in Chemical Engineering (DICE) graduate student group in ChemE and the National Organization of Black Chemists and Chemical Engineers.

    Robert Gilliard Jr. is an assistant professor of chemistry at the University of Virginia and will join the MIT chemistry department, working closely with faculty host Christopher Cummins. His research focuses on various aspects of group 15 element chemistry. He was a founding member of the National Organization of Black Chemists and Chemical Engineers UGA section, and he has served as an American Chemical Society (ACS) Bridge Program mentor as well as an ACS Project Seed mentor. Gilliard has also collaborated with the Cleveland Public Library to expose diverse young scholars to STEM fields.

    Valencia Joyner Koomson ’98, MNG ’99 will return for the second semester of her appointment this fall in MIT’s Department of Electrical Engineering and Computer Science. Based at Tufts University, where she is an associate professor in the Department of Electrical and Computer Engineering, Koomson has focused her research on microelectronic systems for cell analysis and biomedical applications. In the past semester, she has served as a judge for the Black Alumni/ae of MIT Research Slam and worked closely with faculty host Professor Akintunde Akinwande.

    Luis Gilberto Murillo-Urrutia will continue his appointment in MIT’s Environmental Solutions Initiative. He has 30 years of experience in public policy design, implementation, and advocacy, most notably in the areas of sustainable regional development, environmental protection and management of natural resources, social inclusion, and peace building. At MIT, he has continued his research on environmental justice, with a focus on carbon policy and its impacts on Afro-descendant communities in Colombia.

    Sonya T. Smith was the first female professor of mechanical engineering at Howard University. She will join the Department of Aeronautics and Astronautics at MIT. Her research involves computational fluid dynamics and thermal management of electronics for air and space vehicles. She is looking forward to serving as a mentor to underrepresented students across MIT and fostering new research collaborations with her home lab at Howard.

    Lawrence Udeigwe is an associate professor of mathematics at Manhattan College and will join MIT’s Department of Brain and Cognitive Sciences. He plans to co-teach a graduate seminar course with Professor James DiCarlo to explore practical and philosophical questions regarding the use of simulations to build theories in neuroscience. Udeigwe also leads the Lorens Chuno group; as a singer-songwriter, his work tackles intersectionality issues faced by contemporary Africans.

    S. Craig Watkins is an internationally recognized expert in media and a professor at the University of Texas at Austin. He will join MIT’s Institute for Data, Systems, and Society to assist in researching the role of big data in enabling deep structural changes with regard to systemic racism. He will continue to expand on his work as founding director of the Institute for Media Innovation at the University of Texas at Austin, exploring the intersections of critical AI studies, critical race studies, and design. He will also work with MIT’s Center for Advanced Virtuality to develop computational systems that support social perspective-taking.

    Community engagement

    Throughout the 2021-22 academic year, MLK professors and scholars will be presenting their research at a monthly speaker series. Events will be held in an in-person/Zoom hybrid environment. All members of the MIT community are encouraged to attend and hear directly from this year’s cohort of outstanding scholars. To hear more about upcoming events, subscribe to their mailing list.

    On Sept. 15, all are invited to join the Institute Community and Equity Office in welcoming the scholars to campus by attending a welcome luncheon. More

  • in

    Using adversarial attacks to refine molecular energy predictions

    Neural networks (NNs) are increasingly being used to predict new materials, the rate and yield of chemical reactions, and drug-target interactions, among others. For these applications, they are orders of magnitude faster than traditional methods such as quantum mechanical simulations. 

    The price for this agility, however, is reliability. Because machine learning models only interpolate, they may fail when used outside the domain of training data.

    But the part that worried Rafael Gómez-Bombarelli, the Jeffrey Cheah Career Development Professor in the MIT Department of Materials Science and Engineering, and graduate students Daniel Schwalbe-Koda and Aik Rui Tan was that establishing the limits of these machine learning (ML) models is tedious and labor-intensive. 

    This is particularly true for predicting ‘‘potential energy surfaces” (PES), or the map of a molecule’s energy in all its configurations. These surfaces encode the complexities of a molecule into flatlands, valleys, peaks, troughs, and ravines. The most stable configurations of a system are usually in the deep pits — quantum mechanical chasms from which atoms and molecules typically do not escape. 

    In a recent Nature Communications paper, the research team presented a way to demarcate the “safe zone” of a neural network by using “adversarial attacks.” Adversarial attacks have been studied for other classes of problems, such as image classification, but this is the first time that they are being used to sample molecular geometries in a PES. 

    “People have been using uncertainty for active learning for years in ML potentials. The key difference is that they need to run the full ML simulation and evaluate if the NN was reliable, and if it wasn’t, acquire more data, retrain and re-simulate. Meaning that it takes a long time to nail down the right model, and one has to run the ML simulation many times” explains Gómez-Bombarelli.

    The Gómez-Bombarelli lab at MIT works on a synergistic synthesis of first-principles simulation and machine learning that greatly speeds up this process. The actual simulations are run only for a small fraction of these molecules, and all those data are fed into a neural network that learns how to predict the same properties for the rest of the molecules. They have successfully demonstrated these methods for a growing class of novel materials that includes catalysts for producing hydrogen from water, cheaper polymer electrolytes for electric vehicles,  zeolites for molecular sieving, magnetic materials, and more. 

    The challenge, however, is that these neural networks are only as smart as the data they are trained on.  Considering the PES map, 99 percent of the data may fall into one pit, totally missing valleys that are of more interest. 

    Such wrong predictions can have disastrous consequences — think of a self-driving car that fails to identify a person crossing the street.

    One way to find out the uncertainty of a model is to run the same data through multiple versions of it. 

    For this project, the researchers had multiple neural networks predict the potential energy surface from the same data. Where the network is fairly sure of the prediction, the variation between the outputs of different networks is minimal and the surfaces largely converge. When the network is uncertain, the predictions of different models vary widely, producing a range of outputs, any of which could be the correct surface. 

    The spread in the predictions of a “committee of neural networks” is the “uncertainty” at that point. A good model should not just indicate the best prediction, but also indicates the uncertainty about each of these predictions. It’s like the neural network says “this property for material A will have a value of X and I’m highly confident about it.”

    This could have been an elegant solution but for the sheer scale of the combinatorial space. “Each simulation (which is ground feed for the neural network) may take from tens to thousands of CPU hours,” explains Schwalbe-Koda. For the results to be meaningful, multiple models must be run over a sufficient number of points in the PES, an extremely time-consuming process. 

    Instead, the new approach only samples data points from regions of low prediction confidence, corresponding to specific geometries of a molecule. These molecules are then stretched or deformed slightly so that the uncertainty of the neural network committee is maximized. Additional data are computed for these molecules through simulations and then added to the initial training pool. 

    The neural networks are trained again, and a new set of uncertainties are calculated. This process is repeated until the uncertainty associated with various points on the surface becomes well-defined and cannot be decreased any further. 

    Gómez-Bombarelli explains, “We aspire to have a model that is perfect in the regions we care about (i.e., the ones that the simulation will visit) without having had to run the full ML simulation, by making sure that we make it very good in high-likelihood regions where it isn’t.”

    The paper presents several examples of this approach, including predicting complex supramolecular interactions in zeolites. These materials are cavernous crystals that act as molecular sieves with high shape selectivity. They find applications in catalysis, gas separation, and ion exchange, among others.

    Because performing simulations of large zeolite structures is very costly, the researchers show how their method can provide significant savings in computational simulations. They used more than 15,000 examples to train a neural network to predict the potential energy surfaces for these systems. Despite the large cost required to generate the dataset, the final results are mediocre, with only around 80 percent of the neural network-based simulations being successful. To improve the performance of the model using traditional active learning methods, the researchers calculated an additional 5,000 data points, which improved the performance of the neural network potentials to 92 percent.

    However, when the adversarial approach is used to retrain the neural networks, the authors saw a performance jump to 97 percent using only 500 extra points. That’s a remarkable result, the researchers say, especially considering that each of these extra points takes hundreds of CPU hours. 

    This could be the most realistic method to probe the limits of models that researchers use to predict the behavior of materials and the progress of chemical reactions. More