More stories

  • in

    New CRISPR-based map ties every human gene to its function

    The Human Genome Project was an ambitious initiative to sequence every piece of human DNA. The project drew together collaborators from research institutions around the world, including MIT’s Whitehead Institute for Biomedical Research, and was finally completed in 2003. Now, over two decades later, MIT Professor Jonathan Weissman and colleagues have gone beyond the sequence to present the first comprehensive functional map of genes that are expressed in human cells. The data from this project, published online June 9 in Cell, ties each gene to its job in the cell, and is the culmination of years of collaboration on the single-cell sequencing method Perturb-seq.

    The data are available for other scientists to use. “It’s a big resource in the way the human genome is a big resource, in that you can go in and do discovery-based research,” says Weissman, who is also a member of the Whitehead Institute and an investigator with the Howard Hughes Medical Institute. “Rather than defining ahead of time what biology you’re going to be looking at, you have this map of the genotype-phenotype relationships and you can go in and screen the database without having to do any experiments.”

    The screen allowed the researchers to delve into diverse biological questions. They used it to explore the cellular effects of genes with unknown functions, to investigate the response of mitochondria to stress, and to screen for genes that cause chromosomes to be lost or gained, a phenotype that has proved difficult to study in the past. “I think this dataset is going to enable all sorts of analyses that we haven’t even thought up yet by people who come from other parts of biology, and suddenly they just have this available to draw on,” says former Weissman Lab postdoc Tom Norman, a co-senior author of the paper.

    Pioneering Perturb-seq

    The project takes advantage of the Perturb-seq approach that makes it possible to follow the impact of turning on or off genes with unprecedented depth. This method was first published in 2016 by a group of researchers including Weissman and fellow MIT professor Aviv Regev, but could only be used on small sets of genes and at great expense.

    The massive Perturb-seq map was made possible by foundational work from Joseph Replogle, an MD-PhD student in Weissman’s lab and co-first author of the present paper. Replogle, in collaboration with Norman, who now leads a lab at Memorial Sloan Kettering Cancer Center; Britt Adamson, an assistant professor in the Department of Molecular Biology at Princeton University; and a group at 10x Genomics, set out to create a new version of Perturb-seq that could be scaled up. The researchers published a proof-of-concept paper in Nature Biotechnology in 2020. 

    The Perturb-seq method uses CRISPR-Cas9 genome editing to introduce genetic changes into cells, and then uses single-cell RNA sequencing to capture information about the RNAs that are expressed resulting from a given genetic change. Because RNAs control all aspects of how cells behave, this method can help decode the many cellular effects of genetic changes.

    Since their initial proof-of-concept paper, Weissman, Regev, and others have used this sequencing method on smaller scales. For example, the researchers used Perturb-seq in 2021 to explore how human and viral genes interact over the course of an infection with HCMV, a common herpesvirus.

    In the new study, Replogle and collaborators including Reuben Saunders, a graduate student in Weissman’s lab and co-first author of the paper, scaled up the method to the entire genome. Using human blood cancer cell lines as well noncancerous cells derived from the retina, he performed Perturb-seq across more than 2.5 million cells, and used the data to build a comprehensive map tying genotypes to phenotypes.

    Delving into the data

    Upon completing the screen, the researchers decided to put their new dataset to use and examine a few biological questions. “The advantage of Perturb-seq is it lets you get a big dataset in an unbiased way,” says Tom Norman. “No one knows entirely what the limits are of what you can get out of that kind of dataset. Now, the question is, what do you actually do with it?”

    The first, most obvious application was to look into genes with unknown functions. Because the screen also read out phenotypes of many known genes, the researchers could use the data to compare unknown genes to known ones and look for similar transcriptional outcomes, which could suggest the gene products worked together as part of a larger complex.

    The mutation of one gene called C7orf26 in particular stood out. Researchers noticed that genes whose removal led to a similar phenotype were part of a protein complex called Integrator that played a role in creating small nuclear RNAs. The Integrator complex is made up of many smaller subunits — previous studies had suggested 14 individual proteins — and the researchers were able to confirm that C7orf26 made up a 15th component of the complex.

    They also discovered that the 15 subunits worked together in smaller modules to perform specific functions within the Integrator complex. “Absent this thousand-foot-high view of the situation, it was not so clear that these different modules were so functionally distinct,” says Saunders.

    Another perk of Perturb-seq is that because the assay focuses on single cells, the researchers could use the data to look at more complex phenotypes that become muddied when they are studied together with data from other cells. “We often take all the cells where ‘gene X’ is knocked down and average them together to look at how they changed,” Weissman says. “But sometimes when you knock down a gene, different cells that are losing that same gene behave differently, and that behavior may be missed by the average.”

    The researchers found that a subset of genes whose removal led to different outcomes from cell to cell were responsible for chromosome segregation. Their removal was causing cells to lose a chromosome or pick up an extra one, a condition known as aneuploidy. “You couldn’t predict what the transcriptional response to losing this gene was because it depended on the secondary effect of what chromosome you gained or lost,” Weissman says. “We realized we could then turn this around and create this composite phenotype looking for signatures of chromosomes being gained and lost. In this way, we’ve done the first genome-wide screen for factors that are required for the correct segregation of DNA.”

    “I think the aneuploidy study is the most interesting application of this data so far,” Norman says. “It captures a phenotype that you can only get using a single-cell readout. You can’t go after it any other way.”

    The researchers also used their dataset to study how mitochondria responded to stress. Mitochondria, which evolved from free-living bacteria, carry 13 genes in their genomes. Within the nuclear DNA, around 1,000 genes are somehow related to mitochondrial function. “People have been interested for a long time in how nuclear and mitochondrial DNA are coordinated and regulated in different cellular conditions, especially when a cell is stressed,” Replogle says.

    The researchers found that when they perturbed different mitochondria-related genes, the nuclear genome responded similarly to many different genetic changes. However, the mitochondrial genome responses were much more variable. 

    “There’s still an open question of why mitochondria still have their own DNA,” said Replogle. “A big-picture takeaway from our work is that one benefit of having a separate mitochondrial genome might be having localized or very specific genetic regulation in response to different stressors.”

    “If you have one mitochondria that’s broken, and another one that is broken in a different way, those mitochondria could be responding differentially,” Weissman says.

    In the future, the researchers hope to use Perturb-seq on different types of cells besides the cancer cell line they started in. They also hope to continue to explore their map of gene functions, and hope others will do the same. “This really is the culmination of many years of work by the authors and other collaborators, and I’m really pleased to see it continue to succeed and expand,” says Norman. More

  • in

    Ocean vital signs

    Without the ocean, the climate crisis would be even worse than it is. Each year, the ocean absorbs billions of tons of carbon from the atmosphere, preventing warming that greenhouse gas would otherwise cause. Scientists estimate about 25 to 30 percent of all carbon released into the atmosphere by both human and natural sources is absorbed by the ocean.

    “But there’s a lot of uncertainty in that number,” says Ryan Woosley, a marine chemist and a principal research scientist in the Department of Earth, Atmospheric and Planetary Sciences (EAPS) at MIT. Different parts of the ocean take in different amounts of carbon depending on many factors, such as the season and the amount of mixing from storms. Current models of the carbon cycle don’t adequately capture this variation.

    To close the gap, Woosley and a team of other MIT scientists developed a research proposal for the MIT Climate Grand Challenges competition — an Institute-wide campaign to catalyze and fund innovative research addressing the climate crisis. The team’s proposal, “Ocean Vital Signs,” involves sending a fleet of sailing drones to cruise the oceans taking detailed measurements of how much carbon the ocean is really absorbing. Those data would be used to improve the precision of global carbon cycle models and improve researchers’ ability to verify emissions reductions claimed by countries.

    “If we start to enact mitigation strategies—either through removing CO2 from the atmosphere or reducing emissions — we need to know where CO2 is going in order to know how effective they are,” says Woosley. Without more precise models there’s no way to confirm whether observed carbon reductions were thanks to policy and people, or thanks to the ocean.

    “So that’s the trillion-dollar question,” says Woosley. “If countries are spending all this money to reduce emissions, is it enough to matter?”

    In February, the team’s Climate Grand Challenges proposal was named one of 27 finalists out of the almost 100 entries submitted. From among this list of finalists, MIT will announce in April the selection of five flagship projects to receive further funding and support.

    Woosley is leading the team along with Christopher Hill, a principal research engineer in EAPS. The team includes physical and chemical oceanographers, marine microbiologists, biogeochemists, and experts in computational modeling from across the department, in addition to collaborators from the Media Lab and the departments of Mathematics, Aeronautics and Astronautics, and Electrical Engineering and Computer Science.

    Today, data on the flux of carbon dioxide between the air and the oceans are collected in a piecemeal way. Research ships intermittently cruise out to gather data. Some commercial ships are also fitted with sensors. But these present a limited view of the entire ocean, and include biases. For instance, commercial ships usually avoid storms, which can increase the turnover of water exposed to the atmosphere and cause a substantial increase in the amount of carbon absorbed by the ocean.

    “It’s very difficult for us to get to it and measure that,” says Woosley. “But these drones can.”

    If funded, the team’s project would begin by deploying a few drones in a small area to test the technology. The wind-powered drones — made by a California-based company called Saildrone — would autonomously navigate through an area, collecting data on air-sea carbon dioxide flux continuously with solar-powered sensors. This would then scale up to more than 5,000 drone-days’ worth of observations, spread over five years, and in all five ocean basins.

    Those data would be used to feed neural networks to create more precise maps of how much carbon is absorbed by the oceans, shrinking the uncertainties involved in the models. These models would continue to be verified and improved by new data. “The better the models are, the more we can rely on them,” says Woosley. “But we will always need measurements to verify the models.”

    Improved carbon cycle models are relevant beyond climate warming as well. “CO2 is involved in so much of how the world works,” says Woosley. “We’re made of carbon, and all the other organisms and ecosystems are as well. What does the perturbation to the carbon cycle do to these ecosystems?”

    One of the best understood impacts is ocean acidification. Carbon absorbed by the ocean reacts to form an acid. A more acidic ocean can have dire impacts on marine organisms like coral and oysters, whose calcium carbonate shells and skeletons can dissolve in the lower pH. Since the Industrial Revolution, the ocean has become about 30 percent more acidic on average.

    “So while it’s great for us that the oceans have been taking up the CO2, it’s not great for the oceans,” says Woosley. “Knowing how this uptake affects the health of the ocean is important as well.” More

  • in

    Generating new molecules with graph grammar

    Chemical engineers and materials scientists are constantly looking for the next revolutionary material, chemical, and drug. The rise of machine-learning approaches is expediting the discovery process, which could otherwise take years. “Ideally, the goal is to train a machine-learning model on a few existing chemical samples and then allow it to produce as many manufacturable molecules of the same class as possible, with predictable physical properties,” says Wojciech Matusik, professor of electrical engineering and computer science at MIT. “If you have all these components, you can build new molecules with optimal properties, and you also know how to synthesize them. That’s the overall vision that people in that space want to achieve”

    However, current techniques, mainly deep learning, require extensive datasets for training models, and many class-specific chemical datasets contain a handful of example compounds, limiting their ability to generalize and generate physical molecules that could be created in the real world.

    Now, a new paper from researchers at MIT and IBM tackles this problem using a generative graph model to build new synthesizable molecules within the same chemical class as their training data. To do this, they treat the formation of atoms and chemical bonds as a graph and develop a graph grammar — a linguistics analogy of systems and structures for word ordering — that contains a sequence of rules for building molecules, such as monomers and polymers. Using the grammar and production rules that were inferred from the training set, the model can not only reverse engineer its examples, but can create new compounds in a systematic and data-efficient way. “We basically built a language for creating molecules,” says Matusik “This grammar essentially is the generative model.”

    Matusik’s co-authors include MIT graduate students Minghao Guo, who is the lead author, and Beichen Li as well as Veronika Thost, Payal Das, and Jie Chen, research staff members with IBM Research. Matusik, Thost, and Chen are affiliated with the MIT-IBM Watson AI Lab. Their method, which they’ve called data-efficient graph grammar (DEG), will be presented at the International Conference on Learning Representations.

    “We want to use this grammar representation for monomer and polymer generation, because this grammar is explainable and expressive,” says Guo. “With only a few number of the production rules, we can generate many kinds of structures.”

    A molecular structure can be thought of as a symbolic representation in a graph — a string of atoms (nodes) joined together by chemical bonds (edges). In this method, the researchers allow the model to take the chemical structure and collapse a substructure of the molecule down to one node; this may be two atoms connected by a bond, a short sequence of bonded atoms, or a ring of atoms. This is done repeatedly, creating the production rules as it goes, until a single node remains. The rules and grammar then could be applied in the reverse order to recreate the training set from scratch or combined in different combinations to produce new molecules of the same chemical class.

    “Existing graph generation methods would produce one node or one edge sequentially at a time, but we are looking at higher-level structures and, specifically, exploiting chemistry knowledge, so that we don’t treat the individual atoms and bonds as the unit. This simplifies the generation process and also makes it more data-efficient to learn,” says Chen.

    Further, the researchers optimized the technique so that the bottom-up grammar was relatively simple and straightforward, such that it fabricated molecules that could be made.

    “If we switch the order of applying these production rules, we would get another molecule; what’s more, we can enumerate all the possibilities and generate tons of them,” says Chen. “Some of these molecules are valid and some of them not, so the learning of the grammar itself is actually to figure out a minimal collection of production rules, such that the percentage of molecules that can actually be synthesized is maximized.” While the researchers concentrated on three training sets of less than 33 samples each — acrylates, chain extenders, and isocyanates — they note that the process could be applied to any chemical class.

    To see how their method performed, the researchers tested DEG against other state-of-the-art models and techniques, looking at percentages of chemically valid and unique molecules, diversity of those created, success rate of retrosynthesis, and percentage of molecules belonging to the training data’s monomer class.

    “We clearly show that, for the synthesizability and membership, our algorithm outperforms all the existing methods by a very large margin, while it’s comparable for some other widely-used metrics,” says Guo. Further, “what is amazing about our algorithm is that we only need about 0.15 percent of the original dataset to achieve very similar results compared to state-of-the-art approaches that train on tens of thousands of samples. Our algorithm can specifically handle the problem of data sparsity.”

    In the immediate future, the team plans to address scaling up this grammar learning process to be able to generate large graphs, as well as produce and identify chemicals with desired properties.

    Down the road, the researchers see many applications for the DEG method, as it’s adaptable beyond generating new chemical structures, the team points out. A graph is a very flexible representation, and many entities can be symbolized in this form — robots, vehicles, buildings, and electronic circuits, for example. “Essentially, our goal is to build up our grammar, so that our graphic representation can be widely used across many different domains,” says Guo, as “DEG can automate the design of novel entities and structures,” says Chen.

    This research was supported, in part, by the MIT-IBM Watson AI Lab and Evonik. More

  • in

    An “oracle” for predicting the evolution of gene regulation

    Despite the sheer number of genes that each human cell contains, these so-called “coding” DNA sequences comprise just 1 percent of our entire genome. The remaining 99 percent is made up of “non-coding” DNA — which, unlike coding DNA, does not carry the instructions to build proteins.

    One vital function of this non-coding DNA, also called “regulatory” DNA, is to help turn genes on and off, controlling how much (if any) of a protein is made. Over time, as cells replicate their DNA to grow and divide, mutations often crop up in these non-coding regions — sometimes tweaking their function and changing the way they control gene expression. Many of these mutations are trivial, and some are even beneficial. Occasionally, though, they can be associated with increased risk of common diseases, such as Type 2 diabetes, or more life-threatening ones, including cancer.

    To better understand the repercussions of such mutations, researchers have been hard at work on mathematical maps that allow them to look at an organism’s genome, predict which genes will be expressed, and determine how that expression will affect the organism’s observable traits. These maps, called fitness landscapes, were conceptualized roughly a century ago to understand how genetic makeup influences one common measure of organismal fitness in particular: reproductive success. Early fitness landscapes were very simple, often focusing on a limited number of mutations. Much richer datasets are now available, but researchers still require additional tools to characterize and visualize such complex data. This ability would not only facilitate a better understanding of how individual genes have evolved over time, but would also help to predict what sequence and expression changes might occur in the future.

    In a new study published on March 9 in Nature, a team of scientists has developed a framework for studying the fitness landscapes of regulatory DNA. They created a neural network model that, when trained on hundreds of millions of experimental measurements, was capable of predicting how changes to these non-coding sequences in yeast affected gene expression. They also devised a unique way of representing the landscapes in two dimensions, making it easy to understand the past and forecast the future evolution of non-coding sequences in organisms beyond yeast — and even design custom gene expression patterns for gene therapies and industrial applications.

    “We now have an ‘oracle’ that can be queried to ask: What if we tried all possible mutations of this sequence? Or, what new sequence should we design to give us a desired expression?” says Aviv Regev, a professor of biology at MIT (on leave), core member of the Broad Institute of Harvard and MIT (on leave), head of Genentech Research and Early Development, and the study’s senior author. “Scientists can now use the model for their own evolutionary question or scenario, and for other problems like making sequences that control gene expression in desired ways. I am also excited about the possibilities for machine learning researchers interested in interpretability; they can ask their questions in reverse, to better understand the underlying biology.”

    Prior to this study, many researchers had simply trained their models on known mutations (or slight variations thereof) that exist in nature. However, Regev’s team wanted to go a step further by creating their own unbiased models capable of predicting an organism’s fitness and gene expression based on any possible DNA sequence — even sequences they’d never seen before. This would also enable researchers to use such models to engineer cells for pharmaceutical purposes, including new treatments for cancer and autoimmune disorders.

    To accomplish this goal, Eeshit Dhaval Vaishnav, a graduate student at MIT and co-first author; Carl de Boer, now an assistant professor at the University of British Columbia; and their colleagues created a neural network model to predict gene expression. They trained it on a dataset generated by inserting millions of totally random non-coding DNA sequences into yeast, and observing how each random sequence affected gene expression. They focused on a particular subset of non-coding DNA sequences called promoters, which serve as binding sites for proteins that can switch nearby genes on or off.

    “This work highlights what possibilities open up when we design new kinds of experiments to generate the right data to train models,” Regev says. “In the broader sense, I believe these kinds of approaches will be important for many problems — like understanding genetic variants in regulatory regions that confer disease risk in the human genome, but also for predicting the impact of combinations of mutations, or designing new molecules.”

    Regev, Vaishnav, de Boer, and their coauthors went on to test their model’s predictive abilities in a variety of ways, in order to show how it could help demystify the evolutionary past — and possible future — of certain promoters. “Creating an accurate model was certainly an accomplishment, but, to me, it was really just a starting point,” Vaishnav explains.

    First, to determine whether their model could help with synthetic biology applications like producing antibiotics, enzymes, and food, the researchers practiced using it to design promoters that could generate desired expression levels for any gene of interest. They then scoured other scientific papers to identify fundamental evolutionary questions, in order to see if their model could help answer them. The team even went so far as to feed their model a real-world population dataset from one existing study, which contained genetic information from yeast strains around the world. In doing so, they were able to delineate thousands of years of past selection pressures that sculpted the genomes of today’s yeast.

    But, in order to create a powerful tool that could probe any genome, the researchers knew they’d need to find a way to forecast the evolution of non-coding sequences even without such a comprehensive population dataset. To address this goal, Vaishnav and his colleagues devised a computational technique that allowed them to plot the predictions from their framework onto a two-dimensional graph. This helped them show, in a remarkably simple manner, how any non-coding DNA sequence would affect gene expression and fitness, without needing to conduct any time-consuming experiments at the lab bench.

    “One of the unsolved problems in fitness landscapes was that we didn’t have an approach for visualizing them in a way that meaningfully captured the evolutionary properties of sequences,” Vaishnav explains. “I really wanted to find a way to fill that gap, and contribute to the long-standing vision of creating a complete fitness landscape.”

    Martin Taylor, a professor of genetics at the University of Edinburgh’s Medical Research Council Human Genetics Unit who was not involved in the research, says the study shows that artificial intelligence can not only predict the effect of regulatory DNA changes, but also reveal the underlying principles that govern millions of years of evolution.

    Despite the fact that the model was trained on just a fraction of yeast regulatory DNA in a few growth conditions, he’s impressed that it’s capable of making such useful predictions about the evolution of gene regulation in mammals.

    “There are obvious near-term applications, such as the custom design of regulatory DNA for yeast in brewing, baking, and biotechnology,” he explains. “But extensions of this work could also help identify disease mutations in human regulatory DNA that are currently difficult to find and largely overlooked in the clinic. This work suggests there is a bright future for AI models of gene regulation trained on richer, more complex, and more diverse datasets.”

    Even before the study was formally published, Vaishnav began receiving queries from other researchers hoping to use the model to devise non-coding DNA sequences for use in gene therapies.

    “People have been studying regulatory evolution and fitness landscapes for decades now,” Vaishnav says. “I think our framework will go a long way in answering fundamental, open questions about the evolution and evolvability of gene regulatory DNA — and even help us design biological sequences for exciting new applications.” More

  • in

    Deep-learning technique predicts clinical treatment outcomes

    When it comes to treatment strategies for critically ill patients, clinicians want to be able to consider all their options and timing of administration, and make the optimal decision for their patients. While clinician experience and study has helped them to be successful in this effort, not all patients are the same, and treatment decisions at this crucial time could mean the difference between patient improvement and quick deterioration. Therefore, it would be helpful for doctors to be able to take a patient’s previous known health status and received treatments and use that to predict that patient’s health outcome under different treatment scenarios, in order to pick the best path.

    Now, a deep-learning technique, called G-Net, from researchers at MIT and IBM provides a window into causal counterfactual prediction, affording physicians the opportunity to explore how a patient might fare under different treatment plans. The foundation of G-Net is the g-computation algorithm, a causal inference method that estimates the effect of dynamic exposures in the presence of measured confounding variables — ones that may influence both treatments and outcomes. Unlike previous implementations of the g-computation framework, which have used linear modeling approaches, G-Net uses recurrent neural networks (RNN), which have node connections that allow them to better model temporal sequences with complex and nonlinear dynamics, like those found in the physiological and clinical time series data. In this way, physicians can develop alternative plans based on patient history and test them before making a decision.

    “Our ultimate goal is to develop a machine learning technique that would allow doctors to explore various ‘What if’ scenarios and treatment options,” says Li-wei Lehman, MIT research scientist in the MIT Institute for Medical Engineering and Science and an MIT-IBM Watson AI Lab project lead. “A lot of work has been done in terms of deep learning for counterfactual prediction but [it’s] been focusing on a point exposure setting,” or a static, time-varying treatment strategy, which doesn’t allow for adjustment of treatments as patient history changes. However, her team’s new prediction approach provides for treatment plan flexibility and chances for treatment alteration over time as patient covariate history and past treatments change. “G-Net is the first deep-learning approach based on g-computation that can predict both the population-level and individual-level treatment effects under dynamic and time varying treatment strategies.”

    The research, which was recently published in the Proceedings of Machine Learning Research, was co-authored by Rui Li MEng ’20, Stephanie Hu MEng ’21, former MIT postdoc Mingyu Lu MD, graduate student Yuria Utsumi, IBM research staff member Prithwish Chakraborty, IBM Research director of Hybrid Cloud Services Daby Sow, IBM data scientist Piyush Madan, IBM research scientist Mohamed Ghalwash, and IBM research scientist Zach Shahn.

    Tracking disease progression

    To build, validate, and test G-Net’s predictive abilities, the researchers considered the circulatory system in septic patients in the ICU. During critical care, doctors need to make trade-offs and judgement calls, such as ensuring the organs are receiving adequate blood supply without overworking the heart. For this, they could give intravenous fluids to patients to increase blood pressure; however, too much can cause edema. Alternatively, physicians can administer vasopressors, which act to contract blood vessels and raise blood pressure.

    In order to mimic this and demonstrate G-Net’s proof-of-concept, the team used CVSim, a mechanistic model of a human cardiovascular system that’s governed by 28 input variables characterizing the system’s current state, such as arterial pressure, central venous pressure, total blood volume, and total peripheral resistance, and modified it to simulate various disease processes (e.g., sepsis or blood loss) and effects of interventions (e.g., fluids and vasopressors). The researchers used CVSim to generate observational patient data for training and for “ground truth” comparison against counterfactual prediction. In their G-Net architecture, the researchers ran two RNNs to handle and predict variables that are continuous, meaning they can take on a range of values, like blood pressure, and categorical variables, which have discrete values, like the presence or absence of pulmonary edema. The researchers simulated the health trajectories of thousands of “patients” exhibiting symptoms under one treatment regime, let’s say A, for 66 timesteps, and used them to train and validate their model.

    Testing G-Net’s prediction capability, the team generated two counterfactual datasets. Each contained roughly 1,000 known patient health trajectories, which were created from CVSim using the same “patient” condition as the starting point under treatment A. Then at timestep 33, treatment changed to plan B or C, depending on the dataset. The team then performed 100 prediction trajectories for each of these 1,000 patients, whose treatment and medical history was known up until timestep 33 when a new treatment was administered. In these cases, the prediction agreed well with the “ground-truth” observations for individual patients and averaged population-level trajectories.

    A cut above the rest

    Since the g-computation framework is flexible, the researchers wanted to examine G-Net’s prediction using different nonlinear models — in this case, long short-term memory (LSTM) models, which are a type of RNN that can learn from previous data patterns or sequences — against the more classical linear models and a multilayer perception model (MLP), a type of neural network that can make predictions using a nonlinear approach. Following a similar setup as before, the team found that the error between the known and predicted cases was smallest in the LSTM models compared to the others. Since G-Net is able to model the temporal patterns of the patient’s ICU history and past treatment, whereas a linear model and MLP cannot, it was better able to predict the patient’s outcome.

    The team also compared G-Net’s prediction in a static, time-varying treatment setting against two state-of-the-art deep-learning based counterfactual prediction approaches, a recurrent marginal structural network (rMSN) and a counterfactual recurrent neural network (CRN), as well as a linear model and an MLP. For this, they investigated a model for tumor growth under no treatment, radiation, chemotherapy, and both radiation and chemotherapy scenarios. “Imagine a scenario where there’s a patient with cancer, and an example of a static regime would be if you only give a fixed dosage of chemotherapy, radiation, or any kind of drug, and wait until the end of your trajectory,” comments Lu. For these investigations, the researchers generated simulated observational data using tumor volume as the primary influence dictating treatment plans and demonstrated that G-Net outperformed the other models. One potential reason could be because g-computation is known to be more statistically efficient than rMSN and CRN, when models are correctly specified.

    While G-Net has done well with simulated data, more needs to be done before it can be applied to real patients. Since neural networks can be thought of as “black boxes” for prediction results, the researchers are beginning to investigate the uncertainty in the model to help ensure safety. In contrast to these approaches that recommend an “optimal” treatment plan without any clinician involvement, “as a decision support tool, I believe that G-Net would be more interpretable, since the clinicians would input treatment strategies themselves,” says Lehman, and “G-Net will allow them to be able to explore different hypotheses.” Further, the team has moved on to using real data from ICU patients with sepsis, bringing it one step closer to implementation in hospitals.

    “I think it is pretty important and exciting for real-world applications,” says Hu. “It’d be helpful to have some way to predict whether or not a treatment might work or what the effects might be — a quicker iteration process for developing these hypotheses for what to try, before actually trying to implement them in in a years-long, potentially very involved and very invasive type of clinical trial.”

    This research was funded by the MIT-IBM Watson AI Lab. More

  • in

    Research aims to mitigate chemical and biological airborne threats

    When the air harbors harmful matter, such as a virus or toxic chemical, it’s not always easy to promptly detect this danger. Whether spread maliciously or accidentally, how fast and how far could hazardous plumes travel through a city? What could emergency managers do in response?

    These were questions that scientists, public health officials, and government agencies probed with an air flow study conducted recently in New York City. At 120 locations across all five boroughs of the city, a team led by MIT Lincoln Laboratory collected safe test particles and gases released earlier in subway stations and on streets, tracking their journeys. The exercise measured how far the materials traveled and what their concentrations were when detected.

    The results are expected to improve air dispersion models, and in turn, help emergency planners improve response protocols if a real chemical or biological event were to take place. 

    The study was performed under the Department of Homeland Security (DHS) Science and Technology Directorate’s (S&T) Urban Threat Dispersion Project. The project is largely driven by Lincoln Laboratory’s Counter–Weapons of Mass Destruction (CWMD) Systems Group to improve homeland defenses against airborne threats. This exercise followed a similar, though much smaller, study in 2016 that focused mainly on the subway system within Manhattan.

    “The idea was to look at how particles and gases move through urban environments, starting with a focus on subways,” says Mandeep Virdi, a researcher in the CWMD Systems Group who helped lead both studies.

    The particles and gases used in the study are safe to disperse. The particulates are primarily composed of maltodextrin sugar, and have been used in prior public safety exercises. To enable researchers to track the particles, the particles are modified with small amounts of synthetic DNA that acts as a unique “barcode.” This barcode corresponds to the location from which the particle was released and the day of release. When these particles are later collected and analyzed, researchers can know exactly where they came from.

    The laboratory’s team led the process of releasing the particles and collecting the particle samples for analysis. A small sprayer is used to aerosolize the particles into the air. As the particles flow throughout the city, some get trapped in filters set up at the many dispersed collection sites. 

    To make processes more efficient for this large study, the team built special filter heads that rotated through multiple filters, saving time spent revisiting a collection site. They also developed a system using NFC (near-field communication) tags to simplify the cataloging and tracking of samples and equipment through a mobile app. 

    The researchers are still processing the approximately 5,000 samples that were collected over the five-day measurement campaign. The data will feed into existing particle dispersion models to improve simulations. One of these models, from Argonne National Laboratory, focuses on subway environments, and another model from Los Alamos National Laboratory simulates above-ground city environments, taking into account buildings and urban canyon air flows.

    Together, these models can show how a plume would travel from the subway to the streets, for example. These insights will enable emergency managers in New York City to develop more informed response strategies, as they did following the 2016 subway study.

    “The big question has always been, if there is a release and law enforcement can detect it in time, what do you actually do? Do you shut down the subway system? What can you do to mitigate those effects? Knowing that is the end goal,” Virdi says. 

    A new program, called the Chemical and Biological Defense Testbed, has just kicked off to further investigate those questions. Trina Vian at Lincoln Laboratory is leading this program, also under S&T funding.

    “Now that we’ve learned more about how material transports through the subway system, this test bed is looking at ways that we can mitigate that transport in a low-regret way,” Vian says.

    According to Vian, emergency managers don’t have many options other than to evacuate the area when a biological or chemical sensor is triggered. Yet current sensors tend to have high false-alarm rates, particularly in dirty environments. “You really can’t afford to make that evacuation call in error. Not only do you undermine people’s trust in the system, but also people can become injured, and it may actually be a non-threatening situation.”

    The goal of this test bed is to develop architectures and technologies that could allow for a range of appropriate response activities. For example, the team will be looking at ways through which air flow could be constrained or filtered in place, without disrupting traffic, while responders validate an alarm. They’ll also be testing the performance of new chemical and biological sensor technologies.

    Both Vian and Virdi stress the importance of collaboration for carrying out these large-scale studies, and in tackling the problem of airborne dangers in general. The test bed program is already benefiting by using equipment provided through the CWMD Alliance, a partnership of DHS and the Joint Program Executive Office for Chemical, Biological, Radiological and Nuclear Defense.

    A team of nearly 175 personnel worked together on the air flow exercise, spanning the Metropolitan Transportation Authority, New York City Transit, New York City Police Department, Port Authority of New York and New Jersey, New Jersey Transit, New York City Department of Environmental Protection, the New York City Department of Health and Mental Hygiene, the National Guard Weapons of Mass Destruction Civil Support Teams, the Environmental Protection Agency, and Department of Energy National Laboratories, in addition to S&T and Lincoln Laboratory.

    “It really was all about teamwork,” Virdi reflects. “Programs like this are why I came to Lincoln Laboratory. Seeing how the science is applied in a way that has real actionable results and how appreciative agencies are of what we’re doing has been rewarding. It’s exciting to see your program through, especially one as intense as this.” More

  • in

    Physics and the machine-learning “black box”

    Machine-learning algorithms are often referred to as a “black box.” Once data are put into an algorithm, it’s not always known exactly how the algorithm arrives at its prediction. This can be particularly frustrating when things go wrong. A new mechanical engineering (MechE) course at MIT teaches students how to tackle the “black box” problem, through a combination of data science and physics-based engineering.

    In class 2.C161 (Physical Systems Modeling and Design Using Machine Learning), Professor George Barbastathis demonstrates how mechanical engineers can use their unique knowledge of physical systems to keep algorithms in check and develop more accurate predictions.

    “I wanted to take 2.C161 because machine-learning models are usually a “black box,” but this class taught us how to construct a system model that is informed by physics so we can peek inside,” explains Crystal Owens, a mechanical engineering graduate student who took the course in spring 2021.

    As chair of the Committee on the Strategic Integration of Data Science into Mechanical Engineering, Barbastathis has had many conversations with mechanical engineering students, researchers, and faculty to better understand the challenges and successes they’ve had using machine learning in their work.

    “One comment we heard frequently was that these colleagues can see the value of data science methods for problems they are facing in their mechanical engineering-centric research; yet they are lacking the tools to make the most out of it,” says Barbastathis. “Mechanical, civil, electrical, and other types of engineers want a fundamental understanding of data principles without having to convert themselves to being full-time data scientists or AI researchers.”

    Additionally, as mechanical engineering students move on from MIT to their careers, many will need to manage data scientists on their teams someday. Barbastathis hopes to set these students up for success with class 2.C161.

    Bridging MechE and the MIT Schwartzman College of Computing

    Class 2.C161 is part of the MIT Schwartzman College of Computing “Computing Core.” The goal of these classes is to connect data science and physics-based engineering disciplines, like mechanical engineering. Students take the course alongside 6.C402 (Modeling with Machine Learning: from Algorithms to Applications), taught by professors of electrical engineering and computer science Regina Barzilay and Tommi Jaakkola.

    The two classes are taught concurrently during the semester, exposing students to both fundamentals in machine learning and domain-specific applications in mechanical engineering.

    In 2.C161, Barbastathis highlights how complementary physics-based engineering and data science are. Physical laws present a number of ambiguities and unknowns, ranging from temperature and humidity to electromagnetic forces. Data science can be used to predict these physical phenomena. Meanwhile, having an understanding of physical systems helps ensure the resulting output of an algorithm is accurate and explainable.

    “What’s needed is a deeper combined understanding of the associated physical phenomena and the principles of data science, machine learning in particular, to close the gap,” adds Barbastathis. “By combining data with physical principles, the new revolution in physics-based engineering is relatively immune to the “black box” problem facing other types of machine learning.”

    Equipped with a working knowledge of machine-learning topics covered in class 6.C402 and a deeper understanding of how to pair data science with physics, students are charged with developing a final project that solves for an actual physical system.

    Developing solutions for real-world physical systems

    For their final project, students in 2.C161 are asked to identify a real-world problem that requires data science to address the ambiguity inherent in physical systems. After obtaining all relevant data, students are asked to select a machine-learning method, implement their chosen solution, and present and critique the results.

    Topics this past semester ranged from weather forecasting to the flow of gas in combustion engines, with two student teams drawing inspiration from the ongoing Covid-19 pandemic.

    Owens and her teammates, fellow graduate students Arun Krishnadas and Joshua David John Rathinaraj, set out to develop a model for the Covid-19 vaccine rollout.

    “We developed a method of combining a neural network with a susceptible-infected-recovered (SIR) epidemiological model to create a physics-informed prediction system for the spread of Covid-19 after vaccinations started,” explains Owens.

    The team accounted for various unknowns including population mobility, weather, and political climate. This combined approach resulted in a prediction of Covid-19’s spread during the vaccine rollout that was more reliable than using either the SIR model or a neural network alone.

    Another team, including graduate student Yiwen Hu, developed a model to predict mutation rates in Covid-19, a topic that became all too pertinent as the delta variant began its global spread.

    “We used machine learning to predict the time-series-based mutation rate of Covid-19, and then incorporated that as an independent parameter into the prediction of pandemic dynamics to see if it could help us better predict the trend of the Covid-19 pandemic,” says Hu.

    Hu, who had previously conducted research into how vibrations on coronavirus protein spikes affect infection rates, hopes to apply the physics-based machine-learning approaches he learned in 2.C161 to his research on de novo protein design.

    Whatever the physical system students addressed in their final projects, Barbastathis was careful to stress one unifying goal: the need to assess ethical implications in data science. While more traditional computing methods like face or voice recognition have proven to be rife with ethical issues, there is an opportunity to combine physical systems with machine learning in a fair, ethical way.

    “We must ensure that collection and use of data are carried out equitably and inclusively, respecting the diversity in our society and avoiding well-known problems that computer scientists in the past have run into,” says Barbastathis.

    Barbastathis hopes that by encouraging mechanical engineering students to be both ethics-literate and well-versed in data science, they can move on to develop reliable, ethically sound solutions and predictions for physical-based engineering challenges. More

  • in

    Understanding air pollution from space

    Climate change and air pollution are interlocking crises that threaten human health. Reducing emissions of some air pollutants can help achieve climate goals, and some climate mitigation efforts can in turn improve air quality.

    One part of MIT Professor Arlene Fiore’s research program is to investigate the fundamental science in understanding air pollutants — how long they persist and move through our environment to affect air quality.

    “We need to understand the conditions under which pollutants, such as ozone, form. How much ozone is formed locally and how much is transported long distances?” says Fiore, who notes that Asian air pollution can be transported across the Pacific Ocean to North America. “We need to think about processes spanning local to global dimensions.”

    Fiore, the Peter H. Stone and Paola Malanotte Stone Professor in Earth, Atmospheric and Planetary Sciences, analyzes data from on-the-ground readings and from satellites, along with models, to better understand the chemistry and behavior of air pollutants — which ultimately can inform mitigation strategies and policy setting.

    A global concern

    At the United Nations’ most recent climate change conference, COP26, air quality management was a topic discussed over two days of presentations.

    “Breathing is vital. It’s life. But for the vast majority of people on this planet right now, the air that they breathe is not giving life, but cutting it short,” said Sarah Vogel, senior vice president for health at the Environmental Defense Fund, at the COP26 session.

    “We need to confront this twin challenge now through both a climate and clean air lens, of targeting those pollutants that warm both the air and harm our health.”

    Earlier this year, the World Health Organization (WHO) updated its global air quality guidelines it had issued 15 years earlier for six key pollutants including ozone (O3), nitrogen dioxide (NO2), sulfur dioxide (SO2), and carbon monoxide (CO). The new guidelines are more stringent based on what the WHO stated is the “quality and quantity of evidence” of how these pollutants affect human health. WHO estimates that roughly 7 million premature deaths are attributable to the joint effects of air pollution.

    “We’ve had all these health-motivated reductions of aerosol and ozone precursor emissions. What are the implications for the climate system, both locally but also around the globe? How does air quality respond to climate change? We study these two-way interactions between air pollution and the climate system,” says Fiore.

    But fundamental science is still required to understand how gases, such as ozone and nitrogen dioxide, linger and move throughout the troposphere — the lowermost layer of our atmosphere, containing the air we breathe.

    “We care about ozone in the air we’re breathing where we live at the Earth’s surface,” says Fiore. “Ozone reacts with biological tissue, and can be damaging to plants and human lungs. Even if you’re a healthy adult, if you’re out running hard during an ozone smog event, you might feel an extra weight on your lungs.”

    Telltale signs from space

    Ozone is not emitted directly, but instead forms through chemical reactions catalyzed by radiation from the sun interacting with nitrogen oxides — pollutants released in large part from burning fossil fuels—and volatile organic compounds. However, current satellite instruments cannot sense ground-level ozone.

    “We can’t retrieve surface- or even near-surface ozone from space,” says Fiore of the satellite data, “although the anticipated launch of a new instrument looks promising for new advances in retrieving lower-tropospheric ozone”. Instead, scientists can look at signatures from other gas emissions to get a sense of ozone formation. “Nitrogen dioxide and formaldehyde are a heavy focus of our research because they serve as proxies for two of the key ingredients that go on to form ozone in the atmosphere.”

    To understand ozone formation via these precursor pollutants, scientists have gathered data for more than two decades using spectrometer instruments aboard satellites that measure sunlight in ultraviolet and visible wavelengths that interact with these pollutants in the Earth’s atmosphere — known as solar backscatter radiation.

    Satellites, such as NASA’s Aura, carry instruments like the Ozone Monitoring Instrument (OMI). OMI, along with European-launched satellites such as the Global Ozone Monitoring Experiment (GOME) and the Scanning Imaging Absorption spectroMeter for Atmospheric CartograpHY (SCIAMACHY), and the newest generation TROPOspheric Monitoring instrument (TROPOMI), all orbit the Earth, collecting data during daylight hours when sunlight is interacting with the atmosphere over a particular location.

    In a recent paper from Fiore’s group, former graduate student Xiaomeng Jin (now a postdoc at the University of California at Berkeley), demonstrated that she could bring together and “beat down the noise in the data,” as Fiore says, to identify trends in ozone formation chemistry over several U.S. metropolitan areas that “are consistent with our on-the-ground understanding from in situ ozone measurements.”

    “This finding implies that we can use these records to learn about changes in surface ozone chemistry in places where we lack on-the-ground monitoring,” says Fiore. Extracting these signals by stringing together satellite data — OMI, GOME, and SCIAMACHY — to produce a two-decade record required reconciling the instruments’ differing orbit days, times, and fields of view on the ground, or spatial resolutions. 

    Currently, spectrometer instruments aboard satellites are retrieving data once per day. However, newer instruments, such as the Geostationary Environment Monitoring Spectrometer launched in February 2020 by the National Institute of Environmental Research in the Ministry of Environment of South Korea, will monitor a particular region continuously, providing much more data in real time.

    Over North America, the Tropospheric Emissions: Monitoring of Pollution Search (TEMPO) collaboration between NASA and the Smithsonian Astrophysical Observatory, led by Kelly Chance of Harvard University, will provide not only a stationary view of the atmospheric chemistry over the continent, but also a finer-resolution view — with the instrument recording pollution data from only a few square miles per pixel (with an anticipated launch in 2022).

    “What we’re very excited about is the opportunity to have continuous coverage where we get hourly measurements that allow us to follow pollution from morning rush hour through the course of the day and see how plumes of pollution are evolving in real time,” says Fiore.

    Data for the people

    Providing Earth-observing data to people in addition to scientists — namely environmental managers, city planners, and other government officials — is the goal for the NASA Health and Air Quality Applied Sciences Team (HAQAST).

    Since 2016, Fiore has been part of HAQAST, including collaborative “tiger teams” — projects that bring together scientists, nongovernment entities, and government officials — to bring data to bear on real issues.

    For example, in 2017, Fiore led a tiger team that provided guidance to state air management agencies on how satellite data can be incorporated into state implementation plans (SIPs). “Submission of a SIP is required for any state with a region in non-attainment of U.S. National Ambient Air Quality Standards to demonstrate their approach to achieving compliance with the standard,” says Fiore. “What we found is that small tweaks in, for example, the metrics we use to convey the science findings, can go a long way to making the science more usable, especially when there are detailed policy frameworks in place that must be followed.”

    Now, in 2021, Fiore is part of two tiger teams announced by HAQAST in late September. One team is looking at data to address environmental justice issues, by providing data to assess communities disproportionately affected by environmental health risks. Such information can be used to estimate the benefits of governmental investments in environmental improvements for disproportionately burdened communities. The other team is looking at urban emissions of nitrogen oxides to try to better quantify and communicate uncertainties in the estimates of anthropogenic sources of pollution.

    “For our HAQAST work, we’re looking at not just the estimate of the exposure to air pollutants, or in other words their concentrations,” says Fiore, “but how confident are we in our exposure estimates, which in turn affect our understanding of the public health burden due to exposure. We have stakeholder partners at the New York Department of Health who will pair exposure datasets with health data to help prioritize decisions around public health.

    “I enjoy working with stakeholders who have questions that require science to answer and can make a difference in their decisions.” Fiore says. More