    Scientists preserve DNA in an amber-like polymer

    In the movie “Jurassic Park,” scientists extracted DNA that had been preserved in amber for millions of years, and used it to create a population of long-extinct dinosaurs.Inspired partly by that film, MIT researchers have developed a glassy, amber-like polymer that can be used for long-term storage of DNA, whether entire human genomes or digital files such as photos.Most current methods for storing DNA require freezing temperatures, so they consume a great deal of energy and are not feasible in many parts of the world. In contrast, the new amber-like polymer can store DNA at room temperature while protecting the molecules from damage caused by heat or water.The researchers showed that they could use this polymer to store DNA sequences encoding the theme music from Jurassic Park, as well as an entire human genome. They also demonstrated that the DNA can be easily removed from the polymer without damaging it.“Freezing DNA is the number one way to preserve it, but it’s very expensive, and it’s not scalable,” says James Banal, a former MIT postdoc. “I think our new preservation method is going to be a technology that may drive the future of storing digital information on DNA.”Banal and Jeremiah Johnson, the A. Thomas Geurtin Professor of Chemistry at MIT, are the senior authors of the study, published yesterday in the Journal of the American Chemical Society. Former MIT postdoc Elizabeth Prince and MIT postdoc Ho Fung Cheng are the lead authors of the paper.Capturing DNADNA, a very stable molecule, is well-suited for storing massive amounts of information, including digital data. Digital storage systems encode text, photos, and other kind of information as a series of 0s and 1s. This same information can be encoded in DNA using the four nucleotides that make up the genetic code: A, T, G, and C. For example, G and C could be used to represent 0 while A and T represent 1.DNA offers a way to store this digital information at very high density: In theory, a coffee mug full of DNA could store all of the world’s data. DNA is also very stable and relatively easy to synthesize and sequence.In 2021, Banal and his postdoc advisor, Mark Bathe, an MIT professor of biological engineering, developed a way to store DNA in particles of silica, which could be labeled with tags that revealed the particles’ contents. That work led to a spinout called Cache DNA.One downside to that storage system is that it takes several days to embed DNA into the silica particles. Furthermore, removing the DNA from the particles requires hydrofluoric acid, which can be hazardous to workers handling the DNA.To come up with alternative storage materials, Banal began working with Johnson and members of his lab. Their idea was to use a type of polymer known as a degradable thermoset, which consists of polymers that form a solid when heated. The material also includes cleavable links that can be easily broken, allowing the polymer to be degraded in a controlled way.“With these deconstructable thermosets, depending on what cleavable bonds we put into them, we can choose how we want to degrade them,” Johnson says.For this project, the researchers decided to make their thermoset polymer from styrene and a cross-linker, which together form an amber-like thermoset called cross-linked polystyrene. This thermoset is also very hydrophobic, so it can prevent moisture from getting in and damaging the DNA. To make the thermoset degradable, the styrene monomers and cross-linkers are copolymerized with monomers called thionolactones. These links can be broken by treating them with a molecule called cysteamine.Because styrene is so hydrophobic, the researchers had to come up with a way to entice DNA — a hydrophilic, negatively charged molecule — into the styrene.To do that, they identified a combination of three monomers that they could turn into polymers that dissolve DNA by helping it interact with styrene. Each of the monomers has different features that cooperate to get the DNA out of water and into the styrene. There, the DNA forms spherical complexes, with charged DNA in the center and hydrophobic groups forming an outer layer that interacts with styrene. When heated, this solution becomes a solid glass-like block, embedded with DNA complexes.The researchers dubbed their method T-REX (Thermoset-REinforced Xeropreservation). The process of embedding DNA into the polymer network takes a few hours, but that could become shorter with further optimization, the researchers say.To release the DNA, the researchers first add cysteamine, which cleaves the bonds holding the polystyrene thermoset together, breaking it into smaller pieces. Then, a detergent called SDS can be added to remove the DNA from polystyrene without damaging it.Storing informationUsing these polymers, the researchers showed that they could encapsulate DNA of varying length, from tens of nucleotides up to an entire human genome (more than 50,000 base pairs). They were able to store DNA encoding the Emancipation Proclamation and the MIT logo, in addition to the theme music from “Jurassic Park.”After storing the DNA and then removing it, the researchers sequenced it and found that no errors had been introduced, which is a critical feature of any digital data storage system.The researchers also showed that the thermoset polymer can protect DNA from temperatures up to 75 degrees Celsius (167 degrees Fahrenheit). They are now working on ways to streamline the process of making the polymers and forming them into capsules for long-term storage.Cache DNA, a company started by Banal and Bathe, with Johnson as a member of the scientific advisory board, is now working on further developing DNA storage technology. The earliest application they envision is storing genomes for personalized medicine, and they also anticipate that these stored genomes could undergo further analysis as better technology is developed in the future.“The idea is, why don’t we preserve the master record of life forever?” Banal says. “Ten years or 20 years from now, when technology has advanced way more than we could ever imagine today, we could learn more and more things. We’re still in the very infancy of understanding the genome and how it relates to disease.”The research was funded by the National Science Foundation. More

  • in

    Making genetic prediction models more inclusive

    While any two human genomes are about 99.9 percent identical, genetic variation in the remaining 0.1 percent plays an important role in shaping human diversity, including a person’s risk for developing certain diseases.

    Measuring the cumulative effect of these small genetic differences can provide an estimate of an individual’s genetic risk for a particular disease or their likelihood of having a particular trait. However, the majority of models used to generate these “polygenic scores” are based on studies done in people of European descent, and do not accurately gauge the risk for people of non-European ancestry or people whose genomes contain a mixture of chromosome regions inherited from previously isolated populations, also known as admixed ancestry.

    In an effort to make these genetic scores more inclusive, MIT researchers have created a new model that takes into account genetic information from people from a wider diversity of genetic ancestries across the world. Using this model, they showed that they could increase the accuracy of genetics-based predictions for a variety of traits, especially for people from populations that have been traditionally underrepresented in genetic studies.

    “For people of African ancestry, our model proved to be about 60 percent more accurate on average,” says Manolis Kellis, a professor of computer science in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and a member of the Broad Institute of MIT and Harvard. “For people of admixed genetic backgrounds more broadly, who have been excluded from most previous models, the accuracy of our model increased by an average of about 18 percent.”

    The researchers hope their more inclusive modeling approach could help improve health outcomes for a wider range of people and promote health equity by spreading the benefits of genomic sequencing more widely across the globe.

    “What we have done is created a method that allows you to be much more accurate for admixed and ancestry-diverse individuals, and ensure the results and the benefits of human genetics research are equally shared by everyone,” says MIT postdoc Yosuke Tanigawa, the lead and co-corresponding author of the paper, which appears today in open-access form in the American Journal of Human Genetics. The researchers have made all of their data publicly available for the broader scientific community to use.

    More inclusive models

    The work builds on the Human Genome Project, which mapped all of the genes found in the human genome, and on subsequent large-scale, cohort-based studies of how genetic variants in the human genome are linked to disease risk and other differences between individuals.

    These studies showed that the effect of any individual genetic variant on its own is typically very small. Together, these small effects add up and influence the risk of developing heart disease or diabetes, having a stroke, or being diagnosed with psychiatric disorders such as schizophrenia.

    “We have hundreds of thousands of genetic variants that are associated with complex traits, each of which is individually playing a weak effect, but together they are beginning to be predictive for disease predispositions,” Kellis says.

    However, most of these genome-wide association studies included few people of non-European descent, so polygenic risk models based on them translate poorly to non-European populations. People from different geographic areas can have different patterns of genetic variation, shaped by stochastic drift, population history, and environmental factors — for example, in people of African descent, genetic variants that protect against malaria are more common than in other populations. Those variants also affect other traits involving the immune system, such as counts of neutrophils, a type of immune cell. That variation would not be well-captured in a model based on genetic analysis of people of European ancestry alone.

    “If you are an individual of African descent, of Latin American descent, of Asian descent, then you are currently being left out by the system,” Kellis says. “This inequity in the utilization of genetic information for predicting risk of patients can cause unnecessary burden, unnecessary deaths, and unnecessary lack of prevention, and that’s where our work comes in.”

    Some researchers have begun trying to address these disparities by creating distinct models for people of European descent, of African descent, or of Asian descent. These emerging approaches assign individuals to distinct genetic ancestry groups, aggregate the data to create an association summary, and make genetic prediction models. However, these approaches still don’t represent people of admixed genetic backgrounds well.

    “Our approach builds on the previous work without requiring researchers to assign individuals or local genomic segments of individuals to predefined distinct genetic ancestry groups,” Tanigawa says. “Instead, we develop a single model for everybody by directly working on individuals across the continuum of their genetic ancestries.”

    In creating their new model, the MIT team used computational and statistical techniques that enabled them to study each individual’s unique genetic profile instead of grouping individuals by population. This methodological advancement allowed the researchers to include people of admixed ancestry, who made up nearly 10 percent of the UK Biobank dataset used for this study and currently account for about one in seven newborns in the United States.

    “Because we work at the individual level, there is no need for computing summary-level data for different populations,” Kellis says. “Thus, we did not need to exclude individuals of admixed ancestry, increasing our power by including more individuals and representing contributions from all populations in our combined model.”

    Better predictions

    To create their new model, the researchers used genetic data from more than 280,000 people, which was collected by UK Biobank, a large-scale biomedical database and research resource containing de-identified genetic, lifestyle, and health information from half a million U.K. participants. Using another set of about 81,000 held-out individuals from the UK Biobank, the researchers evaluated their model across 60 traits, which included traits related to body size and shape, such as height and body mass index, as well as blood traits such as white blood cell count and red blood cell count, which also have a genetic basis.

    The researchers found that, compared to models trained only on European-ancestry individuals, their model’s predictions are more accurate for all genetic ancestry groups. The most notable gain was for people of African ancestry, who showed 61 percent average improvements, even though they only made up about 1.5 percent of samples in UK Biobank. The researchers also saw improvements of 11 percent for people of South Asian descent and 5 percent for white British people. Predictions for people of admixed ancestry improved by about 18 percent.

    “When you bring all the individuals together in the training set, everybody contributes to the training of the polygenic score modeling on equal footing,” Tanigawa says. “Combined with increasingly more inclusive data collection efforts, our method can help leverage these efforts to improve predictive accuracy for all.”

    The MIT team hopes its approach can eventually be incorporated into tests of an individual’s risk of a variety of diseases. Such tests could be combined with conventional risk factors and used to help doctors diagnose disease or to help people manage their risk for certain diseases before they develop.

    “Our work highlights the power of diversity, equity, and inclusion efforts in the context of genomics research,” Tanigawa says.

    The researchers now hope to add even more data to their model, including data from the United States, and to apply it to additional traits that they didn’t analyze in this study.

    “This is just the start,” Kellis says. “We can’t wait to see more people join our effort to propel inclusive human genetics research.”

    The research was funded by the National Institutes of Health.

  • in

    A more effective experimental design for engineering a cell into a new state

    A strategy for cellular reprogramming involves using targeted genetic interventions to engineer a cell into a new state. The technique holds great promise in immunotherapy, for instance, where researchers could reprogram a patient’s T-cells so they are more potent cancer killers. Someday, the approach could also help identify life-saving cancer treatments or regenerative therapies that repair disease-ravaged organs.

    But the human body has about 20,000 genes, and a genetic perturbation could be on a combination of genes or on any of the over 1,000 transcription factors that regulate the genes. Because the search space is vast and genetic experiments are costly, scientists often struggle to find the ideal perturbation for their particular application.   

    Researchers from MIT and Harvard University developed a new, computational approach that can efficiently identify optimal genetic perturbations based on a much smaller number of experiments than traditional methods.

    Their algorithmic technique leverages the cause-and-effect relationship between factors in a complex system, such as genome regulation, to prioritize the best intervention in each round of sequential experiments.

    The researchers conducted a rigorous theoretical analysis to determine that their technique did, indeed, identify optimal interventions. With that theoretical framework in place, they applied the algorithms to real biological data designed to mimic a cellular reprogramming experiment. Their algorithms were the most efficient and effective.

    “Too often, large-scale experiments are designed empirically. A careful causal framework for sequential experimentation may allow identifying optimal interventions with fewer trials, thereby reducing experimental costs,” says co-senior author Caroline Uhler, a professor in the Department of Electrical Engineering and Computer Science (EECS) who is also co-director of the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard, and a researcher at MIT’s Laboratory for Information and Decision Systems (LIDS) and Institute for Data, Systems and Society (IDSS).

    Joining Uhler on the paper, which appears today in Nature Machine Intelligence, are lead author Jiaqi Zhang, a graduate student and Eric and Wendy Schmidt Center Fellow; co-senior author Themistoklis P. Sapsis, professor of mechanical and ocean engineering at MIT and a member of IDSS; and others at Harvard and MIT.

    Active learning

    When scientists try to design an effective intervention for a complex system, like in cellular reprogramming, they often perform experiments sequentially. Such settings are ideally suited for the use of a machine-learning approach called active learning. Data samples are collected and used to learn a model of the system that incorporates the knowledge gathered so far. From this model, an acquisition function is designed — an equation that evaluates all potential interventions and picks the best one to test in the next trial.

    This process is repeated until an optimal intervention is identified (or resources to fund subsequent experiments run out).

    “While there are several generic acquisition functions to sequentially design experiments, these are not effective for problems of such complexity, leading to very slow convergence,” Sapsis explains.

    Acquisition functions typically consider correlation between factors, such as which genes are co-expressed. But focusing only on correlation ignores the regulatory relationships or causal structure of the system. For instance, a genetic intervention can only affect the expression of downstream genes, but a correlation-based approach would not be able to distinguish between genes that are upstream or downstream.

    “You can learn some of this causal knowledge from the data and use that to design an intervention more efficiently,” Zhang explains.

    The MIT and Harvard researchers leveraged this underlying causal structure for their technique. First, they carefully constructed an algorithm so it can only learn models of the system that account for causal relationships.

    Then the researchers designed the acquisition function so it automatically evaluates interventions using information on these causal relationships. They crafted this function so it prioritizes the most informative interventions, meaning those most likely to lead to the optimal intervention in subsequent experiments.

    “By considering causal models instead of correlation-based models, we can already rule out certain interventions. Then, whenever you get new data, you can learn a more accurate causal model and thereby further shrink the space of interventions,” Uhler explains.

    This smaller search space, coupled with the acquisition function’s special focus on the most informative interventions, is what makes their approach so efficient.

    The researchers further improved their acquisition function using a technique known as output weighting, inspired by the study of extreme events in complex systems. This method carefully emphasizes interventions that are likely to be closer to the optimal intervention.

    “Essentially, we view an optimal intervention as an ‘extreme event’ within the space of all possible, suboptimal interventions and use some of the ideas we have developed for these problems,” Sapsis says.    

    Enhanced efficiency

    They tested their algorithms using real biological data in a simulated cellular reprogramming experiment. For this test, they sought a genetic perturbation that would result in a desired shift in average gene expression. Their acquisition functions consistently identified better interventions than baseline methods through every step in the multi-stage experiment.

    “If you cut the experiment off at any stage, ours would still be more efficient than the baselines. This means you could run fewer experiments and get the same or better results,” Zhang says.

    The researchers are currently working with experimentalists to apply their technique toward cellular reprogramming in the lab.

    Their approach could also be applied to problems outside genomics, such as identifying optimal prices for consumer products or enabling optimal feedback control in fluid mechanics applications.

    In the future, they plan to enhance their technique for optimizations beyond those that seek to match a desired mean. In addition, their method assumes that scientists already understand the causal relationships in their system, but future work could explore how to use AI to learn that information, as well.

    This work was funded, in part, by the Office of Naval Research, the MIT-IBM Watson AI Lab, the MIT J-Clinic for Machine Learning and Health, the Eric and Wendy Schmidt Center at the Broad Institute, a Simons Investigator Award, the Air Force Office of Scientific Research, and a National Science Foundation Graduate Fellowship.

  • in

    Neurodegenerative disease can progress in newly identified patterns

    Neurodegenerative diseases — like amyotrophic lateral sclerosis (ALS, or Lou Gehrig’s disease), Alzheimer’s, and Parkinson’s — are complicated, chronic ailments that can present with a variety of symptoms, worsen at different rates, and have many underlying genetic and environmental causes, some of which are unknown. ALS, in particular, affects voluntary muscle movement and is always fatal, but while most people survive for only a few years after diagnosis, others live with the disease for decades. Manifestations of ALS can also vary significantly; often slower disease development correlates with onset in the limbs and affecting fine motor skills, while the more serious, bulbar ALS impacts swallowing, speaking, breathing, and mobility. Therefore, understanding the progression of diseases like ALS is critical to enrollment in clinical trials, analysis of potential interventions, and discovery of root causes.

    However, assessing disease evolution is far from straightforward. Current clinical studies typically assume that health declines on a downward linear trajectory on a symptom rating scale, and use these linear models to evaluate whether drugs are slowing disease progression. However, data indicate that ALS often follows nonlinear trajectories, with periods where symptoms are stable alternating with periods when they are rapidly changing. Since data can be sparse, and health assessments often rely on subjective rating metrics measured at uneven time intervals, comparisons across patient populations are difficult. These heterogenous data and progression, in turn, complicate analyses of invention effectiveness and potentially mask disease origin.

    Now, a new machine-learning method developed by researchers from MIT, IBM Research, and elsewhere aims to better characterize ALS disease progression patterns to inform clinical trial design.

    “There are groups of individuals that share progression patterns. For example, some seem to have really fast-progressing ALS and others that have slow-progressing ALS that varies over time,” says Divya Ramamoorthy PhD ’22, a research specialist at MIT and lead author of a new paper on the work that was published this month in Nature Computational Science. “The question we were asking is: can we use machine learning to identify if, and to what extent, those types of consistent patterns across individuals exist?”

    Their technique, indeed, identified discrete and robust clinical patterns in ALS progression, many of which are non-linear. Further, these disease progression subtypes were consistent across patient populations and disease metrics. The team additionally found that their method can be applied to Alzheimer’s and Parkinson’s diseases as well.

    Joining Ramamoorthy on the paper are MIT-IBM Watson AI Lab members Ernest Fraenkel, a professor in the MIT Department of Biological Engineering; Research Scientist Soumya Ghosh of IBM Research; and Principal Research Scientist Kenney Ng, also of IBM Research. Additional authors include Kristen Severson PhD ’18, a senior researcher at Microsoft Research and former member of the Watson Lab and of IBM Research; Karen Sachs PhD ’06 of Next Generation Analytics; a team of researchers with Answer ALS; Jonathan D. Glass and Christina N. Fournier of the Emory University School of Medicine; the Pooled Resource Open-Access ALS Clinical Trials Consortium; ALS/MND Natural History Consortium; Todd M. Herrington of Massachusetts General Hospital (MGH) and Harvard Medical School; and James D. Berry of MGH.

    Play video

    MIT Professor Ernest Fraenkel describes early stages of his research looking at root causes of amyotrophic lateral sclerosis (ALS).

    Reshaping health decline

    After consulting with clinicians, the team of machine learning researchers and neurologists let the data speak for itself. They designed an unsupervised machine-learning model that employed two methods: Gaussian process regression and Dirichlet process clustering. These inferred the health trajectories directly from patient data and automatically grouped similar trajectories together without prescribing the number of clusters or the shape of the curves, forming ALS progression “subtypes.” Their method incorporated prior clinical knowledge in the way of a bias for negative trajectories — consistent with expectations for neurodegenerative disease progressions — but did not assume any linearity. “We know that linearity is not reflective of what’s actually observed,” says Ng. “The methods and models that we use here were more flexible, in the sense that, they capture what was seen in the data,” without the need for expensive labeled data and prescription of parameters.

    Primarily, they applied the model to five longitudinal datasets from ALS clinical trials and observational studies. These used the gold standard to measure symptom development: the ALS functional rating scale revised (ALSFRS-R), which captures a global picture of patient neurological impairment but can be a bit of a “messy metric.” Additionally, performance on survivability probabilities, forced vital capacity (a measurement of respiratory function), and subscores of ALSFRS-R, which looks at individual bodily functions, were incorporated.

    New regimes of progression and utility

    When their population-level model was trained and tested on these metrics, four dominant patterns of disease popped out of the many trajectories — sigmoidal fast progression, stable slow progression, unstable slow progression, and unstable moderate progression — many with strong nonlinear characteristics. Notably, it captured trajectories where patients experienced a sudden loss of ability, called a functional cliff, which would significantly impact treatments, enrollment in clinical trials, and quality of life.

    The researchers compared their method against other commonly used linear and nonlinear approaches in the field to separate the contribution of clustering and linearity to the model’s accuracy. The new work outperformed them, even patient-specific models, and found that subtype patterns were consistent across measures. Impressively, when data were withheld, the model was able to interpolate missing values, and, critically, could forecast future health measures. The model could also be trained on one ALSFRS-R dataset and predict cluster membership in others, making it robust, generalizable, and accurate with scarce data. So long as 6-12 months of data were available, health trajectories could be inferred with higher confidence than conventional methods.

    The researchers’ approach also provided insights into Alzheimer’s and Parkinson’s diseases, both of which can have a range of symptom presentations and progression. For Alzheimer’s, the new technique could identify distinct disease patterns, in particular variations in the rates of conversion of mild to severe disease. The Parkinson’s analysis demonstrated a relationship between progression trajectories for off-medication scores and disease phenotypes, such as the tremor-dominant or postural instability/gait difficulty forms of Parkinson’s disease.

    The work makes significant strides to find the signal amongst the noise in the time-series of complex neurodegenerative disease. “The patterns that we see are reproducible across studies, which I don’t believe had been shown before, and that may have implications for how we subtype the [ALS] disease,” says Fraenkel. As the FDA has been considering the impact of non-linearity in clinical trial designs, the team notes that their work is particularly pertinent.

    As new ways to understand disease mechanisms come online, this model provides another tool to pick apart illnesses like ALS, Alzheimer’s, and Parkinson’s from a systems biology perspective.

    “We have a lot of molecular data from the same patients, and so our long-term goal is to see whether there are subtypes of the disease,” says Fraenkel, whose lab looks at cellular changes to understand the etiology of diseases and possible targets for cures. “One approach is to start with the symptoms … and see if people with different patterns of disease progression are also different at the molecular level. That might lead you to a therapy. Then there’s the bottom-up approach, where you start with the molecules” and try to reconstruct biological pathways that might be affected. “We’re going [to be tackling this] from both ends … and finding if something meets in the middle.”

    This research was supported, in part, by the MIT-IBM Watson AI Lab, the Muscular Dystrophy Association, Department of Veterans Affairs of Research and Development, the Department of Defense, NSF Gradate Research Fellowship Program, Siebel Scholars Fellowship, Answer ALS, the United States Army Medical Research Acquisition Activity, National Institutes of Health, and the NIH/NINDS.

  • in

    New CRISPR-based map ties every human gene to its function

    The Human Genome Project was an ambitious initiative to sequence every piece of human DNA. The project drew together collaborators from research institutions around the world, including MIT’s Whitehead Institute for Biomedical Research, and was finally completed in 2003. Now, over two decades later, MIT Professor Jonathan Weissman and colleagues have gone beyond the sequence to present the first comprehensive functional map of genes that are expressed in human cells. The data from this project, published online June 9 in Cell, ties each gene to its job in the cell, and is the culmination of years of collaboration on the single-cell sequencing method Perturb-seq.

    The data are available for other scientists to use. “It’s a big resource in the way the human genome is a big resource, in that you can go in and do discovery-based research,” says Weissman, who is also a member of the Whitehead Institute and an investigator with the Howard Hughes Medical Institute. “Rather than defining ahead of time what biology you’re going to be looking at, you have this map of the genotype-phenotype relationships and you can go in and screen the database without having to do any experiments.”

    The screen allowed the researchers to delve into diverse biological questions. They used it to explore the cellular effects of genes with unknown functions, to investigate the response of mitochondria to stress, and to screen for genes that cause chromosomes to be lost or gained, a phenotype that has proved difficult to study in the past. “I think this dataset is going to enable all sorts of analyses that we haven’t even thought up yet by people who come from other parts of biology, and suddenly they just have this available to draw on,” says former Weissman Lab postdoc Tom Norman, a co-senior author of the paper.

    Pioneering Perturb-seq

    The project takes advantage of the Perturb-seq approach that makes it possible to follow the impact of turning on or off genes with unprecedented depth. This method was first published in 2016 by a group of researchers including Weissman and fellow MIT professor Aviv Regev, but could only be used on small sets of genes and at great expense.

    The massive Perturb-seq map was made possible by foundational work from Joseph Replogle, an MD-PhD student in Weissman’s lab and co-first author of the present paper. Replogle, in collaboration with Norman, who now leads a lab at Memorial Sloan Kettering Cancer Center; Britt Adamson, an assistant professor in the Department of Molecular Biology at Princeton University; and a group at 10x Genomics, set out to create a new version of Perturb-seq that could be scaled up. The researchers published a proof-of-concept paper in Nature Biotechnology in 2020. 

    The Perturb-seq method uses CRISPR-Cas9 genome editing to introduce genetic changes into cells, and then uses single-cell RNA sequencing to capture information about the RNAs that are expressed resulting from a given genetic change. Because RNAs control all aspects of how cells behave, this method can help decode the many cellular effects of genetic changes.

    Since their initial proof-of-concept paper, Weissman, Regev, and others have used this sequencing method on smaller scales. For example, the researchers used Perturb-seq in 2021 to explore how human and viral genes interact over the course of an infection with HCMV, a common herpesvirus.

    In the new study, Replogle and collaborators including Reuben Saunders, a graduate student in Weissman’s lab and co-first author of the paper, scaled up the method to the entire genome. Using human blood cancer cell lines as well noncancerous cells derived from the retina, he performed Perturb-seq across more than 2.5 million cells, and used the data to build a comprehensive map tying genotypes to phenotypes.

    Delving into the data

    Upon completing the screen, the researchers decided to put their new dataset to use and examine a few biological questions. “The advantage of Perturb-seq is it lets you get a big dataset in an unbiased way,” says Tom Norman. “No one knows entirely what the limits are of what you can get out of that kind of dataset. Now, the question is, what do you actually do with it?”

    The first, most obvious application was to look into genes with unknown functions. Because the screen also read out phenotypes of many known genes, the researchers could use the data to compare unknown genes to known ones and look for similar transcriptional outcomes, which could suggest the gene products worked together as part of a larger complex.

    The mutation of one gene called C7orf26 in particular stood out. Researchers noticed that genes whose removal led to a similar phenotype were part of a protein complex called Integrator that played a role in creating small nuclear RNAs. The Integrator complex is made up of many smaller subunits — previous studies had suggested 14 individual proteins — and the researchers were able to confirm that C7orf26 made up a 15th component of the complex.

    They also discovered that the 15 subunits worked together in smaller modules to perform specific functions within the Integrator complex. “Absent this thousand-foot-high view of the situation, it was not so clear that these different modules were so functionally distinct,” says Saunders.

    Another perk of Perturb-seq is that because the assay focuses on single cells, the researchers could use the data to look at more complex phenotypes that become muddied when they are studied together with data from other cells. “We often take all the cells where ‘gene X’ is knocked down and average them together to look at how they changed,” Weissman says. “But sometimes when you knock down a gene, different cells that are losing that same gene behave differently, and that behavior may be missed by the average.”

    The researchers found that a subset of genes whose removal led to different outcomes from cell to cell were responsible for chromosome segregation. Their removal was causing cells to lose a chromosome or pick up an extra one, a condition known as aneuploidy. “You couldn’t predict what the transcriptional response to losing this gene was because it depended on the secondary effect of what chromosome you gained or lost,” Weissman says. “We realized we could then turn this around and create this composite phenotype looking for signatures of chromosomes being gained and lost. In this way, we’ve done the first genome-wide screen for factors that are required for the correct segregation of DNA.”

    “I think the aneuploidy study is the most interesting application of this data so far,” Norman says. “It captures a phenotype that you can only get using a single-cell readout. You can’t go after it any other way.”

    The researchers also used their dataset to study how mitochondria responded to stress. Mitochondria, which evolved from free-living bacteria, carry 13 genes in their genomes. Within the nuclear DNA, around 1,000 genes are somehow related to mitochondrial function. “People have been interested for a long time in how nuclear and mitochondrial DNA are coordinated and regulated in different cellular conditions, especially when a cell is stressed,” Replogle says.

    The researchers found that when they perturbed different mitochondria-related genes, the nuclear genome responded similarly to many different genetic changes. However, the mitochondrial genome responses were much more variable. 

    “There’s still an open question of why mitochondria still have their own DNA,” said Replogle. “A big-picture takeaway from our work is that one benefit of having a separate mitochondrial genome might be having localized or very specific genetic regulation in response to different stressors.”

    “If you have one mitochondria that’s broken, and another one that is broken in a different way, those mitochondria could be responding differentially,” Weissman says.

    In the future, the researchers hope to use Perturb-seq on different types of cells besides the cancer cell line they started in. They also hope to continue to explore their map of gene functions, and hope others will do the same. "This really is the culmination of many years of work by the authors and other collaborators, and I'm really pleased to see it continue to succeed and expand," says Norman.

  • in

    An “oracle” for predicting the evolution of gene regulation

    Despite the sheer number of genes that each human cell contains, these so-called “coding” DNA sequences comprise just 1 percent of our entire genome. The remaining 99 percent is made up of “non-coding” DNA — which, unlike coding DNA, does not carry the instructions to build proteins.

    One vital function of this non-coding DNA, also called “regulatory” DNA, is to help turn genes on and off, controlling how much (if any) of a protein is made. Over time, as cells replicate their DNA to grow and divide, mutations often crop up in these non-coding regions — sometimes tweaking their function and changing the way they control gene expression. Many of these mutations are trivial, and some are even beneficial. Occasionally, though, they can be associated with increased risk of common diseases, such as Type 2 diabetes, or more life-threatening ones, including cancer.

    To better understand the repercussions of such mutations, researchers have been hard at work on mathematical maps that allow them to look at an organism’s genome, predict which genes will be expressed, and determine how that expression will affect the organism’s observable traits. These maps, called fitness landscapes, were conceptualized roughly a century ago to understand how genetic makeup influences one common measure of organismal fitness in particular: reproductive success. Early fitness landscapes were very simple, often focusing on a limited number of mutations. Much richer datasets are now available, but researchers still require additional tools to characterize and visualize such complex data. This ability would not only facilitate a better understanding of how individual genes have evolved over time, but would also help to predict what sequence and expression changes might occur in the future.

    In a new study published on March 9 in Nature, a team of scientists has developed a framework for studying the fitness landscapes of regulatory DNA. They created a neural network model that, when trained on hundreds of millions of experimental measurements, was capable of predicting how changes to these non-coding sequences in yeast affected gene expression. They also devised a unique way of representing the landscapes in two dimensions, making it easy to understand the past and forecast the future evolution of non-coding sequences in organisms beyond yeast — and even design custom gene expression patterns for gene therapies and industrial applications.

    “We now have an ‘oracle’ that can be queried to ask: What if we tried all possible mutations of this sequence? Or, what new sequence should we design to give us a desired expression?” says Aviv Regev, a professor of biology at MIT (on leave), core member of the Broad Institute of Harvard and MIT (on leave), head of Genentech Research and Early Development, and the study’s senior author. “Scientists can now use the model for their own evolutionary question or scenario, and for other problems like making sequences that control gene expression in desired ways. I am also excited about the possibilities for machine learning researchers interested in interpretability; they can ask their questions in reverse, to better understand the underlying biology.”

    Prior to this study, many researchers had simply trained their models on known mutations (or slight variations thereof) that exist in nature. However, Regev’s team wanted to go a step further by creating their own unbiased models capable of predicting an organism’s fitness and gene expression based on any possible DNA sequence — even sequences they’d never seen before. This would also enable researchers to use such models to engineer cells for pharmaceutical purposes, including new treatments for cancer and autoimmune disorders.

    To accomplish this goal, Eeshit Dhaval Vaishnav, a graduate student at MIT and co-first author; Carl de Boer, now an assistant professor at the University of British Columbia; and their colleagues created a neural network model to predict gene expression. They trained it on a dataset generated by inserting millions of totally random non-coding DNA sequences into yeast, and observing how each random sequence affected gene expression. They focused on a particular subset of non-coding DNA sequences called promoters, which serve as binding sites for proteins that can switch nearby genes on or off.

    “This work highlights what possibilities open up when we design new kinds of experiments to generate the right data to train models,” Regev says. “In the broader sense, I believe these kinds of approaches will be important for many problems — like understanding genetic variants in regulatory regions that confer disease risk in the human genome, but also for predicting the impact of combinations of mutations, or designing new molecules.”

    Regev, Vaishnav, de Boer, and their coauthors went on to test their model’s predictive abilities in a variety of ways, in order to show how it could help demystify the evolutionary past — and possible future — of certain promoters. “Creating an accurate model was certainly an accomplishment, but, to me, it was really just a starting point,” Vaishnav explains.

    First, to determine whether their model could help with synthetic biology applications like producing antibiotics, enzymes, and food, the researchers practiced using it to design promoters that could generate desired expression levels for any gene of interest. They then scoured other scientific papers to identify fundamental evolutionary questions, in order to see if their model could help answer them. The team even went so far as to feed their model a real-world population dataset from one existing study, which contained genetic information from yeast strains around the world. In doing so, they were able to delineate thousands of years of past selection pressures that sculpted the genomes of today’s yeast.

    But, in order to create a powerful tool that could probe any genome, the researchers knew they’d need to find a way to forecast the evolution of non-coding sequences even without such a comprehensive population dataset. To address this goal, Vaishnav and his colleagues devised a computational technique that allowed them to plot the predictions from their framework onto a two-dimensional graph. This helped them show, in a remarkably simple manner, how any non-coding DNA sequence would affect gene expression and fitness, without needing to conduct any time-consuming experiments at the lab bench.

    “One of the unsolved problems in fitness landscapes was that we didn’t have an approach for visualizing them in a way that meaningfully captured the evolutionary properties of sequences,” Vaishnav explains. “I really wanted to find a way to fill that gap, and contribute to the long-standing vision of creating a complete fitness landscape.”

    Martin Taylor, a professor of genetics at the University of Edinburgh’s Medical Research Council Human Genetics Unit who was not involved in the research, says the study shows that artificial intelligence can not only predict the effect of regulatory DNA changes, but also reveal the underlying principles that govern millions of years of evolution.

    Despite the fact that the model was trained on just a fraction of yeast regulatory DNA in a few growth conditions, he’s impressed that it’s capable of making such useful predictions about the evolution of gene regulation in mammals.

    “There are obvious near-term applications, such as the custom design of regulatory DNA for yeast in brewing, baking, and biotechnology,” he explains. “But extensions of this work could also help identify disease mutations in human regulatory DNA that are currently difficult to find and largely overlooked in the clinic. This work suggests there is a bright future for AI models of gene regulation trained on richer, more complex, and more diverse datasets.”

    Even before the study was formally published, Vaishnav began receiving queries from other researchers hoping to use the model to devise non-coding DNA sequences for use in gene therapies.

    "People have been studying regulatory evolution and fitness landscapes for decades now," Vaishnav says. "I think our framework will go a long way in answering fundamental, open questions about the evolution and evolvability of gene regulatory DNA — and even help us design biological sequences for exciting new applications."

  • in

    Probing how proteins pair up inside cells

    Despite its minute size, a single cell contains billions of molecules that bustle around and bind to one another, carrying out vital functions. The human genome encodes about 20,000 proteins, most of which interact with partner proteins to mediate upwards of 400,000 distinct interactions. These partners don’t just latch onto one another haphazardly; they only bind to very specific companions that they must recognize inside the crowded cell. If they create the wrong pairings — or even the right pairings at the wrong place or wrong time — cancer or other diseases can ensue. Scientists are hard at work investigating these protein-protein relationships, in order to understand how they work, and potentially create drugs that disrupt or mimic them to treat disease.

    The average human protein is composed of approximately 400 building blocks called amino acids, which are strung together and folded into a complex 3D structure. Within this long string of building blocks, some proteins contain stretches of four to six amino acids called short linear motifs (SLiMs), which mediate protein-protein interactions. Despite their simplicity and small size, SLiMs and their binding partners facilitate key cellular processes. However, it’s been historically difficult to devise experiments to probe how SLiMs recognize their specific binding partners.

    To address this problem, a group led by Theresa Hwang PhD ’21 designed a screening method to understand how SLiMs selectively bind to certain proteins, and even distinguish between those with similar structures. Using the detailed information they gleaned from studying these interactions, the researchers created their own synthetic molecule capable of binding extremely tightly to a protein called ENAH, which is implicated in cancer metastasis. The team shared their findings in a pair of eLife studies, one published on Dec. 2, 2021, and the other published Jan. 25.

    “The ability to test hundreds of thousands of potential SLiMs for binding provides a powerful tool to explore why proteins prefer specific SLiM partners over others,” says Amy Keating, professor of biology and biological engineering and the senior author on both studies. “As we gain an understanding of the tricks that a protein uses to select its partners, we can apply these in protein design to make our own binders to modulate protein function for research or therapeutic purposes.”

    Most existing screens for SLiMs simply select for short, tight binders, while neglecting SLiMs that don’t grip their partner proteins quite as strongly. To survey SLiMs with a wide range of binding affinities, Keating, Hwang, and their colleagues developed their own screen called MassTitr.

    The researchers also suspected that the amino acids on either side of the SLiM’s core four-to-six amino acid sequence might play an underappreciated role in binding. To test their theory, they used MassTitr to screen the human proteome in longer chunks comprised of 36 amino acids, in order to see which “extended” SLiMs would associate with the protein ENAH.

    ENAH, sometimes referred to as Mena, helps cells to move. This ability to migrate is critical for healthy cells, but cancer cells can co-opt it to spread. Scientists have found that reducing the amount of ENAH decreases the cancer cell’s ability to invade other tissues — suggesting that formulating drugs to disrupt this protein and its interactions could treat cancer.

    Thanks to MassTitr, the team identified 33 SLiM-containing proteins that bound to ENAH — 19 of which are potentially novel binding partners. They also discovered three distinct patterns of amino acids flanking core SLiM sequences that helped the SLiMs bind even tighter to ENAH. Of these extended SLiMs, one found in a protein called PCARE bound to ENAH with the highest known affinity of any SLiM to date.

    Next, the researchers combined a computer program called dTERMen with X-ray crystallography in order understand how and why PCARE binds to ENAH over ENAH’s two nearly identical sister proteins (VASP and EVL). Hwang and her colleagues saw that the amino acids flanking PCARE’s core SliM caused ENAH to change shape slightly when the two made contact, allowing the binding sites to latch onto one another. VASP and EVL, by contrast, could not undergo this structural change, so the PCARE SliM did not bind to either of them as tightly.

    Inspired by this unique interaction, Hwang designed her own protein that bound to ENAH with unprecedented affinity and specificity. “It was exciting that we were able to come up with such a specific binder,” she says. “This work lays the foundation for designing synthetic molecules with the potential to disrupt protein-protein interactions that cause disease — or to help scientists learn more about ENAH and other SLiM-binding proteins.”  

    Ylva Ivarsson, a professor of biochemistry at Uppsala University who was not involved with the study, says that understanding how proteins find their binding partners is a question of fundamental importance to cell function and regulation. The two eLife studies, she explains, show that extended SLiMs play an underappreciated role in determining the affinity and specificity of these binding interactions.

    “The studies shed light on the idea that context matters, and provide a screening strategy for a variety of context-dependent binding interactions,” she says. “Hwang and co-authors have created valuable tools for dissecting the cellular function of proteins and their binding partners. Their approach could even inspire ENAH-specific inhibitors for therapeutic purposes.”

    Hwang's biggest takeaway from the project is that things are not always as they seem: even short, simple protein segments can play complex roles in the cell. As she puts it: "We should really appreciate SLiMs more."

  • in

    Differences in T cells’ functional states determine resistance to cancer therapy

    Non-small cell lung cancer (NSCLC) is the most common type of lung cancer in humans. Some patients with NSCLC receive a therapy called immune checkpoint blockade (ICB) that helps kill cancer cells by reinvigorating a subset of immune cells called T cells, which are “exhausted” and have stopped working. However, only about 35 percent of NSCLC patients respond to ICB therapy. Stefani Spranger’s lab at the MIT Department of Biology explores the mechanisms behind this resistance, with the goal of inspiring new therapies to better treat NSCLC patients. In a new study published on Oct. 29 in Science Immunology, a team led by Spranger lab postdoc Brendan Horton revealed what causes T cells to be non-responsive to ICB — and suggests a possible solution.

    Scientists have long thought that the conditions within a tumor were responsible for determining when T cells stop working and become exhausted after being overstimulated or working for too long to fight a tumor. That’s why physicians prescribe ICB to treat cancer — ICB can invigorate the exhausted T cells within a tumor. However, Horton’s new experiments show that some ICB-resistant T cells stop working before they even enter the tumor. These T cells are not actually exhausted, but rather they become dysfunctional due to changes in gene expression that arise early during the activation of a T cell, which occurs in lymph nodes. Once activated, T cells differentiate into certain functional states, which are distinguishable by their unique gene expression patterns.

    The notion that the dysfunctional state that leads to ICB resistance arises before T cells enter the tumor is quite novel, says Spranger, the Howard S. and Linda B. Stern Career Development Professor, a member of the Koch Institute for Integrative Cancer Research, and the study’s senior author.

    “We show that this state is actually a preset condition, and that the T cells are already non-responsive to therapy before they enter the tumor,” she says. As a result, she explains, ICB therapies that work by reinvigorating exhausted T cells within the tumor are less likely to be effective. This suggests that combining ICB with other forms of immunotherapy that target T cells differently might be a more effective approach to help the immune system combat this subset of lung cancer.

    In order to determine why some tumors are resistant to ICB, Horton and the research team studied T cells in murine models of NSCLC. The researchers sequenced messenger RNA from the responsive and non-responsive T cells in order to identify any differences between the T cells. Supported in part by the Koch Institute Frontier Research Program, they used a technique called Seq-Well, developed in the lab of fellow Koch Institute member J. Christopher Love, the Raymond A. (1921) and Helen E. St. Laurent Professor of Chemical Engineering and a co-author of the study. The technique allows for the rapid gene expression profiling of single cells, which permitted Spranger and Horton to get a very granular look at the gene expression patterns of the T cells they were studying.

    Seq-Well revealed distinct patterns of gene expression between the responsive and non-responsive T cells. These differences, which are determined when the T cells assume their specialized functional states, may be the underlying cause of ICB resistance.

    Now that Horton and his colleagues had a possible explanation for why some T cells did not respond to ICB, they decided to see if they could help the ICB-resistant T cells kill the tumor cells. When analyzing the gene expression patterns of the non-responsive T cells, the researchers had noticed that these T cells had a lower expression of receptors for certain cytokines, small proteins that control immune system activity. To counteract this, the researchers treated lung tumors in murine models with extra cytokines. As a result, the previously non-responsive T cells were then able to fight the tumors — meaning that the cytokine therapy prevented, and potentially even reversed, the dysfunctionality.

    Administering cytokine therapy to human patients is not currently safe, because cytokines can cause serious side effects as well as a reaction called a “cytokine storm,” which can produce severe fevers, inflammation, fatigue, and nausea. However, there are ongoing efforts to figure out how to safely administer cytokines to specific tumors. In the future, Spranger and Horton suspect that cytokine therapy could be used in combination with ICB.

    “This is potentially something that could be translated into a therapeutic that could increase the therapy response rate in non-small cell lung cancer,” Horton says.

    Spranger agrees that this work will help researchers develop more innovative cancer therapies, especially because researchers have historically focused on T cell exhaustion rather than the earlier role that T cell functional states might play in cancer.

    “If T cells are rendered dysfunctional early on, ICB is not going to be effective, and we need to think outside the box,” she says. “There’s more evidence, and other labs are now showing this as well, that the functional state of the T cell actually matters quite substantially in cancer therapies.” To Spranger, this means that cytokine therapy “might be a therapeutic avenue” for NSCLC patients beyond ICB.

    Jeffrey Bluestone, the A.W. and Mary Margaret Clausen Distinguished Professor of Metabolism and Endocrinology at the University of California-San Francisco, who was not involved with the paper, agrees. “The study provides a potential opportunity to ‘rescue’ immunity in the NSCLC non-responder patients with appropriate combination therapies,” he says.

    This research was funded by the Pew-Stewart Scholars for Cancer Research, the Ludwig Center for Molecular Oncology, the Koch Institute Frontier Research Program through the Kathy and Curt Mable Cancer Research Fund, and the National Cancer Institute.