More stories

  • in

    Search algorithm reveals nearly 200 new kinds of CRISPR systems

    Microbial sequence databases contain a wealth of information about enzymes and other molecules that could be adapted for biotechnology. But these databases have grown so large in recent years that they’ve become difficult to search efficiently for enzymes of interest.

    Now, scientists at the McGovern Institute for Brain Research at MIT, the Broad Institute of MIT and Harvard, and the National Center for Biotechnology Information (NCBI) at the National Institutes of Health have developed a new search algorithm that has identified 188 kinds of new rare CRISPR systems in bacterial genomes, encompassing thousands of individual systems. The work appears today in Science.

    The algorithm, which comes from the lab of pioneering CRISPR researcher Professor Feng Zhang, uses big-data clustering approaches to rapidly search massive amounts of genomic data. The team used their algorithm, called Fast Locality-Sensitive Hashing-based clustering (FLSHclust) to mine three major public databases that contain data from a wide range of unusual bacteria, including ones found in coal mines, breweries, Antarctic lakes, and dog saliva. The scientists found a surprising number and diversity of CRISPR systems, including ones that could make edits to DNA in human cells, others that can target RNA, and many with a variety of other functions.

    The new systems could potentially be harnessed to edit mammalian cells with fewer off-target effects than current Cas9 systems. They could also one day be used as diagnostics or serve as molecular records of activity inside cells.

    The researchers say their search highlights an unprecedented level of diversity and flexibility of CRISPR and that there are likely many more rare systems yet to be discovered as databases continue to grow.

    “Biodiversity is such a treasure trove, and as we continue to sequence more genomes and metagenomic samples, there is a growing need for better tools, like FLSHclust, to search that sequence space to find the molecular gems,” says Zhang, a co-senior author on the study and the James and Patricia Poitras Professor of Neuroscience at MIT with joint appointments in the departments of Brain and Cognitive Sciences and Biological Engineering. Zhang is also an investigator at the McGovern Institute for Brain Research at MIT, a core institute member at the Broad, and an investigator at the Howard Hughes Medical Institute. Eugene Koonin, a distinguished investigator at the NCBI, is co-senior author on the study as well.

    Searching for CRISPR

    CRISPR, which stands for clustered regularly interspaced short palindromic repeats, is a bacterial defense system that has been engineered into many tools for genome editing and diagnostics.

    To mine databases of protein and nucleic acid sequences for novel CRISPR systems, the researchers developed an algorithm based on an approach borrowed from the big data community. This technique, called locality-sensitive hashing, clusters together objects that are similar but not exactly identical. Using this approach allowed the team to probe billions of protein and DNA sequences — from the NCBI, its Whole Genome Shotgun database, and the Joint Genome Institute — in weeks, whereas previous methods that look for identical objects would have taken months. They designed their algorithm to look for genes associated with CRISPR.

    “This new algorithm allows us to parse through data in a time frame that’s short enough that we can actually recover results and make biological hypotheses,” says Soumya Kannan PhD ’23, who is a co-first author on the study. Kannan was a graduate student in Zhang’s lab when the study began and is currently a postdoc and Junior Fellow at Harvard University. Han Altae-Tran PhD ’23, a graduate student in Zhang’s lab during the study and currently a postdoc at the University of Washington, was the study’s other co-first author.

    “This is a testament to what you can do when you improve on the methods for exploration and use as much data as possible,” says Altae-Tran. “It’s really exciting to be able to improve the scale at which we search.”

    New systems

    In their analysis, Altae-Tran, Kannan, and their colleagues noticed that the thousands of CRISPR systems they found fell into a few existing and many new categories. They studied several of the new systems in greater detail in the lab.

    They found several new variants of known Type I CRISPR systems, which use a guide RNA that is 32 base pairs long rather than the 20-nucleotide guide of Cas9. Because of their longer guide RNAs, these Type I systems could potentially be used to develop more precise gene-editing technology that is less prone to off-target editing. Zhang’s team showed that two of these systems could make short edits in the DNA of human cells. And because these Type I systems are similar in size to CRISPR-Cas9, they could likely be delivered to cells in animals or humans using the same gene-delivery technologies being used today for CRISPR.

    One of the Type I systems also showed “collateral activity” — broad degradation of nucleic acids after the CRISPR protein binds its target. Scientists have used similar systems to make infectious disease diagnostics such as SHERLOCK, a tool capable of rapidly sensing a single molecule of DNA or RNA. Zhang’s team thinks the new systems could be adapted for diagnostic technologies as well.

    The researchers also uncovered new mechanisms of action for some Type IV CRISPR systems, and a Type VII system that precisely targets RNA, which could potentially be used in RNA editing. Other systems could potentially be used as recording tools — a molecular document of when a gene was expressed — or as sensors of specific activity in a living cell.

    Mining data

    The scientists say their algorithm could aid in the search for other biochemical systems. “This search algorithm could be used by anyone who wants to work with these large databases for studying how proteins evolve or discovering new genes,” Altae-Tran says.

    The researchers add that their findings illustrate not only how diverse CRISPR systems are, but also that most are rare and only found in unusual bacteria. “Some of these microbial systems were exclusively found in water from coal mines,” Kannan says. “If someone hadn’t been interested in that, we may never have seen those systems. Broadening our sampling diversity is really important to continue expanding the diversity of what we can discover.”

    This work was supported by the Howard Hughes Medical Institute; the K. Lisa Yang and Hock E. Tan Molecular Therapeutics Center at MIT; Broad Institute Programmable Therapeutics Gift Donors; The Pershing Square Foundation, William Ackman and Neri Oxman; James and Patricia Poitras; BT Charitable Foundation; Asness Family Foundation; Kenneth C. Griffin; the Phillips family; David Cheng; and Robert Metcalfe. More

  • in

    Making genetic prediction models more inclusive

    While any two human genomes are about 99.9 percent identical, genetic variation in the remaining 0.1 percent plays an important role in shaping human diversity, including a person’s risk for developing certain diseases.

    Measuring the cumulative effect of these small genetic differences can provide an estimate of an individual’s genetic risk for a particular disease or their likelihood of having a particular trait. However, the majority of models used to generate these “polygenic scores” are based on studies done in people of European descent, and do not accurately gauge the risk for people of non-European ancestry or people whose genomes contain a mixture of chromosome regions inherited from previously isolated populations, also known as admixed ancestry.

    In an effort to make these genetic scores more inclusive, MIT researchers have created a new model that takes into account genetic information from people from a wider diversity of genetic ancestries across the world. Using this model, they showed that they could increase the accuracy of genetics-based predictions for a variety of traits, especially for people from populations that have been traditionally underrepresented in genetic studies.

    “For people of African ancestry, our model proved to be about 60 percent more accurate on average,” says Manolis Kellis, a professor of computer science in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and a member of the Broad Institute of MIT and Harvard. “For people of admixed genetic backgrounds more broadly, who have been excluded from most previous models, the accuracy of our model increased by an average of about 18 percent.”

    The researchers hope their more inclusive modeling approach could help improve health outcomes for a wider range of people and promote health equity by spreading the benefits of genomic sequencing more widely across the globe.

    “What we have done is created a method that allows you to be much more accurate for admixed and ancestry-diverse individuals, and ensure the results and the benefits of human genetics research are equally shared by everyone,” says MIT postdoc Yosuke Tanigawa, the lead and co-corresponding author of the paper, which appears today in open-access form in the American Journal of Human Genetics. The researchers have made all of their data publicly available for the broader scientific community to use.

    More inclusive models

    The work builds on the Human Genome Project, which mapped all of the genes found in the human genome, and on subsequent large-scale, cohort-based studies of how genetic variants in the human genome are linked to disease risk and other differences between individuals.

    These studies showed that the effect of any individual genetic variant on its own is typically very small. Together, these small effects add up and influence the risk of developing heart disease or diabetes, having a stroke, or being diagnosed with psychiatric disorders such as schizophrenia.

    “We have hundreds of thousands of genetic variants that are associated with complex traits, each of which is individually playing a weak effect, but together they are beginning to be predictive for disease predispositions,” Kellis says.

    However, most of these genome-wide association studies included few people of non-European descent, so polygenic risk models based on them translate poorly to non-European populations. People from different geographic areas can have different patterns of genetic variation, shaped by stochastic drift, population history, and environmental factors — for example, in people of African descent, genetic variants that protect against malaria are more common than in other populations. Those variants also affect other traits involving the immune system, such as counts of neutrophils, a type of immune cell. That variation would not be well-captured in a model based on genetic analysis of people of European ancestry alone.

    “If you are an individual of African descent, of Latin American descent, of Asian descent, then you are currently being left out by the system,” Kellis says. “This inequity in the utilization of genetic information for predicting risk of patients can cause unnecessary burden, unnecessary deaths, and unnecessary lack of prevention, and that’s where our work comes in.”

    Some researchers have begun trying to address these disparities by creating distinct models for people of European descent, of African descent, or of Asian descent. These emerging approaches assign individuals to distinct genetic ancestry groups, aggregate the data to create an association summary, and make genetic prediction models. However, these approaches still don’t represent people of admixed genetic backgrounds well.

    “Our approach builds on the previous work without requiring researchers to assign individuals or local genomic segments of individuals to predefined distinct genetic ancestry groups,” Tanigawa says. “Instead, we develop a single model for everybody by directly working on individuals across the continuum of their genetic ancestries.”

    In creating their new model, the MIT team used computational and statistical techniques that enabled them to study each individual’s unique genetic profile instead of grouping individuals by population. This methodological advancement allowed the researchers to include people of admixed ancestry, who made up nearly 10 percent of the UK Biobank dataset used for this study and currently account for about one in seven newborns in the United States.

    “Because we work at the individual level, there is no need for computing summary-level data for different populations,” Kellis says. “Thus, we did not need to exclude individuals of admixed ancestry, increasing our power by including more individuals and representing contributions from all populations in our combined model.”

    Better predictions

    To create their new model, the researchers used genetic data from more than 280,000 people, which was collected by UK Biobank, a large-scale biomedical database and research resource containing de-identified genetic, lifestyle, and health information from half a million U.K. participants. Using another set of about 81,000 held-out individuals from the UK Biobank, the researchers evaluated their model across 60 traits, which included traits related to body size and shape, such as height and body mass index, as well as blood traits such as white blood cell count and red blood cell count, which also have a genetic basis.

    The researchers found that, compared to models trained only on European-ancestry individuals, their model’s predictions are more accurate for all genetic ancestry groups. The most notable gain was for people of African ancestry, who showed 61 percent average improvements, even though they only made up about 1.5 percent of samples in UK Biobank. The researchers also saw improvements of 11 percent for people of South Asian descent and 5 percent for white British people. Predictions for people of admixed ancestry improved by about 18 percent.

    “When you bring all the individuals together in the training set, everybody contributes to the training of the polygenic score modeling on equal footing,” Tanigawa says. “Combined with increasingly more inclusive data collection efforts, our method can help leverage these efforts to improve predictive accuracy for all.”

    The MIT team hopes its approach can eventually be incorporated into tests of an individual’s risk of a variety of diseases. Such tests could be combined with conventional risk factors and used to help doctors diagnose disease or to help people manage their risk for certain diseases before they develop.

    “Our work highlights the power of diversity, equity, and inclusion efforts in the context of genomics research,” Tanigawa says.

    The researchers now hope to add even more data to their model, including data from the United States, and to apply it to additional traits that they didn’t analyze in this study.

    “This is just the start,” Kellis says. “We can’t wait to see more people join our effort to propel inclusive human genetics research.”

    The research was funded by the National Institutes of Health. More

  • in

    Making sense of cell fate

    Despite the proliferation of novel therapies such as immunotherapy or targeted therapies, radiation and chemotherapy remain the frontline treatment for cancer patients. About half of all patients still receive radiation and 60-80 percent receive chemotherapy.

    Both radiation and chemotherapy work by damaging DNA, taking advantage of a vulnerability specific to cancer cells. Healthy cells are more likely to survive radiation and chemotherapy since their mechanisms for identifying and repairing DNA damage are intact. In cancer cells, these repair mechanisms are compromised by mutations. When cancer cells cannot adequately respond to the DNA damage caused by radiation and chemotherapy, ideally, they undergo apoptosis or die by other means.

    However, there is another fate for cells after DNA damage: senescence — a state where cells survive, but stop dividing. Senescent cells’ DNA has not been damaged enough to induce apoptosis but is too damaged to support cell division. While senescent cancer cells themselves are unable to proliferate and spread, they are bad actors in the fight against cancer because they seem to enable other cancer cells to develop more aggressively. Although a cancer cell’s fate is not apparent until a few days after treatment, the decision to survive, die, or enter senescence is made much earlier. But, precisely when and how that decision is made has not been well understood.

    In an open-access study of ovarian and osteosarcoma cancer cells appearing July 19 in Cell Systems, MIT researchers show that cell signaling proteins commonly associated with cell proliferation and apoptosis instead commit cancer cells to senescence within 12 hours of treatment with low doses of certain kinds of chemotherapy.

    “When it comes to treating cancer, this study underscores that it’s important not to think too linearly about cell signaling,” says Michael Yaffe, who is a David H. Koch Professor of Science at MIT, the director of the MIT Center for Precision Cancer Medicine, a member of MIT’s Koch Institute for Integrative Cancer Research, and the senior author of the study. “If you assume that a particular treatment will always affect cancer cell signaling in the same way — you may be setting yourself up for many surprises, and treating cancers with the wrong combination of drugs.”

    Using a combination of experiments with cancer cells and computational modeling, the team investigated the cell signaling mechanisms that prompt cancer cells to enter senescence after treatment with a commonly used anti-cancer agent. Their efforts singled out two protein kinases and a component of the AP-1 transcription factor complex as highly associated with the induction of senescence after DNA damage, despite the well-established roles for all of these molecules in promoting cell proliferation in cancer.

    The researchers treated cancer cells with low and high doses of doxorubicin, a chemotherapy that interferes with the function with topoisomerase II, an enzyme that breaks and then repairs DNA strands during replication to fix tangles and other topological problems.

    By measuring the effects of DNA damage on single cells at several time points ranging from six hours to four days after the initial exposure, the team created two datasets. In one dataset, the researchers tracked cell fate over time. For the second set, researchers measured relative cell signaling activity levels across a variety of proteins associated with responses to DNA damage or cellular stress, determination of cell fate, and progress through cell growth and division.

    The two datasets were used to build a computational model that identifies correlations between time, dosage, signal, and cell fate. The model identified the activities of the MAP kinases Erk and JNK, and the transcription factor c-Jun as key components of the AP-1 protein likewise understood to involved in the induction of senescence. The researchers then validated these computational findings by showing that inhibition of JNK and Erk after DNA damage successfully prevented cells from entering senescence.

    The researchers leveraged JNK and Erk inhibition to pinpoint exactly when cells made the decision to enter senescence. Surprisingly, they found that the decision to enter senescence was made within 12 hours of DNA damage, even though it took days to actually see the senescent cells accumulate. The team also found that with the passage of more time, these MAP kinases took on a different function: promoting the secretion of proinflammatory proteins called cytokines that are responsible for making other cancer cells proliferate and develop resistance to chemotherapy.

    “Proteins like cytokines encourage ‘bad behavior’ in neighboring tumor cells that lead to more aggressive cancer progression,” says Tatiana Netterfield, a graduate student in the Yaffe lab and the lead author of the study. “Because of this, it is thought that senescent cells that stay near the tumor for long periods of time are detrimental to treating cancer.”

    This study’s findings apply to cancer cells treated with a commonly used type of chemotherapy that stalls DNA replication after repair. But more broadly, the study emphasizes that “when treating cancer, it’s extremely important to understand the molecular characteristics of cancer cells and the contextual factors such as time and dosing that determine cell fate,” explains Netterfield.

    The study, however, has more immediate implications for treatments that are already in use. One class of Erk inhibitors, MEK inhibitors, are used in the clinic with the expectation that they will curb cancer growth.

    “We must be cautious about administering MEK inhibitors together with chemotherapies,” says Yaffe. “The combination may have the unintended effect of driving cells into proliferation, rather than senescence.”

    In future work, the team will perform studies to understand how and why individual cells choose to proliferate instead of enter senescence. Additionally, the team is employing next-generation sequencing to understand which genes c-Jun is regulating in order to push cells toward senescence.

    This study was funded, in part, by the Charles and Marjorie Holloway Foundation and the MIT Center for Precision Cancer Medicine. More

  • in

    New CRISPR-based map ties every human gene to its function

    The Human Genome Project was an ambitious initiative to sequence every piece of human DNA. The project drew together collaborators from research institutions around the world, including MIT’s Whitehead Institute for Biomedical Research, and was finally completed in 2003. Now, over two decades later, MIT Professor Jonathan Weissman and colleagues have gone beyond the sequence to present the first comprehensive functional map of genes that are expressed in human cells. The data from this project, published online June 9 in Cell, ties each gene to its job in the cell, and is the culmination of years of collaboration on the single-cell sequencing method Perturb-seq.

    The data are available for other scientists to use. “It’s a big resource in the way the human genome is a big resource, in that you can go in and do discovery-based research,” says Weissman, who is also a member of the Whitehead Institute and an investigator with the Howard Hughes Medical Institute. “Rather than defining ahead of time what biology you’re going to be looking at, you have this map of the genotype-phenotype relationships and you can go in and screen the database without having to do any experiments.”

    The screen allowed the researchers to delve into diverse biological questions. They used it to explore the cellular effects of genes with unknown functions, to investigate the response of mitochondria to stress, and to screen for genes that cause chromosomes to be lost or gained, a phenotype that has proved difficult to study in the past. “I think this dataset is going to enable all sorts of analyses that we haven’t even thought up yet by people who come from other parts of biology, and suddenly they just have this available to draw on,” says former Weissman Lab postdoc Tom Norman, a co-senior author of the paper.

    Pioneering Perturb-seq

    The project takes advantage of the Perturb-seq approach that makes it possible to follow the impact of turning on or off genes with unprecedented depth. This method was first published in 2016 by a group of researchers including Weissman and fellow MIT professor Aviv Regev, but could only be used on small sets of genes and at great expense.

    The massive Perturb-seq map was made possible by foundational work from Joseph Replogle, an MD-PhD student in Weissman’s lab and co-first author of the present paper. Replogle, in collaboration with Norman, who now leads a lab at Memorial Sloan Kettering Cancer Center; Britt Adamson, an assistant professor in the Department of Molecular Biology at Princeton University; and a group at 10x Genomics, set out to create a new version of Perturb-seq that could be scaled up. The researchers published a proof-of-concept paper in Nature Biotechnology in 2020. 

    The Perturb-seq method uses CRISPR-Cas9 genome editing to introduce genetic changes into cells, and then uses single-cell RNA sequencing to capture information about the RNAs that are expressed resulting from a given genetic change. Because RNAs control all aspects of how cells behave, this method can help decode the many cellular effects of genetic changes.

    Since their initial proof-of-concept paper, Weissman, Regev, and others have used this sequencing method on smaller scales. For example, the researchers used Perturb-seq in 2021 to explore how human and viral genes interact over the course of an infection with HCMV, a common herpesvirus.

    In the new study, Replogle and collaborators including Reuben Saunders, a graduate student in Weissman’s lab and co-first author of the paper, scaled up the method to the entire genome. Using human blood cancer cell lines as well noncancerous cells derived from the retina, he performed Perturb-seq across more than 2.5 million cells, and used the data to build a comprehensive map tying genotypes to phenotypes.

    Delving into the data

    Upon completing the screen, the researchers decided to put their new dataset to use and examine a few biological questions. “The advantage of Perturb-seq is it lets you get a big dataset in an unbiased way,” says Tom Norman. “No one knows entirely what the limits are of what you can get out of that kind of dataset. Now, the question is, what do you actually do with it?”

    The first, most obvious application was to look into genes with unknown functions. Because the screen also read out phenotypes of many known genes, the researchers could use the data to compare unknown genes to known ones and look for similar transcriptional outcomes, which could suggest the gene products worked together as part of a larger complex.

    The mutation of one gene called C7orf26 in particular stood out. Researchers noticed that genes whose removal led to a similar phenotype were part of a protein complex called Integrator that played a role in creating small nuclear RNAs. The Integrator complex is made up of many smaller subunits — previous studies had suggested 14 individual proteins — and the researchers were able to confirm that C7orf26 made up a 15th component of the complex.

    They also discovered that the 15 subunits worked together in smaller modules to perform specific functions within the Integrator complex. “Absent this thousand-foot-high view of the situation, it was not so clear that these different modules were so functionally distinct,” says Saunders.

    Another perk of Perturb-seq is that because the assay focuses on single cells, the researchers could use the data to look at more complex phenotypes that become muddied when they are studied together with data from other cells. “We often take all the cells where ‘gene X’ is knocked down and average them together to look at how they changed,” Weissman says. “But sometimes when you knock down a gene, different cells that are losing that same gene behave differently, and that behavior may be missed by the average.”

    The researchers found that a subset of genes whose removal led to different outcomes from cell to cell were responsible for chromosome segregation. Their removal was causing cells to lose a chromosome or pick up an extra one, a condition known as aneuploidy. “You couldn’t predict what the transcriptional response to losing this gene was because it depended on the secondary effect of what chromosome you gained or lost,” Weissman says. “We realized we could then turn this around and create this composite phenotype looking for signatures of chromosomes being gained and lost. In this way, we’ve done the first genome-wide screen for factors that are required for the correct segregation of DNA.”

    “I think the aneuploidy study is the most interesting application of this data so far,” Norman says. “It captures a phenotype that you can only get using a single-cell readout. You can’t go after it any other way.”

    The researchers also used their dataset to study how mitochondria responded to stress. Mitochondria, which evolved from free-living bacteria, carry 13 genes in their genomes. Within the nuclear DNA, around 1,000 genes are somehow related to mitochondrial function. “People have been interested for a long time in how nuclear and mitochondrial DNA are coordinated and regulated in different cellular conditions, especially when a cell is stressed,” Replogle says.

    The researchers found that when they perturbed different mitochondria-related genes, the nuclear genome responded similarly to many different genetic changes. However, the mitochondrial genome responses were much more variable. 

    “There’s still an open question of why mitochondria still have their own DNA,” said Replogle. “A big-picture takeaway from our work is that one benefit of having a separate mitochondrial genome might be having localized or very specific genetic regulation in response to different stressors.”

    “If you have one mitochondria that’s broken, and another one that is broken in a different way, those mitochondria could be responding differentially,” Weissman says.

    In the future, the researchers hope to use Perturb-seq on different types of cells besides the cancer cell line they started in. They also hope to continue to explore their map of gene functions, and hope others will do the same. “This really is the culmination of many years of work by the authors and other collaborators, and I’m really pleased to see it continue to succeed and expand,” says Norman. More

  • in

    An “oracle” for predicting the evolution of gene regulation

    Despite the sheer number of genes that each human cell contains, these so-called “coding” DNA sequences comprise just 1 percent of our entire genome. The remaining 99 percent is made up of “non-coding” DNA — which, unlike coding DNA, does not carry the instructions to build proteins.

    One vital function of this non-coding DNA, also called “regulatory” DNA, is to help turn genes on and off, controlling how much (if any) of a protein is made. Over time, as cells replicate their DNA to grow and divide, mutations often crop up in these non-coding regions — sometimes tweaking their function and changing the way they control gene expression. Many of these mutations are trivial, and some are even beneficial. Occasionally, though, they can be associated with increased risk of common diseases, such as Type 2 diabetes, or more life-threatening ones, including cancer.

    To better understand the repercussions of such mutations, researchers have been hard at work on mathematical maps that allow them to look at an organism’s genome, predict which genes will be expressed, and determine how that expression will affect the organism’s observable traits. These maps, called fitness landscapes, were conceptualized roughly a century ago to understand how genetic makeup influences one common measure of organismal fitness in particular: reproductive success. Early fitness landscapes were very simple, often focusing on a limited number of mutations. Much richer datasets are now available, but researchers still require additional tools to characterize and visualize such complex data. This ability would not only facilitate a better understanding of how individual genes have evolved over time, but would also help to predict what sequence and expression changes might occur in the future.

    In a new study published on March 9 in Nature, a team of scientists has developed a framework for studying the fitness landscapes of regulatory DNA. They created a neural network model that, when trained on hundreds of millions of experimental measurements, was capable of predicting how changes to these non-coding sequences in yeast affected gene expression. They also devised a unique way of representing the landscapes in two dimensions, making it easy to understand the past and forecast the future evolution of non-coding sequences in organisms beyond yeast — and even design custom gene expression patterns for gene therapies and industrial applications.

    “We now have an ‘oracle’ that can be queried to ask: What if we tried all possible mutations of this sequence? Or, what new sequence should we design to give us a desired expression?” says Aviv Regev, a professor of biology at MIT (on leave), core member of the Broad Institute of Harvard and MIT (on leave), head of Genentech Research and Early Development, and the study’s senior author. “Scientists can now use the model for their own evolutionary question or scenario, and for other problems like making sequences that control gene expression in desired ways. I am also excited about the possibilities for machine learning researchers interested in interpretability; they can ask their questions in reverse, to better understand the underlying biology.”

    Prior to this study, many researchers had simply trained their models on known mutations (or slight variations thereof) that exist in nature. However, Regev’s team wanted to go a step further by creating their own unbiased models capable of predicting an organism’s fitness and gene expression based on any possible DNA sequence — even sequences they’d never seen before. This would also enable researchers to use such models to engineer cells for pharmaceutical purposes, including new treatments for cancer and autoimmune disorders.

    To accomplish this goal, Eeshit Dhaval Vaishnav, a graduate student at MIT and co-first author; Carl de Boer, now an assistant professor at the University of British Columbia; and their colleagues created a neural network model to predict gene expression. They trained it on a dataset generated by inserting millions of totally random non-coding DNA sequences into yeast, and observing how each random sequence affected gene expression. They focused on a particular subset of non-coding DNA sequences called promoters, which serve as binding sites for proteins that can switch nearby genes on or off.

    “This work highlights what possibilities open up when we design new kinds of experiments to generate the right data to train models,” Regev says. “In the broader sense, I believe these kinds of approaches will be important for many problems — like understanding genetic variants in regulatory regions that confer disease risk in the human genome, but also for predicting the impact of combinations of mutations, or designing new molecules.”

    Regev, Vaishnav, de Boer, and their coauthors went on to test their model’s predictive abilities in a variety of ways, in order to show how it could help demystify the evolutionary past — and possible future — of certain promoters. “Creating an accurate model was certainly an accomplishment, but, to me, it was really just a starting point,” Vaishnav explains.

    First, to determine whether their model could help with synthetic biology applications like producing antibiotics, enzymes, and food, the researchers practiced using it to design promoters that could generate desired expression levels for any gene of interest. They then scoured other scientific papers to identify fundamental evolutionary questions, in order to see if their model could help answer them. The team even went so far as to feed their model a real-world population dataset from one existing study, which contained genetic information from yeast strains around the world. In doing so, they were able to delineate thousands of years of past selection pressures that sculpted the genomes of today’s yeast.

    But, in order to create a powerful tool that could probe any genome, the researchers knew they’d need to find a way to forecast the evolution of non-coding sequences even without such a comprehensive population dataset. To address this goal, Vaishnav and his colleagues devised a computational technique that allowed them to plot the predictions from their framework onto a two-dimensional graph. This helped them show, in a remarkably simple manner, how any non-coding DNA sequence would affect gene expression and fitness, without needing to conduct any time-consuming experiments at the lab bench.

    “One of the unsolved problems in fitness landscapes was that we didn’t have an approach for visualizing them in a way that meaningfully captured the evolutionary properties of sequences,” Vaishnav explains. “I really wanted to find a way to fill that gap, and contribute to the long-standing vision of creating a complete fitness landscape.”

    Martin Taylor, a professor of genetics at the University of Edinburgh’s Medical Research Council Human Genetics Unit who was not involved in the research, says the study shows that artificial intelligence can not only predict the effect of regulatory DNA changes, but also reveal the underlying principles that govern millions of years of evolution.

    Despite the fact that the model was trained on just a fraction of yeast regulatory DNA in a few growth conditions, he’s impressed that it’s capable of making such useful predictions about the evolution of gene regulation in mammals.

    “There are obvious near-term applications, such as the custom design of regulatory DNA for yeast in brewing, baking, and biotechnology,” he explains. “But extensions of this work could also help identify disease mutations in human regulatory DNA that are currently difficult to find and largely overlooked in the clinic. This work suggests there is a bright future for AI models of gene regulation trained on richer, more complex, and more diverse datasets.”

    Even before the study was formally published, Vaishnav began receiving queries from other researchers hoping to use the model to devise non-coding DNA sequences for use in gene therapies.

    “People have been studying regulatory evolution and fitness landscapes for decades now,” Vaishnav says. “I think our framework will go a long way in answering fundamental, open questions about the evolution and evolvability of gene regulatory DNA — and even help us design biological sequences for exciting new applications.” More

  • in

    Probing how proteins pair up inside cells

    Despite its minute size, a single cell contains billions of molecules that bustle around and bind to one another, carrying out vital functions. The human genome encodes about 20,000 proteins, most of which interact with partner proteins to mediate upwards of 400,000 distinct interactions. These partners don’t just latch onto one another haphazardly; they only bind to very specific companions that they must recognize inside the crowded cell. If they create the wrong pairings — or even the right pairings at the wrong place or wrong time — cancer or other diseases can ensue. Scientists are hard at work investigating these protein-protein relationships, in order to understand how they work, and potentially create drugs that disrupt or mimic them to treat disease.

    The average human protein is composed of approximately 400 building blocks called amino acids, which are strung together and folded into a complex 3D structure. Within this long string of building blocks, some proteins contain stretches of four to six amino acids called short linear motifs (SLiMs), which mediate protein-protein interactions. Despite their simplicity and small size, SLiMs and their binding partners facilitate key cellular processes. However, it’s been historically difficult to devise experiments to probe how SLiMs recognize their specific binding partners.

    To address this problem, a group led by Theresa Hwang PhD ’21 designed a screening method to understand how SLiMs selectively bind to certain proteins, and even distinguish between those with similar structures. Using the detailed information they gleaned from studying these interactions, the researchers created their own synthetic molecule capable of binding extremely tightly to a protein called ENAH, which is implicated in cancer metastasis. The team shared their findings in a pair of eLife studies, one published on Dec. 2, 2021, and the other published Jan. 25.

    “The ability to test hundreds of thousands of potential SLiMs for binding provides a powerful tool to explore why proteins prefer specific SLiM partners over others,” says Amy Keating, professor of biology and biological engineering and the senior author on both studies. “As we gain an understanding of the tricks that a protein uses to select its partners, we can apply these in protein design to make our own binders to modulate protein function for research or therapeutic purposes.”

    Most existing screens for SLiMs simply select for short, tight binders, while neglecting SLiMs that don’t grip their partner proteins quite as strongly. To survey SLiMs with a wide range of binding affinities, Keating, Hwang, and their colleagues developed their own screen called MassTitr.

    The researchers also suspected that the amino acids on either side of the SLiM’s core four-to-six amino acid sequence might play an underappreciated role in binding. To test their theory, they used MassTitr to screen the human proteome in longer chunks comprised of 36 amino acids, in order to see which “extended” SLiMs would associate with the protein ENAH.

    ENAH, sometimes referred to as Mena, helps cells to move. This ability to migrate is critical for healthy cells, but cancer cells can co-opt it to spread. Scientists have found that reducing the amount of ENAH decreases the cancer cell’s ability to invade other tissues — suggesting that formulating drugs to disrupt this protein and its interactions could treat cancer.

    Thanks to MassTitr, the team identified 33 SLiM-containing proteins that bound to ENAH — 19 of which are potentially novel binding partners. They also discovered three distinct patterns of amino acids flanking core SLiM sequences that helped the SLiMs bind even tighter to ENAH. Of these extended SLiMs, one found in a protein called PCARE bound to ENAH with the highest known affinity of any SLiM to date.

    Next, the researchers combined a computer program called dTERMen with X-ray crystallography in order understand how and why PCARE binds to ENAH over ENAH’s two nearly identical sister proteins (VASP and EVL). Hwang and her colleagues saw that the amino acids flanking PCARE’s core SliM caused ENAH to change shape slightly when the two made contact, allowing the binding sites to latch onto one another. VASP and EVL, by contrast, could not undergo this structural change, so the PCARE SliM did not bind to either of them as tightly.

    Inspired by this unique interaction, Hwang designed her own protein that bound to ENAH with unprecedented affinity and specificity. “It was exciting that we were able to come up with such a specific binder,” she says. “This work lays the foundation for designing synthetic molecules with the potential to disrupt protein-protein interactions that cause disease — or to help scientists learn more about ENAH and other SLiM-binding proteins.”  

    Ylva Ivarsson, a professor of biochemistry at Uppsala University who was not involved with the study, says that understanding how proteins find their binding partners is a question of fundamental importance to cell function and regulation. The two eLife studies, she explains, show that extended SLiMs play an underappreciated role in determining the affinity and specificity of these binding interactions.

    “The studies shed light on the idea that context matters, and provide a screening strategy for a variety of context-dependent binding interactions,” she says. “Hwang and co-authors have created valuable tools for dissecting the cellular function of proteins and their binding partners. Their approach could even inspire ENAH-specific inhibitors for therapeutic purposes.”

    Hwang’s biggest takeaway from the project is that things are not always as they seem: even short, simple protein segments can play complex roles in the cell. As she puts it: “We should really appreciate SLiMs more.” More