More stories

  • in

    Making genetic prediction models more inclusive

    While any two human genomes are about 99.9 percent identical, genetic variation in the remaining 0.1 percent plays an important role in shaping human diversity, including a person’s risk for developing certain diseases.

    Measuring the cumulative effect of these small genetic differences can provide an estimate of an individual’s genetic risk for a particular disease or their likelihood of having a particular trait. However, the majority of models used to generate these “polygenic scores” are based on studies done in people of European descent, and do not accurately gauge the risk for people of non-European ancestry or people whose genomes contain a mixture of chromosome regions inherited from previously isolated populations, also known as admixed ancestry.

    In an effort to make these genetic scores more inclusive, MIT researchers have created a new model that takes into account genetic information from people from a wider diversity of genetic ancestries across the world. Using this model, they showed that they could increase the accuracy of genetics-based predictions for a variety of traits, especially for people from populations that have been traditionally underrepresented in genetic studies.

    “For people of African ancestry, our model proved to be about 60 percent more accurate on average,” says Manolis Kellis, a professor of computer science in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and a member of the Broad Institute of MIT and Harvard. “For people of admixed genetic backgrounds more broadly, who have been excluded from most previous models, the accuracy of our model increased by an average of about 18 percent.”

    The researchers hope their more inclusive modeling approach could help improve health outcomes for a wider range of people and promote health equity by spreading the benefits of genomic sequencing more widely across the globe.

    “What we have done is created a method that allows you to be much more accurate for admixed and ancestry-diverse individuals, and ensure the results and the benefits of human genetics research are equally shared by everyone,” says MIT postdoc Yosuke Tanigawa, the lead and co-corresponding author of the paper, which appears today in open-access form in the American Journal of Human Genetics. The researchers have made all of their data publicly available for the broader scientific community to use.

    More inclusive models

    The work builds on the Human Genome Project, which mapped all of the genes found in the human genome, and on subsequent large-scale, cohort-based studies of how genetic variants in the human genome are linked to disease risk and other differences between individuals.

    These studies showed that the effect of any individual genetic variant on its own is typically very small. Together, these small effects add up and influence the risk of developing heart disease or diabetes, having a stroke, or being diagnosed with psychiatric disorders such as schizophrenia.

    “We have hundreds of thousands of genetic variants that are associated with complex traits, each of which is individually playing a weak effect, but together they are beginning to be predictive for disease predispositions,” Kellis says.

    However, most of these genome-wide association studies included few people of non-European descent, so polygenic risk models based on them translate poorly to non-European populations. People from different geographic areas can have different patterns of genetic variation, shaped by stochastic drift, population history, and environmental factors — for example, in people of African descent, genetic variants that protect against malaria are more common than in other populations. Those variants also affect other traits involving the immune system, such as counts of neutrophils, a type of immune cell. That variation would not be well-captured in a model based on genetic analysis of people of European ancestry alone.

    “If you are an individual of African descent, of Latin American descent, of Asian descent, then you are currently being left out by the system,” Kellis says. “This inequity in the utilization of genetic information for predicting risk of patients can cause unnecessary burden, unnecessary deaths, and unnecessary lack of prevention, and that’s where our work comes in.”

    Some researchers have begun trying to address these disparities by creating distinct models for people of European descent, of African descent, or of Asian descent. These emerging approaches assign individuals to distinct genetic ancestry groups, aggregate the data to create an association summary, and make genetic prediction models. However, these approaches still don’t represent people of admixed genetic backgrounds well.

    “Our approach builds on the previous work without requiring researchers to assign individuals or local genomic segments of individuals to predefined distinct genetic ancestry groups,” Tanigawa says. “Instead, we develop a single model for everybody by directly working on individuals across the continuum of their genetic ancestries.”

    In creating their new model, the MIT team used computational and statistical techniques that enabled them to study each individual’s unique genetic profile instead of grouping individuals by population. This methodological advancement allowed the researchers to include people of admixed ancestry, who made up nearly 10 percent of the UK Biobank dataset used for this study and currently account for about one in seven newborns in the United States.

    “Because we work at the individual level, there is no need for computing summary-level data for different populations,” Kellis says. “Thus, we did not need to exclude individuals of admixed ancestry, increasing our power by including more individuals and representing contributions from all populations in our combined model.”

    Better predictions

    To create their new model, the researchers used genetic data from more than 280,000 people, which was collected by UK Biobank, a large-scale biomedical database and research resource containing de-identified genetic, lifestyle, and health information from half a million U.K. participants. Using another set of about 81,000 held-out individuals from the UK Biobank, the researchers evaluated their model across 60 traits, which included traits related to body size and shape, such as height and body mass index, as well as blood traits such as white blood cell count and red blood cell count, which also have a genetic basis.

    The researchers found that, compared to models trained only on European-ancestry individuals, their model’s predictions are more accurate for all genetic ancestry groups. The most notable gain was for people of African ancestry, who showed 61 percent average improvements, even though they only made up about 1.5 percent of samples in UK Biobank. The researchers also saw improvements of 11 percent for people of South Asian descent and 5 percent for white British people. Predictions for people of admixed ancestry improved by about 18 percent.

    “When you bring all the individuals together in the training set, everybody contributes to the training of the polygenic score modeling on equal footing,” Tanigawa says. “Combined with increasingly more inclusive data collection efforts, our method can help leverage these efforts to improve predictive accuracy for all.”

    The MIT team hopes its approach can eventually be incorporated into tests of an individual’s risk of a variety of diseases. Such tests could be combined with conventional risk factors and used to help doctors diagnose disease or to help people manage their risk for certain diseases before they develop.

    “Our work highlights the power of diversity, equity, and inclusion efforts in the context of genomics research,” Tanigawa says.

    The researchers now hope to add even more data to their model, including data from the United States, and to apply it to additional traits that they didn’t analyze in this study.

    “This is just the start,” Kellis says. “We can’t wait to see more people join our effort to propel inclusive human genetics research.”

    The research was funded by the National Institutes of Health. More

  • in

    How an archeological approach can help leverage biased data in AI to improve medicine

    The classic computer science adage “garbage in, garbage out” lacks nuance when it comes to understanding biased medical data, argue computer science and bioethics professors from MIT, Johns Hopkins University, and the Alan Turing Institute in a new opinion piece published in a recent edition of the New England Journal of Medicine (NEJM). The rising popularity of artificial intelligence has brought increased scrutiny to the matter of biased AI models resulting in algorithmic discrimination, which the White House Office of Science and Technology identified as a key issue in their recent Blueprint for an AI Bill of Rights. 

    When encountering biased data, particularly for AI models used in medical settings, the typical response is to either collect more data from underrepresented groups or generate synthetic data making up for missing parts to ensure that the model performs equally well across an array of patient populations. But the authors argue that this technical approach should be augmented with a sociotechnical perspective that takes both historical and current social factors into account. By doing so, researchers can be more effective in addressing bias in public health. 

    “The three of us had been discussing the ways in which we often treat issues with data from a machine learning perspective as irritations that need to be managed with a technical solution,” recalls co-author Marzyeh Ghassemi, an assistant professor in electrical engineering and computer science and an affiliate of the Abdul Latif Jameel Clinic for Machine Learning in Health (Jameel Clinic), the Computer Science and Artificial Intelligence Laboratory (CSAIL), and Institute of Medical Engineering and Science (IMES). “We had used analogies of data as an artifact that gives a partial view of past practices, or a cracked mirror holding up a reflection. In both cases the information is perhaps not entirely accurate or favorable: Maybe we think that we behave in certain ways as a society — but when you actually look at the data, it tells a different story. We might not like what that story is, but once you unearth an understanding of the past you can move forward and take steps to address poor practices.” 

    Data as artifact 

    In the paper, titled “Considering Biased Data as Informative Artifacts in AI-Assisted Health Care,” Ghassemi, Kadija Ferryman, and Maxine Mackintosh make the case for viewing biased clinical data as “artifacts” in the same way anthropologists or archeologists would view physical objects: pieces of civilization-revealing practices, belief systems, and cultural values — in the case of the paper, specifically those that have led to existing inequities in the health care system. 

    For example, a 2019 study showed that an algorithm widely considered to be an industry standard used health-care expenditures as an indicator of need, leading to the erroneous conclusion that sicker Black patients require the same level of care as healthier white patients. What researchers found was algorithmic discrimination failing to account for unequal access to care.  

    In this instance, rather than viewing biased datasets or lack of data as problems that only require disposal or fixing, Ghassemi and her colleagues recommend the “artifacts” approach as a way to raise awareness around social and historical elements influencing how data are collected and alternative approaches to clinical AI development. 

    “If the goal of your model is deployment in a clinical setting, you should engage a bioethicist or a clinician with appropriate training reasonably early on in problem formulation,” says Ghassemi. “As computer scientists, we often don’t have a complete picture of the different social and historical factors that have gone into creating data that we’ll be using. We need expertise in discerning when models generalized from existing data may not work well for specific subgroups.” 

    When more data can actually harm performance 

    The authors acknowledge that one of the more challenging aspects of implementing an artifact-based approach is being able to assess whether data have been racially corrected: i.e., using white, male bodies as the conventional standard that other bodies are measured against. The opinion piece cites an example from the Chronic Kidney Disease Collaboration in 2021, which developed a new equation to measure kidney function because the old equation had previously been “corrected” under the blanket assumption that Black people have higher muscle mass. Ghassemi says that researchers should be prepared to investigate race-based correction as part of the research process. 

    In another recent paper accepted to this year’s International Conference on Machine Learning co-authored by Ghassemi’s PhD student Vinith Suriyakumar and University of California at San Diego Assistant Professor Berk Ustun, the researchers found that assuming the inclusion of personalized attributes like self-reported race improve the performance of ML models can actually lead to worse risk scores, models, and metrics for minority and minoritized populations.  

    “There’s no single right solution for whether or not to include self-reported race in a clinical risk score. Self-reported race is a social construct that is both a proxy for other information, and deeply proxied itself in other medical data. The solution needs to fit the evidence,” explains Ghassemi. 

    How to move forward 

    This is not to say that biased datasets should be enshrined, or biased algorithms don’t require fixing — quality training data is still key to developing safe, high-performance clinical AI models, and the NEJM piece highlights the role of the National Institutes of Health (NIH) in driving ethical practices.  

    “Generating high-quality, ethically sourced datasets is crucial for enabling the use of next-generation AI technologies that transform how we do research,” NIH acting director Lawrence Tabak stated in a press release when the NIH announced its $130 million Bridge2AI Program last year. Ghassemi agrees, pointing out that the NIH has “prioritized data collection in ethical ways that cover information we have not previously emphasized the value of in human health — such as environmental factors and social determinants. I’m very excited about their prioritization of, and strong investments towards, achieving meaningful health outcomes.” 

    Elaine Nsoesie, an associate professor at the Boston University of Public Health, believes there are many potential benefits to treating biased datasets as artifacts rather than garbage, starting with the focus on context. “Biases present in a dataset collected for lung cancer patients in a hospital in Uganda might be different from a dataset collected in the U.S. for the same patient population,” she explains. “In considering local context, we can train algorithms to better serve specific populations.” Nsoesie says that understanding the historical and contemporary factors shaping a dataset can make it easier to identify discriminatory practices that might be coded in algorithms or systems in ways that are not immediately obvious. She also notes that an artifact-based approach could lead to the development of new policies and structures ensuring that the root causes of bias in a particular dataset are eliminated. 

    “People often tell me that they are very afraid of AI, especially in health. They’ll say, ‘I’m really scared of an AI misdiagnosing me,’ or ‘I’m concerned it will treat me poorly,’” Ghassemi says. “I tell them, you shouldn’t be scared of some hypothetical AI in health tomorrow, you should be scared of what health is right now. If we take a narrow technical view of the data we extract from systems, we could naively replicate poor practices. That’s not the only option — realizing there is a problem is our first step towards a larger opportunity.”  More

  • in

    A faster way to preserve privacy online

    Searching the internet can reveal information a user would rather keep private. For instance, when someone looks up medical symptoms online, they could reveal their health conditions to Google, an online medical database like WebMD, and perhaps hundreds of these companies’ advertisers and business partners.

    For decades, researchers have been crafting techniques that enable users to search for and retrieve information from a database privately, but these methods remain too slow to be effectively used in practice.

    MIT researchers have now developed a scheme for private information retrieval that is about 30 times faster than other comparable methods. Their technique enables a user to search an online database without revealing their query to the server. Moreover, it is driven by a simple algorithm that would be easier to implement than the more complicated approaches from previous work.

    Their technique could enable private communication by preventing a messaging app from knowing what users are saying or who they are talking to. It could also be used to fetch relevant online ads without advertising servers learning a users’ interests.

    “This work is really about giving users back some control over their own data. In the long run, we’d like browsing the web to be as private as browsing a library. This work doesn’t achieve that yet, but it starts building the tools to let us do this sort of thing quickly and efficiently in practice,” says Alexandra Henzinger, a computer science graduate student and lead author of a paper introducing the technique.

    Co-authors include Matthew Hong, an MIT computer science graduate student; Henry Corrigan-Gibbs, the Douglas Ross Career Development Professor of Software Technology in the MIT Department of Electrical Engineering and Computer Science (EECS) and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); Sarah Meiklejohn, a professor in cryptography and security at University College London and a staff research scientist at Google; and senior author Vinod Vaikuntanathan, an EECS professor and principal investigator in CSAIL. The research will be presented at the 2023 USENIX Security Symposium. 

    Preserving privacy

    The first schemes for private information retrieval were developed in the 1990s, partly by researchers at MIT. These techniques enable a user to communicate with a remote server that holds a database, and read records from that database without the server knowing what the user is reading.

    To preserve privacy, these techniques force the server to touch every single item in the database, so it can’t tell which entry a user is searching for. If one area is left untouched, the server would learn that the client is not interested in that item. But touching every item when there may be millions of database entries slows down the query process.

    To speed things up, the MIT researchers developed a protocol, known as Simple PIR, in which the server performs much of the underlying cryptographic work in advance, before a client even sends a query. This preprocessing step produces a data structure that holds compressed information about the database contents, and which the client downloads before sending a query.

    In a sense, this data structure is like a hint for the client about what is in the database.

    “Once the client has this hint, it can make an unbounded number of queries, and these queries are going to be much smaller in both the size of the messages you are sending and the work that you need the server to do. This is what makes Simple PIR so much faster,” Henzinger explains.

    But the hint can be relatively large in size. For example, to query a 1-gigabyte database, the client would need to download a 124-megabyte hint. This drives up communication costs, which could make the technique difficult to implement on real-world devices.

    To reduce the size of the hint, the researchers developed a second technique, known as Double PIR, that basically involves running the Simple PIR scheme twice. This produces a much more compact hint that is fixed in size for any database.

    Using Double PIR, the hint for a 1 gigabyte database would only be 16 megabytes.

    “Our Double PIR scheme runs a little bit slower, but it will have much lower communication costs. For some applications, this is going to be a desirable tradeoff,” Henzinger says.

    Hitting the speed limit

    They tested the Simple PIR and Double PIR schemes by applying them to a task in which a client seeks to audit a specific piece of information about a website to ensure that website is safe to visit. To preserve privacy, the client cannot reveal the website it is auditing.

    The researchers’ fastest technique was able to successfully preserve privacy while running at about 10 gigabytes per second. Previous schemes could only achieve a throughput of about 300 megabytes per second.

    They show that their method approaches the theoretical speed limit for private information retrieval — it is nearly the fastest possible scheme one can build in which the server touches every record in the database, adds Corrigan-Gibbs.

    In addition, their method only requires a single server, making it much simpler than many top-performing techniques that require two separate servers with identical databases. Their method outperformed these more complex protocols.

    “I’ve been thinking about these schemes for some time, and I never thought this could be possible at this speed. The folklore was that any single-server scheme is going to be really slow. This work turns that whole notion on its head,” Corrigan-Gibbs says.

    While the researchers have shown that they can make PIR schemes much faster, there is still work to do before they would be able to deploy their techniques in real-world scenarios, says Henzinger. They would like to cut the communication costs of their schemes while still enabling them to achieve high speeds. In addition, they want to adapt their techniques to handle more complex queries, such as general SQL queries, and more demanding applications, such as a general Wikipedia search. And in the long run, they hope to develop better techniques that can preserve privacy without requiring a server to touch every database item. 

    “I’ve heard people emphatically claiming that PIR will never be practical. But I would never bet against technology. That is an optimistic lesson to learn from this work. There are always ways to innovate,” Vaikuntanathan says.

    “This work makes a major improvement to the practical cost of private information retrieval. While it was known that low-bandwidth PIR schemes imply public-key cryptography, which is typically orders of magnitude slower than private-key cryptography, this work develops an ingenious method to bridge the gap. This is done by making a clever use of special properties of a public-key encryption scheme due to Regev to push the vast majority of the computational work to a precomputation step, in which the server computes a short ‘hint’ about the database,” says Yuval Ishai, a professor of computer science at Technion (the Israel Institute of Technology), who was not involved in the study. “What makes their approach particularly appealing is that the same hint can be used an unlimited number of times, by any number of clients. This renders the (moderate) cost of computing the hint insignificant in a typical scenario where the same database is accessed many times.”

    This work is funded, in part, by the National Science Foundation, Google, Facebook, MIT’s Fintech@CSAIL Initiative, an NSF Graduate Research Fellowship, an EECS Great Educators Fellowship, the National Institutes of Health, the Defense Advanced Research Projects Agency, the MIT-IBM Watson AI Lab, Analog Devices, Microsoft, and a Thornton Family Faculty Research Innovation Fellowship. More

  • in

    Study finds the risks of sharing health care data are low

    In recent years, scientists have made great strides in their ability to develop artificial intelligence algorithms that can analyze patient data and come up with new ways to diagnose disease or predict which treatments work best for different patients.

    The success of those algorithms depends on access to patient health data, which has been stripped of personal information that could be used to identify individuals from the dataset. However, the possibility that individuals could be identified through other means has raised concerns among privacy advocates.

    In a new study, a team of researchers led by MIT Principal Research Scientist Leo Anthony Celi has quantified the potential risk of this kind of patient re-identification and found that it is currently extremely low relative to the risk of data breach. In fact, between 2016 and 2021, the period examined in the study, there were no reports of patient re-identification through publicly available health data.

    The findings suggest that the potential risk to patient privacy is greatly outweighed by the gains for patients, who benefit from better diagnosis and treatment, says Celi. He hopes that in the near future, these datasets will become more widely available and include a more diverse group of patients.

    “We agree that there is some risk to patient privacy, but there is also a risk of not sharing data,” he says. “There is harm when data is not shared, and that needs to be factored into the equation.”

    Celi, who is also an instructor at the Harvard T.H. Chan School of Public Health and an attending physician with the Division of Pulmonary, Critical Care and Sleep Medicine at the Beth Israel Deaconess Medical Center, is the senior author of the new study. Kenneth Seastedt, a thoracic surgery fellow at Beth Israel Deaconess Medical Center, is the lead author of the paper, which appears today in PLOS Digital Health.

    Risk-benefit analysis

    Large health record databases created by hospitals and other institutions contain a wealth of information on diseases such as heart disease, cancer, macular degeneration, and Covid-19, which researchers use to try to discover new ways to diagnose and treat disease.

    Celi and others at MIT’s Laboratory for Computational Physiology have created several publicly available databases, including the Medical Information Mart for Intensive Care (MIMIC), which they recently used to develop algorithms that can help doctors make better medical decisions. Many other research groups have also used the data, and others have created similar databases in countries around the world.

    Typically, when patient data is entered into this kind of database, certain types of identifying information are removed, including patients’ names, addresses, and phone numbers. This is intended to prevent patients from being re-identified and having information about their medical conditions made public.

    However, concerns about privacy have slowed the development of more publicly available databases with this kind of information, Celi says. In the new study, he and his colleagues set out to ask what the actual risk of patient re-identification is. First, they searched PubMed, a database of scientific papers, for any reports of patient re-identification from publicly available health data, but found none.

    To expand the search, the researchers then examined media reports from September 2016 to September 2021, using Media Cloud, an open-source global news database and analysis tool. In a search of more than 10,000 U.S. media publications during that time, they did not find a single instance of patient re-identification from publicly available health data.

    In contrast, they found that during the same time period, health records of nearly 100 million people were stolen through data breaches of information that was supposed to be securely stored.

    “Of course, it’s good to be concerned about patient privacy and the risk of re-identification, but that risk, although it’s not zero, is minuscule compared to the issue of cyber security,” Celi says.

    Better representation

    More widespread sharing of de-identified health data is necessary, Celi says, to help expand the representation of minority groups in the United States, who have traditionally been underrepresented in medical studies. He is also working to encourage the development of more such databases in low- and middle-income countries.

    “We cannot move forward with AI unless we address the biases that lurk in our datasets,” he says. “When we have this debate over privacy, no one hears the voice of the people who are not represented. People are deciding for them that their data need to be protected and should not be shared. But they are the ones whose health is at stake; they’re the ones who would most likely benefit from data-sharing.”

    Instead of asking for patient consent to share data, which he says may exacerbate the exclusion of many people who are now underrepresented in publicly available health data, Celi recommends enhancing the existing safeguards that are in place to protect such datasets. One new strategy that he and his colleagues have begun using is to share the data in a way that it can’t be downloaded, and all queries run on it can be monitored by the administrators of the database. This allows them to flag any user inquiry that seems like it might not be for legitimate research purposes, Celi says.

    “What we are advocating for is performing data analysis in a very secure environment so that we weed out any nefarious players trying to use the data for some other reasons apart from improving population health,” he says. “We’re not saying that we should disregard patient privacy. What we’re saying is that we have to also balance that with the value of data sharing.”

    The research was funded by the National Institutes of Health through the National Institute of Biomedical Imaging and Bioengineering. More

  • in

    Four from MIT receive NIH New Innovator Awards for 2022

    The National Institutes of Health (NIH) has awarded grants to four MIT faculty members as part of its High-Risk, High-Reward Research program.

    The program supports unconventional approaches to challenges in biomedical, behavioral, and social sciences. Each year, NIH Director’s Awards are granted to program applicants who propose high-risk, high-impact research in areas relevant to the NIH’s mission. In doing so, the NIH encourages innovative proposals that, due to their inherent risk, might struggle in the traditional peer-review process.

    This year, Lindsay Case, Siniša Hrvatin, Deblina Sarkar, and Caroline Uhler have been chosen to receive the New Innovator Award, which funds exceptionally creative research from early-career investigators. The award, which was established in 2007, supports researchers who are within 10 years of their final degree or clinical residency and have not yet received a research project grant or equivalent NIH grant.

    Lindsay Case, the Irwin and Helen Sizer Department of Biology Career Development Professor and an extramural member of the Koch Institute for Integrative Cancer Research, uses biochemistry and cell biology to study the spatial organization of signal transduction. Her work focuses on understanding how signaling molecules assemble into compartments with unique biochemical and biophysical properties to enable cells to sense and respond to information in their environment. Earlier this year, Case was one of two MIT assistant professors named as Searle Scholars.

    Siniša Hrvatin, who joined the School of Science faculty this past winter, is an assistant professor in the Department of Biology and a core member at the Whitehead Institute for Biomedical Research. He studies how animals and cells enter, regulate, and survive states of dormancy such as torpor and hibernation, aiming to harness the potential of these states therapeutically.

    Deblina Sarkar is an assistant professor and AT&T Career Development Chair Professor at the MIT Media Lab​. Her research combines the interdisciplinary fields of nanoelectronics, applied physics, and biology to invent disruptive technologies for energy-efficient nanoelectronics and merge such next-generation technologies with living matter to create a new paradigm for life-machine symbiosis. Her high-risk, high-reward proposal received the rare perfect impact score of 10, which is the highest score awarded by NIH.

    Caroline Uhler is a professor in the Department of Electrical Engineering and Computer Science and the Institute for Data, Systems, and Society. In addition, she is a core institute member at the Broad Institute of MIT and Harvard, where she co-directs the Eric and Wendy Schmidt Center. By combining machine learning, statistics, and genomics, she develops representation learning and causal inference methods to elucidate gene regulation in health and disease.

    The High-Risk, High-Reward Research program is supported by the NIH Common Fund, which oversees programs that pursue major opportunities and gaps in biomedical research that require collaboration across NIH Institutes and Centers. In addition to the New Innovator Award, the NIH also issues three other awards each year: the Pioneer Award, which supports bold and innovative research projects with unusually broad scientific impact; the Transformative Research Award, which supports risky and untested projects with transformative potential; and the Early Independence Award, which allows especially impressive junior scientists to skip the traditional postdoctoral training program to launch independent research careers.

    This year, the High-Risk, High-Reward Research program is awarding 103 awards, including eight Pioneer Awards, 72 New Innovator Awards, nine Transformative Research Awards, and 14 Early Independence Awards. These 103 awards total approximately $285 million in support from the institutes, centers, and offices across NIH over five years. “The science advanced by these researchers is poised to blaze new paths of discovery in human health,” says Lawrence A. Tabak DDS, PhD, who is performing the duties of the director of NIH. “This unique cohort of scientists will transform what is known in the biological and behavioral world. We are privileged to support this innovative science.” More

  • in

    Neurodegenerative disease can progress in newly identified patterns

    Neurodegenerative diseases — like amyotrophic lateral sclerosis (ALS, or Lou Gehrig’s disease), Alzheimer’s, and Parkinson’s — are complicated, chronic ailments that can present with a variety of symptoms, worsen at different rates, and have many underlying genetic and environmental causes, some of which are unknown. ALS, in particular, affects voluntary muscle movement and is always fatal, but while most people survive for only a few years after diagnosis, others live with the disease for decades. Manifestations of ALS can also vary significantly; often slower disease development correlates with onset in the limbs and affecting fine motor skills, while the more serious, bulbar ALS impacts swallowing, speaking, breathing, and mobility. Therefore, understanding the progression of diseases like ALS is critical to enrollment in clinical trials, analysis of potential interventions, and discovery of root causes.

    However, assessing disease evolution is far from straightforward. Current clinical studies typically assume that health declines on a downward linear trajectory on a symptom rating scale, and use these linear models to evaluate whether drugs are slowing disease progression. However, data indicate that ALS often follows nonlinear trajectories, with periods where symptoms are stable alternating with periods when they are rapidly changing. Since data can be sparse, and health assessments often rely on subjective rating metrics measured at uneven time intervals, comparisons across patient populations are difficult. These heterogenous data and progression, in turn, complicate analyses of invention effectiveness and potentially mask disease origin.

    Now, a new machine-learning method developed by researchers from MIT, IBM Research, and elsewhere aims to better characterize ALS disease progression patterns to inform clinical trial design.

    “There are groups of individuals that share progression patterns. For example, some seem to have really fast-progressing ALS and others that have slow-progressing ALS that varies over time,” says Divya Ramamoorthy PhD ’22, a research specialist at MIT and lead author of a new paper on the work that was published this month in Nature Computational Science. “The question we were asking is: can we use machine learning to identify if, and to what extent, those types of consistent patterns across individuals exist?”

    Their technique, indeed, identified discrete and robust clinical patterns in ALS progression, many of which are non-linear. Further, these disease progression subtypes were consistent across patient populations and disease metrics. The team additionally found that their method can be applied to Alzheimer’s and Parkinson’s diseases as well.

    Joining Ramamoorthy on the paper are MIT-IBM Watson AI Lab members Ernest Fraenkel, a professor in the MIT Department of Biological Engineering; Research Scientist Soumya Ghosh of IBM Research; and Principal Research Scientist Kenney Ng, also of IBM Research. Additional authors include Kristen Severson PhD ’18, a senior researcher at Microsoft Research and former member of the Watson Lab and of IBM Research; Karen Sachs PhD ’06 of Next Generation Analytics; a team of researchers with Answer ALS; Jonathan D. Glass and Christina N. Fournier of the Emory University School of Medicine; the Pooled Resource Open-Access ALS Clinical Trials Consortium; ALS/MND Natural History Consortium; Todd M. Herrington of Massachusetts General Hospital (MGH) and Harvard Medical School; and James D. Berry of MGH.

    Play video

    MIT Professor Ernest Fraenkel describes early stages of his research looking at root causes of amyotrophic lateral sclerosis (ALS).

    Reshaping health decline

    After consulting with clinicians, the team of machine learning researchers and neurologists let the data speak for itself. They designed an unsupervised machine-learning model that employed two methods: Gaussian process regression and Dirichlet process clustering. These inferred the health trajectories directly from patient data and automatically grouped similar trajectories together without prescribing the number of clusters or the shape of the curves, forming ALS progression “subtypes.” Their method incorporated prior clinical knowledge in the way of a bias for negative trajectories — consistent with expectations for neurodegenerative disease progressions — but did not assume any linearity. “We know that linearity is not reflective of what’s actually observed,” says Ng. “The methods and models that we use here were more flexible, in the sense that, they capture what was seen in the data,” without the need for expensive labeled data and prescription of parameters.

    Primarily, they applied the model to five longitudinal datasets from ALS clinical trials and observational studies. These used the gold standard to measure symptom development: the ALS functional rating scale revised (ALSFRS-R), which captures a global picture of patient neurological impairment but can be a bit of a “messy metric.” Additionally, performance on survivability probabilities, forced vital capacity (a measurement of respiratory function), and subscores of ALSFRS-R, which looks at individual bodily functions, were incorporated.

    New regimes of progression and utility

    When their population-level model was trained and tested on these metrics, four dominant patterns of disease popped out of the many trajectories — sigmoidal fast progression, stable slow progression, unstable slow progression, and unstable moderate progression — many with strong nonlinear characteristics. Notably, it captured trajectories where patients experienced a sudden loss of ability, called a functional cliff, which would significantly impact treatments, enrollment in clinical trials, and quality of life.

    The researchers compared their method against other commonly used linear and nonlinear approaches in the field to separate the contribution of clustering and linearity to the model’s accuracy. The new work outperformed them, even patient-specific models, and found that subtype patterns were consistent across measures. Impressively, when data were withheld, the model was able to interpolate missing values, and, critically, could forecast future health measures. The model could also be trained on one ALSFRS-R dataset and predict cluster membership in others, making it robust, generalizable, and accurate with scarce data. So long as 6-12 months of data were available, health trajectories could be inferred with higher confidence than conventional methods.

    The researchers’ approach also provided insights into Alzheimer’s and Parkinson’s diseases, both of which can have a range of symptom presentations and progression. For Alzheimer’s, the new technique could identify distinct disease patterns, in particular variations in the rates of conversion of mild to severe disease. The Parkinson’s analysis demonstrated a relationship between progression trajectories for off-medication scores and disease phenotypes, such as the tremor-dominant or postural instability/gait difficulty forms of Parkinson’s disease.

    The work makes significant strides to find the signal amongst the noise in the time-series of complex neurodegenerative disease. “The patterns that we see are reproducible across studies, which I don’t believe had been shown before, and that may have implications for how we subtype the [ALS] disease,” says Fraenkel. As the FDA has been considering the impact of non-linearity in clinical trial designs, the team notes that their work is particularly pertinent.

    As new ways to understand disease mechanisms come online, this model provides another tool to pick apart illnesses like ALS, Alzheimer’s, and Parkinson’s from a systems biology perspective.

    “We have a lot of molecular data from the same patients, and so our long-term goal is to see whether there are subtypes of the disease,” says Fraenkel, whose lab looks at cellular changes to understand the etiology of diseases and possible targets for cures. “One approach is to start with the symptoms … and see if people with different patterns of disease progression are also different at the molecular level. That might lead you to a therapy. Then there’s the bottom-up approach, where you start with the molecules” and try to reconstruct biological pathways that might be affected. “We’re going [to be tackling this] from both ends … and finding if something meets in the middle.”

    This research was supported, in part, by the MIT-IBM Watson AI Lab, the Muscular Dystrophy Association, Department of Veterans Affairs of Research and Development, the Department of Defense, NSF Gradate Research Fellowship Program, Siebel Scholars Fellowship, Answer ALS, the United States Army Medical Research Acquisition Activity, National Institutes of Health, and the NIH/NINDS. More

  • in

    In-home wireless device tracks disease progression in Parkinson’s patients

    Parkinson’s disease is the fastest-growing neurological disease, now affecting more than 10 million people worldwide, yet clinicians still face huge challenges in tracking its severity and progression.

    Clinicians typically evaluate patients by testing their motor skills and cognitive functions during clinic visits. These semisubjective measurements are often skewed by outside factors — perhaps a patient is tired after a long drive to the hospital. More than 40 percent of individuals with Parkinson’s are never treated by a neurologist or Parkinson’s specialist, often because they live too far from an urban center or have difficulty traveling.

    In an effort to address these problems, researchers from MIT and elsewhere demonstrated an in-home device that can monitor a patient’s movement and gait speed, which can be used to evaluate Parkinson’s severity, the progression of the disease, and the patient’s response to medication.

    The device, which is about the size of a Wi-Fi router, gathers data passively using radio signals that reflect off the patient’s body as they move around their home. The patient does not need to wear a gadget or change their behavior. (A recent study, for example, showed that this type of device could be used to detect Parkinson’s from a person’s breathing patterns while sleeping.)

    The researchers used these devices to conduct a one-year at-home study with 50 participants. They showed that, by using machine-learning algorithms to analyze the troves of data they passively gathered (more than 200,000 gait speed measurements), a clinician could track Parkinson’s progression and medication response more effectively than they would with periodic, in-clinic evaluations.

    “By being able to have a device in the home that can monitor a patient and tell the doctor remotely about the progression of the disease, and the patient’s medication response so they can attend to the patient even if the patient can’t come to the clinic — now they have real, reliable information — that actually goes a long way toward improving equity and access,” says senior author Dina Katabi, the Thuan and Nicole Pham Professor in the Department of Electrical Engineering and Computer Science (EECS), and a principle investigator in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the MIT Jameel Clinic.

    The co-lead authors are EECS graduate students Yingcheng Liu and Guo Zhang. The research is published today in Science Translational Medicine.

    A human radar

    This work utilizes a wireless device previously developed in the Katabi lab that analyzes radio signals that bounce off people’s bodies. It transmits signals that use a tiny fraction of the power of a Wi-Fi router — these super-low-power signals don’t interfere with other wireless devices in the home. While radio signals pass through walls and other solid objects, they are reflected off humans due to the water in our bodies.  

    This creates a “human radar” that can track the movement of a person in a room. Radio waves always travel at the same speed, so the length of time it takes the signals to reflect back to the device indicates how the person is moving.

    The device incorporates a machine-learning classifier that can pick out the precise radio signals reflected off the patient even when there are other people moving around the room. Advanced algorithms use these movement data to compute gait speed — how fast the person is walking.

    Because the device operates in the background and runs all day, every day, it can collect a massive amount of data. The researchers wanted to see if they could apply machine learning to these datasets to gain insights about the disease over time.

    They gathered 50 participants, 34 of whom had Parkinson’s, and conducted a one-year study of in-home gait measurements Through the study, the researchers collected more than 200,000 individual measurements that they averaged to smooth out variability due to the conditions irrelevant to the disease. (For example, a patient may hurry up to answer an alarm or walk slower when talking on the phone.)

    They used statistical methods to analyze the data and found that in-home gait speed can be used to effectively track Parkinson’s progression and severity. For instance, they showed that gait speed declined almost twice as fast for individuals with Parkinson’s, compared to those without. 

    “Monitoring the patient continuously as they move around the room enabled us to get really good measurements of their gait speed. And with so much data, we were able to perform aggregation that allowed us to see very small differences,” Zhang says.

    Better, faster results

    Drilling down on these variabilities offered some key insights. For instance, the researchers showed that daily fluctuations in a patient’s walking speed correspond with how they are responding to their medication — walking speed may improve after a dose and then begin to decline after a few hours, as the medication impact wears off.

    “This enables us to objectively measure how your mobility responds to your medication. Previously, this was very cumbersome to do because this medication effect could only be measured by having the patient keep a journal,” Liu says.

    A clinician could use these data to adjust medication dosage more effectively and accurately. This is especially important since drugs used to treat disease symptoms can cause serious side effects if the patient receives too much.

    The researchers were able to demonstrate statistically significant results regarding Parkinson’s progression after studying 50 people for just one year. By contrast, an often-cited study by the Michael J. Fox Foundation involved more than 500 individuals and monitored them for more than five years, Katabi says.

    “For a pharmaceutical company or a biotech company trying to develop medicines for this disease, this could greatly reduce the burden and cost and speed up the development of new therapies,” she adds.

    Katabi credits much of the study’s success to the dedicated team of scientists and clinicians who worked together to tackle the many difficulties that arose along the way. For one, they began the study before the Covid-19 pandemic, so team members initially visited people’s homes to set up the devices. When that was no longer possible, they developed a user-friendly phone app to remotely help participants as they deployed the device at home.

    Through the course of the study, they learned to automate processes and reduce effort, especially for the participants and clinical team.

    This knowledge will prove useful as they look to deploy devices in at-home studies of other neurological disorders, such as Alzheimer’s, ALS, and Huntington’s. They also want to explore how these methods could be used, in conjunction with other work from the Katabi lab showing that Parkinson’s can be diagnosed by monitoring breathing, to collect a holistic set of markers that could diagnose the disease early and then be used to track and treat it.

    “This radio-wave sensor can enable more care (and research) to migrate from hospitals to the home where it is most desired and needed,” says Ray Dorsey, a professor of neurology at the University of Rochester Medical Center, co-author of Ending Parkinson’s, and a co-author of this research paper. “Its potential is just beginning to be seen. We are moving toward a day where we can diagnose and predict disease at home. In the future, we may even be able to predict and ideally prevent events like falls and heart attacks.”

    This work is supported, in part, by the National Institutes of Health and the Michael J. Fox Foundation. More

  • in

    Avoiding shortcut solutions in artificial intelligence

    If your Uber driver takes a shortcut, you might get to your destination faster. But if a machine learning model takes a shortcut, it might fail in unexpected ways.

    In machine learning, a shortcut solution occurs when the model relies on a simple characteristic of a dataset to make a decision, rather than learning the true essence of the data, which can lead to inaccurate predictions. For example, a model might learn to identify images of cows by focusing on the green grass that appears in the photos, rather than the more complex shapes and patterns of the cows.  

    A new study by researchers at MIT explores the problem of shortcuts in a popular machine-learning method and proposes a solution that can prevent shortcuts by forcing the model to use more data in its decision-making.

    By removing the simpler characteristics the model is focusing on, the researchers force it to focus on more complex features of the data that it hadn’t been considering. Then, by asking the model to solve the same task two ways — once using those simpler features, and then also using the complex features it has now learned to identify — they reduce the tendency for shortcut solutions and boost the performance of the model.

    One potential application of this work is to enhance the effectiveness of machine learning models that are used to identify disease in medical images. Shortcut solutions in this context could lead to false diagnoses and have dangerous implications for patients.

    “It is still difficult to tell why deep networks make the decisions that they do, and in particular, which parts of the data these networks choose to focus upon when making a decision. If we can understand how shortcuts work in further detail, we can go even farther to answer some of the fundamental but very practical questions that are really important to people who are trying to deploy these networks,” says Joshua Robinson, a PhD student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of the paper.

    Robinson wrote the paper with his advisors, senior author Suvrit Sra, the Esther and Harold E. Edgerton Career Development Associate Professor in the Department of Electrical Engineering and Computer Science (EECS) and a core member of the Institute for Data, Systems, and Society (IDSS) and the Laboratory for Information and Decision Systems; and Stefanie Jegelka, the X-Consortium Career Development Associate Professor in EECS and a member of CSAIL and IDSS; as well as University of Pittsburgh assistant professor Kayhan Batmanghelich and PhD students Li Sun and Ke Yu. The research will be presented at the Conference on Neural Information Processing Systems in December. 

    The long road to understanding shortcuts

    The researchers focused their study on contrastive learning, which is a powerful form of self-supervised machine learning. In self-supervised machine learning, a model is trained using raw data that do not have label descriptions from humans. It can therefore be used successfully for a larger variety of data.

    A self-supervised learning model learns useful representations of data, which are used as inputs for different tasks, like image classification. But if the model takes shortcuts and fails to capture important information, these tasks won’t be able to use that information either.

    For example, if a self-supervised learning model is trained to classify pneumonia in X-rays from a number of hospitals, but it learns to make predictions based on a tag that identifies the hospital the scan came from (because some hospitals have more pneumonia cases than others), the model won’t perform well when it is given data from a new hospital.     

    For contrastive learning models, an encoder algorithm is trained to discriminate between pairs of similar inputs and pairs of dissimilar inputs. This process encodes rich and complex data, like images, in a way that the contrastive learning model can interpret.

    The researchers tested contrastive learning encoders with a series of images and found that, during this training procedure, they also fall prey to shortcut solutions. The encoders tend to focus on the simplest features of an image to decide which pairs of inputs are similar and which are dissimilar. Ideally, the encoder should focus on all the useful characteristics of the data when making a decision, Jegelka says.

    So, the team made it harder to tell the difference between the similar and dissimilar pairs, and found that this changes which features the encoder will look at to make a decision.

    “If you make the task of discriminating between similar and dissimilar items harder and harder, then your system is forced to learn more meaningful information in the data, because without learning that it cannot solve the task,” she says.

    But increasing this difficulty resulted in a tradeoff — the encoder got better at focusing on some features of the data but became worse at focusing on others. It almost seemed to forget the simpler features, Robinson says.

    To avoid this tradeoff, the researchers asked the encoder to discriminate between the pairs the same way it had originally, using the simpler features, and also after the researchers removed the information it had already learned. Solving the task both ways simultaneously caused the encoder to improve across all features.

    Their method, called implicit feature modification, adaptively modifies samples to remove the simpler features the encoder is using to discriminate between the pairs. The technique does not rely on human input, which is important because real-world data sets can have hundreds of different features that could combine in complex ways, Sra explains.

    From cars to COPD

    The researchers ran one test of this method using images of vehicles. They used implicit feature modification to adjust the color, orientation, and vehicle type to make it harder for the encoder to discriminate between similar and dissimilar pairs of images. The encoder improved its accuracy across all three features — texture, shape, and color — simultaneously.

    To see if the method would stand up to more complex data, the researchers also tested it with samples from a medical image database of chronic obstructive pulmonary disease (COPD). Again, the method led to simultaneous improvements across all features they evaluated.

    While this work takes some important steps forward in understanding the causes of shortcut solutions and working to solve them, the researchers say that continuing to refine these methods and applying them to other types of self-supervised learning will be key to future advancements.

    “This ties into some of the biggest questions about deep learning systems, like ‘Why do they fail?’ and ‘Can we know in advance the situations where your model will fail?’ There is still a lot farther to go if you want to understand shortcut learning in its full generality,” Robinson says.

    This research is supported by the National Science Foundation, National Institutes of Health, and the Pennsylvania Department of Health’s SAP SE Commonwealth Universal Research Enhancement (CURE) program. More