More stories

  • in

    Large language models help decipher clinical notes

    Electronic health records (EHRs) need a new public relations manager. Ten years ago, the U.S. government passed a law that required hospitals to digitize their health records with the intent of improving and streamlining care. The enormous amount of information in these now-digital records could be used to answer very specific questions beyond the scope of clinical trials: What’s the right dose of this medication for patients with this height and weight? What about patients with a specific genomic profile?

    Unfortunately, most of the data that could answer these questions is trapped in doctor’s notes, full of jargon and abbreviations. These notes are hard for computers to understand using current techniques — extracting information requires training multiple machine learning models. Models trained for one hospital, also, don’t work well at others, and training each model requires domain experts to label lots of data, a time-consuming and expensive process. 

    An ideal system would use a single model that can extract many types of information, work well at multiple hospitals, and learn from a small amount of labeled data. But how? Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) believed that to disentangle the data, they needed to call on something bigger: large language models. To pull that important medical information, they used a very big, GPT-3 style model to do tasks like expand overloaded jargon and acronyms and extract medication regimens. 

    For example, the system takes an input, which in this case is a clinical note, “prompts” the model with a question about the note, such as “expand this abbreviation, C-T-A.” The system returns an output such as “clear to auscultation,” as opposed to say, a CT angiography. The objective of extracting this clean data, the team says, is to eventually enable more personalized clinical recommendations. 

    Medical data is, understandably, a pretty tricky resource to navigate freely. There’s plenty of red tape around using public resources for testing the performance of large models because of data use restrictions, so the team decided to scrape together their own. Using a set of short, publicly available clinical snippets, they cobbled together a small dataset to enable evaluation of the extraction performance of large language models. 

    “It’s challenging to develop a single general-purpose clinical natural language processing system that will solve everyone’s needs and be robust to the huge variation seen across health datasets. As a result, until today, most clinical notes are not used in downstream analyses or for live decision support in electronic health records. These large language model approaches could potentially transform clinical natural language processing,” says David Sontag, MIT professor of electrical engineering and computer science, principal investigator in CSAIL and the Institute for Medical Engineering and Science, and supervising author on a paper about the work, which will be presented at the Conference on Empirical Methods in Natural Language Processing. “The research team’s advances in zero-shot clinical information extraction makes scaling possible. Even if you have hundreds of different use cases, no problem — you can build each model with a few minutes of work, versus having to label a ton of data for that particular task.”

    For example, without any labels at all, the researchers found these models could achieve 86 percent accuracy at expanding overloaded acronyms, and the team developed additional methods to boost this further to 90 percent accuracy, with still no labels required.

    Imprisoned in an EHR 

    Experts have been steadily building up large language models (LLMs) for quite some time, but they burst onto the mainstream with GPT-3’s widely covered ability to complete sentences. These LLMs are trained on a huge amount of text from the internet to finish sentences and predict the next most likely word. 

    While previous, smaller models like earlier GPT iterations or BERT have pulled off a good performance for extracting medical data, they still require substantial manual data-labeling effort. 

    For example, a note, “pt will dc vanco due to n/v” means that this patient (pt) was taking the antibiotic vancomycin (vanco) but experienced nausea and vomiting (n/v) severe enough for the care team to discontinue (dc) the medication. The team’s research avoids the status quo of training separate machine learning models for each task (extracting medication, side effects from the record, disambiguating common abbreviations, etc). In addition to expanding abbreviations, they investigated four other tasks, including if the models could parse clinical trials and extract detail-rich medication regimens.  

    “Prior work has shown that these models are sensitive to the prompt’s precise phrasing. Part of our technical contribution is a way to format the prompt so that the model gives you outputs in the correct format,” says Hunter Lang, CSAIL PhD student and author on the paper. “For these extraction problems, there are structured output spaces. The output space is not just a string. It can be a list. It can be a quote from the original input. So there’s more structure than just free text. Part of our research contribution is encouraging the model to give you an output with the correct structure. That significantly cuts down on post-processing time.”

    The approach can’t be applied to out-of-the-box health data at a hospital: that requires sending private patient information across the open internet to an LLM provider like OpenAI. The authors showed that it’s possible to work around this by distilling the model into a smaller one that could be used on-site.

    The model — sometimes just like humans — is not always beholden to the truth. Here’s what a potential problem might look like: Let’s say you’re asking the reason why someone took medication. Without proper guardrails and checks, the model might just output the most common reason for that medication, if nothing is explicitly mentioned in the note. This led to the team’s efforts to force the model to extract more quotes from data and less free text.

    Future work for the team includes extending to languages other than English, creating additional methods for quantifying uncertainty in the model, and pulling off similar results with open-sourced models. 

    “Clinical information buried in unstructured clinical notes has unique challenges compared to general domain text mostly due to large use of acronyms, and inconsistent textual patterns used across different health care facilities,” says Sadid Hasan, AI lead at Microsoft and former executive director of AI at CVS Health, who was not involved in the research. “To this end, this work sets forth an interesting paradigm of leveraging the power of general domain large language models for several important zero-/few-shot clinical NLP tasks. Specifically, the proposed guided prompt design of LLMs to generate more structured outputs could lead to further developing smaller deployable models by iteratively utilizing the model generated pseudo-labels.”

    “AI has accelerated in the last five years to the point at which these large models can predict contextualized recommendations with benefits rippling out across a variety of domains such as suggesting novel drug formulations, understanding unstructured text, code recommendations or create works of art inspired by any number of human artists or styles,” says Parminder Bhatia, who was formerly Head of Machine Learning at AWS Health AI and is currently Head of ML for low-code applications leveraging large language models at AWS AI Labs. “One of the applications of these large models [the team has] recently launched is Amazon CodeWhisperer, which is [an] ML-powered coding companion that helps developers in building applications.”

    As part of the MIT Abdul Latif Jameel Clinic for Machine Learning in Health, Agrawal, Sontag, and Lang wrote the paper alongside Yoon Kim, MIT assistant professor and CSAIL principal investigator, and Stefan Hegselmann, a visiting PhD student from the University of Muenster. First-author Agrawal’s research was supported by a Takeda Fellowship, the MIT Deshpande Center for Technological Innovation, and the MLA@CSAIL Initiatives. More

  • in

    Study finds the risks of sharing health care data are low

    In recent years, scientists have made great strides in their ability to develop artificial intelligence algorithms that can analyze patient data and come up with new ways to diagnose disease or predict which treatments work best for different patients.

    The success of those algorithms depends on access to patient health data, which has been stripped of personal information that could be used to identify individuals from the dataset. However, the possibility that individuals could be identified through other means has raised concerns among privacy advocates.

    In a new study, a team of researchers led by MIT Principal Research Scientist Leo Anthony Celi has quantified the potential risk of this kind of patient re-identification and found that it is currently extremely low relative to the risk of data breach. In fact, between 2016 and 2021, the period examined in the study, there were no reports of patient re-identification through publicly available health data.

    The findings suggest that the potential risk to patient privacy is greatly outweighed by the gains for patients, who benefit from better diagnosis and treatment, says Celi. He hopes that in the near future, these datasets will become more widely available and include a more diverse group of patients.

    “We agree that there is some risk to patient privacy, but there is also a risk of not sharing data,” he says. “There is harm when data is not shared, and that needs to be factored into the equation.”

    Celi, who is also an instructor at the Harvard T.H. Chan School of Public Health and an attending physician with the Division of Pulmonary, Critical Care and Sleep Medicine at the Beth Israel Deaconess Medical Center, is the senior author of the new study. Kenneth Seastedt, a thoracic surgery fellow at Beth Israel Deaconess Medical Center, is the lead author of the paper, which appears today in PLOS Digital Health.

    Risk-benefit analysis

    Large health record databases created by hospitals and other institutions contain a wealth of information on diseases such as heart disease, cancer, macular degeneration, and Covid-19, which researchers use to try to discover new ways to diagnose and treat disease.

    Celi and others at MIT’s Laboratory for Computational Physiology have created several publicly available databases, including the Medical Information Mart for Intensive Care (MIMIC), which they recently used to develop algorithms that can help doctors make better medical decisions. Many other research groups have also used the data, and others have created similar databases in countries around the world.

    Typically, when patient data is entered into this kind of database, certain types of identifying information are removed, including patients’ names, addresses, and phone numbers. This is intended to prevent patients from being re-identified and having information about their medical conditions made public.

    However, concerns about privacy have slowed the development of more publicly available databases with this kind of information, Celi says. In the new study, he and his colleagues set out to ask what the actual risk of patient re-identification is. First, they searched PubMed, a database of scientific papers, for any reports of patient re-identification from publicly available health data, but found none.

    To expand the search, the researchers then examined media reports from September 2016 to September 2021, using Media Cloud, an open-source global news database and analysis tool. In a search of more than 10,000 U.S. media publications during that time, they did not find a single instance of patient re-identification from publicly available health data.

    In contrast, they found that during the same time period, health records of nearly 100 million people were stolen through data breaches of information that was supposed to be securely stored.

    “Of course, it’s good to be concerned about patient privacy and the risk of re-identification, but that risk, although it’s not zero, is minuscule compared to the issue of cyber security,” Celi says.

    Better representation

    More widespread sharing of de-identified health data is necessary, Celi says, to help expand the representation of minority groups in the United States, who have traditionally been underrepresented in medical studies. He is also working to encourage the development of more such databases in low- and middle-income countries.

    “We cannot move forward with AI unless we address the biases that lurk in our datasets,” he says. “When we have this debate over privacy, no one hears the voice of the people who are not represented. People are deciding for them that their data need to be protected and should not be shared. But they are the ones whose health is at stake; they’re the ones who would most likely benefit from data-sharing.”

    Instead of asking for patient consent to share data, which he says may exacerbate the exclusion of many people who are now underrepresented in publicly available health data, Celi recommends enhancing the existing safeguards that are in place to protect such datasets. One new strategy that he and his colleagues have begun using is to share the data in a way that it can’t be downloaded, and all queries run on it can be monitored by the administrators of the database. This allows them to flag any user inquiry that seems like it might not be for legitimate research purposes, Celi says.

    “What we are advocating for is performing data analysis in a very secure environment so that we weed out any nefarious players trying to use the data for some other reasons apart from improving population health,” he says. “We’re not saying that we should disregard patient privacy. What we’re saying is that we have to also balance that with the value of data sharing.”

    The research was funded by the National Institutes of Health through the National Institute of Biomedical Imaging and Bioengineering. More

  • in

    Neurodegenerative disease can progress in newly identified patterns

    Neurodegenerative diseases — like amyotrophic lateral sclerosis (ALS, or Lou Gehrig’s disease), Alzheimer’s, and Parkinson’s — are complicated, chronic ailments that can present with a variety of symptoms, worsen at different rates, and have many underlying genetic and environmental causes, some of which are unknown. ALS, in particular, affects voluntary muscle movement and is always fatal, but while most people survive for only a few years after diagnosis, others live with the disease for decades. Manifestations of ALS can also vary significantly; often slower disease development correlates with onset in the limbs and affecting fine motor skills, while the more serious, bulbar ALS impacts swallowing, speaking, breathing, and mobility. Therefore, understanding the progression of diseases like ALS is critical to enrollment in clinical trials, analysis of potential interventions, and discovery of root causes.

    However, assessing disease evolution is far from straightforward. Current clinical studies typically assume that health declines on a downward linear trajectory on a symptom rating scale, and use these linear models to evaluate whether drugs are slowing disease progression. However, data indicate that ALS often follows nonlinear trajectories, with periods where symptoms are stable alternating with periods when they are rapidly changing. Since data can be sparse, and health assessments often rely on subjective rating metrics measured at uneven time intervals, comparisons across patient populations are difficult. These heterogenous data and progression, in turn, complicate analyses of invention effectiveness and potentially mask disease origin.

    Now, a new machine-learning method developed by researchers from MIT, IBM Research, and elsewhere aims to better characterize ALS disease progression patterns to inform clinical trial design.

    “There are groups of individuals that share progression patterns. For example, some seem to have really fast-progressing ALS and others that have slow-progressing ALS that varies over time,” says Divya Ramamoorthy PhD ’22, a research specialist at MIT and lead author of a new paper on the work that was published this month in Nature Computational Science. “The question we were asking is: can we use machine learning to identify if, and to what extent, those types of consistent patterns across individuals exist?”

    Their technique, indeed, identified discrete and robust clinical patterns in ALS progression, many of which are non-linear. Further, these disease progression subtypes were consistent across patient populations and disease metrics. The team additionally found that their method can be applied to Alzheimer’s and Parkinson’s diseases as well.

    Joining Ramamoorthy on the paper are MIT-IBM Watson AI Lab members Ernest Fraenkel, a professor in the MIT Department of Biological Engineering; Research Scientist Soumya Ghosh of IBM Research; and Principal Research Scientist Kenney Ng, also of IBM Research. Additional authors include Kristen Severson PhD ’18, a senior researcher at Microsoft Research and former member of the Watson Lab and of IBM Research; Karen Sachs PhD ’06 of Next Generation Analytics; a team of researchers with Answer ALS; Jonathan D. Glass and Christina N. Fournier of the Emory University School of Medicine; the Pooled Resource Open-Access ALS Clinical Trials Consortium; ALS/MND Natural History Consortium; Todd M. Herrington of Massachusetts General Hospital (MGH) and Harvard Medical School; and James D. Berry of MGH.

    Play video

    MIT Professor Ernest Fraenkel describes early stages of his research looking at root causes of amyotrophic lateral sclerosis (ALS).

    Reshaping health decline

    After consulting with clinicians, the team of machine learning researchers and neurologists let the data speak for itself. They designed an unsupervised machine-learning model that employed two methods: Gaussian process regression and Dirichlet process clustering. These inferred the health trajectories directly from patient data and automatically grouped similar trajectories together without prescribing the number of clusters or the shape of the curves, forming ALS progression “subtypes.” Their method incorporated prior clinical knowledge in the way of a bias for negative trajectories — consistent with expectations for neurodegenerative disease progressions — but did not assume any linearity. “We know that linearity is not reflective of what’s actually observed,” says Ng. “The methods and models that we use here were more flexible, in the sense that, they capture what was seen in the data,” without the need for expensive labeled data and prescription of parameters.

    Primarily, they applied the model to five longitudinal datasets from ALS clinical trials and observational studies. These used the gold standard to measure symptom development: the ALS functional rating scale revised (ALSFRS-R), which captures a global picture of patient neurological impairment but can be a bit of a “messy metric.” Additionally, performance on survivability probabilities, forced vital capacity (a measurement of respiratory function), and subscores of ALSFRS-R, which looks at individual bodily functions, were incorporated.

    New regimes of progression and utility

    When their population-level model was trained and tested on these metrics, four dominant patterns of disease popped out of the many trajectories — sigmoidal fast progression, stable slow progression, unstable slow progression, and unstable moderate progression — many with strong nonlinear characteristics. Notably, it captured trajectories where patients experienced a sudden loss of ability, called a functional cliff, which would significantly impact treatments, enrollment in clinical trials, and quality of life.

    The researchers compared their method against other commonly used linear and nonlinear approaches in the field to separate the contribution of clustering and linearity to the model’s accuracy. The new work outperformed them, even patient-specific models, and found that subtype patterns were consistent across measures. Impressively, when data were withheld, the model was able to interpolate missing values, and, critically, could forecast future health measures. The model could also be trained on one ALSFRS-R dataset and predict cluster membership in others, making it robust, generalizable, and accurate with scarce data. So long as 6-12 months of data were available, health trajectories could be inferred with higher confidence than conventional methods.

    The researchers’ approach also provided insights into Alzheimer’s and Parkinson’s diseases, both of which can have a range of symptom presentations and progression. For Alzheimer’s, the new technique could identify distinct disease patterns, in particular variations in the rates of conversion of mild to severe disease. The Parkinson’s analysis demonstrated a relationship between progression trajectories for off-medication scores and disease phenotypes, such as the tremor-dominant or postural instability/gait difficulty forms of Parkinson’s disease.

    The work makes significant strides to find the signal amongst the noise in the time-series of complex neurodegenerative disease. “The patterns that we see are reproducible across studies, which I don’t believe had been shown before, and that may have implications for how we subtype the [ALS] disease,” says Fraenkel. As the FDA has been considering the impact of non-linearity in clinical trial designs, the team notes that their work is particularly pertinent.

    As new ways to understand disease mechanisms come online, this model provides another tool to pick apart illnesses like ALS, Alzheimer’s, and Parkinson’s from a systems biology perspective.

    “We have a lot of molecular data from the same patients, and so our long-term goal is to see whether there are subtypes of the disease,” says Fraenkel, whose lab looks at cellular changes to understand the etiology of diseases and possible targets for cures. “One approach is to start with the symptoms … and see if people with different patterns of disease progression are also different at the molecular level. That might lead you to a therapy. Then there’s the bottom-up approach, where you start with the molecules” and try to reconstruct biological pathways that might be affected. “We’re going [to be tackling this] from both ends … and finding if something meets in the middle.”

    This research was supported, in part, by the MIT-IBM Watson AI Lab, the Muscular Dystrophy Association, Department of Veterans Affairs of Research and Development, the Department of Defense, NSF Gradate Research Fellowship Program, Siebel Scholars Fellowship, Answer ALS, the United States Army Medical Research Acquisition Activity, National Institutes of Health, and the NIH/NINDS. More

  • in

    In-home wireless device tracks disease progression in Parkinson’s patients

    Parkinson’s disease is the fastest-growing neurological disease, now affecting more than 10 million people worldwide, yet clinicians still face huge challenges in tracking its severity and progression.

    Clinicians typically evaluate patients by testing their motor skills and cognitive functions during clinic visits. These semisubjective measurements are often skewed by outside factors — perhaps a patient is tired after a long drive to the hospital. More than 40 percent of individuals with Parkinson’s are never treated by a neurologist or Parkinson’s specialist, often because they live too far from an urban center or have difficulty traveling.

    In an effort to address these problems, researchers from MIT and elsewhere demonstrated an in-home device that can monitor a patient’s movement and gait speed, which can be used to evaluate Parkinson’s severity, the progression of the disease, and the patient’s response to medication.

    The device, which is about the size of a Wi-Fi router, gathers data passively using radio signals that reflect off the patient’s body as they move around their home. The patient does not need to wear a gadget or change their behavior. (A recent study, for example, showed that this type of device could be used to detect Parkinson’s from a person’s breathing patterns while sleeping.)

    The researchers used these devices to conduct a one-year at-home study with 50 participants. They showed that, by using machine-learning algorithms to analyze the troves of data they passively gathered (more than 200,000 gait speed measurements), a clinician could track Parkinson’s progression and medication response more effectively than they would with periodic, in-clinic evaluations.

    “By being able to have a device in the home that can monitor a patient and tell the doctor remotely about the progression of the disease, and the patient’s medication response so they can attend to the patient even if the patient can’t come to the clinic — now they have real, reliable information — that actually goes a long way toward improving equity and access,” says senior author Dina Katabi, the Thuan and Nicole Pham Professor in the Department of Electrical Engineering and Computer Science (EECS), and a principle investigator in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the MIT Jameel Clinic.

    The co-lead authors are EECS graduate students Yingcheng Liu and Guo Zhang. The research is published today in Science Translational Medicine.

    A human radar

    This work utilizes a wireless device previously developed in the Katabi lab that analyzes radio signals that bounce off people’s bodies. It transmits signals that use a tiny fraction of the power of a Wi-Fi router — these super-low-power signals don’t interfere with other wireless devices in the home. While radio signals pass through walls and other solid objects, they are reflected off humans due to the water in our bodies.  

    This creates a “human radar” that can track the movement of a person in a room. Radio waves always travel at the same speed, so the length of time it takes the signals to reflect back to the device indicates how the person is moving.

    The device incorporates a machine-learning classifier that can pick out the precise radio signals reflected off the patient even when there are other people moving around the room. Advanced algorithms use these movement data to compute gait speed — how fast the person is walking.

    Because the device operates in the background and runs all day, every day, it can collect a massive amount of data. The researchers wanted to see if they could apply machine learning to these datasets to gain insights about the disease over time.

    They gathered 50 participants, 34 of whom had Parkinson’s, and conducted a one-year study of in-home gait measurements Through the study, the researchers collected more than 200,000 individual measurements that they averaged to smooth out variability due to the conditions irrelevant to the disease. (For example, a patient may hurry up to answer an alarm or walk slower when talking on the phone.)

    They used statistical methods to analyze the data and found that in-home gait speed can be used to effectively track Parkinson’s progression and severity. For instance, they showed that gait speed declined almost twice as fast for individuals with Parkinson’s, compared to those without. 

    “Monitoring the patient continuously as they move around the room enabled us to get really good measurements of their gait speed. And with so much data, we were able to perform aggregation that allowed us to see very small differences,” Zhang says.

    Better, faster results

    Drilling down on these variabilities offered some key insights. For instance, the researchers showed that daily fluctuations in a patient’s walking speed correspond with how they are responding to their medication — walking speed may improve after a dose and then begin to decline after a few hours, as the medication impact wears off.

    “This enables us to objectively measure how your mobility responds to your medication. Previously, this was very cumbersome to do because this medication effect could only be measured by having the patient keep a journal,” Liu says.

    A clinician could use these data to adjust medication dosage more effectively and accurately. This is especially important since drugs used to treat disease symptoms can cause serious side effects if the patient receives too much.

    The researchers were able to demonstrate statistically significant results regarding Parkinson’s progression after studying 50 people for just one year. By contrast, an often-cited study by the Michael J. Fox Foundation involved more than 500 individuals and monitored them for more than five years, Katabi says.

    “For a pharmaceutical company or a biotech company trying to develop medicines for this disease, this could greatly reduce the burden and cost and speed up the development of new therapies,” she adds.

    Katabi credits much of the study’s success to the dedicated team of scientists and clinicians who worked together to tackle the many difficulties that arose along the way. For one, they began the study before the Covid-19 pandemic, so team members initially visited people’s homes to set up the devices. When that was no longer possible, they developed a user-friendly phone app to remotely help participants as they deployed the device at home.

    Through the course of the study, they learned to automate processes and reduce effort, especially for the participants and clinical team.

    This knowledge will prove useful as they look to deploy devices in at-home studies of other neurological disorders, such as Alzheimer’s, ALS, and Huntington’s. They also want to explore how these methods could be used, in conjunction with other work from the Katabi lab showing that Parkinson’s can be diagnosed by monitoring breathing, to collect a holistic set of markers that could diagnose the disease early and then be used to track and treat it.

    “This radio-wave sensor can enable more care (and research) to migrate from hospitals to the home where it is most desired and needed,” says Ray Dorsey, a professor of neurology at the University of Rochester Medical Center, co-author of Ending Parkinson’s, and a co-author of this research paper. “Its potential is just beginning to be seen. We are moving toward a day where we can diagnose and predict disease at home. In the future, we may even be able to predict and ideally prevent events like falls and heart attacks.”

    This work is supported, in part, by the National Institutes of Health and the Michael J. Fox Foundation. More

  • in

    Emma Gibson: Optimizing health care logistics in Africa

    Growing up in South Africa at the turn of the century, Emma Gibson saw the rise of the HIV/AIDS epidemic and its devastating impact on her home country, where many people lacked life-saving health care. At the time, Gibson was too young to understand what a sexually transmitted infection was, but she knew that HIV was infecting millions of South Africans and AIDS was taking hundreds of thousands of lives. “As a child, I was terrified by this monster that was HIV and felt so powerless to do anything about it,” she says.

    Now, as an adult, her childhood fear of the HIV epidemic has evolved into a desire to fight it. Gibson seeks to improve health care for HIV and other diseases in regions with limited resources, including South Africa. She wants to help health care facilities in these areas to use their resources more effectively so that patients can more easily obtain care.

    To help reach her goal, Gibson sought mathematics and logistics training through higher education in South Africa. She first earned her bachelor’s degree in mathematical sciences at the University of the Witwatersrand, and then her master’s degree in operations research at Stellenbosch University. There, she learned to tackle complex decision-making problems using math, statistics, and computer simulations.

    During her master’s, Gibson studied the operational challenges faced in rural South African health care facilities by working with staff at Zithulele Hospital in the Eastern Cape, one of the country’s poorest provinces. Her research focused on ways to reduce hours-long wait times for patients seeking same-day care. In the end, she developed a software tool to model patient congestion throughout the day and optimize staff schedules accordingly, enabling the hospital to care for its patients more efficiently.

    After completing her master’s, Gibson wanted to further her education outside of South Africa and left to pursue a PhD in operations research at MIT. Upon arrival, she branched out in her research and worked on a project to improve breast cancer treatment in U.S. health care, a very different environment from what she was used to.

    Two years later, Gibson had the opportunity to return to researching health care in resource-limited settings and began working with Jónas Jónasson, an associate professor at the MIT Sloan School of Management, on a new project to improve diagnostic services in sub-Saharan Africa. For the past four years, she has been working diligently on this project in collaboration with researchers at the Indian School of Business and Northwestern University. “My love language is time,” she says. “If I’m investing a lot of time in something, I really value it.”

    Scheduling sample transport

    Diagnostic testing is an essential tool that allows medical professionals to identify new diagnoses in patients and monitor patients’ conditions as they undergo treatment. For example, people living with HIV require regular blood tests to ensure that their prescribed treatments are working effectively and provide an early warning of potential treatment failures.

    For Gibson’s current project, she’s trying to improve diagnostic services in Malawi, a landlocked country in southeast Africa. “We have the tools” to diagnose and treat diseases like HIV, she says. “But in resource-limited settings, we often lack the money, the staff, and the infrastructure to reach every patient that needs them.”

    When diagnostic testing is needed, clinicians collect samples from patients and send the samples to be tested at a laboratory, which then returns the results to the facility where the patient is treated. To move these items between facilities and laboratories, Malawi has developed a national sample transportation network. The transportation system plays an important role in linking remote, rural facilities to laboratory services and ensuring that patients in these areas can access diagnostic testing through community clinics. Samples collected at these clinics are first transported to nearby district hubs, and then forwarded to laboratories located in urban areas. Since most facilities do not have computers or communications infrastructure, laboratories print copies of test results and send them back to facilities through the same transportation process.

    The sample transportation cycle is onerous, but it’s a practical solution to a difficult problem. “During the Covid pandemic, we saw how hard it was to scale up diagnostic infrastructure,” Gibson says. Diagnostic services in sub-Saharan Africa face “similar challenges, but in a much poorer setting.”

    In Malawi, sample transportation is managed by a  nongovernment organization called Riders 4 Health. The organization has around 80 couriers on motorcycles who transport samples and test results between facilities. “When we started working with [Riders], the couriers operated on fixed weekly schedules, visiting each site once or twice a week,” Gibson says. But that led to “a lot of unnecessary trips and delays.”

    To make sample transportation more efficient, Gibson developed a dynamic scheduling system that adapts to the current demand for diagnostic testing. The system consists of two main parts: an information sharing platform that aggregates sample transportation data, and an algorithm that uses the data to generate optimized routes and schedules for sample transport couriers.

    In 2019, Gibson ran a four-month-long pilot test for this system in three out of the 27 districts in Malawi. During the pilot study, six couriers transported over 20,000 samples and results across 51 health care facilities, and 150 health care workers participated in data sharing.

    The pilot was a success. Gibson’s dynamic scheduling system eliminated about half the unnecessary trips and reduced transportation delays by 25 percent — a delay that used to be four days was reduced to three. Now, Riders 4 Health is developing their own version of Gibson’s system to operate nationally in Malawi. Throughout this project, “we focused on making sure this was something that could grow with the organization,” she says. “It’s gratifying to see that actually happening.”

    Leveraging patient data

    Gibson is completing her MIT degree this September but will continue working to improve health care in Africa. After graduation, she will join the technology and analytics health care practice of an established company in South Africa. Her initial focus will be on public health care institutions, including Chris Hani Baragwanath Academic Hospital in Johannesburg, the third-largest hospital in the world.

    In this role, Gibson will work to fill in gaps in African patient data for medical operational research and develop ways to use this data more effectively to improve health care in resource-limited areas. For example, better data systems can help to monitor the prevalence and impact of different diseases, guiding where health care workers and researchers put their efforts to help the most people. “You can’t make good decisions if you don’t have all the information,” Gibson says.

    To best leverage patient data for improving health care, Gibson plans to reevaluate how data systems are structured and used in the hospital. For ideas on upgrading the current system, she’ll look to existing data systems in other countries to see what works and what doesn’t, while also drawing upon her past research experience in U.S. health care. Ultimately, she’ll tailor the new hospital data system to South African needs to accurately inform future directions in health care.

    Gibson’s new job — her “dream job” — will be based in the United Kingdom, but she anticipates spending a significant amount of time in Johannesburg. “I have so many opportunities in the wider world, but the ones that appeal to me are always back in the place I came from,” she says. More

  • in

    Teaching AI to ask clinical questions

    Physicians often query a patient’s electronic health record for information that helps them make treatment decisions, but the cumbersome nature of these records hampers the process. Research has shown that even when a doctor has been trained to use an electronic health record (EHR), finding an answer to just one question can take, on average, more than eight minutes.

    The more time physicians must spend navigating an oftentimes clunky EHR interface, the less time they have to interact with patients and provide treatment.

    Researchers have begun developing machine-learning models that can streamline the process by automatically finding information physicians need in an EHR. However, training effective models requires huge datasets of relevant medical questions, which are often hard to come by due to privacy restrictions. Existing models struggle to generate authentic questions — those that would be asked by a human doctor — and are often unable to successfully find correct answers.

    To overcome this data shortage, researchers at MIT partnered with medical experts to study the questions physicians ask when reviewing EHRs. Then, they built a publicly available dataset of more than 2,000 clinically relevant questions written by these medical experts.

    When they used their dataset to train a machine-learning model to generate clinical questions, they found that the model asked high-quality and authentic questions, as compared to real questions from medical experts, more than 60 percent of the time.

    With this dataset, they plan to generate vast numbers of authentic medical questions and then use those questions to train a machine-learning model which would help doctors find sought-after information in a patient’s record more efficiently.

    “Two thousand questions may sound like a lot, but when you look at machine-learning models being trained nowadays, they have so much data, maybe billions of data points. When you train machine-learning models to work in health care settings, you have to be really creative because there is such a lack of data,” says lead author Eric Lehman, a graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL).

    The senior author is Peter Szolovits, a professor in the Department of Electrical Engineering and Computer Science (EECS) who heads the Clinical Decision-Making Group in CSAIL and is also a member of the MIT-IBM Watson AI Lab. The research paper, a collaboration between co-authors at MIT, the MIT-IBM Watson AI Lab, IBM Research, and the doctors and medical experts who helped create questions and participated in the study, will be presented at the annual conference of the North American Chapter of the Association for Computational Linguistics.

    “Realistic data is critical for training models that are relevant to the task yet difficult to find or create,” Szolovits says. “The value of this work is in carefully collecting questions asked by clinicians about patient cases, from which we are able to develop methods that use these data and general language models to ask further plausible questions.”

    Data deficiency

    The few large datasets of clinical questions the researchers were able to find had a host of issues, Lehman explains. Some were composed of medical questions asked by patients on web forums, which are a far cry from physician questions. Other datasets contained questions produced from templates, so they are mostly identical in structure, making many questions unrealistic.

    “Collecting high-quality data is really important for doing machine-learning tasks, especially in a health care context, and we’ve shown that it can be done,” Lehman says.

    To build their dataset, the MIT researchers worked with practicing physicians and medical students in their last year of training. They gave these medical experts more than 100 EHR discharge summaries and told them to read through a summary and ask any questions they might have. The researchers didn’t put any restrictions on question types or structures in an effort to gather natural questions. They also asked the medical experts to identify the “trigger text” in the EHR that led them to ask each question.

    For instance, a medical expert might read a note in the EHR that says a patient’s past medical history is significant for prostate cancer and hypothyroidism. The trigger text “prostate cancer” could lead the expert to ask questions like “date of diagnosis?” or “any interventions done?”

    They found that most questions focused on symptoms, treatments, or the patient’s test results. While these findings weren’t unexpected, quantifying the number of questions about each broad topic will help them build an effective dataset for use in a real, clinical setting, says Lehman.

    Once they had compiled their dataset of questions and accompanying trigger text, they used it to train machine-learning models to ask new questions based on the trigger text.

    Then the medical experts determined whether those questions were “good” using four metrics: understandability (Does the question make sense to a human physician?), triviality (Is the question too easily answerable from the trigger text?), medical relevance (Does it makes sense to ask this question based on the context?), and relevancy to the trigger (Is the trigger related to the question?).

    Cause for concern

    The researchers found that when a model was given trigger text, it was able to generate a good question 63 percent of the time, whereas a human physician would ask a good question 80 percent of the time.

    They also trained models to recover answers to clinical questions using the publicly available datasets they had found at the outset of this project. Then they tested these trained models to see if they could find answers to “good” questions asked by human medical experts.

    The models were only able to recover about 25 percent of answers to physician-generated questions.

    “That result is really concerning. What people thought were good-performing models were, in practice, just awful because the evaluation questions they were testing on were not good to begin with,” Lehman says.

    The team is now applying this work toward their initial goal: building a model that can automatically answer physicians’ questions in an EHR. For the next step, they will use their dataset to train a machine-learning model that can automatically generate thousands or millions of good clinical questions, which can then be used to train a new model for automatic question answering.

    While there is still much work to do before that model could be a reality, Lehman is encouraged by the strong initial results the team demonstrated with this dataset.

    This research was supported, in part, by the MIT-IBM Watson AI Lab. Additional co-authors include Leo Anthony Celi of the MIT Institute for Medical Engineering and Science; Preethi Raghavan and Jennifer J. Liang of the MIT-IBM Watson AI Lab; Dana Moukheiber of the University of Buffalo; Vladislav Lialin and Anna Rumshisky of the University of Massachusetts at Lowell; Katelyn Legaspi, Nicole Rose I. Alberto, Richard Raymund R. Ragasa, Corinna Victoria M. Puyat, Isabelle Rose I. Alberto, and Pia Gabrielle I. Alfonso of the University of the Philippines; Anne Janelle R. Sy and Patricia Therese S. Pile of the University of the East Ramon Magsaysay Memorial Medical Center; Marianne Taliño of the Ateneo de Manila University School of Medicine and Public Health; and Byron C. Wallace of Northeastern University. More

  • in

    Exploring emerging topics in artificial intelligence policy

    Members of the public sector, private sector, and academia convened for the second AI Policy Forum Symposium last month to explore critical directions and questions posed by artificial intelligence in our economies and societies.

    The virtual event, hosted by the AI Policy Forum (AIPF) — an undertaking by the MIT Schwarzman College of Computing to bridge high-level principles of AI policy with the practices and trade-offs of governing — brought together an array of distinguished panelists to delve into four cross-cutting topics: law, auditing, health care, and mobility.

    In the last year there have been substantial changes in the regulatory and policy landscape around AI in several countries — most notably in Europe with the development of the European Union Artificial Intelligence Act, the first attempt by a major regulator to propose a law on artificial intelligence. In the United States, the National AI Initiative Act of 2020, which became law in January 2021, is providing a coordinated program across federal government to accelerate AI research and application for economic prosperity and security gains. Finally, China recently advanced several new regulations of its own.

    Each of these developments represents a different approach to legislating AI, but what makes a good AI law? And when should AI legislation be based on binding rules with penalties versus establishing voluntary guidelines?

    Jonathan Zittrain, professor of international law at Harvard Law School and director of the Berkman Klein Center for Internet and Society, says the self-regulatory approach taken during the expansion of the internet had its limitations with companies struggling to balance their interests with those of their industry and the public.

    “One lesson might be that actually having representative government take an active role early on is a good idea,” he says. “It’s just that they’re challenged by the fact that there appears to be two phases in this environment of regulation. One, too early to tell, and two, too late to do anything about it. In AI I think a lot of people would say we’re still in the ‘too early to tell’ stage but given that there’s no middle zone before it’s too late, it might still call for some regulation.”

    A theme that came up repeatedly throughout the first panel on AI laws — a conversation moderated by Dan Huttenlocher, dean of the MIT Schwarzman College of Computing and chair of the AI Policy Forum — was the notion of trust. “If you told me the truth consistently, I would say you are an honest person. If AI could provide something similar, something that I can say is consistent and is the same, then I would say it’s trusted AI,” says Bitange Ndemo, professor of entrepreneurship at the University of Nairobi and the former permanent secretary of Kenya’s Ministry of Information and Communication.

    Eva Kaili, vice president of the European Parliament, adds that “In Europe, whenever you use something, like any medication, you know that it has been checked. You know you can trust it. You know the controls are there. We have to achieve the same with AI.” Kalli further stresses that building trust in AI systems will not only lead to people using more applications in a safe manner, but that AI itself will reap benefits as greater amounts of data will be generated as a result.

    The rapidly increasing applicability of AI across fields has prompted the need to address both the opportunities and challenges of emerging technologies and the impact they have on social and ethical issues such as privacy, fairness, bias, transparency, and accountability. In health care, for example, new techniques in machine learning have shown enormous promise for improving quality and efficiency, but questions of equity, data access and privacy, safety and reliability, and immunology and global health surveillance remain at large.

    MIT’s Marzyeh Ghassemi, an assistant professor in the Department of Electrical Engineering and Computer Science and the Institute for Medical Engineering and Science, and David Sontag, an associate professor of electrical engineering and computer science, collaborated with Ziad Obermeyer, an associate professor of health policy and management at the University of California Berkeley School of Public Health, to organize AIPF Health Wide Reach, a series of sessions to discuss issues of data sharing and privacy in clinical AI. The organizers assembled experts devoted to AI, policy, and health from around the world with the goal of understanding what can be done to decrease barriers to access to high-quality health data to advance more innovative, robust, and inclusive research results while being respectful of patient privacy.

    Over the course of the series, members of the group presented on a topic of expertise and were tasked with proposing concrete policy approaches to the challenge discussed. Drawing on these wide-ranging conversations, participants unveiled their findings during the symposium, covering nonprofit and government success stories and limited access models; upside demonstrations; legal frameworks, regulation, and funding; technical approaches to privacy; and infrastructure and data sharing. The group then discussed some of their recommendations that are summarized in a report that will be released soon.

    One of the findings calls for the need to make more data available for research use. Recommendations that stem from this finding include updating regulations to promote data sharing to enable easier access to safe harbors such as the Health Insurance Portability and Accountability Act (HIPAA) has for de-identification, as well as expanding funding for private health institutions to curate datasets, amongst others. Another finding, to remove barriers to data for researchers, supports a recommendation to decrease obstacles to research and development on federally created health data. “If this is data that should be accessible because it’s funded by some federal entity, we should easily establish the steps that are going to be part of gaining access to that so that it’s a more inclusive and equitable set of research opportunities for all,” says Ghassemi. The group also recommends taking a careful look at the ethical principles that govern data sharing. While there are already many principles proposed around this, Ghassemi says that “obviously you can’t satisfy all levers or buttons at once, but we think that this is a trade-off that’s very important to think through intelligently.”

    In addition to law and health care, other facets of AI policy explored during the event included auditing and monitoring AI systems at scale, and the role AI plays in mobility and the range of technical, business, and policy challenges for autonomous vehicles in particular.

    The AI Policy Forum Symposium was an effort to bring together communities of practice with the shared aim of designing the next chapter of AI. In his closing remarks, Aleksander Madry, the Cadence Designs Systems Professor of Computing at MIT and faculty co-lead of the AI Policy Forum, emphasized the importance of collaboration and the need for different communities to communicate with each other in order to truly make an impact in the AI policy space.

    “The dream here is that we all can meet together — researchers, industry, policymakers, and other stakeholders — and really talk to each other, understand each other’s concerns, and think together about solutions,” Madry said. “This is the mission of the AI Policy Forum and this is what we want to enable.” More

  • in

    Artificial intelligence predicts patients’ race from their medical images

    The miseducation of algorithms is a critical problem; when artificial intelligence mirrors unconscious thoughts, racism, and biases of the humans who generated these algorithms, it can lead to serious harm. Computer programs, for example, have wrongly flagged Black defendants as twice as likely to reoffend as someone who’s white. When an AI used cost as a proxy for health needs, it falsely named Black patients as healthier than equally sick white ones, as less money was spent on them. Even AI used to write a play relied on using harmful stereotypes for casting. 

    Removing sensitive features from the data seems like a viable tweak. But what happens when it’s not enough? 

    Examples of bias in natural language processing are boundless — but MIT scientists have investigated another important, largely underexplored modality: medical images. Using both private and public datasets, the team found that AI can accurately predict self-reported race of patients from medical images alone. Using imaging data of chest X-rays, limb X-rays, chest CT scans, and mammograms, the team trained a deep learning model to identify race as white, Black, or Asian — even though the images themselves contained no explicit mention of the patient’s race. This is a feat even the most seasoned physicians cannot do, and it’s not clear how the model was able to do this. 

    In an attempt to tease out and make sense of the enigmatic “how” of it all, the researchers ran a slew of experiments. To investigate possible mechanisms of race detection, they looked at variables like differences in anatomy, bone density, resolution of images — and many more, and the models still prevailed with high ability to detect race from chest X-rays. “These results were initially confusing, because the members of our research team could not come anywhere close to identifying a good proxy for this task,” says paper co-author Marzyeh Ghassemi, an assistant professor in the MIT Department of Electrical Engineering and Computer Science and the Institute for Medical Engineering and Science (IMES), who is an affiliate of the Computer Science and Artificial Intelligence Laboratory (CSAIL) and of the MIT Jameel Clinic. “Even when you filter medical images past where the images are recognizable as medical images at all, deep models maintain a very high performance. That is concerning because superhuman capacities are generally much more difficult to control, regulate, and prevent from harming people.”

    In a clinical setting, algorithms can help tell us whether a patient is a candidate for chemotherapy, dictate the triage of patients, or decide if a movement to the ICU is necessary. “We think that the algorithms are only looking at vital signs or laboratory tests, but it’s possible they’re also looking at your race, ethnicity, sex, whether you’re incarcerated or not — even if all of that information is hidden,” says paper co-author Leo Anthony Celi, principal research scientist in IMES at MIT and associate professor of medicine at Harvard Medical School. “Just because you have representation of different groups in your algorithms, that doesn’t guarantee it won’t perpetuate or magnify existing disparities and inequities. Feeding the algorithms with more data with representation is not a panacea. This paper should make us pause and truly reconsider whether we are ready to bring AI to the bedside.” 

    The study, “AI recognition of patient race in medical imaging: a modeling study,” was published in Lancet Digital Health on May 11. Celi and Ghassemi wrote the paper alongside 20 other authors in four countries.

    To set up the tests, the scientists first showed that the models were able to predict race across multiple imaging modalities, various datasets, and diverse clinical tasks, as well as across a range of academic centers and patient populations in the United States. They used three large chest X-ray datasets, and tested the model on an unseen subset of the dataset used to train the model and a completely different one. Next, they trained the racial identity detection models for non-chest X-ray images from multiple body locations, including digital radiography, mammography, lateral cervical spine radiographs, and chest CTs to see whether the model’s performance was limited to chest X-rays. 

    The team covered many bases in an attempt to explain the model’s behavior: differences in physical characteristics between different racial groups (body habitus, breast density), disease distribution (previous studies have shown that Black patients have a higher incidence for health issues like cardiac disease), location-specific or tissue specific differences, effects of societal bias and environmental stress, the ability of deep learning systems to detect race when multiple demographic and patient factors were combined, and if specific image regions contributed to recognizing race. 

    What emerged was truly staggering: The ability of the models to predict race from diagnostic labels alone was much lower than the chest X-ray image-based models. 

    For example, the bone density test used images where the thicker part of the bone appeared white, and the thinner part appeared more gray or translucent. Scientists assumed that since Black people generally have higher bone mineral density, the color differences helped the AI models to detect race. To cut that off, they clipped the images with a filter, so the model couldn’t color differences. It turned out that cutting off the color supply didn’t faze the model — it still could accurately predict races. (The “Area Under the Curve” value, meaning the measure of the accuracy of a quantitative diagnostic test, was 0.94–0.96). As such, the learned features of the model appeared to rely on all regions of the image, meaning that controlling this type of algorithmic behavior presents a messy, challenging problem. 

    The scientists acknowledge limited availability of racial identity labels, which caused them to focus on Asian, Black, and white populations, and that their ground truth was a self-reported detail. Other forthcoming work will include potentially looking at isolating different signals before image reconstruction, because, as with bone density experiments, they couldn’t account for residual bone tissue that was on the images. 

    Notably, other work by Ghassemi and Celi led by MIT student Hammaad Adam has found that models can also identify patient self-reported race from clinical notes even when those notes are stripped of explicit indicators of race. Just as in this work, human experts are not able to accurately predict patient race from the same redacted clinical notes.

    “We need to bring social scientists into the picture. Domain experts, which are usually the clinicians, public health practitioners, computer scientists, and engineers are not enough. Health care is a social-cultural problem just as much as it’s a medical problem. We need another group of experts to weigh in and to provide input and feedback on how we design, develop, deploy, and evaluate these algorithms,” says Celi. “We need to also ask the data scientists, before any exploration of the data, are there disparities? Which patient groups are marginalized? What are the drivers of those disparities? Is it access to care? Is it from the subjectivity of the care providers? If we don’t understand that, we won’t have a chance of being able to identify the unintended consequences of the algorithms, and there’s no way we’ll be able to safeguard the algorithms from perpetuating biases.”

    “The fact that algorithms ‘see’ race, as the authors convincingly document, can be dangerous. But an important and related fact is that, when used carefully, algorithms can also work to counter bias,” says Ziad Obermeyer, associate professor at the University of California at Berkeley, whose research focuses on AI applied to health. “In our own work, led by computer scientist Emma Pierson at Cornell, we show that algorithms that learn from patients’ pain experiences can find new sources of knee pain in X-rays that disproportionately affect Black patients — and are disproportionately missed by radiologists. So just like any tool, algorithms can be a force for evil or a force for good — which one depends on us, and the choices we make when we build algorithms.”

    The work is supported, in part, by the National Institutes of Health. More