More stories

  • in

    System helps severely motor-impaired individuals type more quickly and accurately

    In 1995, French fashion magazine editor Jean-Dominique Bauby suffered a seizure while driving a car, which left him with a condition known as locked-in syndrome, a neurological disease in which the patient is completely paralyzed and can only move muscles that control the eyes.

    Bauby, who had signed a book contract shortly before his accident, wrote the memoir “The Diving Bell and the Butterfly” using a dictation system in which his speech therapist recited the alphabet and he would blink when she said the correct letter. They wrote the 130-page book one blink at a time.

    Technology has come a long way since Bauby’s accident. Many individuals with severe motor impairments caused by locked-in syndrome, cerebral palsy, amyotrophic lateral sclerosis, or other conditions can communicate using computer interfaces where they select letters or words in an onscreen grid by activating a single switch, often by pressing a button, releasing a puff of air, or blinking.

    But these row-column scanning systems are very rigid, and, similar to the technique used by Bauby’s speech therapist, they highlight each option one at a time, making them frustratingly slow for some users. And they are not suitable for tasks where options can’t be arranged in a grid, like drawing, browsing the web, or gaming.

    A more flexible system being developed by researchers at MIT places individual selection indicators next to each option on a computer screen. The indicators can be placed anywhere — next to anything someone might click with a mouse — so a user does not need to cycle through a grid of choices to make selections. The system, called Nomon, incorporates probabilistic reasoning to learn how users make selections, and then adjusts the interface to improve their speed and accuracy.

    Participants in a user study were able to type faster using Nomon than with a row-column scanning system. The users also performed better on a picture selection task, demonstrating how Nomon could be used for more than typing.

    “It is so cool and exciting to be able to develop software that has the potential to really help people. Being able to find those signals and turn them into communication as we are used to it is a really interesting problem,” says senior author Tamara Broderick, an associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS) and a member of the Laboratory for Information and Decision Systems and the Institute for Data, Systems, and Society.

    Joining Broderick on the paper are lead author Nicholas Bonaker, an EECS graduate student; Emli-Mari Nel, head of innovation and machine learning at Averly and a visiting lecturer at the University of Witwatersrand in South Africa; and Keith Vertanen, an associate professor at Michigan Tech. The research is being presented at the ACM Conference on Human Factors in Computing Systems.

    On the clock

    In the Nomon interface, a small analog clock is placed next to every option the user can select. (A gnomon is the part of a sundial that casts a shadow.) The user looks at one option and then clicks their switch when that clock’s hand passes a red “noon” line. After each click, the system changes the phases of the clocks to separate the most probable next targets. The user clicks repeatedly until their target is selected.

    When used as a keyboard, Nomon’s machine-learning algorithms try to guess the next word based on previous words and each new letter as the user makes selections.

    Broderick developed a simplified version of Nomon several years ago but decided to revisit it to make the system easier for motor-impaired individuals to use. She enlisted the help of then-undergraduate Bonaker to redesign the interface.

    They first consulted nonprofit organizations that work with motor-impaired individuals, as well as a motor-impaired switch user, to gather feedback on the Nomon design.

    Then they designed a user study that would better represent the abilities of motor-impaired individuals. They wanted to make sure to thoroughly vet the system before using much of the valuable time of motor-impaired users, so they first tested on non-switch users, Broderick explains.

    Switching up the switch

    To gather more representative data, Bonaker devised a webcam-based switch that was harder to use than simply clicking a key. The non-switch users had to lean their bodies to one side of the screen and then back to the other side to register a click.

    “And they have to do this at precisely the right time, so it really slows them down. We did some empirical studies which showed that they were much closer to the response times of motor-impaired individuals,” Broderick says.

    They ran a 10-session user study with 13 non-switch participants and one single-switch user with an advanced form of spinal muscular dystrophy. In the first nine sessions, participants used Nomon and a row-column scanning interface for 20 minutes each to perform text entry, and in the 10th session they used the two systems for a picture selection task.

    Non-switch users typed 15 percent faster using Nomon, while the motor-impaired user typed even faster than the non-switch users. When typing unfamiliar words, the users were 20 percent faster overall and made half as many errors. In their final session, they were able to complete the picture selection task 36 percent faster using Nomon.

    “Nomon is much more forgiving than row-column scanning. With row-column scanning, even if you are just slightly off, now you’ve chosen B instead of A and that’s an error,” Broderick says.

    Adapting to noisy clicks

    With its probabilistic reasoning, Nomon incorporates everything it knows about where a user is likely to click to make the process faster, easier, and less error-prone. For instance, if the user selects “Q,” Nomon will make it as easy as possible for the user to select “U” next.

    Nomon also learns how a user clicks. So, if the user always clicks a little after the clock’s hand strikes noon, the system adapts to that in real time. It also adapts to noisiness. If a user’s click is often off the mark, the system requires extra clicks to ensure accuracy.

    This probabilistic reasoning makes Nomon powerful but also requires a higher click-load than row-column scanning systems. Clicking multiple times can be a trying task for severely motor-impaired users.

    Broderick hopes to reduce the click-load by incorporating gaze tracking into Nomon, which would give the system more robust information about what a user might choose next based on which part of the screen they are looking at. The researchers also want to find a better way to automatically adjust the clock speeds to help users be more accurate and efficient.

    They are working on a new series of studies in which they plan to partner with more motor-impaired users.

    “So far, the feedback from motor-impaired users has been invaluable to us; we’re very grateful to the motor-impaired user who commented on our initial interface and the separate motor-impaired user who participated in our study. We’re currently extending our study to work with a bigger and more diverse group of our target population. With their help, we’re already making further improvements to our interface and working to better understand the performance of Nomon,” she says.

    “Nonspeaking individuals with motor disabilities are currently not provided with efficient communication solutions for interacting with either speaking partners or computer systems. This ‘communication gap’ is a known unresolved problem in human-computer interaction, and so far there are no good solutions. This paper demonstrates that a highly creative approach underpinned by a statistical model can provide tangible performance gains to the users who need it the most: nonspeaking individuals reliant on a single switch to communicate,” says Per Ola Kristensson, professor of interactive systems engineering at Cambridge University, who was not involved with this research. “The paper also demonstrates the value of complementing insights from computational experiments with the involvement of end-users and other stakeholders in the design process. I find this a highly creative and important paper in an area where it is notoriously difficult to make significant progress.”

    This research was supported, in part, by the Seth Teller Memorial Fund to Advanced Technology for People with Disabilities, a Peter J. Eloranta Summer Undergraduate Research Fellowship, the MIT Quest for Intelligence, and the National Science Foundation. More

  • in

    When should someone trust an AI assistant’s predictions?

    In a busy hospital, a radiologist is using an artificial intelligence system to help her diagnose medical conditions based on patients’ X-ray images. Using the AI system can help her make faster diagnoses, but how does she know when to trust the AI’s predictions?

    She doesn’t. Instead, she may rely on her expertise, a confidence level provided by the system itself, or an explanation of how the algorithm made its prediction — which may look convincing but still be wrong — to make an estimation.

    To help people better understand when to trust an AI “teammate,” MIT researchers created an onboarding technique that guides humans to develop a more accurate understanding of those situations in which a machine makes correct predictions and those in which it makes incorrect predictions.

    By showing people how the AI complements their abilities, the training technique could help humans make better decisions or come to conclusions faster when working with AI agents.

    “We propose a teaching phase where we gradually introduce the human to this AI model so they can, for themselves, see its weaknesses and strengths,” says Hussein Mozannar, a graduate student in the Social and Engineering Systems doctoral program within the Institute for Data, Systems, and Society (IDSS) who is also a researcher with the Clinical Machine Learning Group of the Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Institute for Medical Engineering and Science. “We do this by mimicking the way the human will interact with the AI in practice, but we intervene to give them feedback to help them understand each interaction they are making with the AI.”

    Mozannar wrote the paper with Arvind Satyanarayan, an assistant professor of computer science who leads the Visualization Group in CSAIL; and senior author David Sontag, an associate professor of electrical engineering and computer science at MIT and leader of the Clinical Machine Learning Group. The research will be presented at the Association for the Advancement of Artificial Intelligence in February.

    Mental models

    This work focuses on the mental models humans build about others. If the radiologist is not sure about a case, she may ask a colleague who is an expert in a certain area. From past experience and her knowledge of this colleague, she has a mental model of his strengths and weaknesses that she uses to assess his advice.

    Humans build the same kinds of mental models when they interact with AI agents, so it is important those models are accurate, Mozannar says. Cognitive science suggests that humans make decisions for complex tasks by remembering past interactions and experiences. So, the researchers designed an onboarding process that provides representative examples of the human and AI working together, which serve as reference points the human can draw on in the future. They began by creating an algorithm that can identify examples that will best teach the human about the AI.

    “We first learn a human expert’s biases and strengths, using observations of their past decisions unguided by AI,” Mozannar says. “We combine our knowledge about the human with what we know about the AI to see where it will be helpful for the human to rely on the AI. Then we obtain cases where we know the human should rely on the AI and similar cases where the human should not rely on the AI.”

    The researchers tested their onboarding technique on a passage-based question answering task: The user receives a written passage and a question whose answer is contained in the passage. The user then has to answer the question and can click a button to “let the AI answer.” The user can’t see the AI answer in advance, however, requiring them to rely on their mental model of the AI. The onboarding process they developed begins by showing these examples to the user, who tries to make a prediction with the help of the AI system. The human may be right or wrong, and the AI may be right or wrong, but in either case, after solving the example, the user sees the correct answer and an explanation for why the AI chose its prediction. To help the user generalize from the example, two contrasting examples are shown that explain why the AI got it right or wrong.

    For instance, perhaps the training question asks which of two plants is native to more continents, based on a convoluted paragraph from a botany textbook. The human can answer on her own or let the AI system answer. Then, she sees two follow-up examples that help her get a better sense of the AI’s abilities. Perhaps the AI is wrong on a follow-up question about fruits but right on a question about geology. In each example, the words the system used to make its prediction are highlighted. Seeing the highlighted words helps the human understand the limits of the AI agent, explains Mozannar.

    To help the user retain what they have learned, the user then writes down the rule she infers from this teaching example, such as “This AI is not good at predicting flowers.” She can then refer to these rules later when working with the agent in practice. These rules also constitute a formalization of the user’s mental model of the AI.

    The impact of teaching

    The researchers tested this teaching technique with three groups of participants. One group went through the entire onboarding technique, another group did not receive the follow-up comparison examples, and the baseline group didn’t receive any teaching but could see the AI’s answer in advance.

    “The participants who received teaching did just as well as the participants who didn’t receive teaching but could see the AI’s answer. So, the conclusion there is they are able to simulate the AI’s answer as well as if they had seen it,” Mozannar says.

    The researchers dug deeper into the data to see the rules individual participants wrote. They found that almost 50 percent of the people who received training wrote accurate lessons of the AI’s abilities. Those who had accurate lessons were right on 63 percent of the examples, whereas those who didn’t have accurate lessons were right on 54 percent. And those who didn’t receive teaching but could see the AI answers were right on 57 percent of the questions.

    “When teaching is successful, it has a significant impact. That is the takeaway here. When we are able to teach participants effectively, they are able to do better than if you actually gave them the answer,” he says.

    But the results also show there is still a gap. Only 50 percent of those who were trained built accurate mental models of the AI, and even those who did were only right 63 percent of the time. Even though they learned accurate lessons, they didn’t always follow their own rules, Mozannar says.

    That is one question that leaves the researchers scratching their heads — even if people know the AI should be right, why won’t they listen to their own mental model? They want to explore this question in the future, as well as refine the onboarding process to reduce the amount of time it takes. They are also interested in running user studies with more complex AI models, particularly in health care settings.

    “When humans collaborate with other humans, we rely heavily on knowing what our collaborators’ strengths and weaknesses are — it helps us know when (and when not) to lean on the other person for assistance. I’m glad to see this research applying that principle to humans and AI,” says Carrie Cai, a staff research scientist in the People + AI Research and Responsible AI groups at Google, who was not involved with this research. “Teaching users about an AI’s strengths and weaknesses is essential to producing positive human-AI joint outcomes.” 

    This research was supported, in part, by the National Science Foundation. More

  • in

    Making machine learning more useful to high-stakes decision makers

    The U.S. Centers for Disease Control and Prevention estimates that one in seven children in the United States experienced abuse or neglect in the past year. Child protective services agencies around the nation receive a high number of reports each year (about 4.4 million in 2019) of alleged neglect or abuse. With so many cases, some agencies are implementing machine learning models to help child welfare specialists screen cases and determine which to recommend for further investigation.

    But these models don’t do any good if the humans they are intended to help don’t understand or trust their outputs.

    Researchers at MIT and elsewhere launched a research project to identify and tackle machine learning usability challenges in child welfare screening. In collaboration with a child welfare department in Colorado, the researchers studied how call screeners assess cases, with and without the help of machine learning predictions. Based on feedback from the call screeners, they designed a visual analytics tool that uses bar graphs to show how specific factors of a case contribute to the predicted risk that a child will be removed from their home within two years.

    The researchers found that screeners are more interested in seeing how each factor, like the child’s age, influences a prediction, rather than understanding the computational basis of how the model works. Their results also show that even a simple model can cause confusion if its features are not described with straightforward language.

    These findings could be applied to other high-risk fields where humans use machine learning models to help them make decisions, but lack data science experience, says senior author Kalyan Veeramachaneni, principal research scientist in the Laboratory for Information and Decision Systems (LIDS) and senior author of the paper.

    “Researchers who study explainable AI, they often try to dig deeper into the model itself to explain what the model did. But a big takeaway from this project is that these domain experts don’t necessarily want to learn what machine learning actually does. They are more interested in understanding why the model is making a different prediction than what their intuition is saying, or what factors it is using to make this prediction. They want information that helps them reconcile their agreements or disagreements with the model, or confirms their intuition,” he says.

    Co-authors include electrical engineering and computer science PhD student Alexandra Zytek, who is the lead author; postdoc Dongyu Liu; and Rhema Vaithianathan, professor of economics and director of the Center for Social Data Analytics at the Auckland University of Technology and professor of social data analytics at the University of Queensland. The research will be presented later this month at the IEEE Visualization Conference.

    Real-world research

    The researchers began the study more than two years ago by identifying seven factors that make a machine learning model less usable, including lack of trust in where predictions come from and disagreements between user opinions and the model’s output.

    With these factors in mind, Zytek and Liu flew to Colorado in the winter of 2019 to learn firsthand from call screeners in a child welfare department. This department is implementing a machine learning system developed by Vaithianathan that generates a risk score for each report, predicting the likelihood the child will be removed from their home. That risk score is based on more than 100 demographic and historic factors, such as the parents’ ages and past court involvements.

    “As you can imagine, just getting a number between one and 20 and being told to integrate this into your workflow can be a bit challenging,” Zytek says.

    They observed how teams of screeners process cases in about 10 minutes and spend most of that time discussing the risk factors associated with the case. That inspired the researchers to develop a case-specific details interface, which shows how each factor influenced the overall risk score using color-coded, horizontal bar graphs that indicate the magnitude of the contribution in a positive or negative direction.

    Based on observations and detailed interviews, the researchers built four additional interfaces that provide explanations of the model, including one that compares a current case to past cases with similar risk scores. Then they ran a series of user studies.

    The studies revealed that more than 90 percent of the screeners found the case-specific details interface to be useful, and it generally increased their trust in the model’s predictions. On the other hand, the screeners did not like the case comparison interface. While the researchers thought this interface would increase trust in the model, screeners were concerned it could lead to decisions based on past cases rather than the current report.   

    “The most interesting result to me was that, the features we showed them — the information that the model uses — had to be really interpretable to start. The model uses more than 100 different features in order to make its prediction, and a lot of those were a bit confusing,” Zytek says.

    Keeping the screeners in the loop throughout the iterative process helped the researchers make decisions about what elements to include in the machine learning explanation tool, called Sibyl.

    As they refined the Sibyl interfaces, the researchers were careful to consider how providing explanations could contribute to some cognitive biases, and even undermine screeners’ trust in the model.

    For instance, since explanations are based on averages in a database of child abuse and neglect cases, having three past abuse referrals may actually decrease the risk score of a child, since averages in this database may be far higher. A screener may see that explanation and decide not to trust the model, even though it is working correctly, Zytek explains. And because humans tend to put more emphasis on recent information, the order in which the factors are listed could also influence decisions.

    Improving interpretability

    Based on feedback from call screeners, the researchers are working to tweak the explanation model so the features that it uses are easier to explain.

    Moving forward, they plan to enhance the interfaces they’ve created based on additional feedback and then run a quantitative user study to track the effects on decision making with real cases. Once those evaluations are complete, they can prepare to deploy Sibyl, Zytek says.

    “It was especially valuable to be able to work so actively with these screeners. We got to really understand the problems they faced. While we saw some reservations on their part, what we saw more of was excitement about how useful these explanations were in certain cases. That was really rewarding,” she says.

    This work is supported, in part, by the National Science Foundation. More