More stories

  • in

    Avoiding shortcut solutions in artificial intelligence

    If your Uber driver takes a shortcut, you might get to your destination faster. But if a machine learning model takes a shortcut, it might fail in unexpected ways.

    In machine learning, a shortcut solution occurs when the model relies on a simple characteristic of a dataset to make a decision, rather than learning the true essence of the data, which can lead to inaccurate predictions. For example, a model might learn to identify images of cows by focusing on the green grass that appears in the photos, rather than the more complex shapes and patterns of the cows.  

    A new study by researchers at MIT explores the problem of shortcuts in a popular machine-learning method and proposes a solution that can prevent shortcuts by forcing the model to use more data in its decision-making.

    By removing the simpler characteristics the model is focusing on, the researchers force it to focus on more complex features of the data that it hadn’t been considering. Then, by asking the model to solve the same task two ways — once using those simpler features, and then also using the complex features it has now learned to identify — they reduce the tendency for shortcut solutions and boost the performance of the model.

    One potential application of this work is to enhance the effectiveness of machine learning models that are used to identify disease in medical images. Shortcut solutions in this context could lead to false diagnoses and have dangerous implications for patients.

    “It is still difficult to tell why deep networks make the decisions that they do, and in particular, which parts of the data these networks choose to focus upon when making a decision. If we can understand how shortcuts work in further detail, we can go even farther to answer some of the fundamental but very practical questions that are really important to people who are trying to deploy these networks,” says Joshua Robinson, a PhD student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of the paper.

    Robinson wrote the paper with his advisors, senior author Suvrit Sra, the Esther and Harold E. Edgerton Career Development Associate Professor in the Department of Electrical Engineering and Computer Science (EECS) and a core member of the Institute for Data, Systems, and Society (IDSS) and the Laboratory for Information and Decision Systems; and Stefanie Jegelka, the X-Consortium Career Development Associate Professor in EECS and a member of CSAIL and IDSS; as well as University of Pittsburgh assistant professor Kayhan Batmanghelich and PhD students Li Sun and Ke Yu. The research will be presented at the Conference on Neural Information Processing Systems in December. 

    The long road to understanding shortcuts

    The researchers focused their study on contrastive learning, which is a powerful form of self-supervised machine learning. In self-supervised machine learning, a model is trained using raw data that do not have label descriptions from humans. It can therefore be used successfully for a larger variety of data.

    A self-supervised learning model learns useful representations of data, which are used as inputs for different tasks, like image classification. But if the model takes shortcuts and fails to capture important information, these tasks won’t be able to use that information either.

    For example, if a self-supervised learning model is trained to classify pneumonia in X-rays from a number of hospitals, but it learns to make predictions based on a tag that identifies the hospital the scan came from (because some hospitals have more pneumonia cases than others), the model won’t perform well when it is given data from a new hospital.     

    For contrastive learning models, an encoder algorithm is trained to discriminate between pairs of similar inputs and pairs of dissimilar inputs. This process encodes rich and complex data, like images, in a way that the contrastive learning model can interpret.

    The researchers tested contrastive learning encoders with a series of images and found that, during this training procedure, they also fall prey to shortcut solutions. The encoders tend to focus on the simplest features of an image to decide which pairs of inputs are similar and which are dissimilar. Ideally, the encoder should focus on all the useful characteristics of the data when making a decision, Jegelka says.

    So, the team made it harder to tell the difference between the similar and dissimilar pairs, and found that this changes which features the encoder will look at to make a decision.

    “If you make the task of discriminating between similar and dissimilar items harder and harder, then your system is forced to learn more meaningful information in the data, because without learning that it cannot solve the task,” she says.

    But increasing this difficulty resulted in a tradeoff — the encoder got better at focusing on some features of the data but became worse at focusing on others. It almost seemed to forget the simpler features, Robinson says.

    To avoid this tradeoff, the researchers asked the encoder to discriminate between the pairs the same way it had originally, using the simpler features, and also after the researchers removed the information it had already learned. Solving the task both ways simultaneously caused the encoder to improve across all features.

    Their method, called implicit feature modification, adaptively modifies samples to remove the simpler features the encoder is using to discriminate between the pairs. The technique does not rely on human input, which is important because real-world data sets can have hundreds of different features that could combine in complex ways, Sra explains.

    From cars to COPD

    The researchers ran one test of this method using images of vehicles. They used implicit feature modification to adjust the color, orientation, and vehicle type to make it harder for the encoder to discriminate between similar and dissimilar pairs of images. The encoder improved its accuracy across all three features — texture, shape, and color — simultaneously.

    To see if the method would stand up to more complex data, the researchers also tested it with samples from a medical image database of chronic obstructive pulmonary disease (COPD). Again, the method led to simultaneous improvements across all features they evaluated.

    While this work takes some important steps forward in understanding the causes of shortcut solutions and working to solve them, the researchers say that continuing to refine these methods and applying them to other types of self-supervised learning will be key to future advancements.

    “This ties into some of the biggest questions about deep learning systems, like ‘Why do they fail?’ and ‘Can we know in advance the situations where your model will fail?’ There is still a lot farther to go if you want to understand shortcut learning in its full generality,” Robinson says.

    This research is supported by the National Science Foundation, National Institutes of Health, and the Pennsylvania Department of Health’s SAP SE Commonwealth Universal Research Enhancement (CURE) program. More

  • in

    Study: Global cancer risk from burning organic matter comes from unregulated chemicals

    Whenever organic matter is burned, such as in a wildfire, a power plant, a car’s exhaust, or in daily cooking, the combustion releases polycyclic aromatic hydrocarbons (PAHs) — a class of pollutants that is known to cause lung cancer.

    There are more than 100 known types of PAH compounds emitted daily into the atmosphere. Regulators, however, have historically relied on measurements of a single compound, benzo(a)pyrene, to gauge a community’s risk of developing cancer from PAH exposure. Now MIT scientists have found that benzo(a)pyrene may be a poor indicator of this type of cancer risk.

    In a modeling study appearing today in the journal GeoHealth, the team reports that benzo(a)pyrene plays a small part — about 11 percent — in the global risk of developing PAH-associated cancer. Instead, 89 percent of that cancer risk comes from other PAH compounds, many of which are not directly regulated.

    Interestingly, about 17 percent of PAH-associated cancer risk comes from “degradation products” — chemicals that are formed when emitted PAHs react in the atmosphere. Many of these degradation products can in fact be more toxic than the emitted PAH from which they formed.

    The team hopes the results will encourage scientists and regulators to look beyond benzo(a)pyrene, to consider a broader class of PAHs when assessing a community’s cancer risk.

    “Most of the regulatory science and standards for PAHs are based on benzo(a)pyrene levels. But that is a big blind spot that could lead you down a very wrong path in terms of assessing whether cancer risk is improving or not, and whether it’s relatively worse in one place than another,” says study author Noelle Selin, a professor in MIT’s Institute for Data, Systems and Society, and the Department of Earth, Atmospheric and Planetary Sciences.

    Selin’s MIT co-authors include Jesse Kroll, Amy Hrdina, Ishwar Kohale, Forest White, and Bevin Engelward, and Jamie Kelly (who is now at University College London). Peter Ivatt and Mathew Evans at the University of York are also co-authors.

    Chemical pixels

    Benzo(a)pyrene has historically been the poster chemical for PAH exposure. The compound’s indicator status is largely based on early toxicology studies. But recent research suggests the chemical may not be the PAH representative that regulators have long relied upon.   

    “There has been a bit of evidence suggesting benzo(a)pyrene may not be very important, but this was from just a few field studies,” says Kelly, a former postdoc in Selin’s group and the study’s lead author.

    Kelly and his colleagues instead took a systematic approach to evaluate benzo(a)pyrene’s suitability as a PAH indicator. The team began by using GEOS-Chem, a global, three-dimensional chemical transport model that breaks the world into individual grid boxes and simulates within each box the reactions and concentrations of chemicals in the atmosphere.

    They extended this model to include chemical descriptions of how various PAH compounds, including benzo(a)pyrene, would react in the atmosphere. The team then plugged in recent data from emissions inventories and meteorological observations, and ran the model forward to simulate the concentrations of various PAH chemicals around the world over time.

    Risky reactions

    In their simulations, the researchers started with 16 relatively well-studied PAH chemicals, including benzo(a)pyrene, and traced the concentrations of these chemicals, plus the concentration of their degradation products over two generations, or chemical transformations. In total, the team evaluated 48 PAH species.

    They then compared these concentrations with actual concentrations of the same chemicals, recorded by monitoring stations around the world. This comparison was close enough to show that the model’s concentration predictions were realistic.

    Then within each model’s grid box, the researchers related the concentration of each PAH chemical to its associated cancer risk; to do this, they had to develop a new method based on previous studies in the literature to avoid double-counting risk from the different chemicals. Finally, they overlaid population density maps to predict the number of cancer cases globally, based on the concentration and toxicity of a specific PAH chemical in each location.

    Dividing the cancer cases by population produced the cancer risk associated with that chemical. In this way, the team calculated the cancer risk for each of the 48 compounds, then determined each chemical’s individual contribution to the total risk.

    This analysis revealed that benzo(a)pyrene had a surprisingly small contribution, of about 11 percent, to the overall risk of developing cancer from PAH exposure globally. Eighty-nine percent of cancer risk came from other chemicals. And 17 percent of this risk arose from degradation products.

    “We see places where you can find concentrations of benzo(a)pyrene are lower, but the risk is higher because of these degradation products,” Selin says. “These products can be orders of magnitude more toxic, so the fact that they’re at tiny concentrations doesn’t mean you can write them off.”

    When the researchers compared calculated PAH-associated cancer risks around the world, they found significant differences depending on whether that risk calculation was based solely on concentrations of benzo(a)pyrene or on a region’s broader mix of PAH compounds.

    “If you use the old method, you would find the lifetime cancer risk is 3.5 times higher in Hong Kong versus southern India, but taking into account the differences in PAH mixtures, you get a difference of 12 times,” Kelly says. “So, there’s a big difference in the relative cancer risk between the two places. And we think it’s important to expand the group of compounds that regulators are thinking about, beyond just a single chemical.”

    The team’s study “provides an excellent contribution to better understanding these ubiquitous pollutants,” says Elisabeth Galarneau, an air quality expert and PhD research scientist in Canada’s Department of the Environment. “It will be interesting to see how these results compare to work being done elsewhere … to pin down which (compounds) need to be tracked and considered for the protection of human and environmental health.”

    This research was conducted in MIT’s Superfund Research Center and is supported in part by the National Institute of Environmental Health Sciences Superfund Basic Research Program, and the National Institutes of Health. More