More stories

  • in

    Taming the data deluge

    An oncoming tsunami of data threatens to overwhelm huge data-rich research projects on such areas that range from the tiny neutrino to an exploding supernova, as well as the mysteries deep within the brain. 

    When LIGO picks up a gravitational-wave signal from a distant collision of black holes and neutron stars, a clock starts ticking for capturing the earliest possible light that may accompany them: time is of the essence in this race. Data collected from electrical sensors monitoring brain activity are outpacing computing capacity. Information from the Large Hadron Collider (LHC)’s smashed particle beams will soon exceed 1 petabit per second. 

    To tackle this approaching data bottleneck in real-time, a team of researchers from nine institutions led by the University of Washington, including MIT, has received $15 million in funding to establish the Accelerated AI Algorithms for Data-Driven Discovery (A3D3) Institute. From MIT, the research team includes Philip Harris, assistant professor of physics, who will serve as the deputy director of the A3D3 Institute; Song Han, assistant professor of electrical engineering and computer science, who will serve as the A3D3’s co-PI; and Erik Katsavounidis, senior research scientist with the MIT Kavli Institute for Astrophysics and Space Research.

    Infused with this five-year Harnessing the Data Revolution Big Idea grant, and jointly funded by the Office of Advanced Cyberinfrastructure, A3D3 will focus on three data-rich fields: multi-messenger astrophysics, high-energy particle physics, and brain imaging neuroscience. By enriching AI algorithms with new processors, A3D3 seeks to speed up AI algorithms for solving fundamental problems in collider physics, neutrino physics, astronomy, gravitational-wave physics, computer science, and neuroscience. 

    “I am very excited about the new Institute’s opportunities for research in nuclear and particle physics,” says Laboratory for Nuclear Science Director Boleslaw Wyslouch. “Modern particle detectors produce an enormous amount of data, and we are looking for extraordinarily rare signatures. The application of extremely fast processors to sift through these mountains of data will make a huge difference in what we will measure and discover.”

    The seeds of A3D3 were planted in 2017, when Harris and his colleagues at Fermilab and CERN decided to integrate real-time AI algorithms to process the incredible rates of data at the LHC. Through email correspondence with Han, Harris’ team built a compiler, HLS4ML, that could run an AI algorithm in nanoseconds.

    “Before the development of HLS4ML, the fastest processing that we knew of was roughly a millisecond per AI inference, maybe a little faster,” says Harris. “We realized all the AI algorithms were designed to solve much slower problems, such as image and voice recognition. To get to nanosecond inference timescales, we recognized we could make smaller algorithms and rely on custom implementations with Field Programmable Gate Array (FPGA) processors in an approach that was largely different from what others were doing.”

    A few months later, Harris presented their research at a physics faculty meeting, where Katsavounidis became intrigued. Over coffee in Building 7, they discussed combining Harris’ FPGA with Katsavounidis’s use of machine learning for finding gravitational waves. FPGAs and other new processor types, such as graphics processing units (GPUs), accelerate AI algorithms to more quickly analyze huge amounts of data.

    “I had worked with the first FPGAs that were out in the market in the early ’90s and have witnessed first-hand how they revolutionized front-end electronics and data acquisition in big high-energy physics experiments I was working on back then,” recalls Katsavounidis. “The ability to have them crunch gravitational-wave data has been in the back of my mind since joining LIGO over 20 years ago.”

    Two years ago they received their first grant, and the University of Washington’s Shih-Chieh Hsu joined in. The team initiated the Fast Machine Lab, published about 40 papers on the subject, built the group to about 50 researchers, and “launched a whole industry of how to explore a region of AI that has not been explored in the past,” says Harris. “We basically started this without any funding. We’ve been getting small grants for various projects over the years. A3D3 represents our first large grant to support this effort.”  

    “What makes A3D3 so special and suited to MIT is its exploration of a technical frontier, where AI is implemented not in high-level software, but rather in lower-level firmware, reconfiguring individual gates to address the scientific question at hand,” says Rob Simcoe, director of MIT Kavli Institute for Astrophysics and Space Research and the Francis Friedman Professor of Physics. “We are in an era where experiments generate torrents of data. The acceleration gained from tailoring reprogrammable, bespoke computers at the processor level can advance real-time analysis of these data to new levels of speed and sophistication.”

    The Huge Data from the Large Hadron Collider 

    With data rates already exceeding 500 terabits per second, the LHC processes more data than any other scientific instrument on earth. Its future aggregate data rates will soon exceed 1 petabit per second, the biggest data rate in the world. 

    “Through the use of AI, A3D3 aims to perform advanced analyses, such as anomaly detection, and particle reconstruction on all collisions happening 40 million times per second,” says Harris.

    The goal is to find within all of this data a way to identify the few collisions out of the 3.2 billion collisions per second that could reveal new forces, explain how dark matter is formed, and complete the picture of how fundamental forces interact with matter. Processing all of this information requires a customized computing system capable of interpreting the collider information within ultra-low latencies.  

    “The challenge of running this on all of the 100s of terabits per second in real-time is daunting and requires a complete overhaul of how we design and implement AI algorithms,” says Harris. “With large increases in the detector resolution leading to data rates that are even larger the challenge of finding the one collision, among many, will become even more daunting.” 

    The Brain and the Universe

    Thanks to advances in techniques such as medical imaging and electrical recordings from implanted electrodes, neuroscience is also gathering larger amounts of data on how the brain’s neural networks process responses to stimuli and perform motor information. A3D3 plans to develop and implement high-throughput and low-latency AI algorithms to process, organize, and analyze massive neural datasets in real time, to probe brain function in order to enable new experiments and therapies.   

    With Multi-Messenger Astrophysics (MMA), A3D3 aims to quickly identify astronomical events by efficiently processing data from gravitational waves, gamma-ray bursts, and neutrinos picked up by telescopes and detectors. 

    The A3D3 researchers also include a multi-disciplinary group of 15 other researchers, including project lead the University of Washington, along with Caltech, Duke University, Purdue University, UC San Diego, University of Illinois Urbana-Champaign, University of Minnesota, and the University of Wisconsin-Madison. It will include neutrinos research at Icecube and DUNE, and visible astronomy at Zwicky Transient Facility, and will organize deep-learning workshops and boot camps to train students and researchers on how to contribute to the framework and widen the use of fast AI strategies.

    “We have reached a point where detector network growth will be transformative, both in terms of event rates and in terms of astrophysical reach and ultimately, discoveries,” says Katsavounidis. “‘Fast’ and ‘efficient’ is the only way to fight the ‘faint’ and ‘fuzzy’ that is out there in the universe, and the path for getting the most out of our detectors. A3D3 on one hand is going to bring production-scale AI to gravitational-wave physics and multi-messenger astronomy; but on the other hand, we aspire to go beyond our immediate domains and become the go-to place across the country for applications of accelerated AI to data-driven disciplines.” More

  • in

    Making data visualizations more accessible

    In the early days of the Covid-19 pandemic, the Centers for Disease Control and Prevention produced a simple chart to illustrate how measures like mask wearing and social distancing could “flatten the curve” and reduce the peak of infections.

    The chart was amplified by news sites and shared on social media platforms, but it often lacked a corresponding text description to make it accessible for blind individuals who use a screen reader to navigate the web, shutting out many of the 253 million people worldwide who have visual disabilities.

    This alternative text is often missing from online charts, and even when it is included, it is frequently uninformative or even incorrect, according to qualitative data gathered by scientists at MIT.

    These researchers conducted a study with blind and sighted readers to determine which text is useful to include in a chart description, which text is not, and why. Ultimately, they found that captions for blind readers should focus on the overall trends and statistics in the chart, not its design elements or higher-level insights.

    They also created a conceptual model that can be used to evaluate a chart description, whether the text was generated automatically by software or manually by a human author. Their work could help journalists, academics, and communicators create descriptions that are more effective for blind individuals and guide researchers as they develop better tools to automatically generate captions.

    “Ninety-nine-point-nine percent of images on Twitter lack any kind of description — and that is not hyperbole, that is the actual statistic,” says Alan Lundgard, a graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of the paper. “Having people manually author those descriptions seems to be difficult for a variety of reasons. Perhaps semiautonomous tools could help with that. But it is crucial to do this preliminary participatory design work to figure out what is the target for these tools, so we are not generating content that is either not useful to its intended audience or, in the worst case, erroneous.”

    Lundgard wrote the paper with senior author Arvind Satyanarayan, an assistant professor of computer science who leads the Visualization Group in CSAIL. The research will be presented at the Institute of Electrical and Electronics Engineers Visualization Conference in October.

    Evaluating visualizations

    To develop the conceptual model, the researchers planned to begin by studying graphs featured by popular online publications such as FiveThirtyEight and NYTimes.com, but they ran into a problem — those charts mostly lacked any textual descriptions. So instead, they collected descriptions for these charts from graduate students in an MIT data visualization class and through an online survey, then grouped the captions into four categories.

    Level 1 descriptions focus on the elements of the chart, such as its title, legend, and colors. Level 2 descriptions describe statistical content, like the minimum, maximum, or correlations. Level 3 descriptions cover perceptual interpretations of the data, like complex trends or clusters. Level 4 descriptions include subjective interpretations that go beyond the data and draw on the author’s knowledge.

    In a study with blind and sighted readers, the researchers presented visualizations with descriptions at different levels and asked participants to rate how useful they were. While both groups agreed that level 1 content on its own was not very helpful, sighted readers gave level 4 content the highest marks while blind readers ranked that content among the least useful.

    Survey results revealed that a majority of blind readers were emphatic that descriptions should not contain an author’s editorialization, but rather stick to straight facts about the data. On the other hand, most sighted readers preferred a description that told a story about the data.

    “For me, a surprising finding about the lack of utility for the highest-level content is that it ties very closely to feelings about agency and control as a disabled person. In our research, blind readers specifically didn’t want the descriptions to tell them what to think about the data. They want the data to be accessible in a way that allows them to interpret it for themselves, and they want to have the agency to do that interpretation,” Lundgard says.

    A more inclusive future

    This work could have implications as data scientists continue to develop and refine machine learning methods for autogenerating captions and alternative text.

    “We are not able to do it yet, but it is not inconceivable to imagine that in the future we would be able to automate the creation of some of this higher-level content and build models that target level 2 or level 3 in our framework. And now we know what the research questions are. If we want to produce these automated captions, what should those captions say? We are able to be a bit more directed in our future research because we have these four levels,” Satyanarayan says.

    In the future, the four-level framework could also help researchers develop machine learning models that can automatically suggest effective visualizations as part of the data analysis process, or models that can extract the most useful information from a chart.

    This research could also inform future work in Satyanarayan’s group that seeks to make interactive visualizations more accessible for blind readers who use a screen reader to access and interpret the information. 

    “The question of how to ensure that charts and graphs are accessible to screen reader users is both a socially important equity issue and a challenge that can advance the state-of-the-art in AI,” says Meredith Ringel Morris, director and principal scientist of the People + AI Research team at Google Research, who was not involved with this study. “By introducing a framework for conceptualizing natural language descriptions of information graphics that is grounded in end-user needs, this work helps ensure that future AI researchers will focus their efforts on problems aligned with end-users’ values.”

    Morris adds: “Rich natural-language descriptions of data graphics will not only expand access to critical information for people who are blind, but will also benefit a much wider audience as eyes-free interactions via smart speakers, chatbots, and other AI-powered agents become increasingly commonplace.”

    This research was supported by the National Science Foundation. More