More stories

  • in

    Security tool guarantees privacy in surveillance footage

    Surveillance cameras have an identity problem, fueled by an inherent tension between utility and privacy. As these powerful little devices have cropped up seemingly everywhere, the use of machine learning tools has automated video content analysis at a massive scale — but with increasing mass surveillance, there are currently no legally enforceable rules to limit privacy invasions. 

    Security cameras can do a lot — they’ve become smarter and supremely more competent than their ghosts of grainy pictures past, the ofttimes “hero tool” in crime media. (“See that little blurry blue blob in the right hand corner of that densely populated corner — we got him!”) Now, video surveillance can help health officials measure the fraction of people wearing masks, enable transportation departments to monitor the density and flow of vehicles, bikes, and pedestrians, and provide businesses with a better understanding of shopping behaviors. But why has privacy remained a weak afterthought? 

    The status quo is to retrofit video with blurred faces or black boxes. Not only does this prevent analysts from asking some genuine queries (e.g., Are people wearing masks?), it also doesn’t always work; the system may miss some faces and leave them unblurred for the world to see. Dissatisfied with this status quo, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), in collaboration with other institutions, came up with a system to better guarantee privacy in video footage from surveillance cameras. Called “Privid,” the system lets analysts submit video data queries, and adds a little bit of noise (extra data) to the end result to ensure that an individual can’t be identified. The system builds on a formal definition of privacy — “differential privacy” — which allows access to aggregate statistics about private data without revealing personally identifiable information.

    Typically, analysts would just have access to the entire video to do whatever they want with it, but Privid makes sure the video isn’t a free buffet. Honest analysts can get access to the information they need, but that access is restrictive enough that malicious analysts can’t do too much with it. To enable this, rather than running the code over the entire video in one shot, Privid breaks the video into small pieces and runs processing code over each chunk. Instead of getting results back from each piece, the segments are aggregated, and that additional noise is added. (There’s also information on the error bound you’re going to get on your result — maybe a 2 percent error margin, given the extra noisy data added). 

    For example, the code might output the number of people observed in each video chunk, and the aggregation might be the “sum,” to count the total number of people wearing face coverings, or the “average” to estimate the density of crowds. 

    Privid allows analysts to use their own deep neural networks that are commonplace for video analytics today. This gives analysts the flexibility to ask questions that the designers of Privid did not anticipate. Across a variety of videos and queries, Privid was accurate within 79 to 99 percent of a non-private system.

    “We’re at a stage right now where cameras are practically ubiquitous. If there’s a camera on every street corner, every place you go, and if someone could actually process all of those videos in aggregate, you can imagine that entity building a very precise timeline of when and where a person has gone,” says MIT CSAIL PhD student ​​Frank Cangialosi, the lead author on a paper about Privid. “People are already worried about location privacy with GPS — video data in aggregate could capture not only your location history, but also moods, behaviors, and more at each location.” 

    Privid introduces a new notion of “duration-based privacy,” which decouples the definition of privacy from its enforcement — with obfuscation, if your privacy goal is to protect all people, the enforcement mechanism needs to do some work to find the people to protect, which it may or may not do perfectly. With this mechanism, you don’t need to fully specify everything, and you’re not hiding more information than you need to. 

    Let’s say we have a video overlooking a street. Two analysts, Alice and Bob, both claim they want to count the number of people that pass by each hour, so they submit a video processing module and ask for a sum aggregation.

    The first analyst is the city planning department, which hopes to use this information to understand footfall patterns and plan sidewalks for the city. Their model counts people and outputs this count for each video chunk.

    The other analyst is malicious. They hope to identify every time “Charlie” passes by the camera. Their model only looks for Charlie’s face, and outputs a large number if Charlie is present (i.e., the “signal” they’re trying to extract), or zero otherwise. Their hope is that the sum will be non-zero if Charlie was present. 

    From Privid’s perspective, these two queries look identical. It’s hard to reliably determine what their models might be doing internally, or what the analyst hopes to use the data for. This is where the noise comes in. Privid executes both of the queries, and adds the same amount of noise for each. In the first case, because Alice was counting all people, this noise will only have a small impact on the result, but likely won’t impact the usefulness. 

    In the second case, since Bob was looking for a specific signal (Charlie was only visible for a few chunks), the noise is enough to prevent them from knowing if Charlie was there or not. If they see a non-zero result, it might be because Charlie was actually there, or because the model outputs “zero,” but the noise made it non-zero. Privid didn’t need to know anything about when or where Charlie appeared, the system just needed to know a rough upper bound on how long Charlie might appear for, which is easier to specify than figuring out the exact locations, which prior methods rely on. 

    The challenge is determining how much noise to add — Privid wants to add just enough to hide everyone, but not so much that it would be useless for analysts. Adding noise to the data and insisting on queries over time windows means that your result isn’t going to be as accurate as it could be, but the results are still useful while providing better privacy. 

    Cangialosi wrote the paper with Princeton PhD student Neil Agarwal, MIT CSAIL PhD student Venkat Arun, assistant professor at the University of Chicago Junchen Jiang, assistant professor at Rutgers University and former MIT CSAIL postdoc Srinivas Narayana, associate professor at Rutgers University Anand Sarwate, and assistant professor at Princeton University and Ravi Netravali SM ’15, PhD ’18. Cangialosi will present the paper at the USENIX Symposium on Networked Systems Design and Implementation Conference in April in Renton, Washington. 

    This work was partially supported by a Sloan Research Fellowship and National Science Foundation grants. More

  • in

    A tool for predicting the future

    Whether someone is trying to predict tomorrow’s weather, forecast future stock prices, identify missed opportunities for sales in retail, or estimate a patient’s risk of developing a disease, they will likely need to interpret time-series data, which are a collection of observations recorded over time.

    Making predictions using time-series data typically requires several data-processing steps and the use of complex machine-learning algorithms, which have such a steep learning curve they aren’t readily accessible to nonexperts.

    To make these powerful tools more user-friendly, MIT researchers developed a system that directly integrates prediction functionality on top of an existing time-series database. Their simplified interface, which they call tspDB (time series predict database), does all the complex modeling behind the scenes so a nonexpert can easily generate a prediction in only a few seconds.

    The new system is more accurate and more efficient than state-of-the-art deep learning methods when performing two tasks: predicting future values and filling in missing data points.

    One reason tspDB is so successful is that it incorporates a novel time-series-prediction algorithm, explains electrical engineering and computer science (EECS) graduate student Abdullah Alomar, an author of a recent research paper in which he and his co-authors describe the algorithm. This algorithm is especially effective at making predictions on multivariate time-series data, which are data that have more than one time-dependent variable. In a weather database, for instance, temperature, dew point, and cloud cover each depend on their past values.

    The algorithm also estimates the volatility of a multivariate time series to provide the user with a confidence level for its predictions.

    “Even as the time-series data becomes more and more complex, this algorithm can effectively capture any time-series structure out there. It feels like we have found the right lens to look at the model complexity of time-series data,” says senior author Devavrat Shah, the Andrew and Erna Viterbi Professor in EECS and a member of the Institute for Data, Systems, and Society and of the Laboratory for Information and Decision Systems.

    Joining Alomar and Shah on the paper is lead author Anish Agrawal, a former EECS graduate student who is currently a postdoc at the Simons Institute at the University of California at Berkeley. The research will be presented at the ACM SIGMETRICS conference.

    Adapting a new algorithm

    Shah and his collaborators have been working on the problem of interpreting time-series data for years, adapting different algorithms and integrating them into tspDB as they built the interface.

    About four years ago, they learned about a particularly powerful classical algorithm, called singular spectrum analysis (SSA), that imputes and forecasts single time series. Imputation is the process of replacing missing values or correcting past values. While this algorithm required manual parameter selection, the researchers suspected it could enable their interface to make effective predictions using time series data. In earlier work, they removed this need to manually intervene for algorithmic implementation.  

    The algorithm for single time series transformed it into a matrix and utilized matrix estimation procedures. The key intellectual challenge was how to adapt it to utilize multiple time series.  After a few years of struggle, they realized the answer was something very simple: “Stack” the matrices for each individual time series, treat it as a one big matrix, and then apply the single time-series algorithm on it.

    This utilizes information across multiple time series naturally — both across the time series and across time, which they describe in their new paper.

    This recent publication also discusses interesting alternatives, where instead of transforming the multivariate time series into a big matrix, it is viewed as a three-dimensional tensor. A tensor is a multi-dimensional array, or grid, of numbers. This established a promising connection between the classical field of time series analysis and the growing field of tensor estimation, Alomar says.

    “The variant of mSSA that we introduced actually captures all of that beautifully. So, not only does it provide the most likely estimation, but a time-varying confidence interval, as well,” Shah says.

    The simpler, the better

    They tested the adapted mSSA against other state-of-the-art algorithms, including deep-learning methods, on real-world time-series datasets with inputs drawn from the electricity grid, traffic patterns, and financial markets.

    Their algorithm outperformed all the others on imputation and it outperformed all but one of the other algorithms when it came to forecasting future values. The researchers also demonstrated that their tweaked version of mSSA can be applied to any kind of time-series data.

    “One reason I think this works so well is that the model captures a lot of time series dynamics, but at the end of the day, it is still a simple model. When you are working with something simple like this, instead of a neural network that can easily overfit the data, you can actually perform better,” Alomar says.

    The impressive performance of mSSA is what makes tspDB so effective, Shah explains. Now, their goal is to make this algorithm accessible to everyone.

    One a user installs tspDB on top of an existing database, they can run a prediction query with just a few keystrokes in about 0.9 milliseconds, as compared to 0.5 milliseconds for a standard search query. The confidence intervals are also designed to help nonexperts to make a more informed decision by incorporating the degree of uncertainty of the predictions into their decision making.

    For instance, the system could enable a nonexpert to predict future stock prices with high accuracy in just a few minutes, even if the time-series dataset contains missing values.

    Now that the researchers have shown why mSSA works so well, they are targeting new algorithms that can be incorporated into tspDB. One of these algorithms utilizes the same model to automatically enable change point detection, so if the user believes their time series will change its behavior at some point, the system will automatically detect that change and incorporate that into its predictions.

    They also want to continue gathering feedback from current tspDB users to see how they can improve the system’s functionality and user-friendliness, Shah says.

    “Our interest at the highest level is to make tspDB a success in the form of a broadly utilizable, open-source system. Time-series data are very important, and this is a beautiful concept of actually building prediction functionalities directly into the database. It has never been done before, and so we want to make sure the world uses it,” he says.

    “This work is very interesting for a number of reasons. It provides a practical variant of mSSA which requires no hand tuning, they provide the first known analysis of mSSA, and the authors demonstrate the real-world value of their algorithm by being competitive with or out-performing several known algorithms for imputations and predictions in (multivariate) time series for several real-world data sets,” says Vishal Misra, a professor of computer science at Columbia University who was not involved with this research. “At the heart of it all is the beautiful modeling work where they cleverly exploit correlations across time (within a time series) and space (across time series) to create a low-rank spatiotemporal factor representation of a multivariate time series. Importantly this model connects the field of time series analysis to that of the rapidly evolving topic of tensor completion, and I expect a lot of follow-on research spurred by this paper.” More

  • in

    Study: With masking and distancing in place, NFL stadium openings in 2020 had no impact on local Covid-19 infections

    As with most everything in the world, football looked very different in 2020. As the Covid-19 pandemic unfolded, many National Football League (NFL) games were played in empty stadiums, while other stadiums opened to fans at significantly reduced capacity, with strict safety protocols in place.

    At the time it was unclear what impact such large sporting events would have on Covid-19 case counts, particularly at a time when vaccination against the virus was not widely available.

    Now, MIT engineers have taken a look back at the NFL’s 2020 regular season and found that for this specific period during the pandemic, opening stadiums to fans while requiring face coverings, social distancing, and other measures had no impact on the number of Covid-19 infections in those stadiums’ local counties.

    As they write in a new paper appearing this week in the Proceedings of the National Academy of Sciences, “the benefits of providing a tightly controlled outdoor spectating environment — including masking and distancing requirements — counterbalanced the risks associated with opening.”

    The study concentrates on the NFL’s 2020 regular season (September 2020 to early January 2021), at a time when earlier strains of the virus dominated, before the rise of more transmissible Delta and Omicron variants. Nevertheless, the results may inform decisions on whether and how to hold large outdoor gatherings in the face of future public health crises.

    “These results show that the measures adopted by the NFL were effective in safely opening stadiums,” says study author Anette “Peko” Hosoi, the Neil and Jane Pappalardo Professor of Mechanical Engineering at MIT. “If case counts start to rise again, we know what to do: mask people, put them outside, and distance them from each other.”

    The study’s co-authors are members of MIT’s Institue for Data, Systems, and Society (IDSS), and include Bernardo García Bulle, Dennis Shen, and Devavrat Shah, the Andrew and Erna Viterbi Professor in the Department of Electrical Engineering and Computer Science (EECS).

    Preseason patterns

    Last year a group led by the University of Southern Mississippi compared Covid-19 case counts in the counties of NFL stadiums that allowed fans in, versus those that did not. Their analysis showed that stadiums that opened to large numbers of fans led to “tangible increases” in the local county’s number of Covid-19 cases.

    But there are a number of factors in addition to a stadium’s opening that can affect case counts, including local policies, mandates, and attitudes. As the MIT team writes, “it is not at all obvious that one can attribute the differences in case spikes to the stadiums given the enormous number of confounding factors.”

    To truly isolate the effects of a stadium’s opening, one could imagine tracking Covid cases in a county with an open stadium through the 2020 season, then turning back the clock, closing the stadium, then tracking that same county’s Covid cases through the same season, all things being equal.

    “That’s the perfect experiment, with the exception that you would need a time machine,” Hosoi says.

    As it turns out, the next best thing is synthetic control — a statistical method that is used to determine the effect of an “intervention” (such as the opening of a stadium) compared with the exact same scenario without that intervention.

    In synthetic control, researchers use a weighted combination of groups to construct a “synthetic” version of an actual  scenario. In this case, the actual scenario is a county such as Dallas that hosts an open stadium. A synthetic version would be a county that looks similar to Dallas, only without a stadium. In the context of this study, a county that “looks” like Dallas has a similar preseason pattern of Covid-19 cases.

    To construct a synthetic Dallas, the researchers looked for surrounding counties without stadiums, that had similar Covid-19 trajectories leading up to the 2020 football season. They combined these counties in a way that best fit Dallas’ actual case trajectory. They then used data from the combined counties to calculate the number of Covid cases for this synthetic Dallas through the season, and compared these counts to the real Dallas.

    The team carried out this analysis for every “stadium county.” They determined a county to be a stadium county if more than 10 percent of a stadium’s fans came from that county, which the researchers estimated based on attendance data provided by the NFL.

    “Go outside”

    Of the stadiums included in the study, 13 were closed through the regular season, while 16 opened with reduced capacity and multiple pandemic requirements in place, such as required masking, distanced seating, mobile ticketing, and enhanced cleaning protocols.

    The researchers found the trajectory of infections in all stadium counties mirrored that of synthetic counties, showing that the number of infections would have been the same if the stadiums had remained closed. In other words, they found no evidence that NFL stadium openings led to any increase in local Covid case counts.

    To check that their method wasn’t missing any case spikes, they tested it on a known superspreader: the Sturgis Motorcycle Rally, which was held in August of 2020. The analysis successfully picked up an increase in cases in Meade, the host county, compared to a synthetic counterpart, in the two weeks following the rally.

    Surprisingly, the researchers found that several stadium counties’ case counts dipped slightly compared to their synthetic counterparts. In these counties — including Hamilton, Ohio, home of the Cincinnati Bengals — it appeared that opening the stadium to fans was tied to a dip in Covid-19 infections. Hosoi has a guess as to why:

    “These are football communities with dedicated fans. Rather than stay home alone, those fans may have gone to a sports bar or hosted indoor football gatherings if the stadium had not opened,” Hosoi proposes. “Opening the stadium under those circumstances would have been beneficial to the community because it makes people go outside.”

    The team’s analysis also revealed another connection: Counties with similar Covid trajectories also shared similar politics. To illustrate this point, the team mapped the county-wide temporal trajectories of Covid case counts in Ohio in 2020 and found them to be a strong predictor of the state’s 2020 electoral map.

    “That is not a coincidence,” Hosoi notes. “It tells us that local political leanings determined the temporal trajectory of the pandemic.”

    The team plans to apply their analysis to see how other factors may have influenced the pandemic.

    “Covid is a different beast [today],” she says. “Omicron is more transmissive, and more of the population is vaccinated. It’s possible we’d find something different if we ran this analysis on the upcoming season, and I think we probably should try.” More

  • in

    Jonathan Schwarz appointed director of MIT Institutional Research

    Former Provost Martin A. Schmidt named Jonathan D. Schwarz as the new director of MIT Institutional Research — a group within the Office of the Provost that provides high-quality data and analysis to the Institute, government entities, news organizations, and the broader community. 

    Over its 35-year history, Institutional Research has provided consistent, verifiable, and high-quality data. The group was established in 1986 as part of the MIT Office of Campus Planning to support MIT’s academic budget process and space planning studies. The Institute established the group to provide a central source of dependable data for departments, units, research labs, and administrators. 

    Institutional Research conducts campus-wide surveys on topics that affect the community including commuting, wellness, and diversity and inclusion. Additionally, the group submits data on behalf of MIT to the U.S. Department of Education, the Commonwealth of Massachusetts, the National Science Foundation, and national and international higher education rankings such as U.S. News & World Report. Institutional Research also works with peer institutions, consortia, government agencies, and rankings groups to establish the criteria that define how students, faculty, and research dollars are counted.

    “At its core, Institutional Research is about counting people, money, and space,” says Schwarz. “Once Institutional Research established valid and reliable metrics in these areas, it was able to apply its deep understanding of data and the Institute to a broader range of topics using surveys, interviews, and focus groups. We collect, maintain, analyze, and report data so people can make data-informed decisions.”

    One of the group’s most data-rich surveys launched earlier this month, the 2022 MIT Quality of Life Survey. Administered every two years to the entire MIT community on campus and at Lincoln Laboratory, the Quality of Life Survey gathers information about the workload and well-being of MIT’s community members as well as the general atmosphere and climate at MIT. Findings from previous Institutional Research surveys helped to inspire several campus-wide initiatives, including expanded childcare benefits, protocols for flexible work arrangements, upgrades to commuting services, and measures to address student hunger.

    “Surveys give us an idea of where to shine a flashlight, but they are blunt instruments that don’t tell the whole story,” says Schwarz, who most recently served as associate director of Institutional Research, where he has worked since 2017. “We also need to sit down and talk to people and take a deeper dive to get nuance, rich detail, and context to better understand the data we’re collecting.”

    As associate director, Schwarz led an initiative to integrate qualitative data collection and analysis, and played an active role in work around issues of diversity, equity and inclusion. Schwarz joined MIT as an intern and later served as a researcher in MIT’s Office of Minority Education and Admissions Office. He earned a bachelor’s degree in political science from Wabash College and served as the college’s mascot, Wally Wabash. He also earned a master’s degree in education from the Harvard Graduate School of Education, and a PhD in sociology from the University of Notre Dame.

    Schwarz takes over the post from his mentor and Institutional Research’s founding director Lydia Snover, who is retiring after serving MIT in various roles for more than 50 years. 

    “We are blessed at MIT to have a community with an engineering culture — measuring is what we do,” says Snover. “You can’t fix something if you don’t know what’s wrong.”

    Snover will serve as the senior advisor to the director through 2022. A dedicated and valuable member of the MIT community, she started her career at MIT working in administrative positions in the departments of Psychology (now Brain and Cognitive Sciences) and Nutrition and Food Science/Applied Biological Sciences and served as a cook at MIT’s Kappa Sigma fraternity before she officially joined MIT. Snover has a bachelor of arts in philosophy and an MBA from Boston University.

    In her capacity as director of Institutional Research, Snover was awarded the 2019 John Stecklein Distinguished Member Award by the Association for Institutional Research, and the 2007 Lifetime Achievement Award from the Association of American Universities Data Exchange.

    Schwarz began his new role on Jan. 3. More

  • in

    How artificial intelligence can help combat systemic racism

    In 2020, Detroit police arrested a Black man for shoplifting almost $4,000 worth of watches from an upscale boutique. He was handcuffed in front of his family and spent a night in lockup. After some questioning, however, it became clear that they had the wrong man. So why did they arrest him in the first place?

    The reason: a facial recognition algorithm had matched the photo on his driver’s license to grainy security camera footage.

    Facial recognition algorithms — which have repeatedly been demonstrated to be less accurate for people with darker skin — are just one example of how racial bias gets replicated within and perpetuated by emerging technologies.

    “There’s an urgency as AI is used to make really high-stakes decisions,” says MLK Visiting Professor S. Craig Watkins, whose academic home for his time at MIT is the Institute for Data, Systems, and Society (IDSS). “The stakes are higher because new systems can replicate historical biases at scale.”

    Watkins, a professor at the University of Texas at Austin and the founding director of the Institute for Media Innovation​, researches the impacts of media and data-based systems on human behavior, with a specific concentration on issues related to systemic racism. “One of the fundamental questions of the work is: how do we build AI models that deal with systemic inequality more effectively?”

    Play video

    Artificial Intelligence and the Future of Racial Justice | S. Craig Watkins | TEDxMIT

    Ethical AI

    Inequality is perpetuated by technology in many ways across many sectors. One broad domain is health care, where Watkins says inequity shows up in both quality of and access to care. The demand for mental health care, for example, far outstrips the capacity for services in the United States. That demand has been exacerbated by the pandemic, and access to care is harder for communities of color.

    For Watkins, taking the bias out of the algorithm is just one component of building more ethical AI. He works also to develop tools and platforms that can address inequality outside of tech head-on. In the case of mental health access, this entails developing a tool to help mental health providers deliver care more efficiently.

    “We are building a real-time data collection platform that looks at activities and behaviors and tries to identify patterns and contexts in which certain mental states emerge,” says Watkins. “The goal is to provide data-informed insights to care providers in order to deliver higher-impact services.”

    Watkins is no stranger to the privacy concerns such an app would raise. He takes a user-centered approach to the development that is grounded in data ethics. “Data rights are a significant component,” he argues. “You have to give the user complete control over how their data is shared and used and what data a care provider sees. No one else has access.”

    Combating systemic racism

    Here at MIT, Watkins has joined the newly launched Initiative on Combatting Systemic Racism (ICSR), an IDSS research collaboration that brings together faculty and researchers from the MIT Stephen A. Schwarzman College of Computing and beyond. The aim of the ICSR is to develop and harness computational tools that can help effect structural and normative change toward racial equity.

    The ICSR collaboration has separate project teams researching systemic racism in different sectors of society, including health care. Each of these “verticals” addresses different but interconnected issues, from sustainability to employment to gaming. Watkins is a part of two ICSR groups, policing and housing, that aim to better understand the processes that lead to discriminatory practices in both sectors. “Discrimination in housing contributes significantly to the racial wealth gap in the U.S.,” says Watkins.

    The policing team examines patterns in how different populations get policed. “There is obviously a significant and charged history to policing and race in America,” says Watkins. “This is an attempt to understand, to identify patterns, and note regional differences.”

    Watkins and the policing team are building models using data that details police interventions, responses, and race, among other variables. The ICSR is a good fit for this kind of research, says Watkins, who notes the interdisciplinary focus of both IDSS and the SCC. 

    “Systemic change requires a collaborative model and different expertise,” says Watkins. “We are trying to maximize influence and potential on the computational side, but we won’t get there with computation alone.”

    Opportunities for change

    Models can also predict outcomes, but Watkins is careful to point out that no algorithm alone will solve racial challenges.

    “Models in my view can inform policy and strategy that we as humans have to create. Computational models can inform and generate knowledge, but that doesn’t equate with change.” It takes additional work — and additional expertise in policy and advocacy — to use knowledge and insights to strive toward progress.

    One important lever of change, he argues, will be building a more AI-literate society through access to information and opportunities to understand AI and its impact in a more dynamic way. He hopes to see greater data rights and greater understanding of how societal systems impact our lives.

    “I was inspired by the response of younger people to the murders of George Floyd and Breonna Taylor,” he says. “Their tragic deaths shine a bright light on the real-world implications of structural racism and has forced the broader society to pay more attention to this issue, which creates more opportunities for change.” More

  • in

    When it comes to AI, can we ditch the datasets?

    Huge amounts of data are needed to train machine-learning models to perform image classification tasks, such as identifying damage in satellite photos following a natural disaster. However, these data are not always easy to come by. Datasets may cost millions of dollars to generate, if usable data exist in the first place, and even the best datasets often contain biases that negatively impact a model’s performance.

    To circumvent some of the problems presented by datasets, MIT researchers developed a method for training a machine learning model that, rather than using a dataset, uses a special type of machine-learning model to generate extremely realistic synthetic data that can train another model for downstream vision tasks.

    Their results show that a contrastive representation learning model trained using only these synthetic data is able to learn visual representations that rival or even outperform those learned from real data.

    This special machine-learning model, known as a generative model, requires far less memory to store or share than a dataset. Using synthetic data also has the potential to sidestep some concerns around privacy and usage rights that limit how some real data can be distributed. A generative model could also be edited to remove certain attributes, like race or gender, which could address some biases that exist in traditional datasets.

    “We knew that this method should eventually work; we just needed to wait for these generative models to get better and better. But we were especially pleased when we showed that this method sometimes does even better than the real thing,” says Ali Jahanian, a research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL) and lead author of the paper.

    Jahanian wrote the paper with CSAIL grad students Xavier Puig and Yonglong Tian, and senior author Phillip Isola, an assistant professor in the Department of Electrical Engineering and Computer Science. The research will be presented at the International Conference on Learning Representations.

    Generating synthetic data

    Once a generative model has been trained on real data, it can generate synthetic data that are so realistic they are nearly indistinguishable from the real thing. The training process involves showing the generative model millions of images that contain objects in a particular class (like cars or cats), and then it learns what a car or cat looks like so it can generate similar objects.

    Essentially by flipping a switch, researchers can use a pretrained generative model to output a steady stream of unique, realistic images that are based on those in the model’s training dataset, Jahanian says.

    But generative models are even more useful because they learn how to transform the underlying data on which they are trained, he says. If the model is trained on images of cars, it can “imagine” how a car would look in different situations — situations it did not see during training — and then output images that show the car in unique poses, colors, or sizes.

    Having multiple views of the same image is important for a technique called contrastive learning, where a machine-learning model is shown many unlabeled images to learn which pairs are similar or different.

    The researchers connected a pretrained generative model to a contrastive learning model in a way that allowed the two models to work together automatically. The contrastive learner could tell the generative model to produce different views of an object, and then learn to identify that object from multiple angles, Jahanian explains.

    “This was like connecting two building blocks. Because the generative model can give us different views of the same thing, it can help the contrastive method to learn better representations,” he says.

    Even better than the real thing

    The researchers compared their method to several other image classification models that were trained using real data and found that their method performed as well, and sometimes better, than the other models.

    One advantage of using a generative model is that it can, in theory, create an infinite number of samples. So, the researchers also studied how the number of samples influenced the model’s performance. They found that, in some instances, generating larger numbers of unique samples led to additional improvements.

    “The cool thing about these generative models is that someone else trained them for you. You can find them in online repositories, so everyone can use them. And you don’t need to intervene in the model to get good representations,” Jahanian says.

    But he cautions that there are some limitations to using generative models. In some cases, these models can reveal source data, which can pose privacy risks, and they could amplify biases in the datasets they are trained on if they aren’t properly audited.

    He and his collaborators plan to address those limitations in future work. Another area they want to explore is using this technique to generate corner cases that could improve machine learning models. Corner cases often can’t be learned from real data. For instance, if researchers are training a computer vision model for a self-driving car, real data wouldn’t contain examples of a dog and his owner running down a highway, so the model would never learn what to do in this situation. Generating that corner case data synthetically could improve the performance of machine learning models in some high-stakes situations.

    The researchers also want to continue improving generative models so they can compose images that are even more sophisticated, he says.

    This research was supported, in part, by the MIT-IBM Watson AI Lab, the United States Air Force Research Laboratory, and the United States Air Force Artificial Intelligence Accelerator. More

  • in

    An “oracle” for predicting the evolution of gene regulation

    Despite the sheer number of genes that each human cell contains, these so-called “coding” DNA sequences comprise just 1 percent of our entire genome. The remaining 99 percent is made up of “non-coding” DNA — which, unlike coding DNA, does not carry the instructions to build proteins.

    One vital function of this non-coding DNA, also called “regulatory” DNA, is to help turn genes on and off, controlling how much (if any) of a protein is made. Over time, as cells replicate their DNA to grow and divide, mutations often crop up in these non-coding regions — sometimes tweaking their function and changing the way they control gene expression. Many of these mutations are trivial, and some are even beneficial. Occasionally, though, they can be associated with increased risk of common diseases, such as Type 2 diabetes, or more life-threatening ones, including cancer.

    To better understand the repercussions of such mutations, researchers have been hard at work on mathematical maps that allow them to look at an organism’s genome, predict which genes will be expressed, and determine how that expression will affect the organism’s observable traits. These maps, called fitness landscapes, were conceptualized roughly a century ago to understand how genetic makeup influences one common measure of organismal fitness in particular: reproductive success. Early fitness landscapes were very simple, often focusing on a limited number of mutations. Much richer datasets are now available, but researchers still require additional tools to characterize and visualize such complex data. This ability would not only facilitate a better understanding of how individual genes have evolved over time, but would also help to predict what sequence and expression changes might occur in the future.

    In a new study published on March 9 in Nature, a team of scientists has developed a framework for studying the fitness landscapes of regulatory DNA. They created a neural network model that, when trained on hundreds of millions of experimental measurements, was capable of predicting how changes to these non-coding sequences in yeast affected gene expression. They also devised a unique way of representing the landscapes in two dimensions, making it easy to understand the past and forecast the future evolution of non-coding sequences in organisms beyond yeast — and even design custom gene expression patterns for gene therapies and industrial applications.

    “We now have an ‘oracle’ that can be queried to ask: What if we tried all possible mutations of this sequence? Or, what new sequence should we design to give us a desired expression?” says Aviv Regev, a professor of biology at MIT (on leave), core member of the Broad Institute of Harvard and MIT (on leave), head of Genentech Research and Early Development, and the study’s senior author. “Scientists can now use the model for their own evolutionary question or scenario, and for other problems like making sequences that control gene expression in desired ways. I am also excited about the possibilities for machine learning researchers interested in interpretability; they can ask their questions in reverse, to better understand the underlying biology.”

    Prior to this study, many researchers had simply trained their models on known mutations (or slight variations thereof) that exist in nature. However, Regev’s team wanted to go a step further by creating their own unbiased models capable of predicting an organism’s fitness and gene expression based on any possible DNA sequence — even sequences they’d never seen before. This would also enable researchers to use such models to engineer cells for pharmaceutical purposes, including new treatments for cancer and autoimmune disorders.

    To accomplish this goal, Eeshit Dhaval Vaishnav, a graduate student at MIT and co-first author; Carl de Boer, now an assistant professor at the University of British Columbia; and their colleagues created a neural network model to predict gene expression. They trained it on a dataset generated by inserting millions of totally random non-coding DNA sequences into yeast, and observing how each random sequence affected gene expression. They focused on a particular subset of non-coding DNA sequences called promoters, which serve as binding sites for proteins that can switch nearby genes on or off.

    “This work highlights what possibilities open up when we design new kinds of experiments to generate the right data to train models,” Regev says. “In the broader sense, I believe these kinds of approaches will be important for many problems — like understanding genetic variants in regulatory regions that confer disease risk in the human genome, but also for predicting the impact of combinations of mutations, or designing new molecules.”

    Regev, Vaishnav, de Boer, and their coauthors went on to test their model’s predictive abilities in a variety of ways, in order to show how it could help demystify the evolutionary past — and possible future — of certain promoters. “Creating an accurate model was certainly an accomplishment, but, to me, it was really just a starting point,” Vaishnav explains.

    First, to determine whether their model could help with synthetic biology applications like producing antibiotics, enzymes, and food, the researchers practiced using it to design promoters that could generate desired expression levels for any gene of interest. They then scoured other scientific papers to identify fundamental evolutionary questions, in order to see if their model could help answer them. The team even went so far as to feed their model a real-world population dataset from one existing study, which contained genetic information from yeast strains around the world. In doing so, they were able to delineate thousands of years of past selection pressures that sculpted the genomes of today’s yeast.

    But, in order to create a powerful tool that could probe any genome, the researchers knew they’d need to find a way to forecast the evolution of non-coding sequences even without such a comprehensive population dataset. To address this goal, Vaishnav and his colleagues devised a computational technique that allowed them to plot the predictions from their framework onto a two-dimensional graph. This helped them show, in a remarkably simple manner, how any non-coding DNA sequence would affect gene expression and fitness, without needing to conduct any time-consuming experiments at the lab bench.

    “One of the unsolved problems in fitness landscapes was that we didn’t have an approach for visualizing them in a way that meaningfully captured the evolutionary properties of sequences,” Vaishnav explains. “I really wanted to find a way to fill that gap, and contribute to the long-standing vision of creating a complete fitness landscape.”

    Martin Taylor, a professor of genetics at the University of Edinburgh’s Medical Research Council Human Genetics Unit who was not involved in the research, says the study shows that artificial intelligence can not only predict the effect of regulatory DNA changes, but also reveal the underlying principles that govern millions of years of evolution.

    Despite the fact that the model was trained on just a fraction of yeast regulatory DNA in a few growth conditions, he’s impressed that it’s capable of making such useful predictions about the evolution of gene regulation in mammals.

    “There are obvious near-term applications, such as the custom design of regulatory DNA for yeast in brewing, baking, and biotechnology,” he explains. “But extensions of this work could also help identify disease mutations in human regulatory DNA that are currently difficult to find and largely overlooked in the clinic. This work suggests there is a bright future for AI models of gene regulation trained on richer, more complex, and more diverse datasets.”

    Even before the study was formally published, Vaishnav began receiving queries from other researchers hoping to use the model to devise non-coding DNA sequences for use in gene therapies.

    “People have been studying regulatory evolution and fitness landscapes for decades now,” Vaishnav says. “I think our framework will go a long way in answering fundamental, open questions about the evolution and evolvability of gene regulatory DNA — and even help us design biological sequences for exciting new applications.” More

  • in

    Computational modeling guides development of new materials

    Metal-organic frameworks, a class of materials with porous molecular structures, have a variety of possible applications, such as capturing harmful gases and catalyzing chemical reactions. Made of metal atoms linked by organic molecules, they can be configured in hundreds of thousands of different ways.

    To help researchers sift through all of the possible metal-organic framework (MOF) structures and help identify the ones that would be most practical for a particular application, a team of MIT computational chemists has developed a model that can analyze the features of a MOF structure and predict if it will be stable enough to be useful.

    The researchers hope that these computational predictions will help cut the development time of new MOFs.

    “This will allow researchers to test the promise of specific materials before they go through the trouble of synthesizing them,” says Heather Kulik, an associate professor of chemical engineering at MIT.

    The MIT team is now working to develop MOFs that could be used to capture methane gas and convert it to useful compounds such as fuels.

    The researchers described their new model in two papers, one in the Journal of the American Chemical Society and one in Scientific Data. Graduate students Aditya Nandy and Gianmarco Terrones are the lead authors of the Scientific Data paper, and Nandy is also the lead author of the JACS paper. Kulik is the senior author of both papers.

    Modeling structure

    MOFs consist of metal atoms joined by organic molecules called linkers to create a rigid, cage-like structure. The materials also have many pores, which makes them useful for catalyzing reactions involving gases but can also make them less structurally stable.

    “The limitation in seeing MOFs realized at industrial scale is that although we can control their properties by controlling where each atom is in the structure, they’re not necessarily that stable, as far as materials go,” Kulik says. “They’re very porous and they can degrade under realistic conditions that we need for catalysis.”

    Scientists have been working on designing MOFs for more than 20 years, and thousands of possible structures have been published. A centralized repository contains about 10,000 of these structures but is not linked to any of the published findings on the properties of those structures.

    Kulik, who specializes in using computational modeling to discover structure-property relationships of materials, wanted to take a more systematic approach to analyzing and classifying the properties of MOFs.

    “When people make these now, it’s mostly trial and error. The MOF dataset is really promising because there are so many people excited about MOFs, so there’s so much to learn from what everyone’s been working on, but at the same time, it’s very noisy and it’s not systematic the way it’s reported,” she says.

    Kulik and her colleagues set out to analyze published reports of MOF structures and properties using a natural-language-processing algorithm. Using this algorithm, they scoured nearly 4,000 published papers, extracting information on the temperature at which a given MOF would break down. They also pulled out data on whether particular MOFs can withstand the conditions needed to remove solvents used to synthesize them and make sure they become porous.

    Once the researchers had this information, they used it to train two neural networks to predict MOFs’ thermal stability and stability during solvent removal, based on the molecules’ structure.

    “Before you start working with a material and thinking about scaling it up for different applications, you want to know will it hold up, or is it going to degrade in the conditions I would want to use it in?” Kulik says. “Our goal was to get better at predicting what makes a stable MOF.”

    Better stability

    Using the model, the researchers were able to identify certain features that influence stability. In general, simpler linkers with fewer chemical groups attached to them are more stable. Pore size is also important: Before the researchers did their analysis, it had been thought that MOFs with larger pores might be too unstable. However, the MIT team found that large-pore MOFs can be stable if other aspects of their structure counteract the large pore size.

    “Since MOFs have so many things that can vary at the same time, such as the metal, the linkers, the connectivity, and the pore size, it is difficult to nail down what governs stability across different families of MOFs,” Nandy says. “Our models enable researchers to make predictions on existing or new materials, many of which have yet to be made.”

    The researchers have made their data and models available online. Scientists interested in using the models can get recommendations for strategies to make an existing MOF more stable, and they can also add their own data and feedback on the predictions of the models.

    The MIT team is now using the model to try to identify MOFs that could be used to catalyze the conversion of methane gas to methanol, which could be used as fuel. Kulik also plans to use the model to create a new dataset of hypothetical MOFs that haven’t been built before but are predicted to have high stability. Researchers could then screen this dataset for a variety of properties.

    “People are interested in MOFs for things like quantum sensing and quantum computing, all sorts of different applications where you need metals distributed in this atomically precise way,” Kulik says.

    The research was funded by DARPA, the U.S. Office of Naval Research, the U.S. Department of Energy, a National Science Foundation Graduate Research Fellowship, a Career Award at the Scientific Interface from the Burroughs Wellcome Fund, and an AAAS Marion Milligan Mason Award. More