More stories

  • in

    Strengthening trust in machine-learning models

    Probabilistic machine learning methods are becoming increasingly powerful tools in data analysis, informing a range of critical decisions across disciplines and applications, from forecasting election results to predicting the impact of microloans on addressing poverty.

    This class of methods uses sophisticated concepts from probability theory to handle uncertainty in decision-making. But the math is only one piece of the puzzle in determining their accuracy and effectiveness. In a typical data analysis, researchers make many subjective choices, or potentially introduce human error, that must also be assessed in order to cultivate users’ trust in the quality of decisions based on these methods.

    To address this issue, MIT computer scientist Tamara Broderick, associate professor in the Department of Electrical Engineering and Computer Science (EECS) and a member of the Laboratory for Information and Decision Systems (LIDS), and a team of researchers have developed a classification system — a “taxonomy of trust” — that defines where trust might break down in a data analysis and identifies strategies to strengthen trust at each step. The other researchers on the project are Professor Anna Smith at the University of Kentucky, professors Tian Zheng and Andrew Gelman at Columbia University, and Professor Rachael Meager at the London School of Economics. The team’s hope is to highlight concerns that are already well-studied and those that need more attention.

    In their paper, published in February in Science Advances, the researchers begin by detailing the steps in the data analysis process where trust might break down: Analysts make choices about what data to collect and which models, or mathematical representations, most closely mirror the real-life problem or question they are aiming to answer. They select algorithms to fit the model and use code to run those algorithms. Each of these steps poses unique challenges around building trust. Some components can be checked for accuracy in measurable ways. “Does my code have bugs?”, for example, is a question that can be tested against objective criteria. Other times, problems are more subjective, with no clear-cut answers; analysts are confronted with numerous strategies to gather data and decide whether a model reflects the real world.

    “What I think is nice about making this taxonomy, is that it really highlights where people are focusing. I think a lot of research naturally focuses on this level of ‘are my algorithms solving a particular mathematical problem?’ in part because it’s very objective, even if it’s a hard problem,” Broderick says.

    “I think it’s really hard to answer ‘is it reasonable to mathematize an important applied problem in a certain way?’ because it’s somehow getting into a harder space, it’s not just a mathematical problem anymore.”

    Capturing real life in a model

    The researchers’ work in categorizing where trust breaks down, though it may seem abstract, is rooted in real-world application.

    Meager, a co-author on the paper, analyzed whether microfinances can have a positive effect in a community. The project became a case study for where trust could break down, and ways to reduce this risk.

    At first look, measuring the impact of microfinancing might seem like a straightforward endeavor. But like any analysis, researchers meet challenges at each step in the process that can affect trust in the outcome. Microfinancing — in which individuals or small businesses receive small loans and other financial services in lieu of conventional banking — can offer different services, depending on the program. For the analysis, Meager gathered datasets from microfinance programs in countries across the globe, including in Mexico, Mongolia, Bosnia, and the Philippines.

    When combining conspicuously distinct datasets, in this case from multiple countries and across different cultures and geographies, researchers must evaluate whether specific case studies can reflect broader trends. It is also important to contextualize the data on hand. For example, in rural Mexico, owning goats may be counted as an investment.

    “It’s hard to measure the quality of life of an individual. People measure things like, ‘What’s the business profit of the small business?’ Or ‘What’s the consumption level of a household?’ There’s this potential for mismatch between what you ultimately really care about, and what you’re measuring,” Broderick says. “Before we get to the mathematical level, what data and what assumptions are we leaning on?”

    With data on hand, analysts must define the real-world questions they seek to answer. In the case of evaluating the benefits of microfinancing, analysts must define what they consider a positive outcome. It is standard in economics, for example, to measure the average financial gain per business in communities where a microfinance program is introduced. But reporting an average might suggest a net positive effect even if only a few (or even one) person benefited, instead of the community as a whole.

    “What you really wanted was that a lot of people are benefiting,” Broderick says. “It sounds simple. Why didn’t we measure the thing that we cared about? But I think it’s really common that practitioners use standard machine learning tools, for a lot of reasons. And these tools might report a proxy that doesn’t always agree with the quantity of interest.”

    Analysts may consciously or subconsciously favor models they are familiar with, especially after investing a great deal of time learning their ins and outs. “Someone might be hesitant to try a nonstandard method because they might be less certain they will use it correctly. Or peer review might favor certain familiar methods, even if a researcher might like to use nonstandard methods,” Broderick says. “There are a lot of reasons, sociologically. But this can be a concern for trust.”

    Final step, checking the code 

    While distilling a real-life problem into a model can be a big-picture, amorphous problem, checking the code that runs an algorithm can feel “prosaic,” Broderick says. But it is another potentially overlooked area where trust can be strengthened.

    In some cases, checking a coding pipeline that executes an algorithm might be considered outside the purview of an analyst’s job, especially when there is the option to use standard software packages.

    One way to catch bugs is to test whether code is reproducible. Depending on the field, however, sharing code alongside published work is not always a requirement or the norm. As models increase in complexity over time, it becomes harder to recreate code from scratch. Reproducing a model becomes difficult or even impossible.

    “Let’s just start with every journal requiring you to release your code. Maybe it doesn’t get totally double-checked, and everything isn’t absolutely perfect, but let’s start there,” Broderick says, as one step toward building trust.

    Paper co-author Gelman worked on an analysis that forecast the 2020 U.S. presidential election using state and national polls in real-time. The team published daily updates in The Economist magazine, while also publishing their code online for anyone to download and run themselves. Throughout the season, outsiders pointed out both bugs and conceptual problems in the model, ultimately contributing to a stronger analysis.

    The researchers acknowledge that while there is no single solution to create a perfect model, analysts and scientists have the opportunity to reinforce trust at nearly every turn.

    “I don’t think we expect any of these things to be perfect,” Broderick says, “but I think we can expect them to be better or to be as good as possible.” More

  • in

    Learning to grow machine-learning models

    It’s no secret that OpenAI’s ChatGPT has some incredible capabilities — for instance, the chatbot can write poetry that resembles Shakespearean sonnets or debug code for a computer program. These abilities are made possible by the massive machine-learning model that ChatGPT is built upon. Researchers have found that when these types of models become large enough, extraordinary capabilities emerge.

    But bigger models also require more time and money to train. The training process involves showing hundreds of billions of examples to a model. Gathering so much data is an involved process in itself. Then come the monetary and environmental costs of running many powerful computers for days or weeks to train a model that may have billions of parameters. 

    “It’s been estimated that training models at the scale of what ChatGPT is hypothesized to run on could take millions of dollars, just for a single training run. Can we improve the efficiency of these training methods, so we can still get good models in less time and for less money? We propose to do this by leveraging smaller language models that have previously been trained,” says Yoon Kim, an assistant professor in MIT’s Department of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL).

    Rather than discarding a previous version of a model, Kim and his collaborators use it as the building blocks for a new model. Using machine learning, their method learns to “grow” a larger model from a smaller model in a way that encodes knowledge the smaller model has already gained. This enables faster training of the larger model.

    Their technique saves about 50 percent of the computational cost required to train a large model, compared to methods that train a new model from scratch. Plus, the models trained using the MIT method performed as well as, or better than, models trained with other techniques that also use smaller models to enable faster training of larger models.

    Reducing the time it takes to train huge models could help researchers make advancements faster with less expense, while also reducing the carbon emissions generated during the training process. It could also enable smaller research groups to work with these massive models, potentially opening the door to many new advances.

    “As we look to democratize these types of technologies, making training faster and less expensive will become more important,” says Kim, senior author of a paper on this technique.

    Kim and his graduate student Lucas Torroba Hennigen wrote the paper with lead author Peihao Wang, a graduate student at the University of Texas at Austin, as well as others at the MIT-IBM Watson AI Lab and Columbia University. The research will be presented at the International Conference on Learning Representations.

    The bigger the better

    Large language models like GPT-3, which is at the core of ChatGPT, are built using a neural network architecture called a transformer. A neural network, loosely based on the human brain, is composed of layers of interconnected nodes, or “neurons.” Each neuron contains parameters, which are variables learned during the training process that the neuron uses to process data.

    Transformer architectures are unique because, as these types of neural network models get bigger, they achieve much better results.

    “This has led to an arms race of companies trying to train larger and larger transformers on larger and larger datasets. More so than other architectures, it seems that transformer networks get much better with scaling. We’re just not exactly sure why this is the case,” Kim says.

    These models often have hundreds of millions or billions of learnable parameters. Training all these parameters from scratch is expensive, so researchers seek to accelerate the process.

    One effective technique is known as model growth. Using the model growth method, researchers can increase the size of a transformer by copying neurons, or even entire layers of a previous version of the network, then stacking them on top. They can make a network wider by adding new neurons to a layer or make it deeper by adding additional layers of neurons.

    In contrast to previous approaches for model growth, parameters associated with the new neurons in the expanded transformer are not just copies of the smaller network’s parameters, Kim explains. Rather, they are learned combinations of the parameters of the smaller model.

    Learning to grow

    Kim and his collaborators use machine learning to learn a linear mapping of the parameters of the smaller model. This linear map is a mathematical operation that transforms a set of input values, in this case the smaller model’s parameters, to a set of output values, in this case the parameters of the larger model.

    Their method, which they call a learned Linear Growth Operator (LiGO), learns to expand the width and depth of larger network from the parameters of a smaller network in a data-driven way.

    But the smaller model may actually be quite large — perhaps it has a hundred million parameters — and researchers might want to make a model with a billion parameters. So the LiGO technique breaks the linear map into smaller pieces that a machine-learning algorithm can handle.

    LiGO also expands width and depth simultaneously, which makes it more efficient than other methods. A user can tune how wide and deep they want the larger model to be when they input the smaller model and its parameters, Kim explains.

    When they compared their technique to the process of training a new model from scratch, as well as to model-growth methods, it was faster than all the baselines. Their method saves about 50 percent of the computational costs required to train both vision and language models, while often improving performance.

    The researchers also found they could use LiGO to accelerate transformer training even when they didn’t have access to a smaller, pretrained model.

    “I was surprised by how much better all the methods, including ours, did compared to the random initialization, train-from-scratch baselines.” Kim says.

    In the future, Kim and his collaborators are looking forward to applying LiGO to even larger models.

    The work was funded, in part, by the MIT-IBM Watson AI Lab, Amazon, the IBM Research AI Hardware Center, Center for Computational Innovation at Rensselaer Polytechnic Institute, and the U.S. Army Research Office. More

  • in

    New method accelerates data retrieval in huge databases

    Hashing is a core operation in most online databases, like a library catalogue or an e-commerce website. A hash function generates codes that directly determine the location where data would be stored. So, using these codes, it is easier to find and retrieve the data.

    However, because traditional hash functions generate codes randomly, sometimes two pieces of data can be hashed with the same value. This causes collisions — when searching for one item points a user to many pieces of data with the same hash value. It takes much longer to find the right one, resulting in slower searches and reduced performance.

    Certain types of hash functions, known as perfect hash functions, are designed to place the data in a way that prevents collisions. But they are time-consuming to construct for each dataset and take more time to compute than traditional hash functions.

    Since hashing is used in so many applications, from database indexing to data compression to cryptography, fast and efficient hash functions are critical. So, researchers from MIT and elsewhere set out to see if they could use machine learning to build better hash functions.

    They found that, in certain situations, using learned models instead of traditional hash functions could result in half as many collisions. These learned models are created by running a machine-learning algorithm on a dataset to capture specific characteristics. The team’s experiments also showed that learned models were often more computationally efficient than perfect hash functions.

    “What we found in this work is that in some situations we can come up with a better tradeoff between the computation of the hash function and the collisions we will face. In these situations, the computation time for the hash function can be increased a bit, but at the same time its collisions can be reduced very significantly,” says Ibrahim Sabek, a postdoc in the MIT Data Systems Group of the Computer Science and Artificial Intelligence Laboratory (CSAIL).

    Their research, which will be presented at the 2023 International Conference on Very Large Databases, demonstrates how a hash function can be designed to significantly speed up searches in a huge database. For instance, their technique could accelerate computational systems that scientists use to store and analyze DNA, amino acid sequences, or other biological information.

    Sabek is the co-lead author of the paper with Department of Electrical Engineering and Computer Science (EECS) graduate student Kapil Vaidya. They are joined by co-authors Dominick Horn, a graduate student at the Technical University of Munich; Andreas Kipf, an MIT postdoc; Michael Mitzenmacher, professor of computer science at the Harvard John A. Paulson School of Engineering and Applied Sciences; and senior author Tim Kraska, associate professor of EECS at MIT and co-director of the Data, Systems, and AI Lab.

    Hashing it out

    Given a data input, or key, a traditional hash function generates a random number, or code, that corresponds to the slot where that key will be stored. To use a simple example, if there are 10 keys to be put into 10 slots, the function would generate a random integer between 1 and 10 for each input. It is highly probable that two keys will end up in the same slot, causing collisions.

    Perfect hash functions provide a collision-free alternative. Researchers give the function some extra knowledge, such as the number of slots the data are to be placed into. Then it can perform additional computations to figure out where to put each key to avoid collisions. However, these added computations make the function harder to create and less efficient.

    “We were wondering, if we know more about the data — that it will come from a particular distribution — can we use learned models to build a hash function that can actually reduce collisions?” Vaidya says.

    A data distribution shows all possible values in a dataset, and how often each value occurs. The distribution can be used to calculate the probability that a particular value is in a data sample.

    The researchers took a small sample from a dataset and used machine learning to approximate the shape of the data’s distribution, or how the data are spread out. The learned model then uses the approximation to predict the location of a key in the dataset.

    They found that learned models were easier to build and faster to run than perfect hash functions and that they led to fewer collisions than traditional hash functions if data are distributed in a predictable way. But if the data are not predictably distributed because gaps between data points vary too widely, using learned models might cause more collisions.

    “We may have a huge number of data inputs, and the gaps between consecutive inputs are very different, so learning a model to capture the data distribution of these inputs is quite difficult,” Sabek explains.

    Fewer collisions, faster results

    When data were predictably distributed, learned models could reduce the ratio of colliding keys in a dataset from 30 percent to 15 percent, compared with traditional hash functions. They were also able to achieve better throughput than perfect hash functions. In the best cases, learned models reduced the runtime by nearly 30 percent.

    As they explored the use of learned models for hashing, the researchers also found that throughput was impacted most by the number of sub-models. Each learned model is composed of smaller linear models that approximate the data distribution for different parts of the data. With more sub-models, the learned model produces a more accurate approximation, but it takes more time.

    “At a certain threshold of sub-models, you get enough information to build the approximation that you need for the hash function. But after that, it won’t lead to more improvement in collision reduction,” Sabek says.

    Building off this analysis, the researchers want to use learned models to design hash functions for other types of data. They also plan to explore learned hashing for databases in which data can be inserted or deleted. When data are updated in this way, the model needs to change accordingly, but changing the model while maintaining accuracy is a difficult problem.

    “We want to encourage the community to use machine learning inside more fundamental data structures and algorithms. Any kind of core data structure presents us with an opportunity to use machine learning to capture data properties and get better performance. There is still a lot we can explore,” Sabek says.

    “Hashing and indexing functions are core to a lot of database functionality. Given the variety of users and use cases, there is no one size fits all hashing, and learned models help adapt the database to a specific user. This paper is a great balanced analysis of the feasibility of these new techniques and does a good job of talking rigorously about the pros and cons, and helps us build our understanding of when such methods can be expected to work well,” says Murali Narayanaswamy, a principal machine learning scientist at Amazon, who was not involved with this work. “Exploring these kinds of enhancements is an exciting area of research both in academia and industry, and the kind of rigor shown in this work is critical for these methods to have large impact.”

    This work was supported, in part, by Google, Intel, Microsoft, the U.S. National Science Foundation, the U.S. Air Force Research Laboratory, and the U.S. Air Force Artificial Intelligence Accelerator. More

  • in

    Large language models are biased. Can logic help save them?

    Turns out, even language models “think” they’re biased. When prompted in ChatGPT, the response was as follows: “Yes, language models can have biases, because the training data reflects the biases present in society from which that data was collected. For example, gender and racial biases are prevalent in many real-world datasets, and if a language model is trained on that, it can perpetuate and amplify these biases in its predictions.” A well-known but dangerous problem. 

    Humans (typically) can dabble with both logical and stereotypical reasoning when learning. Still, language models mainly mimic the latter, an unfortunate narrative we’ve seen play out ad nauseam when the ability to employ reasoning and critical thinking is absent. So would injecting logic into the fray be enough to mitigate such behavior? 

    Scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) had an inkling that it might, so they set off to examine if logic-aware language models could significantly avoid more harmful stereotypes. They trained a language model to predict the relationship between two sentences, based on context and semantic meaning, using a dataset with labels for text snippets detailing if a second phrase “entails,” “contradicts,” or is neutral with respect to the first one. Using this dataset — natural language inference — they found that the newly trained models were significantly less biased than other baselines, without any extra data, data editing, or additional training algorithms.

    For example, with the premise “the person is a doctor” and the hypothesis “the person is masculine,” using these logic-trained models, the relationship would be classified as “neutral,” since there’s no logic that says the person is a man. With more common language models, two sentences might seem to be correlated due to some bias in training data, like “doctor” might be pinged with “masculine,” even when there’s no evidence that the statement is true. 

    At this point, the omnipresent nature of language models is well-known: Applications in natural language processing, speech recognition, conversational AI, and generative tasks abound. While not a nascent field of research, growing pains can take a front seat as they increase in complexity and capability. 

    “Current language models suffer from issues with fairness, computational resources, and privacy,” says MIT CSAIL postdoc Hongyin Luo, the lead author of a new paper about the work. “Many estimates say that the CO2 emission of training a language model can be higher than the lifelong emission of a car. Running these large language models is also very expensive because of the amount of parameters and the computational resources they need. With privacy, state-of-the-art language models developed by places like ChatGPT or GPT-3 have their APIs where you must upload your language, but there’s no place for sensitive information regarding things like health care or finance. To solve these challenges, we proposed a logical language model that we qualitatively measured as fair, is 500 times smaller than the state-of-the-art models, can be deployed locally, and with no human-annotated training samples for downstream tasks. Our model uses 1/400 the parameters compared with the largest language models, has better performance on some tasks, and significantly saves computation resources.” 

    This model, which has 350 million parameters, outperformed some very large-scale language models with 100 billion parameters on logic-language understanding tasks. The team evaluated, for example, popular BERT pretrained language models with their “textual entailment” ones on stereotype, profession, and emotion bias tests. The latter outperformed other models with significantly lower bias, while preserving the language modeling ability. The “fairness” was evaluated with something called ideal context association (iCAT) tests, where higher iCAT scores mean fewer stereotypes. The model had higher than 90 percent iCAT scores, while other strong language understanding models ranged between 40 to 80. 

    Luo wrote the paper alongside MIT Senior Research Scientist James Glass. They will present the work at the Conference of the European Chapter of the Association for Computational Linguistics in Croatia. 

    Unsurprisingly, the original pretrained language models the team examined were teeming with bias, confirmed by a slew of reasoning tests demonstrating how professional and emotion terms are significantly biased to the feminine or masculine words in the gender vocabulary. 

    With professions, a language model (which is biased) thinks that “flight attendant,” “secretary,” and “physician’s assistant” are feminine jobs, while “fisherman,” “lawyer,” and “judge” are masculine. Concerning emotions, a language model thinks that “anxious,” “depressed,” and “devastated” are feminine.

    While we may still be far away from a neutral language model utopia, this research is ongoing in that pursuit. Currently, the model is just for language understanding, so it’s based on reasoning among existing sentences. Unfortunately, it can’t generate sentences for now, so the next step for the researchers would be targeting the uber-popular generative models built with logical learning to ensure more fairness with computational efficiency. 

    “Although stereotypical reasoning is a natural part of human recognition, fairness-aware people conduct reasoning with logic rather than stereotypes when necessary,” says Luo. “We show that language models have similar properties. A language model without explicit logic learning makes plenty of biased reasoning, but adding logic learning can significantly mitigate such behavior. Furthermore, with demonstrated robust zero-shot adaptation ability, the model can be directly deployed to different tasks with more fairness, privacy, and better speed.” More

  • in

    Efficient technique improves machine-learning models’ reliability

    Powerful machine-learning models are being used to help people tackle tough problems such as identifying disease in medical images or detecting road obstacles for autonomous vehicles. But machine-learning models can make mistakes, so in high-stakes settings it’s critical that humans know when to trust a model’s predictions.

    Uncertainty quantification is one tool that improves a model’s reliability; the model produces a score along with the prediction that expresses a confidence level that the prediction is correct. While uncertainty quantification can be useful, existing methods typically require retraining the entire model to give it that ability. Training involves showing a model millions of examples so it can learn a task. Retraining then requires millions of new data inputs, which can be expensive and difficult to obtain, and also uses huge amounts of computing resources.

    Researchers at MIT and the MIT-IBM Watson AI Lab have now developed a technique that enables a model to perform more effective uncertainty quantification, while using far fewer computing resources than other methods, and no additional data. Their technique, which does not require a user to retrain or modify a model, is flexible enough for many applications.

    The technique involves creating a simpler companion model that assists the original machine-learning model in estimating uncertainty. This smaller model is designed to identify different types of uncertainty, which can help researchers drill down on the root cause of inaccurate predictions.

    “Uncertainty quantification is essential for both developers and users of machine-learning models. Developers can utilize uncertainty measurements to help develop more robust models, while for users, it can add another layer of trust and reliability when deploying models in the real world. Our work leads to a more flexible and practical solution for uncertainty quantification,” says Maohao Shen, an electrical engineering and computer science graduate student and lead author of a paper on this technique.

    Shen wrote the paper with Yuheng Bu, a former postdoc in the Research Laboratory of Electronics (RLE) who is now an assistant professor at the University of Florida; Prasanna Sattigeri, Soumya Ghosh, and Subhro Das, research staff members at the MIT-IBM Watson AI Lab; and senior author Gregory Wornell, the Sumitomo Professor in Engineering who leads the Signals, Information, and Algorithms Laboratory RLE and is a member of the MIT-IBM Watson AI Lab. The research will be presented at the AAAI Conference on Artificial Intelligence.

    Quantifying uncertainty

    In uncertainty quantification, a machine-learning model generates a numerical score with each output to reflect its confidence in that prediction’s accuracy. Incorporating uncertainty quantification by building a new model from scratch or retraining an existing model typically requires a large amount of data and expensive computation, which is often impractical. What’s more, existing methods sometimes have the unintended consequence of degrading the quality of the model’s predictions.

    The MIT and MIT-IBM Watson AI Lab researchers have thus zeroed in on the following problem: Given a pretrained model, how can they enable it to perform effective uncertainty quantification?

    They solve this by creating a smaller and simpler model, known as a metamodel, that attaches to the larger, pretrained model and uses the features that larger model has already learned to help it make uncertainty quantification assessments.

    “The metamodel can be applied to any pretrained model. It is better to have access to the internals of the model, because we can get much more information about the base model, but it will also work if you just have a final output. It can still predict a confidence score,” Sattigeri says.

    They design the metamodel to produce the uncertainty quantification output using a technique that includes both types of uncertainty: data uncertainty and model uncertainty. Data uncertainty is caused by corrupted data or inaccurate labels and can only be reduced by fixing the dataset or gathering new data. In model uncertainty, the model is not sure how to explain the newly observed data and might make incorrect predictions, most likely because it hasn’t seen enough similar training examples. This issue is an especially challenging but common problem when models are deployed. In real-world settings, they often encounter data that are different from the training dataset.

    “Has the reliability of your decisions changed when you use the model in a new setting? You want some way to have confidence in whether it is working in this new regime or whether you need to collect training data for this particular new setting,” Wornell says.

    Validating the quantification

    Once a model produces an uncertainty quantification score, the user still needs some assurance that the score itself is accurate. Researchers often validate accuracy by creating a smaller dataset, held out from the original training data, and then testing the model on the held-out data. However, this technique does not work well in measuring uncertainty quantification because the model can achieve good prediction accuracy while still being over-confident, Shen says.

    They created a new validation technique by adding noise to the data in the validation set — this noisy data is more like out-of-distribution data that can cause model uncertainty. The researchers use this noisy dataset to evaluate uncertainty quantifications.

    They tested their approach by seeing how well a meta-model could capture different types of uncertainty for various downstream tasks, including out-of-distribution detection and misclassification detection. Their method not only outperformed all the baselines in each downstream task but also required less training time to achieve those results.

    This technique could help researchers enable more machine-learning models to effectively perform uncertainty quantification, ultimately aiding users in making better decisions about when to trust predictions.

    Moving forward, the researchers want to adapt their technique for newer classes of models, such as large language models that have a different structure than a traditional neural network, Shen says.

    The work was funded, in part, by the MIT-IBM Watson AI Lab and the U.S. National Science Foundation. More

  • in

    Helping companies deploy AI models more responsibly

    Companies today are incorporating artificial intelligence into every corner of their business. The trend is expected to continue until machine-learning models are incorporated into most of the products and services we interact with every day.

    As those models become a bigger part of our lives, ensuring their integrity becomes more important. That’s the mission of Verta, a startup that spun out of MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL).

    Verta’s platform helps companies deploy, monitor, and manage machine-learning models safely and at scale. Data scientists and engineers can use Verta’s tools to track different versions of models, audit them for bias, test them before deployment, and monitor their performance in the real world.

    “Everything we do is to enable more products to be built with AI, and to do that safely,” Verta founder and CEO Manasi Vartak SM ’14, PhD ’18 says. “We’re already seeing with ChatGPT how AI can be used to generate data, artefacts — you name it — that look correct but aren’t correct. There needs to be more governance and control in how AI is being used, particularly for enterprises providing AI solutions.”

    Verta is currently working with large companies in health care, finance, and insurance to help them understand and audit their models’ recommendations and predictions. It’s also working with a number of high-growth tech companies looking to speed up deployment of new, AI-enabled solutions while ensuring those solutions are used appropriately.

    Vartak says the company has been able to decrease the time it takes customers to deploy AI models by orders of magnitude while ensuring those models are explainable and fair — an especially important factor for companies in highly regulated industries.

    Health care companies, for example, can use Verta to improve AI-powered patient monitoring and treatment recommendations. Such systems need to be thoroughly vetted for errors and biases before they’re used on patients.

    “Whether it’s bias or fairness or explainability, it goes back to our philosophy on model governance and management,” Vartak says. “We think of it like a preflight checklist: Before an airplane takes off, there’s a set of checks you need to do before you get your airplane off the ground. It’s similar with AI models. You need to make sure you’ve done your bias checks, you need to make sure there’s some level of explainability, you need to make sure your model is reproducible. We help with all of that.”

    From project to product

    Before coming to MIT, Vartak worked as a data scientist for a social media company. In one project, after spending weeks tuning machine-learning models that curated content to show in people’s feeds, she learned an ex-employee had already done the same thing. Unfortunately, there was no record of what they did or how it affected the models.

    For her PhD at MIT, Vartak decided to build tools to help data scientists develop, test, and iterate on machine-learning models. Working in CSAIL’s Database Group, Vartak recruited a team of graduate students and participants in MIT’s Undergraduate Research Opportunities Program (UROP).

    “Verta would not exist without my work at MIT and MIT’s ecosystem,” Vartak says. “MIT brings together people on the cutting edge of tech and helps us build the next generation of tools.”

    The team worked with data scientists in the CSAIL Alliances program to decide what features to build and iterated based on feedback from those early adopters. Vartak says the resulting project, named ModelDB, was the first open-source model management system.

    Vartak also took several business classes at the MIT Sloan School of Management during her PhD and worked with classmates on startups that recommended clothing and tracked health, spending countless hours in the Martin Trust Center for MIT Entrepreneurship and participating in the center’s delta v summer accelerator.

    “What MIT lets you do is take risks and fail in a safe environment,” Vartak says. “MIT afforded me those forays into entrepreneurship and showed me how to go about building products and finding first customers, so by the time Verta came around I had done it on a smaller scale.”

    ModelDB helped data scientists train and track models, but Vartak quickly saw the stakes were higher once models were deployed at scale. At that point, trying to improve (or accidentally breaking) models can have major implications for companies and society. That insight led Vartak to begin building Verta.

    “At Verta, we help manage models, help run models, and make sure they’re working as expected, which we call model monitoring,” Vartak explains. “All of those pieces have their roots back to MIT and my thesis work. Verta really evolved from my PhD project at MIT.”

    Verta’s platform helps companies deploy models more quickly, ensure they continue working as intended over time, and manage the models for compliance and governance. Data scientists can use Verta to track different versions of models and understand how they were built, answering questions like how data were used and which explainability or bias checks were run. They can also vet them by running them through deployment checklists and security scans.

    “Verta’s platform takes the data science model and adds half a dozen layers to it to transform it into something you can use to power, say, an entire recommendation system on your website,” Vartak says. “That includes performance optimizations, scaling, and cycle time, which is how quickly you can take a model and turn it into a valuable product, as well as governance.”

    Supporting the AI wave

    Vartak says large companies often use thousands of different models that influence nearly every part of their operations.

    “An insurance company, for example, will use models for everything from underwriting to claims, back-office processing, marketing, and sales,” Vartak says. “So, the diversity of models is really high, there’s a large volume of them, and the level of scrutiny and compliance companies need around these models are very high. They need to know things like: Did you use the data you were supposed to use? Who were the people who vetted it? Did you run explainability checks? Did you run bias checks?”

    Vartak says companies that don’t adopt AI will be left behind. The companies that ride AI to success, meanwhile, will need well-defined processes in place to manage their ever-growing list of models.

    “In the next 10 years, every device we interact with is going to have intelligence built in, whether it’s a toaster or your email programs, and it’s going to make your life much, much easier,” Vartak says. “What’s going to enable that intelligence are better models and software, like Verta, that help you integrate AI into all of these applications very quickly.” More

  • in

    Putting clear bounds on uncertainty

    In science and technology, there has been a long and steady drive toward improving the accuracy of measurements of all kinds, along with parallel efforts to enhance the resolution of images. An accompanying goal is to reduce the uncertainty in the estimates that can be made, and the inferences drawn, from the data (visual or otherwise) that have been collected. Yet uncertainty can never be wholly eliminated. And since we have to live with it, at least to some extent, there is much to be gained by quantifying the uncertainty as precisely as possible.

    Expressed in other terms, we’d like to know just how uncertain our uncertainty is.

    That issue was taken up in a new study, led by Swami Sankaranarayanan, a postdoc at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), and his co-authors — Anastasios Angelopoulos and Stephen Bates of the University of California at Berkeley; Yaniv Romano of Technion, the Israel Institute of Technology; and Phillip Isola, an associate professor of electrical engineering and computer science at MIT. These researchers succeeded not only in obtaining accurate measures of uncertainty, they also found a way to display uncertainty in a manner the average person could grasp.

    Their paper, which was presented in December at the Neural Information Processing Systems Conference in New Orleans, relates to computer vision — a field of artificial intelligence that involves training computers to glean information from digital images. The focus of this research is on images that are partially smudged or corrupted (due to missing pixels), as well as on methods — computer algorithms, in particular — that are designed to uncover the part of the signal that is marred or otherwise concealed. An algorithm of this sort, Sankaranarayanan explains, “takes the blurred image as the input and gives you a clean image as the output” — a process that typically occurs in a couple of steps.

    First, there is an encoder, a kind of neural network specifically trained by the researchers for the task of de-blurring fuzzy images. The encoder takes a distorted image and, from that, creates an abstract (or “latent”) representation of a clean image in a form — consisting of a list of numbers — that is intelligible to a computer but would not make sense to most humans. The next step is a decoder, of which there are a couple of types, that are again usually neural networks. Sankaranarayanan and his colleagues worked with a kind of decoder called a “generative” model. In particular, they used an off-the-shelf version called StyleGAN, which takes the numbers from the encoded representation (of a cat, for instance) as its input and then constructs a complete, cleaned-up image (of that particular cat). So the entire process, including the encoding and decoding stages, yields a crisp picture from an originally muddied rendering.

    But how much faith can someone place in the accuracy of the resultant image? And, as addressed in the December 2022 paper, what is the best way to represent the uncertainty in that image? The standard approach is to create a “saliency map,” which ascribes a probability value — somewhere between 0 and 1 — to indicate the confidence the model has in the correctness of every pixel, taken one at a time. This strategy has a drawback, according to Sankaranarayanan, “because the prediction is performed independently for each pixel. But meaningful objects occur within groups of pixels, not within an individual pixel,” he adds, which is why he and his colleagues are proposing an entirely different way of assessing uncertainty.

    Their approach is centered around the “semantic attributes” of an image — groups of pixels that, when taken together, have meaning, making up a human face, for example, or a dog, or some other recognizable thing. The objective, Sankaranarayanan maintains, “is to estimate uncertainty in a way that relates to the groupings of pixels that humans can readily interpret.”

    Whereas the standard method might yield a single image, constituting the “best guess” as to what the true picture should be, the uncertainty in that representation is normally hard to discern. The new paper argues that for use in the real world, uncertainty should be presented in a way that holds meaning for people who are not experts in machine learning. Rather than producing a single image, the authors have devised a procedure for generating a range of images — each of which might be correct. Moreover, they can set precise bounds on the range, or interval, and provide a probabilistic guarantee that the true depiction lies somewhere within that range. A narrower range can be provided if the user is comfortable with, say, 90 percent certitude, and a narrower range still if more risk is acceptable.

    The authors believe their paper puts forth the first algorithm, designed for a generative model, which can establish uncertainty intervals that relate to meaningful (semantically-interpretable) features of an image and come with “a formal statistical guarantee.” While that is an important milestone, Sankaranarayanan considers it merely a step toward “the ultimate goal. So far, we have been able to do this for simple things, like restoring images of human faces or animals, but we want to extend this approach into more critical domains, such as medical imaging, where our ‘statistical guarantee’ could be especially important.”

    Suppose that the film, or radiograph, of a chest X-ray is blurred, he adds, “and you want to reconstruct the image. If you are given a range of images, you want to know that the true image is contained within that range, so you are not missing anything critical” — information that might reveal whether or not a patient has lung cancer or pneumonia. In fact, Sankaranarayanan and his colleagues have already begun working with a radiologist to see if their algorithm for predicting pneumonia could be useful in a clinical setting.

    Their work may also have relevance in the law enforcement field, he says. “The picture from a surveillance camera may be blurry, and you want to enhance that. Models for doing that already exist, but it is not easy to gauge the uncertainty. And you don’t want to make a mistake in a life-or-death situation.” The tools that he and his colleagues are developing could help identify a guilty person and help exonerate an innocent one as well.

    Much of what we do and many of the things happening in the world around us are shrouded in uncertainty, Sankaranarayanan notes. Therefore, gaining a firmer grasp of that uncertainty could help us in countless ways. For one thing, it can tell us more about exactly what it is we do not know.

    Angelopoulos was supported by the National Science Foundation. Bates was supported by the Foundations of Data Science Institute and the Simons Institute. Romano was supported by the Israel Science Foundation and by a Career Advancement Fellowship from Technion. Sankaranarayanan’s and Isola’s research for this project was sponsored by the U.S. Air Force Research Laboratory and the U.S. Air Force Artificial Intelligence Accelerator and was accomplished under Cooperative Agreement Number FA8750-19-2- 1000. MIT SuperCloud and the Lincoln Laboratory Supercomputing Center also provided computing resources that contributed to the results reported in this work. More

  • in

    Research, education, and connection in the face of war

    When Russian forces invaded Ukraine in February 2022, Tetiana Herasymova had several decisions to make: What should she do, where should she live, and should she take her MITx MicroMasters capstone exams? She had registered for the Statistics and Data Science Program’s final exams just days prior to moving out of her apartment and into a bomb shelter. Although it was difficult to focus on studying and preparations with air horns sounding overhead and uncertainty lingering around her, she was determined to try. “I wouldn’t let the aggressor in the war squash my dreams,” she says.

    A love of research and the desire to improve teaching 

    An early love of solving puzzles and problems for fun piqued Herasymova’s initial interest in mathematics. When she later pursued her PhD in mathematics at Kiev National Taras Shevchenko University, Herasymova’s love of math evolved into a love of research. Throughout Herasymova’s career, she’s worked to close the gap between scientific researchers and educators. Starting as a math tutor at MBA Strategy, a company that prepares Ukrainian leaders for qualifying standardized tests for MBA programs, she was later promoted as the head of their test preparation department. Afterward, she moved on to an equivalent position at ZNOUA, a new project that prepared high school students for Ukraine’s standardized test, and she eventually became ZNOUA’s CEO.

    In 2018, she founded Prosteer, a “self-learning community” of educators who share research, pedagogy, and experience to learn from one another. “It’s really interesting to have a community of teachers from different domains,” she says, speaking of educators and researchers whose specialties range across language, mathematics, physics, music, and more.

    Implementing new pedagogical research in the classroom is often up to educators who seek out studies on an individual basis, Herasymova has found. “Lots of scientists are not practitioners,” she says, and the reverse is also true. She only became more determined to build these connections once she was promoted to head of test preparation at MBA Strategy because she wanted to share more effective pedagogy with the tutors she was mentoring.

    First, Herasymova knew she needed a way to measure the teachers’ effectiveness. She was able to determine whether students who received the company’s tutoring services improved their scores. Moreover, Ukraine keeps an open-access database of national standardized test scores, so anyone could analyze the data in hopes of improving the level of education in the country. She says, “I could do some analytics because I am a mathematician, but I knew I could do much more with this data if I knew data science and machine learning knowledge.”

    That’s why Herasymova sought out the MITx MicroMasters Program in Statistics and Data Science offered by the MIT Institute for Data, Systems, and Society (IDSS). “I wanted to learn the fundamentals so I could join the Learning Analytics domain,” she says. She was looking for a comprehensive program that covered the foundations without being overly basic. “I had some knowledge from the ground, so I could see the deepness of that course,” she says. Because of her background as an instructional designer, she thought the MicroMasters curriculum was well-constructed, calling the variety of videos, practice problems, and homework assignments that encouraged learners to approach the course material in different ways, “a perfect experience.”

    Another benefit of the MicroMasters program was its online format. “I had my usual work, so it was impossible to study in a stationary way,” she says. She found the structure to be more flexible than other programs. “It’s really great that you can construct your course schedule your own way, especially with your own adult life,” she says.

    Determination and support in the midst of war

    When the war first forced Herasymova to flee her apartment, she had already registered to take the exams for her four courses. “It was quite hard to prepare for exams when you could hear explosions outside of the bomb shelter,” she says. She and other Ukranians were invited to postpone their exams until the following session, but the next available testing period wouldn’t be held until October. “It was a hard decision, but I had to allow myself to try,” she says. “For all people in Ukraine, when you don’t know if you’re going to live or die, you try to live in the now. You have to appreciate every moment and what life brings to you. You don’t say, ‘Someday’ — you do it today or tomorrow.”

    In addition to emotional support from her boyfriend, Herasymova had a group of friends who had also enrolled in the program, and they supported each other through study sessions and an ongoing chat. Herasymova’s personal support network helped her accomplish what she set out to do with her MicroMasters program, and in turn, she was able to support her professional network. While Prosteer halted its regular work during the early stages of the war, Herasymova was determined to support the community of educators and scientists that she had built. They continued meeting weekly to exchange ideas as usual. “It’s intrinsic motivation,” she says. They managed to restore all of their activities by October.

    Despite the factors stacked against her, Herasymova’s determination paid off — she passed all of her exams in May, the final step to earning her MicroMasters certificate in statistics and data science. “I just couldn’t believe it,” she says. “It was definitely a bifurcation point. The moment when you realize that you have something to rely on, and that life is just beginning to show all its diversity despite the fact that you live in war.” With her newly minted certificate in hand, Herasymova has continued her research on the effectiveness of educational models — analyzing the data herself — with a summer research program at New York University. 

    The student becomes the master

    After moving seven times between February and October, heading west from Kyiv until most recently settling near the border of Poland, Herasymova hopes she’s moved for the last time. Ukrainian Catholic University offered her a position teaching both mathematics and programming. Before enrolling in the MicroMasters Program in Statistics and Data Science, she had some prior knowledge of programming languages and mathematical algorithms, but she didn’t know Python. She took MITx’s Introduction to Computer Science and Programming Using Python to prepare. “It gave me a huge step forward,” she says. “I learned a lot. Now, not only can I work with Python machine learning models in programming language R, I also have knowledge of the big picture of the purpose and the point to do so.”

    In addition to the skills the MicroMasters Program trained her in, she gained firsthand experience in learning new subjects and exploring topics more deeply. She will be sharing that practice with the community of students and teachers she’s built, plus, she plans on guiding them through this course during the next year. As a continuation of her own educational growth, says she’s looking forward to her next MITx course this year, Data Analysis.

    Herasymova advises that the best way to keep progressing is investing a lot of time. “Adults don’t want to hear this, but you need one or two years,” she says. “Allow yourself to be stupid. If you’re an expert in one domain and want to switch to another, or if you want to understand something new, a lot of people don’t ask questions or don’t ask for help. But from this point, if I don’t know something, I know I should ask for help because that’s the start of learning. With a fixed mindset, you won’t grow.”

    July 2022 MicroMasters Program Joint Completion Celebration. Ukrainian student Tetiana Herasymova, who completed her program amid war in her home country, speaks at 43:55. More