More stories

  • in

    Enabling AI-driven health advances without sacrificing patient privacy

    There’s a lot of excitement at the intersection of artificial intelligence and health care. AI has already been used to improve disease treatment and detection, discover promising new drugs, identify links between genes and diseases, and more.

    By analyzing large datasets and finding patterns, virtually any new algorithm has the potential to help patients — AI researchers just need access to the right data to train and test those algorithms. Hospitals, understandably, are hesitant to share sensitive patient information with research teams. When they do share data, it’s difficult to verify that researchers are only using the data they need and deleting it after they’re done.

    Secure AI Labs (SAIL) is addressing those problems with a technology that lets AI algorithms run on encrypted datasets that never leave the data owner’s system. Health care organizations can control how their datasets are used, while researchers can protect the confidentiality of their models and search queries. Neither party needs to see the data or the model to collaborate.

    SAIL’s platform can also combine data from multiple sources, creating rich insights that fuel more effective algorithms.

    “You shouldn’t have to schmooze with hospital executives for five years before you can run your machine learning algorithm,” says SAIL co-founder and MIT Professor Manolis Kellis, who co-founded the company with CEO Anne Kim ’16, SM ’17. “Our goal is to help patients, to help machine learning scientists, and to create new therapeutics. We want new algorithms — the best algorithms — to be applied to the biggest possible data set.”

    SAIL has already partnered with hospitals and life science companies to unlock anonymized data for researchers. In the next year, the company hopes to be working with about half of the top 50 academic medical centers in the country.

    Unleashing AI’s full potential

    As an undergraduate at MIT studying computer science and molecular biology, Kim worked with researchers in the Computer Science and Artificial Intelligence Laboratory (CSAIL) to analyze data from clinical trials, gene association studies, hospital intensive care units, and more.

    “I realized there is something severely broken in data sharing, whether it was hospitals using hard drives, ancient file transfer protocol, or even sending stuff in the mail,” Kim says. “It was all just not well-tracked.”

    Kellis, who is also a member of the Broad Institute of MIT and Harvard, has spent years establishing partnerships with hospitals and consortia across a range of diseases including cancers, heart disease, schizophrenia, and obesity. He knew that smaller research teams would struggle to get access to the same data his lab was working with.

    In 2017, Kellis and Kim decided to commercialize technology they were developing to allow AI algorithms to run on encrypted data.

    In the summer of 2018, Kim participated in the delta v startup accelerator run by the Martin Trust Center for MIT Entrepreneurship. The founders also received support from the Sandbox Innovation Fund and the Venture Mentoring Service, and made various early connections through their MIT network.

    To participate in SAIL’s program, hospitals and other health care organizations make parts of their data available to researchers by setting up a node behind their firewall. SAIL then sends encrypted algorithms to the servers where the datasets reside in a process called federated learning. The algorithms crunch the data locally in each server and transmit the results back to a central model, which updates itself. No one — not the researchers, the data owners, or even SAIL —has access to the models or the datasets.

    The approach allows a much broader set of researchers to apply their models to large datasets. To further engage the research community, Kellis’ lab at MIT has begun holding competitions in which it gives access to datasets in areas like protein function and gene expression, and challenges researchers to predict results.

    “We invite machine learning researchers to come and train on last year’s data and predict this year’s data,” says Kellis. “If we see there’s a new type of algorithm that is performing best in these community-level assessments, people can adopt it locally at many different institutions and level the playing field. So, the only thing that matters is the quality of your algorithm rather than the power of your connections.”

    By enabling a large number of datasets to be anonymized into aggregate insights, SAIL’s technology also allows researchers to study rare diseases, in which small pools of relevant patient data are often spread out among many institutions. That has historically made the data difficult to apply AI models to.

    “We’re hoping that all of these datasets will eventually be open,” Kellis says. “We can cut across all the silos and enable a new era where every patient with every rare disorder across the entire world can come together in a single keystroke to analyze data.”

    Enabling the medicine of the future

    To work with large amounts of data around specific diseases, SAIL has increasingly sought to partner with patient associations and consortia of health care groups, including an international health care consulting company and the Kidney Cancer Association. The partnerships also align SAIL with patients, the group they’re most trying to help.

    Overall, the founders are happy to see SAIL solving problems they faced in their labs for researchers around the world.

    “The right place to solve this is not an academic project. The right place to solve this is in industry, where we can provide a platform not just for my lab but for any researcher,” Kellis says. “It’s about creating an ecosystem of academia, researchers, pharma, biotech, and hospital partners. I think it’s the blending all of these different areas that will make that vision of medicine of the future become a reality.” More

  • in

    3 Questions: Kalyan Veeramachaneni on hurdles preventing fully automated machine learning

    The proliferation of big data across domains, from banking to health care to environmental monitoring, has spurred increasing demand for machine learning tools that help organizations make decisions based on the data they gather.

    That growing industry demand has driven researchers to explore the possibilities of automated machine learning (AutoML), which seeks to automate the development of machine learning solutions in order to make them accessible for nonexperts, improve their efficiency, and accelerate machine learning research. For example, an AutoML system might enable doctors to use their expertise interpreting electroencephalography (EEG) results to build a model that can predict which patients are at higher risk for epilepsy — without requiring the doctors to have a background in data science.

    Yet, despite more than a decade of work, researchers have been unable to fully automate all steps in the machine learning development process. Even the most efficient commercial AutoML systems still require a prolonged back-and-forth between a domain expert, like a marketing manager or mechanical engineer, and a data scientist, making the process inefficient.

    Kalyan Veeramachaneni, a principal research scientist in the MIT Laboratory for Information and Decision Systems who has been studying AutoML since 2010, has co-authored a paper in the journal ACM Computing Surveys that details a seven-tiered schematic to evaluate AutoML tools based on their level of autonomy.

    A system at level zero has no automation and requires a data scientist to start from scratch and build models by hand, while a tool at level six is completely automated and can be easily and effectively used by a nonexpert. Most commercial systems fall somewhere in the middle.

    Veeramachaneni spoke with MIT News about the current state of AutoML, the hurdles that prevent truly automatic machine learning systems, and the road ahead for AutoML researchers.

    Q: How has automatic machine learning evolved over the past decade, and what is the current state of AutoML systems?

    A: In 2010, we started to see a shift, with enterprises wanting to invest in getting value out of their data beyond just business intelligence. So then came the question, maybe there are certain things in the development of machine learning-based solutions that we can automate? The first iteration of AutoML was to make our own jobs as data scientists more efficient. Can we take away the grunt work that we do on a day-to-day basis and automate that by using a software system? That area of research ran its course until about 2015, when we realized we still weren’t able to speed up this development process.

    Then another thread emerged. There are a lot of problems that could be solved with data, and they come from experts who know those problems, who live with them on a daily basis. These individuals have very little to do with machine learning or software engineering. How do we bring them into the fold? That is really the next frontier.

    There are three areas where these domain experts have strong input in a machine learning system. The first is defining the problem itself and then helping to formulate it as a prediction task to be solved by a machine learning model. Second, they know how the data have been collected, so they also know intuitively how to process that data. And then third, at the end, machine learning models only give you a very tiny part of a solution — they just give you a prediction. The output of a machine learning model is just one input to help a domain expert get to a decision or action.

    Q: What steps of the machine learning pipeline are the most difficult to automate, and why has automating them been so challenging?

    A: The problem-formulation part is extremely difficult to automate. For example, if I am a researcher who wants to get more government funding, and I have a lot of data about the content of the research proposals that I write and whether or not I receive funding, can machine learning help there? We don’t know yet. In problem formulation, I use my domain expertise to translate the problem into something that is more tangible to predict, and that requires somebody who knows the domain very well. And he or she also knows how to use that information post-prediction. That problem is refusing to be automated.

    There is one part of problem-formulation that could be automated. It turns out that we can look at the data and mathematically express several possible prediction tasks automatically. Then we can share those prediction tasks with the domain expert to see if any of them would help in the larger problem they are trying to tackle. Then once you pick the prediction task, there are a lot of intermediate steps you do, including feature engineering, modeling, etc., that are very mechanical steps and easy to automate.

    But defining the prediction tasks has typically been a collaborative effort between data scientists and domain experts because, unless you know the domain, you can’t translate the domain problem into a prediction task. And then sometimes domain experts don’t know what is meant by “prediction.” That leads to the major, significant back and forth in the process. If you automate that step, then machine learning penetration and the use of data to create meaningful predictions will increase tremendously.

    Then what happens after the machine learning model gives a prediction? We can automate the software and technology part of it, but at the end of the day, it is root cause analysis and human intuition and decision making. We can augment them with a lot of tools, but we can’t fully automate that.

    Q: What do you hope to achieve with the seven-tiered framework for evaluating AutoML systems that you outlined in your paper?

    A: My hope is that people start to recognize that some levels of automation have already been achieved and some still need to be tackled. In the research community, we tend to focus on what we are comfortable with. We have gotten used to automating certain steps, and then we just stick to it. Automating these other parts of the machine learning solution development is very important, and that is where the biggest bottlenecks remain.

    My second hope is that researchers will very clearly understand what domain expertise means. A lot of this AutoML work is still being conducted by academics, and the problem is that we often don’t do applied work. There is not a crystal-clear definition of what a domain expert is and in itself, “domain expert,” is a very nebulous phrase. What we mean by domain expert is the expert in the problem you are trying to solve with machine learning. And I am hoping that everyone unifies around that because that would make things so much clearer.

    I still believe that we are not able to build that many models for that many problems, but even for the ones that we are building, the majority of them are not getting deployed and used in day-to-day life. The output of machine learning is just going to be another data point, an augmented data point, in someone’s decision making. How they make those decisions, based on that input, how that will change their behavior, and how they will adapt their style of working, that is still a big, open question. Once we automate everything, that is what’s next.

    We have to determine what has to fundamentally change in the day-to-day workflow of someone giving loans at a bank, or an educator trying to decide whether he or she should change the assignments in an online class. How are they going to use machine learning’s outputs? We need to focus on the fundamental things we have to build out to make machine learning more usable. More

  • in

    Using adversarial attacks to refine molecular energy predictions

    Neural networks (NNs) are increasingly being used to predict new materials, the rate and yield of chemical reactions, and drug-target interactions, among others. For these applications, they are orders of magnitude faster than traditional methods such as quantum mechanical simulations. 

    The price for this agility, however, is reliability. Because machine learning models only interpolate, they may fail when used outside the domain of training data.

    But the part that worried Rafael Gómez-Bombarelli, the Jeffrey Cheah Career Development Professor in the MIT Department of Materials Science and Engineering, and graduate students Daniel Schwalbe-Koda and Aik Rui Tan was that establishing the limits of these machine learning (ML) models is tedious and labor-intensive. 

    This is particularly true for predicting ‘‘potential energy surfaces” (PES), or the map of a molecule’s energy in all its configurations. These surfaces encode the complexities of a molecule into flatlands, valleys, peaks, troughs, and ravines. The most stable configurations of a system are usually in the deep pits — quantum mechanical chasms from which atoms and molecules typically do not escape. 

    In a recent Nature Communications paper, the research team presented a way to demarcate the “safe zone” of a neural network by using “adversarial attacks.” Adversarial attacks have been studied for other classes of problems, such as image classification, but this is the first time that they are being used to sample molecular geometries in a PES. 

    “People have been using uncertainty for active learning for years in ML potentials. The key difference is that they need to run the full ML simulation and evaluate if the NN was reliable, and if it wasn’t, acquire more data, retrain and re-simulate. Meaning that it takes a long time to nail down the right model, and one has to run the ML simulation many times” explains Gómez-Bombarelli.

    The Gómez-Bombarelli lab at MIT works on a synergistic synthesis of first-principles simulation and machine learning that greatly speeds up this process. The actual simulations are run only for a small fraction of these molecules, and all those data are fed into a neural network that learns how to predict the same properties for the rest of the molecules. They have successfully demonstrated these methods for a growing class of novel materials that includes catalysts for producing hydrogen from water, cheaper polymer electrolytes for electric vehicles,  zeolites for molecular sieving, magnetic materials, and more. 

    The challenge, however, is that these neural networks are only as smart as the data they are trained on.  Considering the PES map, 99 percent of the data may fall into one pit, totally missing valleys that are of more interest. 

    Such wrong predictions can have disastrous consequences — think of a self-driving car that fails to identify a person crossing the street.

    One way to find out the uncertainty of a model is to run the same data through multiple versions of it. 

    For this project, the researchers had multiple neural networks predict the potential energy surface from the same data. Where the network is fairly sure of the prediction, the variation between the outputs of different networks is minimal and the surfaces largely converge. When the network is uncertain, the predictions of different models vary widely, producing a range of outputs, any of which could be the correct surface. 

    The spread in the predictions of a “committee of neural networks” is the “uncertainty” at that point. A good model should not just indicate the best prediction, but also indicates the uncertainty about each of these predictions. It’s like the neural network says “this property for material A will have a value of X and I’m highly confident about it.”

    This could have been an elegant solution but for the sheer scale of the combinatorial space. “Each simulation (which is ground feed for the neural network) may take from tens to thousands of CPU hours,” explains Schwalbe-Koda. For the results to be meaningful, multiple models must be run over a sufficient number of points in the PES, an extremely time-consuming process. 

    Instead, the new approach only samples data points from regions of low prediction confidence, corresponding to specific geometries of a molecule. These molecules are then stretched or deformed slightly so that the uncertainty of the neural network committee is maximized. Additional data are computed for these molecules through simulations and then added to the initial training pool. 

    The neural networks are trained again, and a new set of uncertainties are calculated. This process is repeated until the uncertainty associated with various points on the surface becomes well-defined and cannot be decreased any further. 

    Gómez-Bombarelli explains, “We aspire to have a model that is perfect in the regions we care about (i.e., the ones that the simulation will visit) without having had to run the full ML simulation, by making sure that we make it very good in high-likelihood regions where it isn’t.”

    The paper presents several examples of this approach, including predicting complex supramolecular interactions in zeolites. These materials are cavernous crystals that act as molecular sieves with high shape selectivity. They find applications in catalysis, gas separation, and ion exchange, among others.

    Because performing simulations of large zeolite structures is very costly, the researchers show how their method can provide significant savings in computational simulations. They used more than 15,000 examples to train a neural network to predict the potential energy surfaces for these systems. Despite the large cost required to generate the dataset, the final results are mediocre, with only around 80 percent of the neural network-based simulations being successful. To improve the performance of the model using traditional active learning methods, the researchers calculated an additional 5,000 data points, which improved the performance of the neural network potentials to 92 percent.

    However, when the adversarial approach is used to retrain the neural networks, the authors saw a performance jump to 97 percent using only 500 extra points. That’s a remarkable result, the researchers say, especially considering that each of these extra points takes hundreds of CPU hours. 

    This could be the most realistic method to probe the limits of models that researchers use to predict the behavior of materials and the progress of chemical reactions. More

  • in

    Last-mile routing research challenge awards $175,000 to three winning teams

    Routing is one of the most studied problems in operations research; even small improvements in routing efficiency can save companies money and result in energy savings and reduced environmental impacts. Now, three teams of researchers from universities around the world have received prize money totaling $175,000 for their innovative route optimization models.

    The three teams were the winners of the Amazon Last-Mile Routing Research Challenge, through which the MIT Center for Transportation & Logistics (MIT CTL) and Amazon engaged with a global community of researchers across a range of disciplines, from computer science to business operations to supply chain management, challenging them to build data-driven route optimization models leveraging massive historical route execution data.

    First announced in February, the research challenge attracted more than 2,000 participants from around the world. Two hundred twenty-nine researcher teams formed during the spring to independently develop solutions that incorporated driver know-how into route optimization models with the intent that they would outperform traditional optimization approaches. Out of the 48 teams whose models qualified for the final round of the challenge, three teams’ work stood out above the rest. Amazon provided real operational training data for the models and evaluated submissions, with technical support from MIT CTL scientists.

    In real life, drivers frequently deviate from planned and mathematically optimized route sequences. Drivers carry information about which roads are hard to navigate when traffic is bad, when and where they can easily find parking, which stops can be conveniently served together, and many other factors that existing optimization models simply don’t capture.

    Each model addressed the challenge data in a unique way. The methodological approaches chosen by the participants frequently combined traditional exact and heuristic optimization approaches with nontraditional machine learning methods. On the machine learning side, the most commonly adopted methods were different variants of artificial neural networks, as well as inverse reinforcement learning approaches.

    There were 45 submissions that reached the finalist phase, with team members hailing from 29 countries. Entrants spanned all levels of higher education from final-year undergraduate students to retired faculty. Entries were assessed in a double-blind review process so that the judges would not know what team was attached to each entry.

    The third-place prize of $25,000 was awarded to Okan Arslan and Rasit Abay. Okan is a professor at HEC Montréal, and Rasit is a doctoral student at the University of New South Wales in Australia. The runner-up prize at $50,000 was awarded to MIT’s own Xiaotong Guo, Qingyi Wang, and Baichuan Mo, all doctoral students. The top prize of $100,000 was awarded to Professor William Cook of the University of Waterloo in Canada, Professor Stephan Held of the University of Bonn in Germany, and Professor Emeritus Keld Helsgaun of Roskilde University in Denmark. Congratulations to all winners and contestants were held via webinar on July 30.

    Top-performing teams may be interviewed by Amazon for research roles in the company’s Last Mile organization. MIT CTL will publish and promote short technical papers written by all finalists and might invite top-performing teams to present at MIT. Further, a team led by Matthias Winkenbach, director of the MIT Megacity Logistics Lab, will guest-edit a special issue of Transportation Science, one of the most renowned academic journals in this field, featuring academic papers on topics related to the problem tackled by the research challenge. More

  • in

    Helping companies optimize their websites and mobile apps

    Creating a good customer experience increasingly means creating a good digital experience. But metrics like pageviews and clicks offer limited insight into how much customers actually like a digital product.

    That’s the problem the digital optimization company Amplitude is solving. Amplitude gives companies a clearer picture into how users interact with their digital products to help them understand exactly which features to promote or improve.

    “It’s all about using product data to drive your business,” says Amplitude CEO Spenser Skates ’10, who co-founded the company with Curtis Liu ’10 and Stanford University graduate Jeffrey Wang. “Mobile apps and websites are really complex. The average app or website will have thousands of things you can do with it. The question is how you know which of those things are driving a great user experience and which parts are really frustrating for users.”

    Amplitude’s database can gather millions of details about how users behave inside an app or website and allow customers to explore that information without needing data science degrees.

    “It provides an interface for very easy, accessible ways of looking at your data, understanding your data, and asking questions of that data,” Skates says.

    Amplitude, which recently announced it will be going public, is already helping 23 of the 100 largest companies in the U.S. Customers include media companies like NBC, tech companies like Twitter, and retail companies like Walmart.

    “Our platform helps businesses understand how people are using their apps and websites so they can create better versions of their products,” Skates says. “It’s all about creating a really compelling product.”

    Learning entrepreneurship

    The founders say their years at MIT were among the best of their lives. Skates and Liu were undergraduates from 2006 to 2010. Skates majored in biological engineering while Liu majored in mathematics and electrical engineering and computer science. The two first met as opponents in MIT’s Battlecode competition, in which students use artificial intelligence algorithms to control teams of robots that compete in a strategy game against other teams. The following year they teamed up.

    “There are a lot of parallels between what you’re trying to do in Battlecode and what you end up having to do in the early stages of a startup,” Liu says. “You have limited resources, limited time, and you’re trying to accomplish a goal. What we found is trying a lot of different things, putting our ideas out there and testing them with real data, really helped us focus on the things that actually mattered. That method of iteration and continual improvement set the foundation for how we approach building products and startups.”

    Liu and Skates next participated in the MIT $100K Entrepreneurship Competition with an idea for a cloud-based music streaming service. After graduation, Skates began working in finance and Liu got a job at Google, but they continued pursuing startup ideas on the side, including a website that let alumni see where their classmates ended up and a marketplace for finding photographers.

    A year after graduation, the founders decided to quit their jobs and work on a startup full time. Skates moved into Liu’s apartment in San Francisco, setting up a mattress on the floor, and they began working on a project that became Sonalight, a voice recognition app. As part of the project, the founders built an internal system to understand where users got stuck in the app and what features were used the most.

    Despite getting over 100,000 downloads, the founders decided Sonalight was a little too early for its time and started thinking their analytics feature could be useful to other companies. They spoke with about 30 different product teams to learn more about what companies wanted from their digital analytics. Amplitude was officially founded in 2012.

    Amplitude gathers fine details about digital product usage, parsing out individual features and actions to give customers a better view of how their products are being used. Using the data in Amplitude’s intuitive, no-code interface, customers can make strategic decisions like whether to launch a feature or change a distribution channel.

    The platform is designed to ease the bottlenecks that arise when executives, product teams, salespeople, and marketers want to answer questions about customer experience or behavior but need the data science team to crunch the numbers for them.

    “It’s a very collaborative interface to encourage customers to work together to understand how users are engaging with their apps,” Skates says.

    Amplitude’s database also uses machine learning to segment users, predict user outcomes, and uncover novel correlations. Earlier this year, the company unveiled a service called Recommend that helps companies create personalized user experiences across their entire platform in minutes. The service goes beyond demographics to personalize customer experiences based on what users have done or seen before within the product.

    “We’re very conscious on the privacy front,” Skates says. “A lot of analytics companies will resell your data to third parties or use it for advertising purposes. We don’t do any of that. We’re only here to provide product insights to our customers. We’re not using data to track you across the web. Everyone expects Netflix to use the data on what you’ve watched before to recommend what to watch next. That’s effectively what we’re helping other companies do.”

    Optimizing digital experiences

    The meditation app Calm is on a mission to help users build habits that improve their mental wellness. Using Amplitude, the company learned that users most often use the app to get better sleep and reduce stress. The insights helped Calm’s team double down on content geared toward those goals, launching “sleep stories” to help users unwind at the end of each day and adding content around anxiety relief and relaxation. Sleep stories are now Calm’s most popular type of content, and Calm has grown rapidly to millions of people around the world.

    Calm’s story shows the power of letting user behavior drive product decisions. Amplitude has also helped the online fundraising site GoFundMe increase donations by showing users more compelling campaigns and the exercise bike company Peloton realize the importance of social features like leaderboards.

    Moving forward, the founders believe Amplitude’s platform will continue helping companies adapt to an increasingly digital world in which users expect more compelling, personalized experiences.

    “If you think about the online experience for companies today compared to 10 years ago, now [digital] is the main point of contact, whether you’re a media company streaming content, a retail company, or a finance company,” Skates says. “That’s only going to continue. That’s where we’re trying to help.” More

  • in

    Exact symbolic artificial intelligence for faster, better assessment of AI fairness

    The justice system, banks, and private companies use algorithms to make decisions that have profound impacts on people’s lives. Unfortunately, those algorithms are sometimes biased — disproportionately impacting people of color as well as individuals in lower income classes when they apply for loans or jobs, or even when courts decide what bail should be set while a person awaits trial.

    MIT researchers have developed a new artificial intelligence programming language that can assess the fairness of algorithms more exactly, and more quickly, than available alternatives.

    Their Sum-Product Probabilistic Language (SPPL) is a probabilistic programming system. Probabilistic programming is an emerging field at the intersection of programming languages and artificial intelligence that aims to make AI systems much easier to develop, with early successes in computer vision, common-sense data cleaning, and automated data modeling. Probabilistic programming languages make it much easier for programmers to define probabilistic models and carry out probabilistic inference — that is, work backward to infer probable explanations for observed data.

    “There are previous systems that can solve various fairness questions. Our system is not the first; but because our system is specialized and optimized for a certain class of models, it can deliver solutions thousands of times faster,” says Feras Saad, a PhD student in electrical engineering and computer science (EECS) and first author on a recent paper describing the work. Saad adds that the speedups are not insignificant: The system can be up to 3,000 times faster than previous approaches.

    SPPL gives fast, exact solutions to probabilistic inference questions such as “How likely is the model to recommend a loan to someone over age 40?” or “Generate 1,000 synthetic loan applicants, all under age 30, whose loans will be approved.” These inference results are based on SPPL programs that encode probabilistic models of what kinds of applicants are likely, a priori, and also how to classify them. Fairness questions that SPPL can answer include “Is there a difference between the probability of recommending a loan to an immigrant and nonimmigrant applicant with the same socioeconomic status?” or “What’s the probability of a hire, given that the candidate is qualified for the job and from an underrepresented group?”

    SPPL is different from most probabilistic programming languages, as SPPL only allows users to write probabilistic programs for which it can automatically deliver exact probabilistic inference results. SPPL also makes it possible for users to check how fast inference will be, and therefore avoid writing slow programs. In contrast, other probabilistic programming languages such as Gen and Pyro allow users to write down probabilistic programs where the only known ways to do inference are approximate — that is, the results include errors whose nature and magnitude can be hard to characterize.

    Error from approximate probabilistic inference is tolerable in many AI applications. But it is undesirable to have inference errors corrupting results in socially impactful applications of AI, such as automated decision-making, and especially in fairness analysis.

    Jean-Baptiste Tristan, associate professor at Boston College and former research scientist at Oracle Labs, who was not involved in the new research, says, “I’ve worked on fairness analysis in academia and in real-world, large-scale industry settings. SPPL offers improved flexibility and trustworthiness over other PPLs on this challenging and important class of problems due to the expressiveness of the language, its precise and simple semantics, and the speed and soundness of the exact symbolic inference engine.”

    SPPL avoids errors by restricting to a carefully designed class of models that still includes a broad class of AI algorithms, including the decision tree classifiers that are widely used for algorithmic decision-making. SPPL works by compiling probabilistic programs into a specialized data structure called a “sum-product expression.” SPPL further builds on the emerging theme of using probabilistic circuits as a representation that enables efficient probabilistic inference. This approach extends prior work on sum-product networks to models and queries expressed via a probabilistic programming language. However, Saad notes that this approach comes with limitations: “SPPL is substantially faster for analyzing the fairness of a decision tree, for example, but it can’t analyze models like neural networks. Other systems can analyze both neural networks and decision trees, but they tend to be slower and give inexact answers.”

    “SPPL shows that exact probabilistic inference is practical, not just theoretically possible, for a broad class of probabilistic programs,” says Vikash Mansinghka, an MIT principal research scientist and senior author on the paper. “In my lab, we’ve seen symbolic inference driving speed and accuracy improvements in other inference tasks that we previously approached via approximate Monte Carlo and deep learning algorithms. We’ve also been applying SPPL to probabilistic programs learned from real-world databases, to quantify the probability of rare events, generate synthetic proxy data given constraints, and automatically screen data for probable anomalies.”

    The new SPPL probabilistic programming language was presented in June at the ACM SIGPLAN International Conference on Programming Language Design and Implementation (PLDI), in a paper that Saad co-authored with MIT EECS Professor Martin Rinard and Mansinghka. SPPL is implemented in Python and is available open source. More

  • in

    A comprehensive study of technological change

    The societal impacts of technological change can be seen in many domains, from messenger RNA vaccines and automation to drones and climate change. The pace of that technological change can affect its impact, and how quickly a technology improves in performance can be an indicator of its future importance. For decision-makers like investors, entrepreneurs, and policymakers, predicting which technologies are fast improving (and which are overhyped) can mean the difference between success and failure.

    New research from MIT aims to assist in the prediction of technology performance improvement using U.S. patents as a dataset. The study describes 97 percent of the U.S. patent system as a set of 1,757 discrete technology domains, and quantitatively assesses each domain for its improvement potential.

    “The rate of improvement can only be empirically estimated when substantial performance measurements are made over long time periods,” says Anuraag Singh SM ’20, lead author of the paper. “In some large technological fields, including software and clinical medicine, such measures have rarely, if ever, been made.”

    A previous MIT study provided empirical measures for 30 technological domains, but the patent sets identified for those technologies cover less than 15 percent of the patents in the U.S. patent system. The major purpose of this new study is to provide predictions of the performance improvement rates for the thousands of domains not accessed by empirical measurement. To accomplish this, the researchers developed a method using a new probability-based algorithm, machine learning, natural language processing, and patent network analytics.

    Overlap and centrality

    A technology domain, as the researchers define it, consists of sets of artifacts fulfilling a specific function using a specific branch of scientific knowledge. To find the patents that best represent a domain, the team built on previous research conducted by co-author Chris Magee, a professor of the practice of engineering systems within the Institute for Data, Systems, and Society (IDSS). Magee and his colleagues found that by looking for patent overlap between the U.S. and international patent-classification systems, they could quickly identify patents that best represent a technology. The researchers ultimately created a correspondence of all patents within the U.S. patent system to a set of 1,757 technology domains.

    To estimate performance improvement, Singh employed a method refined by co-authors Magee and Giorgio Triulzi, a researcher with the Sociotechnical Systems Research Center (SSRC) within IDSS and an assistant professor at Universidad de los Andes in Colombia. Their method is based on the average “centrality” of patents in the patent citation network. Centrality refers to multiple criteria for determining the ranking or importance of nodes within a network.

    “Our method provides predictions of performance improvement rates for nearly all definable technologies for the first time,” says Singh.

    Those rates vary — from a low of 2 percent per year for the “Mechanical skin treatment — Hair removal and wrinkles” domain to a high of 216 percent per year for the “Dynamic information exchange and support systems integrating multiple channels” domain. The researchers found that most technologies improve slowly; more than 80 percent of technologies improve at less than 25 percent per year. Notably, the number of patents in a technological area was not a strong indicator of a higher improvement rate.

    “Fast-improving domains are concentrated in a few technological areas,” says Magee. “The domains that show improvement rates greater than the predicted rate for integrated chips — 42 percent, from Moore’s law — are predominantly based upon software and algorithms.”

    TechNext Inc.

    The researchers built an online interactive system where domains corresponding to technology-related keywords can be found along with their improvement rates. Users can input a keyword describing a technology and the system returns a prediction of improvement for the technological domain, an automated measure of the quality of the match between the keyword and the domain, and patent sets so that the reader can judge the semantic quality of the match.

    Moving forward, the researchers have founded a new MIT spinoff called TechNext Inc. to further refine this technology and use it to help leaders make better decisions, from budgets to investment priorities to technology policy. Like any inventors, Magee and his colleagues want to protect their intellectual property rights. To that end, they have applied for a patent for their novel system and its unique methodology.

    “Technologies that improve faster win the market,” says Singh. “Our search system enables technology managers, investors, policymakers, and entrepreneurs to quickly look up predictions of improvement rates for specific technologies.”

    Adds Magee: “Our goal is to bring greater accuracy, precision, and repeatability to the as-yet fuzzy art of technology forecasting.” More

  • in

    Lincoln Laboratory convenes top network scientists for Graph Exploitation Symposium

    As the Covid-19 pandemic has shown, we live in a richly connected world, facilitating not only the efficient spread of a virus but also of information and influence. What can we learn by analyzing these connections? This is a core question of network science, a field of research that models interactions across physical, biological, social, and information systems to solve problems.

    The 2021 Graph Exploitation Symposium (GraphEx), hosted by MIT Lincoln Laboratory, brought together top network science researchers to share the latest advances and applications in the field.

    “We explore and identify how exploitation of graph data can offer key technology enablers to solve the most pressing problems our nation faces today,” says Edward Kao, a symposium organizer and technical staff in Lincoln Laboratory’s AI Software Architectures and Algorithms Group.

    The themes of the virtual event revolved around some of the year’s most relevant issues, such as analyzing disinformation on social media, modeling the pandemic’s spread, and using graph-based machine learning models to speed drug design.

    “The special sessions on influence operations and Covid-19 at GraphEx reflect the relevance of network and graph-based analysis for understanding the phenomenology of these complicated and impactful aspects of modern-day life, and also may suggest paths forward as we learn more and more about graph manipulation,” says William Streilein, who co-chaired the event with Rajmonda Caceres, both of Lincoln Laboratory.

    Social networks

    Several presentations at the symposium focused on the role of network science in analyzing influence operations (IO), or organized attempts by state and/or non-state actors to spread disinformation narratives.  

    Lincoln Laboratory researchers have been developing tools to classify and quantify the influence of social media accounts that are likely IO accounts, such as those willfully spreading false Covid-19 treatments to vulnerable populations.

    “A cluster of IO accounts acts as an echo chamber to amplify the narrative. The vulnerable population is then engaging in these narratives,” says Erika Mackin, a researcher developing the tool, called RIO or Reconnaissance of Influence Operations.

    To classify IO accounts, Mackin and her team trained an algorithm to detect probable IO accounts in Twitter networks based on a specific hashtag or narrative. One example they studied was #MacronLeaks, a disinformation campaign targeting Emmanuel Macron during the 2017 French presidential election. The algorithm is trained to label accounts within this network as being IO on the basis of several factors, such as the number of interactions with foreign news accounts, the number of links tweeted, or number of languages used. Their model then uses a statistical approach to score an account’s level of influence in spreading the narrative within that network.

    The team has found that their classifier outperforms existing detectors of IO accounts, because it can identify both bot accounts and human-operated ones. They’ve also discovered that IO accounts that pushed the 2017 French election disinformation narrative largely overlap with accounts influentially spreading Covid-19 pandemic disinformation today. “This suggests that these accounts will continue to transition to disinformation narratives,” Mackin says.

    Pandemic modeling

    Throughout the Covid-19 pandemic, leaders have been looking to epidemiological models, which predict how disease will spread, to make sound decisions. Alessandro Vespignani, director of the Network Science Institute at Northeastern University, has been leading Covid-19 modeling efforts in the United States, and shared a keynote on this work at the symposium.

    Besides taking into account the biological facts of the disease, such as its incubation period, Vespignani’s model is especially powerful in its inclusion of community behavior. To run realistic simulations of disease spread, he develops “synthetic populations” that are built by using publicly available, highly detailed datasets about U.S. households. “We create a population that is not real, but is statistically real, and generate a map of the interactions of those individuals,” he says. This information feeds back into the model to predict the spread of the disease. 

    Today, Vespignani is considering how to integrate genomic analysis of the virus into this kind of population modeling in order to understand how variants are spreading. “It’s still a work in progress that is extremely interesting,” he says, adding that this approach has been useful in modeling the dispersal of the Delta variant of SARS-CoV-2. 

    As researchers model the virus’ spread, Lucas Laird at Lincoln Laboratory is considering how network science can be used to design effective control strategies. He and his team are developing a model for customizing strategies for different geographic regions. The effort was spurred by the differences in Covid-19 spread across U.S. communities, and what the researchers found to be a gap in intervention modeling to address those differences.

    As examples, they applied their planning algorithm to three counties in Florida, Massachusetts, and California. Taking into account the characteristics of a specific geographic center, such as the number of susceptible individuals and number of infections there, their planner institutes different strategies in those communities throughout the outbreak duration.

    “Our approach eradicates disease in 100 days, but it also is able to do it with much more targeted interventions than any of the global interventions. In other words, you don’t have to shut down a full country.” Laird adds that their planner offers a “sandbox environment” for exploring intervention strategies in the future.

    Machine learning with graphs

    Graph-based machine learning is receiving increasing attention for its potential to “learn” the complex relationships between graphical data, and thus extract new insights or predictions about these relationships. This interest has given rise to a new class of algorithms called graph neural networks. Today, graph neural networks are being applied in areas such as drug discovery and material design, with promising results.

    “We can now apply deep learning much more broadly, not only to medical images and biological sequences. This creates new opportunities in data-rich biology and medicine,” says Marinka Zitnik, an assistant professor at Harvard University who presented her research at GraphEx.

    Zitnik’s research focuses on the rich networks of interactions between proteins, drugs, disease, and patients, at the scale of billions of interactions. One application of this research is discovering drugs to treat diseases with no or few approved drug treatments, such as for Covid-19. In April, Zitnik’s team published a paper on their research that used graph neural networks to rank 6,340 drugs for their expected efficacy against SARS-CoV-2, identifying four that could be repurposed to treat Covid-19.

    At Lincoln Laboratory, researchers are similarly applying graph neural networks to the challenge of designing advanced materials, such as those that can withstand extreme radiation or capture carbon dioxide. Like the process of designing drugs, the trial-and-error approach to materials design is time-consuming and costly. The laboratory’s team is developing graph neural networks that can learn relationships between a material’s crystalline structure and its properties. This network can then be used to predict a variety of properties from any new crystal structure, greatly speeding up the process of screening materials with desired properties for specific applications.

    “Graph representation learning has emerged as a rich and thriving research area for incorporating inductive bias and structured priors during the machine learning process, with broad applications such as drug design, accelerated scientific discovery, and personalized recommendation systems,” Caceres says. 

    A vibrant community

    Lincoln Laboratory has hosted the GraphEx Symposium annually since 2010, with the exception of last year’s cancellation due to Covid-19. “One key takeaway is that despite the postponement from last year and the need to be virtual, the GraphEx community is as vibrant and active as it’s ever been,” Streilein says. “Network-based analysis continues to expand its reach and is applied to ever-more important areas of science, society, and defense with increasing impact.”

    In addition to those from Lincoln Laboratory, technical committee members and co-chairs of the GraphEx Symposium included researchers from Harvard University, Arizona State University, Stanford University, Smith College, Duke University, the U.S. Department of Defense, and Sandia National Laboratories. More