More stories

  • in

    Translating lost languages using machine learning

    Recent research suggests that most languages that have ever existed are no longer spoken. Dozens of these dead languages are also considered to be lost, or “undeciphered” — that is, we don’t know enough about their grammar, vocabulary, or syntax to be able to actually understand their texts.
    Lost languages are more than a mere academic curiosity; without them, we miss an entire body of knowledge about the people who spoke them. Unfortunately, most of them have such minimal records that scientists can’t decipher them by using machine-translation algorithms like Google Translate. Some don’t have a well-researched “relative” language to be compared to, and often lack traditional dividers like white space and punctuation. (To illustrate, imaginetryingtodecipheraforeignlanguagewrittenlikethis.)
    However, researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) recently made a major development in this area: a new system that has been shown to be able to automatically decipher a lost language, without needing advanced knowledge of its relation to other languages. They also showed that their system can itself determine relationships between languages, and they used it to corroborate recent scholarship suggesting that the language of Iberian is not actually related to Basque.
    The team’s ultimate goal is for the system to be able to decipher lost languages that have eluded linguists for decades, using just a few thousand words.
    Spearheaded by MIT Professor Regina Barzilay, the system relies on several principles grounded in insights from historical linguistics, such as the fact that languages generally only evolve in certain predictable ways. For instance, while a given language rarely adds or deletes an entire sound, certain sound substitutions are likely to occur. A word with a “p” in the parent language may change into a “b” in the descendant language, but changing to a “k” is less likely due to the significant pronunciation gap.
    By incorporating these and other linguistic constraints, Barzilay and MIT PhD student Jiaming Luo developed a decipherment algorithm that can handle the vast space of possible transformations and the scarcity of a guiding signal in the input. The algorithm learns to embed language sounds into a multidimensional space where differences in pronunciation are reflected in the distance between corresponding vectors. This design enables them to capture pertinent patterns of language change and express them as computational constraints. The resulting model can segment words in an ancient language and map them to counterparts in a related language.  
    The project builds on a paper Barzilay and Luo wrote last year that deciphered the dead languages of Ugaritic and Linear B, the latter of which had previously taken decades for humans to decode. However, a key difference with that project was that the team knew that these languages were related to early forms of Hebrew and Greek, respectively.
    With the new system, the relationship between languages is inferred by the algorithm. This question is one of the biggest challenges in decipherment. In the case of Linear B, it took several decades to discover the correct known descendant. For Iberian, the scholars still cannot agree on the related language: Some argue for Basque, while others refute this hypothesis and claim that Iberian doesn’t relate to any known language. 
    The proposed algorithm can assess the proximity between two languages; in fact, when tested on known languages, it can even accurately identify language families. The team applied their algorithm to Iberian considering Basque, as well as less-likely candidates from Romance, Germanic, Turkic, and Uralic families. While Basque and Latin were closer to Iberian than other languages, they were still too different to be considered related. 
    In future work, the team hopes to expand their work beyond the act of connecting texts to related words in a known language — an approach referred to as “cognate-based decipherment.” This paradigm assumes that such a known language exists, but the example of Iberian shows that this is not always the case. The team’s new approach would involve identifying semantic meaning of the words, even if they don’t know how to read them. 
    “For instance, we may identify all the references to people or locations in the document which can then be further investigated in light of the known historical evidence,” says Barzilay. “These methods of ‘entity recognition’ are commonly used in various text processing applications today and are highly accurate, but the key research question is whether the task is feasible without any training data in the ancient language.”      .
    The project was supported, in part, by the Intelligence Advanced Research Projects Activity (IARPA). More

  • in

    Bringing construction projects to the digital world

    People who work behind a computer screen all day take it for granted that everyone’s work will be tracked and accessible when they collaborate with others. But if your job takes place out in the real world, managing projects can require a lot more effort.
    In construction, for example, general contractors and real estate developers often need someone to be physically present on a job site to verify work is done correctly and on time. They might also rely on a photographer or smartphone images to document a project’s progress. Those imperfect solutions can lead to accountability issues, unnecessary change orders, and project delays.
    Now the startup OpenSpace is bringing some of the benefits of digital work to the real world with a solution that uses 360-degree cameras and computer vision to create comprehensive, time-stamped digital replicas of construction sites.
    All customers need to do is walk their job site with a small 360-degree camera on their hard hat. The OpenSpace Vision Engine maps the photos to work plans automatically, creating a Google Streetview-like experience for people to remotely tour work sites at different times as if they were physically present.
    The company is also deploying analytics solutions that help customers track progress and search for objects on their job sites. To date, OpenSpace has helped customers map more than 1.5 billion square feet of construction projects, including bridges, hospitals, football stadiums, and large residential buildings.
    The solution is helping workers in the construction industry improve accountability, minimize travel, reduce risks, and more.
    “The core product we have today is a simple idea: It allows our customers to have a complete visual record of any space, indoor or outdoor, so they can see what’s there from anywhere at any point in time,” says OpenSpace cofounder and CEO Jeevan Kalanithi SM ’07. “They can teleport into the site to inspect the actual reality, but they can also see what was there yesterday or a week ago or five years ago. It brings this ground truth record to the site.”
    Shining a light on construction sites
    The founders of OpenSpace originally met during their time at MIT. At the Media Lab, Kalanithi and David Merrill SM ’06, PhD ’09 built a gaming system based on small cubes that used LCD touch screens and motion sensors to encourage kids to develop critical thinking skills. They spun the idea into a company, Sifteo, which created multiple generations of its toys.
    In 2014, Sifteo was bought by 3D Robotics, then a drone company that would go on to focus on drone inspection software for construction, engineering, and mining firms. Kalanithi stayed with 3D Robotics for over two years, eventually serving as president of the company.
    In the summer of 2016, Kalanithi left 3D Robotics with the intention of spending more time with friends and family. He reconnected with two friends from MIT, Philip DeCamp ’05, SM ’08, PhD ’13 and Michael Fleischman PhD ’08, who had researched new machine vision and AI techniques in their PhD research. Fleischman had started a social media analytics company he sold to Twitter.
    At the time, DeCamp and Fleischman were considering ways to use machine vision advances with 360-degree cameras. Kalanithi, who had helped guide 3D Robotics toward the construction industry, thought he had the perfect application.
    People have long used photographs to document construction projects, and many times contracts for large construction projects require photos of progress to be taken. But the photos never document the entire site, and they aren’t taken frequently enough to capture every phase of work.
    Early versions of the OpenSpace solution required someone to set up a tripod in every space of a construction project. A breakthrough came when one early user, a straight-talking project manager, gave the founders some useful feedback.
    “I was showing him the output of our product at the time, which looks similar to now, and he says, ‘This is great. How long did it take you?’ When I told him he said, ‘Well that’s cool Jeevan, but there’s no way we’re going to use that,’” Kalanithi recalls. “I thought maybe this idea isn’t so good after all. But then he gave us the idea. He said, ‘What would be great is if I could just wear that little camera and walk around. I walk around the job site all the time.’”
    The founders took the advice and repurposed their solution to work with off-the-shelf 360-degree cameras and slightly modified hard hats. The cameras take pictures every half second and use artificial intelligence techniques to identify the camera’s precise location, even indoors. Once a few tours of the job site have been uploaded to OpenSpace’s platform, it can map pictures onto site plans within 15 minutes.
    Kalanithi still remembers the excitement the founders felt the first time they saved a customer money, helping to settle a dispute between a general contractor and a drywall specialist. Since then they’ve gotten a lot of those calls, in some cases saving companies millions of dollars. Kalanithi says saving builders costs helps the construction industry meet growing needs related to aging infrastructure and housing shortages.
    Helping nondigital workers
    OpenSpace’s analytics solutions, which the company calls its ClearSight suite of products, have not been rolled out to every customer yet. But Kalanithi believes they will bring even more value to people managing work sites.
    “If you have someone walking around the project all the time, we can start classifying and computing what they’re seeing,” Kalanithi says. “So, we can see how much framing and drywall is being installed, how quickly, how much material was used. That’s the basis for how people get paid in this industry: How much work did you do?”
    Kalanithi believes Clearsight is the beginning of a new phase for OpenSpace, where the company can use AI and computer vision to give customers a new perspective on what’s going on at their job site.
    “The product experience today, where you look around to see the site, will be something people sometimes do on OpenSpace, but they may be spending more time looking at productivity charts and little OpenSpace verified payment buttons, and maybe sometimes they’ll drill down to look at the actual images,” Kalanithi says.
    The Covid-19 pandemic accelerated some companies’ adoption of digital solutions to help cut down on travel and physical contact. But even in states that have resumed construction, Kalanithi says customers are continuing to use OpenSpace, a key indicator of the value it brings.
    Indeed, the vast majority of the information captured by OpenSpace was never available before, and it brings with it the potential for major improvements in the construction industry and beyond.
    “If the last decade was defined by the cloud and mobile technology being the real enabling technologies, I think this next decade will be innovations that affect people in the real physical world,” Kalanithi says. “Because cameras and computer vision are getting better, so for a lot of people who have been ignored or left behind by technology based on the work they do, we’ll have the opportunity to make some amends and build some stuff that will make those folks lives easier.” More

  • in

    Neural pathway crucial to successful rapid object recognition in primates

    MIT researchers have identified a brain pathway critical in enabling primates to effortlessly identify objects in their field of vision. The findings enrich existing models of the neural circuitry involved in visual perception and help to further unravel the computational code for solving object recognition in the primate brain.
    Led by Kohitij Kar, a postdoc at the McGovern Institute for Brain Research and Department of Brain and Cognitive Sciences, the study looked at an area called the ventrolateral prefrontal cortex (vlPFC), which sends feedback signals to the inferior temporal (IT) cortex via a network of neurons. The main goal of this study was to test how the back-and-forth information processing of this circuitry — that is, this recurrent neural network — is essential to rapid object identification in primates.
    The current study, published in Neuron and available via open access, is a followup to prior work published by Kar and James DiCarlo, the Peter de Florez Professor of Neuroscience, the head of MIT’s Department of Brain and Cognitive Sciences, and an investigator in the McGovern Institute and the Center for Brains, Minds, and Machines.
    Monkey versus machine
    In 2019, Kar, DiCarlo, and colleagues identified that primates must use some recurrent circuits during rapid object recognition. Monkey subjects in that study were able to identify objects more accurately than engineered “feed-forward” computational models, called deep convolutional neural networks, that lacked recurrent circuitry.
    Interestingly, specific images for which models performed poorly compared to monkeys in object identification, also took longer to be solved in the monkeys’ brains — suggesting that the additional time might be due to recurrent processing in the brain. Based on the 2019 study, it remained unclear, though, exactly which recurrent circuits were responsible for the delayed information boost in the IT cortex. That’s where the current study picks up.
    “In this new study, we wanted to find out: Where are these recurrent signals in IT coming from?” Kar says. “Which areas reciprocally connected to IT, are functionally the most critical part of this recurrent circuit?”
    To determine this, researchers used a pharmacological agent to temporarily block the activity in parts of the vlPFC in macaques while they engaged in an object discrimination task. During these tasks, monkeys viewed images that contained an object, such as an apple, a car, or a dog; then, researchers used eye tracking to determine if the monkeys could correctly indicate what object they had previously viewed when given two object choices.
    “We observed that if you use pharmacological agents to partially inactivate the vlPFC, then both the monkeys’ behavior and IT cortex activity deteriorates, but more so for certain specific images. These images were the same ones we identified in the previous study — ones that were poorly solved by ‘feed-forward’ models and took longer to be solved in the monkey’s IT cortex,” says Kar.
    “These results provide evidence that this recurrently connected network is critical for rapid object recognition, the behavior we’re studying. Now, we have a better understanding of how the full circuit is laid out, and what are the key underlying neural components of this behavior.”
    The full study, entitled “Fast recurrent processing via ventrolateral prefrontal cortex is needed by the primate ventral stream for robust core visual object recognition,” will run in print Jan. 6, 2021.
    “This study demonstrates the importance of prefrontal cortical circuits in automatically boosting object recognition performance in a very particular way,” DiCarlo says. “These results were obtained in nonhuman primates and thus are highly likely to also be relevant to human vision.”
    The present study makes clear the integral role of the recurrent connections between the vlPFC and the primate ventral visual cortex during rapid object recognition. The results will be helpful to researchers designing future studies that aim to develop accurate models of the brain, and to researchers who seek to develop more human-like artificial intelligence. More

  • in

    Eight Lincoln Laboratory technologies named 2020 R&D 100 Award winners

    Eight technologies developed by MIT Lincoln Laboratory researchers, either wholly or in collaboration with researchers from other organizations, were among the winners of the 2020 R&D 100 Awards. Annually since 1963, these international R&D awards recognize 100 technologies that a panel of expert judges selects as the most revolutionary of the past year.
    Six of the laboratory’s winning technologies are software systems, a number of which take advantage of artificial intelligence techniques. The software technologies are solutions to difficulties inherent in analyzing large volumes of data and to problems in maintaining cybersecurity. Another technology is a process designed to assure secure fabrication of integrated circuits, and the eighth winner is an optical communications technology that may enable future space missions to transmit error-free data to Earth at significantly higher rates than currently possible.
    CyberPow
    To enable timely, effective responses to post-disaster large-scale power outages, Lincoln Laboratory created a system that rapidly estimates and maps the extent and location of power outages across geographic boundaries. Cyber Sensing for Power Outage Detection, nicknamed CyberPow, uses pervasive, internet-connected devices to produce near-real-time situational awareness to inform decisions about allocating personnel and resources.
    The system performs active scanning of the IP networks in a targeted region to identify changes in network device availability as an indicator of power loss. CyberPow is a low-cost, efficient alternative to current approaches that rely on piecing together outage data from disparate electric utilities with varying ability to assess their own outages.
    FOVEA
    The Forensic Video Exploitation and Analysis (FOVEA) tool suite, developed by the laboratory under the sponsorship of the Department of Homeland Security Science and Technology Directorate, enables users to efficiently analyze video captured by existing large-scale closed-circuit television systems. The FOVEA tools expedite daily tasks, such as searching through video, investigating abandoned objects, or piecing together activity from multiple cameras. A video summarization tool condenses all motion activity within a long time frame into a very short visual summary, transforming, for example, one hour of raw video into a three-minute summary that also acts as a clickable index into the original video sequence.
    To allow analysts to track the onsite history of a suspicious object, a “jump back” feature automatically scans to the segment of video in which an idle or suspicious object first appeared. Because analysts can quickly navigate a camera network through the use of transition zones — clickable overlays that mark common entry/exit zones — FOVEA makes it easy to follow a person of interest through many camera views. Video data from each camera combined on the fly in chronological order can be exported easily. Highly efficient algorithms mean that no specialized hardware is required; thus, FOVEA software tools can add strong forensic capabilities to any video streaming system.
    Keylime
    An open-source key bootstrapping and integrity management software architecture, Keylime is designed to increase the security and privacy of edge, cloud, and internet-of-things devices. Keylime enables users to securely upload cryptographic keys, passwords, and certificates into their machines without unnecessarily divulging these secrets. In addition, Keylime enables users to continuously verify trust in their computing resources without relying on their service providers to guarantee security.
    Keylime leverages the Trusted Platform Module, an industry-standard hardware security chip, but eliminates the complexity, compatibility, and performance issues that the module introduces. Keylime has fostered a vibrant, growing open-source community with the help of Red Hat, a multinational software company, and has been accepted as a Sandbox technology in the Cloud Native Computing Foundation, a Linux Foundation project.
    LAVA
    A team from Lincoln Laboratory, New York University, and Northeastern University developed the Large-scale Vulnerability Addition (LAVA) technique, which injects numerous bugs into a program to create ground truth for evaluating bug-finding systems. The technique inserts bugs at known locations in a program and constructs triggering inputs for each bug. A bug finder’s ability to discover LAVA bugs in a program can be used to estimate the finder’s false negative and false positive rates.
    LAVA addresses the critical need for technology that can discover new approaches to finding bugs in software programs. Despite decades of research into building stable software, bugs still plague modern programs, and current approaches to bug discovery have relied on analyzing programs against programs that have either no known bugs or previously discovered bugs. Manually creating programs with known bugs is laborious and cannot be done at a large scale. LAVA is the only system capable of injecting essentially an unlimited number of bugs into real programs given the program’s source code. It has been used to evaluate bug finders, both human and automated, since 2017.
    RIO
    The Reconnaissance of Influence Operations (RIO) software system automates the detection of disinformation narratives, networks, and influential actors. The system is designed to address the growing threat posed by adversarial countries that exploit social media and digital communications to achieve political objectives.
    The unprecedented scales, speeds, and reach of disinformation campaigns present a rising threat to global stability, and especially to democratic societies. The RIO system integrates natural language processing, machine learning, graph analytics, and novel network causal inference to quantify the impact of individual actors in spreading the disinformation narrative. By providing situational awareness of influence campaigns and knowledge of the mechanisms behind social influence, RIO offers capabilities that can aid in crafting responses to dangerous influence operations.
    TRACER
    Many popular closed-source computer applications, such as browsers or productivity applications running on the Windows operating system, are vulnerable to large-scale cyber attacks through which adversaries use previously discovered entry into the applications’ data to take control of a computer. Enabling the severity of these attacks is the homogeneity of the targets: Because all installations of an application look alike, it can be easy for attackers to simultaneously compromise millions of computers, remotely exfiltrating sensitive information or stealing user data.
    Lincoln Laboratory’s Timely Randomization Applied to Commodity Executables at Runtime (TRACER) technique protects closed-source Windows applications against sophisticated attacks by automatically and transparently re-randomizing the applications’ sensitive internal data and layout every time any output is generated. TRACER therefore assures that leaked information quickly becomes stale and that attacks cannot bypass a one-time randomization, as is the case in one-time randomization defenses.
    Defensive Wire Routing for Untrusted Integrated Circuit Fabrication
    Lincoln Laboratory researchers developed the Defensive Wire Routing for Untrusted Integrated Circuit Fabrication techniques to deter an outsourced foundry from maliciously tampering with or modifying the security-critical components of a digital circuit design. For example, a trusted design could be changed by a fabricator who inserts a “hardware Trojan” or “backdoor” that can compromise the downstream system security.
    The Defensive Wire Routing technology augments standard wire routing processes to make complex integrated circuits (ICs) inspectable and/or tamper-evident post fabrication. The need for such defensive techniques has arisen because of increasing commercial and government use of outsourced third-party IC foundries for advanced high-performance IC fabrication.
    TBIRD
    Sponsored by NASA, Lincoln Laboratory developed TeraByte InfraRed Delivery (TBIRD), a technology that enables error-free transmission of data from satellites in low Earth orbit (LEO) at a rate of 200 gigabits per second. The current approach to LEO data delivery, generally a combination of RF communication, networks of ground stations, and onboard data compression, will become less able to efficiently and accurately handle the volume of data sent from the increasing numbers of LEO satellites sharing a crowded RF spectrum.
    TBIRD is an optical communications alternative that leverages the high bandwidths and unregulated spectrum available in the optical frequencies. It combines a custom-designed, innovative transmit/receive system in conjunction with commercial transceivers to provide high-rate, error-free data links through the dynamic atmosphere. Because TBIRD enables extremely high data-volume transfers that can occur over different atmospheric conditions (horizontal-link or LEO-to-ground), it has the potential to transform satellite operations in all scientific, commercial, and defense applications.
    Because of the Covid-19 pandemic, editors of R&D World, an online publication that promotes the award program, announced the winners at virtual ceremonies broadcast on Sept. 29-30 and Oct. 1. Since 2010, Lincoln Laboratory has had 66 technologies recognized with R&D 100 Awards.  More

  • in

    A global collaboration to move artificial intelligence principles to practice

    Today, artificial intelligence — and the computing systems that underlie it — are more than just matters of technology; they are matters of state and society, of governance and the public interest. The choices that technologists, policymakers, and communities make in the next few years will shape the relationship between machines and humans for decades to come.
    The rapidly increasing applicability of AI has prompted a number of organizations to develop high-level principles on social and ethical issues such as privacy, fairness, bias, transparency, and accountability. Building on those broader principles, the AI Policy Forum, a global effort convened by the MIT Stephen A. Schwarzman College of Computing, will provide an overarching policy framework and tools for governments and companies to implement in concrete ways.
    “Our goal is to help policymakers in making practical decisions about AI policy,” says Daniel Huttenlocher, dean of the MIT Schwarzman College of Computing. “We are not trying to develop another set of principles around AI, several of which already exist, but rather provide context and guidelines specific to a field of use of AI to help policymakers around the world with implementation.”
    “Moving beyond principles means understanding trade-offs and identifying the technical tools and the policy levers to address them. We created the college to examine and address these types of issues, but this can’t be a siloed effort. We need for this to be a global collaboration and engage scientists, technologists, policymakers, and business leaders,” says MIT Provost Martin Schmidt. “This is a challenging and complex process for which we need all hands-on deck.”
    The AI Policy Forum is designed as a yearlong process. Activities associated with this effort will be distinguished by their focus on tangible outcomes — their engagement with key government officials at the local, national, and international level charged with designing those public policies, and their deep technical grounding in the latest advances in the science of AI. The measure of success will be whether these efforts have bridged the gap between these communities, translated principled agreement into actionable outcomes, and helped create the conditions for deeper trust between humans and machines.
    The global collaboration will begin in late 2020 and early 2021 with a series of AI Policy Forum Task Forces, chaired by MIT researchers and bringing together the world’s leading technical and policy experts on some of the most pressing issues of AI policy, starting with AI in finance and mobility. Further task forces throughout 2021 will convene more communities of practice with the shared aim of designing the next chapter of AI: one that both delivers on AI’s innovative potential and responds to society’s needs.
    Each task force will produce results that inform concrete public policies and frameworks for the next chapter of AI, and help define the roles that the academic and business communities, civil society, and governments will need to play in making it a reality. Research from the task forces will feed into the development of the AI Policy Framework, a dynamic assessment tool that will help governments gauge their own progress on AI policy-making goals and guide application of best practices appropriate to their own national priorities.
    On May 6–7, 2021, MIT will host — most likely online — the first AI Policy Forum Summit, a two-day collaborative gathering to discuss the progress of the task forces towards equipping high-level decision-makers with a deeper understanding of the tools at their disposal — and trade-offs to be made — to produce better public policy around AI, and better AI systems with concern for public policy. Then, in fall 2021, a follow-on event at MIT will bring together leaders from across sectors and countries and, built atop the leading research from the task forces, the forum will provide a focal point for work to move from AI principles to AI practice, and serve as a springboard to global efforts to design the future of AI. More

  • in

    The real promise of synthetic data

    Each year, the world generates more data than the previous year. In 2020 alone, an estimated 59 zettabytes of data will be “created, captured, copied, and consumed,” according to the International Data Corporation — enough to fill about a trillion 64-gigabyte hard drives.
    But just because data are proliferating doesn’t mean everyone can actually use them. Companies and institutions, rightfully concerned with their users’ privacy, often restrict access to datasets — sometimes within their own teams. And now that the Covid-19 pandemic has shut down labs and offices, preventing people from visiting centralized data stores, sharing information safely is even more difficult.
    Without access to data, it’s hard to make tools that actually work. Enter synthetic data: artificial information developers and engineers can use as a stand-in for real data.
    Synthetic data is a bit like diet soda. To be effective, it has to resemble the “real thing” in certain ways. Diet soda should look, taste, and fizz like regular soda. Similarly, a synthetic dataset must have the same mathematical and statistical properties as the real-world dataset it’s standing in for. “It looks like it, and has formatting like it,” says Kalyan Veeramachaneni, principal investigator of the Data to AI (DAI) Lab and a principal research scientist in MIT’s Laboratory for Information and Decision Systems. If it’s run through a model, or used to build or test an application, it performs like that real-world data would.
    But — just as diet soda should have fewer calories than the regular variety — a synthetic dataset must also differ from a real one in crucial aspects. If it’s based on a real dataset, for example, it shouldn’t contain or even hint at any of the information from that dataset.
    Threading this needle is tricky. After years of work, Veeramachaneni and his collaborators recently unveiled a set of open-source data generation tools — a one-stop shop where users can get as much data as they need for their projects, in formats from tables to time series. They call it the Synthetic Data Vault.
    Maximizing access while maintaining privacy
    Veeramachaneni and his team first tried to create synthetic data in 2013. They had been tasked with analyzing a large amount of information from the online learning program edX, and wanted to bring in some MIT students to help. The data were sensitive, and couldn’t be shared with these new hires, so the team decided to create artificial data that the students could work with instead — figuring that “once they wrote the processing software, we could use it on the real data,” Veeramachaneni says.
    This is a common scenario. Imagine you’re a software developer contracted by a hospital. You’ve been asked to build a dashboard that lets patients access their test results, prescriptions, and other health information. But you aren’t allowed to see any real patient data, because it’s private.
    Most developers in this situation will make “a very simplistic version” of the data they need, and do their best, says Carles Sala, a researcher in the DAI lab. But when the dashboard goes live, there’s a good chance that “everything crashes,” he says, “because there are some edge cases they weren’t taking into account.”
    High-quality synthetic data — as complex as what it’s meant to replace — would help to solve this problem. Companies and institutions could share it freely, allowing teams to work more collaboratively and efficiently. Developers could even carry it around on their laptops, knowing they weren’t putting any sensitive information at risk.
    Perfecting the formula — and handling constraints
    Back in 2013, Veeramachaneni’s team gave themselves two weeks to create a data pool they could use for that edX project. The timeline “seemed really reasonable,” Veeramachaneni says. “But we failed completely.” They soon realized that if they built a series of synthetic data generators, they could make the process quicker for everyone else.
    In 2016, the team completed an algorithm that accurately captures correlations between the different fields in a real dataset — think a patient’s age, blood pressure, and heart rate — and creates a synthetic dataset that preserves those relationships, without any identifying information. When data scientists were asked to solve problems using this synthetic data, their solutions were as effective as those made with real data 70 percent of the time. The team presented this research at the 2016 IEEE International Conference on Data Science and Advanced Analytics.
    For the next go-around, the team reached deep into the machine learning toolbox. In 2019, PhD student Lei Xu presented his new algorithm, CTGAN, at the 33rd Conference on Neural Information Processing Systems in Vancouver. CTGAN (for “conditional tabular generative adversarial networks) uses GANs to build and perfect synthetic data tables. GANs are pairs of neural networks that “play against each other,” Xu says. The first network, called a generator, creates something — in this case, a row of synthetic data — and the second, called the discriminator, tries to tell if it’s real or not.
    “Eventually, the generator can generate perfect [data], and the discriminator cannot tell the difference,” says Xu. GANs are more often used in artificial image generation, but they work well for synthetic data, too: CTGAN outperformed classic synthetic data creation techniques in 85 percent of the cases tested in Xu’s study.
    Statistical similarity is crucial. But depending on what they represent, datasets also come with their own vital context and constraints, which must be preserved in synthetic data. DAI lab researcher Sala gives the example of a hotel ledger: a guest always checks out after he or she checks in. The dates in a synthetic hotel reservation dataset must follow this rule, too: “They need to be in the right order,” he says.
    Large datasets may contain a number of different relationships like this, each strictly defined. “Models cannot learn the constraints, because those are very context-dependent,” says Veeramachaneni. So the team recently finalized an interface that allows people to tell a synthetic data generator where those bounds are. “The data is generated within those constraints,” Veeramachaneni says.
    Such precise data could aid companies and organizations in many different sectors. One example is banking, where increased digitization, along with new data privacy rules, have “triggered a growing interest in ways to generate synthetic data,” says Wim Blommaert, a team leader at ING financial services. Current solutions, like data-masking, often destroy valuable information that banks could otherwise use to make decisions, he said. A tool like SDV has the potential to sidestep the sensitive aspects of data while preserving these important constraints and relationships.
    One vault to rule them all
    The Synthetic Data Vault combines everything the group has built so far into “a whole ecosystem,” says Veeramachaneni. The idea is that stakeholders — from students to professional software developers — can come to the vault and get what they need, whether that’s a large table, a small amount of time-series data, or a mix of many different data types.
    The vault is open-source and expandable. “There are a whole lot of different areas where we are realizing synthetic data can be used as well,” says Sala. For example, if a particular group is underrepresented in a sample dataset, synthetic data can be used to fill in those gaps — a sensitive endeavor that requires a lot of finesse. Or companies might also want to use synthetic data to plan for scenarios they haven’t yet experienced, like a huge bump in user traffic.
    As use cases continue to come up, more tools will be developed and added to the vault, Veeramachaneni says. It may occupy the team for another seven years at least, but they are ready: “We’re just touching the tip of the iceberg.” More

  • in

    Machine learning uncovers potential new TB drugs

    Machine learning is a computational tool used by many biologists to analyze huge amounts of data, helping them to identify potential new drugs. MIT researchers have now incorporated a new feature into these types of machine-learning algorithms, improving their prediction-making ability.
    Using this new approach, which allows computer models to account for uncertainty in the data they’re analyzing, the MIT team identified several promising compounds that target a protein required by the bacteria that cause tuberculosis.
    This method, which has previously been used by computer scientists but has not taken off in biology, could also prove useful in protein design and many other fields of biology, says Bonnie Berger, the Simons Professor of Mathematics and head of the Computation and Biology group in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL).
    “This technique is part of a known subfield of machine learning, but people have not brought it to biology,” Berger says. “This is a paradigm shift, and is absolutely how biological exploration should be done.”
    Berger and Bryan Bryson, an assistant professor of biological engineering at MIT and a member of the Ragon Institute of MGH, MIT, and Harvard, are the senior authors of the study, which appears today in Cell Systems. MIT graduate student Brian Hie is the paper’s lead author.
    Better predictions
    Machine learning is a type of computer modeling in which an algorithm learns to make predictions based on data that it has already seen. In recent years, biologists have begun using machine learning to scour huge databases of potential drug compounds to find molecules that interact with particular targets.
    One limitation of this method is that while the algorithms perform well when the data they’re analyzing are similar to the data they were trained on, they’re not very good at evaluating molecules that are very different from the ones they have already seen.
    To overcome that, the researchers used a technique called Gaussian process to assign uncertainty values to the data that the algorithms are trained on. That way, when the models are analyzing the training data, they also take into account how reliable those predictions are.
    For example, if the data going into the model predict how strongly a particular molecule binds to a target protein, as well as the uncertainty of those predictions, the model can use that information to make predictions for protein-target interactions that it hasn’t seen before. The model also estimates the certainty of its own predictions. When analyzing new data, the model’s predictions may have lower certainty for molecules that are very different from the training data. Researchers can use that information to help them decide which molecules to test experimentally.
    Another advantage of this approach is that the algorithm requires only a small amount of training data. In this study, the MIT team trained the model with a dataset of 72 small molecules and their interactions with more than 400 proteins called protein kinases. They were then able to use this algorithm to analyze nearly 11,000 small molecules, which they took from the ZINC database, a publicly available repository that contains millions of chemical compounds. Many of these molecules were very different from those in the training data.
    Using this approach, the researchers were able to identify molecules with very strong predicted binding affinities for the protein kinases they put into the model. These included three human kinases, as well as one kinase found in Mycobacterium tuberculosis. That kinase, PknB, is critical for the bacteria to survive, but is not targeted by any frontline TB antibiotics.
    The researchers then experimentally tested some of their top hits to see how well they actually bind to their targets, and found that the model’s predictions were very accurate. Among the molecules that the model assigned the highest certainty, about 90 percent proved to be true hits — much higher than the 30 to 40 percent hit rate of existing machine learning models used for drug screens.
    The researchers also used the same training data to train a traditional machine-learning algorithm, which does not incorporate uncertainty, and then had it analyze the same 11,000 molecule library. “Without uncertainty, the model just gets horribly confused and it proposes very weird chemical structures as interacting with the kinases,” Hie says.
    The researchers then took some of their most promising PknB inhibitors and tested them against Mycobacterium tuberculosis grown in bacterial culture media, and found that they inhibited bacterial growth. The inhibitors also worked in human immune cells infected with the bacterium.
    A good starting point
    Another important element of this approach is that once the researchers get additional experimental data, they can add it to the model and retrain it, further improving the predictions. Even a small amount of data can help the model get better, the researchers say.
    “You don’t really need very large data sets on each iteration,” Hie says. “You can just retrain the model with maybe 10 new examples, which is something that a biologist can easily generate.”
    This study is the first in many years to propose new molecules that can target PknB, and should give drug developers a good starting point to try to develop drugs that target the kinase, Bryson says. “We’ve now provided them with some new leads beyond what has been already published,” he says.
    The researchers also showed that they could use this same type of machine learning to boost the fluorescent output of a green fluorescent protein, which is commonly used to label molecules inside living cells. It could also be applied to many other types of biological studies, says Berger, who is now using it to analyze mutations that drive tumor development.
    The research was funded by the U.S. Department of Defense through the National Defense Science and Engineering Graduate Fellowship; the National Institutes of Health; the Ragon Institute of MGH, MIT, and Harvard’ and MIT’s Department of Biological Engineering. More

  • in

    MIT Proto Ventures program readies new startups for launch

    Powered by the MIT Innovation Initiative (MITii) and launched in October 2019, the MIT Proto Ventures program takes an entirely new approach to venture formation from within MIT. It oversees the accelerated emergence of new ventures along a full life cycle: from discovery of ideas and resources at MIT to exploration of the problem-solution space to a methodical de-risking process to helping build a “proto venture” with internal and external support that demonstrates the viability of the venture.
    Under the leadership of MITii Venture Builder Luis Ruben Soenksen PhD ’19, the program announced last week that two new MIT startups will launch as part of Proto Ventures.
    “The Proto Ventures program has nurtured everything I know is relevant in order to develop high-impact scientific ventures in today’s world,” says Soenksen. “This time has been an extraordinary complement to my PhD studies and research activities at MIT and is leading me to pursue my passion around artificial intelligence and health care in the form of multiple startups. What more could I’ve hoped for?”
    In its first year, the Proto Ventures program has led to hundreds of ecosystem interactions, and a log of 319 screened concepts needed to identify the kind of high-impact proto ventures that would attract long-term collaboration with multiple faculty, students, and staff across MIT.
    The program hosted its AI+Healthcare forum in February, bringing together 48 representatives of corporations, startups, venture capital firms, local hospitals, the pharmaceutical industry, and MIT researchers. Thirty-eight curated “proto venture” ideas were discussed and screened through roundtables, expert discussions and a voting process to gather multi-sectoral input on the anticipated value and impact of the proposed ventures, initiatives, and organizations at the intersection of artificial intelligence and health care.
    Two startups emerged from within the program’s AI+Healthcare track, which was sponsored by MIT’s Abdul Latif Jameel Clinic for Machine Learning in Health (Jameel Clinic):
    TOCI develops state-of-the-art AI tools that help time-constrained doctors and care-givers to effortlessly regain quality time and human connection in front of their patients; and 
    Medicall offers enjoyable one-shot telemedicine using its unique AI that fully automates the repetitive process of asynchronous information collection and differential pre-reporting even before the visit begins.
    Going forward, Jameel Clinic will assume day-to-day management of this AI+Healthcare channel, promoting and accelerating these proto ventures along with further launches of the program.
    MIT Innovation Initiative actively seeks new Venture Builders to develop additional channels within the MIT Proto Ventures program.
    “The role of MIT Venture Builder is one of the most exciting opportunities for an entrepreneur in the MIT community”, says MIT Innovation Initiative Executive Director Gene Keselman. “It’s a unique license to explore and tap into the resources of the entire Institute, bring together people and organizations that may not otherwise have collaborated, to rapidly explore, invent, iterate, and launch.”
    Visit the MIT Proto Ventures Program or email proto.ventures@mit.edu to learn more, including how to become a channel sponsor or the next Venture Builder. More