More stories

  • in

    Improving the way videos are organized

    At any given moment, many thousands of new videos are being posted to sites like YouTube, TikTok, and Instagram. An increasing number of those videos are being recorded and streamed live. But tech and media companies still struggle to understand what’s going in all that content.

    Now MIT alumnus-founded Netra is using artificial intelligence to improve video analysis at scale. The company’s system can identify activities, objects, emotions, locations, and more to organize and provide context to videos in new ways.

    Companies are using Netra’s solution to group similar content into highlight reels or news segments, flag nudity and violence, and improve ad placement. In advertising, Netra is helping ensure videos are paired with relevant ads so brands can move away from tracking individual people, which has led to privacy concerns.

    “The industry as a whole is pivoting toward content-based advertising, or what they call affinity advertising, and away from cookie-based, pixel-based tracking, which was always sort of creepy,” Netra co-founder and CTO Shashi Kant SM ’06 says.

    Netra also believes it is improving the searchability of video content. Once videos are processed by Netra’s system, users can start a search with a keyword. From there, they can click on results to see similar content and find increasingly specific events.

    For instance, Netra’s system can process a baseball season’s worth of video and help users find all the singles. By clicking on certain plays to see more like it, they can also find all the singles that were almost outs and led the fans to boo angrily.

    “Video is by far the biggest information resource today,” Kant says. “It dwarfs text by orders of magnitude in terms of information richness and size, yet no one’s even touched it with search. It’s the whitest of white space.”

    Pursuing a vision

    Internet pioneer and MIT professor Sir Tim Berners-Lee has long worked to improve machines’ ability to make sense of data on the internet. Kant researched under Berners-Lee as a graduate student and was inspired by his vision for improving the way information is stored and used by machines.

    “The holy grail to me is a new paradigm in information retrieval,” Kant says. “I feel web search is still 1.0. Even Google is 1.0. That’s been the vision of Sir Tim Berners-Lee’s semantic web initiative and that’s what I took from that experience.”

    Kant was also a member of the winning team in the MIT $100K Entrepreneurship Competition (the MIT $50K back then). He helped write the computer code for a solution called the Active Joint Brace, which was an electromechanical orthotic device for people with disabilities.

    After graduating in 2006, Kant started a company that used AI in its solution called Cognika. AI still had a bad reputation from being overhyped, so Kant would use terms like cognitive computing when pitching his company to investors and customers.

    Kant started Netra in 2013 to use AI for video analysis. These days he has to deal with the opposite end of the hype spectrum, with so many startups claiming they use AI in their solution.

    Netra tries cutting through the hype with demonstrations of its system. Netra can quickly analyze videos and organize the content based on what’s going on in different clips, including scenes where people are doing similar things, expressing similar emotions, using similar products, and more. Netra’s analysis generates metadata for different scenes, but Kant says Netra’s system provides much more than keyword tagging.

    “What we work with are embeddings,” Kant explains, referring to how his system classifies content. “If there’s a scene of someone hitting a home run, there’s a certain signature to that, and we generate an embedding for that. An embedding is a sequence of numbers, or a ‘vector,’ that captures the essence of a piece of content. Tags are just human readable representations of that. So, we’ll train a model that detects all the home runs, but underneath the cover there’s a neural network, and it’s creating an embedding of that video, and that differentiates the scene in other ways from an out or a walk.”

    By defining the relationships between different clips, Netra’s system allows customers to organize and search their content in new ways. Media companies can determine the most exciting moments of sporting events based on fans’ emotions. They can also group content by subject, location, or by whether or not clips include sensitive or disturbing content.

    Those abilities have major implications for online advertising. An advertising company representing a brand like the outdoor apparel company Patagonia could use Netra’s system to place Patagonia’s ads next to hiking content. Media companies could offer brands like Nike advertising space around clips of sponsored athletes.

    Those capabilities are helping advertisers adhere to new privacy regulations around the world that put restrictions on gathering data on individual people, especially children. Targeting certain groups of people with ads and tracking them across the web has also become controversial.

    Kant believes Netra’s AI engine is a step toward giving consumers more control over their data, an idea long championed by Berners-Lee.

    “It’s not the implementation of my CSAIL work, but I’d say the conceptual ideas I was pursuing at CSAIL come through in Netra’s solution,” Kant says.

    Transforming the way information is stored

    Netra currently counts some of the country’s largest media and advertising companies as customers. Kant believes Netra’s system could one day help anyone search through and organize the growing ocean of video content on the internet. To that end, he sees Netra’s solution continuing to evolve.

    “Search hasn’t changed much since it was invented for web 1.0,” Kant says. “Right now there’s lots of link-based search. Links are obsolete in my view. You don’t want to visit different documents. You want information from those documents aggregated into something contextual and customizable, including just the information you need.”

    Kant believes such contextualization would greatly improve the way information is organized and shared on the internet.

    “It’s about relying less and less on keywords and more and more on examples,” Kant explains. “For instance, in this video, if Shashi makes a statement, is that because he’s a crackpot or is there more to it? Imagine a system that could say, ‘This other scientist said something similar to validate that statement and this scientist responded similarly to that question.’ To me, those types of things are the future of information retrieval, and that’s my life’s passion. That’s why I came to MIT. That’s why I’ve spent one and a half decades of my life fighting this battle of AI, and that’s what I’ll continue to do.” More

  • in

    Behind Covid-19 vaccine development

    When starting a vaccine program, scientists generally have anecdotal understanding of the disease they’re aiming to target. When Covid-19 surfaced over a year ago, there were so many unknowns about the fast-moving virus that scientists had to act quickly and rely on new methods and techniques just to even begin understanding the basics of the disease.

    Scientists at Janssen Research & Development, developers of the Johnson & Johnson-Janssen Covid-19 vaccine, leveraged real-world data and, working with MIT researchers, applied artificial intelligence and machine learning to help guide the company’s research efforts into a potential vaccine.

    “Data science and machine learning can be used to augment scientific understanding of a disease,” says Najat Khan, chief data science officer and global head of strategy and operations for Janssen Research & Development. “For Covid-19, these tools became even more important because ­­­our knowledge was rather limited. There was no hypothesis at the time. We were developing an unbiased understanding of the disease based on real-world data using sophisticated AI/ML algorithms.”

    In preparing for clinical studies of Janssen’s lead vaccine candidate, Khan put out a call for collaborators on predictive modeling efforts to partner with her data science team to identify key locations to set up trial sites. Through Regina Barzilay, the MIT School of Engineering Distinguished Professor for AI and Health, faculty lead of AI for MIT’s Abdul Latif Jameel Clinic for Machine Learning in Health, and a member of Janssen’s scientific advisory board, Khan connected with Dimitris Bertsimas, the Boeing Leaders for Global Operations Professor of Management at MIT, who had developed a leading machine learning model that tracks Covid-19 spread in communities and predicts patient outcomes, and brought him on as the primary technical partner on the project.

    DELPHI

    When the World Health Organization declared Covid-19 a pandemic in March 2020 and forced much of the world into lockdown, Bertsimas, who is also the faculty lead of entrepreneurship for the Jameel Clinic, brought his group of 25-plus doctoral and master’s students together to discuss how they could use their collective skills in machine learning and optimization to create new tools to aid the world in combating the spread of the disease.

    The group started tracking their efforts on the COVIDAnalytics platform, where their models are generating accurate real-time insight into the pandemic. One of the group’s first projects was charting the progression of Covid-19 with an epidemiological model they developed named DELPHI, which predicts state-by-state infection and mortality rates based upon each state’s policy decision.

    DELPHI is based on the standard SEIR model, a compartmental model that simplifies the mathematical modeling of infectious diseases by dividing populations in four categories: susceptible, exposed, infectious, and recovered. The ordering of the labels is intentional to show the flow patterns between the compartments. DELPHI expands on this model with a system that looks at 11 possible states of being to account for realistic effects of the pandemic, such comparing the length of time those who recovered from Covid-19 spent in the hospital versus those who died.

    “The model has some values that are hardwired, such as how long a person stays in the hospital, but we went deeper to account for the nonlinear change of infection rates, which we found were not constant and varied over different periods and locations,” says Bertsimas. “This gave us more modeling flexibility, which led the model to make more accurate predictions.”

    A key innovation of the model is capturing the behaviors of people related to measures put into place during the pandemic, such as lockdowns, mask-wearing, and social distancing, and the impact these had on infection rates.

    “By June or July, we were able to augment the model with these data. The model then became even more accurate,” says Bertsimas. “We also considered different scenarios for how various governments might respond with policy decisions, from implementing serious restrictions to no restrictions at all, and compared them to what we were seeing happening in the world. This gave us the ability to make a spectrum of predictions. One of the advantages of the DELPHI model is that it makes predictions on 120 countries and all 50 U.S. states on a daily basis.”

    A vaccine for today’s pandemic

    Being able to determine where Covid-19 is likely to spike next proved to be critical to the success of Janssen’s clinical trials, which were “event-based” — meaning that “we figure out efficacy based on how many ‘events’ are in our study population, events such as becoming sick with Covid-19,” explains Khan.

    “To run a trial like this, which is very, very large, it’s important to go to hot spots where we anticipate the disease transmission to be high so that you can accumulate those events quickly. If you can, then you can run the trial faster, bring the vaccine to market more quickly, and also, most importantly, have a very rich dataset where you can make statistically sound analysis.”

    Bertsimas assembled a core group of researchers to work with him on the project, including two doctoral students from MIT’s Operations Research Center, where he is a faculty member: Michael Li, who led implementation efforts, and Omar Skali Lami. Other members included Hamza Tazi MBN ’20, a former master of business analytics student, and Ali Haddad, a data research scientist at Dynamic Ideas LLC.

    The MIT team began collaborating with Khan and her team last May to forecast where the next surge in cases might happen. Their goal was to identify Covid-19 hot spots where Janssen could conduct clinical trials and recruit participants who were most likely to get exposed to the virus.

    With clinical trials due to start last September, the teams had to immediately hit the ground running and make predictions four months in advance of when the trials would actually take place. “We started meeting daily with the Janssen team. I’m not exaggerating — we met on a daily basis … sometimes over the weekend, and sometimes more than once a day,” says Bertsimas.

    To understand how the virus was moving around the world, data scientists at Janssen continuously monitored and scouted data sources across the world. The team built a global surveillance dashboard that pulled in data at a country, state, and even county level based on data availability, on case numbers, hospitalizations, and mortality and testing rates.

    The DELPHI model integrated these data, with additional information about local policies and behaviors, such as whether people were being compliant with mask-wearing, and was making daily predictions in the 300-400 range. “We were getting constant feedback from the Janssen team which helped to improve the quality of the model. The model eventually became quite central to the clinical trial process,” says Bertsimas.

    Remarkably, the vast majority of Janssen’s clinical trial sites that DELPHI predicted to be Covid-19 hot spots ultimately had extremely high number of cases, including in South Africa and Brazil, where new variants of the virus had surfaced by the time the trials began. According to Khan, high incidence rates typically indicate variant involvement.

    “All of the predictions the model made are publicly available, so one can go back and see how accurate the model really is. It held its own. To this day, DELPHI is one of the most accurate models the scientific community has produced,” says Bertsimas.

    “As a result of this model, we were able to have a highly data-rich package at the time of submission of our vaccine candidate,” says Khan. “We are one of the few trials that had clinical data in South Africa and Brazil. That became critical because we were able to develop a vaccine that became relevant for today’s needs, today’s world, and today’s pandemic, which consists of so many variants, unfortunately.” 

    Khan points out that the DELPHI model was further evolved with diversity in mind, taking into account biological risk factors, patient demographics, and other characteristics. “Covid-19 impacts people in different ways, so it was important to go to areas where we were able to recruit participants from different races, ethnic groups, and genders. Due to this effort, we had one of the most diverse Covid-19 trials that’s been run to date,” she says. “If you start with the right data, unbiased, and go to the right places, we can actually change a lot of the paradigms that are limiting us today.”

    In April, the MIT and Janssen R&D Data Science team were jointly recognized by the Institute for Operations Research and the Management Sciences (INFORMS) as the winner of the 2021 Innovative Applications in Analytics Award for their innovative and highly impactful work on Covid-19. Building on this success, the teams are continuing their collaboration to apply their data-driven approach and technical rigor in tackling other infectious diseases. “This was not a partnership in name only. Our teams really came together in this and continue to work together on various data science efforts across the pipeline,” says Khan. The team further appreciates the role of investigators on the ground, who contributed to site selection in combination with the model.

    “It was a very satisfying experience,” concurs Bertsimas. “I’m proud to have contributed to this effort and help the world in the fight against the pandemic.” More

  • in

    Helping students of all ages flourish in the era of artificial intelligence

    A new cross-disciplinary research initiative at MIT aims to promote the understanding and use of AI across all segments of society. The effort, called Responsible AI for Social Empowerment and Education (RAISE), will develop new teaching approaches and tools to engage learners in settings from preK-12 to the workforce.

    “People are using AI every day in our workplaces and our private lives. It’s in our apps, devices, social media, and more. It’s shaping the global economy, our institutions, and ourselves. Being digitally literate is no longer enough. People need to be AI-literate to understand the responsible use of AI and create things with it at individual, community, and societal levels,” says RAISE Director Cynthia Breazeal, a professor of media arts and sciences at MIT.

    “But right now, if you want to learn about AI to make AI-powered applications, you pretty much need to have a college degree in computer science or related topic,” Breazeal adds. “The educational barrier is still pretty high. The vision of this initiative is: AI for everyone else — with an emphasis on equity, access, and responsible empowerment.”

    Headquartered in the MIT Media Lab, RAISE is a collaboration with the MIT Schwarzman College of Computing and MIT Open Learning. The initiative will engage in research coupled with education and outreach efforts to advance new knowledge and innovative technologies to support how diverse people learn about AI as well as how AI can help to better support human learning. Through Open Learning and the Abdul Latif Jameel World Education Lab (J-WEL), RAISE will also extend its reach into a global network where equity and justice are key.

    The initiative draws on MIT’s history as both a birthplace of AI technology and a leader in AI pedagogy. “MIT already excels at undergraduate and graduate AI education,” says Breazeal, who heads the Media Lab’s Personal Robots group and is an associate director of the Media Lab. “Now we’re building on those successes. We’re saying we can take a leadership role in educational research, the science of learning, and technological innovation to broaden AI education and empower society writ large to shape our future with AI.”

    In addition to Breazeal, RAISE co-directors are Hal Abelson, professor of computer science and education; Eric Klopfer, professor and director of the Scheller Teacher Education Program; and Hae Won Park, a research scientist at the Media Lab. Other principal leaders include Professor Sanjay Sarma, vice president for open learning. RAISE draws additional participation from dozens of faculty, staff, and students across the Institute.

    “In today’s rapidly changing economic and technological landscape, a core challenge nationally and globally is to improve the effectiveness, availability, and equity of preK-12 education, community college, and workforce development. AI offers tremendous promise for new pedagogies and platforms, as well as for new content. Developing and deploying advances in computing for the public good is core to the mission of the Schwarzman College of Computing, and I’m delighted to have the College playing a role in this initiative,” says Daniel Huttenlocher, dean of the MIT Schwarzman College of Computing.

    The new initiative will engage in research, education, and outreach activities to advance four strategic impact areas: diversity and inclusion in AI, AI literacy in preK-12 education, AI workforce training, and AI-supported learning. Success entails that new knowledge, materials, technological innovations, and programs developed by RAISE are leveraged by other stakeholder AI education programs across MIT and beyond to add value to their efficacy, experience, equity, and impact.

    RAISE will develop AI-augmented tools to support human learning across a variety of topics. “We’ve done a lot of work in the Media Lab around companion AI,” says Park. “Personalized learning companion AI agents such as social robots support individual students’ learning and motivation to learn. This work provides an effective and safe space for students to practice and explore topics such as early childhood literacy and language development.”

    Diversity and inclusion will be embedded throughout RAISE’s work, to help correct historic inequities in the field of AI. “We’re seeing story after story of unintended bias and inequities that are arising because of these AI systems,” says Breazeal. “So, a mission of our initiative is to educate a far more diverse and inclusive group of people in the responsible design and use of AI technologies, who will ultimately be more representative of the communities they will be developing these products and services for.”

    This spring, RAISE is piloting a K-12 outreach program called Future Makers. The program brings engaging, hands-on learning experiences about AI fundamentals and critical thinking about societal implications to teachers and students, primarily from underserved or under-resourced communities, such as schools receiving Title I services.

    To bring AI to young people within and beyond the classroom, RAISE is developing and distributing curricula, teacher guides, and student-friendly AI tools that enable anyone, even those with no programming background, to create original applications for desktop and mobile computing. “Scratch and App Inventor are already in the hands of millions of learners worldwide,” explains Abelson. “RAISE is enhancing these platforms and making powerful AI accessible to all people for increased creativity and personal expression.”

    Ethics and AI will be a central component to the initiative’s curricula and teaching tools. “Our philosophy is, have kids learn about the technical concepts right alongside the ethical design practices,” says Breazeal.  “Thinking through the societal implications can’t be an afterthought.”

    “AI is changing the way we interact with computers as consumers as well as designers and developers of technology,” Klopfer says. “It is creating a new paradigm for innovation and change. We want to make sure that all people are empowered to use this technology in constructive, creative, and beneficial ways.”

    “Connecting this initiative not only to [MIT’s schools of] engineering and computing, but also to the School of Humanities, Arts and Social Sciences recognizes the multidimensional nature of this effort,” Klopfer adds.

    Sarma says RAISE also aims to boost AI literacy in the workforce, in part by adapting some of their K-12 techniques. “Many of these tools — when made somewhat more sophisticated and more germane to the adult learner — will make a tremendous difference,” says Sarma. For example, he envisions a program to train radiology technicians in how AI programs interpret diagnostic imagery and, vitally, how they can err.

    “AI is having a truly transformative effect across broad swaths of society,” says Breazeal. “Children today are not only digital natives, they’re AI natives. And adults need to understand AI to be able to engage in a democratic dialogue around how we want these systems deployed.” More

  • in

    Helping robots collaborate to get the job done

    Sometimes, one robot isn’t enough.

    Consider a search-and-rescue mission to find a hiker lost in the woods. Rescuers might want to deploy a squad of wheeled robots to roam the forest, perhaps with the aid of drones scouring the scene from above. The benefits of a robot team are clear. But orchestrating that team is no simple matter. How to ensure the robots aren’t duplicating each other’s efforts or wasting energy on a convoluted search trajectory?

    MIT researchers have designed an algorithm to ensure the fruitful cooperation of information-gathering robot teams. Their approach relies on balancing a tradeoff between data collected and energy expended — which eliminates the chance that a robot might execute a wasteful maneuver to gain just a smidgeon of information. The researchers say this assurance is vital for robot teams’ success in complex, unpredictable environments. “Our method provides comfort, because we know it will not fail, thanks to the algorithm’s worst-case performance,” says Xiaoyi Cai, a PhD student in MIT’s Department of Aeronautics and Astronautics (AeroAstro).

    The research will be presented at the IEEE International Conference on Robotics and Automation in May. Cai is the paper’s lead author. His co-authors include Jonathan How, the R.C. Maclaurin Professor of Aeronautics and Astronautics at MIT; Brent Schlotfeldt and George J. Pappas, both of the University of Pennsylvania; and Nikolay Atanasov of the University of California at San Diego.

    Robot teams have often relied on one overarching rule for gathering information: The more the merrier. “The assumption has been that it never hurts to collect more information,” says Cai. “If there’s a certain battery life, let’s just use it all to gain as much as possible.” This objective is often executed sequentially — each robot evaluates the situation and plans its trajectory, one after another. It’s a straightforward procedure, and it generally works well when information is the sole objective. But problems arise when energy efficiency becomes a factor.

    Cai says the benefits of gathering additional information often diminish over time. For example, if you already have 99 pictures of a forest, it might not be worth sending a robot on a miles-long quest to snap the 100th. “We want to be cognizant of the tradeoff between information and energy,” says Cai. “It’s not always good to have more robots moving around. It can actually be worse when you factor in the energy cost.”

    The researchers developed a robot team planning algorithm that optimizes the balance between energy and information. The algorithm’s “objective function,” which determines the value of a robot’s proposed task, accounts for the diminishing      benefits of gathering additional information and the rising energy cost. Unlike prior planning methods, it doesn’t just assign tasks to the robots sequentially. “It’s more of a collaborative effort,” says Cai. “The robots come up with the team plan themselves.”

    Cai’s method, called Distributed Local Search, is an iterative approach that improves the team’s performance by adding or removing individual robot’s trajectories from the group’s overall plan. First, each robot independently generates a set of potential trajectories it might pursue. Next, each robot proposes its trajectories to the rest of the team. Then the algorithm accepts or rejects each individual’s proposal, depending on whether it increases or decreases the team’s objective function. “We allow the robots to plan their trajectories on their own,” says Cai. “Only when they need to come up with the team plan, we let them negotiate. So, it’s a rather distributed computation.”

    Distributed Local Search proved its mettle in computer simulations. The researchers ran their algorithm against competing ones in coordinating a simulated team of 10 robots. While Distributed Local Search took slightly more computation time, it guaranteed successful completion of the robots’ mission, in part by ensuring that no team member got mired in a wasteful expedition for minimal information. “It’s a more expensive method,” says Cai. “But we gain performance.”

    The advance could one day help robot teams solve real-world information gathering problems where energy is a finite resource, according to Geoff Hollinger, a roboticist at Oregon State University, who was not involved with the research. “These techniques are applicable where the robot team needs to trade off between sensing quality and energy expenditure. That would include aerial surveillance and ocean monitoring.”

    Cai also points to potential applications in mapping and search-and-rescue — activities that rely on efficient data collection. “Improving this underlying capability of information gathering will be quite impactful,” he says. The researchers next plan to test their algorithm on robot teams in the lab, including a mix of drones and wheeled robots.

    This research was funded in part by Boeing and the Army Research Laboratory’s Distributed and Collaborative Intelligent Systems and Technology Collaborative Research Alliance (DCIST CRA). More

  • in

    Building robots to expand access to cell therapies

    Over the last two years, Multiply Labs has helped pharmaceutical companies produce biologic drugs with its robotic manufacturing platform. The robots can work around the clock, precisely formulating small batches of drugs to help companies run clinical trials more quickly.

    Now Multiply Labs, which was founded by Fred Parietti PhD ’16 and former visiting PhD at MIT Alice Melocchi, is hoping to bring the speed and precision of its robots to a new type of advanced treatment.

    In a recently announced project, Multiply Labs is developing a new robotic manufacturing platform to ease bottlenecks in the creation of cell therapies. These therapies have proven to be a powerful tool in the fight against cancer, but their production is incredibly labor intensive, contributing to their high cost. CAR-T cell therapy, for example, requires scientists to extract blood from a patient, isolate immune cells, genetically engineer those cells, grow the new cells, and inject them back into the patient. In many cases, each of those steps must be repeated for each patient.

    Multiply Labs is attempting to automate many processes that can currently only be done by highly trained scientists, reducing the potential for human error. The platform will also perform some of the most time-consuming tasks of cell therapy production in parallel. For instance, the company’s system will contain multiple bioreactors, which are used to grow the genetically modified cells that will be injected back into the patient. Some labs today only use one bioreactor in each clean room because of the specific environmental conditions that have to be met to optimize cell growth. By running multiple reactors simultaneously in a space about a quarter of the size of a basketball court, the company believes it can multiply the throughput of cell therapy production.

    Multiply Labs has partnered with global life sciences company Cytiva, which provides cell therapy equipment and services, as well as researchers at the University of California San Francisco to bring the platform to market.

    Multiply Labs’ efforts come at a time when demand for cell therapy treatment is expected to explode: There are currently more than 1,000 clinical trials underway to explore the treatment’s potential in a range of diseases. In the few areas where cell therapies are already approved, they have helped cancer patients when other treatment options had failed.

    “These [cell therapy] treatments are needed by millions of people, but only dozens of them can be administered by many centers,” Parietti says. “The real potential we see is enabling pharmaceutical companies to get these treatments approved and manufactured quicker so they can scale to hundreds of thousands — or millions — of patients.”

    A force multiplier

    Multiply Labs’ move into cell therapy is just the latest pivot for the company. The original idea for the startup came from Melocchi, who was a visiting PhD candidate in MIT’s chemical engineering department in 2013 and 2014. Melocchi had been creating drugs by hand in the MIT-Novartis Center for Continuous Manufacturing when she toured Parietti’s space at MIT. Parietti was building robotic limbs for factory workers and people with disabilities at the time, and his workspace was littered with robotic appendages and 3-D printers. Melocchi saw the machines as a way to make personalized drug capsules.

    Parietti developed the first robotic prototype in the kitchen of his Cambridge apartment, and the founders received early funding from the MIT Sandbox Innovation Fund Program.

    After going through the Y Combinator startup accelerator, the founders realized their biggest market would be pharmaceutical companies running clinical trials. Early trials often involve testing drugs of different potencies.

    “Every clinical trial is essentially personalized, because drug developers don’t know the right dosage,” Parietti says.

    Today Multiply Labs’ robotic clusters are being deployed on the production floors of leading pharmaceutical companies. The cloud-based platforms can produce 30,000 drug capsules a day and are modular, so companies can purchase as many systems as they need and run them together. Each system is contained in 15 square feet.

    “Our goal is to be the gold standard for the manufacturing of individualized drugs,” Parietti says. “We believe the future of medicine is going to be individualized drugs made on demand for single patients, and the only way to make those is with robots.”

    Multiply Labs robots handle each step of the drug formulation process.

    Roboticists enter cell therapy

    The move to cell therapy comes after Parietti’s small team of mostly MIT-trained roboticists and engineers spent the last two years learning about cell therapy production separately from its drug capsule work. Earlier this month, the company raised $20 million and is expecting to triple its team.

    Multiply labs is already working with Cytiva to incorporate the company’s bioreactors into its platform.

    “[Multiply Labs’] automation has broad implications for the industry that include expanding patient access to existing treatments and accelerating the next generation of treatments,” says Cytiva’s Parker Donner, the company’s head of business development for cell and gene therapy.

    Multiply Labs aims to ship a demo to a cell therapy manufacturing facility at UCSF for clinical validation in the next nine months.

    “It really is a great adventure for someone like me, a physician-scientist, to interact with mechanical engineers and see how they think and solve problems,” says Jonathan Esensten, an assistant adjunct professor at UCSF whose research group is being sponsored by Multiply Labs for the project. “I think they have complementary ways of approaching problems compared to my team, and I think it’s going to lead to great things. I’m hopeful we’ll build technologies that push this field forward and bend the cost curve to allow us to do things better, faster, and cheaper. That’s what we need if these really exciting therapies are going to be made widely available.”

    Esensten, whose workspace is also an FDA-compliant cell therapy manufacturing facility, says his research group struggles to produce more than approximately six cell therapies per month.

    “The beauty of the Multiply Labs concept is that it’s modular,” Esensten said. “You could imagine a robot where there are no bottlenecks: You have as much capacity as you need at every step, no matter how long it takes. Of course, there are theoretical limits, but for a given footprint the robot will be able to manufacture many more products than we could do using manual processes in our clean rooms.”

    Parietti thinks Esensten’s lab is a great partner to prove robots can be a game changer for a nascent field with a lot of promise.

    “Cell therapies are amazing in terms of efficacy,” Parietti says. “But right now, they’re made by hand. Scientists are being used for manufacturing; it’s essentially artisanal. That’s not the way to scale. The way we think about it, the more successful we are, the more patients we help.” More

  • in

    New system cleans messy data tables automatically

    MIT researchers have created a new system that automatically cleans “dirty data” —  the typos, duplicates, missing values, misspellings, and inconsistencies dreaded by data analysts, data engineers, and data scientists. The system, called PClean, is the latest in a series of domain-specific probabilistic programming languages written by researchers at the Probabilistic Computing Project that aim to simplify and automate the development of AI applications (others include one for 3D perception via inverse graphics and another for modeling time series and databases).

    According to surveys conducted by Anaconda and Figure Eight, data cleaning can take a quarter of a data scientist’s time. Automating the task is challenging because different datasets require different types of cleaning, and common-sense judgment calls about objects in the world are often needed (e.g., which of several cities called “Beverly Hills” someone lives in). PClean provides generic common-sense models for these kinds of judgment calls that can be customized to specific databases and types of errors.

    PClean uses a knowledge-based approach to automate the data cleaning process: Users encode background knowledge about the database and what sorts of issues might appear. Take, for instance, the problem of cleaning state names in a database of apartment listings. What if someone said they lived in Beverly Hills but left the state column empty? Though there is a well-known Beverly Hills in California, there’s also one in Florida, Missouri, and Texas … and there’s a neighborhood of Baltimore known as Beverly Hills. How can you know in which the person lives? This is where PClean’s expressive scripting language comes in. Users can give PClean background knowledge about the domain and about how data might be corrupted. PClean combines this knowledge via common-sense probabilistic reasoning to come up with the answer. For example, given additional knowledge about typical rents, PClean infers the correct Beverly Hills is in California because of the high cost of rent where the respondent lives. 

    Alex Lew, the lead author of the paper and a PhD student in the Department of Electrical Engineering and Computer Science (EECS), says he’s most excited that PClean gives a way to enlist help from computers in the same way that people seek help from one another. “When I ask a friend for help with something, it’s often easier than asking a computer. That’s because in today’s dominant programming languages, I have to give step-by-step instructions, which can’t assume that the computer has any context about the world or task — or even just common-sense reasoning abilities. With a human, I get to assume all those things,” he says. “PClean is a step toward closing that gap. It lets me tell the computer what I know about a problem, encoding the same kind of background knowledge I’d explain to a person helping me clean my data. I can also give PClean hints, tips, and tricks I’ve already discovered for solving the task faster.”

    Co-authors are Monica Agrawal, a PhD student in EECS; David Sontag, an associate professor in EECS; and Vikash K. Mansinghka, a principal research scientist in the Department of Brain and Cognitive Sciences.

    What innovations allow this to work? 

    The idea that probabilistic cleaning based on declarative, generative knowledge could potentially deliver much greater accuracy than machine learning was previously suggested in a 2003 paper by Hanna Pasula and others from Stuart Russell’s lab at the University of California at Berkeley. “Ensuring data quality is a huge problem in the real world, and almost all existing solutions are ad-hoc, expensive, and error-prone,” says Russell, professor of computer science at UC Berkeley. “PClean is the first scalable, well-engineered, general-purpose solution based on generative data modeling, which has to be the right way to go. The results speak for themselves.” Co-author Agrawal adds that “existing data cleaning methods are more constrained in their expressiveness, which can be more user-friendly, but at the expense of being quite limiting. Further, we found that PClean can scale to very large datasets that have unrealistic runtimes under existing systems.”

    PClean builds on recent progress in probabilistic programming, including a new AI programming model built at MIT’s Probabilistic Computing Project that makes it much easier to apply realistic models of human knowledge to interpret data. PClean’s repairs are based on Bayesian reasoning, an approach that weighs alternative explanations of ambiguous data by applying probabilities based on prior knowledge to the data at hand. “The ability to make these kinds of uncertain decisions, where we want to tell the computer what kind of things it is likely to see, and have the computer automatically use that in order to figure out what is probably the right answer, is central to probabilistic programming,” says Lew.

    PClean is the first Bayesian data-cleaning system that can combine domain expertise with common-sense reasoning to automatically clean databases of millions of records. PClean achieves this scale via three innovations. First, PClean’s scripting language lets users encode what they know. This yields accurate models, even for complex databases. Second, PClean’s inference algorithm uses a two-phase approach, based on processing records one-at-a-time to make informed guesses about how to clean them, then revisiting its judgment calls to fix mistakes. This yields robust, accurate inference results. Third, PClean provides a custom compiler that generates fast inference code. This allows PClean to run on million-record databases with greater speed than multiple competing approaches. “PClean users can give PClean hints about how to reason more effectively about their database, and tune its performance — unlike previous probabilistic programming approaches to data cleaning, which relied primarily on generic inference algorithms that were often too slow or inaccurate,” says Mansinghka. 

    As with all probabilistic programs, the lines of code needed for the tool to work are many fewer than alternative state-of-the-art options: PClean programs need only about 50 lines of code to outperform benchmarks in terms of accuracy and runtime. For comparison, a simple snake cellphone game takes twice as many lines of code to run, and Minecraft comes in at well over 1 million lines of code.

    In their paper, just presented at the 2021 Society for Artificial Intelligence and Statistics conference, the authors show PClean’s ability to scale to datasets containing millions of records by using PClean to detect errors and impute missing values in the 2.2 million-row Medicare Physician Compare National dataset. Running for just seven-and-a-half hours, PClean found more than 8,000 errors. The authors then verified by hand (via searches on hospital websites and doctor LinkedIn pages) that for more than 96 percent of them, PClean’s proposed fix was correct. 

    Since PClean is based on Bayesian probability, it can also give calibrated estimates of its uncertainty. “It can maintain multiple hypotheses — give you graded judgments, not just yes/no answers. This builds trust and helps users override PClean when necessary. For example, you can look at a judgment where PClean was uncertain, and tell it the right answer. It can then update the rest of its judgments in light of your feedback,” says Mansinghka. “We think there’s a lot of potential value in that kind of interactive process that interleaves human judgment with machine judgment. We see PClean as an early example of a new kind of AI system that can be told more of what people know, report when it is uncertain, and reason and interact with people in more useful, human-like ways.”

    David Pfau, a senior research scientist at DeepMind, noted in a tweet that PClean meets a business need: “When you consider that the vast majority of business data out there is not images of dogs, but entries in relational databases and spreadsheets, it’s a wonder that things like this don’t yet have the success that deep learning has.”

    Benefits, risks, and regulation

    PClean makes it cheaper and easier to join messy, inconsistent databases into clean records, without the massive investments in human and software systems that data-centric companies currently rely on. This has potential social benefits — but also risks, among them that PClean may make it cheaper and easier to invade peoples’ privacy, and potentially even to de-anonymize them, by joining incomplete information from multiple public sources.

    “We ultimately need much stronger data, AI, and privacy regulation, to mitigate these kinds of harms,” says Mansinghka. Lew adds, “As compared to machine-learning approaches to data cleaning, PClean might allow for finer-grained regulatory control. For example, PClean can tell us not only that it merged two records as referring to the same person, but also why it did so — and I can come to my own judgment about whether I agree. I can even tell PClean only to consider certain reasons for merging two entries.” Unfortunately, the reseachers say, privacy concerns persist no matter how fairly a dataset is cleaned.

    Mansinghka and Lew are excited to help people pursue socially beneficial applications. They have been approached by people who want to use PClean to improve the quality of data for journalism and humanitarian applications, such as anticorruption monitoring and consolidating donor records submitted to state boards of elections. Agrawal says she hopes PClean will free up data scientists’ time, “to focus on the problems they care about instead of data cleaning. Early feedback and enthusiasm around PClean suggest that this might be the case, which we’re excited to hear.” More

  • in

    A comprehensive map of the SARS-CoV-2 genome

    In early 2020, a few months after the Covid-19 pandemic began, scientists were able to sequence the full genome of SARS-CoV-2, the virus that causes the Covid-19 infection. While many of its genes were already known at that point, the full complement of protein-coding genes was unresolved.

    Now, after performing an extensive comparative genomics study, MIT researchers have generated what they describe as the most accurate and complete gene annotation of the SARS-CoV-2 genome. In their study, which appears today in Nature Communications, they confirmed several protein-coding genes and found that a few others that had been suggested as genes do not code for any proteins.

    “We were able to use this powerful comparative genomics approach for evolutionary signatures to discover the true functional protein-coding content of this enormously important genome,” says Manolis Kellis, who is the senior author of the study and a professor of computer science in MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) as well as a member of the Broad Institute of MIT and Harvard.

    The research team also analyzed nearly 2,000 mutations that have arisen in different SARS-CoV-2 isolates since it began infecting humans, allowing them to rate how important those mutations may be in changing the virus’ ability to evade the immune system or become more infectious.

    Comparative genomics

    The SARS-CoV-2 genome consists of nearly 30,000 RNA bases. Scientists have identified several regions known to encode protein-coding genes, based on their similarity to protein-coding genes found in related viruses. A few other regions were suspected to encode proteins, but they had not been definitively classified as protein-coding genes.

    To nail down which parts of the SARS-CoV-2 genome actually contain genes, the researchers performed a type of study known as comparative genomics, in which they compare the genomes of similar viruses. The SARS-CoV-2 virus belongs to a subgenus of viruses called Sarbecovirus, most of which infect bats. The researchers performed their analysis on SARS-CoV-2, SARS-CoV (which caused the 2003 SARS outbreak), and 42 strains of bat sarbecoviruses.

    Kellis has previously developed computational techniques for doing this type of analysis, which his team has also used to compare the human genome with genomes of other mammals. The techniques are based on analyzing whether certain DNA or RNA bases are conserved between species, and comparing their patterns of evolution over time.

    Using these techniques, the researchers confirmed six protein-coding genes in the SARS-CoV-2 genome in addition to the five that are well established in all coronaviruses. They also determined that the region that encodes a gene called ORF3a also encodes an additional gene, which they name ORF3c. The gene has RNA bases that overlap with ORF3a but occur in a different reading frame. This gene-within-a-gene is rare in large genomes, but common in many viruses, whose genomes are under selective pressure to stay compact. The role for this new gene, as well as several other SARS-CoV-2 genes, is not known yet.

    The researchers also showed that five other regions that had been proposed as possible genes do not encode functional proteins, and they also ruled out the possibility that there are any more conserved protein-coding genes yet to be discovered.

    “We analyzed the entire genome and are very confident that there are no other conserved protein-coding genes,” says Irwin Jungreis, lead author of the study and a CSAIL research scientist. “Experimental studies are needed to figure out the functions of the uncharacterized genes, and by determining which ones are real, we allow other researchers to focus their attention on those genes rather than spend their time on something that doesn’t even get translated into protein.”

    The researchers also recognized that many previous papers used not only incorrect gene sets, but sometimes also conflicting gene names. To remedy the situation, they brought together the SARS-CoV-2 community and presented a set of recommendations for naming SARS-CoV-2 genes, in a separate paper published a few weeks ago in Virology.

    Fast evolution

    In the new study, the researchers also analyzed more than 1,800 mutations that have arisen in SARS-CoV-2 since it was first identified. For each gene, they compared how rapidly that particular gene has evolved in the past with how much it has evolved since the current pandemic began.

    They found that in most cases, genes that evolved rapidly for long periods of time before the current pandemic have continued to do so, and those that tended to evolve slowly have maintained that trend. However, the researchers also identified exceptions to these patterns, which may shed light on how the virus has evolved as it has adapted to its new human host, Kellis says.

    In one example, the researchers identified a region of the nucleocapsid protein, which surrounds the viral genetic material, that had many more mutations than expected from its historical evolution patterns. This protein region is also classified as a target of human B cells. Therefore, mutations in that region may help the virus evade the human immune system, Kellis says.

    “The most accelerated region in the entire genome of SARS-CoV-2 is sitting smack in the middle of this nucleocapsid protein,” he says. “We speculate that those variants that don’t mutate that region get recognized by the human immune system and eliminated, whereas those variants that randomly accumulate mutations in that region are in fact better able to evade the human immune system and remain in circulation.”

    The researchers also analyzed mutations that have arisen in variants of concern, such as the B.1.1.7 strain from England, the P.1 strain from Brazil, and the B.1.351 strain from South Africa. Many of the mutations that make those variants more dangerous are found in the spike protein, and help the virus spread faster and avoid the immune system. However, each of those variants carries other mutations as well.

    “Each of those variants has more than 20 other mutations, and it’s important to know which of those are likely to be doing something and which aren’t,” Jungreis says. “So, we used our comparative genomics evidence to get a first-pass guess at which of these are likely to be important based on which ones were in conserved positions.”

    This data could help other scientists focus their attention on the mutations that appear most likely to have significant effects on the virus’ infectivity, the researchers say. They have made the annotated gene set and their mutation classifications available in the University of California at Santa Cruz Genome Browser for other researchers who wish to use it.

    “We can now go and actually study the evolutionary context of these variants and understand how the current pandemic fits in that larger history,” Kellis says. “For strains that have many mutations, we can see which of these mutations are likely to be host-specific adaptations, and which mutations are perhaps nothing to write home about.”

    The research was funded by the National Human Genome Research Institute and the National Institutes of Health. Rachel Sealfon, a research scientist at the Flatiron Institute Center for Computational Biology, is also an author of the paper. More

  • in

    A robot that can help you untangle your hair

    With rapidly growing demands on health care systems, nurses typically spend 18 to 40 percent of their time performing direct patient care tasks, oftentimes for many patients and with little time to spare. Personal care robots that brush hair could provide substantial help and relief. 

    This may seem like a truly radical form of “self-care,” but crafty robots for things like shaving, hair-washing, and makeup are not new. In 2011, the tech giant Panasonic developed a robot that could wash, massage, and even blow-dry hair, explicitly designed to help support “safe and comfortable living of the elderly and people with limited mobility, while reducing the burden of caregivers.” 

    Hair-combing bots, however, proved to be less explored, leading scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and the Soft Math Lab at Harvard University to develop a robotic arm setup with a sensorized soft brush. The robot is equipped with a camera that helps it “see” and assess curliness, so it can plan a delicate and time-efficient brush-out.  

    Play video

    Robotic Hair Brushing

    The team’s control strategy is adaptive to the degree of tangling in the fiber bunch, and they put “RoboWig” to the test by brushing wigs ranging from straight to very curly hair.

    While the hardware setup of RoboWig looks futuristic and shiny, the underlying model of the hair fibers is what makes it tick. CSAIL postdoc Josie Hughes and her team opted to represent the entangled hair as sets of entwined double helices —  think classic DNA strands. This level of granularity provided key insights into mathematical models and control systems for manipulating bundles of soft fibers, with a wide range of applications in the textile industry, animal care, and other fibrous systems. 

    “By developing a model of tangled fibers, we understand from a model-based perspective how hairs must be entangled: starting from the bottom and slowly working the way up to prevent ‘jamming’ of the fibers,” says Hughes, the lead author on a paper about RoboWig. “This is something everyone who has brushed hair has learned from experience, but is now something we can demonstrate through a model, and use to inform a robot.”  

    This task at hand is a tangled one. Every head of hair is different, and the intricate interplay between hairs when combing can easily lead to knots. What’s more, if the incorrect brushing strategy is used, the process can be very painful and damaging to the hair. 

    Previous research in the brushing domain has mostly been on the mechanical, dynamic, and visual properties of hair, as opposed to RoboWig’s refined focus on tangling and combing behavior. 

    To brush and manipulate the hair, the researchers added a soft-bristled sensorized brush to the robot arm, to allow forces during brushing to be measured. They combined this setup with something called a “closed-loop control system,” which takes feedback from an output and automatically performs an action without human intervention. This created “force feedback” from the brush — a control method that lets the user feel what the device is doing — so the length of the stroke could be optimized to take into account both the potential “pain,” and time taken to brush. 

    Initial tests preserved the human head — for now — and instead were done on a number of wigs of various hair styles and types. The model provided insight into the behaviors of the combing, related to the number of entanglements, and how those could be efficiently and effectively brushed out by choosing appropriate brushing lengths. For example, for curlier hair, the pain cost would dominate, so shorter brush lengths were optimal. 

    The team wants to eventually perform more realistic experiments on humans, to better understand the performance of the robot with respect to their experience of pain — a metric that is obviously highly subjective, as one person’s “two” could be another’s “eight.”

    “To allow robots to extend their task-solving abilities to more complex tasks such as hair brushing, we need not only novel safe hardware, but also an understanding of the complex behavior of the soft hair and tangled fibers,” says Hughes. “In addition to hair brushing, the insights provided by our approach could be applied to brushing of fibers for textiles, or animal fibers.” 

    Hughes wrote the paper alongside Harvard University School of Engineering and Applied Sciences PhD students Thomas Bolton Plumb-Reyes and Nicholas Charles; Professor L. Mahadevan of Harvard’s School of Engineering and Applied Sciences, Department of Physics, and Organismic and Evolutionary Biology; and MIT professor and CSAIL Director Daniela Rus. They presented the paper virtually at the IEEE Conference on Soft Robotics (RoboSoft) earlier this month. 

    The project was supported, in part, by the National Science Foundation’s Emerging Frontiers in Research and Innovation program between MIT CSAIL and the Soft Math Lab at Harvard.  More