More stories

  • in

    New techniques efficiently accelerate sparse tensors for massive AI models

    Researchers from MIT and NVIDIA have developed two techniques that accelerate the processing of sparse tensors, a type of data structure that’s used for high-performance computing tasks. The complementary techniques could result in significant improvements to the performance and energy-efficiency of systems like the massive machine-learning models that drive generative artificial intelligence.

    Tensors are data structures used by machine-learning models. Both of the new methods seek to efficiently exploit what’s known as sparsity — zero values — in the tensors. When processing these tensors, one can skip over the zeros and save on both computation and memory. For instance, anything multiplied by zero is zero, so it can skip that operation. And it can compress the tensor (zeros don’t need to be stored) so a larger portion can be stored in on-chip memory.

    However, there are several challenges to exploiting sparsity. Finding the nonzero values in a large tensor is no easy task. Existing approaches often limit the locations of nonzero values by enforcing a sparsity pattern to simplify the search, but this limits the variety of sparse tensors that can be processed efficiently.

    Another challenge is that the number of nonzero values can vary in different regions of the tensor. This makes it difficult to determine how much space is required to store different regions in memory. To make sure the region fits, more space is often allocated than is needed, causing the storage buffer to be underutilized. This increases off-chip memory traffic, which increases energy consumption.

    The MIT and NVIDIA researchers crafted two solutions to address these problems. For one, they developed a technique that allows the hardware to efficiently find the nonzero values for a wider variety of sparsity patterns.

    For the other solution, they created a method that can handle the case where the data do not fit in memory, which increases the utilization of the storage buffer and reduces off-chip memory traffic.

    Both methods boost the performance and reduce the energy demands of hardware accelerators specifically designed to speed up the processing of sparse tensors.

    “Typically, when you use more specialized or domain-specific hardware accelerators, you lose the flexibility that you would get from a more general-purpose processor, like a CPU. What stands out with these two works is that we show that you can still maintain flexibility and adaptability while being specialized and efficient,” says Vivienne Sze, associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS), a member of the Research Laboratory of Electronics (RLE), and co-senior author of papers on both advances.

    Her co-authors include lead authors Yannan Nellie Wu PhD ’23 and Zi Yu Xue, an electrical engineering and computer science graduate student; and co-senior author Joel Emer, an MIT professor of the practice in computer science and electrical engineering and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL), as well as others at NVIDIA. Both papers will be presented at the IEEE/ACM International Symposium on Microarchitecture.

    HighLight: Efficiently finding zero values

    Sparsity can arise in the tensor for a variety of reasons. For example, researchers sometimes “prune” unnecessary pieces of the machine-learning models by replacing some values in the tensor with zeros, creating sparsity. The degree of sparsity (percentage of zeros) and the locations of the zeros can vary for different models.

    To make it easier to find the remaining nonzero values in a model with billions of individual values, researchers often restrict the location of the nonzero values so they fall into a certain pattern. However, each hardware accelerator is typically designed to support one specific sparsity pattern, limiting its flexibility.  

    By contrast, the hardware accelerator the MIT researchers designed, called HighLight, can handle a wide variety of sparsity patterns and still perform well when running models that don’t have any zero values.

    They use a technique they call “hierarchical structured sparsity” to efficiently represent a wide variety of sparsity patterns that are composed of several simple sparsity patterns. This approach divides the values in a tensor into smaller blocks, where each block has its own simple, sparsity pattern (perhaps two zeros and two nonzeros in a block with four values).

    Then, they combine the blocks into a hierarchy, where each collection of blocks also has its own simple, sparsity pattern (perhaps one zero block and three nonzero blocks in a level with four blocks). They continue combining blocks into larger levels, but the patterns remain simple at each step.

    This simplicity enables HighLight to more efficiently find and skip zeros, so it can take full advantage of the opportunity to cut excess computation. On average, their accelerator design had about six times better energy-delay product (a metric related to energy efficiency) than other approaches.

    “In the end, the HighLight accelerator is able to efficiently accelerate dense models because it does not introduce a lot of overhead, and at the same time it is able to exploit workloads with different amounts of zero values based on hierarchical structured sparsity,” Wu explains.

    In the future, she and her collaborators want to apply hierarchical structured sparsity to more types of machine-learning models and different types of tensors in the models.

    Tailors and Swiftiles: Effectively “overbooking” to accelerate workloads

    Researchers can also leverage sparsity to more efficiently move and process data on a computer chip.

    Since the tensors are often larger than what can be stored in the memory buffer on chip, the chip only grabs and processes a chunk of the tensor at a time. The chunks are called tiles.

    To maximize the utilization of that buffer and limit the number of times the chip must access off-chip memory, which often dominates energy consumption and limits processing speed, researchers seek to use the largest tile that will fit into the buffer.

    But in a sparse tensor, many of the data values are zero, so an even larger tile can fit into the buffer than one might expect based on its capacity. Zero values don’t need to be stored.

    But the number of zero values can vary across different regions of the tensor, so they can also vary for each tile. This makes it difficult to determine a tile size that will fit in the buffer. As a result, existing approaches often conservatively assume there are no zeros and end up selecting a smaller tile, which results in wasted blank spaces in the buffer.

    To address this uncertainty, the researchers propose the use of “overbooking” to allow them to increase the tile size, as well as a way to tolerate it if the tile doesn’t fit the buffer.

    The same way an airline overbooks tickets for a flight, if all the passengers show up, the airline must compensate the ones who are bumped from the plane. But usually all the passengers don’t show up.

    In a sparse tensor, a tile size can be chosen such that usually the tiles will have enough zeros that most still fit into the buffer. But occasionally, a tile will have more nonzero values than will fit. In this case, those data are bumped out of the buffer.

    The researchers enable the hardware to only re-fetch the bumped data without grabbing and processing the entire tile again. They modify the “tail end” of the buffer to handle this, hence the name of this technique, Tailors.

    Then they also created an approach for finding the size for tiles that takes advantage of overbooking. This method, called Swiftiles, swiftly estimates the ideal tile size so that a specific percentage of tiles, set by the user, are overbooked. (The names “Tailors” and “Swiftiles” pay homage to Taylor Swift, whose recent Eras tour was fraught with overbooked presale codes for tickets).

    Swiftiles reduces the number of times the hardware needs to check the tensor to identify an ideal tile size, saving on computation. The combination of Tailors and Swiftiles more than doubles the speed while requiring only half the energy demands of existing hardware accelerators which cannot handle overbooking.

    “Swiftiles allows us to estimate how large these tiles need to be without requiring multiple iterations to refine the estimate. This only works because overbooking is supported. Even if you are off by a decent amount, you can still extract a fair bit of speedup because of the way the non-zeros are distributed,” Xue says.

    In the future, the researchers want to apply the idea of overbooking to other aspects in computer architecture and also work to improve the process for estimating the optimal level of overbooking.

    This research is funded, in part, by the MIT AI Hardware Program. More

  • in

    A new chip for decoding data transmissions demonstrates record-breaking energy efficiency

    Imagine using an online banking app to deposit money into your account. Like all information sent over the internet, those communications could be corrupted by noise that inserts errors into the data.

    To overcome this problem, senders encode data before they are transmitted, and then a receiver uses a decoding algorithm to correct errors and recover the original message. In some instances, data are received with reliability information that helps the decoder figure out which parts of a transmission are likely errors.

    Researchers at MIT and elsewhere have developed a decoder chip that employs a new statistical model to use this reliability information in a way that is much simpler and faster than conventional techniques.

    Their chip uses a universal decoding algorithm the team previously developed, which can unravel any error correcting code. Typically, decoding hardware can only process one particular type of code. This new, universal decoder chip has broken the record for energy-efficient decoding, performing between 10 and 100 times better than other hardware.

    This advance could enable mobile devices with fewer chips, since they would no longer need separate hardware for multiple codes. This would reduce the amount of material needed for fabrication, cutting costs and improving sustainability. By making the decoding process less energy intensive, the chip could also improve device performance and lengthen battery life. It could be especially useful for demanding applications like augmented and virtual reality and 5G networks.

    “This is the first time anyone has broken below the 1 picojoule-per-bit barrier for decoding. That is roughly the same amount of energy you need to transmit a bit inside the system. It had been a big symbolic threshold, but it also changes the balance in the receiver of what might be the most pressing part from an energy perspective — we can move that away from the decoder to other elements,” says Muriel Médard, the School of Science NEC Professor of Software Science and Engineering, a professor in the Department of Electrical Engineering and Computer Science, and a co-author of a paper presenting the new chip.

    Médard’s co-authors include lead author Arslan Riaz, a graduate student at Boston University (BU); Rabia Tugce Yazicigil, assistant professor of electrical and computer engineering at BU; and Ken R. Duffy, then director of the Hamilton Institute at Maynooth University and now a professor at Northeastern University, as well as others from MIT, BU, and Maynooth University. The work is being presented at the International Solid-States Circuits Conference.

    Smarter sorting

    Digital data are transmitted over a network in the form of bits (0s and 1s). A sender encodes data by adding an error-correcting code, which is a redundant string of 0s and 1s that can be viewed as a hash. Information about this hash is held in a specific code book. A decoding algorithm at the receiver, designed for this particular code, uses its code book and the hash structure to retrieve the original information, which may have been jumbled by noise. Since each algorithm is code-specific, and most require dedicated hardware, a device would need many chips to decode different codes.

    The researchers previously demonstrated GRAND (Guessing Random Additive Noise Decoding), a universal decoding algorithm that can crack any code. GRAND works by guessing the noise that affected the transmission, subtracting that noise pattern from the received data, and then checking what remains in a code book. It guesses a series of noise patterns in the order they are likely to occur.

    Data are often received with reliability information, also called soft information, that helps a decoder figure out which pieces are errors. The new decoding chip, called ORBGRAND (Ordered Reliability Bits GRAND), uses this reliability information to sort data based on how likely each bit is to be an error.

    But it isn’t as simple as ordering single bits. While the most unreliable bit might be the likeliest error, perhaps the third and fourth most unreliable bits together are as likely to be an error as the seventh-most unreliable bit. ORBGRAND uses a new statistical model that can sort bits in this fashion, considering that multiple bits together are as likely to be an error as some single bits.

    “If your car isn’t working, soft information might tell you that it is probably the battery. But if it isn’t the battery alone, maybe it is the battery and the alternator together that are causing the problem. This is how a rational person would troubleshoot — you’d say that it could actually be these two things together before going down the list to something that is much less likely,” Médard says.

    This is a much more efficient approach than traditional decoders, which would instead look at the code structure and have a performance that is generally designed for the worst-case.

    “With a traditional decoder, you’d pull out the blueprint of the car and examine each and every piece. You’ll find the problem, but it will take you a long time and you’ll get very frustrated,” Médard explains.

    ORBGRAND stops sorting as soon as a code word is found, which is often very soon. The chip also employs parallelization, generating and testing multiple noise patterns simultaneously so it finds the code word faster. Because the decoder stops working once it finds the code word, its energy consumption stays low even though it runs multiple processes simultaneously.

    Record-breaking efficiency

    When they compared their approach to other chips, ORBGRAND decoded with maximum accuracy while consuming only 0.76 picojoules of energy per bit, breaking the previous performance record. ORBGRAND consumes between 10 and 100 times less energy than other devices.

    One of the biggest challenges of developing the new chip came from this reduced energy consumption, Médard says. With ORBGRAND, generating noise sequences is now so energy-efficient that other processes the researchers hadn’t focused on before, like checking the code word in a code book, consume most of the effort.

    “Now, this checking process, which is like turning on the car to see if it works, is the hardest part. So, we need to find more efficient ways to do that,” she says.

    The team is also exploring ways to change the modulation of transmissions so they can take advantage of the improved efficiency of the ORBGRAND chip. They also plan to see how their technique could be utilized to more efficiently manage multiple transmissions that overlap.

    The research is funded, in part, by the U.S. Defense Advanced Research Projects Agency (DARPA) and Science Foundation Ireland. More

  • in

    A new way for quantum computing systems to keep their cool

    Heat causes errors in the qubits that are the building blocks of a quantum computer, so quantum systems are typically kept inside refrigerators that keep the temperature just above absolute zero (-459 degrees Fahrenheit).

    But quantum computers need to communicate with electronics outside the refrigerator, in a room-temperature environment. The metal cables that connect these electronics bring heat into the refrigerator, which has to work even harder and draw extra power to keep the system cold. Plus, more qubits require more cables, so the size of a quantum system is limited by how much heat the fridge can remove.

    To overcome this challenge, an interdisciplinary team of MIT researchers has developed a wireless communication system that enables a quantum computer to send and receive data to and from electronics outside the refrigerator using high-speed terahertz waves.

    A transceiver chip placed inside the fridge can receive and transmit data. Terahertz waves generated outside the refrigerator are beamed in through a glass window. Data encoded onto these waves can be received by the chip. That chip also acts as a mirror, delivering data from the qubits on the terahertz waves it reflects to their source.

    This reflection process also bounces back much of the power sent into the fridge, so the process generates only a minimal amount of heat. The contactless communication system consumes up to 10 times less power than systems with metal cables.

    “By having this reflection mode, you really save the power consumption inside the fridge and leave all those dirty jobs on the outside. While this is still just a preliminary prototype and we have some room to improve, even at this point, we have shown low power consumption inside the fridge that is already better than metallic cables. I believe this could be a way to build largescale quantum systems,” says senior author Ruonan Han, an associate professor in the Department of Electrical Engineering and Computer Sciences (EECS) who leads the Terahertz Integrated Electronics Group.

    Han and his team, with expertise in terahertz waves and electronic devices, joined forces with associate professor Dirk Englund and the Quantum Photonics Laboratory team, who provided quantum engineering expertise and joined in conducting the cryogenic experiments.

    Joining Han and Englund on the paper are first author and EECS graduate student Jinchen Wang; Mohamed Ibrahim PhD ’21; Isaac Harris, a graduate student in the Quantum Photonics Laboratory; Nathan M. Monroe PhD ’22; Wasiq Khan PhD ’22; and Xiang Yi, a former postdoc who is now a professor at the South China University of Technology. The paper will be presented at the International Solid-States Circuits Conference.

    Tiny mirrors

    The researchers’ square transceiver chip, measuring about 2 millimeters on each side, is placed on a quantum computer inside the refrigerator, which is called a cryostat because it maintains cryogenic temperatures. These super-cold temperatures don’t damage the chip; in fact, they enable it to run more efficiently than it would at room temperature.

    The chip sends and receives data from a terahertz wave source outside the cryostat using a passive communication process known as backscatter, which involves reflections. An array of antennas on top of the chip, each of which is only about 200 micrometers in size, act as tiny mirrors. These mirrors can be “turned on” to reflect waves or “turned off.”

    The terahertz wave generation source encodes data onto the waves it sends into the cryostat, and the antennas in their “off” state can receive those waves and the data they carry.

    When the tiny mirrors are turned on, they can be set so they either reflect a wave in its current form or invert its phase before bouncing it back. If the reflected wave has the same phase, that represents a 0, but if the phase is inverted, that represents a 1. Electronics outside the cryostat can interpret those binary signals to decode the data.

    “This backscatter technology is not new. For instance, RFIDs are based on backscatter communication. We borrow that idea and bring it into this very unique scenario, and I think this leads to a good combination of all these technologies,” Han says.

    Terahertz advantages

    The data are transmitted using high-speed terahertz waves, which are located on the electromagnetic spectrum between radio waves and infrared light.

    Because terahertz waves are much smaller than radio waves, the chip and its antennas can be smaller, too, which would make the device easier to manufacture at scale. Terahertz waves also have higher frequencies than radio waves, so they can transmit data much faster and move larger amounts of information.

    But because terahertz waves have lower frequencies than the light waves used in photonic systems, the terahertz waves carry less quantum noise, which leads to less interference with quantum processors.

    Importantly, the transceiver chip and terahertz link can be fully constructed with standard fabrication processes on a CMOS chip, so they can be integrated into many current systems and techniques.

    “CMOS compatibility is important. For example, one terahertz link could deliver a large amount of data and feed it to another cryo-CMOS controller, which can split the signal to control multiple qubits simultaneously, so we can reduce the quantity of RF cables dramatically. This is very promising.” Wang says.

    The researchers were able to transmit data at 4 gigabits per second with their prototype, but Han says the sky is nearly the limit when it comes to boosting that speed. The downlink of the contactless system posed about 10 times less heat load than a system with metallic cables, and the temperature of the cryostat fluctuated up to a few millidegrees during experiments.

    Now that the researchers have demonstrated this wireless technology, they want to improve the system’s speed and efficiency using special terahertz fibers, which are only a few hundred micrometers wide. Han’s group has shown that these plastic wires can transmit data at a rate of 100 gigabits per second and have much better thermal insulation than fatter, metal cables.

    The researchers also want to refine the design of their transceiver to improve scalability and continue boosting its energy efficiency. Generating terahertz waves requires a lot of power, but Han’s group is studying more efficient methods that utilize low-cost chips. Incorporating this technology into the system could make the device more cost-effective.

    The transceiver chip was fabricated through the Intel University Shuttle Program. More

  • in

    Efficient technique improves machine-learning models’ reliability

    Powerful machine-learning models are being used to help people tackle tough problems such as identifying disease in medical images or detecting road obstacles for autonomous vehicles. But machine-learning models can make mistakes, so in high-stakes settings it’s critical that humans know when to trust a model’s predictions.

    Uncertainty quantification is one tool that improves a model’s reliability; the model produces a score along with the prediction that expresses a confidence level that the prediction is correct. While uncertainty quantification can be useful, existing methods typically require retraining the entire model to give it that ability. Training involves showing a model millions of examples so it can learn a task. Retraining then requires millions of new data inputs, which can be expensive and difficult to obtain, and also uses huge amounts of computing resources.

    Researchers at MIT and the MIT-IBM Watson AI Lab have now developed a technique that enables a model to perform more effective uncertainty quantification, while using far fewer computing resources than other methods, and no additional data. Their technique, which does not require a user to retrain or modify a model, is flexible enough for many applications.

    The technique involves creating a simpler companion model that assists the original machine-learning model in estimating uncertainty. This smaller model is designed to identify different types of uncertainty, which can help researchers drill down on the root cause of inaccurate predictions.

    “Uncertainty quantification is essential for both developers and users of machine-learning models. Developers can utilize uncertainty measurements to help develop more robust models, while for users, it can add another layer of trust and reliability when deploying models in the real world. Our work leads to a more flexible and practical solution for uncertainty quantification,” says Maohao Shen, an electrical engineering and computer science graduate student and lead author of a paper on this technique.

    Shen wrote the paper with Yuheng Bu, a former postdoc in the Research Laboratory of Electronics (RLE) who is now an assistant professor at the University of Florida; Prasanna Sattigeri, Soumya Ghosh, and Subhro Das, research staff members at the MIT-IBM Watson AI Lab; and senior author Gregory Wornell, the Sumitomo Professor in Engineering who leads the Signals, Information, and Algorithms Laboratory RLE and is a member of the MIT-IBM Watson AI Lab. The research will be presented at the AAAI Conference on Artificial Intelligence.

    Quantifying uncertainty

    In uncertainty quantification, a machine-learning model generates a numerical score with each output to reflect its confidence in that prediction’s accuracy. Incorporating uncertainty quantification by building a new model from scratch or retraining an existing model typically requires a large amount of data and expensive computation, which is often impractical. What’s more, existing methods sometimes have the unintended consequence of degrading the quality of the model’s predictions.

    The MIT and MIT-IBM Watson AI Lab researchers have thus zeroed in on the following problem: Given a pretrained model, how can they enable it to perform effective uncertainty quantification?

    They solve this by creating a smaller and simpler model, known as a metamodel, that attaches to the larger, pretrained model and uses the features that larger model has already learned to help it make uncertainty quantification assessments.

    “The metamodel can be applied to any pretrained model. It is better to have access to the internals of the model, because we can get much more information about the base model, but it will also work if you just have a final output. It can still predict a confidence score,” Sattigeri says.

    They design the metamodel to produce the uncertainty quantification output using a technique that includes both types of uncertainty: data uncertainty and model uncertainty. Data uncertainty is caused by corrupted data or inaccurate labels and can only be reduced by fixing the dataset or gathering new data. In model uncertainty, the model is not sure how to explain the newly observed data and might make incorrect predictions, most likely because it hasn’t seen enough similar training examples. This issue is an especially challenging but common problem when models are deployed. In real-world settings, they often encounter data that are different from the training dataset.

    “Has the reliability of your decisions changed when you use the model in a new setting? You want some way to have confidence in whether it is working in this new regime or whether you need to collect training data for this particular new setting,” Wornell says.

    Validating the quantification

    Once a model produces an uncertainty quantification score, the user still needs some assurance that the score itself is accurate. Researchers often validate accuracy by creating a smaller dataset, held out from the original training data, and then testing the model on the held-out data. However, this technique does not work well in measuring uncertainty quantification because the model can achieve good prediction accuracy while still being over-confident, Shen says.

    They created a new validation technique by adding noise to the data in the validation set — this noisy data is more like out-of-distribution data that can cause model uncertainty. The researchers use this noisy dataset to evaluate uncertainty quantifications.

    They tested their approach by seeing how well a meta-model could capture different types of uncertainty for various downstream tasks, including out-of-distribution detection and misclassification detection. Their method not only outperformed all the baselines in each downstream task but also required less training time to achieve those results.

    This technique could help researchers enable more machine-learning models to effectively perform uncertainty quantification, ultimately aiding users in making better decisions about when to trust predictions.

    Moving forward, the researchers want to adapt their technique for newer classes of models, such as large language models that have a different structure than a traditional neural network, Shen says.

    The work was funded, in part, by the MIT-IBM Watson AI Lab and the U.S. National Science Foundation. More

  • in

    Breaking the scaling limits of analog computing

    As machine-learning models become larger and more complex, they require faster and more energy-efficient hardware to perform computations. Conventional digital computers are struggling to keep up.

    An analog optical neural network could perform the same tasks as a digital one, such as image classification or speech recognition, but because computations are performed using light instead of electrical signals, optical neural networks can run many times faster while consuming less energy.

    However, these analog devices are prone to hardware errors that can make computations less precise. Microscopic imperfections in hardware components are one cause of these errors. In an optical neural network that has many connected components, errors can quickly accumulate.

    Even with error-correction techniques, due to fundamental properties of the devices that make up an optical neural network, some amount of error is unavoidable. A network that is large enough to be implemented in the real world would be far too imprecise to be effective.

    MIT researchers have overcome this hurdle and found a way to effectively scale an optical neural network. By adding a tiny hardware component to the optical switches that form the network’s architecture, they can reduce even the uncorrectable errors that would otherwise accumulate in the device.

    Their work could enable a super-fast, energy-efficient, analog neural network that can function with the same accuracy as a digital one. With this technique, as an optical circuit becomes larger, the amount of error in its computations actually decreases.  

    “This is remarkable, as it runs counter to the intuition of analog systems, where larger circuits are supposed to have higher errors, so that errors set a limit on scalability. This present paper allows us to address the scalability question of these systems with an unambiguous ‘yes,’” says lead author Ryan Hamerly, a visiting scientist in the MIT Research Laboratory for Electronics (RLE) and Quantum Photonics Laboratory and senior scientist at NTT Research.

    Hamerly’s co-authors are graduate student Saumil Bandyopadhyay and senior author Dirk Englund, an associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS), leader of the Quantum Photonics Laboratory, and member of the RLE. The research is published today in Nature Communications.

    Multiplying with light

    An optical neural network is composed of many connected components that function like reprogrammable, tunable mirrors. These tunable mirrors are called Mach-Zehnder Inferometers (MZI). Neural network data are encoded into light, which is fired into the optical neural network from a laser.

    A typical MZI contains two mirrors and two beam splitters. Light enters the top of an MZI, where it is split into two parts which interfere with each other before being recombined by the second beam splitter and then reflected out the bottom to the next MZI in the array. Researchers can leverage the interference of these optical signals to perform complex linear algebra operations, known as matrix multiplication, which is how neural networks process data.

    But errors that can occur in each MZI quickly accumulate as light moves from one device to the next. One can avoid some errors by identifying them in advance and tuning the MZIs so earlier errors are cancelled out by later devices in the array.

    “It is a very simple algorithm if you know what the errors are. But these errors are notoriously difficult to ascertain because you only have access to the inputs and outputs of your chip,” says Hamerly. “This motivated us to look at whether it is possible to create calibration-free error correction.”

    Hamerly and his collaborators previously demonstrated a mathematical technique that went a step further. They could successfully infer the errors and correctly tune the MZIs accordingly, but even this didn’t remove all the error.

    Due to the fundamental nature of an MZI, there are instances where it is impossible to tune a device so all light flows out the bottom port to the next MZI. If the device loses a fraction of light at each step and the array is very large, by the end there will only be a tiny bit of power left.

    “Even with error correction, there is a fundamental limit to how good a chip can be. MZIs are physically unable to realize certain settings they need to be configured to,” he says.

    So, the team developed a new type of MZI. The researchers added an additional beam splitter to the end of the device, calling it a 3-MZI because it has three beam splitters instead of two. Due to the way this additional beam splitter mixes the light, it becomes much easier for an MZI to reach the setting it needs to send all light from out through its bottom port.

    Importantly, the additional beam splitter is only a few micrometers in size and is a passive component, so it doesn’t require any extra wiring. Adding additional beam splitters doesn’t significantly change the size of the chip.

    Bigger chip, fewer errors

    When the researchers conducted simulations to test their architecture, they found that it can eliminate much of the uncorrectable error that hampers accuracy. And as the optical neural network becomes larger, the amount of error in the device actually drops — the opposite of what happens in a device with standard MZIs.

    Using 3-MZIs, they could potentially create a device big enough for commercial uses with error that has been reduced by a factor of 20, Hamerly says.

    The researchers also developed a variant of the MZI design specifically for correlated errors. These occur due to manufacturing imperfections — if the thickness of a chip is slightly wrong, the MZIs may all be off by about the same amount, so the errors are all about the same. They found a way to change the configuration of an MZI to make it robust to these types of errors. This technique also increased the bandwidth of the optical neural network so it can run three times faster.

    Now that they have showcased these techniques using simulations, Hamerly and his collaborators plan to test these approaches on physical hardware and continue driving toward an optical neural network they can effectively deploy in the real world.

    This research is funded, in part, by a National Science Foundation graduate research fellowship and the U.S. Air Force Office of Scientific Research. More

  • in

    Deep learning with light

    Ask a smart home device for the weather forecast, and it takes several seconds for the device to respond. One reason this latency occurs is because connected devices don’t have enough memory or power to store and run the enormous machine-learning models needed for the device to understand what a user is asking of it. The model is stored in a data center that may be hundreds of miles away, where the answer is computed and sent to the device.

    MIT researchers have created a new method for computing directly on these devices, which drastically reduces this latency. Their technique shifts the memory-intensive steps of running a machine-learning model to a central server where components of the model are encoded onto light waves.

    The waves are transmitted to a connected device using fiber optics, which enables tons of data to be sent lightning-fast through a network. The receiver then employs a simple optical device that rapidly performs computations using the parts of a model carried by those light waves.

    This technique leads to more than a hundredfold improvement in energy efficiency when compared to other methods. It could also improve security, since a user’s data do not need to be transferred to a central location for computation.

    This method could enable a self-driving car to make decisions in real-time while using just a tiny percentage of the energy currently required by power-hungry computers. It could also allow a user to have a latency-free conversation with their smart home device, be used for live video processing over cellular networks, or even enable high-speed image classification on a spacecraft millions of miles from Earth.

    “Every time you want to run a neural network, you have to run the program, and how fast you can run the program depends on how fast you can pipe the program in from memory. Our pipe is massive — it corresponds to sending a full feature-length movie over the internet every millisecond or so. That is how fast data comes into our system. And it can compute as fast as that,” says senior author Dirk Englund, an associate professor in the Department of Electrical Engineering and Computer Science (EECS) and member of the MIT Research Laboratory of Electronics.

    Joining Englund on the paper is lead author and EECS grad student Alexander Sludds; EECS grad student Saumil Bandyopadhyay, Research Scientist Ryan Hamerly, as well as others from MIT, the MIT Lincoln Laboratory, and Nokia Corporation. The research is published today in Science.

    Lightening the load

    Neural networks are machine-learning models that use layers of connected nodes, or neurons, to recognize patterns in datasets and perform tasks, like classifying images or recognizing speech. But these models can contain billions of weight parameters, which are numeric values that transform input data as they are processed. These weights must be stored in memory. At the same time, the data transformation process involves billions of algebraic computations, which require a great deal of power to perform.

    The process of fetching data (the weights of the neural network, in this case) from memory and moving them to the parts of a computer that do the actual computation is one of the biggest limiting factors to speed and energy efficiency, says Sludds.

    “So our thought was, why don’t we take all that heavy lifting — the process of fetching billions of weights from memory — move it away from the edge device and put it someplace where we have abundant access to power and memory, which gives us the ability to fetch those weights quickly?” he says.

    The neural network architecture they developed, Netcast, involves storing weights in a central server that is connected to a novel piece of hardware called a smart transceiver. This smart transceiver, a thumb-sized chip that can receive and transmit data, uses technology known as silicon photonics to fetch trillions of weights from memory each second.

    It receives weights as electrical signals and imprints them onto light waves. Since the weight data are encoded as bits (1s and 0s) the transceiver converts them by switching lasers; a laser is turned on for a 1 and off for a 0. It combines these light waves and then periodically transfers them through a fiber optic network so a client device doesn’t need to query the server to receive them.

    “Optics is great because there are many ways to carry data within optics. For instance, you can put data on different colors of light, and that enables a much higher data throughput and greater bandwidth than with electronics,” explains Bandyopadhyay.

    Trillions per second

    Once the light waves arrive at the client device, a simple optical component known as a broadband “Mach-Zehnder” modulator uses them to perform super-fast, analog computation. This involves encoding input data from the device, such as sensor information, onto the weights. Then it sends each individual wavelength to a receiver that detects the light and measures the result of the computation.

    The researchers devised a way to use this modulator to do trillions of multiplications per second, which vastly increases the speed of computation on the device while using only a tiny amount of power.   

    “In order to make something faster, you need to make it more energy efficient. But there is a trade-off. We’ve built a system that can operate with about a milliwatt of power but still do trillions of multiplications per second. In terms of both speed and energy efficiency, that is a gain of orders of magnitude,” Sludds says.

    They tested this architecture by sending weights over an 86-kilometer fiber that connects their lab to MIT Lincoln Laboratory. Netcast enabled machine-learning with high accuracy — 98.7 percent for image classification and 98.8 percent for digit recognition — at rapid speeds.

    “We had to do some calibration, but I was surprised by how little work we had to do to achieve such high accuracy out of the box. We were able to get commercially relevant accuracy,” adds Hamerly.

    Moving forward, the researchers want to iterate on the smart transceiver chip to achieve even better performance. They also want to miniaturize the receiver, which is currently the size of a shoe box, down to the size of a single chip so it could fit onto a smart device like a cell phone.

    “Using photonics and light as a platform for computing is a really exciting area of research with potentially huge implications on the speed and efficiency of our information technology landscape,” says Euan Allen, a Royal Academy of Engineering Research Fellow at the University of Bath, who was not involved with this work. “The work of Sludds et al. is an exciting step toward seeing real-world implementations of such devices, introducing a new and practical edge-computing scheme whilst also exploring some of the fundamental limitations of computation at very low (single-photon) light levels.”

    The research is funded, in part, by NTT Research, the National Science Foundation, the Air Force Office of Scientific Research, the Air Force Research Laboratory, and the Army Research Office. More

  • in

    A technique to improve both fairness and accuracy in artificial intelligence

    For workers who use machine-learning models to help them make decisions, knowing when to trust a model’s predictions is not always an easy task, especially since these models are often so complex that their inner workings remain a mystery.

    Users sometimes employ a technique, known as selective regression, in which the model estimates its confidence level for each prediction and will reject predictions when its confidence is too low. Then a human can examine those cases, gather additional information, and make a decision about each one manually.

    But while selective regression has been shown to improve the overall performance of a model, researchers at MIT and the MIT-IBM Watson AI Lab have discovered that the technique can have the opposite effect for underrepresented groups of people in a dataset. As the model’s confidence increases with selective regression, its chance of making the right prediction also increases, but this does not always happen for all subgroups.

    For instance, a model suggesting loan approvals might make fewer errors on average, but it may actually make more wrong predictions for Black or female applicants. One reason this can occur is due to the fact that the model’s confidence measure is trained using overrepresented groups and may not be accurate for these underrepresented groups.

    Once they had identified this problem, the MIT researchers developed two algorithms that can remedy the issue. Using real-world datasets, they show that the algorithms reduce performance disparities that had affected marginalized subgroups.

    “Ultimately, this is about being more intelligent about which samples you hand off to a human to deal with. Rather than just minimizing some broad error rate for the model, we want to make sure the error rate across groups is taken into account in a smart way,” says senior MIT author Greg Wornell, the Sumitomo Professor in Engineering in the Department of Electrical Engineering and Computer Science (EECS) who leads the Signals, Information, and Algorithms Laboratory in the Research Laboratory of Electronics (RLE) and is a member of the MIT-IBM Watson AI Lab.

    Joining Wornell on the paper are co-lead authors Abhin Shah, an EECS graduate student, and Yuheng Bu, a postdoc in RLE; as well as Joshua Ka-Wing Lee SM ’17, ScD ’21 and Subhro Das, Rameswar Panda, and Prasanna Sattigeri, research staff members at the MIT-IBM Watson AI Lab. The paper will be presented this month at the International Conference on Machine Learning.

    To predict or not to predict

    Regression is a technique that estimates the relationship between a dependent variable and independent variables. In machine learning, regression analysis is commonly used for prediction tasks, such as predicting the price of a home given its features (number of bedrooms, square footage, etc.) With selective regression, the machine-learning model can make one of two choices for each input — it can make a prediction or abstain from a prediction if it doesn’t have enough confidence in its decision.

    When the model abstains, it reduces the fraction of samples it is making predictions on, which is known as coverage. By only making predictions on inputs that it is highly confident about, the overall performance of the model should improve. But this can also amplify biases that exist in a dataset, which occur when the model does not have sufficient data from certain subgroups. This can lead to errors or bad predictions for underrepresented individuals.

    The MIT researchers aimed to ensure that, as the overall error rate for the model improves with selective regression, the performance for every subgroup also improves. They call this monotonic selective risk.

    “It was challenging to come up with the right notion of fairness for this particular problem. But by enforcing this criteria, monotonic selective risk, we can make sure the model performance is actually getting better across all subgroups when you reduce the coverage,” says Shah.

    Focus on fairness

    The team developed two neural network algorithms that impose this fairness criteria to solve the problem.

    One algorithm guarantees that the features the model uses to make predictions contain all information about the sensitive attributes in the dataset, such as race and sex, that is relevant to the target variable of interest. Sensitive attributes are features that may not be used for decisions, often due to laws or organizational policies. The second algorithm employs a calibration technique to ensure the model makes the same prediction for an input, regardless of whether any sensitive attributes are added to that input.

    The researchers tested these algorithms by applying them to real-world datasets that could be used in high-stakes decision making. One, an insurance dataset, is used to predict total annual medical expenses charged to patients using demographic statistics; another, a crime dataset, is used to predict the number of violent crimes in communities using socioeconomic information. Both datasets contain sensitive attributes for individuals.

    When they implemented their algorithms on top of a standard machine-learning method for selective regression, they were able to reduce disparities by achieving lower error rates for the minority subgroups in each dataset. Moreover, this was accomplished without significantly impacting the overall error rate.

    “We see that if we don’t impose certain constraints, in cases where the model is really confident, it could actually be making more errors, which could be very costly in some applications, like health care. So if we reverse the trend and make it more intuitive, we will catch a lot of these errors. A major goal of this work is to avoid errors going silently undetected,” Sattigeri says.

    The researchers plan to apply their solutions to other applications, such as predicting house prices, student GPA, or loan interest rate, to see if the algorithms need to be calibrated for those tasks, says Shah. They also want to explore techniques that use less sensitive information during the model training process to avoid privacy issues.

    And they hope to improve the confidence estimates in selective regression to prevent situations where the model’s confidence is low, but its prediction is correct. This could reduce the workload on humans and further streamline the decision-making process, Sattigeri says.

    This research was funded, in part, by the MIT-IBM Watson AI Lab and its member companies Boston Scientific, Samsung, and Wells Fargo, and by the National Science Foundation. More

  • in

    A universal system for decoding any type of data sent across a network

    Every piece of data that travels over the internet — from paragraphs in an email to 3D graphics in a virtual reality environment — can be altered by the noise it encounters along the way, such as electromagnetic interference from a microwave or Bluetooth device. The data are coded so that when they arrive at their destination, a decoding algorithm can undo the negative effects of that noise and retrieve the original data.

    Since the 1950s, most error-correcting codes and decoding algorithms have been designed together. Each code had a structure that corresponded with a particular, highly complex decoding algorithm, which often required the use of dedicated hardware.

    Researchers at MIT, Boston University, and Maynooth University in Ireland have now created the first silicon chip that is able to decode any code, regardless of its structure, with maximum accuracy, using a universal decoding algorithm called Guessing Random Additive Noise Decoding (GRAND). By eliminating the need for multiple, computationally complex decoders, GRAND enables increased efficiency that could have applications in augmented and virtual reality, gaming, 5G networks, and connected devices that rely on processing a high volume of data with minimal delay.

    The research at MIT is led by Muriel Médard, the Cecil H. and Ida Green Professor in the Department of Electrical Engineering and Computer Science, and was co-authored by Amit Solomon and Wei Ann, both graduate students at MIT; Rabia Tugce Yazicigil, assistant professor of electrical and computer engineering at Boston University; Arslan Riaz and Vaibhav Bansal, both graduate students at Boston University; Ken R. Duffy, director of the Hamilton Institute at the National University of Ireland at Maynooth; and Kevin Galligan, a Maynooth graduate student. The research will be presented at the European Solid-States Device Research and Circuits Conference next week.

    Focus on noise

    One way to think of these codes is as redundant hashes (in this case, a series of 1s and 0s) added to the end of the original data. The rules for the creation of that hash are stored in a specific codebook.

    As the encoded data travel over a network, they are affected by noise, or energy that disrupts the signal, which is often generated by other electronic devices. When that coded data and the noise that affected them arrive at their destination, the decoding algorithm consults its codebook and uses the structure of the hash to guess what the stored information is.

    Instead, GRAND works by guessing the noise that affected the message, and uses the noise pattern to deduce the original information. GRAND generates a series of noise sequences in the order they are likely to occur, subtracts them from the received data, and checks to see if the resulting codeword is in a codebook.

    While the noise appears random in nature, it has a probabilistic structure that allows the algorithm to guess what it might be.

    “In a way, it is similar to troubleshooting. If someone brings their car into the shop, the mechanic doesn’t start by mapping the entire car to blueprints. Instead, they start by asking, ‘What is the most likely thing to go wrong?’ Maybe it just needs gas. If that doesn’t work, what’s next? Maybe the battery is dead?” Médard says.

    Novel hardware

    The GRAND chip uses a three-tiered structure, starting with the simplest possible solutions in the first stage and working up to longer and more complex noise patterns in the two subsequent stages. Each stage operates independently, which increases the throughput of the system and saves power.

    The device is also designed to switch seamlessly between two codebooks. It contains two static random-access memory chips, one that can crack codewords, while the other loads a new codebook and then switches to decoding without any downtime.

    The researchers tested the GRAND chip and found it could effectively decode any moderate redundancy code up to 128 bits in length, with only about a microsecond of latency.

    Médard and her collaborators had previously demonstrated the success of the algorithm, but this new work showcases the effectiveness and efficiency of GRAND in hardware for the first time.

    Developing hardware for the novel decoding algorithm required the researchers to first toss aside their preconceived notions, Médard says.

    “We couldn’t go out and reuse things that had already been done. This was like a complete whiteboard. We had to really think about every single component from scratch. It was a journey of reconsideration. And I think when we do our next chip, there will be things with this first chip that we’ll realize we did out of habit or assumption that we can do better,” she says.

    A chip for the future

    Since GRAND only uses codebooks for verification, the chip not only works with legacy codes but could also be used with codes that haven’t even been introduced yet.

    In the lead-up to 5G implementation, regulators and communications companies struggled to find consensus as to which codes should be used in the new network. Regulators ultimately chose to use two types of traditional codes for 5G infrastructure in different situations. Using GRAND could eliminate the need for that rigid standardization in the future, Médard says.

    The GRAND chip could even open the field of coding to a wave of innovation.

    “For reasons I’m not quite sure of, people approach coding with awe, like it is black magic. The process is mathematically nasty, so people just use codes that already exist. I’m hoping this will recast the discussion so it is not so standards-oriented, enabling people to use codes that already exist and create new codes,” she says.

    Moving forward, Médard and her collaborators plan to tackle the problem of soft detection with a retooled version of the GRAND chip. In soft detection, the received data are less precise.

    They also plan to test the ability of GRAND to crack longer, more complex codes and adjust the structure of the silicon chip to improve its energy efficiency.

    The research was funded by the Battelle Memorial Institute and Science Foundation of Ireland. More