Sustainable computing Archivi - technology-news.space - All about the world of technology!

Latest story

75 Shares99 Views

Startup accelerates progress toward light-speed computing

by Markus Andrews 1 March 2024, 04:00

Our ability to cram ever-smaller transistors onto a chip has enabled today’s age of ubiquitous computing. But that approach is finally running into limits, with some experts declaring an end to Moore’s Law and a related principle, known as Dennard’s Scaling.

Those developments couldn’t be coming at a worse time. Demand for computing power has skyrocketed in recent years thanks in large part to the rise of artificial intelligence, and it shows no signs of slowing down.

Now Lightmatter, a company founded by three MIT alumni, is continuing the remarkable progress of computing by rethinking the lifeblood of the chip. Instead of relying solely on electricity, the company also uses light for data processing and transport. The company’s first two products, a chip specializing in artificial intelligence operations and an interconnect that facilitates data transfer between chips, use both photons and electrons to drive more efficient operations.

“The two problems we are solving are ‘How do chips talk?’ and ‘How do you do these [AI] calculations?’” Lightmatter co-founder and CEO Nicholas Harris PhD ’17 says. “With our first two products, Envise and Passage, we’re addressing both of those questions.”

In a nod to the size of the problem and the demand for AI, Lightmatter raised just north of $300 million in 2023 at a valuation of $1.2 billion. Now the company is demonstrating its technology with some of the largest technology companies in the world in hopes of reducing the massive energy demand of data centers and AI models.

“We’re going to enable platforms on top of our interconnect technology that are made up of hundreds of thousands of next-generation compute units,” Harris says. “That simply wouldn’t be possible without the technology that we’re building.”

From idea to $100K

Prior to MIT, Harris worked at the semiconductor company Micron Technology, where he studied the fundamental devices behind integrated chips. The experience made him see how the traditional approach for improving computer performance — cramming more transistors onto each chip — was hitting its limits.

“I saw how the roadmap for computing was slowing, and I wanted to figure out how I could continue it,” Harris says. “What approaches can augment computers? Quantum computing and photonics were two of those pathways.”

Harris came to MIT to work on photonic quantum computing for his PhD under Dirk Englund, an associate professor in the Department of Electrical Engineering and Computer Science. As part of that work, he built silicon-based integrated photonic chips that could send and process information using light instead of electricity.

The work led to dozens of patents and more than 80 research papers in prestigious journals like Nature. But another technology also caught Harris’s attention at MIT.

“I remember walking down the hall and seeing students just piling out of these auditorium-sized classrooms, watching relayed live videos of lectures to see professors teach deep learning,” Harris recalls, referring to the artificial intelligence technique. “Everybody on campus knew that deep learning was going to be a huge deal, so I started learning more about it, and we realized that the systems I was building for photonic quantum computing could actually be leveraged to do deep learning.”

Harris had planned to become a professor after his PhD, but he realized he could attract more funding and innovate more quickly through a startup, so he teamed up with Darius Bunandar PhD ’18, who was also studying in Englund’s lab, and Thomas Graham MBA ’18. The co-founders successfully launched into the startup world by winning the 2017 MIT $100K Entrepreneurship Competition.

Seeing the light

Lightmatter’s Envise chip takes the part of computing that electrons do well, like memory, and combines it with what light does well, like performing the massive matrix multiplications of deep-learning models.

“With photonics, you can perform multiple calculations at the same time because the data is coming in on different colors of light,” Harris explains. “In one color, you could have a photo of a dog. In another color, you could have a photo of a cat. In another color, maybe a tree, and you could have all three of those operations going through the same optical computing unit, this matrix accelerator, at the same time. That drives up operations per area, and it reuses the hardware that’s there, driving up energy efficiency.”

Passage takes advantage of light’s latency and bandwidth advantages to link processors in a manner similar to how fiber optic cables use light to send data over long distances. It also enables chips as big as entire wafers to act as a single processor. Sending information between chips is central to running the massive server farms that power cloud computing and run AI systems like ChatGPT.

Both products are designed to bring energy efficiencies to computing, which Harris says are needed to keep up with rising demand without bringing huge increases in power consumption.

“By 2040, some predict that around 80 percent of all energy usage on the planet will be devoted to data centers and computing, and AI is going to be a huge fraction of that,” Harris says. “When you look at computing deployments for training these large AI models, they’re headed toward using hundreds of megawatts. Their power usage is on the scale of cities.”

Lightmatter is currently working with chipmakers and cloud service providers for mass deployment. Harris notes that because the company’s equipment runs on silicon, it can be produced by existing semiconductor fabrication facilities without massive changes in process.

The ambitious plans are designed to open up a new path forward for computing that would have huge implications for the environment and economy.

“We’re going to continue looking at all of the pieces of computers to figure out where light can accelerate them, make them more energy efficient, and faster, and we’re going to continue to replace those parts,” Harris says. “Right now, we’re focused on interconnect with Passage and on compute with Envise. But over time, we’re going to build out the next generation of computers, and it’s all going to be centered around light.” More

More stories

63 Shares189 Views
in Data Management & Statistics
Learning to grow machine-learning models
by Markus Andrews 22 March 2023, 03:00
It’s no secret that OpenAI’s ChatGPT has some incredible capabilities — for instance, the chatbot can write poetry that resembles Shakespearean sonnets or debug code for a computer program. These abilities are made possible by the massive machine-learning model that ChatGPT is built upon. Researchers have found that when these types of models become large enough, extraordinary capabilities emerge.
But bigger models also require more time and money to train. The training process involves showing hundreds of billions of examples to a model. Gathering so much data is an involved process in itself. Then come the monetary and environmental costs of running many powerful computers for days or weeks to train a model that may have billions of parameters.
“It’s been estimated that training models at the scale of what ChatGPT is hypothesized to run on could take millions of dollars, just for a single training run. Can we improve the efficiency of these training methods, so we can still get good models in less time and for less money? We propose to do this by leveraging smaller language models that have previously been trained,” says Yoon Kim, an assistant professor in MIT’s Department of Electrical Engineering and Computer Science and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL).
Rather than discarding a previous version of a model, Kim and his collaborators use it as the building blocks for a new model. Using machine learning, their method learns to “grow” a larger model from a smaller model in a way that encodes knowledge the smaller model has already gained. This enables faster training of the larger model.
Their technique saves about 50 percent of the computational cost required to train a large model, compared to methods that train a new model from scratch. Plus, the models trained using the MIT method performed as well as, or better than, models trained with other techniques that also use smaller models to enable faster training of larger models.
Reducing the time it takes to train huge models could help researchers make advancements faster with less expense, while also reducing the carbon emissions generated during the training process. It could also enable smaller research groups to work with these massive models, potentially opening the door to many new advances.
“As we look to democratize these types of technologies, making training faster and less expensive will become more important,” says Kim, senior author of a paper on this technique.
Kim and his graduate student Lucas Torroba Hennigen wrote the paper with lead author Peihao Wang, a graduate student at the University of Texas at Austin, as well as others at the MIT-IBM Watson AI Lab and Columbia University. The research will be presented at the International Conference on Learning Representations.
The bigger the better
Large language models like GPT-3, which is at the core of ChatGPT, are built using a neural network architecture called a transformer. A neural network, loosely based on the human brain, is composed of layers of interconnected nodes, or “neurons.” Each neuron contains parameters, which are variables learned during the training process that the neuron uses to process data.
Transformer architectures are unique because, as these types of neural network models get bigger, they achieve much better results.
“This has led to an arms race of companies trying to train larger and larger transformers on larger and larger datasets. More so than other architectures, it seems that transformer networks get much better with scaling. We’re just not exactly sure why this is the case,” Kim says.
These models often have hundreds of millions or billions of learnable parameters. Training all these parameters from scratch is expensive, so researchers seek to accelerate the process.
One effective technique is known as model growth. Using the model growth method, researchers can increase the size of a transformer by copying neurons, or even entire layers of a previous version of the network, then stacking them on top. They can make a network wider by adding new neurons to a layer or make it deeper by adding additional layers of neurons.
In contrast to previous approaches for model growth, parameters associated with the new neurons in the expanded transformer are not just copies of the smaller network’s parameters, Kim explains. Rather, they are learned combinations of the parameters of the smaller model.
Learning to grow
Kim and his collaborators use machine learning to learn a linear mapping of the parameters of the smaller model. This linear map is a mathematical operation that transforms a set of input values, in this case the smaller model’s parameters, to a set of output values, in this case the parameters of the larger model.
Their method, which they call a learned Linear Growth Operator (LiGO), learns to expand the width and depth of larger network from the parameters of a smaller network in a data-driven way.
But the smaller model may actually be quite large — perhaps it has a hundred million parameters — and researchers might want to make a model with a billion parameters. So the LiGO technique breaks the linear map into smaller pieces that a machine-learning algorithm can handle.
LiGO also expands width and depth simultaneously, which makes it more efficient than other methods. A user can tune how wide and deep they want the larger model to be when they input the smaller model and its parameters, Kim explains.
When they compared their technique to the process of training a new model from scratch, as well as to model-growth methods, it was faster than all the baselines. Their method saves about 50 percent of the computational costs required to train both vision and language models, while often improving performance.
The researchers also found they could use LiGO to accelerate transformer training even when they didn’t have access to a smaller, pretrained model.
“I was surprised by how much better all the methods, including ours, did compared to the random initialization, train-from-scratch baselines.” Kim says.
In the future, Kim and his collaborators are looking forward to applying LiGO to even larger models.
The work was funded, in part, by the MIT-IBM Watson AI Lab, Amazon, the IBM Research AI Hardware Center, Center for Computational Innovation at Rensselaer Polytechnic Institute, and the U.S. Army Research Office. More
138 Shares179 Views
in Data Management & Statistics
A new chip for decoding data transmissions demonstrates record-breaking energy efficiency
by Markus Andrews 22 February 2023, 04:00
Imagine using an online banking app to deposit money into your account. Like all information sent over the internet, those communications could be corrupted by noise that inserts errors into the data.
To overcome this problem, senders encode data before they are transmitted, and then a receiver uses a decoding algorithm to correct errors and recover the original message. In some instances, data are received with reliability information that helps the decoder figure out which parts of a transmission are likely errors.
Researchers at MIT and elsewhere have developed a decoder chip that employs a new statistical model to use this reliability information in a way that is much simpler and faster than conventional techniques.
Their chip uses a universal decoding algorithm the team previously developed, which can unravel any error correcting code. Typically, decoding hardware can only process one particular type of code. This new, universal decoder chip has broken the record for energy-efficient decoding, performing between 10 and 100 times better than other hardware.
This advance could enable mobile devices with fewer chips, since they would no longer need separate hardware for multiple codes. This would reduce the amount of material needed for fabrication, cutting costs and improving sustainability. By making the decoding process less energy intensive, the chip could also improve device performance and lengthen battery life. It could be especially useful for demanding applications like augmented and virtual reality and 5G networks.
“This is the first time anyone has broken below the 1 picojoule-per-bit barrier for decoding. That is roughly the same amount of energy you need to transmit a bit inside the system. It had been a big symbolic threshold, but it also changes the balance in the receiver of what might be the most pressing part from an energy perspective — we can move that away from the decoder to other elements,” says Muriel Médard, the School of Science NEC Professor of Software Science and Engineering, a professor in the Department of Electrical Engineering and Computer Science, and a co-author of a paper presenting the new chip.
Médard’s co-authors include lead author Arslan Riaz, a graduate student at Boston University (BU); Rabia Tugce Yazicigil, assistant professor of electrical and computer engineering at BU; and Ken R. Duffy, then director of the Hamilton Institute at Maynooth University and now a professor at Northeastern University, as well as others from MIT, BU, and Maynooth University. The work is being presented at the International Solid-States Circuits Conference.
Smarter sorting
Digital data are transmitted over a network in the form of bits (0s and 1s). A sender encodes data by adding an error-correcting code, which is a redundant string of 0s and 1s that can be viewed as a hash. Information about this hash is held in a specific code book. A decoding algorithm at the receiver, designed for this particular code, uses its code book and the hash structure to retrieve the original information, which may have been jumbled by noise. Since each algorithm is code-specific, and most require dedicated hardware, a device would need many chips to decode different codes.
The researchers previously demonstrated GRAND (Guessing Random Additive Noise Decoding), a universal decoding algorithm that can crack any code. GRAND works by guessing the noise that affected the transmission, subtracting that noise pattern from the received data, and then checking what remains in a code book. It guesses a series of noise patterns in the order they are likely to occur.
Data are often received with reliability information, also called soft information, that helps a decoder figure out which pieces are errors. The new decoding chip, called ORBGRAND (Ordered Reliability Bits GRAND), uses this reliability information to sort data based on how likely each bit is to be an error.
But it isn’t as simple as ordering single bits. While the most unreliable bit might be the likeliest error, perhaps the third and fourth most unreliable bits together are as likely to be an error as the seventh-most unreliable bit. ORBGRAND uses a new statistical model that can sort bits in this fashion, considering that multiple bits together are as likely to be an error as some single bits.
“If your car isn’t working, soft information might tell you that it is probably the battery. But if it isn’t the battery alone, maybe it is the battery and the alternator together that are causing the problem. This is how a rational person would troubleshoot — you’d say that it could actually be these two things together before going down the list to something that is much less likely,” Médard says.
This is a much more efficient approach than traditional decoders, which would instead look at the code structure and have a performance that is generally designed for the worst-case.
“With a traditional decoder, you’d pull out the blueprint of the car and examine each and every piece. You’ll find the problem, but it will take you a long time and you’ll get very frustrated,” Médard explains.
ORBGRAND stops sorting as soon as a code word is found, which is often very soon. The chip also employs parallelization, generating and testing multiple noise patterns simultaneously so it finds the code word faster. Because the decoder stops working once it finds the code word, its energy consumption stays low even though it runs multiple processes simultaneously.
Record-breaking efficiency
When they compared their approach to other chips, ORBGRAND decoded with maximum accuracy while consuming only 0.76 picojoules of energy per bit, breaking the previous performance record. ORBGRAND consumes between 10 and 100 times less energy than other devices.
One of the biggest challenges of developing the new chip came from this reduced energy consumption, Médard says. With ORBGRAND, generating noise sequences is now so energy-efficient that other processes the researchers hadn’t focused on before, like checking the code word in a code book, consume most of the effort.
“Now, this checking process, which is like turning on the car to see if it works, is the hardest part. So, we need to find more efficient ways to do that,” she says.
The team is also exploring ways to change the modulation of transmissions so they can take advantage of the improved efficiency of the ORBGRAND chip. They also plan to see how their technique could be utilized to more efficiently manage multiple transmissions that overlap.
The research is funded, in part, by the U.S. Defense Advanced Research Projects Agency (DARPA) and Science Foundation Ireland. More
163 Shares129 Views
in Data Management & Statistics
Busy GPUs: Sampling and pipelining method speeds up deep learning on large graphs
by Markus Andrews 29 November 2022, 20:15
Graphs, a potentially extensive web of nodes connected by edges, can be used to express and interrogate relationships between data, like social connections, financial transactions, traffic, energy grids, and molecular interactions. As researchers collect more data and build out these graphical pictures, researchers will need faster and more efficient methods, as well as more computational power, to conduct deep learning on them, in the way of graph neural networks (GNN).
Now, a new method, called SALIENT (SAmpling, sLIcing, and data movemeNT), developed by researchers at MIT and IBM Research, improves the training and inference performance by addressing three key bottlenecks in computation. This dramatically cuts down on the runtime of GNNs on large datasets, which, for example, contain on the scale of 100 million nodes and 1 billion edges. Further, the team found that the technique scales well when computational power is added from one to 16 graphical processing units (GPUs). The work was presented at the Fifth Conference on Machine Learning and Systems.
“We started to look at the challenges current systems experienced when scaling state-of-the-art machine learning techniques for graphs to really big datasets. It turned out there was a lot of work to be done, because a lot of the existing systems were achieving good performance primarily on smaller datasets that fit into GPU memory,” says Tim Kaler, the lead author and a postdoc in the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).
By vast datasets, experts mean scales like the entire Bitcoin network, where certain patterns and data relationships could spell out trends or foul play. “There are nearly a billion Bitcoin transactions on the blockchain, and if we want to identify illicit activities inside such a joint network, then we are facing a graph of such a scale,” says co-author Jie Chen, senior research scientist and manager of IBM Research and the MIT-IBM Watson AI Lab. “We want to build a system that is able to handle that kind of graph and allows processing to be as efficient as possible, because every day we want to keep up with the pace of the new data that are generated.”
Kaler and Chen’s co-authors include Nickolas Stathas MEng ’21 of Jump Trading, who developed SALIENT as part of his graduate work; former MIT-IBM Watson AI Lab intern and MIT graduate student Anne Ouyang; MIT CSAIL postdoc Alexandros-Stavros Iliopoulos; MIT CSAIL Research Scientist Tao B. Schardl; and Charles E. Leiserson, the Edwin Sibley Webster Professor of Electrical Engineering at MIT and a researcher with the MIT-IBM Watson AI Lab.
For this problem, the team took a systems-oriented approach in developing their method: SALIENT, says Kaler. To do this, the researchers implemented what they saw as important, basic optimizations of components that fit into existing machine-learning frameworks, such as PyTorch Geometric and the deep graph library (DGL), which are interfaces for building a machine-learning model. Stathas says the process is like swapping out engines to build a faster car. Their method was designed to fit into existing GNN architectures, so that domain experts could easily apply this work to their specified fields to expedite model training and tease out insights during inference faster. The trick, the team determined, was to keep all of the hardware (CPUs, data links, and GPUs) busy at all times: while the CPU samples the graph and prepares mini-batches of data that will then be transferred through the data link, the more critical GPU is working to train the machine-learning model or conduct inference.
The researchers began by analyzing the performance of a commonly used machine-learning library for GNNs (PyTorch Geometric), which showed a startlingly low utilization of available GPU resources. Applying simple optimizations, the researchers improved GPU utilization from 10 to 30 percent, resulting in a 1.4 to two times performance improvement relative to public benchmark codes. This fast baseline code could execute one complete pass over a large training dataset through the algorithm (an epoch) in 50.4 seconds.
Seeking further performance improvements, the researchers set out to examine the bottlenecks that occur at the beginning of the data pipeline: the algorithms for graph sampling and mini-batch preparation. Unlike other neural networks, GNNs perform a neighborhood aggregation operation, which computes information about a node using information present in other nearby nodes in the graph — for example, in a social network graph, information from friends of friends of a user. As the number of layers in the GNN increase, the number of nodes the network has to reach out to for information can explode, exceeding the limits of a computer. Neighborhood sampling algorithms help by selecting a smaller random subset of nodes to gather; however, the researchers found that current implementations of this were too slow to keep up with the processing speed of modern GPUs. In response, they identified a mix of data structures, algorithmic optimizations, and so forth that improved sampling speed, ultimately improving the sampling operation alone by about three times, taking the per-epoch runtime from 50.4 to 34.6 seconds. They also found that sampling, at an appropriate rate, can be done during inference, improving overall energy efficiency and performance, a point that had been overlooked in the literature, the team notes.
In previous systems, this sampling step was a multi-process approach, creating extra data and unnecessary data movement between the processes. The researchers made their SALIENT method more nimble by creating a single process with lightweight threads that kept the data on the CPU in shared memory. Further, SALIENT takes advantage of a cache of modern processors, says Stathas, parallelizing feature slicing, which extracts relevant information from nodes of interest and their surrounding neighbors and edges, within the shared memory of the CPU core cache. This again reduced the overall per-epoch runtime from 34.6 to 27.8 seconds.
The last bottleneck the researchers addressed was to pipeline mini-batch data transfers between the CPU and GPU using a prefetching step, which would prepare data just before it’s needed. The team calculated that this would maximize bandwidth usage in the data link and bring the method up to perfect utilization; however, they only saw around 90 percent. They identified and fixed a performance bug in a popular PyTorch library that caused unnecessary round-trip communications between the CPU and GPU. With this bug fixed, the team achieved a 16.5 second per-epoch runtime with SALIENT.
“Our work showed, I think, that the devil is in the details,” says Kaler. “When you pay close attention to the details that impact performance when training a graph neural network, you can resolve a huge number of performance issues. With our solutions, we ended up being completely bottlenecked by GPU computation, which is the ideal goal of such a system.”
SALIENT’s speed was evaluated on three standard datasets ogbn-arxiv, ogbn-products, and ogbn-papers100M, as well as in multi-machine settings, with different levels of fanout (amount of data that the CPU would prepare for the GPU), and across several architectures, including the most recent state-of-the-art one, GraphSAGE-RI. In each setting, SALIENT outperformed PyTorch Geometric, most notably on the large ogbn-papers100M dataset, containing 100 million nodes and over a billion edges Here, it was three times faster, running on one GPU, than the optimized baseline that was originally created for this work; with 16 GPUs, SALIENT was an additional eight times faster.
While other systems had slightly different hardware and experimental setups, so it wasn’t always a direct comparison, SALIENT still outperformed them. Among systems that achieved similar accuracy, representative performance numbers include 99 seconds using one GPU and 32 CPUs, and 13 seconds using 1,536 CPUs. In contrast, SALIENT’s runtime using one GPU and 20 CPUs was 16.5 seconds and was just two seconds with 16 GPUs and 320 CPUs. “If you look at the bottom-line numbers that prior work reports, our 16 GPU runtime (two seconds) is an order of magnitude faster than other numbers that have been reported previously on this dataset,” says Kaler. The researchers attributed their performance improvements, in part, to their approach of optimizing their code for a single machine before moving to the distributed setting. Stathas says that the lesson here is that for your money, “it makes more sense to use the hardware you have efficiently, and to its extreme, before you start scaling up to multiple computers,” which can provide significant savings on cost and carbon emissions that can come with model training.
This new capacity will now allow researchers to tackle and dig deeper into bigger and bigger graphs. For example, the Bitcoin network that was mentioned earlier contained 100,000 nodes; the SALIENT system can capably handle a graph 1,000 times (or three orders of magnitude) larger.
“In the future, we would be looking at not just running this graph neural network training system on the existing algorithms that we implemented for classifying or predicting the properties of each node, but we also want to do more in-depth tasks, such as identifying common patterns in a graph (subgraph patterns), [which] may be actually interesting for indicating financial crimes,” says Chen. “We also want to identify nodes in a graph that are similar in a sense that they possibly would be corresponding to the same bad actor in a financial crime. These tasks would require developing additional algorithms, and possibly also neural network architectures.”
This research was supported by the MIT-IBM Watson AI Lab and in part by the U.S. Air Force Research Laboratory and the U.S. Air Force Artificial Intelligence Accelerator. More
88 Shares149 Views
in Data Management & Statistics
Breaking the scaling limits of analog computing
by Markus Andrews 29 November 2022, 09:00
As machine-learning models become larger and more complex, they require faster and more energy-efficient hardware to perform computations. Conventional digital computers are struggling to keep up.
An analog optical neural network could perform the same tasks as a digital one, such as image classification or speech recognition, but because computations are performed using light instead of electrical signals, optical neural networks can run many times faster while consuming less energy.
However, these analog devices are prone to hardware errors that can make computations less precise. Microscopic imperfections in hardware components are one cause of these errors. In an optical neural network that has many connected components, errors can quickly accumulate.
Even with error-correction techniques, due to fundamental properties of the devices that make up an optical neural network, some amount of error is unavoidable. A network that is large enough to be implemented in the real world would be far too imprecise to be effective.
MIT researchers have overcome this hurdle and found a way to effectively scale an optical neural network. By adding a tiny hardware component to the optical switches that form the network’s architecture, they can reduce even the uncorrectable errors that would otherwise accumulate in the device.
Their work could enable a super-fast, energy-efficient, analog neural network that can function with the same accuracy as a digital one. With this technique, as an optical circuit becomes larger, the amount of error in its computations actually decreases.
“This is remarkable, as it runs counter to the intuition of analog systems, where larger circuits are supposed to have higher errors, so that errors set a limit on scalability. This present paper allows us to address the scalability question of these systems with an unambiguous ‘yes,’” says lead author Ryan Hamerly, a visiting scientist in the MIT Research Laboratory for Electronics (RLE) and Quantum Photonics Laboratory and senior scientist at NTT Research.
Hamerly’s co-authors are graduate student Saumil Bandyopadhyay and senior author Dirk Englund, an associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS), leader of the Quantum Photonics Laboratory, and member of the RLE. The research is published today in Nature Communications.
Multiplying with light
An optical neural network is composed of many connected components that function like reprogrammable, tunable mirrors. These tunable mirrors are called Mach-Zehnder Inferometers (MZI). Neural network data are encoded into light, which is fired into the optical neural network from a laser.
A typical MZI contains two mirrors and two beam splitters. Light enters the top of an MZI, where it is split into two parts which interfere with each other before being recombined by the second beam splitter and then reflected out the bottom to the next MZI in the array. Researchers can leverage the interference of these optical signals to perform complex linear algebra operations, known as matrix multiplication, which is how neural networks process data.
But errors that can occur in each MZI quickly accumulate as light moves from one device to the next. One can avoid some errors by identifying them in advance and tuning the MZIs so earlier errors are cancelled out by later devices in the array.
“It is a very simple algorithm if you know what the errors are. But these errors are notoriously difficult to ascertain because you only have access to the inputs and outputs of your chip,” says Hamerly. “This motivated us to look at whether it is possible to create calibration-free error correction.”
Hamerly and his collaborators previously demonstrated a mathematical technique that went a step further. They could successfully infer the errors and correctly tune the MZIs accordingly, but even this didn’t remove all the error.
Due to the fundamental nature of an MZI, there are instances where it is impossible to tune a device so all light flows out the bottom port to the next MZI. If the device loses a fraction of light at each step and the array is very large, by the end there will only be a tiny bit of power left.
“Even with error correction, there is a fundamental limit to how good a chip can be. MZIs are physically unable to realize certain settings they need to be configured to,” he says.
So, the team developed a new type of MZI. The researchers added an additional beam splitter to the end of the device, calling it a 3-MZI because it has three beam splitters instead of two. Due to the way this additional beam splitter mixes the light, it becomes much easier for an MZI to reach the setting it needs to send all light from out through its bottom port.
Importantly, the additional beam splitter is only a few micrometers in size and is a passive component, so it doesn’t require any extra wiring. Adding additional beam splitters doesn’t significantly change the size of the chip.
Bigger chip, fewer errors
When the researchers conducted simulations to test their architecture, they found that it can eliminate much of the uncorrectable error that hampers accuracy. And as the optical neural network becomes larger, the amount of error in the device actually drops — the opposite of what happens in a device with standard MZIs.
Using 3-MZIs, they could potentially create a device big enough for commercial uses with error that has been reduced by a factor of 20, Hamerly says.
The researchers also developed a variant of the MZI design specifically for correlated errors. These occur due to manufacturing imperfections — if the thickness of a chip is slightly wrong, the MZIs may all be off by about the same amount, so the errors are all about the same. They found a way to change the configuration of an MZI to make it robust to these types of errors. This technique also increased the bandwidth of the optical neural network so it can run three times faster.
Now that they have showcased these techniques using simulations, Hamerly and his collaborators plan to test these approaches on physical hardware and continue driving toward an optical neural network they can effectively deploy in the real world.
This research is funded, in part, by a National Science Foundation graduate research fellowship and the U.S. Air Force Office of Scientific Research. More

Sustainable computing

Latest story

Startup accelerates progress toward light-speed computing

More stories

Learning to grow machine-learning models

A new chip for decoding data transmissions demonstrates record-breaking energy efficiency

Busy GPUs: Sampling and pipelining method speeds up deep learning on large graphs

Breaking the scaling limits of analog computing

ITALIAN LANGUAGE

ENGLISH LANGUAGE