Accelerating Deep Neural Networks at Datacenter Scale
with the BrainWave Architecture
Author/Presenter
Event Type
Workshop
TimeFriday, November 17th10:30am -
11:30am
Location402-403-404
DescriptionIn the last several years, advances in deep neural
networks (DNN) have led to many breakthroughs in machine
learning, spanning diverse fields such as computer
vision, speech recognition, and natural language
processing. Since then, the size and complexity of DNNs
have significantly outpaced the growth of CPU
performance, leading to an explosion of DNN
accelerators.
Recently, major cloud vendors have turned to ASICs and FPGAs to accelerate DNN serving in latency-sensitive online services. While significant attention has been devoted to deep convolutional neural networks (CNN), much less so has been given toward the more difficult problems of memory-intensive Multilayer Perceptions (MLP) and Recurrent Neural Networks (RNN).
Improving the performance of memory-intensive DNNs (e.g., LSTM) can be achieved using a variety of methods. Batching is one popular approach that can be used to drive up device utilization, but can be harmful to tail latencies (e.g., 99.9th) in the context of online services. Increasing off-chip bandwidth is another option, but incurs high cost and power, and still may not be sufficient to saturate an accelerator with tens or even hundreds of TFLOPS of raw performance.
In this work, we present BrainWave, a scalable, distributed DNN acceleration architecture that enables inferencing of MLPs and RNNs at near-peak device efficiency without the need for input batching. The BrainWave architecture is enabled and powered by FPGAs and is able to reduce latency 10-100X relative to well-tuned software.
Recently, major cloud vendors have turned to ASICs and FPGAs to accelerate DNN serving in latency-sensitive online services. While significant attention has been devoted to deep convolutional neural networks (CNN), much less so has been given toward the more difficult problems of memory-intensive Multilayer Perceptions (MLP) and Recurrent Neural Networks (RNN).
Improving the performance of memory-intensive DNNs (e.g., LSTM) can be achieved using a variety of methods. Batching is one popular approach that can be used to drive up device utilization, but can be harmful to tail latencies (e.g., 99.9th) in the context of online services. Increasing off-chip bandwidth is another option, but incurs high cost and power, and still may not be sufficient to saturate an accelerator with tens or even hundreds of TFLOPS of raw performance.
In this work, we present BrainWave, a scalable, distributed DNN acceleration architecture that enables inferencing of MLPs and RNNs at near-peak device efficiency without the need for input batching. The BrainWave architecture is enabled and powered by FPGAs and is able to reduce latency 10-100X relative to well-tuned software.
Author/Presenter




