Low Communication FMM-Accelerated FFT on GPUs
Author
Event Type
Paper
TimeThursday, November 16th2pm -
2:30pm
Location405-406-407
DescriptionCommunication-avoiding algorithms have been the subject
of growing interest in the last decade due to the growth
of distributed memory systems and the disproportionate
increase of computational throughput to communication
bandwidth. For distributed 1D FFTs, communication costs
quickly dominate execution time as all industry-standard
implementations perform three all-to-all transpositions
of the data.
In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies on existing library primitives, demonstrate that this implementation achieves consistent speed-ups between 1.3x and 2.2x against cuFFTXT on 2xP100 and 8xP100 GPUs, and develop an accurate compute model to analyze the performance dependencies.
In this work, we reformulate an existing algorithm that employs the Fast Multipole Method to reduce the communication requirements to approximately a single all-to-all transpose. We present a detailed and clear implementation strategy that relies on existing library primitives, demonstrate that this implementation achieves consistent speed-ups between 1.3x and 2.2x against cuFFTXT on 2xP100 and 8xP100 GPUs, and develop an accurate compute model to analyze the performance dependencies.
Download PDF:
here
Author




