Data Reduction and Partitioning in an Extreme Scale
GPU-Based Clustering Algorithm
Author/Presenter
Event Type
Workshop
TimeFriday, November 17th9:30am -
9:50am
Location302-303
DescriptionThe emergence of leadership-class systems with
GPU-equipped nodes has the potential to vastly increase
the performance of existing distributed applications. An
increasing number of applications that are converted to
run on these systems are reliant on algorithms that
perform computations on spatial data. Algorithms that
operate on spatial data, such as density-based
clustering algorithms, present unique challenges in data
partitioning and result aggregation when porting to
extreme scale environments. These algorithms require
that spatially-near data points are partitioned to the
same node and that the output from individual nodes
needs to be aggregated to ensure that relationships
between partition boundaries are discovered. In the
development of an extreme scale density-based clustering
algorithm, called Mr. Scan, we leveraged the increased
computational power provided by GPUs to overcome these
challenges. Three main techniques allowed Mr. Scan to
cluster 6.5 billion points from the Twitter dataset
using 8,192 GPU nodes on Cray Titan in 7.5 minutes: (1)
a data partitioner scheme that ensures correctness by
using shadow regions to selectively duplicate
computation between nodes to detect clusters that lie
in-between partition boundaries, (2) a dense box
algorithm that reduces the computational complexity of
clustering dense spatial regions, and (3) a new tree
based method for merging results with spatial
dependencies that requires only a fraction of the
intermediate results to produce an accurate final
result. The pure computational power of the GPUs allowed
us to effectively summarize the initial data, making the
scalable use of these techniques possible.
Author/Presenter




