Intro: I'm Will, a second year EECS undergrad, math enthusiast, and sailor. I'm interested in research about one and two-dimensional foil design (and hence fluid mechanics), self-supervised machine learning, and really anything interesting. I'm writing this for CS267 ("Applications of Parallel Computing"), a class I'm taking to see what it's like to write things that must run at scale and to harness the power of scientific computing.
A great example of this power is its use in large scale graph algorithms. Graph algorithms such as Breadth-First search is used in bioinformatics to determine the locality of areas of the brain, social network analysis, recommender systems, any many other applications. It is a simple algorithm used to determine the number of edges from one vertex to another, its depth in the BFS tree. While the algorithm is simple and the naive implementation has a fairly good runtime, this algorithm is commonly used on the scale of billions (or even trillions) of vertices in a graph. Thus, performance and parallelization is key.
A common practice in graph algorithms is to use a standard node-edge based model for computation. While this has many different representations, the model in which computational steps stays very similar. I stumbled upon this PhD showcase from SC18 entitled "Linear Algebra is the Right Way to Think About Graphs". It argues that, instead of a vertex-edge based representation of graph computations, linear algebra-based methods are more concise, expressible, portable, and potentially more performant than traditional models.
Each step of a graph traversal can be cleverly represented as a sparse matrix-vector multiplication, where the vector represents nodes to be traversed and the matrix represents the edges in the graph:
by Carl Yang (UC Davis; LBNL)
Here each $row,column$ entry in the matrix represents an edge from $column \rightarrow row$
By combining this concept with Push-Pull variant of BFS presented in Direction-Optimizing Breadth-First Search, the potential for fast, parallel BFS algorithms arises. One nice side-effect of this model is that Push and Pull are duals of each other. Push represents the column $i$ as the nodes which vertex $i$ goes to and row $j$ as the nodes which go into vertex $j$. Thus, to implement pull, we need only swap the row-column representation and run the same algorithm!
This formulation does require some additional optimizations relative to the standard formulations however. As the place we start from is a single point in a massive vector, we have both a sparse matrix and a sparse vector. Since modern SpMV algorithms are not optimized for a sparse vector, what we need is a sparse matrix sparse vector multiply (SpMSpV) algorithm, which Fast Sparse Matrix and Sparse Vector Multiplication Algorithm on the GPU implements quite cleverly and in a similar vein to many SpMV algorithms.
Additionally, once a vertex is reached, there is no need to explore it. Using masking techniques which skip some areas of computation available in hardware level SpMV techniques or in algorithmic tweaks , we can further reduce the amount of work required for the BFS in the later stages of its search and approach the efficiency of standard graph computation methods.
The performance of these linear algebra single-gpu based algorithms are mixed. They rival similar algorithms 2-CPU performance, but do not currently scale to multiple GPUs/multiple machines. In order for these linear algebra-based graph algorithms to be truly useful, they must be able to scale to multiple GPUs and multiple systems. There are GPU-based alternatives, however, such as Gunrock , which scale to multiple GPUs and multiple machines. Although these do not process data in a linear algebraic way, they are still quite fast. These techniques still acheive nearly the performance of libraries such as Gunrock in most areas though:
The main advantage of centering an algorithm around these common linear algebra routines is that if a fast vendor library or a new algorithm for arises for the routine which a graph algorithm depends on, the algorithm naturally runs faster. The portability of these methods is their true value and if they can be parallelized from a single Shared Memory machine (like a gpu) to Distributed Memory systems (like a top500 supercomputer), they will see their maximum potential.