EuroSys '24 Contigra: Graph Mining with Containment Constraints
Joanna Che, Kasra Jamshidi, and Keval Vora
European Conference on Computer Systems,
14 pages, Athens, Greece, April 2024.
PPoPP '24 OsirisBFT: Say No to Task Replication for Scalable Byzantine Fault Tolerant Analytics
Kasra Jamshidi and Keval Vora
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming,
Pages 94–108, Edinburgh, UK, March 2024.     [ Abstract ] [ Paper ]
OsirisBFT: Say No to Task Replication for Scalable Byzantine Fault Tolerant Analytics
We present a verification-based Byzantine Fault Tolerant processing system, called OsirisBFT, for distributed task-parallel applications. OsirisBFT treats computation tasks differently from state update tasks, allowing the application to scale independently from number of expected failures. OsirisBFT captures application-specific verification semantics via generic verification operators and employs lightweight verification strategies with little coordination during graceful execution. Evaluation across multiple applications and workloads shows that OsirisBFT delivers high processing throughput and scalability compared to replicated processing. Importantly, the scalable nature of OsirisBFT enables it to reduce the performance gap compared to baseline with no fault tolerance by simply scaling out.
EuroSys '23 Accelerating Graph Mining Systems with Subgraph Morphing
Kasra Jamshidi, Guoqing Harry Xu, and Keval Vora
European Conference on Computer Systems,
Pages 162–181, Rome, Italy, May 2023.     [ Abstract ] [ Paper ] [ Presentation ]
Accelerating Graph Mining Systems with Subgraph Morphing
Graph mining applications analyze the structural properties of large graphs. These applications are computationally expensive because finding structural patterns requires checking subgraph isomorphism, which is NP-complete.

This paper exploits the sub-structural similarities across different patterns by employing Subgraph Morphing to accurately infer the results for a given set of patterns from the results of a completely different set of patterns that are less expensive to compute. To enable Subgraph Morphing in practice, we develop efficient query transformation techniques as well as automatic result conversion strategies for different application scenarios. We have implemented Subgraph Morphing in four state-of-the-art graph mining and subgraph matching systems: Peregrine, AutoMine/GraphZero, GraphPi, and BigJoin; a thorough evaluation demonstrates that Subgraph Morphing improves the performance of these four systems by 34x, 10x, 18x, and 13x, respectively.
GRADES-NDA '22 Anti-Vertex for Neighborhood Constraints in Subgraph Queries
Kasra Jamshidi, Mugilan Mariappan, and Keval Vora
Proceedings of the ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA),
Article No. 5, Pages 1-9, Philadelphia, Pennsylvania, June 2022.     [ Abstract ] [ Paper ]
Anti-Vertex for Neighborhood Constraints in Subgraph Queries
This paper focuses on subgraph queries where constraints are present in the neighborhood of the explored subgraphs. We describe anti-vertex, a declarative construct that indicates absence of a vertex, i.e., the resulting subgraph should not have a vertex in its specified neighborhood that matches the anti-vertex. We formalize the semantics of anti-vertex to benefit from automatic reasoning and optimization, and to enable standardized implementation across query languages and runtimes. The semantics are defined for various matching semantics that are commonly employed in subgraph querying (isomorphism, homomorphism, and no-repeated-edge matching) and for the widely adopted property graph model. We illustrate several examples where anti-vertices can be employed to help familiarize with the anti-vertex concept. We further showcase how anti-vertex support can be added in existing graph query languages by developing prototype extensions of Cypher language. Finally, we study how anti-vertices interact with the symmetry breaking technique in subgraph matching frameworks so that their meaning remains consistent with the expected outcome of constrained neighborhoods to connected vertices.
ICFP '21 Grafs: Declarative Graph Analytics
Farzin Houshmand, Mohsen Lesani, and Keval Vora
Proceedings of the ACM on Programming Languages,
Volume 5, Issue ICFP, Article 83, Pages 83:1–83:32, Online, August 2021.     [ Abstract ] [ Paper ]
Grafs: Declarative Graph Analytics
Graph analytics elicits insights from large graphs to inform critical decisions for business, safety and security. Several large-scale graph processing frameworks feature efficient runtime systems; however, they often provide programming models that are low-level and subtly different from each other. Therefore, end users can find implementation and specially optimization of graph analytics error-prone and time-consuming. This paper regards the abstract interface of the graph processing frameworks as the instruction set for graph analytics, and presents Grafs, a high-level declarative specification language for graph analytics and a synthesizer that automatically generates efficient code for five high-performance graph processing frameworks. It features novel semantics-preserving fusion transformations that optimize the specifications and reduce them to three primitives: reduction over paths, mapping over vertices and reduction over vertices. Reductions over paths are commonly calculated based on push or pull models that iteratively apply kernel functions at the vertices. This paper presents conditions, parametric in terms of the kernel functions, for the correctness and termination of the iterative models, and uses these conditions as specifications to automatically synthesize the kernel functions. Experimental results show that the generated code matches or outperforms handwritten code, and that fusion accelerates execution.
OSDI '21 Dorylus: Affordable, Scalable, and Accurate GNN Training over Billion-Edge Graphs
John Thorpe, Yifan Qiao, Jonathan Eyolfson, Shen Teng, Guanzhou Hu, Zhihao Jia, Jinliang Wei, Keval Vora, Ravi Netravali, Miryung Kim, and Guoqing Harry Xu
USENIX Symposium on Operating Systems Design and Implementation,
Pages 495-514, Online, July 2021.     [ Abstract ] [ Paper ] [ Presentation ]
Dorylus: Affordable, Scalable, and Accurate GNN Training over Billion-Edge Graphs
A graph neural network (GNN) enables deep learning on structured graph data. There are two major GNN training obstacles: 1) it relies on high-end servers with many GPUs which are expensive to purchase and maintain, and 2) limited memory on GPUs cannot scale to today’s billion-edge graphs. This paper presents Dorylus: a distributed system for training GNNs. Uniquely, Dorylus can take advantage of serverless computing to increase scalability at a low cost.

The key insight guiding our design is computation separation. Computation separation makes it possible to construct a deep, bounded-asynchronous pipeline where graph and tensor parallel tasks can fully overlap, effectively hiding the network latency incurred by Lambdas. With the help of thousands of Lambda threads, Dorylus scales GNN training to billion-edge graphs. Currently, for large graphs, CPU servers offer the best performance per dollar over GPU servers. Just using Lambdas on top of Dorylus offers up to 2.75x more performance-per-dollar than CPU-only servers. Concretely, Dorylus is 1.22x faster and 4.83x cheaper than GPU servers for massive sparse graphs. Dorylus is up to 3.8x faster and 10.7x cheaper compared to existing sampling-based systems.
USENIX
ATC '21
Controlling Memory Footprint of Stateful Streaming Graph Processing
Pourya Vaziri and Keval Vora
USENIX Annual Technical Conference,
Pages 269-283, Online, July 2021.     [ Abstract ] [ Paper ] [ Video ] [ Presentation ]
Controlling Memory Footprint of Stateful Streaming Graph Processing
With growing interest in efficiently analyzing dynamic graphs, streaming graph processing systems rely on stateful iterative models where they track the intermediate state as execution progresses in order to incrementally adjust the results upon graph mutation. We observe that the intermediate state tracked by these stateful iterative models significantly increases the memory footprint of these systems, which limits their scalability on large graphs.

In this paper, we develop memory-efficient stateful iterative models that demand much less memory capacity to efficiently process streaming graphs and deliver the same results as provided by existing stateful iterative models. First, we propose a Selective Stateful Iterative Model where the memory footprint is controlled by selecting a small portion of the intermediate state to be maintained throughout execution. Then, we propose a Minimal Stateful Iterative Model that further reduces the memory footprint by exploiting key properties of graph algorithms. We develop incremental processing strategies for both of our models in order to correctly compute the effects of graph mutations on the final results even when intermediate states are not available. Evaluation shows our memory-efficient models are effective in limiting the memory footprint while still retaining most of the performance benefits of traditional stateful iterative models, hence being able to scale on larger graphs that could not be handled by the traditional models.
SIGOPS
OSR '21
A Deeper Dive into Pattern-Aware Subgraph Exploration with Peregrine
Kasra Jamshidi and Keval Vora
ACM SIGOPS Operating Systems Review,
Volume 55, Issue 1, Pages 1-10, July 2021.     [ Abstract ] [ Paper ]
A Deeper Dive into Pattern-Aware Subgraph Exploration with Peregrine
Graph mining workloads aim to extract structural properties of a graph by exploring its subgraph structures. Peregrine is a general-purpose graph mining system that provides a generic runtime to efficiently explore subgraph structures of interest and perform various graph mining analyses. It takes a 'pattern-aware' approach by incorporating a pattern-based programming model along with efficient pattern matching strategies. The programming model enables easier expression of complex graph mining use cases and enables Peregrine to extract the semantics of patterns. By analyzing the patterns, Peregrine generates efficient exploration plans which it uses to guide its subgraph exploration.

In this paper, we present an in-depth view of the patternanalysis techniques powering the matching engine of Peregrine. Beyond the theoretical foundations from prior research, we expose opportunities based on how the exploration plans are evaluated, and develop key techniques for computation reuse, enumeration depth reduction, and branch elimination. Our experiments show the importance of patternawareness for scalable and performant graph mining where the presented new techniques speed up the performance by up to two orders of magnitude on top of the benefits achieved from the prior theoretical foundations that generate the initial exploration plans.
EuroSys '21 DZiG: Sparsity-Aware Incremental Processing of Streaming Graphs
Mugilan Mariappan, Joanna Che, and Keval Vora
European Conference on Computer Systems,
Pages 83–98, Online, United Kingdom, April 2021.     [ Abstract ] [ Paper ] [ Video ] [ Presentation ]
DZiG: Sparsity-Aware Incremental Processing of Streaming Graphs
State-of-the-art streaming graph processing systems that provide Bulk Synchronous Parallel (BSP) guarantees remain oblivious to the computation sparsity present in iterative graph algorithms, which severely limits their performance. In this paper we propose DZiG, a high-performance streaming graph processing system that retains efficiency in presence of sparse computations while still guaranteeing BSP semantics. At the heart of DZiG is: (1) a sparsity-aware incremental processing technique that expresses computations in a recursive manner to be able to safely identify and prune updates (hence retaining sparsity); (2) a simple change-driven programming model that naturally exposes sparsity in iterative computations; and, (3) an adaptive processing model that automatically changes the incremental computation strategy to limit its overheads when computations become very sparse. DZiG outperforms state-of-the-art streaming graph processing systems, and pushes the boundary of dependency-driven processing for streaming graphs to over 10 million simultaneous mutations, which is orders of magnitude higher compared to the state-of-the-art systems.
EuroSys '20 Peregrine: A Pattern-Aware Graph Mining System
Kasra Jamshidi, Rakesh Mahadasa, and Keval Vora
European Conference on Computer Systems,
Pages 13:1-13:16, Heraklion, Greece, April 2020.     [ Abstract ] Paper ] [ Long Version ] [ Video ] [ Presentation ]
Peregrineon GitHub
Peregrine: A Pattern-Aware Graph Mining System
Graph mining workloads aim to extract structural properties of a graph by exploring its subgraph structures. General purpose graph mining systems provide a generic runtime to explore subgraph structures of interest with the help of user-defined functions that guide the overall exploration process. However, the state-of-the-art graph mining systems remain largely oblivious to the shape (or pattern) of the subgraphs that they mine. This causes them to: (a) explore unnecessary subgraphs; (b) perform expensive computations on the explored subgraphs; and, (c) hold intermediate partial subgraphs in memory; all of which affect their overall performance. Furthermore, their programming models are often tied to their underlying exploration strategies, which makes it difficult for domain users to express complex mining tasks.

In this paper, we develop Peregrine, a pattern-aware graph mining system that directly explores the subgraphs of interest while avoiding exploration of unnecessary subgraphs, and simultaneously bypassing expensive computations throughout the mining process. We design a pattern-based programming model that treats graph patterns as first class constructs and enables Peregrine to extract the semantics of patterns, which it uses to guide its exploration. Our evaluation shows that Peregrine outperforms state-of-the-art distributed and single machine graph mining systems, and scales to complex mining tasks on larger graphs, while retaining simplicity and expressivity with its "pattern-first" programming approach.
OOPSLA '19 DProf: Distributed Profiler with Strong Guarantees
Zachary Benavides, Keval Vora, and Rajiv Gupta
Proceedings of the ACM on Programming Languages,
Volume 3, Issue OOPSLA, Article 156, Pages 156:1–156:24, Athens, Greece, October 2019.     [ Abstract ] [ Paper ]
DProf: Distributed Profiler with Strong Guarantees
Performance analysis of a distributed system is typically achieved by collecting profiles whose underlying events are timestamped with unsynchronized clocks of multiple machines in the system. To allow comparison of timestamps taken at different machines, several timestamp synchronization algorithms have been developed. However, the inaccuracies associated with these algorithms can lead to inaccuracies in the final results of performance analysis. To address this problem, in this paper, we develop a system for constructing distributed performance profiles called DProf. At the core of DProf is a new timestamp synchronization algorithm, FreeZer, that tightly bounds the inaccuracy in a converted timestamp to a time interval. This not only allows timestamps from different machines to be compared, it also enables maintaining strong guarantees throughout the comparison which can be carefully transformed into guarantees for analysis results.

To demonstrate the utility of DProf, we use it to implement dCSP and dCOZ that are accuracy bounded distributed versions of Context Sensitive Profiles and Causal Profiles developed for shared memory systems. While dCSP enables user to ascertain existence of a performance bottleneck, dCOZ estimates the expected performance benefit from eliminating that bottleneck. Experiments with three distributed applications on a cluster of heterogeneous machines validate that inferences via dCSP and dCOZ are highly accurate. Moreover, if FreeZer is replaced by two existing timestamp algorithms (linear regression & convex hull), the inferences provided by dCSP and dCOZ are severely degraded.
ASPLOS '19 PnP: Pruning and Prediction for Point-To-Point Iterative Graph Analytics
Chengshuo Xu, Keval Vora, and Rajiv Gupta
ACM 24th International Conference on Architectural Support for Programming Languages and Operating Systems,
Pages 587-600, Providence, Rhode Island, April 2019.     [ Abstract ] Paper ]
PnP: Pruning and Prediction for Point-To-Point Iterative Graph Analytics
Frequently used parallel iterative graph analytics algorithms are computationally expensive. However, researchers have observed that applications often require point-to-point versions of these analytics algorithms that are less demanding. In this paper we introduce the PnP parallel framework for iterative graph analytics that processes a stream of point-to-point queries with each involving a single source and destination vertex pair. The efficiency of our framework is derived from the following two novel features: online Pruning of graph exploration that eliminates propagation from vertices that are determined to not contribute to a query's final solution; and dynamic direction Prediction for solving the query in either forward (from source) or backward (from destination) direction as their costs can differ greatly. PnP employs a two-phase algorithm where, Phase 1 briefly traverses the graph in both directions to predict the faster direction and enable pruning; then Phase 2 completes query evaluation by running the algorithm for the chosen direction till it converges. Our experiments show that PnP responds to queries rapidly because of accurate direction selection and effective pruning that often offsets the runtime overhead of direction prediction. PnP substantially outperforms Quegel, the only other point-to-point query evaluation framework. Our experiments on multiple benchmarks and graphs show that PnP on a single machine is 8.2x to 3116x faster than Quegel on 4 machines.
EuroSys '19 GraphBolt: Dependency-Driven Synchronous Processing of Streaming Graphs
Mugilan Mariappan and Keval Vora
European Conference on Computer Systems,
Pages 25:1-25:16, Dresden, Germany, March 2019.     [ Abstract ] [ Paper ] [ Video ] [ Presentation ]
GraphBolton GitHub
GraphBolt: Dependency-Driven Synchronous Processing of Streaming Graphs
Efficient streaming graph processing systems leverage incremental processing by updating computed results to reflect the change in graph structure for the latest graph snapshot. Although certain monotonic path-based algorithms produce correct results by refining intermediate values via numerical comparisons, directly reusing values that were computed before mutation does not work correctly for algorithms that require BSP semantics. Since structural mutations in streaming graphs render the intermediate results unusable, exploiting incremental computation while simultaneously providing synchronous processing guarantees is challenging.

In this paper we develop GraphBolt which incrementally processes streaming graphs while guaranteeing BSP semantics. GraphBolt incorporates dependency-driven incremental processing where it first tracks dependencies to capture how intermediate values get computed, and then uses this information to incrementally propagate the impact of change across intermediate values. To support wide variety of graph-based analytics, GraphBolt provides a generalized incremental programming model that enables development of incremental versions of complex aggregations. Our evaluation shows that GraphBolt's incremental processing eliminates redundant computations and efficiently processes streaming graphs with varying mutation rates, starting from just a single edge mutation all the way up to 1 million edge mutations at a time. Furthermore, being specialized for graph computations, GraphBolt extracts high performance compared to Differential Dataflow.
USENIX
ATC '19
Lumos: Dependency-Driven Disk-based Graph Processing
Keval Vora
USENIX Annual Technical Conference,
Pages 429-442, Renton, Washington, July 2019.     [ Abstract ] [ Paper ] [ Video ] [ Presentation ]
Lumoson GitHub
Lumos: Dependency-Driven Disk-based Graph Processing
Out-of-core graph processing systems are well-optimized to maintain sequential locality on disk and minimize the amount of disk I/O per iteration. Even though the sparsity in real-world graphs provides opportunities for out-of-order execution, these systems often process graphs iteration-by-iteration, hence providing Bulk Synchronous Parallel (synchronous for short) mode of processing which is also a preferred choice for easier programmability. Since out-of-core setting limits the view of entire graph and constrains the processing order to maintain disk locality, exploiting out-of-order execution while simultaneously providing synchronous processing guarantees is challenging. In this paper we develop a generic dependency-driven out-of-core graph processing technique, called Lumos, that performs out-of-order execution to proactively propagate values across iterations while simultaneously providing synchronous processing guarantees. Our cross-iteration value propagation technique identifies future dependencies that can be safely satisfied, and actively computes values across those dependencies without sacrificing disk locality. This eliminates the need to load the corresponding portions of graph in future iterations, hence reducing disk I/O and accelerating the overall processing.
ISMM '18 OMR: Out-of-Core MapReduce for Large Data Sets
Gurneet Kaur, Keval Vora, Sai Charan Koduru, and Rajiv Gupta
International Symposium on Memory Management,
Pages 71-83, Philadelphia, Pennsylvania, June 2018.     [ Abstract ] [ Paper ]
OMR: Out-of-Core MapReduce for Large Data Sets
While single machine MapReduce systems can squeeze out maximum performance from available multi-cores, they are often limited by the size of main memory and can thus only process small datasets. Our experience shows that the state-of-the-art single-machine in-memory MapReduce system Metis frequently experiences out-of-memory crashes. Even though today's computers are equipped with efficient secondary storage devices, the frameworks do not utilize these devices mainly because disk access latencies are much higher than those for main memory. Therefore, the single-machine setup of the Hadoop system performs much slower when it is presented with the datasets which are larger than the main memory. Moreover, such frameworks also require tuning a lot of parameters which puts an added burden on the programmer. In this paper we present OMR, an Out-of-core MapReduce system that not only successfully handles datasets that are far larger than the size of main memory, it also guarantees linear scaling with the growing data sizes. OMR actively minimizes the amount of data to be read/written to/from disk via on-the-fly aggregation and it uses block sequential disk read/write operations whenever disk accesses become necessary to avoid running out of memory. We theoretically prove OMR's linear scalability and empirically demonstrate it by processing datasets that are up to 5x larger than main memory. Our experiments show that in comparison to the standalone single-machine setup of the Hadoop system, OMR delivers far higher performance. Also in contrast to Metis, OMR avoids out-of-memory crashes for large datasets as well as delivers higher performance when datasets are small enough to fit in main memory.
IJPP '18 Software Speculation on Caching DSMs
Sai Charan Koduru, Keval Vora, and Rajiv Gupta
International Journal of Parallel Programming,
Volume 46, Issue 2, Pages 313-332, April 2018.     [ Abstract ] [ Paper ]
Software Speculation on Caching DSMs
Clusters with caching DSMs deliver programmability and performance by supporting shared-memory programming model and tolerating communication latency of remote fetches via caching. The input of a data parallel program is partitioned across machines in the cluster while the DSM transparently fetches and caches remote data as needed by the application. Irregular applications are challenging to parallelize because the input related data dependences that manifest at runtime require the use of speculation for efficiently exploiting parallelism. By speculating that there are no cross iteration dependences, multiple iterations of a data parallel loop are executed in parallel using locally cached copies of data; the absence of dependences is validated before committing the speculatively computed results. In this paper we show that in irregular data-parallel applications, while caching helps tolerate long communication latencies, using a value read from the cache in a computation can lead to misspeculation, and thus aggressive caching can degrade performance due to increased misspeculation rate. To limit misspeculation rate we present optimizations for distributed speculation on caching based DSMs that decrease the cost of misspeculation check and speed up the re-execution of misspeculated recomputations. These optimizations give speedups of 2.24x for graph coloring, 1.71x for connected components, 1.88x for community detection, 1.32x for shortest path, and 1.74x for pagerank over baseline parallel executions.
IA3 '17 Enabling Work-Efficiency for High Performance Vertex-Centric Graph Analytics on GPUs
Farzad Khorasani, Keval Vora, Rajiv Gupta, and Laxmi N. Bhuyan
Seventh Workshop on Irregular Applications: Architectures and Algorithms,
Article No. 11, 4 pages, Denver, Colorado, November 2017.     [ Abstract ] Paper ]
Enabling Work-Efficiency for High Performance Vertex-Centric Graph Analytics on GPUs
Massive parallel processing power of GPUs has attracted researchers to develop iterative vertex-centric graph processing frameworks for GPUs. Enabling work-efficiency in these solutions, however, is not straightforward and comes at the cost of SIMD-inefficiency and load imbalance. This paper offers techniques that overcome these challenges when processing the graph on a GPU. For a SIMD-efficient kernel operation involving gathering of neighbors and performing reduction on them, we employ an effective task expansion strategy that avoids intra-warp thread underutilization. As recording vertex activeness requires additional data structures, to attenuate the graph storage overhead on limited GPU DRAM, we introduce vertex grouping as a technique that enables trade-off between memory consumption and the work efficiency in our solution. Our experiments show that these techniques provide up to 5.46x over the recently proposed WS-VR framework over multiple algorithms and inputs.
ASPLOS '17 KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations
Keval Vora, Rajiv Gupta, and Guoqing Xu
ACM 22nd International Conference on Architectural Support for Programming Languages and Operating Systems,
Pages 237-251, Xi'an, China, April 2017.     [ Abstract ] Paper ] [ Presentation ]
KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations
Continuous processing of a streaming graph iteratively maintains an approximate result of the computation on a recent version of the graph. Upon a user query, the accurate result on the current graph can be quickly computed by feeding the approximate results to the iterative computation - a form of incremental computation that corrects the (small amount of) error in the approximate result. Despite the effectiveness of this approach in processing growing graphs, it is not generally applicable when edge deletions are present - existing approximations can lead to either incorrect results (e.g., for monotonic algorithms the computation terminates at an incorrect minima/maxima) or poor performance (e.g., with approximations, convergence takes longer than performing the computation from scratch).

In this paper we present KickStarter, that, for a general class of monotonic graph algorithms, is able to trim the approximations to a subset of vertex values whose use preserves correctness of results and yet allows a majority of existing approximations to be directly used for efficiency. Our experiments with four streaming algorithms on five real-world graphs demonstrate that trimming not only produces correct results but also accelerates these algorithms by 8.5-23.7x.
ASPLOS '17 CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
Keval Vora, Chen Tian, Rajiv Gupta, and Ziang Hu
ACM 22nd International Conference on Architectural Support for Programming Languages and Operating Systems,
Pages 223-236, Xi'an, China, April 2017.     [ Abstract ] Paper ] [ Presentation ]
CoRAL: Confined Recovery in Distributed Asynchronous Graph Processing
Existing distributed asynchronous graph processing systems employ checkpointing to capture globally consistent snapshots and rollback all machines to most recent checkpoint to recover from machine failures. In this paper we argue that recovery in distributed asynchronous graph processing does not require the entire execution state to be rolled back to a globally consistent state due to the relaxed asynchronous execution semantics. We define the properties required in the recovered state for it to be usable for correct asynchronous processing and develop CoRAL a lightweight checkpointing and recovery algorithm. First, this algorithm carries out confined recovery that only rolls back graph execution states of the failed machines to affect recovery. Second, it relies upon lightweight checkpoints that capture locally consistent snapshots with a reduced peak network bandwidth requirement. Our experiments using real-world graphs show that our technique recovers from failures and finishes processing 1.5-3.2x faster compared to the traditional asynchronous checkpointing and recovery mechanism when failures impact 6-37% of the machines in the cluster. Moreover, capturing locally consistent snapshots significantly reduces intermittent high bandwidth usage required to save the snapshots -- the average reduction in 99th percentile bandwidth is 22-51% to maintain 1-6 snapshot replicas.
USENIX
ATC '16
Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing
Keval Vora, Guoqing Xu, and Rajiv Gupta
USENIX Annual Technical Conference,
Pages 507-522, Denver, Colorado, June 2016.     [ Abstract ] [ Paper ] [ Presentation ]
Load the Edges You Need: A Generic I/O Optimization for Disk-based Graph Processing
Single-PC, disk-based processing of big graphs has recently gained much popularity. At the core of an efficient disk-based system is a well-designed partition structure that can minimize random disk accesses. All existing systems use static partitions that are created before processing starts. These partitions have static layouts and are loaded entirely into memory in every single iteration even though much of the edge data is not changed across many iterations, causing these unchanged edges to have zero new impact on the computation of vertex values.

This work provides a general optimization that removes this I/O inefficiency by employing dynamic partitions whose layouts are dynamically adjustable. We have implemented this optimization in GraphChi -- a representative out-of-core vertex-centric graph system -- which sped up GraphChi by 1.5-2.8x on six large graphs. Our idea is generally applicable to other systems as well.
TACO '16 Synergistic Analysis of Evolving Graphs
Keval Vora, Rajiv Gupta, and Guoqing Xu
ACM Transactions on Architecture and Code Optimization,
Volume 13, Issue 4, Article No. 32, Pages 32:1–32:27, December 2016.     [ Abstract ] Paper ]
Synergistic Analysis of Evolving Graphs
Evolving graph processing involves repeating analyses, that are often iterative, over multiple snapshots of the graph corresponding to different points in time. Since the snapshots of an evolving graph share a great number of vertices and edges, traditional approaches that process these snapshots one at a time without exploiting this overlap contain much wasted effort on both data loading and computation, making them extremely inefficient. In this paper, we identify major sources of inefficiencies and present two optimization techniques to address them. First, we propose a technique for amortizing the fetch cost by merging fetching of values for different snapshots of the same vertex. Second, we propose a technique for amortizing the processing cost by feeding values computed by earlier snapshots into later snapshots. We have implemented these optimizations in two distributed graph processing systems, namely GraphLab and ASPIRE. Our experiments with multiple real evolving graphs and algorithms show that, on average fetch amortization speeds up execution of GraphLab and ASPIRE by 5.2x and 4.1x respectively. Amortizing the processing cost yields additional average speedups of 2x and 7.9x respectively.
HPDC '16 Efficient Processing of Large Graphs via Input Reduction
Amlan Kusum, Keval Vora, Rajiv Gupta, and Iulian Neamtiu
25th ACM International Symposium on High-Performance Parallel and Distributed Computing,
Pages 245-257, Kyoto, Japan, May-June 2016.     [ Abstract ] Paper ] [ Presentation ]
Efficient Processing of Large Graphs via Input Reduction
Large-scale parallel graph analytics involves executing iterative algorithms (e.g., PageRank, Shortest Paths, etc.) that are both data- and compute-intensive. In this work we construct faster versions of iterative graph algorithms from their original counterparts using input graph reduction. A large input graph is transformed into a small graph using a sequence of input reduction transformations. Savings in execution time are achieved using our two phased processing model that effectively runs the original iterative algorithm in two phases: first, using the reduced input graph to gain savings in execution time; and second, using the original input graph along with the results from the first phase for computing precise results. We propose several input reduction transformations and identify the structural and non-structural properties that they guarantee, which in turn are used to ensure the correctness of results while using our two phased processing model. We further present a unified input reduction algorithm that efficiently applies a non-interfering sequence of simple local input reduction transformations. Our experiments show that our transformation techniques enable significant reductions in execution time (1.25x-2.14x) while achieving precise final results for most of the algorithms. For cases where precise results cannot be achieved, the relative error remains very small (at most 0.065).
OOPSLA '15 RAIVE: Runtime Assessment of Floating-Point Instability by Vectorization
Wen-Chuan Lee, Tao Bao, Yunhui Zheng, Xiangyu Zhang, Keval Vora, and Rajiv Gupta
ACM SIGPLAN International Conference on Object Oriented Programming Systems, Languages and Applications,
Pages 623-638, Pittsburgh, Pennsylvania, October 2015.     [ Abstract ] [ Paper ]
RAIVE: Runtime Assessment of Floating-Point Instability by Vectorization
Floating point representation has limited precision and inputs to floating point programs may also have errors. Consequently, during execution, errors are introduced, propagated, and accumulated, leading to unreliable outputs. We call this the instability problem. We propose RAIVE, a technique that identifies output variations of a floating point execution in the presence of instability. RAIVE transforms every floating point value to a vector of multiple values – the values added to create the vector are obtained by introducing artificial errors that are upper bounds of actual errors. The propagation of artificial errors models the propagation of actual errors. When values in vectors result in discrete execution differences (e.g., following different paths), the execution is forked to capture the resulting output variations. Our evaluation shows that RAIVE can precisely capture output variations. Its overhead (340%) is 2.43 times lower than the state of the art.
CLUSTER '15 Optimizing Caching DSM for Distributed Software Speculation
Sai Charan Koduru, Keval Vora, and Rajiv Gupta
IEEE International Conference on Cluster Computing,
Pages 452-455, Chicago, Illinois, September 2015.     [ Abstract ] [ Paper ]
Optimizing Caching DSM for Distributed Software Speculation
Clusters with caching DSMs deliver programmability and performance by supporting shared-memory programming and tolerate remote I/O latencies via caching. The input to a data parallel program is partitioned across the cluster while the DSM transparently fetches and caches remote data as needed. Irregular applications, however, are challenging to parallelize because the input related data dependences that manifest at runtime require use of speculation for correct parallel execution. By speculating that there are no input related cross iteration dependences, private copies of the input can be processed by parallelizing the loop; the absence of dependences is validated before committing the computed results. We show that while caching helps tolerate long communication latencies in irregular data-parallel applications, using a cached values in a computation can lead to misspeculation and thus aggressive caching can degrade performance due to increased misspeculation rate. We present optimizations for distributed speculation on caching based DSMs that decrease the cost of misspeculation check and speed up the re-execution of misspeculated recomputations. Optimized distributed speculation achieves speedups of 2.24x for coloring, 1.71x for connected components, 1.88x for community detection, 1.32x for shortest path, and 1.74x for pagerank over unoptimized speculation.
OOPSLA '14 ASPIRE: Exploiting Asynchronous Parallelism in Iterative Algorithms using a Relaxed Consistency based DSM
Keval Vora, Sai Charan Koduru, and Rajiv Gupta
ACM SIGPLAN International Conference on Object Oriented Programming Systems, Languages and Applications,
Pages 861-878, Portland, Oregon, October 2014.     [ Abstract ] Paper ] [ Presentation ]
ASPIRE: Exploiting Asynchronous Parallelism in Iterative Algorithms using a Relaxed Consistency based DSM
Many vertex-centric graph algorithms can be expressed using asynchronous parallelism by relaxing certain read-after-write data dependences and allowing threads to compute vertex values using stale (i.e., not the most recent) values of their neighboring vertices. We observe that on distributed shared memory systems, by converting synchronous algorithms into their asynchronous counterparts, algorithms can be made tolerant to high inter-node communication latency. However, high inter-node communication latency can lead to excessive use of stale values causing an increase in the number of iterations required by the algorithms to converge. Although by using bounded staleness we can restrict the slowdown in the rate of convergence, this also restricts the ability to tolerate communication latency. In this paper we design a relaxed memory consistency model and consistency protocol that simultaneously tolerate communication latency and minimize the use of stale values. This is achieved via a coordinated use of best effort refresh policy and bounded staleness. We demonstrate that for a range of asynchronous graph algorithms and PDE solvers, on an average, our approach outperforms algorithms based upon: prior relaxed memory models that allow stale values by at least 2.27x; and Bulk Synchronous Parallel (BSP) model by 4.2x. We also show that our approach frequently outperforms GraphLab, a popular distributed graph processing framework.
HPDC '14 CuSha: Vertex-Centric Graph Processing on GPUs
Farzad Khorasani, Keval Vora, Rajiv Gupta, and Laxmi N. Bhuyan
23rd ACM International Symposium on High Performance Parallel and Distributed Computing,
Pages 239-251, Vancouver, Canada, June 2014.     [ Abstract ] Paper ] [ Presentation ]
CuSha: Vertex-Centric Graph Processing on GPUs
Vertex-centric graph processing is employed by many popular algorithms (e.g., PageRank) due to its simplicity and efficient use of asynchronous parallelism. The high compute power provided by SIMT architecture presents an opportunity for accelerating these algorithms using GPUs. Prior works of graph processing on a GPU employ Compressed Sparse Row (CSR) form for its space-efficiency; however, CSR suffers from irregular memory accesses and GPU underutilization that limit its performance. In this paper, we present CuSha, a CUDA-based graph processing framework that overcomes the above obstacle via use of two novel graph representations: G-Shards and Concatenated Windows (CW). G-Shards uses a concept recently introduced for non-GPU systems that organizes a graph into autonomous sets of ordered edges called shards. CuSha's mapping of GPU hardware resources on to shards allows fully coalesced memory accesses. CW is a novel representation that enhances the use of shards to achieve higher GPU utilization for processing sparse graphs. Finally, CuSha fully utilizes the GPU power by processing multiple shards in parallel on GPU's streaming multiprocessors. For ease of programming, CuSha allows the user to define the vertex-centric computation and plug it into its framework for parallel processing of large graphs. Our experiments show that CuSha provides significant speedups over the state-of-the-art CSR-based virtual warp-centric method for processing graphs on GPUs.
HIPS '14 ABC2: Adaptively Balancing Computation & Communication in a DSM cluster of Multicores for Irregular Applications
Sai Charan Koduru, Keval Vora, and Rajiv Gupta
Workshop on High-Level Parallel Programming Models and Supportive Environments,
Pages 391-400, in IEEE IPDPSW Proceedings, Phoenix, May 2014.     [ Abstract ] [ Paper ]
ABC2: Adaptively Balancing Computation & Communication in a DSM cluster of Multicores for Irregular Applications
Graph-based applications have become increasingly important in many application domains. The large graph sizes offer data level parallelism at a scale that makes it attractive to run such applications on distributed shared memory (DSM) based modern clusters composed of multicore machines. Our analysis of several graph applications that rely on speculative parallelism or asynchronous parallelism shows that the balance between computation and communication differs between applications. In this paper, we study this balance in the context of DSMs and exploit the multiple cores present in modern multicore machines by creating three kinds of threads which allows us to dynamically balance computation and communication: compute threads to exploit data level parallelism in the computation; fetch threads that replicate data into object-stores before it is accessed by compute threads; and update threads that make results computed by compute threads visible to all compute threads by writing them to DSM. We observe that the best configuration for above mechanisms varies across different inputs in addition to the variation across different applications. To this end, we design ABC2: a runtime algorithm that automatically configures the DSM using simple runtime information such as: observed object prefetch and update queue lengths. This runtime algorithm achieves speedups close to that of the best hand-optimized configurations.