Project Proposal Format: The proposal should be in pdf format. It should be no longer than two pages in a single-column single-spaced 10pt Times font. Margins are one inch each (top, bottom, left and right).
Submit the following on Canvas:
For each of the projects described below, students will be graded based on the tasks that have been accomplished in the project. Each project will have two goals, and the grade will be based only on finished goals. If only one goal is accomplished, the maximum project score will be 50%. If both goals are accomplished, the maximum project score will be 100%. Note that these maximum levels do not mean that students will always achieve the maximum. The actual student score will be based on other aspects: Project proposal, progress report, presentation and final report/code.
Implement a branch predictor in gem5 that could predict two, three, and four conditional branches every cycle. Choose a technique from literature that could be used to predict many branches. Compare its performance and prediction accuracy to that of the bimodal predictor implemented in gem5. Some references to consider:
Loads that result in cache misses significantly reduce performance. A mechanism that has been proposed to reduce their impact is to predict the values of load instructions before they execute, and forward these values to dependent instructions. When successful, load value prediction can improve performance by allowing instructions dependent on load misses to execute early. However, incorrectly predicted loads require flushing the pipeline which incurs a significant penalty (similar to branch mispredictions or memory dependence mispredictions). In this project, you need to implement a Markov Load Value Predictor in gem5 that uses the history of previously executed loads to predict future loads. You need to model the benefits of a correctly-predicted value and the cost of an incorrectly-predicted value. You need to compare this value predictor with a baseline without value prediction. For more information about Markov predictors, consider this paper:
Y. Sazeides and J.E. Smith, “The Predictability of Data Values,” MICRO 1997. Paper: https://ieeexplore.ieee.org/abstract/document/645815 Talk: https://ftp1.cs.wisc.edu/sohi/talks/1997/micro.predictability.pdf
Effective cache replacement policies can significantly reduce cache miss rates and improve performance. Recent research showed that it is beneficial to bypass the insertion of some blocks in the cache if they are not predicted to be reused. An example of such research is the winner of the cache replacement championship (http://www.jilp.org/jwac-1/online/papers/005_gao.pdf). In this project, you should implement a cache replacement policy with bypass in gem5, and compare its performance to the default processor and cache replacement policy already implemented in gem5.
Effective cache replacement and dead block prediction mechanisms can greatly reduce cache misses and improve peformance. A recent paper published in HPCA 2022 presented a mechanism (Mockingjay) that mimics Belady’s optimal replacement policy to approach optimal performance. In this project, you should implement Mockingjay in gem5 and compare its performance to existing replacement policies already implemented in gem5.
Reference: Ishan Shah, Akanksha Jain and Calvin Lin, Effective Mimicry of Belady’s MIN Policy, HPCA 2022. Link: https://www.cs.utexas.edu/~lin/papers/hpca22.pdf
Cache prefetching mechanisms can greatly reduce compulsory and capacity misses, and therefore improve performance. However, aggressive prefetching can replace useful blocks in the cache which can be counter-productive. An accurate replacement policy can improve performance while avoiding extra misses. In this project, you should implement the Signature Path Prefetcher (SPP) in gem5, and compare its performance to existing prefetchers already implemented in gem5, and to no prefetching.
Reference: J. Kim , S. Pugsley, P. V. Gratz, A. Reddy, C. Wilkerson, Z. Chishti, Path Confidence based Lookahead Prefetching, MICRO 2016. Link: https://ieeexplore.ieee.org/document/7783763
Recent conferences at ISCA, MICRO, HPCA have included multiple kernel accelerators. Kernel accelerators are those that include minimal programmability and offload a particular kernel end-to-end. Implement one of these accelerators on the gem5 SALAM framework and evaluate the design. Some suggested accelerators
Publishable result
While Domain-Specific Architectures (DSAs) differ significantly across application domains, they share common underlying principles. As noted by Dally et al. and Hennessy and Patterson, DSAs must exploit data locality and minimize global memory accesses to achieve high performance and energy efficiency. Consequently, most DSA resources (both area and energy) are dedicated to organizing on-chip memory hierarchies and efficiently fetching data from DRAM.
DSAs leverage three key optimizations:
DSA-specific data types: Early DSAs primarily handled dense data in regular loop nests. Modern DSAs now support complex, non-indexed metadata-based structures such as compressed sparse matrices, graph nodes, and database indexes.
DSA-specific walkers: Like CPUs, DSAs use hardwired address generators and DRAM fetchers to maximize memory bandwidth. While simple base+offset addressing suffices for dense arrays, state-of-the-art DSAs require sophisticated walkers that reference multiple elements with complex access patterns.
DSA-specific orchestration: DSAs explicitly orchestrate data movement, overlap computation with memory access, and maximize DRAM channel utilization by leveraging domain-specific knowledge to efficiently pack and unpack data on-chip.
Multiple research directions can be explored in this context:
Objective: Replicate and extend techniques from Phi and Coup to determine whether DSAs can benefit from approximate parallel memory operations. Create a general framework that DSAs can exploit for scatter/gather and reduction operations.
Tools: zsim simulator
Objective: Develop a systematic framework for design space exploration to optimize dynamic updates in graph applications.
Tools: GraphBolt
Objectives:
Relevant Papers:
Objectives:
Relevant Papers:
Benchmarks: MachSuite
Many custom DSAs have been designed to target specific algorithm kernels. However, the fundamental challenge in DSA design is determining what components should remain programmable or reconfigurable.
Choose an application domain and analyze the trade-offs between fixed-function hardware and programmability. Evaluate the costs and benefits of making different DSA components programmable.
Overheads of Programmability:
Benefits of Programmability:
Large Language Models (LLMs) have shown remarkable capabilities in generating code, but they often struggle to produce performance-optimized implementations. This project explores fine-tuning LLMs to generate high-performance code by training them on architecture-specific optimizations and performance patterns.
Fine-tune an existing LLM (such as GPT, CodeLlama, or StarCoder) to generate optimized code for specific computational kernels or domains. The project focuses on teaching the model to apply performance optimizations such as vectorization, loop tiling, memory access patterns, and parallelization techniques.
Dataset Creation: Curate or generate a dataset of code pairs showing unoptimized and optimized versions of computational kernels, annotated with performance characteristics and optimization techniques applied.
Fine-tuning Strategy: Implement a fine-tuning approach using techniques such as instruction tuning, reinforcement learning from performance feedback, or supervised learning on optimized code examples.
Evaluation Methodology: Develop metrics to evaluate both functional correctness and performance improvements, including execution time, memory efficiency, and resource utilization.
Domain Focus: Select a specific application domain (e.g., linear algebra kernels, image processing, graph algorithms, or numerical computations) to specialize the model’s optimization capabilities.
Spectre v1 exploits speculative execution to leak private data from memory. An expensive mechanism to defend against such attacks is by disabling speculative execution. However, recent research has explored mechanisms with much lower performance impact. In this project, you need to compare the performance impact in gem5 of two such strategies: Invisispec and Speculative Taint Tracking.
…more ideas may be posted later.