SFU Architecture I Class Projects

Project Guidelines
Project Grading Guidelines
Project Ideas:
Branch predictor
Caches
Domain-Specific Architectures
- Kernel Accelerators (Default: Grad)
  - Grading scheme:
DSA memory hierarchy
DSA Definition
- Programmable Accelerators
- Building hardware accelerators on FPGA.
Security
- Impact of Spectre Defense Mechanisms on Performance:

Project Guidelines

The project is a significant part of your grade (25%). You should use the project to gain experience in developing and evaluating ideas in computer architecture.
You are encouraged to come up with your own project ideas. Bonus points will be awarded to project with interesting and/or novel proposals. However, your proposed projects should be at the same level as other projects in the course. Instructor approval is required for all proposed project ideas. Please discuss project ideas with the instructors during office hours before submitting a proposal.
You should form project teams, 2-3 student each. You should attempt to have people with diverse skill sets on your team. Please form teams as soon as you are able to. Please email your instructors by Oct 1st with names of students on each team.
Project Proposal [3 points]: (Due Oct 18 at midnight) You should submit a project proposal containing the following information: Project title, project team members’ names and email addresses, a description of your project topic, a statement explaining why this topic is important, a description of the methods you’ll use to evaluate your ideas, and references to at least two papers related to your topic. Important: Your project proposal should clearly describe your project goals, and some milestones you need to achieve to reach those goals. Project topics can be chosen from the list of project ideas below, or could be an idea pre-approved by the instructor. You submit your project proposal on canvas (link to be posted before the deadline)

Project Proposal Format: The proposal should be in pdf format. It should be no longer than two pages in a single-column single-spaced 10pt Times font. Margins are one inch each (top, bottom and both sides).

Instructors will grade and provide feedback on your proposal by Oct 22.
Project Progress Report [3 points]: (Due Nov 15 at midnight) This should be an updated version of the project proposal that describes the progress you’ve made so far in your project. You should focus on the tasks that you have completed and not on tasks that you’ve just started or that you have planned to start on. The report should provide at least four references for papers related to your project. Your report will be graded based on how close you came to reaching 50% of your project milestones. You should submit your progress report on canvas. Project Progress Report Format: Format for the progress report is the same as the project proposal but with a five-page limit instead of two.
Project Presentations [5 points]: (Dec 3/7/8) Each team is required to give a presentation to the whole class on Dec 3rd (in class, after last quiz) or on zoom (Dec 7/8 10-11:30AM). Each presentation should be 15 minutes long (including 3 minutes for Q&A). Each member in the team is required to present a part of the presentation. The presentation should higlight important findings from the project, explain the project idea and key results. All students are invited and encouraged to attend all project presentations. However, the target audience for your presentation is both instructors who will grade your presentation.
Final Project Report and Code [14 points]: (Due Dec 15 at midnight) Your report should mimic a conference paper similar to the ones we covered in class. The report should include a title, author names, abstract, introduction, background, explanation of your project, evaluation results, a conclusion, references, and optional appendices. Your report should be no longer than 10 pages using the same format as your project proposal and progress report. Any extra material beyond 10 pages should be put in appendices. Please note that any material beyond 10 pages may not be considered for grading. As part of your final project submission, you also need to submit your project code. Instructions on final project submissions will be provided as we get closer to the project deadline. Important: Your project report Appendix should include a description of each project member’s contribution to the project code, report and presentation.
Project submission instructions: Submit the following on Canvas

(1) Project report (in pdf).

(2) A README text file with a pointer to your project code and detailed instructions for how to run it. You need to give both instructors and both TAs permission to access your code.

(3) A pdf version of your project presentation.

Project Grading Guidelines

For each of the projects described below, students will be graded based on the tasks that have been accomplished in the project. Each project will have two goals, and the grade will be based only on finished goals. If only one goal is accomplished, the maximum project score will be 50%. If both goals are accomplished, the maximum project score will be 100%. Note that these maximum levels do not mean that students will always achieve the maximum. The actual student score will be based on other aspects: Project proposal, progress report, presentation and final report/code.

Project Ideas:

Branch predictor

Branch Prediction Confidence:

Accurate branch prediction is critical to improve performance in superscalar processors. However, if a certain branch is frequently mispredicted, it could be better to stall instruction fetch until that branch is resolved. Explore existing work in evaluating branch prediction confidence, and implement branch confidence predictor techniques in gem5. Evaluate the performance for the configuration where you predict all branches, and different configurations where you place a confidence threshold on predicting branches and only predict branches above a certain confidence level. You need to implement and evaluate three of the branch confidence prediction techniques presented in prior work. Some references to consider:

https://people.engr.ncsu.edu/ericro/publications/conference_MICRO-29_jrs.pdf
https://hal.inria.fr/file/index/docid/521098/filename/RR-7371.pdf

Multi-Branch Predictor:

Implement a branch predictor in gem5 that could predict two, three, and four conditional branches every cycle. Choose a technique from literature that could be used to predict many branches. Compare its performance and prediction accuracy to that of the bimodal predictor implemented in gem5. Some references to consider:

https://hps.ece.utexas.edu/pub/yeh_ics7.pdf
https://ftp.cs.wisc.edu/pub/sohi/papers/1997/micro.trace-prediction.pdf
http://web.cecs.pdx.edu/~alaa/ece587/papers/rotenberg_micro_1996.pdf

Perceptron Branch Predictor:

Implement the Perceptron branch predictor (Jimenez) in gem5. Compare its accuracy to the other predictors currently implemented in gem5. References to consider:

https://www.cs.utexas.edu/~lin/papers/hpca01.pdf

PC-based vs. Data-address-based Data Cache Prefetching (450):

Implement a stride-based prefetcher (Chen and Baer) or any other data prefetcher you choose. You should use two implementations of the prefetcher: PC-based (where you issue prefetches when you detect patterns for the same PC), and data-address-based (where you issue prefetches based on strided addresses that miss in the cache). Compare performance, prefetching accuracy and coverage for different benchmarks.

Caches

Cache Replacement Policy with Bypassing:

Effective cache replacement policies can significantly reduce cache miss rates and improve performance. Recent research showed that it is beneficial to bypass the insertion of some blocks in the cache if they are not predicted to be reused. An example of such research is the winner of the cache replacement championship (http://www.jilp.org/jwac-1/online/papers/005_gao.pdf). In this project, you should implement a cache replacement policy with bypass in gem5, and compare its performance to the default processor and cache replacement policy already implemented in gem5.

Cache Dead Block Predictor:

Some recent research on caches have pointed out that many cache lines are DOA, and shouldn’t be allowed to stay in the cache for a long time. An example is in the paper “Sampling Dead Block Prediction for Last-Level Caches” . In this project, you need to implement a simplified version of a dead block predictor that predicts such blocks and either bypasses the cache or assigns low priority for them. You should compare your algorithm to the default cache replacement policy in gem5.

Potential for Data Cache Ports (450):

Some architectures use a single read and a single write port to the L1 data cache which might limit the amount of instruction-level parallelism available in a program in order to achieve fast cache access times. You should explore the performance of different benchmarks when the number of read and write ports is limited to 1, 2, 3, …etc. for a six-wide superscalar processor. Following this limit study, you should factor in the impact of access latency increase as you increase the number of ports.

Domain-Specific Architectures

Kernel Accelerators (Default: 750)

Recent conferences at ISCA, MICRO, HPCA have included multiple kernel accelerators. Kernel accelerators are those that include minimal programmability and offload a particular kernel end-to-end. Implement one of these accelerators on the gem5 SALAM framework and evaluate the design. Some suggested accelerators

Grading scheme: (will be updated)

30% for creating accelerator based on DMA
70% for creating accelerator for incorporating streaming
100% for sweeping, finding optimal parameters, modeling RAM energy using cacti

DSA memory hierarchy

Publishable result

By definition, while DSAs seem incomparable to each other, they do adopt a common underlying common theme. Dally et al. and Hennessy and Patterson comment that while computation parallelism and density are important, DSAs must exploit locality and make few global accesses. Thus, most of the resources (area and energy) in current DSAs tend to be dedicated to organizing on-chip memory and fetching data from DRAM. DSAs exploit three main optimizations: i) DSA-specific data types The early wave of DSAs predominantly needed to supply regular loop nests with dense data. However, emerging DSAs work on non-indexed metadata-based data structures e.g., compressed sparse matrix~\cite{sparch}, graph nodes\cite{graphpulse}, or database indexes~\cite{walker}. ii) DSA-specific walkerslike CPUs, DSAs employ hardwired address generators and DRAM fetchers that maximize channel bandwidth. While address generators for DMAing dense arrays tend to be simple \code{base+offset}, state-of-the-art DSAs require complex walkers making referencing multiple elements. iii) DSA-specific orchestration Finally, DSAs explicitly orchestrate movement, overlap computation and maximize DRAM channel utilization. DSAs leverage domain knowledge to pack/unpack data on-chip.

Multiple ideas can be explored in this context

Approximate, parallel scatter/gather or reduction.

Phi, Coup. Replicate the works in these papers and figure out if DSAs can benefit from this. Create a general framework for DSAs to exploit. Simulators

Graph accelerators

Develop a framework for design space exploration to optimize dynamic updates in applications such as graphs. graphbolt

Scratchpad hierarchies

Figure out a methodical way to convert pull-based to push-based models.
Implement a state machine controller to move data between the different levels of scratchpad. Create a flexible model that can support sparse linear algebra.
Relevant papers to read. Buffets Hotpads[https://people.csail.mit.edu/sanchez/papers/2018.hotpads.micro.pdf], ExTensor

Design space exploration

Create a framework to explore the organization of memory banks in a DSA.
Create an analytical and mathematical model based on DSA compute schedule to organize the scratchpads. e.g., Scratchpads in Capsule DNNs Spatial memory banking

Benchmarks Machsuite(https://github.com/harvard-acc/ALADDIN/tree/master/MachSuite)

DSA Definition

Programmable Accelerators

A number of custom DSAs have been created targetting specific algorithm kernels. However, the key challenge in a DSA is to figure out what to leave programmable. Pick your favorite application domain, study the cost of keeping something programmable or reconfigurable. The typical overheads of making a DSA programmable involves i) tcost of storing and retreiving the instructions from associated RAM ii) the cost of dynamically scheduling instructions to spatial resources. iii) the cost of transfering operands to the scheduled resources. The key benefit of making a DSA programmable is reusability and elimination of dark silicon i.e., many parts of the DSA remain active and are not underutilized during phases. For instance in a heterogeneous CGRA if the specialized PE is not-utilized the other associated components such as register files and routers also go underutilized. In a homogeneous CGRA such components tend to be shared and this enables better utilization. See this paper Databases. Queries

Prior references

Suggested application domains, image processing, tensor processing, security, databases.

Building hardware accelerators on FPGA.

Security

Impact of Spectre Defense Mechanisms on Performance:

Spectre v1 (https://spectreattack.com/spectre.pdf) exploits speculative execution to leak private data from memory. An expensive mechanism to defend against such attacks is by disabling speculative execution. However, recent research has explored mechanisms with much lower performance impact. In this project, you need to compare the performance impact in gem5 of two such strategies: Invisispec (http://iacoma.cs.uiuc.edu/iacoma-papers/corrected_micro18.pdf) and Speculative Taint Tracking (http://iacoma.cs.uiuc.edu/iacoma-papers/micro19_2.pdf).

…more ideas may be posted later.