Skip to main content

CMPT 450/750 Course Project

Project Guidelines

  • The project is a significant part of your grade (30%). You should use the project to gain experience in developing and evaluating ideas in computer architecture.
  • You are encouraged to come up with your own interesting and/or novel project ideas. However, your proposed projects should be at the same level as other projects in the course. Instructor approval is required for all proposed project ideas. Please discuss project ideas with the instructors during office hours before submitting a proposal.
  • You should form project teams, 2-3 students each. You should attempt to have people with diverse skill sets on your team. Please form teams as soon as possible (deadline: March 2). Please use “Project Groups” on canvas to register names of students on each team by March 2.
  • Project Proposal [40 points]: (Due March 5 at midnight) You should submit a project proposal containing the following information: Project title, project team members’ names and email addresses, a description of your project topic, a statement explaining why this topic is important, a description of the methods you’ll use to evaluate your ideas, and references to at least two papers related to your topic. Important: Your project proposal should clearly describe your project goals, and some milestones you need to achieve to reach those goals. Project topics can be chosen from the list of project ideas below, or could be an idea pre-approved by the instructor. You submit your project proposal on canvas (link to be posted before the deadline)

Project Proposal Format: The proposal should be in pdf format. It should be no longer than two pages in a single-column single-spaced 10pt Times font. Margins are one inch each (top, bottom, left and right).

  • Instructors will grade and provide feedback on your proposal by March 7.
  • Project Progress Report [40 points]: (Due March 23 at midnight) This should be an updated version of the project proposal that describes the progress you’ve made so far in your project. You should focus on the tasks that you have completed and not on tasks that you’ve just started or that you have planned to start on. The report should provide at least four references for papers related to your project. Your report will be graded based on how close you came to reaching 50% of your project milestones. You should submit your progress report on canvas. Project Progress Report Format: Format for the progress report is the same as the project proposal but with a five-page limit instead of two.
  • Project Presentations [60 points]: (April 10) Each team is required to record a presentation that will be available for the whole class. Each presentation should be at most 15 minutes long. Each member in the team is required to present a part of the presentation. The presentation should higlight important findings from the project, explain the project idea and key results. All students are invited and encouraged to watch all project presentations, and ask questions on Piazza. However, the target audience for your presentation is both instructors who will grade your presentation.
  • Final Project Report and Code [160 points]: (Due April 10 at midnight) Your report should mimic a conference paper similar to the ones we covered in class. The report should include a title, author names, abstract, introduction, background, explanation of your project, evaluation results, a conclusion, references, and optional appendices. Your report should be no longer than 10 pages using the same format as your project proposal and progress report. Any extra material beyond 10 pages should be put in appendices. Please note that any material beyond 10 pages may not be considered for grading. As part of your final project submission, you also need to submit your project code. Grading will be based on project quality and difficulty (30 points), implementation and results (60 points), report quality (50 points), documentation and code instructions (20 points) Important: Your project report Appendix should include a description of each project member’s contribution to the project code, report and presentation.

Project submission instructions

Submit the following on Canvas:

  • (1) Project report (in PDF).
  • (2) A README text file with a pointer to your project code and detailed instructions for how to run it. You need to give all instructors and TAs permission to access your code.
  • (3) A PDF version of your project presentation.

Project Grading Guidelines

For each of the projects described below, students will be graded based on the tasks that have been accomplished in the project. Each project will have two goals, and the grade will be based only on finished goals. If only one goal is accomplished, the maximum project score will be 50%. If both goals are accomplished, the maximum project score will be 100%. Note that these maximum levels do not mean that students will always achieve the maximum. The actual student score will be based on other aspects: Project proposal, progress report, presentation and final report/code.

Project Ideas:

Microarchitecture

Branch Prediction Championship Submission:

Many architecture competitions have been held over the past 20 years to advance the state of the art in different areas in computer architecture, e.g., branch prediction, cache replacement, prefetching, and value prediction. Branch prediction is an area where advancements lead to big benefits in performance and energy efficiency. Six previous branch prediction championships were held since 2004, and the latest was in 2025. There is another ongoing championship where the goal is to propose, implement and evaluate ideas that improve performance of branch prediction without increasing energy. Please visit this webpage for more details: https://cbp-ng.bpchamp.com/

In this project, you need to learn the simulation tool used for the CBP-NG, propose a novel branch prediction idea or an improvement over existing ideas (e.g., TAGE), implement your idea in the championship simulator, and evaluate its performance and energy. If you manage to propose a novel predictor and improve performance/energy over existing baselines, you can actually submit your work to the competition (please speak with your professors about logistics).

Multi-Branch Predictor:

Implement a branch predictor in gem5 that could predict two, three, and four conditional branches every cycle. Choose a technique from literature that could be used to predict many branches. Compare its performance and prediction accuracy to that of the bimodal predictor implemented in gem5. Some references to consider:

  • https://hps.ece.utexas.edu/pub/yeh_ics7.pdf
  • https://ftp.cs.wisc.edu/pub/sohi/papers/1997/micro.trace-prediction.pdf
  • http://web.cecs.pdx.edu/~alaa/ece587/papers/rotenberg_micro_1996.pdf

Value Prediction:

Loads that result in cache misses significantly reduce performance. A mechanism that has been proposed to reduce their impact is to predict the values of load instructions before they execute, and forward these values to dependent instructions. When successful, load value prediction can improve performance by allowing instructions dependent on load misses to execute early. However, incorrectly predicted loads require flushing the pipeline which incurs a significant penalty (similar to branch mispredictions or memory dependence mispredictions). In this project, you need to implement a Markov Load Value Predictor in gem5 that uses the history of previously executed loads to predict future loads. You need to model the benefits of a correctly-predicted value and the cost of an incorrectly-predicted value. You need to compare this value predictor with a baseline without value prediction. For more information about Markov predictors, consider this paper:

Y. Sazeides and J.E. Smith, “The Predictability of Data Values,” MICRO 1997. Paper: https://ieeexplore.ieee.org/abstract/document/645815 Talk: https://ftp1.cs.wisc.edu/sohi/talks/1997/micro.predictability.pdf

Agents, LLMs, and CPU Bottlenecks

A CPU-Centric Perspective on Agentic AI

https://arxiv.org/abs/2511.00739

https://github.com/ritikraj7/cpu-centric-agentic-ai

An Example C++ Agent You can set it up in vscode.

Math Agent

Tool/Agent is a simple function that takes in some input and produces some output. Examples include calculator. To interact with LLMs tools/agents are wrapped in rest api and LLMs call the rest api with json payloads.

Our infrastructure

  • 24GB RTX 4090 GPU (can run coding models upto 14b-20b such as Qwen)
  • 64GB of DDR4 with 16 cores.

This paper

  • In representative agentic workloads, CPU tool processing can dominate end-to-end latency (up to ~90.6%), not GPU inference.
  • Throughput saturation often comes from CPU-side limits (core over-subscription, cache-coherence, synchronization) or GPU-side limits (HBM capacity/bandwidth).
  • At large batch sizes, CPU dynamic energy can be a big slice (reported up to ~44%).
  • Two concrete scheduling ideas help: CGAM (CPU/GPU-aware micro-batching) and MAWS (mixed workload scheduling), with notable P50/P99 gains.

Thus: “the CPU is the bottleneck, now what do we do about it—at runtime, OS, and microarchitecture?”


1) Tool-call microarchitecture: treat tools as first-class kernels

Idea: Build an Agentic Tool Kernel Suite and optimize the “boring” CPU work: JSON parsing/serialization, HTTP fetch, decompression, embedding pre/post, tokenization, vector DB lookups, file-system walks, sandbox/VM overhead.

Research questions

  • Which sub-kernels dominate per tool class (retrieval vs web vs code exec)? Does dominance shift with agent flow type (single-step vs multi-step)?
  • Can ISA/microarch features (string ops, SIMD gather, prefetch hints, compression assist) move the needle more than “more cores”?

Method

  • Trace tool invocations + CPU profiles (cycles, LLC misses, syscalls, context switches).
  • Build a microbenchmark suite that replays tool-kernel traces.
  • Evaluate with “cycles per tool step”, tail latency, and Joules/query.

Publishable angle: “Agentic tool kernels resemble datacenter RPC + analytics primitives more than ML kernels.”


2) Generalize CGAM from “two-stage” to arbitrary agent graphs

CGAM is presented with a tools→LLM pattern and a batch cap selection rule.

Extension: Agents in practice are graphs: retrieve→rerank→summarize→code-exec→retrieve again…

Research question

  • How do you do micro-batching when you have branching + dynamic paths (their characterization includes static vs dynamic paths)?

Approach

  • Model the agent as a DAG of stages with (latency, CPU cores, GPU mem, fanout).
  • Solve a scheduling problem: choose micro-batch sizes per stage + overlap policy to minimize tail latency under resource constraints.

Publishable output: “Graph-CGAM: micro-batching for dynamic agent DAGs.”


3) Heterogeneous CPU scheduling: big.LITTLE and SMT-aware MAWS

MAWS separates CPU-heavy vs LLM-heavy and adapts threading vs multiprocessing.

Extensions

  • Core-type aware MAWS: run latency-sensitive orchestration + parsing on “big” cores; background fetch/decompress on “little” cores.
  • SMT partitioning: avoid tool threads sharing a physical core with latency-critical orchestration.
  • Interference-aware admission control: learn a contention model (LLC/TLB/memory BW) and throttle tool parallelism to protect tail latency.

4) Agentic workload benchmarks + traces (the community will use)

A very practical extension is to make a standard benchmark suite that includes:

  • Orchestrator types (LLM vs host), path (static vs dynamic), flow (single vs multi-step).
  • Metrics beyond throughput: tool-step latency distribution, context-switch rate, LLC miss rate, Joules/query.

Why it’s publishable: right now, many “agent” evaluations are opaque; a rigorous suite would shape future work.


Caches

Cache Replacement Policy with Bypassing:

Effective cache replacement policies can significantly reduce cache miss rates and improve performance. Recent research showed that it is beneficial to bypass the insertion of some blocks in the cache if they are not predicted to be reused. An example of such research is the winner of the cache replacement championship (http://www.jilp.org/jwac-1/online/papers/005_gao.pdf). In this project, you should implement a cache replacement policy with bypass in gem5, and compare its performance to the default processor and cache replacement policy already implemented in gem5.

Cache Replacement Imitating Belady’s OPT Policy:

Effective cache replacement and dead block prediction mechanisms can greatly reduce cache misses and improve peformance. A recent paper published in HPCA 2022 presented a mechanism (Mockingjay) that mimics Belady’s optimal replacement policy to approach optimal performance. In this project, you should implement Mockingjay in gem5 and compare its performance to existing replacement policies already implemented in gem5.

Reference: Ishan Shah, Akanksha Jain and Calvin Lin, Effective Mimicry of Belady’s MIN Policy, HPCA 2022. Link: https://www.cs.utexas.edu/~lin/papers/hpca22.pdf

Cache Prefetching Policy:

Cache prefetching mechanisms can greatly reduce compulsory and capacity misses, and therefore improve performance. However, aggressive prefetching can replace useful blocks in the cache which can be counter-productive. An accurate replacement policy can improve performance while avoiding extra misses. In this project, you should implement the Signature Path Prefetcher (SPP) in gem5, and compare its performance to existing prefetchers already implemented in gem5, and to no prefetching.

Reference: J. Kim , S. Pugsley, P. V. Gratz, A. Reddy, C. Wilkerson, Z. Chishti, Path Confidence based Lookahead Prefetching, MICRO 2016. Link: https://ieeexplore.ieee.org/document/7783763

Domain-Specific Architectures

Kernel Accelerators (Default: 750)

Recent conferences at ISCA, MICRO, HPCA have included multiple kernel accelerators. Kernel accelerators are those that include minimal programmability and offload a particular kernel end-to-end. Implement one of these accelerators on the gem5 SALAM framework and evaluate the design. Some suggested accelerators

Grading scheme:

  • 30% for creating accelerator based on DMA
  • 70% for creating accelerator for incorporating streaming
  • 100% for sweeping, finding optimal parameters, modeling RAM energy using cacti

DSA Memory Hierarchy

Publishable result

Background

While Domain-Specific Architectures (DSAs) differ significantly across application domains, they share common underlying principles. As noted by Dally et al. and Hennessy and Patterson, DSAs must exploit data locality and minimize global memory accesses to achieve high performance and energy efficiency. Consequently, most DSA resources (both area and energy) are dedicated to organizing on-chip memory hierarchies and efficiently fetching data from DRAM.

DSAs leverage three key optimizations:

  1. DSA-specific data types: Early DSAs primarily handled dense data in regular loop nests. Modern DSAs now support complex, non-indexed metadata-based structures such as compressed sparse matrices, graph nodes, and database indexes.

  2. DSA-specific walkers: Like CPUs, DSAs use hardwired address generators and DRAM fetchers to maximize memory bandwidth. While simple base+offset addressing suffices for dense arrays, state-of-the-art DSAs require sophisticated walkers that reference multiple elements with complex access patterns.

  3. DSA-specific orchestration: DSAs explicitly orchestrate data movement, overlap computation with memory access, and maximize DRAM channel utilization by leveraging domain-specific knowledge to efficiently pack and unpack data on-chip.

Project Ideas

Multiple research directions can be explored in this context:

1. Approximate Parallel Scatter/Gather and Reduction

Objective: Replicate and extend techniques from Phi and Coup to determine whether DSAs can benefit from approximate parallel memory operations. Create a general framework that DSAs can exploit for scatter/gather and reduction operations.

Tools: zsim simulator

2. Graph Accelerator Design Space Exploration

Objective: Develop a systematic framework for design space exploration to optimize dynamic updates in graph applications.

Tools: GraphBolt

3. Scratchpad Hierarchies

Objectives:

  • Develop a methodical approach to convert pull-based memory models to push-based models
  • Implement a state machine controller to orchestrate data movement between scratchpad levels
  • Create a flexible model supporting sparse linear algebra operations

Relevant Papers:

4. Design Space Exploration for Memory Bank Organization

Objectives:

  • Create a framework to systematically explore memory bank organization in DSAs
  • Develop an analytical and mathematical model based on DSA compute schedules to optimize scratchpad organization

Relevant Papers:

Benchmarks: MachSuite

DSA Definition

Programmable Accelerators

Background

Many custom DSAs have been designed to target specific algorithm kernels. However, the fundamental challenge in DSA design is determining what components should remain programmable or reconfigurable.

Project Objective

Choose an application domain and analyze the trade-offs between fixed-function hardware and programmability. Evaluate the costs and benefits of making different DSA components programmable.

Key Considerations

Overheads of Programmability:

  1. Cost of storing and retrieving instructions from associated RAM
  2. Cost of dynamically scheduling instructions to spatial resources
  3. Cost of transferring operands to scheduled resources

Benefits of Programmability:

  • Reusability: Hardware can be repurposed for different workloads
  • Elimination of dark silicon: In heterogeneous CGRAs, when a specialized processing element (PE) is idle, associated components (register files, routers) also remain underutilized. Homogeneous CGRAs enable component sharing, improving overall utilization across execution phases.

Relevant Papers

Suggested Application Domains

  • Image processing
  • Tensor processing
  • Security
  • Databases
LLM-based High Performance Code Generation

Large Language Models (LLMs) have shown remarkable capabilities in generating code, but they often struggle to produce performance-optimized implementations. This project explores fine-tuning LLMs to generate high-performance code by training them on architecture-specific optimizations and performance patterns.

Project Objective

Fine-tune an existing LLM (such as GPT, CodeLlama, or StarCoder) to generate optimized code for specific computational kernels or domains. The project focuses on teaching the model to apply performance optimizations such as vectorization, loop tiling, memory access patterns, and parallelization techniques.

Key Tasks

  1. Dataset Creation: Curate or generate a dataset of code pairs showing unoptimized and optimized versions of computational kernels, annotated with performance characteristics and optimization techniques applied.

  2. Fine-tuning Strategy: Implement a fine-tuning approach using techniques such as instruction tuning, reinforcement learning from performance feedback, or supervised learning on optimized code examples.

  3. Evaluation Methodology: Develop metrics to evaluate both functional correctness and performance improvements, including execution time, memory efficiency, and resource utilization.

  4. Domain Focus: Select a specific application domain (e.g., linear algebra kernels, image processing, graph algorithms, or numerical computations) to specialize the model’s optimization capabilities.

Expected Outcomes

  • A fine-tuned LLM capable of generating performance-optimized code
  • Comparative analysis of generated code performance against baseline implementations
  • Documentation of which optimization patterns the model successfully learned
  • Analysis of the model’s limitations and failure cases

Relevant Benchmarks and Datasets

Tools and Frameworks

  • Fine-tuning: Hugging Face Transformers, DeepSpeed, LoRA/QLoRA
  • Performance profiling: perf, Intel VTune, or similar tools
  • Code generation evaluation: CodeBLEU, execution-based metrics
Security

Impact of Spectre Defense Mechanisms on Performance:

Spectre v1 exploits speculative execution to leak private data from memory. An expensive mechanism to defend against such attacks is by disabling speculative execution. However, recent research has explored mechanisms with much lower performance impact. In this project, you need to compare the performance impact in gem5 of two such strategies: Invisispec and Speculative Taint Tracking.

…more ideas may be posted later.