In this assignment you will explore the effectiveness of hardware prefetchers on an actual program. Your task is to use the SPEC2017 benchmarks’ traces to evaluate the effectiveness of a few real-life prefetching schemes. The objective is to compare different prefetching algorithms in a common framework.
ChampSim is a trace-based function-first simulator used for microarchitecture studies.Trace-based (Champsim) vs Timing-First or (gem5)
ChampSim takes five parameters: Branch predictor, L1D prefetcher, L2C prefetcher, LLC replacement policy, and the number of cores (We will only be conducting single core). For example,
$ cd $REPO
# builds a single-core processor with perceptron predictor, L1 prefetcher
# next_line, no L2 or L3 prefetcher, and baseline LRU policy for LLC
$ ./build_champsim.sh perceptron next_line no no lru 1
# Builds l2 prefetcher, disables l1 prefetcher
$ ./build_champsim.sh perceptron no next_line no lru 1
# Link to spec 2017 traces
$ export TRACE_DIR=/data/dpc3_traces/
$ ./run_champsim.sh perceptron-next_line-no-no-lru-1core 5 50 607.cactuBSSN_s-2421B.champsimtrace.xz
# This will take about 1-2 minutes. Actual runs you will have to use 50 million
# warmup 150 million simulation. When debugging try atmost 1 million instruction
# trace before moving onto longer traces. Result file will be
# "results_${N_SIM}M" as a form of "${TRACE}-${BINARY}-${OPTION}.txt".
| Param | Description |
|---|---|
| 5 | number of instructions for warmup (5 million). For actual Qs below use 50 |
| 50 | number of instructions for detailed simulation (50 million). For actual Qs below use 150 |
Results will be stored in results_${N_SIM}M/ directory. For example, to see the results of the above command:
$ cat results_50M/607.cactuBSSN_s-2421B.champsimtrace.xz-perceptron-next_line-no-no-lru-1core.txt
The /prefetcher directory is where you’ll implement custom prefetchers. ChampSim supports prefetchers at all cache levels, but this assignment focuses on L2 prefetchers. The L2 prefetcher API is defined in l2c_prefetcher.cc and consists of five key functions:
l2c_prefetcher_initialize: Called when the L2 cache is created. Use this to initialize internal prefetcher data structures.
l2c_prefetcher_operate: Called for each L2 lookup (load/store, hit/miss). Focus on L2 load accesses. The third and fourth arguments of this function helps in identifying the type of lookup. In this assignment, you will focus on finding patterns only in L2 load accesses. The first and second arguments provide the cacheline aligned address and the PC of the memory access, respectively.
l2c_prefetcher_cache_fill: Called for each L2 fill operation. Argument names are self-explanatory.
l2c_prefetcher_final_stats: Called at simulation end. Use this to print prefetcher statistics.
prefetch_line: You do not need to implement this function, but use it to inject prefetch requests. The first two arguments are the PC and cacheline address triggering the prefetch; the third is the prefetch address. By default, prefetch requests look up cache level N and higher, then main memory on miss. The fourth argument (per fill level) controls where prefetched data is filled: For an L2 prefetcher, it can either be FILL_L2 (prefetched data is filled into L2) or FILL_LLC (prefetched data is filled into LLC).
To implement your own prefetcher, create a new file with prefetcher_name.l2c_pref naming format and define the four API functions in C++ in your own way. If you do not need to use any specific function, simply define the function with an empty body. Please do not change the file l2c_prefetcher.cc by yourself. The build script will automatically generate the appropriate l2c_prefetcher.cc file using your l2c_pref file and link it with the ChampSim model.
To help you get started, we have provided a simple prefetcher implementations: (1) next-line prefetcher. These are taken from the 3rd Data Prefetching Championship and can be found in next_line.l2c_pref and ip stride.l2c_pref. These files should give you a brief understanding of how to implement a prefetcher. We have also provided an empty L2 prefetcher in no.l2c_pref to simulate a system without any prefetcher.
cp prefetcher/nol2c.pref prefetcher/myl2cpref.pref
Each prefetcher needs to at least implement 3 handlers
// Class definition for prefetcher.
// Class definition for stats collection
// Global variable for prefetching
// Global variable for stats
// Initialize the prefetcher state e.g., LRU counters for entries, Hysterisis
// counters for entries, etc.
void CACHE::l2c_prefetcher_initialize()
{
}
// Called on every L2 cache access (both hits and misses)
// Parameters:
// addr: The cache block address being accessed
// ip: The program counter (PC) of the load/store instruction
// cache_hit: 1 if the access hits in L2 cache, 0 if it misses
// type: The access type (0 = load, 1 = store)
// metadata_in: metadata associated with the cache block (ignore, simply return it)
// Return: metadata_in (must be returned as-is)
uint32_t CACHE::l2c_prefetcher_operate(uint64_t addr, uint64_t ip,
uint8_t cache_hit, uint8_t type,
uint32_t metadata_in)
{
return metadata_in;
}
// Called when a cache line is filled (whether by prefetch or demand access)
// Parameters:
// addr: address of the cache line being filled
// set: cache set index where the line is placed
// way: cache way index where the line is placed
// prefetch: indicates if this fill was due to a prefetch (1) or demand access (0)
// evicted_addr: address of the cache line that was evicted (if any)
// metadata_in: metadata associated with the cache line (ignore, simply return it)
// Note: You typically don't need to modify this function, but you can use it
// to track prefetch effectiveness and gather statistics.
uint32_t CACHE::l2c_prefetcher_cache_fill(uint64_t addr, uint32_t set,
uint32_t way, uint8_t prefetch,
uint64_t evicted_addr, uint32_t metadata_in)
{
return metadata_in;
}
// Print the stats of the prefetcher at the end of simulation. You can use this
// function to print any overall statistics of the prefetcher that you have
// collected during the simulation.
void CACHE::l2c_prefetcher_final_stats()
{
}
Report will be prepared in markdown. For each of the questions below experiment with 50 million warmup instructions and 150 million detailed simulation instructions.
Two formats for the report file. REPORT.md and REPORT.pdf. REPORT.md will be the file you fill out Convert REPORT.md to REPORT.pdf Organize all your plots into the plots folder Submit your REPORT.pdf on canvas at https://canvas.sfu.ca/courses/1369/assignments/70678
You should run simulations using the following traces:
Do not proceed without comprehending these bullets. Refresh the material below and attent an instructor OH if you are not able to answer any of these questions.
As we discussed in depth, prefetching is a speculation technique to predict a future memory request and fetch it into the caches before the processor core actually demands it. This prediction helps to hide the long latency of a DRAM-based main memory access and makes the program run faster.
The effectiveness of a prefetcher can be measured using four metrics:
In the first experiment we will compare whether next-line prefetching at the L1 data cache is more effective than next-line prefetching at the L2 cache level. To do this, you will run two simulations:
Next-line prefetcher at L1 data cache and no prefetcher at L2 cache:
$ ./build_champsim.sh perceptron next_line no no lru 1
No prefetcher at L1 data cache and next-line prefetcher at L2 cache:
$ ./build_champsim.sh perceptron no next_line no lru 1
The stride prefetcher described in this section is a simplified version of the basic stride prefetcher described in https://ieeexplore.ieee.org/document/381947?arnumber=381947. Several commercial processors (e.g., IBM Power 4) implement some version of this prefetcher. If you are interested in reading more, the paper is available on IEEE Xplore (accessible from SFU library). Stride prefetchers try to predict future references based on the past history for the same memory instructions (i.e., loads/stores) that exhibit a constant stride in their memory addresses. The core hardware structure employed by a stride prefetcher is a prefetch tracker table (PT) that keeps track of data access patterns in the form of effective address and stride for memory operations (i.e., loads and stores).
Figure 2 in the paper shows the design of the PT. The PT keeps track of previous addresses and associated strides for memory instructions. Prefetches are generated based on how stable the observed stride pattern is for a particular memory The PT is organized as a direct-mapped cache structure. The PT entry, indexed by the instruction address (i.e., PC), includes the address of the last data access, the stride information, and the regularity status for the instruction (i.e., an indication of how stable the stride pattern is). Each entry has the following format (see Figure 2 in the paper)
The four states and the transitions between them are depicted in Figure. The four states are:
PT records the address of the memory instruction, computes the stride for that access, and sets a state controlling the prefetching by comparing the previously recorded stride with the one just computed. The stride information is obtained by taking the difference between the addresses of the two most recent accesses for the same instruction.
When a load/store instruction is encountered for the first time, the instruction is entered in the PT in the initial state. When the stride is obtained for the first time, i.e., at the second access, the state is set to transient since it is not known yet whether the pattern will be regular or irregular. When a further stride is computed and, if it is equal to the one previously recorded, i.e., if the same stride has occurred twice in a row, then the entry will be set to the steady state. It will remain in the steady state until a different stride is detected. At that point, the state is reset to initial.
If a different stride is computed while in the transient state, the entry is set to the no-prediction state since the pattern is not regular and erroneous prefetching should be prevented. In the presence of irregular patterns, the state of the entry will either stay in the no-prediction state or oscillate between the transient and no-prediction states until a regular stride is detected.
The PT is updated for an instruction that accesses a memory location at address addr as follows:
| STATE | STRIDE CONDITION | NEW STATE | NEW STRIDE |
|---|---|---|---|
| initial | true | steady | no change |
| initial | false | transient | update |
| transient | true | steady | no change |
| transient | false | no-prediction | update |
| steady | true | steady | no change |
| steady | false | initial | no change |
| no-prediction | true | transient | no change |
| no-prediction | false | no-prediction | update |
Following the update, a prefetch is generated depending on the state of the entry. If the entry is in one of these states (init, transient or steady), a prefetch for the next address in the stride pattern is generated, provided the cache line/block is not already present in the cache (i.e., a prefetch for address addr + stride).
NOTE Assume that on initialization, all tags in the PT are zero to indicate that the entries in the PT are empty. The executable to be simulated starts at higher addresses, so this assumption is safe. When indexing in the PT, make sure you discard the lower bits in the PC that are always zero.
NOTE Implement the prefetcher at the L2 cache level. In all your simulations for this task, use the next-line prefetcher at the L1 data cache to capture the spatial locality and your stride prefetcher at the L2 cache to capture the temporal locality. This will help you to evaluate the effectiveness of your stride prefetcher in capturing the temporal patterns that are not captured by the next-line prefetcher.
Your goal is to implement a Global History Buffer (GHB) based stride prefetcher GHB Paper at the L2 cache by using the prefetcher API. The GHB decouples the index tag from the stride tracker.
NOTE Just like the stride prefetcher, implement the GHB prefetcher at the L2 cache level. In all your simulations for this task, use the next-line prefetcher at the L1 data cache to capture the spatial locality and your GHB prefetcher at the L2 cache to capture the temporal locality.
Your second task is to incorporate feedback information to evaluate the effectiveness of the prefetcher in the system and accordingly adjust the prefetcher’s aggressiveness. Srinath et al. propose a mechanism that incorporates dynamic feedback into the design of the prefetcher to increase the performance improvement provided by prefetching as well as to reduce the negative performance and bandwidth impact of prefetching.
You need to extend the prefetcher you designed in GHB using a simplified version of Srinath et al.’s feedback directed mechanism. The two dynamic feedback metrics you need to incorporate into the GHB prefetcher are described below. The counters are implemented within the cache struct (see cache.h), access as ooo_cpu[i].L2C..... See main.cc for stats printed out with pf_...
Prefetch Accuracy: Prefetch accuracy is a measure of how accurately the prefetcher can predict the memory addresses that will be accessed by the program. $Prefetch Accuracy = \frac{Number of Useful Prefetches}{Number of Prefetches Sent to Memory}$ where
Number of Useful Prefetches is the number of prefetched cache blocks that are used by demand requests while they are resident in the L2 cache. Your goal is to implement the mechanism that calculates the prefetch accuracy as described in Section 3.1.1 of Srinath et al.Prefetch Lateness Prefetch lateness is a measure of how timely the prefetch requests generated by the prefetcher are with respect to the demand accesses that need the prefetched data. A prefetch is defined to be late if the prefetched data has not yet returned from main memory by the time a load or store instruction requests the prefetched data. $Prefetch Lateness = \frac{Number of Late Prefetches}{Number of Useful Prefetches}$
The feedback information should be collected every 200 memory operations. Based on the collected feedback you need to update a counter that tunes the aggressiveness of the GHB prefetcher based on the information provided in tables below. The prefetch distance is defined in [Srinath et al.] in Section 2.1. The threshold values which define the accuracy and the lateness of the prefetcher can be found in Section 4.3 in [Srinath et al.]. DO NOT implement the changes in the cache insertion policy proposed in [Srinath et al.].
Table 1: Prefetcher aggressiveness configurations
| Counter | Aggressiveness | Distance | Degree |
|---|---|---|---|
| 1 | Very Conservative | 4 | 1 |
| 2 | Conservative | 8 | 1 |
| 3 | Middle-of-the-Road | 16 | 2 |
| 4 | Aggressive | 32 | 4 |
| 5 | Very Aggressive | 48 | 4 |
Table 2. Feedback counter updates based on prefetch accuracy and lateness.
| Case | Prefetch Accuracy | Prefetch Lateness | Counter Update |
|---|---|---|---|
| 1 | High | Late | Increment |
| 2 | High | Not-Late | No Change |
| 3 | Medium | Late | Increase |
| 4 | Medium | Not-Late | No Change |
| 5 | Low | Late | Decrement |
| 6 | Low | Not-Late | No Change |
This assignment has been modified by the CMPT 450/750 instructors.