In this assignment you will explore the effectiveness of hardware prefetchers on an actual program. Your task is to use the spec 2017 benchmarks’ traces to evaluate the effectiveness of a few real-life prefetching schemes. The objective is to compare different prefetching algorithms in a common framework.
ChampSim is a trace-based function-first simulator used for microarchitecture studies.Trace-based (Champsim) vs Timing-First or (gem5)
ChampSim takes five parameters: Branch predictor
, L1D prefetcher
, L2C prefetcher
, LLC replacement policy
, and the number of cores
(We will only be conducting single core). For example,
cd $REPO
# builds a single-core processor with perceptron predictor, L1 prefetcher next_line, no L2 or L3 prefetcher, and baseline LRU policy for LLC
./build_champsim.sh perceptron next_line no no lru 1
# Builds l2 prefetcher, disables l1 prefetcher
./build_champsim.sh perceptron no next_line no lru 1
# Link to spec 2017 traces
export TRACE_DIR=/data/dpc3_traces/
./run_champsim.sh perceptron-next_line-no-no-lru-1core 1 10 600.perlbench_s-210B.champsimtrace.xz
# This will take about 1-2minutes. Actual runs you will have to use 10 million warmup 50million simulation
# When debugging try atmost 1million instruction trace before moving onto longer traces.
# Result file will be "results_${N_SIM}M" as a form of "${TRACE}-${BINARY}-${OPTION}.txt".
Param | Description |
---|---|
1 | number of instructions for warmup (1 million). For actual Qs below use 10 |
10 | number of instructions for detailed simulation (10 million). For actual Qs below use 50 |
Results file
$ cat results_10M/600.perlbench_s-210B.champsimtrace.xz-perceptron-next_line-no-no-lru-1core.txt
l2c_prefetcher_initialise
This function is called when the cache gets created and should be used to initialize any internal data structures of the prefetcher.l2c_prefetcher_final_stats
This is the final function that is called at the end of simulation. This can be a good place to print overall statistics of the prefetcher.l2c_prefetcher_operate
This function is called for each L2 lookup operation. This means it is called for both L2 load and store accesses that can either hit or miss in the cache. The third and fourth arguments of this function helps in identifying the type of lookup.
In this assignment, you will focus on finding patterns only in L2 load accesses. The first and second arguments provide the cacheline aligned address and the PC of the memory access, respectively.l2c_prefetcher_cache_fill:
This function is called for each L2 fill operation. The func- tion argument names are self explanatory.prefetch line
You do NOT need to implement this function, but to use it when a prefetch request needs to be injected into the memory hierarchy. (Tip: see how how this function is used in next_line.l2c_pref). The first two arguments are the PC and the cacheline address of the access that triggers this prefetch request. The third argument is the cacheline address of the prefetch request. By default, a prefetch request generated by the prefetcher at a cache level N first looks up the Nth-level cache. On a miss, the prefetch request looks up the next cache levels (N + 1, N + 2 etc) and eventually goes to main memory if it misses in the LLC. Once the memory supplies the corresponding data, the prefetched cachelines gets filled in all cache levels from the LLC to the Nth level. However, you can modify the prefetcher to send a hint to the memory subsystem to refrain from filling the cacheline in all cache levels if you think filling a prefetched line in all cache levels might hurt performance. This hint is passed by the fourth argument, per fill level. For an L2 prefetcher, this argument can take either of the two values FILL_L2
or FILL_LLC
; here you will set this to FILL_L2
.To implement your own prefetcher, create a new file with
cp prefetcher/nol2c.pref prefetcher/myl2cpref.pref
Each prefetcher needs to at least implement 3 handlers
// Class definition for prefetcher.
// Class definition for stats collection
// Global variable for prefetching
// Global variable for stats
// Initialize the prefetcher state e.g., LRU counters for entries, Hysterisis counters for entries, etc.
void CACHE::l2c_prefetcher_initialize()
{
}
// Invoked on a cache line.
// addr: address of cache block ip: PC of load/store cache_hit: true if load/store hits in L2 cache. metadata_in: read only metadata of the cache line; simply return always from function. type : 0 for load, 1 for store.
uint32_t CACHE::l2c_prefetcher_operate(uint64_t addr, uint64_t ip, uint8_t cache_hit, uint8_t type, uint32_t metadata_in)
{
return metadata_in;
}
// Invoked when cache is refilled (prefetch or not)
// addr: address of cache line that was prefetched.
// set and way of allocated entry
// prefetch: 0 if prefetch 1: if not
// If another block was evicted then the address
// You shouldn't need to modify this. However, you can use it to track whether your prefetching is working or not.
uint32_t CACHE::l2c_prefetcher_cache_fill(uint64_t addr, uint32_t set, uint32_t way, uint8_t prefetch, uint64_t evicted_addr, uint32_t metadata_in)
{
return metadata_in;
}
// Print the stats.
void CACHE::l2c_prefetcher_final_stats()
{
}
Report will be prepared in markdown. For each of the questions below experiment with 10 million instructions from the benchmarks listed.
Two formats for the report file. REPORT.md and REPORT.pdf. REPORT.md will be the file you fill out
Convert REPORT.md to REPORT.pdf
Organize all your plots into the plots folder
Submit your REPORT.pdf on canvas at https://canvas.sfu.ca/courses/86832/assignments/1010536
You should run simulations using the following traces:
Do not proceed without comprehending these bullets. Refresh the material below and fix an instructor OH if you are not able to answer any of these questions.
As we discussed in depth, prefetching is a speculation technique to predict a future memory request and fetch it into the caches before the processor core actually demands it. This prediction helps to hide the long latency of a DRAM-based main memory access and makes the program run faster. The effectiveness of a prefetcher can be measured using four metrics:
In the first experiment we will compare whether prefetching at the L1 cache level improves the performance of the program or not.
The stride prefetcher described in this section is a simplified version of the basic stride prefetcher described in https://ieeexplore.ieee.org/document/381947?arnumber=381947. Several commercial processors (e.g., IBM Power 4) implement some version of this prefetcher. If you are interested in reading more, the paper is available on IEEE Xplore (accessible from SFU library). Stride prefetchers try to predict future references based on the past history for the same memory instructions (i.e., loads/stores) that exhibit a constant stride in their memory addresses. The core hardware structure employed by a stride prefetcher is a prefetch tracker table (PT) that keeps track of data access patterns in the form of effective address and stride for memory operations (i.e., loads and stores).
Figure 2 in the paper shows the design of the PT. The PT keeps track of previous addresses and associated strides for memory instructions. Prefetches are generated based on how stable the observed stride pattern is for a particular memory The PT is organized as a direct-mapped cache structure. The PT entry, indexed by the instruction address (i.e., PC), includes the address of the last data access, the stride information, and the regularity status for the instruction (i.e., an indication of how stable the stride pattern is). Each entry has the following format (see Figure 2 in the paper)
The four states and the transitions between them are depicted in Figure. The four states are:
PT records the address of the memory instruction, computes the stride for that access, and sets a state controlling the prefetching by comparing the previously recorded stride with the one just computed. The stride information is obtained by taking the difference between the addresses of the two most recent accesses for the same instruction.
When a load/store instruction is encountered for the first time, the instruction is entered in the PT in the initial state. When the stride is obtained for the first time, i.e., at the second access, the state is set to transient since it is not known yet whether the pattern will be regular or irregular. When a further stride is computed and, if it is equal to the one previously recorded, i.e., if the same stride has occurred twice in a row, then the entry will be set to the steady state. It will remain in the steady state until a different stride is detected. At that point, the state is reset to initial.
If a different stride is computed while in the transient state, the entry is set to the no-prediction state since the pattern is not regular and erroneous prefetching should be prevented. In the presence of irregular patterns, the state of the entry will either stay in the no-prediction state or oscillate between the transient and no-prediction states until a regular stride is detected.
The PT is updated for an instruction that accesses a memory location at address addr as follows:
STATE | STRIDE CONDITION | NEW STATE | NEW STRIDE |
---|---|---|---|
initial | true | steady | no change |
initial | false | transient | update |
transient | true | steady | no change |
transient | false | no-prediction | update |
steady | true | steady | no change |
steady | false | initial | no change |
no-prediction | true | transient | no change |
no-prediction | false | no-prediction | update |
Following the update, a prefetch is generated depending on the state of the entry. If the entry is in one of these states (init, transient or steady), a prefetch for the next address in the stride pattern is generated, provided the cache line/block is not already present in the cache (i.e., a prefetch for address addr + stride).
NOTE Assume that on initialization, all tags in the PT are zero to indicate that the entries in the PT are empty. The executable to be simulated starts at higher addresses, so this assumption is safe. When indexing in the PT, make sure you discard the lower bits in the PC that are always zero.
Your goal is to implement a Global History Buffer (GHB) based stride prefetcher GHB Paper at the L2 cache by using the prefetcher API. The GHB decouples the index tag from the stride tracker.
Your second task is to incorporate feedback information to evaluate the effectiveness of the prefetcher in the system and accordingly adjust the prefetcher’s aggressiveness. Srinath et al. propose a mechanism that incorporates dynamic feedback into the design of the prefetcher to increase the performance improvement provided by prefetching as well as to reduce the negative performance and bandwidth impact of prefetching.
You need to extend the prefetcher you designed in GHB using a simplified version of Srinath et al.’s feedback directed mechanism. The two dynamic feedback metrics you need to incorporate into the GHB prefetcher are described below. The counters are implemented within the cache struct (see cache.h), access as ooo_cpu[i].L2C....
. See main.cc
for stats printed out with pf_...
Prefetch Accuracy: Prefetch accuracy is a measure of how accurately the prefetcher can predict the memory addresses that will be accessed by the program. $Prefetch Accuracy = \frac{Number of Useful Prefetches}{Number of Prefetches Sent to Memory}$ where
Number of Useful Prefetches
is the number of prefetched cache blocks that are used by demand requests while they are resident in the L2 cache. Your goal is to implement the mechanism that calculates the prefetch accuracy as described in Section 3.1.1 of Srinath et al.Prefetch Lateness Prefetch lateness is a measure of how timely the prefetch requests generated by the prefetcher are with respect to the demand accesses that need the prefetched data. A prefetch is defined to be late if the prefetched data has not yet returned from main memory by the time a load or store instruction requests the prefetched data. $Prefetch Lateness = \frac{Number of Late Prefetches}{Number of Useful Prefetches}$
The feedback information should be collected every 200 memory operations. Based on the collected feedback you need to update a counter that tunes the aggressiveness of the GHB prefetcher based on the information provided in tables below. The prefetch distance is defined in [Srinath et al.] in Section 2.1. The threshold values which define the accuracy and the lateness of the prefetcher can be found in Section 4.3 in [Srinath et al.]. DO NOT implement the changes in the cache insertion policy proposed in [Srinath et al.].
Table 1: Prefetcher aggressiveness configurations
Counter | Aggressiveness | Distance | Degree |
---|---|---|---|
1 | Very Conservative | 4 | 1 |
2 | Conservative | 8 | 1 |
3 | Middle-of-the-Road | 16 | 2 |
4 | Aggressive | 32 | 4 |
5 | Very Aggressive | 48 | 4 |
Table 2. Feedback counter updates based on prefetch accuracy and lateness.
Case | Prefetch Accuracy | Prefetch Lateness | Counter Update |
---|---|---|---|
1 | High | Late | Increment |
2 | High | Not-Late | No Change |
3 | Medium | Late | Increase |
4 | Medium | Not-Late | No Change |
5 | Low | Late | Decrement |
6 | Low | Not-Late | No Change |
WARNING: Do not simply include a data dump
This assignment has been modified by the CMPT 450/750 instructors