always taken
and always not taken
(cutting misprediction rate in half)? Give the predictor size both in terms of number of counters as well as bytes.In this assignment you will explore the effectiveness of branch direction prediction (taken vs not taken) on an actual program. Your task is to use the spec 2017 benchmarks traces to evaluate the effectiveness of a few real-life branch prediction schemes. The objective is to compare different branch prediction algorithms in a common framework. Predictors will be evaluated for conditional branches.
ChampSim is a trace-based function-first simulator used for microarchitecture studies.Trace-based (Champsim) vs Timing-First or (gem5)
ChampSim takes five parameters: Branch predictor
, L1D prefetcher
, L2C prefetcher
, LLC replacement policy
, and the number of cores
(We will only be conducting single core). For example,
cd $REPO
# builds a single-core processor with bimodal branch predictor, no L1/L2 data prefetchers, and the baseline LRU replacement policy for the LLC
./build_champsim.sh bimodal no no no no lru 1
# Link to spec 2017 traces
export TRACE_DIR=/data/dpc3_traces/
./run_champsim.sh bimodal-no-no-no-no-lru-1core 1 10 600.perlbench_s-210B.champsimtrace.xz
# This will take about 1-2minutes
# When debugging try at most 1 million instruction trace before moving onto longer traces.
# Result file will be "results_${N_SIM}M" as a form of "${TRACE}-${BINARY}-${OPTION}.txt".
Param | Description |
---|---|
1 | number of instructions for warmup (1 million) |
10 | number of instructions for detailed simulation (10 million) |
Results file
$ cat results_10M/600.perlbench_s-210B.champsimtrace.xz-bimodal-no-no-no-no-lru-1core.txt
CPU 0 Bimodal branch predictor
Warmup complete CPU 0 instructions: 1000001 cycles: 401571 (Simulation time: 0 hr 0 min 4 sec)
Heartbeat CPU 0 instructions: 10000000 cycles: 7410810 heartbeat IPC: 1.34938 cumulative IPC: 1.28402 (Simulation time : 0 hr 0 min 32 sec)
CPU 0 Branch Prediction Accuracy: 97.771% MPKI: 3.658 Average ROB Occupancy at Mispredict: 83.3361
Branch types
NOT_BRANCH: 8358572 83.5857%
BRANCH_DIRECT_JUMP: 200075 2.00075%
BRANCH_INDIRECT: 64447 0.64447%
BRANCH_CONDITIONAL: 1196085 11.9609%
BRANCH_DIRECT_CALL: 83309 0.83309%
BRANCH_INDIRECT_CALL: 6936 0.06936%
Branch misprediction for individual branch types
BRANCH_DIRECT_JUMP: 245 0.122454%
BRANCH_INDIRECT: 20 0.0310333%
BRANCH_CONDITIONAL: 45785 3.82792% # <-- We use misprediction rate conditional branches only
BRANCH_DIRECT_CALL: 141 0.169249%
BRANCH_INDIRECT_CALL: 10 0.144175%
BRANCH_RETURN: 121 0.134078%
BRANCH_OTHER: 0 -nan%
cp branch/branch_predictor.cc branch/mybranch.bpred
Each branch predictor needs to atleast implement 3 handlers
// bimodal.pred
// Initialize the state of your branch predictor
void O3_CPU::initialize_branch_predictor()
// Initialize values to not taken
// Invoked on conditional branch by the pipeline
uint8_t O3_CPU::predict_branch(uint64_t ip)
// if return 1; branch is taken
// if return 0; branch is not-taken
// A useful C++ std function is std::clamp.
// Update/Repair
void O3_CPU::last_branch_result(uint64_t ip, uint8_t taken)
// Change the counter values.
// Deallocate/allocate entries if required.
# Everytime you change the pred file. Remember to rebuild
./build_champsim.sh mybranch no no no no lru 1
Report will be prepared in markdown. For each of the questions below experiment with 10 million instructions from the benchmarks listed.
Two formats for the report file. REPORT.md and REPORT.pdf. REPORT.md will be the file you fill out
Convert REPORT.md to REPORT.pdf
Organize all your plots into the plots folder
Submit your REPORT.pdf on canvas at https://canvas.sfu.ca/courses/86832/assignments/1010535
We will be using the following traces to experiment with different branch predictors and configurations:
Do not proceed without comprehending these bullets. Refresh the material below and fix an instructor OH if you are not able to answer any of these questions.
Before looking at dynamic branch prediction, we are going to implement simple static branch prediction policies of
taken.pred
always predict not taken not-taken.pred
.
(You can implement these two prediction policies by following the instructions under Add your own branch predictor section)
You can use excel or matplotlib to generate the plots: see here for line plots:
Grouped bar charts in Matplotlib Simple plot tutorial using Matplotlib
The simplest dynamic branch direction predictor is an array of $2^n$ two-bit saturating counters. Each counter includes one of four values: strongly taken (T), weakly taken (t), weakly not taken (n), and strongly not taken (N).
This has already been provided to you at branch/bimodal.pred
init = 0
Prediction: To make a prediction, the predictor selects a counter from the table using using the lower-order n bits of the instruction’s address (its program counter value). The direction prediction is made based on the value of the counter.
Training: After each branch (correctly predicted or not), the hardware increments or decrements the corresponding counter to bias the counter toward the actual branch outcome (the outcome given in the trace file). As these are two bit saturating counters, decrementing a minimum counter or incrementing a maxed out counter should have no impact.
always taken
and always not taken
(cutting misprediction rate in half)? Give the predictor size both in terms of number of counters as well as bytes.Bimodal predictor maintain a 2-bit saturating counter for each entry (or use 10 branch PC bits to index into one of 1024 counters) – captures the recent ``common case’’ for each branch. Here we try to take advantage of additional information. Intuition 0 (Not taken), 1 (taken)
You task is to build correlating predictors or two-level branch predictor. We will build two kinds. Those that take advantage of global history and those that take advantage of local history and then conduct a tournament between the two.
Instead of maintaining a counter for each branch to capture the common case
Instead of using only the PC to predict if a branch is taken or not, the global predictor uses the last N branches. For instance, if the last N branches were TNTNTN, you might predict the next branch would be taken.
The rest of the implementation is similar.
A two-level predictor that only keeps local histories at the first level. There are two tables.
Table 1: Local history table. Indexed by PC bits. Maintains LH bits of local histories for each PC. The size of the local table is $2^{PC}$. Each entry tracks the most recent $LH$ taken/not-taken of the particular tag.
Table 2: Indexed based on history from the first table. Total entries = $2^{LH}$. Each entry is a 2-bit counter.
Check yourself What is the total size of the local table ?
This part of the homework is optional. Extra credit will be given to students who complete this part.
From the previous data, you can see that neither global or local is always best. To remedy this situation, we can use a hybrid predictor that tries to capture the best of both style of predicts. A tournament predictor consists of three tables. The first and second tables are just normal global and local predictors. The third table is a chooser
table that predicts whether the global or local predictor will be more accurate.
The basic idea is that some branches do better with a global predictor and some do better with a local predictor. So, the chooser is a table of two-bit saturating counters indexed by the low-order bits of the PC that determines which of the other two table’s prediction to return. For the choose, the two-bit counter encodes: strongly prefer global (B), weakly prefer global (b), weakly prefer local (g), and strong prefer local (G).
Let’s compare the tournament predictor’s accuracy versus the data from its two constituent predictors (using the data from the previous question). For now, let’s compare a $2^n$ counter local (or global) versus a tournament predictor with three $2^n$ counter tables. (We’ll make the comparison more fair in terms of a ``same number of bits’’ comparison in a moment.) As above for the local predictor, the local component of the tournament predictor should use a history length equal to the number of index bits (log of the table size). The graph will have three lines total.
In the previous question, the tournament predictor was given the unfair advantage of having three times the storage. In this question, run another set of experimental data in which all the predictors at the same location on the x-axis have the same storage capacity. To do this, compare let total entries be $2^n$ for tournament predictor with the following three table sizes:
Total $\simeq$ a $2^n$ global or local.
Vary n from n $12,16,20,24$
Compare the old and new tournament predictor data. What was the impact of moving to the smaller (fairer) table sizes?
For the same table size. ($2^{12}$, $2^{16}$, $2^{20}$, $2^{24}$) which bp performed better; bimodal or global ?
You will implement a variant of one of the best branch-predictors of all time and the winner of three consecutive branch predictor competitions, TAGE ref1 TAGE ref2 Figure illustrates a TAGE predictor.
The TAGE predictor features a base predictor T0 in charge of
providing a basic prediction and a set of (partially) tagged predictors $T_i$ (i=1..M). These tagged predictor components $T_i$, are indexed using different history lengths that form a geometric series, i.e, $L(i) = \alpha^{i-1} * L(1) + 0.5$. The the base predictor $L_0$ will be a simple PC-indexed 2-bit counter bimodal table. Note that essentially we maintain a single global history table with maximal width and simply pick the relevant shorter history where required. An entry in a tagged component consists of a signed counter ctr
which sign provides the prediction, a (partial) tag
and an unsigned useful counter u
. u is a 2-bit counter and ctr is a 2-bit counter.
Similar to a tournament predictor, multiple tables provide a prediction. The $T_{pred}$ provider is the matching table with the longest history. The alternate prediction $T_{alt}$ is the prediction that would have occurred if there had been a miss on the provider component. If there is no matching tag component altpred is the $T_0$ (bimodal prediction).
Rule 1: Prediction. | Find Table $T_{pred}$ with the longest history. If ctr is weak or u is 0 then $T_{alt}$ provides prediction |
Rule 2: ctr | On a correct prediction, ctr of both $T_{Pred}$ is updated. If incorrect its decremented. |
Rule 3: useful | The useful counter u of the $T_{pred}$ table is updated when the $T_{alt}$ $\neq$ $T_{pred}$ and prediction is correct. The useful counter u is also used as an age counter and is gracefully reset every 512K branches. useful exists only in Tage table |
Rule 4: Allocation entries | On mispredictions at most one entry is allocated. If the provider component $T_{pred}$ is not the longest, we try to allocate an entry on a predictor component $T_k$ ($pred \leq k \leq M$). The allocation process is described below. |
Finally, the hash and tag functions. The hash function determines which entry in the table does a particular branch gets allocated in. We use an xor function to calculate the index. Why XOR function. We split the PC and global history of the table into index-wide segments and xor them with each other. We similarly calculate the tags, however the width of the segments depend on the number of tag width.
You can compare your results to the championship winning TAGE here. Note that the championship winner uses loop predictors and other patterns. It also uses different hash and index functions so your results may not match completely. At least you should be in the ballpark.
Run the following benchmarks. You will fastforward for 1 million instructions and run for 10 million instructions.
Configuration 1: ~70K bits. 4 tables + 1 bimodal TAGE.
T1,T2 | T3,T4 | |
---|---|---|
history length | 4 6 | 10 16 |
# Table entries | 1K | 2K |
Tag width | 7 | 8 |
storage budget | 11K | 24K |
Configuration 2: ~171K bits. 9 tables + 1 bimodal TAGE.
T1,T2 | T3,T4 | T5 | T6 | T7 | T8,T9 | |
---|---|---|---|---|---|---|
history length | 4 6 | 10 16 | 25 | 40 | 64 | 101 160 |
# Table entries | 1K | 2K | 2K | 2K | 1K | 1K |
Tag width | 7 | 8 | 9 | 10 | 11 | 12 |
storage budget | 11K | 24K | 26K | 28K | 15K | 16K |
Configuration 3: ~204.5K bits. 12 tables + 1 bimodal TAGE.
T1,T2 | T3,T4 | T5 | T6 | T7 | T8,T9 | T10 | T11 | T12 | |
---|---|---|---|---|---|---|---|---|---|
history length | 4 6 | 10 16 | 25 | 40 | 64 | 101 160 | 254 | 403 | 640 |
# Table entries | 1K | 2K | 2K | 2K | 1K | 1K | 1K | 0.5K | 0.5K |
Tag width | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
storage budget | 11K | 24K | 26K | 28K | 15K | 16K | 17K | 9K | 9.5K |
Use the same base bi-modal predictor in all configs: 16K entries, 2 bit counter.
WARNING: Do not simply include a data dump
This assignment has been modified by the CMPT 750 instructors