SFU Architecture I Class Review Unix Programming Tools and Basics of the C Language

Grace Hopper developed the first compiler for a computer programming language.

Github clone link
Goals
Champsim Simulator
- Add your own prefetcher
  - Your Prefetcher.
450/750 Students
Grading criteria
Acknowledgments

Github clone link

Goals

In this assignment you will explore the effectiveness of hardware prefetchers on an actual program. Your task is to use the spec 2017 benchmarks’ traces to evaluate the effectiveness of a few real-life prefetching schemes. The objective is to compare different prefetching algorithms in a common framework.

Champsim Simulator

ChampSim is a trace-based function-first simulator used for microarchitecture studies.Trace-based (Champsim) vs Timing-First or (gem5)

ChampSim takes five parameters: Branch predictor, L1D prefetcher, L2C prefetcher, LLC replacement policy, and the number of cores (We will only be conducting single core). For example,

cd $REPO
#  builds a single-core processor with perceptron predictor, L1 prefetcher next_line, no L2 or L3 prefetcher, and baseline LRU policy for LLC
./build_champsim.sh perceptron next_line no no lru 1
# Builds l2 prefetcher, disables l1 prefetcher
 ./build_champsim.sh perceptron no next_line no lru 1




# Link to spec 2017 traces
export TRACE_DIR=/data/dpc3_traces/
./run_champsim.sh perceptron-next_line-no-no-lru-1core 1 10 600.perlbench_s-210B.champsimtrace.xz
# This will take about 1-2minutes. Actual runs you will have to use 10 million warmup 50million simulation
# When debugging try atmost 1million instruction trace before moving onto longer traces.
# Result file will be "results_${N_SIM}M" as a form of "${TRACE}-${BINARY}-${OPTION}.txt".

Param	Description
1	number of instructions for warmup (1 million). For actual Qs below use 10
10	number of instructions for detailed simulation (10 million). For actual Qs below use 50

Results file

$ cat results_10M/600.perlbench_s-210B.champsimtrace.xz-perceptron-next_line-no-no-lru-1core.txt

Add your own prefetcher

prefetcher This is the main directory you will work with. ChampSim is extensible to imple- ment prefetchers at all three cache levels, but you will only focus on implementing prefetchers at L2. The L2 prefetcher API is defined in l2c_prefetcher.cc file. The API consists of five key functions:
l2c_prefetcher_initialise This function is called when the cache gets created and should be used to initialize any internal data structures of the prefetcher.
l2c_prefetcher_final_stats This is the final function that is called at the end of simulation. This can be a good place to print overall statistics of the prefetcher.
l2c_prefetcher_operate This function is called for each L2 lookup operation. This means it is called for both L2 load and store accesses that can either hit or miss in the cache. The third and fourth arguments of this function helps in identifying the type of lookup. In this assignment, you will focus on finding patterns only in L2 load accesses. The first and second arguments provide the cacheline aligned address and the PC of the memory access, respectively.
l2c_prefetcher_cache_fill: This function is called for each L2 fill operation. The func- tion argument names are self explanatory.
prefetch line You do NOT need to implement this function, but to use it when a prefetch request needs to be injected into the memory hierarchy. (Tip: see how how this function is used in next_line.l2c_pref). The first two arguments are the PC and the cacheline address of the access that triggers this prefetch request. The third argument is the cacheline address of the prefetch request. By default, a prefetch request generated by the prefetcher at a cache level N first looks up the Nth-level cache. On a miss, the prefetch request looks up the next cache levels (N + 1, N + 2 etc) and eventually goes to main memory if it misses in the LLC. Once the memory supplies the corresponding data, the prefetched cachelines gets filled in all cache levels from the LLC to the Nth level. However, you can modify the prefetcher to send a hint to the memory subsystem to refrain from filling the cacheline in all cache levels if you think filling a prefetched line in all cache levels might hurt performance. This hint is passed by the fourth argument, per fill level. For an L2 prefetcher, this argument can take either of the two values FILL_L2 or FILL_LLC; here you will set this to FILL_L2.

Your Prefetcher.

To implement your own prefetcher, create a new file with .l2c_pref naming format and define the four API functions in C++ in your own way. If you do not need to use any specific function, simply define the function with an empty body. Please do not change the file l2c_prefetcher.cc by yourself. The build script will automatically generate the appropriate l2c_prefetcher.cc file using your l2c_pref file and link it with the ChampSim model. To help you get started, we have provided a simple prefetcher implementations: (1) next-line prefetcher. These are taken from the 3rd Data Prefetching Championship and can be found in next_line.l2c_pref and ip stride.l2c_pref. These files should give you a brief understanding of how to implement a prefetcher. We have also provided an empty L2 prefetcher in no.l2c_pref to simulate a system without any prefetcher.

cp prefetcher/nol2c.pref prefetcher/myl2cpref.pref

Each prefetcher needs to at least implement 3 handlers

//  Class definition for prefetcher.
// Class definition for stats collection
// Global variable for prefetching
// Global variable for stats

// Initialize the prefetcher state e.g., LRU counters for entries, Hysterisis counters for entries, etc.
void CACHE::l2c_prefetcher_initialize()
{

}
// Invoked on a cache line.
// addr: address of cache block ip: PC of load/store cache_hit: true if load/store hits in L2 cache. metadata_in: read only metadata of the cache line; simply return always from function. type : 0 for load, 1 for store.
uint32_t CACHE::l2c_prefetcher_operate(uint64_t addr, uint64_t ip, uint8_t cache_hit, uint8_t type, uint32_t metadata_in)
{
  return metadata_in;
}

// Invoked when cache is refilled (prefetch or not)
// addr: address of cache line that was prefetched.
// set and way of allocated entry
// prefetch: 0 if prefetch 1: if not
// If another block was evicted then the address
// You shouldn't need to modify this. However, you can use it to track whether your prefetching is working or not.
uint32_t CACHE::l2c_prefetcher_cache_fill(uint64_t addr, uint32_t set, uint32_t way, uint8_t prefetch, uint64_t evicted_addr, uint32_t metadata_in)
{
  return metadata_in;
}

// Print the stats.
void CACHE::l2c_prefetcher_final_stats()
{

}

450/750 Students

Report will be prepared in markdown. For each of the questions below experiment with 10 million instructions from the benchmarks listed.

450 students can do in groups of up to two students.
Only one person has to create the repository. You can add the other person in settings after you create the repository.

Report preparation

Two formats for the report file. REPORT.md and REPORT.pdf. REPORT.md will be the file you fill out Convert REPORT.md to REPORT.pdf Organize all your plots into the plots folder Submit your REPORT.pdf on canvas at https://canvas.sfu.ca/courses/86832/assignments/1010536

Reports wil be prepared in markdown
You can export markdown to pdf using markdown vistual studio to pdf
Report files will be named REPORT.md and REPORT.pdf
REPORT.pdf should be submitted on canvas

You should run simulations using the following traces:

600.perlbench_s-210B.champsimtrace.xz
602.gcc_s-734B.champsimtrace.xz
628.pop2_s-17B.champsimtrace.xz
638.imagick_s-10316B.champsimtrace.xz
649.fotonik3d_s-1176B.champsimtrace.xz

Check yourself

Do not proceed without comprehending these bullets. Refresh the material below and fix an instructor OH if you are not able to answer any of these questions.

Do you know how to mask and extract bits from address ? If not familiar, watch 295 video
Do you know what is a direct mapped cache ?
If a hardware table is indexed by a K-bit index, how many entries does the table have ?
Do you know how to implement a saturation counter (e.g., 2-bit up/down saturating counter)?
Do you know how to implement a state machine like the one on Slide 7 of Week2’s Lecture Notes https://www.cs.sfu.ca/~ashriram/Courses/CS7ARCH/assets/lectures/02_BPred_PreciseInt.pdf ?
Why do we skip the least significant two bits in the branch PC when indexing into a branch prediction table ?

Prefetcher Stats

As we discussed in depth, prefetching is a speculation technique to predict a future memory request and fetch it into the caches before the processor core actually demands it. This prediction helps to hide the long latency of a DRAM-based main memory access and makes the program run faster. The effectiveness of a prefetcher can be measured using four metrics:

Performance: The execution cycles saved due to prefetcher (the higher the better) • Coverage: The fraction of the memory accesses saved by the prefetcher (e.g., reduction in number of memory accesses) • Accuracy: The fraction of prefetched addresses that are actually needed later by the program (e.g., measures as the number of cache hits) • Timeliness: The fraction of the latency of memory accesses hidden by the prefetcher (the higher the better) In a processor core with a traditional three level cache hierarchy, a prefetcher can be employed at any level. For example, a prefetcher at the L2 cache level tries to capture the program access pattern by observing the accesses coming out from the L1 data cache and fills the prefetched cachelines up to the L2 cache level.

Task 1. L1 vs L2 Prefetching

In the first experiment we will compare whether prefetching at the L1 cache level improves the performance of the program or not.

Q1. How much does execution time improve if prefetching implemented at L1 vs L2. Why? (Hint: Check average miss latency, execution time, or IPC)

Q2. Did the useful and useless prefetches vary between the L1 prefetching vs L2 prefetching?

Task 2. Baseline strided prefetcher

Check yourself

Read Effective Hardware-Based Data Prefetching for High-Performance Processors

What you need to implement

The stride prefetcher described in this section is a simplified version of the basic stride prefetcher described in https://ieeexplore.ieee.org/document/381947?arnumber=381947. Several commercial processors (e.g., IBM Power 4) implement some version of this prefetcher. If you are interested in reading more, the paper is available on IEEE Xplore (accessible from SFU library). Stride prefetchers try to predict future references based on the past history for the same memory instructions (i.e., loads/stores) that exhibit a constant stride in their memory addresses. The core hardware structure employed by a stride prefetcher is a prefetch tracker table (PT) that keeps track of data access patterns in the form of effective address and stride for memory operations (i.e., loads and stores).

Figure 2 in the paper shows the design of the PT. The PT keeps track of previous addresses and associated strides for memory instructions. Prefetches are generated based on how stable the observed stride pattern is for a particular memory The PT is organized as a direct-mapped cache structure. The PT entry, indexed by the instruction address (i.e., PC), includes the address of the last data access, the stride information, and the regularity status for the instruction (i.e., an indication of how stable the stride pattern is). Each entry has the following format (see Figure 2 in the paper)

tag: the tag corresponding to the address of the memory instruction (i.e., PC)
prev-addr: the last address referenced by the corresponding instruction
stride: the difference between the last two addresses that were generated by the corresponding PC (can be negative)
state: a two-bit encoding (four states) of the past history

The four states and the transitions between them are depicted in Figure. The four states are:

initial: set at first entry in the PT or after the entry experienced a different stride from the steady state one.
transient: corresponds to the case when the system is not sure whether the current stride is stable
steady: indicates that the prediction should be stable for a while.
no-prediction: disables the prefetching for this entry for the time being.

PT records the address of the memory instruction, computes the stride for that access, and sets a state controlling the prefetching by comparing the previously recorded stride with the one just computed. The stride information is obtained by taking the difference between the addresses of the two most recent accesses for the same instruction.

When a load/store instruction is encountered for the first time, the instruction is entered in the PT in the initial state. When the stride is obtained for the first time, i.e., at the second access, the state is set to transient since it is not known yet whether the pattern will be regular or irregular. When a further stride is computed and, if it is equal to the one previously recorded, i.e., if the same stride has occurred twice in a row, then the entry will be set to the steady state. It will remain in the steady state until a different stride is detected. At that point, the state is reset to initial.

If a different stride is computed while in the transient state, the entry is set to the no-prediction state since the pattern is not regular and erroneous prefetching should be prevented. In the presence of irregular patterns, the state of the entry will either stay in the no-prediction state or oscillate between the transient and no-prediction states until a regular stride is detected.

The PT is updated for an instruction that accesses a memory location at address addr as follows:

Scenario 1. There is no corresponding entry in the PT. The instruction is entered in the PT, the prev-addr field is set to addr, the stride to 0, and the state to initial. This may involve replacing an existing entry (with a different tag) that maps to the same PT entry.
Scenario 2. There is a corresponding entry. The new stride is computed as addr - prev-addr. Depending on the current state of the entry and whether the new stride is equal to the stride in the PT the transitions depicted in Figure 1b are performed. The following table further explains what happens in each transition. In the table, by stride Condition “false” we mean the new stride is different than the stride recorded in the PT, and by “true” we mean the same stride is observed. Once the transition to the new state is performed, the prev-addr is set to addr.

STATE	STRIDE CONDITION	NEW STATE	NEW STRIDE
initial	true	steady	no change
initial	false	transient	update
transient	true	steady	no change
transient	false	no-prediction	update
steady	true	steady	no change
steady	false	initial	no change
no-prediction	true	transient	no change
no-prediction	false	no-prediction	update

Following the update, a prefetch is generated depending on the state of the entry. If the entry is in one of these states (init, transient or steady), a prefetch for the next address in the stride pattern is generated, provided the cache line/block is not already present in the cache (i.e., a prefetch for address addr + stride).

NOTE Assume that on initialization, all tags in the PT are zero to indicate that the entries in the PT are empty. The executable to be simulated starts at higher addresses, so this assumption is safe. When indexing in the PT, make sure you discard the lower bits in the PC that are always zero.

Q3. Vary PT size from 64, 128, 256, 512, 1024. How does it influence performance and prefetch accuracy?

Q4. Vary prefetch depth (or degree) from 1-16 (powers of 2, i.e., use degrees 1, 2, 4, 8, 16). How does it influence performance and prefetch accuracy?

Q5. Plot histogram of strides. Does it change over time?

Every million accesses, check what are the strides detected in the valid entries then create a histogram from these stride values (counting how many times each stride is repeated).

Task 3. Global History Stride Prefetcher

Your goal is to implement a Global History Buffer (GHB) based stride prefetcher GHB Paper at the L2 cache by using the prefetcher API. The GHB decouples the index tag from the stride tracker.

Index Table (IT): A table that is indexed by program properties, like PC (program counter), and that stores a pointer to a GHB entry.
Global History Buffer (GHB): A circular queue that stores time-ordered sequence of the observed cacheline addresses. Each GHB entry also stores a pointer (called prev ptr) that points to the last cacheline address that has the same IT index. By traversing prev ptr, one can get the temporal sequence of cacheline addresses that points to the same IT entry.

Algorithm

For each L2 cache access (both hit and miss), the algorithm uses the PC of the access to index into the IT and insert the cacheline address (say A) into the GHB.
Using the PC and the link pointers in GHB entries, the algorithm retrieves the sequence of last 3 addresses by this PC that accessed the L2 cache.
The stride is computed by taking the difference between two consecutive addresses in the sequence.
If two strides match (say d), the prefetcher simply issues prefetch requests to cachelines A+ld,A+(l+1)d,A+(l+2)d,…,A+(l+n)d, where l is the prefetch look-ahead and n is the prefetch degree.
For your design, please statically set both l and n to 4.
Please also size the IT and GHB so that they are 256 entries each.
For further details read the paper.

Q6. Compare strided prefetcher (from Q3) against Global history prefetcher (same size: 256 entries) in terms of performance and prefetch accuracy. What is the size of the strided prefetcher vs. next-line (in bytes) ?

Q7 Increase prefetch depth 1-16 (powers of 2, i.e., use degrees 1, 2, 4, 8, 16). How does it influence performance? How does it influence prefetch accuracy?

4. Feedback GHB (750 students only)

Your second task is to incorporate feedback information to evaluate the effectiveness of the prefetcher in the system and accordingly adjust the prefetcher’s aggressiveness. Srinath et al. propose a mechanism that incorporates dynamic feedback into the design of the prefetcher to increase the performance improvement provided by prefetching as well as to reduce the negative performance and bandwidth impact of prefetching.

You need to extend the prefetcher you designed in GHB using a simplified version of Srinath et al.’s feedback directed mechanism. The two dynamic feedback metrics you need to incorporate into the GHB prefetcher are described below. The counters are implemented within the cache struct (see cache.h), access as ooo_cpu[i].L2C..... See main.cc for stats printed out with pf_...

Prefetch Accuracy: Prefetch accuracy is a measure of how accurately the prefetcher can predict the memory addresses that will be accessed by the program. $Prefetch Accuracy = \frac{Number of Useful Prefetches}{Number of Prefetches Sent to Memory}$ where
- Number of Useful Prefetches is the number of prefetched cache blocks that are used by demand requests while they are resident in the L2 cache. Your goal is to implement the mechanism that calculates the prefetch accuracy as described in Section 3.1.1 of Srinath et al.
Prefetch Lateness Prefetch lateness is a measure of how timely the prefetch requests generated by the prefetcher are with respect to the demand accesses that need the prefetched data. A prefetch is defined to be late if the prefetched data has not yet returned from main memory by the time a load or store instruction requests the prefetched data. $Prefetch Lateness = \frac{Number of Late Prefetches}{Number of Useful Prefetches}$
- In general, prefetch lateness decreases as the prefetcher becomes more aggressive. Your goal is to implement the mechanism that calculates the prefetch lateness as described in Section 3.1.2 of the Srinath et al.

The feedback information should be collected every 200 memory operations. Based on the collected feedback you need to update a counter that tunes the aggressiveness of the GHB prefetcher based on the information provided in tables below. The prefetch distance is defined in [Srinath et al.] in Section 2.1. The threshold values which define the accuracy and the lateness of the prefetcher can be found in Section 4.3 in [Srinath et al.]. DO NOT implement the changes in the cache insertion policy proposed in [Srinath et al.].

Table 1: Prefetcher aggressiveness configurations

Counter	Aggressiveness	Distance	Degree
1	Very Conservative	4	1
2	Conservative	8	1
3	Middle-of-the-Road	16	2
4	Aggressive	32	4
5	Very Aggressive	48	4

Table 2. Feedback counter updates based on prefetch accuracy and lateness.

Case	Prefetch Accuracy	Prefetch Lateness	Counter Update
1	High	Late	Increment
2	High	Not-Late	No Change
3	Medium	Late	Increase
4	Medium	Not-Late	No Change
5	Low	Late	Decrement
6	Low	Not-Late	No Change

Q8. Compare feedback prefetcher against Global history prefetcher (same size: 256 entries) in terms of performance, prefetch accuracy, and usefulness (i.e., fraction of useful prefetches).

Q9. Create histogram of prefetch depth in a table. How many entries have a distance between 4-8, 8-12 etc. Compare against GHB with fixed degree (2,4) and distance (4,16,32).

Q10. Plot the useless prefetches for GHB with 16 and 32 vs Feedback GHB. How does it impact performance?

Grading criteria

Answer the questions in REPORT.md and those specified here (if different, answer the questions in this html)
If you do not include plots for each question it will simply be zeroed out. WARNING: Do not simply include a data dump
The total assignment is worth a 100 points equally split among mandatory questions.

Acknowledgments

This assignment has been modified by the CMPT 450/750 instructors