In this lab and accelerator assignment we will be using a specific version of gem5 developed at University of North Carolina Charlotte. We thank the authors of gem5 SALAM for making their tool available as open source. CMPT 750 version of gem5 includes changes and benchmarks and may not be backwards compatible with the SALAM version
gem5 ACC extends gem5 to model domain-specific architectures and heterogeneous SoCs. When creating your accelerator there are a many considerations to be made which have first order effect on overall performance and energy.
These include:
In this case the code we are going to accelerate is the following
char input[8] = {0x1,0x2,0x3,0x4,0x5,0x6,0x7,0x8};
char const[8] = {0x1,0x2,0x3,0x4,0x5,0x6,0x7,0x8};
char output[8];
for (int i = 0; i < 7 ; i++)
{
output[i] = input[i] + const[i];
output[i] = output[i]*2;
output[i] = output[i]*2;
}
We will illustrate accelerator creation using the DMA model.
$ git clone git@github.com:CMPT-7ARCH-SFU/gem5-lab2.git
$ cd gem5-lab2/benchmarks/vector_dma
$ ls
defines.h host hw Makefile
host/ | Defines the main that runs on CPU and invokes the accelerator |
hw/ | Defines the accelerator datapath |
hw/config.ini | Defines accelerator configuration |
inputs/m0.bin,m1.bin | Input files that contain the data |
defines.h | Common defines used by both datapath and host code |
Our SoC has two main sections.
Here we focus on the host code and interactions with the accelerator. For this application, we are using a bare metal kernel. This means that we will have a load file, assembly file, and must generate ELF files for execution.
# Psuedocode for host code
1. Set up addresses for scratchpad
2. Copy data from DRAM into scratchpad
3. Start accelerator.
5. Copy data from scratchpad to DRAM
defines.h
Top-level definition for memory mapped addresses. Since this is a bare-metal SoC without any access to virtual memory and memory allocators, we have to define the memory space. The overall memory space looks like the following:#define acc *(char *)0x2f000000
#define val_a *(int *)0x2f000001
#define val_b *(int *)0x2f000009
#define val_c *(int *)0x2f000011
0 | 0x2f000000 | 0x2f000001-11 (accelerator parameters. 8 bytes each) is the accelerator parameter) | 0x2f100000-0x2FFFFFFF | Limit |
---|---|---|---|---|
Host DRAM Coherent Address space | Accelerator status (0: inactive. 1: start: 4: running) | 3 parameters (see bench/ code) | Scratchpad memories | Host DRAM space |
These memory spaces are set in the following files
Accelerator range
Any accesses to this range, either from host code or accelerator is routed to the accelerator cluster.
gem5-config/HWAcc.py
local_low = 0x2F000000
local_high = 0x2FFFFFFF
Accelerator Start Address and Parameters
config.ini
[CommInterface]
pio_addr = 0x2f000000
pio_size = 64
pio_size is in bytes
Scratchpad address
config.ini
[Memory]
addr_range = 0x2f100000
In this instance we want to have the DMAs and accelerator controlled by an additional device to reduce overhead on the CPU. We define the helper functions in common/dma.h
.
$ cd benchmarks/vector_dma
$ xxd m0.bin
00000000: 0100 0000 0200 0000 0300 0000 0400 0000 ................
00000010: 0500 0000 0600 0000 0700 0000 0800 0000 ................
00000020: 0900 0000 0a00 0000 0b00 0000 0c00 0000 ................
00000030: 0c00 0000 0d00 0000 0e00 0000 0f00 0000 ................
# fs_vector_input.py
test_sys.kernel_extras = [os.environ["LAB_PATH"]+"/benchmarks/vector_dma/m0.bin",os.environ["LAB_PATH"]+"/benchmarks/vector_dma/m1.bin"]
main.cpp
uint64_t base = 0x80c00000;
TYPE *m1 = (TYPE *)base;
TYPE *m2 = (TYPE *)(base + sizeof(TYPE) * N);
TYPE *m3 = (TYPE *)(base + 2 * sizeof(TYPE) * N);
We then setup the DMA to perform the memory copy between DRAM and the scratchpad memory. dmacpy is similar to memcpy. Note the address ranges used for performing the copy. The destination uses the scratchpad range specified in the config.ini and gem5-scripts. This space is carved out of the global memory space and the host CPU knows to route any reads and writes to this address range to the scratchpad
# Define scratchpad addresses.
uint64_t spm_base = 0x2f100000;
TYPE *spm1 = (TYPE *)spm_base;
TYPE *spm2 = (TYPE *)(spm_base + sizeof(TYPE) * N);
TYPE *spm3 = (TYPE *)(spm_base + 2 * sizeof(TYPE) * N);
# spm1 is destination address
# m1 is source address
# Size in bytes.
dmacpy(spm1, m1, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
dmacpy(spm2, m2, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
Set up parameters in accelerator memory-mapped registers. The accelerator status byte is set in pio_addr = 0x2f000000 in config.ini
. pio_size = 64
. This will set up 64 memory mapped registers starting from address 0x2f000001. Each entry is 8 bytes. The parameters are automatically derived from the accelerator function definition.
// 4 possible values. 0x0: inactive 0x1: to start the accelerator. 0x4 active.
acc = 0x01;
printf("%d\n", acc);
// Check for accelerator end.
while (acc != 0x0) {
printf("%d\n", acc);
}
In our boot code, we setup an Interrupt Service Routine (ISR) in isr.c that the accelerator triggers to set up the end of the execution. This will reset the the accelerator status to 0x0. Which we spin on in the host code.
// ISR.c. Invoked when accelerator is complete
void isr(void)
{
printf("Interrupt\n\r");
// Helps break the for loop in the host code
acc = 0;
}
We copy back the results from the accelerator to the DRAM so that the host code can access and check.
dmacpy(m3, spm3, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
We will first start with creating the code for our accelerator.
In bench/vector_dma.c
there is a vector loop application written. To expose parallelism for computation and memory access we fully unroll the innermost loop of the application. The simulator will natively pipeline the other loop instances for us. To accomplish the loop unrolling we can utilize clang compiler pragmas such as those on line 18 of vector_dma.c.
// Unrolls loop and creates instruction parallelism
#pragma clang loop unroll_count(8)
for(i=0;i<N;i++) {
prod[i] = 4*(m1[i] + m2[i]);
}
With unrolling
The hardware ends up being a circuit that implements the above dataflow graph. The unrolling creates 8 way parallelism. The loads to m1[i] and m2[i] can happen in parallel. The adds and multiplies can happen in parallel. The figures show the compiler representation or view that gets mapped down to hardware. Each node in the graph in an LLVM IR instruction. This is just an intermediate RISCy ISA-like representation with certain important differences.
Benefits of Compiler IR view of Accelerator
Infinite registers. Typical object code for CPUs is limited by the architectural registers. This causes unnecesary memory operations to manage register overflows and underflows that hide the available parallelism. Compilers don’t have such limitations since they are trying to simply capture the available parallelism and locality
Dataflow Semantics. While object code is linearly laid out and relies on a program counter. Compiler IR inherently supports dataflow semantics with no specific program counter.
Without unrolling
We are generating hardware datapath from the the C code specified, hence we have a number of rules. If these rules are violated, the compiler may complain, you may encounter a runtime error from llvm runtime engine of SALAM or may even have a silent failure. Its very important you follow them.
Rule 1: SINGLE FUNCTION
Only single function perfmitted per accelerator .c file.Rule 2: NO LIBRARIES
Cannot use standard library functions. Cannot call into other functionsRule 3 : No I/O
No printfs or writes to files. Either use traces or write back to cpu memory to debugRule 4: Only locals or args
Can only work with variables declared within function or input arrays.Read here for more details on LLVM
cd $REPO
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=/data/src/750-SALAM
# Building gem5-SALAM
git clone git@github.com:CMPT-7ARCH-SFU/gem5-SALAM.git
cd gem5-SALAM; scons -j 16 build/ARM/gem5.opt -j`nproc`
# Build benchmark
cd $REPO/benchmarks/vector_dma
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
module load llvm-38
# Build datapath and host binary
make clean; make
# If you are 227 ip machine follow instructions below
# Running without docker
/data/src/750-SALAM/build/ARM/gem5.opt --debug-flags=NoncoherentDma --outdir=BM_ARM_OUT/vector_dma gem5-config/run_vector.py --mem-size=4GB --kernel=/data/src/gem5-lab2.0/benchmarks/vector_dma/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab2.0/benchmarks --accbench=vector_dma --caches --l2cache
# OR in a single line
./runvector.sh -b vector_dma
# Running inside docker if on 236,
docker run -v /data:/data/ -v $HOME/gem5-lab2:/repo --env M5_PATH=/data/src/750-SALAM --env LAB_PATH=/repo --user $(id -u):$(id -g) -it gem5-salam:latest bash -c "cd /repo; /repo/runvector.sh -b vector_dma -p"
For each of our accelerators we also need to generate an INI file. In each INI file we can define the number of cycles for each IR instruction and provide any limitations on the number of Functional Units (FUs) associated with IR instructions.
Additionally, there are options for setting the FU clock periods and controls for pipelining of the accelerator. Below is an example with a few IR instructions and their respective cycle counts:
[CycleCounts]
counter = 1
gep = 0
phi = 0
select = 1
ret = 1
br = 0
switch = 1
indirectbr = 1
invoke = 1
Importantly, under the AccConfig section, we set MMR specific details such as the size of the flags register, memory address, interrupt line number, and the accelerator’s clock.
[AccConfig]
flags_size = 1
config_size = 0
int_num = -1
clock_period = 10
premap_data = 0
data_bases = 0
In the Memory section, you can define the scratchpad’s memory address, size, response latency and number of ports. Also, if you want the accelerator to verify that memory exists in the scratchpad prior to accessing the scratchpad we can set ready mode to true.
[Memory]
addr_range = 0x2f100000
size = 98304
latency = 2ns
ports = 4
ready_mode = True
reset_on_private_read = False
Lastly, we need to set the memory address under CommInterface of our MMR for the accelerator as well as the overall size of the MMR that account for all variables that need to be passed, 8 bytes per variable, and flags. The pio_size is in bytes. So 1 byte for the start/stop flag and 63 bytes for the remaining input arguments. Each argument is 8 bytes so 8 args in total.
[CommInterface]
pio_addr = 0x2f000000
pio_size = 65
We are now going to leverage and modify the example scripts for gem5’s full system simulation. In gem5-config/fs_vector_input.py we have a modified version of the script located in gem5’s default folder. The main difference in our configuration is there are two additional parameters
Line 242: HWAcc.makeHWAcc(options, test_sys)
we invoke our own function that has been added on `HW. This adds gem5-config/HWAcc.py to the overall system configuration.In order to simplify the organization of accelerator-related resources, we define a accelerator cluster. This accelerator cluster will contain any shared resources between the accelerators as well as the accelerators themselves. It has several functions associated with it that help with attaching accelerators to it and for hooking cluster into the system.
The _attach_bridges function (line 19) connects the accelerator cluster into the larger system, and connects the memory bus to the cluster. This gives devices outside the cluster master access to cluster resources.
system.acctest._attach_bridges(system, local_range, external_range)
We then invoke the _connect_caches function (line 20) in order to connect any cache hierarchy that exists in-between the cluster, the memory bus, or l2xbar of the CPU depending on design. This gives the accelerator cluster master access to resources outside of itself. It also establishes coherency between cluster and other resources via caches. If no caches are needed this will merely attach the cluster to the memory bus without a cache.
system.acctest._connect_caches(system, options, l2coherent=True, cache_size = "32kB")
These functions are defined in gem5-SALAM/src/hwacc/AccCluster.py
The DMA control address defined here has to match common/dma.h. The default value is 0x2ff00000 and 24*8 bytes starting from that address.
system.acctest.dma = NoncoherentDma(pio_addr=0x2ff00000, pio_size=24, gic=system.realview.gic, max_pending=32, int_num=95)
system.acctest._connect_cluster_dma(system, system.acctest.dma)
We, we are going to create a CommInterface (Line 30) which is the communications portion of our Top accelerator. We will then configure Top and generate its LLVM interface by passing CommInterface, a config file, and an IR file, to AccConfig (Line 31). This will generate the LLVM interface, configure any hardware limitations, and will establish the static Control and Dataflow Graph (CDFG).
We then connect the accelerator to the cluster (Line 32). This will attach the PIO port of the accelerator to the cluster’s local bus that is associated with MMRs.
For our Hardware, we follow the same steps.
This can be seen in
# gem5-config/HWAcc.py
# Add the benchmark function
acc_bench = options.accpath + "/" + options.accbench + "/bench/" + options.accbench + ".ll"
# Specify the path to the config file for an accelerator
# acc_config = <Absolute path to the config file>
acc_config = options.accpath + "/" + options.accbench + "/config.ini"
......
.......
# Add an accelerator attribute to the cluster
setattr(system.acctest, options.accbench, CommInterface(devicename=options.accbench))
ACC = getattr(system.acctest,options.accbench)
AccConfig(ACC, acc_config, acc_bench)
# Add an SPM attribute to the cluster
setattr(system.acctest, options.accbench+"_spm", ScratchpadMemory())
ACC_SPM = getattr(system.acctest,options.accbench + "_spm")
AccSPMConfig(ACC, ACC_SPM, acc_config)
system.acctest._connect_spm(ACC_SPM)
# Connect the accelerator to the system's interrupt controller
ACC.gic = system.realview.gic
# Connect HWAcc to cluster buses
system.acctest._connect_hwacc(ACC)
ACC.local = system.acctest.local_bus.slave
ACC.acp = system.acctest.coherency_bus.slave
Because we want our hardware accelerator to be managed by the host, we connect the PIO directly to the coherent cross bar.
We then define a scratchpad memory and configure it using AccSPMConfig, which points to our accelerator’s config file (Line 42).
Lastly we connect scratchpad memory to the cluster (Line 43), this allows for all accelerators in the cluster to access it.
Lines 46-63 configure different buffer sizes for the DMA. These are optional, but are presented to demonstrate how you can impose additional limitations on the DMA to control how data is transferred.
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=/data/src/750-SALAM
cd $REPO/benchmarks/vector_dma
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/Modules/3.2.10/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-38
#
make clean; make
# This should create a .ll file in your hw/
# and main.elf file in host/
# On 227
/data/src/750-SALAM/build/ARM/gem5.opt --debug-flags=NoncoherentDma --outdir=BM_ARM_OUT/vector_dma gem5-config/run_vector.py --mem-size=4GB --kernel=/data/src/gem5-lab2.0/benchmarks/vector_dma/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab2.0/benchmarks --accbench=vector_dma --caches --l2cache
# In short
$ ./runvector.sh -b vector_dma
# This will create a BM_ARM_OUT/vector_dma (this is your m5_out folder)
# The debug-trace.txt will contain stats for your accelerator
# Run on 236 machine
# Running inside docker if on 236,
docker run -v /data:/data/ -v $HOME/gem5-lab2:/repo --env M5_PATH=/data/src/750-SALAM --env LAB_PATH=/repo --user $(id -u):$(id -g) -it gem5-salam:latest bash -c "cd /repo; /repo/runvector.sh -b vector_dma -p"
Do you understand what these stats are?
cat BM_ARM_OUT/vector_dma/debug-trace.txt
Cycle Counts Loaded!
**** REAL SIMULATION ****
1238719500: system.acctest.dma: SRC:0x0000000080c00000, DST:0x000000002f100000, LEN:64
1238849500: system.acctest.dma: Transfer completed in 0.13 us
1242711500: system.acctest.dma: SRC:0x0000000080c00040, DST:0x000000002f100040, LEN:64
1242831500: system.acctest.dma: Transfer completed in 0.12 us
1
********************************************************************************
system.acctest.vector_dma.compute
========= Performance Analysis =============
Setup Time: 0.000363995seconds
Simulation Time: 0.00181577seconds
System Clock: 0.1GHz
Transistor Latency: 10ns
Runtime: 23 cycles .
Runtime: 0.23 us
Stalls: 10 cycles
Executed Nodes: 12 cycles
********************************************************************************
========= Performance Analysis =================
Setup Time: 363995ns
Simulation Time: 1.81577e+06ns
System Clock: 0.1GHz
Transistor Latency: 102ns
Runtime: 23 cycles
Runtime: 2.3e-09 seconds
Stalls: 10 cycles
Load Only: 0 cycles
Store Only: 1 cycles
Compute Only: 0 cycles
Compute & Store: 2 cycles
Load & Store: 0 cycles
Load & Compute: 1 cycles
Load & Compute & Store: 6 cycles
Executed Nodes: 12 cycles
Load Only: 0 cycles
Store Only: 1 cycles
Compute Only: 0 cycles
Compute & Store: 2 cycles
Load & Store: 0 cycles
Load & Compute: 3 cycles
Load & Compute & Store: 6 cycles
========= Runtime FU's ========= (Max | Avg) ===
Counter FU's: 1 | 0.347826
Integer Add/Sub FU's: 3 | 0.260870
Integer Mul/Div FU's: 0 | 0.000000
Integer Shifter FU's: 4 | 0.206522
Integer Bitwise FU's: 1 | 0.347826
Floating Point Float Add/Sub: 0 | 0.000000
Floating Point Double Add/Sub: 0 | 0.000000
Floating Point Float Mul/Div: 0 | 0.000000
Floating Point Double Mul/Div: 0 | 0.000000
0 Cycle Compare FU's: 3 | 0.347826
GEP Instruction FU's: 6 | 0.347826
Type Conversion FU's: 0 | 0.000000
========= Static FU's ==========================
Counter FU's: 0
Integer Add/Sub FU's: 0
Integer Mul/Div FU's: 0
Integer Shifter FU's: 0
Integer Bitwise FU's: 0
Floating Point Float Add/Sub: 0
Floating Point Double Add/Sub: 0
Floating Point Float Mul/Div: 0
Floating Point Double Mul/Div: 0
0 Cycle Compare FU's: 0
GEP Instruction FU's: 0
Type Conversion FU's: 0
Other: 0
========= Pipeline Register Usage =============
Total Number of Registers: 23
Max Register Usage Per Cycle: 14
Avg Register Usage Per Cycle: 6.260870
Avg Register Size (Bytes): 5.833333
========= Memory Configuration =================
Cache Bus Ports: 117
Shared Cache Size: 0kB
Local Bus Ports: 46
Private SPM Size: 0kB
Private Read Ports: 0
Private Write Ports: 0
Private Read Bus Width: 0
Private Write Bus Width: 0
Memory Reads: 0
Memory Writes: 0
========= Power Analysis ======================
FU Leakage Power: 0.014534 mW
FU Dynamic Power: 0.000000 mW
FU Total Power: 0.014534 mW
Registers Leakage Power: 0.002576 mW
Registers Dynamic Power: 0.000000 mW
Register Reads (Bits): 144
Register Writes (Bits): 144
Registers Total Power: 0.002576 mW
SPM Leakage Power: 0.000000 mW
SPM Read Dynamic Power: 0.000000 mW
SPM Write Dynamic Power: 0.000000 mW
SPM Total Power: 0.000000 mW
Cache Leakage Power: 0.000000 mW
Cache Read Dynamic Power: 0.000000 mW
Cache Write Dynamic Power: 0.000000 mW
Cache Total Power: 0.000000 mW
Accelerator Power: 0.017110 mW
Accelerator Power (SPM): 0.017110 mW
Accelerator Power (Cache): 0.017110 mW
========= Area Analysis =======================
FU Area: 1048.954346 um^2 (0.001049 mm^2)
Register Area: 209.350159 um^2 (0.000209 mm^2)
SPM Area: 0.000000 um^2 (0.000000 mm^2)
Cache Area: 0.000000 um^2 (0.000000 mm^2)
Accelerator Area: 1258.304443 um^2 (0.001258 mm^2)
Accelerator Area (SPM): 1258.304443 um^2 (0.001258 mm^2)
Accelerator Area (Cache): 1258.304443 um^2 (0.001258 mm^2)
========= SPM Resizing =======================
SPM Optimized Leakage Power: 0.000000 mW
SPM Opt Area: 0.000000 um^2
1354939000: system.acctest.dma: SRC:0x000000002f100080, DST:0x0000000080c00080, LEN:64
1354989000: system.acctest.dma: Transfer completed in 0.05 us
Exiting @ tick 1368457000 because m5_exit instruction encountered
benchmarks/vector_dma/hw/vector_dma.c
from 1-16 and see what happens to runtime cycles each time. Also see the stats for Total Number of Registers:
, Max Register Usage Per Cycle:
Runtime:
, Runtime FUs
. Power Analysis
.WARNING: Remember you have to rebuild .ll and main.elf each time
benchmarks/vector_dma/hw/config.ini [Memory]
from 1-8 and see what happens to cycles. Why does changing number of ports to 1 increase stalls ? To try and understand follow step belowrunvector.sh
to MinorCPU and see difference in overall simulation time.FLAGS="HWACC,LLVMRuntime"
in run-vector.sh. Re-run and check debug-trace.txt . Try to comphrehend what the trace says. This will include step-by-step execution of the hardware. Disable the flags for assignments; otherwise traces will consume too much space.Comments on trace
BM_ARM_OUT/vector_dma/debug-trace.txt
Look for lines of type.
Trying to read addr: 0x000000002f100004, 4 bytes through port:
When changing ports to 1 Check how many reads occur in tick 1271935000
1476840000: system.acctest.vector_dma: Checking MMR to see if Run bit set
1476840000: system.acctest.vector_dma.compute: Initializing LLVM Runtime Engine!
1476840000: system.acctest.vector_dma.compute: Constructing Static Dependency Graph
1476840000: system.acctest.vector_dma.compute: Parsing: (/data/src/gem5-lab2/benchmarks/vector_dma/hw/vector_dma.ll)
Read from 0x2f10000c; indicates read from address. Depeding on address range this either refers to scratchpad or global memory
Check the computation operations. Open the $REPO/benchmarks/vector_dma/hw/vector_dma.ll
file and identify these instructions.
1476910000: system.acctest.vector_dma.compute.i( %7 = shl i32 %6, 2): Performing shl Operation
1476910000: system.acctest.vector_dma.compute.i( %7 = shl i32 %6, 2): 2 << 2
1476910000: system.acctest.vector_dma.compute.i( %7 = shl i32 %6, 2): shl Complete. Result = 8
1476910000: system.acctest.vector_dma.compute.i( %7 = shl i32 %6, 2): Operation Will Commit in 1 Cycle(s)
1476910000: system.acctest.vector_dma.compute.i( %13 = add i32 %12, %10): Performing add Operation (13)
1476910000: system.acctest.vector_dma.compute.i( %13 = add i32 %12, %10): 2 + 2
1476910000: system.acctest.vector_dma.compute.i( %13 = add i32 %12, %10): add Complete. Result = 4
1476910000: system.acctest.vector_dma.compute.i( %13 = add i32 %12, %10): Operation Will Commit in 1 Cycle(s)
1476910000: system.acctest.vector_dma.compute.i( %6 = add i32 %5, %3): Performing add Operation (6)
1476910000: system.acctest.vector_dma.compute.i( %6 = add i32 %5, %3): 3 + 3
1476910000: system.acctest.vector_dma.compute.i( %6 = add i32 %5, %3): add Complete. Result = 6
Dataflow graph Visualizer
. See if you can spot the difference between parallel and serial.module unload llvm-38
module load llvm-10
clang --version
# Should be 10.
cd $REPO/vector_dma/hw/
# Dataflow graph with
clang -emit-llvm -S vector_dma.c -o vector_dma-10.ll
opt -load /data/PDG/build/libpdg.so --dot-pdg --dot-only-ddg vector_dma-10.ll
dot -Tpdf pdgragh.vadd.dot -o pdgragh.vadd.serial.pdf
clang -emit-llvm -O3 -S vector_dma.c -o vector_dma-10.ll
opt -load /data/PDG/build/libpdg.so --dot-pdg --dot-only-ddg vector_dma-10.ll
dot -Tpdf pdgragh.vadd.dot -o pdgragh.vadd.parallel.pdf
benchmarks/vector_dma_2x
In Model 1, we moved all the data we need into the scratchpad and then kickstarted the computation. However, scratchpads are finite and accelerators can only work with data in the scratchpad. Hence we may need to restrict the size of the accelerators and process data in multiple batches. In this example we are going to restrict the accelerator to process only 8 elements. However we have 16 elements in the array. To process all the elements we have to process the data in 2 batches. The modifications are to the manager host code.
// Modified config.ini to set scratchpad size
// Modified defines.h
#define N 8
// The accelerator datapath will work on 8 elements at a time
// Modified top.c to process 16 elements as two batches.
// Batch 0 DMAs elements 0-8 to scratchpad
dmacpy(spm1, m1, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
dmacpy(spm2, m2, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
// Invoke accelerator on scratchpad address range
val_a = (uint64_t)spm_base;
val_b = (uint64_t)(spm_base + sizeof(TYPE) * N);
val_c = (uint64_t)(spm_base + 2 * sizeof(TYPE) * N);
// Batch 2 DMAs elements 8-16
dmacpy(spm1, m1 + N, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
dmacpy(spm2, m2 + N, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
// Notice in both cases we are passing the scratchpad base address to the accelerator datapath to work on. This is redundant and we can hardcode it into the accelerator datapath in hw/ if we want to vector_dma_2x.c.
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=export M5_PATH=/data/src/750-SALAM
cd $REPO/benchmarks/vector_dma_2x
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/Modules/3.2.10/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-38
#
make clean; make
# This should create a .ll file in your hw/
# and main.elf file in host/
# In short
/data/src/750-SALAM/build/ARM/gem5.opt --debug-flags=NoncoherentDma --outdir=BM_ARM_OUT/vector_dma_2x gem5-config/run_vector.py --mem-size=4GB --kernel=/data/src/gem5-lab2.0/benchmarks/vector_dma_2x/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab2.0/benchmarks --accbench=vector_dma_2x --caches --l2cache
# On 227
$ ./runvector.sh -b vector_dma_2x
# This will create a BM_ARM_OUT/vector_dma_2x (this is your m5_out folder)
# The debug-trace.txt will contain stats for your accelerator
The cache model hooks up the accelerator to the global memory through a coherence crossbar. It is ok to be not familiar with coherence bar when reading this document. You only need to understand that with coherence available the accelerators can directly reference the DRAM space mapped to the CPU.
To enable accelerator cache.
CACHE_OPTS="--caches --l2cache --acc_cache"
clstr._connect_caches(system, options, l2coherent=True, cache_size = "32kB"):gem5-config/HWAcc.py
/data/src/750-SALAM/build/ARM/gem5.opt --debug-flags=HWACC --outdir=BM_ARM_OUT/vector_cache gem5-config/fs_vector_input.py --mem-size=4GB --kernel=/data/src/gem5-lab-acc/benchmarks/vector_cache/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab-acc/benchmarks --accbench=vector_cache --caches --l2cache --acc_cache
# In short
$ ./runvector.sh -b vector_cache -p
The image below compares the system organization with an accelerator cache and without.
The primary difference between the cache and dma version is in the host code
The pointers passed to the accelerator point to the global memory space (base
). The load and store operations directly touch these locations and access them through the coherence cross bar.
// benchmarks/vector_cache/host/main.cpp
uint64_t base = 0x80c00000;
uint64_t spm_base = 0x2f100000;
val_a = (uint64_t)base;
val_b = (uint64_t)(base + sizeof(TYPE) * N);
val_c = (uint64_t)(base + 2 * sizeof(TYPE) * N);
The cache model hooks up the accelerator to the global memory through a coherence crossbar. It is ok to be not familiar with coherence bar when reading this document. You only need to understand that with coherence available the accelerators can directly reference the DRAM space mapped to the CPU.
To enable accelerator cache.
CACHE_OPTS="--caches --l2cache --acc_cache"
clstr._connect_caches(system, options, l2coherent=True, cache_size = "32kB"):vector.py
cd $REPO/benchmarks/vector_cache
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-38
# Build datapath and host binary
make clean; make
# Run on 227 machine
/data/src/750-SALAM/build/ARM/gem5.opt --debug-flags=HWACC --outdir=BM_ARM_OUT/vector_cache gem5-config/fs_vector_input.py --mem-size=4GB --kernel=/data/src/gem5-lab-acc/benchmarks/vector_cache/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab-acc/benchmarks --accbench=vector_cache --caches --l2cache --acc_cache
# In short.
$ ./runvector.sh -b vector_cache -p
# Run on 236 machine
# Running inside docker if on 236,
docker run -v /data:/data/ -v $HOME/gem5-lab2:/repo --env M5_PATH=/data/src/750-SALAM --env LAB_PATH=/repo --user $(id -u):$(id -g) -it gem5-salam:latest bash -c "cd /repo; /repo/runvector.sh -b vector_cache -p"
The image below compares the system organization with an accelerator cache and without.
Modify the parameter in N in defines.h
and
test_sys.kernel = binary(options.kernel)
test_sys.kernel_extras = [os.environ["LAB_PATH"]+"/benchmarks/inputs/m0.bin",os.environ["LAB_PATH"]+"/benchmarks/inputs/m1.bin"]
test_sys.kernel_extras_addrs = [0x80c00000,0x80c00000+os.path.getsize(os.environ["LAB_PATH"]+"/benchmarks/inputs/m0.bin")]
print("Loading file m0 at" + str(hex(0x80c00000)))
print("Loading file m1 at" + str(hex(0x80c00000 + os.path.getsize(os.environ["LAB_PATH"]+"/benchmarks/inputs/m0.bin"))))
cd $REPO
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=export M5_PATH=/data/src/750-SALAM
cd $REPO/benchmarks/multi_vector
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-38
# Build datapath and host binary
make clean; make
# In short on 227.
$ ./runmulti.sh -b multi_vector -p
# Run on 227 machine
/data/src/750-SALAM/build/ARM/gem5.opt --debug-flags=HWACC --outdir=BM_ARM_OUT/multi_vector gem5-config/run_multi.py --mem-size=4GB --kernel=/data/src/gem5-lab-acc/benchmarks/multi_vector/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab-acc/benchmarks --accbench=multi-cache --caches --l2cache --acc_cache
# Run on 236 machine
# Running inside docker if on 236,
docker run -v /data:/data/ -v $HOME/gem5-lab2:/repo --env M5_PATH=/data/src/750-SALAM --env LAB_PATH=/repo --user $(id -u):$(id -g) -it gem5-salam:latest bash -c "cd /repo; /repo/runmulti.sh -b multi-cache -p"
For larger applications we may need to include multiple accelerators in the cluster. For this, we need to include a top
accelerator to coordinate each of the other accelerators.
Figure below shows the system model. The top now offloads the DMA and accelerator kickstart logic; it also initiates the DMA movement between the accelerators. The host in this case simply passes the address pointers to the inputs and output zone. There are two accelerators vector and vector2.ll
// host/main.cpp
volatile uint8_t * top = (uint8_t *)0x2f000000;
volatile uint32_t * val_a = (uint32_t *)0x2f000001;
volatile uint32_t * val_b = (uint32_t *)0x2f000009;
volatile uint32_t * val_c = (uint32_t *)0x2f000011;
int main(void) {
// Pointers in DRAM. m1 and m2 are inputs.
// m3 is the output
uint32_t base = 0x80c00000;
TYPE *m1 = (TYPE *)base;
TYPE *m2 = (TYPE *)(base + sizeof(TYPE) * N);
TYPE *m3 = (TYPE *)(base + 2 * sizeof(TYPE) * N);
// MMRegs of the top accelerator.
// Argument 1 to top
*val_a = (uint32_t)(void *)m1;
// Argument 2 to top
*val_b = (uint32_t)(void *)m2;
// Argument 3 to top
*val_c = (uint32_t)(void *)m3;
hw/source/top.c | Code for top accelerator coordinator. This is itself an accelerator |
hw/configs/top.ini | Configuration for top accelerator |
hw/source/vector.c, hw/configs/vector.ini | Code for first stage of vector accelerator |
hw/source/vector2.c, hw/configs/vector2.ini | Code for second stage of vector accelerator |
hw/ir | llvm files after compiler generates dataflow graph |
Start address | Description |
---|---|
0x2f000000 | Memory mapped args for top (defined in top.ini) |
0x2f0000F0 | Memory mapped args for vector (defined in vector.ini) |
0x2f000100 | Memory mapped args for vector2 (defined in vector2.ini) |
0x2f100000 | Scratchpad for vector |
0x2f200000 | scratchpad for vector2 |
// Accelerator 1: vector.c
for(i=0;i<N;i++) {
tmp_m3[i] = (m1[i] + m2[i]);
}
// Accelerator 2: vector2.c
for(i=0;i<N;i++) {
m3[i] = tmp_m3[i] * 8;
}
Top manages the accelerator themselves
// hw/source/top.c
volatile uint8_t *DmaFlags = (uint8_t *)(DMA);
volatile uint64_t *DmaRdAddr = (uint64_t *)(DMA + 1);
volatile uint64_t *DmaWrAddr = (uint64_t *)(DMA + 9);
volatile uint32_t *DmaCopyLen = (uint32_t *)(DMA + 17);
Transfer data from DRAM to Scratchpad S1 of Accelerator Vector.
// hw/source/top.c
// Global memory address
*DmaRdAddr = m1_addr;
// Scratchpad address of vector. Defined in hw_defines.h
// 0x2f100000. This is for input 1
*DmaWrAddr = M1ADDR;
// Number of bytes
*DmaCopyLen = vector_size;
// Copy bytes
*DmaFlags = DEV_INIT;
// Poll DMA for finish
while ((*DmaFlags & DEV_INTR) != DEV_INTR);
// Transfer M2 to scratchpad now.
Scratchpad memory is laid out in the following manner
M1 | M2 | M3 |
---|---|---|
0x2f100000 - N*sizeof(int) | 0x2f100040 - N*sizeof(int) | 0x2f100080 - N*sizeof(int) |
Set up arguments if required. The accelerator can only work with data in the scratchpad or local registers. These are fixed memory rangers in the DMA space. In this case, the V1 vector accelerator does not require any additional arguments. To start the accelerator from TOP it is important to follow the steps below (in particular checking if accelerator is ready for kickstart) after arguments are are setup.
// Write to argument MMR of V1 accelerator
// First, check if accelerator ready for kickstarting
while (*V1Flags != 0x0);
// Start the accelerated function
*V1Flags = DEV_INIT;
// Poll function for finish
while ((*V1Flags & DEV_INTR) != DEV_INTR);
// Reset accelerator for next time.
*V1Flags = 0x0;
The output of accelerator v1 is the input of v2. Need to copy N*4 bytes from 0x2f100080 to 0x2f200000.
// Transfer the output of V1 to V2.
*DmaRdAddr = M3ADDR;
*DmaWrAddr = M1ADDR_V2;
*DmaCopyLen = vector_size;
*DmaFlags = DEV_INIT;
// //Poll DMA for finish
while ((*DmaFlags & DEV_INTR) != DEV_INTR)
;
// Write to argument MMR of V2 accelerator
// First, check if accelerator ready for kickstarting
while (*V2Flags != 0x0);
// Start the accelerated function
*V2Flags = DEV_INIT;
// Poll function for finish
while ((*V2Flags & DEV_INTR) != DEV_INTR);
// Reset accelerator for next time.
*V2Flags = 0x0;
// Transfer M3
// Scratchpad addresss
*DmaRdAddr = M3ADDR_V2;
// Global address the host wants the final result in
*DmaWrAddr = m3_addr;
// Number of bytes
*DmaCopyLen = vector_size;
// Start DMA
*DmaFlags = DEV_INIT;
// Poll DMA for finish
while ((*DmaFlags & DEV_INTR) != DEV_INTR)
;
We now create a multi-accelerator system with a shared cache. We do not need to explicitly transfer data between accelerators and all data is implicitly transferred between the accelerators through the shared cluster cache. The top
only has to set up the appropriate arguments and invoke the accelerators in sequence. Each accelerator reads and writes back to the global memory space and the cluster cache captures the locality.
cd $REPO
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=export M5_PATH=/data/src/750-SALAM
cd $REPO/benchmarks/multi_vector_cache
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-38
# Build datapath and host binary
make clean; make
# In short on 227.
$ ./runmulti.sh -b multi_vector_cache -p
# Run on 227 machine
/data/src/750-SALAM/build/ARM/gem5.opt --debug-flags=HWACC --outdir=BM_ARM_OUT/multi_vector_cache gem5-config/run_multi.py --mem-size=4GB --kernel=/data/src/gem5-lab-acc/benchmarks/multi_vector_cache/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab-acc/benchmarks --accbench=multi-cache --caches --l2cache --acc_cache
# Run on 236 machine
# Running inside docker if on 236,
docker run -v /data:/data/ -v $HOME/gem5-lab2:/repo --env M5_PATH=/data/src/750-SALAM --env LAB_PATH=/repo --user $(id -u):$(id -g) -it gem5-salam:latest bash -c "cd /repo; /repo/runmulti.sh -b multi-cache -p"
// benchmarks/multi_vector_cache/hw/source/top.c
// Pass on host address arguments to accelerator
*V1Arg1 = m1_addr;
*V1Arg2 = m2_addr;
*V1Arg3 = m3_addr;
// Start V1
*V1Flag = DEV_INIT;
// Poll function for finish
while ((*V1Flag & DEV_INTR) != DEV_INTR);
*V1Flag = 0x0;
// Start V2
*V2Flag = DEV_INIT;
while ((*V2Flag & DEV_INTR) != DEV_INTR);
*V2Flag = 0x0;
Streaming pipelines introduce a FIFO interface to the memory system. If you take a look at the datapaths vector_dma/hw/vector_dma.c
you notice that the memory access pattern is highly regular with no ordering requirements between the elements of the array. We simply sequence through the elements of the vectoro applying an operation at individual locations.
This can be concisely described as a stream of values. A stream simply provides a FIFO interface to the data.
Memory map. The term MMR refers to the memory mapped register and flags used to control the DMA and accelerators.
0x2F000000 (TOP MMR) | 0x2F000100 (S1 MMR) | 0x2F000200 (S2 MMR) | 0x2F000300 (S3 MMR) | 0x2fe00000 (StreamDMA MMR) | 0x2ff00000 (Noncoherent DMA) |
0x2F003000 (DRAM->S1, S3->DRAM) | 0x2F003000 (S1->S2 FIFO port) | 0x2F004000 (S2->S3 FIFO port) |
This stream data from DRAM in chunks of stream_size (bits). Figure illustrates a stream DMA.
We need to create a new configuration and modify top to initiate the stream.
The stream DMA includes a control pio (similar to other accelerators). This can be written to by top to control from where in the DRAM data is being streamed. The out port of the StreamDMA engine is wired up to the stream ports of one of the accelerators. Each stream is a single input-single output FIFO. Each accelerator has a .stream
interface into which all the required streams are wired in. In this case i) we read from the DRAM and send data to accelerator S1. ii) read data from accelerator S3 and write it into stream DMA.
0x2f0001000
to read/write to the stream addresses.stream_size:8
bits worth of data from the port.StrDmaRdFrameSize
bytes of data in chunks of stream_size.
The total number of dataflow tokens generated will be $\frac{RdFrameSize*8}{stream_size}$. # Configuration in gem5-config/vector_stream.py
# Control address for setting up stream
addr = 0x2fe00000
clstr.stream_dma0 = StreamDma(pio_addr=addr, pio_size=32, gic=gic, max_pending=32)
# Address for reading/writing to stream from accelerator
clstr.stream_dma0.stream_addr= local_low + 0x1000
# Number of bits per FIFO access.
clstr.stream_dma0.stream_size=8
clstr.stream_dma0.pio_delay='1ns'
clstr.stream_dma0.rd_int = 210
clstr.stream_dma0.wr_int = 211
clstr._connect_dma(system, clstr.stream_dma0)
# DRAM->Accelerator S1
clstr.S1.stream = clstr.stream_dma0.stream_out
# Accelerator S3->DRAM
clstr.S3.stream = clstr.stream_dma0.stream_in
// vector_stream/hw/source/top.c
// Define Stream control config
volatile uint8_t *StrDmaFlags = (uint8_t *)(STREAM_DMA_MMR);
volatile uint64_t *StrDmaRdAddr = (uint64_t *)(STREAM_DMA_MMR + 4);
volatile uint64_t *StrDmaWrAddr = (uint64_t *)(STREAM_DMA_MMR + 12);
volatile uint32_t *StrDmaRdFrameSize = (uint32_t *)(STREAM_DMA_MMR + 20);
volatile uint8_t *StrDmaNumRdFrames = (uint8_t *)(STREAM_DMA_MMR + 24);
volatile uint8_t *StrDmaRdFrameBuffSize = (uint8_t *)(STREAM_DMA_MMR + 25);
volatile uint32_t *StrDmaWrFrameSize = (uint32_t *)(STREAM_DMA_MMR + 26);
volatile uint8_t *StrDmaNumWrFrames = (uint8_t *)(STREAM_DMA_MMR + 30);
volatile uint8_t *StrDmaWrFrameBuffSize = (uint8_t *)(STREAM_DMA_MMR + 31);
// Initiate Stream from DRAM to FIFO port
*StrDmaRdAddr = in_addr;
*StrDmaRdFrameSize = INPUT_SIZE; // Specifies number of bytes
*StrDmaNumRdFrames = 1;
*StrDmaRdFrameBuffSize = 1;
// Start Stream
*StrDmaFlags = STR_DMA_INIT_RD | STR_DMA_INIT_WR;
Stream buffers establish ports directly between accelerators. They do not need to set up during runtime
The configuration is set up and the accelerators simply read from the address that controls the port.
For example, here we have set up a stream buffer between accelerator v1 and v2.
Each accelerator uses the address to read and write to the FIFO. The streambuffer only supports a single input and output port.
┌──────────────────────┐ ┌───────────────┐
│ Accelerator V1 │ ┌─────────────┐ │ Acclerator │
│ ├─────►│ FIFO Buffer ├────────► V2 │
└──────────────────────┘ └─────────────┘ └───────────────┘
# Address accelerator v1 and v2 can read and write to.
addr = local_low + 0x3000
clstr.S1Out = StreamBuffer(stream_address=addr, stream_size=1, buffer_size=8)
# # of bits read on each access
clstr.S1Out.stream_size = 8
# Input to the buffer from accelerator S1
clstr.S1.stream = clstr.S1Out.stream_in
# Output of buffer sent to accelerator S2.
clstr.S2.stream = clstr.S1Out.stream_out
Each stream-buffer only supports 1-1 input and output port. However multiple stream buffers can be wired to a single accelerator. However, each accelerator can have multiple streambuffer ports.
┌───────────────┐
┌─────────────┐ │ Acclerator │
│ 0x2f003000 ├────┬───► │
└─────────────┘ ├───► V2 │
│ └───────────────┘
┌─────────────┐ │
│ 0x2f004000 ├────┘
└─────────────┘
# Address accelerator v1 and v2 can read and write to access FIFO.
addr = local_low + 0x3000
clstr.B1 = StreamBuffer(stream_address=addr, stream_size=1, buffer_size=8)
addr = local_low + 0x4000
clstr.B2 = StreamBuffer(stream_address=addr, stream_size=1, buffer_size=8)
clstr.S2.stream = clstr.B1.stream_out
clstr.S2.stream = clstr.B2.stream_out
cd $REPO
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=export M5_PATH=/data/src/750-SALAM
cd $REPO/benchmarks/vector_stream
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-38
# Build datapath and host binary
make clean; make
# In short on 227.
$ ./runvector_stream.sh -p
# Full command
/data/src/750-SALAM/build/ARM/gem5.opt --outdir=BM_ARM_OUT/vector_stream gem5-config/run_vector_stream.py --mem-size=4GB --kernel=/data/src/gem5-lab2/benchmarks/vector_stream/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab2/benchmarks --accbench=vector_stream --caches --l2cache --acc_cache
# Running inside docker if on 236,
docker run -v /data:/data/ -v $HOME/gem5-lab2:/repo --env M5_PATH=/data/src/750-SALAM --env LAB_PATH=/repo --user $(id -u):$(id -g) -it gem5-salam:latest bash -c "cd /repo; /repo/runvector_stream.sh -p"
The purpose of top is to kickstart the stream dma from memory. Completion is detected by checking whether the output stream is complete. The overall execution is data-driven. When the FIFO port empties out the top accelerator triggers the completion of the stream.
// Start Stream DMAs
*StrDmaFlags = STR_DMA_INIT_RD | STR_DMA_INIT_WR;
// Start all accelerators
// Start S1
*S1 = 0x01;
// Start S2
*S2 = 0x01;
// Start S3
*S3 = 0x01;
// Wait for all accelerators to finish before sending interrupt to CPU
while ((*StrDmaFlags & 0x08) == 0x08);
As each each accelerator fills the stream buffer ports they will automatically trigger the operations in neighboring accelerators in a dataflow manner. Each accelerator has to know how many tokens are going to be generated and has to read the stream buffer port. The S1 stage writes to the FIFO streambuffer between S1 and S2. It uses the appropriate stream buffer memory mapped port.
// hw/source/hw_defines.h
#define BASE 0x2F000000
#define StreamIn BASE + 0x1000
#define S1Out BASE + 0x3000
// hw/source/top.c
volatile dType_8u * STR_IN = (dType_8u *)(S1In);
volatile dType_8u * STR_OUT = (dType_8u *)(S1Out);
......
for (dType_Reg i = 0; i < INPUT_SIZE; i++) {
*STR_OUT = (*STR_IN) + BUFFER[i];
}
}
Complete configuration
hw/source/S4.c
, hw/configs/S4.ini
. Modify top.c to define memory map for MMR and Stream ports. Modify hw/gem5-config/vector_stream.py
. You will need make S4 the final stage writing to the stream DMA. You will have to define new streambuffer that connects S3 and S4. You may and will need to makefile modifications. Figure it out.This document has put together by your CMPT 750/450 instructors.