In this lab and accelerator assignment we will be using a specific version of gem5 developed at University of North Carolina Charlotte. We thank the authors of gem5 SALAM for making their tool available as open source. CMPT 750 version of gem5 includes changes and benchmarks and may not be backwards compatible with the SALAM version
gem5 ACC extends gem5 to model domain-specific architectures and heterogeneous SoCs. When creating your accelerator there are a many considerations to be made which have first order effect on overall performance and energy.
These include:
In this case the code we are going to accelerate is the following
char input[8] = {0x1,0x2,0x3,0x4,0x5,0x6,0x7,0x8};
char const[8] = {0x1,0x2,0x3,0x4,0x5,0x6,0x7,0x8};
char output[8];
for (int i = 0; i < 7 ; i++)
{
output[i] = input[i] + const[i];
output[i] = output[i]*2;
output[i] = output[i]*2;
}
We will illustrate accelerator creation using the DMA model.
$ git clone git@github.com:CMPT-7ARCH-SFU/gem5-lab2-v2.git
$ export CROSS_COMPILE_DIR=/data/arm/gcc-arm-none-eabi-10.3-2021.10/bin
$ cd gem5-lab2/benchmarks/vector_dma
$ ls
defines.h host hw Makefile
host/ | Defines the main that runs on CPU and invokes the accelerator |
hw/ | Defines the accelerator datapath |
hw/config.yml | Defines accelerator configuration |
inputs/m0.bin,m1.bin | Input files that contain the data |
defines.h | Common defines used by both datapath and host code |
gem5-config/fs_vector_dma.py | Top-level initialization of Arm SoC system |
gem5-config/vector_dma.py | gem5 configuration file for vector_dma accelerator. Defines accelerator components |
Our SoC has two main sections.
Here we focus on the host code and interactions with the accelerator. For this application, we are using a bare metal kernel. This means that we will have a load file, assembly file, and must generate ELF files for execution.
# Psuedocode for host code
1. Set up addresses for scratchpad
2. Copy data from DRAM into scratchpad
3. Start accelerator.
5. Copy data from scratchpad to DRAM
defines.h
Top-level definition for memory mapped addresses. Since this is a bare-metal SoC without any access to virtual memory and memory allocators, we have to define the memory space. The overall memory space looks like the following:// Accelerator Name VECTOR_DMA. Base address for
// interacting with memory mapped registers
#define VECTOR_DMA 0x10020040
// Address for interacting with DMA
#define DMA_Flags 0x10020000
// 3 scratchpads
#define MATRIX1 0x10020080
#define MATRIX2 0x10020100
#define MATRIX3 0x10020180
0 | 0x10020040 | 0x10020041-11 (accelerator parameters. 8 bytes each) is the accelerator parameter) | 0x10020000-0x10030000 | Limit |
---|---|---|---|---|
Host DRAM Coherent Address space | Accelerator status (0: inactive. 1: start: 4: running) | 3 parameters (see bench/ code) | Scratchpad memories | Host DRAM space |
These memory spaces are set in the following files
Accelerator range
Any accesses to this range, either from host code or accelerator is routed to the accelerator cluster.
gem5-config/vector_dma.py
line 24: local_low = 0x10020000
line 25: local_high = 0x10030000
Accelerator Start Address and Parameters
gem5-config/vector_dma.py
clstr.vector_dma = CommInterface(devicename=acc, gic=gic, pio_addr=0x10020040, pio_size=64, int_num=68)
pio_size is in bytes
Scratchpad address
gem5-config/vector_dma.py
# MATRIX1 (Variable)
addr = 0x10020080
spmRange = AddrRange(addr, addr + 0x40)
Connecting scratchpads to accelerator
# Connecting MATRIX1 to vector_dma. The range
# here controls how many accesses you can issue
# to scratchpad in 1 cycle.
for i in range(2):
clstr.vector_dma.spm = clstr.matrix1.spm_ports
# MATRIX2 (Variable)
addr = 0x10020100
spmRange = AddrRange(addr, addr + 0x40)
# MATRIX3 (Variable)
addr = 0x10020180
spmRange = AddrRange(addr, addr + 0x40)
In this instance we want to have the DMAs and accelerator controlled by an additional device to reduce overhead on the CPU. We define the helper functions in common/dma.h
.
$ cd benchmarks/inputs
$ xxd m0.bin
00000000: 0100 0000 0200 0000 0300 0000 0400 0000 ................
00000010: 0500 0000 0600 0000 0700 0000 0800 0000 ................
00000020: 0900 0000 0a00 0000 0b00 0000 0c00 0000 ................
00000030: 0c00 0000 0d00 0000 0e00 0000 0f00 0000 ................
# fs_vector_dma.py
test_sys.kernel_extras = [os.environ["LAB_PATH"]+"/benchmarks/vector_dma/m0.bin",os.environ["LAB_PATH"]+"/benchmarks/vector_dma/m1.bin"]
main.cpp
uint64_t base = 0x80c00000;
TYPE *m1 = (TYPE *)base;
TYPE *m2 = (TYPE *)(base + sizeof(TYPE) * N);
TYPE *m3 = (TYPE *)(base + 2 * sizeof(TYPE) * N);
We then setup the DMA to perform the memory copy between DRAM and the scratchpad memory. dmacpy is similar to memcpy. Note the address ranges used for performing the copy. The destination uses the scratchpad range specified in the config.ini and gem5-scripts. This space is carved out of the global memory space and the host CPU knows to route any reads and writes to this address range to the scratchpad
# Define scratchpad addresses.
# Note that N = 16. 4 byte int. 64 bytes total (or 0x40 bytes)
TYPE *spm1 = (TYPE *)MATRIX1;
TYPE *spm2 = (TYPE *)MATRIX2;
TYPE *spm3 = (TYPE *)MATRIX3;
# spm1 is destination address
# m1 is source address
# Size in bytes.
dmacpy(spm1, m1, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
dmacpy(spm2, m2, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
Set up parameters in accelerator memory-mapped registers. The accelerator status byte is set in pio_addr = 0x2f000000 in config.ini
. pio_size = 64
. This will set up 64 memory mapped registers starting from address 0x2f000001. Each entry is 8 bytes. The parameters are automatically derived from the accelerator function definition.
// 4 possible values. 0x0: inactive 0x1: to start the accelerator. 0x4 active.
// Start the accelerated function
*ACC = DEV_INIT;
while (*ACC != 0);
In our boot code, we setup an Interrupt Service Routine (ISR) in isr.c that the accelerator triggers to set up the end of the execution. This will reset the the accelerator status to 0x0. Which we spin on in the host code.
// isr.c. Invoked when accelerator is complete
void isr(void)
{
printf("Interrupt\n\r");
// Helps break the for loop in the host code
*ACC = 0x00;
printf("Interrupt\n\r");
}
We copy back the results from the accelerator to the DRAM so that the host code can access and check.
dmacpy(m3, spm3, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
We will first start with creating the code for our accelerator.
In bench/vector_dma.c
there is a vector loop application written. To expose parallelism for computation and memory access we fully unroll the innermost loop of the application. The simulator will natively pipeline the other loop instances for us. To accomplish the loop unrolling we can utilize clang compiler pragmas such as those on line 18 of vector_dma.c.
// Unrolls loop and creates instruction parallelism
#pragma clang loop unroll_count(8)
for(i=0;i<N;i++) {
prod[i] = 4*(m1[i] + m2[i]);
}
With unrolling
The hardware ends up being a circuit that implements the above dataflow graph. The unrolling creates 8 way parallelism. The loads to m1[i] and m2[i] can happen in parallel. The adds and multiplies can happen in parallel. The figures show the compiler representation or view that gets mapped down to hardware. Each node in the graph in an LLVM IR instruction. This is just an intermediate RISCy ISA-like representation with certain important differences.
Benefits of Compiler IR view of Accelerator
Infinite registers. Typical object code for CPUs is limited by the architectural registers. This causes unnecesary memory operations to manage register overflows and underflows that hide the available parallelism. Compilers don’t have such limitations since they are trying to simply capture the available parallelism and locality
Dataflow Semantics. While object code is linearly laid out and relies on a program counter. Compiler IR inherently supports dataflow semantics with no specific program counter.
Without unrolling
We are generating hardware datapath from the the C code specified, hence we have a number of rules. If these rules are violated, the compiler may complain, you may encounter a runtime error from llvm runtime engine of SALAM or may even have a silent failure. Its very important you follow them.
Rule 1: SINGLE FUNCTION
Only single function perfmitted per accelerator .c file.Rule 2: NO LIBRARIES
Cannot use standard library functions. Cannot call into other functionsRule 3 : No I/O
No printfs or writes to files. Either use traces or write back to cpu memory to debugRule 4: Only locals or args
Can only work with variables declared within function or input arrays.Read here for more details on LLVM
cd $REPO
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=/data/gem5-salam-v2
# Building gem5-SALAM. WE PREBUILD. YOU DO NOT NEED TO BUILD GEM5
git clone git@github.com:CMPT-7ARCH-SFU/gem5-SALAM.git
cd gem5-SALAM; scons -j 16 build/ARM/gem5.opt -j`nproc`
# Set compiler
export CROSS_COMPILE_DIR=/data/arm/gcc-arm-none-eabi-10.3-2021.10/bin
# Build benchmark
cd $REPO/benchmarks/vector_dma
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
module load llvm-10
# Build datapath and host binary
make clean; make
$M5_PATH/build/ARM/gem5.opt --debug-flags=DeviceMMR,LLVMInterface,AddrRanges,NoncoherentDma,RuntimeCompute --outdir=BM_ARM_OUT/vector_dma gem5-config/fs_vector_dma.py --mem-size=4GB --kernel=$LAB_PATH/benchmarks/vector_dma/host/main.elf --disk-image=$M5_PATH/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=$LAB_PATH/benchmarks --accbench=vector_dma --caches --l2cache --acc_cache
# OR in a single line
./runvector.sh -b vector_dma
config.yml
is added to all the benchmarks. The systembuilder.py
read the config for each benchmark and creates the python files (fs_$BENCHMARK.py
and $BENCHMARK.py
) containing the setup of the accelerator. By defualt, the generated python files can be found in config/SALAM/generated/
directory.There is a commented line in the bash files:
mkdir -p $LAB_PATH/config/SALAM/generated
mkdir benchmarks/gemm
cp benchmarks/vector_dma/config.yml benchmarks/gemm/config.yml
export BENCH=gemm
${LAB_PATH}/SALAM-Configurator/systembuilder.py --sysName $BENCH --benchDir "benchmarks/${BENCH}"
This line generates the python files for us. After the config file is generated, we manaully change it to read the input files (m1.bin, m2.bin
).
fs_$BENCHMARK.py
with:elif args.kernel is not None:
test_sys.workload.object_file = binary(args.kernel)
test_sys.workload.extras = [os.environ["LAB_PATH"]+"/benchmarks/inputs/m0.bin",os.environ["LAB_PATH"]+"/benchmarks/inputs/m1.bin"]
test_sys.workload.extras_addrs = [0x80c00000,0x80c00000+8*8]
systembuilder.py
creates a new header containing the SPM and DMA addresses. This header is stored in the benchmark directory under the name of $BENCHMARK_clstr_hw_defines.h
. We include this header in the defines.h
of every benchmark.For each of our accelerators we also need to generate an yaml file (if you use system builder these are generated). In each yaml file we can define the number of cycles for each IR instruction and provide any limitations on the number of Functional Units (FUs) associated with IR instructions.
Additionally, there are options for setting the FU clock periods and controls for pipelining of the accelerator. Below is an example with a few IR instructions and their respective cycle counts:
instructions:
add:
functional_unit: 1
functional_unit_limit: 5
opcode_num: 13
runtime_cycles: 1
Importantly, under the AccConfig section, we set MMR specific details such as the size of the flags register, memory address, interrupt line number, and the accelerator’s clock.
- Accelerator:
- Name: vector_dma
IrPath: benchmarks/vector_dma/hw/vector_dma.ll
ConfigPath: benchmarks/vector_dma/hw/vector_dma.ini
Debug: True
PIOSize: 25
PIOMaster: LocalBus
InterruptNum: 68
In the Memory section, you can define the scratchpad’s memory address, size, response latency and number of ports. Also, if you want the accelerator to verify that memory exists in the scratchpad prior to accessing the scratchpad we can set ready mode to true.
- Var:
- Name: MATRIX1
Type: SPM
Size: 64
Ports: 2
- Var:
- Name: MATRIX2
Type: SPM
Size: 64
Ports: 2
- Var:
- Name: MATRIX3
Type: SPM
Size: 64
Ports: 2
Hardware has function units. LLVM IR has ops. There is a many:1 mapping between LLVM IR ops and function units i.e., different instruction types could be scheduled on same function unit (taking into consideration cycle and pipelining constraints). SALAM enables the designer to control this mapping. Typically laid out in the config.yml file. Here is the total list of ops and function units supported by SALAM. See the “function_unit” under the instructions tab.
Function Unit | ID |
---|---|
INTADDER | 1 |
INTMULTI | 2 |
INTSHIFTER | 3 |
INTBITWISE | 4 |
FPSPADDER | 5 |
FPDPADDER | 6 |
FPSPMULTI | 7 |
FPSPDIVID | 8 |
FPDPMULTI | 9 |
FPDPDIVID | 10 |
COMPARE | 11 |
GETELEMENTPTR | 12 |
CONVERSION | 13 |
OTHERINST | 14 |
REGISTER | 15 |
COUNTER | 16 |
TRIG_SINE | 17 |
This is used for tracking activity factors and
Opcode | Number |
---|---|
Add | 13 |
Addrspac | 50 |
Alloca | 31 |
AndInst | 28 |
Ashr | 27 |
Bitcast | 49 |
Br | 2 |
Call | 56 |
Fadd | 14 |
Fcmp | 54 |
Fdiv | 21 |
Fence | 35 |
Fmul | 18 |
Fpext | 46 |
Fptosi | 42 |
Fptoui | 41 |
Fptrunc | 45 |
Frem | 24 |
Fsub | 16 |
Gep | 34 |
Icmp | 53 |
Indirect | 4 |
Inttoptr | 48 |
Invoke | 5 |
Landingp | 66 |
Load | 32 |
Lshr | 26 |
Mul | 17 |
OrInst | 29 |
Phi | 55 |
Ptrtoint | 47 |
Resume | 6 |
Ret | 1 |
Sdiv | 20 |
Select | 57 |
Sext | 40 |
Shl | 25 |
Srem | 23 |
Store | 33 |
Sub | 15 |
SwitchIn | 3 |
Trunc | 38 |
Udiv | 19 |
Uitofp | 43 |
Unreacha | 7 |
Urem | 22 |
Vaarg | 60 |
XorInst | 30 |
Zext | 39 |
HWAccConfig.py
acc.hw_interface.functional_units.integer_adder.limit = 5
We are now going to leverage and modify the example scripts for gem5’s full system simulation. In gem5-config/fs_vector_d,a.py we have a modified version of the script located in gem5’s default folder. The main difference in our configuration is there are two additional parameters
fs_vector_dma.py
: Connects accelerator cluster to arm. Accelerator connections start at line 231:vector_dma.makeHWAcc(args, test_sys)
vector_dma.py
: Sets up specific accelerator systemHWAccConfig.py
: Provides accelerator independent helper functions
vector_dma.py
In order to simplify the organization of accelerator-related resources, we define a accelerator cluster. This accelerator cluster will contain any shared resources between the accelerators as well as the accelerators themselves. It has several functions associated with it that help with attaching accelerators to it and for hooking cluster into the system.
# Allocate cluster and build it
def makeHWAcc(args, system):
system.vector_dma_clstr = AccCluster()
buildvector_dma_clstr(args, system, system.vector_dma_clstr)
def buildvector_dma_clstr(args, system, clstr):
# Define memory map. Any read/write from cpu to this range sent to accelerator cluster
local_low = 0x10020000
local_high = 0x10030000
local_range = AddrRange(local_low, local_high)
# Mutually exclusive range. Any accelerator access to this range sent to CPU's L2 and DRAM
external_range = [AddrRange(0x00000000, local_low-1), AddrRange(local_high+1, 0xFFFFFFFF)]
system.iobus.mem_side_ports = clstr.local_bus.cpu_side_ports
# Connect caches if any, if cache_size !=0
clstr._connect_caches(system, options, l2coherent=True)
gic = system.realview.gic
We then invoke the _connect_caches function (line 20) in order to connect any cache hierarchy that exists in-between the cluster, the memory bus, or l2xbar of the CPU depending on design. This gives the accelerator cluster master access to resources outside of itself. It also establishes coherency between cluster and other resources via caches. If no caches are needed this will merely attach the cluster to the memory bus without a cache.
system.acctest._connect_caches(system, options, l2coherent=True, cache_size = "32kB")
These functions are defined in gem5-SALAM/src/hwacc/AccCluster.py
The DMA control address defined here has to match common/dma.h. The memory mapped control, pio, int_num all have to match the values set in config.yml.
# Noncoherent DMA
clstr.dma = NoncoherentDma(pio_addr=0x10020000, pio_size = 21, gic=gic, int_num=95)
clstr.dma.cluster_dma = clstr.local_bus.cpu_side_ports
clstr.dma.max_req_size = 64
clstr.dma.buffer_size = 128
clstr.dma.dma = clstr.coherency_bus.cpu_side_ports
clstr.local_bus.mem_side_ports = clstr.dma.pio
We, we are going to create a CommInterface (Line 30) which is the communications portion of our Top accelerator. We will then configure Top and generate its LLVM interface by passing CommInterface, a config file, and an IR file, to AccConfig (Line 31). This will generate the LLVM interface, configure any hardware limitations, and will establish the static Control and Dataflow Graph (CDFG).
We then connect the accelerator to the cluster (Line 32). This will attach the PIO port of the accelerator to the cluster’s local bus that is associated with MMRs.
# vector_dma Definition
acc = "vector_dma"
# Datapath definition
ir = os.environ["LAB_PATH"]+"/benchmarks/vector_dma/hw/vector_dma.ll"
# Configs for picking up memory maps and instruction to FU mapping
config = os.environ["LAB_PATH"]+"/benchmarks/vector_dma/config.yml"
yaml_file = open(config, 'r')
yaml_config = yaml.safe_load(yaml_file)
debug = False
for component in yaml_config["acc_cluster"]:
if "Accelerator" in component.keys():
for axc in component["Accelerator"]:
print(axc)
if axc.get("Name","") == acc:
debug = axc["Debug"]
# Communication interface to accelerator.
clstr.vector_dma = CommInterface(devicename=acc, gic=gic, pio_addr=0x10020040, pio_size=64, int_num=68)
AccConfig(clstr.vector_dma, ir, config)
# vector_dma Config
clstr.vector_dma.pio = clstr.local_bus.mem_side_ports
# Activate/Deactivate debug messages
clstr.vector_dma.enable_debug_msgs = debug
# MATRIX1 (Variable)
# Base address. Accelerator/CPU reach into scratchpad starting at this address
addr = 0x10020080
# Set range. 0x40 = 64 bytes
spmRange = AddrRange(addr, addr + 0x40)
# Create scratchpad object
clstr.matrix1 = ScratchpadMemory(range = spmRange)
# Report in debug table?
clstr.matrix1.conf_table_reported = False
# If ready (True) then pay attention to flags below
clstr.matrix1.ready_mode = False
# Zero out after first read
clstr.matrix1.reset_on_scratchpad_read = True
# Read succeeds only if previously init (False)
clstr.matrix1.read_on_invalid = False
# Write succeeds only if previously read (True)
clstr.matrix1.write_on_valid = True
# Connect scratchpad to local bus
clstr.matrix1.port = clstr.local_bus.mem_side_ports
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=/data/gem5-salam-v2
cd $REPO/benchmarks/vector_dma
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/Modules/3.2.10/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-10
#
make clean; make
# In short
$ ./runvector.sh -b -p vector_dma
# This will create a BM_ARM_OUT/vector_dma (this is your m5_out folder)
# The debug-trace.txt will contain stats for your accelerator
Do you understand what these stats are?
cat BM_ARM_OUT/vector_dma/debug-trace.txt
system.vector_dma_clstr.vector_dma.llvm_interface
Total Area: 12290.3
Total Power Static: 0.154364
Total Power Dynamic: 2.43911
Function Unit - Limit (0=inf)Units/Cycle
double_multiplier - 0 0
bitwise_operations - 0 0
bit_shifter - 0 0
double_adder - 0 0
float_divider - 0 0
bit_shifter - 0 0
integer_multiplier - 0 0
integer_adder - 0 16
double_divider - 0 0
float_adder - 0 0
float_multiplier - 0 0
========= Performance Analysis =============
Setup Time: 0h 0m 0s 1ms 117us
Simulation Time (Total): 0h 0m 0s 1ms
Simulation Time (Active): 0h 0m 0s 1ms
Queue Processing Time: 0h 0m 0s 0ms
Scheduling Time: 0h 0m 0s 0ms
Computation Time: 0h 0m 0s 0ms
System Clock: 0.1GHz
Runtime: 20 cycles
Runtime: 0.2 us
Stalls: 0 cycles
Executed Nodes: 19 cycles
The metrics of interest are dynamic power, area, and Runtime.
benchmarks/vector_dma/hw/vector_dma.c
from 1-16 and see what happens to runtime cycles each time. Also see the stats for Total Number of Registers:
, Max Register Usage Per Cycle:
Runtime:
, Runtime FUs
. Power Analysis
.Why is there no stalls
WARNING: Remember you have to make clean and rebuild .ll and main.elf each time
vector_dma.py
see range loops e.g., line 72. Need to modify for each scratchpad) and see what happens to cycles. Why does changing number of ports to 1 increase stalls ? To try and understand follow step belowrunvector.sh
to MinorCPU and see difference in overall simulation time.FLAGS="HWACC,LLVMRuntime"
in run-vector.sh. Re-run and check debug-trace.txt . Try to comphrehend what the trace says. This will include step-by-step execution of the hardware. Disable the flags for assignments; otherwise traces will consume too much space.Comments on trace
BM_ARM_OUT/vector_dma/debug-trace.txt
Look for lines of type.
Trying to read addr: 0x0000000102..., 4 bytes through port:
When changing ports to 1 Check how many reads occur in tick
1476840000: system.acctest.vector_dma: Checking MMR to see if Run bit set
1476840000: system.acctest.vector_dma.compute: Initializing LLVM Runtime Engine!
1476840000: system.acctest.vector_dma.compute: Constructing Static Dependency Graph
1476840000: system.acctest.vector_dma.compute: Parsing: (/data/src/gem5-lab2/benchmarks/vector_dma/hw/vector_dma.ll)
Read from 0x2f10000c; indicates read from address. Depeding on address range this either refers to scratchpad or global memory
Check the computation operations. Open the $REPO/benchmarks/vector_dma/hw/vector_dma.ll
file and identify these instructions.
1476910000: system.acctest.vector_dma.compute.i( %7 = shl i32 %6, 2): Performing shl Operation
1476910000: system.acctest.vector_dma.compute.i( %7 = shl i32 %6, 2): 2 << 2
1476910000: system.acctest.vector_dma.compute.i( %7 = shl i32 %6, 2): shl Complete. Result = 8
1476910000: system.acctest.vector_dma.compute.i( %7 = shl i32 %6, 2): Operation Will Commit in 1 Cycle(s)
1476910000: system.acctest.vector_dma.compute.i( %13 = add i32 %12, %10): Performing add Operation (13)
1476910000: system.acctest.vector_dma.compute.i( %13 = add i32 %12, %10): 2 + 2
1476910000: system.acctest.vector_dma.compute.i( %13 = add i32 %12, %10): add Complete. Result = 4
1476910000: system.acctest.vector_dma.compute.i( %13 = add i32 %12, %10): Operation Will Commit in 1 Cycle(s)
1476910000: system.acctest.vector_dma.compute.i( %6 = add i32 %5, %3): Performing add Operation (6)
1476910000: system.acctest.vector_dma.compute.i( %6 = add i32 %5, %3): 3 + 3
1476910000: system.acctest.vector_dma.compute.i( %6 = add i32 %5, %3): add Complete. Result = 6
Dataflow graph Visualizer
. See if you can spot the difference between parallel and serial.module load llvm-10
clang --version
# Should be 10.
cd $REPO/vector_dma/hw/
# Dataflow graph with
clang -emit-llvm -S vector_dma.c -o vector_dma-10.ll
opt -load /data/PDG/build/libpdg.so --dot-pdg --dot-only-ddg vector_dma-10.ll
dot -Tpdf pdgragh.vadd.dot -o pdgragh.vadd.serial.pdf
clang -emit-llvm -O3 -S vector_dma.c -o vector_dma-10.ll
opt -load /data/PDG/build/libpdg.so --dot-pdg --dot-only-ddg vector_dma-10.ll
dot -Tpdf pdgragh.vadd.dot -o pdgragh.vadd.parallel.pdf
benchmarks/vector_dma_2
In Model 1, we moved all the data we need into the scratchpad and then kickstarted the computation. However, scratchpads are finite and accelerators can only work with data in the scratchpad. Hence we may need to restrict the size of the accelerators and process data in multiple batches. In this example we are going to restrict the accelerator to process only 8 elements. However we have 16 elements in the array. To process all the elements we have to process the data in 2 batches. The modifications are to the manager host code.
// Modified config.ini to set scratchpad size
// Modified defines.h
#define N 8
// The accelerator datapath will work on 8 elements at a time
// Modified top.c to process 16 elements as two batches.
// Batch 0 DMAs elements 0-8 to scratchpad
dmacpy(spm1, m1, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
dmacpy(spm2, m2, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
// Invoke accelerator on scratchpad address range
val_a = (uint64_t)spm_base;
val_b = (uint64_t)(spm_base + sizeof(TYPE) * N);
val_c = (uint64_t)(spm_base + 2 * sizeof(TYPE) * N);
// Batch 2 DMAs elements 8-16
dmacpy(spm1, m1 + N, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
dmacpy(spm2, m2 + N, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
// Notice in both cases we are passing the scratchpad base address to the accelerator datapath to work on. This is redundant and we can hardcode it into the accelerator datapath in hw/ if we want to vector_dma_2x.c.
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=/data/gem5-salam-v2
cd $REPO/benchmarks/vector_dma_2
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/Modules/3.2.10/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-10
#
make clean; make
# This should create a .ll file in your hw/
# and main.elf file in host/
# Full command for gem5 simulation
$M5_PATH/build/arm/gem5.opt --debug-flags=DeviceMMR,LLVMInterface,AddrRanges,NoncoherentDma,RuntimeCompute --outdir=BM_ARM_OUT/vector_dma_2 gem5-config/fs_vector_dma_2.py --mem-size=4GB --kernel=$LAB_PATH/benchmarks/vector_dma_2/host/main.elf --disk-image=$M5_PATH/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=$LAB_PATH/benchmarks --accbench=vector_dma_2 --caches --l2cache --acc_cache
# Short command
$ ./runvector.sh -p -b vector_dma_2
# This will create a BM_ARM_OUT/vector_dma_2 (this is your m5_out folder)
# The debug-trace.txt will contain stats for your accelerator
The cache model hooks up the accelerator to the global memory through a coherence crossbar. It is ok to be not familiar with coherence bar when reading this document. You only need to understand that with coherence available the accelerators can directly reference the DRAM space mapped to the CPU.
To enable accelerator cache.
CACHE_OPTS="--caches --l2cache --acc_cache"
clstr._connect_caches(system, options, l2coherent=True, cache_size = "32kB"):gem5-config/vector_cache.py
$M5_PATH/build/ARM/gem5.opt --debug-flags=DeviceMMR,LLVMInterface,AddrRanges,NoncoherentDma,RuntimeCompute --outdir=BM_ARM_OUT/vector_cache gem5-config/fs_vector_cache.py --mem-size=4GB --kernel=$LAB_PATH/benchmarks/vector_cache/host/main.elf --disk-image=$M5_PATH/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=$LAB_PATH/benchmarks --accbench=vector_cache --caches --l2cache --acc_cache
# In short
$ ./runvector.sh -b vector_cache -p
The image below compares the system organization with an accelerator cache and without.
The primary difference between the cache and dma version is in the host code
The pointers passed to the accelerator point to the global memory space (base
). The load and store operations directly touch these locations and access them through the coherence cross bar.
// benchmarks/vector_cache/host/main.cpp
uint64_t base = 0x80c00000;
uint64_t spm_base = 0x2f100000;
val_a = (uint64_t)base;
val_b = (uint64_t)(base + sizeof(TYPE) * N);
val_c = (uint64_t)(base + 2 * sizeof(TYPE) * N);
The cache model hooks up the accelerator to the global memory through a coherence crossbar. It is ok to be not familiar with coherence bar when reading this document. You only need to understand that with coherence available the accelerators can directly reference the DRAM space mapped to the CPU.
To enable accelerator cache.
clstr._connect_caches(system, options, l2coherent=True, cache_size = "32kB"):vector.py
cd $REPO/benchmarks/vector_cache
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-10
# Build datapath and host binary
make clean; make
$ ./runvector.sh -b vector_cache -p
The image below compares the system organization with an accelerator cache and without.
Modify the parameter in N in defines.h
and
test_sys.kernel = binary(options.kernel)
test_sys.kernel_extras = [os.environ["LAB_PATH"]+"/benchmarks/inputs/m0.bin",os.environ["LAB_PATH"]+"/benchmarks/inputs/m1.bin"]
test_sys.kernel_extras_addrs = [0x80c00000,0x80c00000+os.path.getsize(os.environ["LAB_PATH"]+"/benchmarks/inputs/m0.bin")]
print("Loading file m0 at" + str(hex(0x80c00000)))
print("Loading file m1 at" + str(hex(0x80c00000 + os.path.getsize(os.environ["LAB_PATH"]+"/benchmarks/inputs/m0.bin"))))
cd $REPO
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=export M5_PATH=/data/gem5-salam-v2
# Set compiler
export CROSS_COMPILE_DIR=/data/arm/gcc-arm-none-eabi-10.3-2021.10/bin
cd $REPO/benchmarks/multi_vector
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-10
# Build datapath and host binary
make clean; make
$ ./runmulti.sh -b multi_vector -p
$ $M5_PATH/build/ARM/gem5.opt --debug-flags=HWACC,Runtime --outdir=BM_ARM_OUT/multi_vector gem5-config/fs_multi_vector.py --mem-size=8GB --kernel=$LAB_PATH/benchmarks/multi_vector/host/main.elf --disk-image=$M5_PATH/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=$LAB_PATH/benchmarks --accbench=multi_vector --caches --l2cache --acc_cache
For larger applications we may need to include multiple accelerators in the cluster. For this, we need to include a top
accelerator to coordinate each of the other accelerators.
Figure below shows the system model. The top now offloads the DMA and accelerator kickstart logic; it also initiates the DMA movement between the accelerators. The host in this case simply passes the address pointers to the inputs and output zone. There are two accelerators vector and vector2.ll
// host/main.cpp
volatile uint8_t * top = (uint8_t *)0x2f000000;
volatile uint32_t * val_a = (uint32_t *)0x2f000001;
volatile uint32_t * val_b = (uint32_t *)0x2f000009;
volatile uint32_t * val_c = (uint32_t *)0x2f000011;
int main(void) {
// Pointers in DRAM. m1 and m2 are inputs.
// m3 is the output
uint32_t base = 0x80c00000;
TYPE *m1 = (TYPE *)base;
TYPE *m2 = (TYPE *)(base + sizeof(TYPE) * N);
TYPE *m3 = (TYPE *)(base + 2 * sizeof(TYPE) * N);
// MMRegs of the top accelerator.
// Argument 1 to top
*val_a = (uint32_t)(void *)m1;
// Argument 2 to top
*val_b = (uint32_t)(void *)m2;
// Argument 3 to top
*val_c = (uint32_t)(void *)m3;
hw/source/top.c | Code for top accelerator coordinator. This is itself an accelerator |
config.yml | Configuration for accelerators |
hw/source/vector.c | Code for first stage of vector accelerator |
hw/source/vector2.c | Code for second stage of vector accelerator |
hw/ir | llvm files after compiler generates dataflow graph |
Start address | Description |
---|---|
0x10020040 | Memory mapped args for top |
0x10020080 | Memory mapped args for vector |
0x10020780 | Memory mapped args for vector2 |
0x100200c0, 0x10020300, 0x10020540 | Scratchpad for vector |
0x100207c0, 0x10020a00, 0x10020c40 | scratchpad for vector2 |
// Accelerator 1: vector.c
for(i=0;i<N;i++) {
tmp_m3[i] = (m1[i] + m2[i]);
}
// Accelerator 2: vector2.c
for(i=0;i<N;i++) {
m3[i] = tmp_m3[i] * 8;
}
Top manages the accelerator themselves
// hw/source/top.c
volatile uint8_t *DmaFlags = (uint8_t *)(DMA);
volatile uint64_t *DmaRdAddr = (uint64_t *)(DMA + 1);
volatile uint64_t *DmaWrAddr = (uint64_t *)(DMA + 9);
volatile uint32_t *DmaCopyLen = (uint32_t *)(DMA + 17);
Transfer data from DRAM to Scratchpad S1 of Accelerator Vector.
// multi_vector/hw/top.c
// Global Memory Address
*_DMARdAddr = (uint32_t)m1;
// Scratchpad address
*_DMAWrAddr = (uint32_t)MATRIX1;
// Vector Len
*_DMACopyLen = vector_size;
// // // Fence it
*_DMAFlags = 0;
while (*_DMAFlags != 0x0);
*_DMAFlags = DEVINIT;
// Poll DMA for finish
while ((*_DMAFlags & DEVINTR) != DEVINTR);
// // Reset DMA
*_DMAFlags = 0x0;
Scratchpad memory is laid out in the following manner
M1 | M2 | M3 |
---|---|---|
0x100200c0 - N*sizeof(int) | 0x10020300 - N*sizeof(int) | 0x10020540 - N*sizeof(int) |
Set up arguments if required. The accelerator can only work with data in the scratchpad or local registers. These are fixed memory rangers in the DMA space. In this case, the V1 vector accelerator does not require any additional arguments. To start the accelerator from TOP it is important to follow the steps below (in particular checking if accelerator is ready for kickstart) after arguments are are setup.
// Write to argument MMR of V1 accelerator
// First, check if accelerator ready for kickstarting
while (*V1Flags != 0x0);
// Start the accelerated function
*V1Flags = DEV_INIT;
// Poll function for finish
while ((*V1Flags & DEV_INTR) != DEV_INTR);
// Reset accelerator for next time.
*V1Flags = 0x0;
The output of accelerator v1 is the input of v2. Need to copy N*4 bytes from
// Transfer the output of V1 to V2.
*DmaRdAddr = M3ADDR;
*DmaWrAddr = M1ADDR_V2;
*DmaCopyLen = vector_size;
*DmaFlags = DEV_INIT;
// //Poll DMA for finish
while ((*DmaFlags & DEV_INTR) != DEV_INTR)
;
// Write to argument MMR of V2 accelerator
// First, check if accelerator ready for kickstarting
while (*V2Flags != 0x0);
// Start the accelerated function
*V2Flags = DEV_INIT;
// Poll function for finish
while ((*V2Flags & DEV_INTR) != DEV_INTR);
// Reset accelerator for next time.
*V2Flags = 0x0;
// Transfer M3
// Scratchpad addresss
*DmaRdAddr = M3ADDR_V2;
// Global address the host wants the final result in
*DmaWrAddr = m3_addr;
// Number of bytes
*DmaCopyLen = vector_size;
// Start DMA
*DmaFlags = DEV_INIT;
// Poll DMA for finish
while ((*DmaFlags & DEV_INTR) != DEV_INTR)
;
We now create a multi-accelerator system with a shared cache. We do not need to explicitly transfer data between accelerators and all data is implicitly transferred between the accelerators through the shared cluster cache. The top
only has to set up the appropriate arguments and invoke the accelerators in sequence. Each accelerator reads and writes back to the global memory space and the cluster cache captures the locality.
cd $REPO
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=/data/gem5-salam-v2
# Set compiler
export CROSS_COMPILE_DIR=/data/arm/gcc-arm-none-eabi-10.3-2021.10/bin
cd $REPO/benchmarks/multi_vector_cache
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-10
# Build datapath and host binary
make clean; make
# In short.
$ ./runmulti.sh -b multi_vector_cache -p
# Full command
$ $M5_PATH/build/ARM/gem5.opt --debug-flags=HWACC,Runtime --outdir=BM_ARM_OUT/multi_vector_cache gem5-config/fs_multi_vector_cache.py --mem-size=8GB --kernel=$LAB_PATH/benchmarks/multi_vector_cache/host/main.elf --disk-image=$M5_PATH/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=$LAB_PATH/benchmarks --accbench=multi_vector_cache --caches --l2cache --acc_cache
// benchmarks/multi_vector_cache/hw/source/top.c
// Pass on host address arguments to accelerator
*V1Arg1 = m1_addr;
*V1Arg2 = m2_addr;
*V1Arg3 = m3_addr;
// Start V1
*V1Flag = DEV_INIT;
// Poll function for finish
while ((*V1Flag & DEV_INTR) != DEV_INTR);
*V1Flag = 0x0;
// Start V2
*V2Flag = DEV_INIT;
while ((*V2Flag & DEV_INTR) != DEV_INTR);
*V2Flag = 0x0;
Streaming pipelines introduce a FIFO interface to the memory system. If you take a look at the datapaths vector_dma/hw/vector_dma.c
you notice that the memory access pattern is highly regular with no ordering requirements between the elements of the array. We simply sequence through the elements of the vectoro applying an operation at individual locations.
This can be concisely described as a stream of values. A stream simply provides a FIFO interface to the data.
Memory map. The term MMR refers to the memory mapped register and flags used to control the DMA and accelerators.
0x100200c0 (TOP MMR) | 0x10020100 (S1 MMR) | 0x10020400 (S2 MMR) | 0x100204c0 (S3 MMR) | 0x10020000 (StreamDMA MMR) | 0x10020080 (Noncoherent DMA) |
0x10020000 (DRAM->S1, S3->DRAM) | 0x100203c0 (S1->S2 FIFO port) | 0x10020480 (S2->S3 FIFO port) |
This stream data from DRAM in chunks of stream_size (bits). Figure illustrates a stream DMA.
We need to create a new configuration and modify top to initiate the stream.
The stream DMA includes a control pio (similar to other accelerators). This can be written to by top to control from where in the DRAM data is being streamed. The out port of the StreamDMA engine is wired up to the stream ports of one of the accelerators. Each stream is a single input-single output FIFO. Each accelerator has a .stream
interface into which all the required streams are wired in. In this case i) we read from the DRAM and send data to accelerator S1. ii) read data from accelerator S3 and write it into stream DMA.
0x2f0001000
to read/write to the stream addresses.stream_size:8
bits worth of data from the port.StrDmaRdFrameSize
bytes of data in chunks of stream_size.
The total number of dataflow tokens generated will be $\frac{RdFrameSize*8}{stream_size}$. # Configuration in gem5-config/vector_stream.py
# 0x1002000 DMA control address
clstr.streamdma = StreamDma(pio_addr=0x10020000, status_addr=0x10020040, pio_size = 32, gic=gic, max_pending = 32)
# Stream read/write address
clstr.streamdma.stream_addr = 0x10020000 + 32
clstr.streamdma.stream_size = 128
clstr.streamdma.pio_delay = '1ns'
clstr.streamdma.rd_int = 210
clstr.streamdma.wr_int = 211
clstr.streamdma.dma = clstr.coherency_bus.cpu_side_ports
clstr.local_bus.mem_side_ports = clstr.streamdma.pio
# DRAM->Accelerator S1
clstr.s1.stream = clstr.streamdma.stream_out
# Accelerator S3->DRAM
clstr.s3.stream = clstr.streamdma.stream_in
// vector_stream/hw/source/top.c
// StreamDma
volatile uint8_t *StrDmaFlags = (uint8_t *)(STREAMDMA_Flags);
volatile uint64_t *StrDmaRdAddr = (uint64_t *)(STREAMDMA_Flags + 4);
volatile uint64_t *StrDmaWrAddr = (uint64_t *)(STREAMDMA_Flags + 12);
volatile uint32_t *StrDmaRdFrameSize = (uint32_t *)(STREAMDMA_Flags + 20);
volatile uint8_t *StrDmaNumRdFrames = (uint8_t *)(STREAMDMA_Flags + 24);
volatile uint8_t *StrDmaRdFrameBuffSize = (uint8_t *)(STREAMDMA_Flags + 25);
volatile uint32_t *StrDmaWrFrameSize = (uint32_t *)(STREAMDMA_Flags + 26);
volatile uint8_t *StrDmaNumWrFrames = (uint8_t *)(STREAMDMA_Flags + 30);
volatile uint8_t *StrDmaWrFrameBuffSize = (uint8_t *)(STREAMDMA_Flags + 31);
// Initiate Stream from DRAM to FIFO port
*StrDmaRdAddr = in_addr;
*StrDmaRdFrameSize = INPUT_SIZE; // Specifies number of bytes
*StrDmaNumRdFrames = 1;
*StrDmaRdFrameBuffSize = 1;
// Start Stream
*StrDmaFlags = STR_DMA_INIT_RD | STR_DMA_INIT_WR;
Stream buffers establish ports directly between accelerators. They do not need to set up during runtime
The configuration is set up and the accelerators simply read from the address that controls the port.
For example, here we have set up a stream buffer between accelerator v1 and v2.
Each accelerator uses the address to read and write to the FIFO. The streambuffer only supports a single input and output port.
┌──────────────────────┐ ┌───────────────┐
│ Accelerator V1 │ ┌─────────────┐ │ Acclerator │
│ ├─────►│ FIFO Buffer ├────────► V2 │
└──────────────────────┘ └─────────────┘ └───────────────┘
# Address accelerator v1 and v2 can read and write to.
# S1Out (Stream Variable)
addr = 0x10020380
# stream_size. # bits read on each access
clstr.s1out = StreamBuffer(stream_address = addr, status_address= 0x100203c0, stream_size = 8, buffer_size = 8)
# Input to the buffer from accelerator S1
clstr.s1.stream = clstr.s1out.stream_in
# Output of buffer sent to accelerator S2.
clstr.s2.stream = clstr.s1out.stream_out
Each stream-buffer only supports 1-1 input and output port. However multiple stream buffers can be wired to a single accelerator. However, each accelerator can have multiple streambuffer ports.
┌───────────────┐
┌─────────────┐ │ Acclerator │
│ 0x10020380 ├────┬───► │
└─────────────┘ ├───► V2 │
│ └───────────────┘
┌─────────────┐ │
│ 0x10020440 ├────┘
└─────────────┘
# Address accelerator v1 and v2 can read and write to access FIFO.
addr = 0x10020380
clstr.s1out = StreamBuffer(stream_address = addr, status_address= 0x100203c0, stream_size = 8, buffer_size = 8)
clstr.s1.stream = clstr.s1out.stream_in
clstr.s2.stream = clstr.s1out.stream_out
cd $REPO
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=export M5_PATH=/data/gem5-salam-v2
# Set compiler
export CROSS_COMPILE_DIR=/data/arm/gcc-arm-none-eabi-10.3-2021.10/bin
cd $REPO/benchmarks/vector_stream
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-10
# Build datapath and host binary
make clean; make
# In short on 227.
$ ./runvector_stream.sh -p
# Full command
$M5_PATH/build/ARM/gem5.opt --outdir=BM_ARM_OUT/vector_stream gem5-config/fs_vector_stream.py --mem-size=4GB --kernel=$LAB_PATH/benchmarks/vector_stream/host/main.elf --disk-image=$M5_PATH/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=$LAB_PATH/benchmarks --accbench=vector_stream --caches --l2cache --acc_cache
The purpose of top is to kickstart the stream dma from memory. Completion is detected by checking whether the output stream is complete. The overall execution is data-driven. When the FIFO port empties out the top accelerator triggers the completion of the stream.
// Start Stream DMAs
*StrDmaFlags = STR_DMA_INIT_RD | STR_DMA_INIT_WR;
// Start all accelerators
// Start S1
*S1 = 0x01;
// Start S2
*S2 = 0x01;
// Start S3
*S3 = 0x01;
// Wait for all accelerators to finish before sending interrupt to CPU
while ((*StrDmaFlags & 0x08) == 0x08);
As each each accelerator fills the stream buffer ports they will automatically trigger the operations in neighboring accelerators in a dataflow manner. Each accelerator has to know how many tokens are going to be generated and has to read the stream buffer port. The S1 stage writes to the FIFO streambuffer between S1 and S2. It uses the appropriate stream buffer memory mapped port.
// vector_stream_clstr_hw_defines.h
//Accelerator: TOP
#define TOP 0x100200c0
//Accelerator: S1
#define S1 0x10020100
#define S1Buffer 0x10020140
#define S1Out 0x10020380
#define S1Out_Status 0x100203c0
//Accelerator: S2
#define S2 0x10020400
#define S2Out 0x10020440
#define S2Out_Status 0x10020480
// hw/S1.c
volatile dType_8u * STR_IN = (dType_8u *)(S1In);
volatile dType_8u * BUFFER = (dType_8u *)(S1Buffer);
volatile dType_8u * STR_OUT = (dType_8u *)(S1Out);
for (dType_Reg i = 0; i < INPUT_SIZE; i++) {
*STR_OUT = (*STR_IN) + BUFFER[i];
}
}
Complete configuration
hw/source/S4.c
, hw/configs/S4.ini
. Modify top.c to define memory map for MMR and Stream ports. Modify hw/gem5-config/vector_stream.py
. You will need make S4 the final stage writing to the stream DMA. You will have to define new streambuffer that connects S3 and S4. You may and will need to makefile modifications. Figure it out.A key part of the gem5 infrastructure is the ability to generate SoC configurations. This is done by using the config.yml
file. The config.yml
file is processed by a python script (SALAM-Configurator/systembuilder.py
)
cd $REPO
export $LAB_PATH = $PWD
export bench = "gemm"
# benchmarks/gemm/config.yml : Top level config file
# Includes all the required components with their sizes
${LAB_PATH}/SALAM-Configurator/systembuilder.py --sysName gemm --benchDir "benchmarks/gemm"
# Two outputs
benchmarks/gemm/gemm_clstr_hw_defines.h # Defines the memory map of accelerators
---
acc_cluster:
# Name of header to be generated
- Name: multi_vector_clstr
# Define DMA
- DMA:
- Name: dma
MaxReqSize: 64 # Max request size
BufferSize: 128 # Buffer size
PIOMaster: LocalBus # Bus on which requests are invoked
Type: NonCoherent # Coherent or NonCoherent
InterruptNum: 95 # Do not change. Interrupt number. Check boot.s if interrupt number is changed
- Accelerator: # Define accelerators. Multiple defined here
- Name: Top # Name of accelerator
IrPath: benchmarks/multi_vector/hw/top.ll # Datapath definition
ConfigPath: benchmarks/multi_vector/hw/top.ini # Configuration file. For future extensions
PIOSize: 25 # Number of bytes of memory mapped registers. 1 Byte flag. 8 bytes for each registers
InterruptNum: 68 # Interrupt number. DO NOT CHANGE. Check boot.s if interrupt number is changed. Only Top has interrupt
PIOMaster: LocalBus # Bus on which requests are invoked
# Local to PIO
LocalSlaves: LocalBus # Local bus to which the accelerator is connected
Debug: False # Debug. False or True. Make sure its enabled if you want to see what's going on within the accelerator
- Accelerator: # 2nd accelerator
- Name: vector
IrPath: benchmarks/multi_vector/hw/vector.ll
ConfigPath: benchmarks/multi_vector/hw/vector.ini
Debug: False
PIOSize: 1
PIOMaster: LocalBus
- Var: # Add-ons to accelerator
- Name: MATRIX1 # Scratchpad name
Type: SPM # Scratchpad
Size: 512 # size in bytes
Ports: 2 # Number of ports. Parallel accesses to scratchpad.
- Var:
- Name: MATRIX2
Type: SPM
Size: 512
Ports: 2
- Var:
- Name: MATRIX3
Type: SPM
Size: 512
Ports: 2
- Accelerator:
- Name: vector2
IrPath: benchmarks/multi_vector/hw/vector2.ll
ConfigPath: benchmarks/multi_vector/hw/vector2.ini
Debug: False
PIOSize: 1
PIOMaster: LocalBus
- Var:
- Name: V2_MAT1
Type: SPM
Size: 512
Ports: 2
- Var:
- Name: V2_MAT2
Type: SPM
Size: 512
Ports: 2
- Var:
- Name: V2_MAT3
Type: SPM
Size: 512
Ports: 2
hw_config: # Always include below configuration. Defines the function unit spec.
top:
vector2:
vector:
instructions:
add:
functional_unit: 1
functional_unit_limit: 0
opcode_num: 13
runtime_cycles: 0
............
This document has put together by your CMPT 750/450 instructors and Milad Hakimi.