Skip to main content

Videos

Acknowledgments

In this lab and accelerator assignment we will be using a specific version of gem5 developed at University of North Carolina Charlotte. We thank the authors of gem5 SALAM for making their tool available as open source. CMPT 750 version of gem5 includes changes and benchmarks and may not be backwards compatible with the SALAM version

Gem5 ACC Overview

gem5 ACC extends gem5 to model domain-specific architectures and heterogeneous SoCs. When creating your accelerator there are a many considerations to be made which have first order effect on overall performance and energy.

These include:

  • How to integrate it into the system and define accelerator
  • How to receive data ? Is it coupled to main memory ?
  • Does the accelerator need DMAs
  • How to control the accelerator
  • Parallelism Desired

In this case the code we are going to accelerate is the following

char input[8] = {0x1,0x2,0x3,0x4,0x5,0x6,0x7,0x8};
char const[8] = {0x1,0x2,0x3,0x4,0x5,0x6,0x7,0x8};
char output[8];
for (int i = 0; i < 7 ; i++)
{
  output[i] = input[i] + const[i];
  output[i] = output[i]*2;
  output[i] = output[i]*2;
}

We will illustrate accelerator creation using the DMA model.

Vector_DMA SoC

$ git clone git@github.com:CMPT-7ARCH-SFU/gem5-lab2-v2.git
$ export CROSS_COMPILE_DIR=/data/arm/gcc-arm-none-eabi-10.3-2021.10/bin
$ cd gem5-lab2/benchmarks/vector_dma
$ ls
defines.h  host  hw  Makefile
   
host/ Defines the main that runs on CPU and invokes the accelerator
hw/ Defines the accelerator datapath
hw/config.yml Defines accelerator configuration
inputs/m0.bin,m1.bin Input files that contain the data
defines.h Common defines used by both datapath and host code
gem5-config/fs_vector_dma.py Top-level initialization of Arm SoC system
gem5-config/vector_dma.py gem5 configuration file for vector_dma accelerator. Defines accelerator components

Our SoC has two main sections.

  • Host
  • Accelerator

Here we focus on the host code and interactions with the accelerator. For this application, we are using a bare metal kernel. This means that we will have a load file, assembly file, and must generate ELF files for execution.

# Psuedocode for host code
1. Set up addresses for scratchpad
2. Copy data from DRAM into scratchpad

3. Start accelerator.
5. Copy data from scratchpad to DRAM

1. Address Mapping

  • defines.h Top-level definition for memory mapped addresses. Since this is a bare-metal SoC without any access to virtual memory and memory allocators, we have to define the memory space. The overall memory space looks like the following:
// Accelerator Name VECTOR_DMA. Base address for
// interacting with memory mapped registers
#define VECTOR_DMA 0x10020040

// Address for interacting with DMA
#define DMA_Flags 0x10020000

// 3 scratchpads
#define MATRIX1 0x10020080
#define MATRIX2 0x10020100
#define MATRIX3 0x10020180


0 0x10020040 0x10020041-11 (accelerator parameters. 8 bytes each) is the accelerator parameter) 0x10020000-0x10030000 Limit
Host DRAM Coherent Address space Accelerator status (0: inactive. 1: start: 4: running) 3 parameters (see bench/ code) Scratchpad memories Host DRAM space

These memory spaces are set in the following files

  • Accelerator range

Any accesses to this range, either from host code or accelerator is routed to the accelerator cluster.

gem5-config/vector_dma.py
line 24:	local_low = 0x10020000
line 25:	local_high = 0x10030000
  • Accelerator Start Address and Parameters
gem5-config/vector_dma.py
clstr.vector_dma = CommInterface(devicename=acc, gic=gic, pio_addr=0x10020040, pio_size=64, int_num=68)

pio_size is in bytes

  • Scratchpad address
gem5-config/vector_dma.py
# MATRIX1 (Variable)
	addr = 0x10020080
	spmRange = AddrRange(addr, addr + 0x40)

Connecting scratchpads to accelerator
# Connecting MATRIX1 to vector_dma. The range
# here controls how many accesses you can issue
# to scratchpad in 1 cycle.
	for i in range(2):
		clstr.vector_dma.spm = clstr.matrix1.spm_ports

# MATRIX2 (Variable)
	addr = 0x10020100
	spmRange = AddrRange(addr, addr + 0x40)


# MATRIX3 (Variable)
	addr = 0x10020180
	spmRange = AddrRange(addr, addr + 0x40)

In this instance we want to have the DMAs and accelerator controlled by an additional device to reduce overhead on the CPU. We define the helper functions in common/dma.h.

  • Specifies loading of binary files on host code. Here we have a 16 * 4 byte binary file with integers 0x00-0xF. The data is stored in little endian format. LSB—MSB
$ cd benchmarks/inputs
$ xxd m0.bin
00000000: 0100 0000 0200 0000 0300 0000 0400 0000  ................
00000010: 0500 0000 0600 0000 0700 0000 0800 0000  ................
00000020: 0900 0000 0a00 0000 0b00 0000 0c00 0000  ................
00000030: 0c00 0000 0d00 0000 0e00 0000 0f00 0000  ................
# fs_vector_dma.py
  test_sys.kernel_extras = [os.environ["LAB_PATH"]+"/benchmarks/vector_dma/m0.bin",os.environ["LAB_PATH"]+"/benchmarks/vector_dma/m1.bin"]
  • Define the DRAM addresses where the input data is loaded
main.cpp

uint64_t base = 0x80c00000;
TYPE *m1 = (TYPE *)base;
TYPE *m2 = (TYPE *)(base + sizeof(TYPE) * N);
TYPE *m3 = (TYPE *)(base + 2 * sizeof(TYPE) * N);

2. Copy data from DRAM to Scratchpad

We then setup the DMA to perform the memory copy between DRAM and the scratchpad memory. dmacpy is similar to memcpy. Note the address ranges used for performing the copy. The destination uses the scratchpad range specified in the config.ini and gem5-scripts. This space is carved out of the global memory space and the host CPU knows to route any reads and writes to this address range to the scratchpad

# Define scratchpad addresses.
# Note that N = 16. 4 byte int. 64 bytes total (or 0x40 bytes)
TYPE *spm1 = (TYPE *)MATRIX1;
TYPE *spm2 = (TYPE *)MATRIX2;
TYPE *spm3 = (TYPE *)MATRIX3;
# spm1 is destination address
# m1 is source address
# Size in bytes.
dmacpy(spm1, m1, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
dmacpy(spm2, m2, sizeof(TYPE) * N);
while (!pollDma());
resetDma();

3. Start accelerator

Set up parameters in accelerator memory-mapped registers. The accelerator status byte is set in pio_addr = 0x2f000000 in config.ini. pio_size = 64. This will set up 64 memory mapped registers starting from address 0x2f000001. Each entry is 8 bytes. The parameters are automatically derived from the accelerator function definition.

 // 4 possible values. 0x0: inactive 0x1: to start the accelerator. 0x4 active.
	// Start the accelerated function
	*ACC = DEV_INIT;
	while (*ACC != 0);

In our boot code, we setup an Interrupt Service Routine (ISR) in isr.c that the accelerator triggers to set up the end of the execution. This will reset the the accelerator status to 0x0. Which we spin on in the host code.

// isr.c. Invoked when accelerator is complete
void isr(void)
{
	printf("Interrupt\n\r");
	// Helps break the for loop in the host code
  *ACC = 0x00;
	printf("Interrupt\n\r");

}

4. Copy result from accelerator.

We copy back the results from the accelerator to the DRAM so that the host code can access and check.

 dmacpy(m3, spm3, sizeof(TYPE) * N);
  while (!pollDma());
  resetDma();

Check yourself ?

  • In which file is the base address of scratchpad defined?
  • Where are the low and high mark of the accelerator addresses defined?
  • How did we detect the accelerator complete execution

Accelerator datapath definition

We will first start with creating the code for our accelerator.

In bench/vector_dma.c there is a vector loop application written. To expose parallelism for computation and memory access we fully unroll the innermost loop of the application. The simulator will natively pipeline the other loop instances for us. To accomplish the loop unrolling we can utilize clang compiler pragmas such as those on line 18 of vector_dma.c.

  // Unrolls loop and creates instruction parallelism
    #pragma clang loop unroll_count(8)
    for(i=0;i<N;i++) {
            prod[i]  = 4*(m1[i] + m2[i]);

    }

With unrolling DDgraph

The hardware ends up being a circuit that implements the above dataflow graph. The unrolling creates 8 way parallelism. The loads to m1[i] and m2[i] can happen in parallel. The adds and multiplies can happen in parallel. The figures show the compiler representation or view that gets mapped down to hardware. Each node in the graph in an LLVM IR instruction. This is just an intermediate RISCy ISA-like representation with certain important differences.

Benefits of Compiler IR view of Accelerator

  • Infinite registers. Typical object code for CPUs is limited by the architectural registers. This causes unnecesary memory operations to manage register overflows and underflows that hide the available parallelism. Compilers don’t have such limitations since they are trying to simply capture the available parallelism and locality

  • Dataflow Semantics. While object code is linearly laid out and relies on a program counter. Compiler IR inherently supports dataflow semantics with no specific program counter.

Without unrolling Plain

Rules for Accelerator Datapath

We are generating hardware datapath from the the C code specified, hence we have a number of rules. If these rules are violated, the compiler may complain, you may encounter a runtime error from llvm runtime engine of SALAM or may even have a silent failure. Its very important you follow them.

  • Rule 1: SINGLE FUNCTION Only single function perfmitted per accelerator .c file.
  • Rule 2: NO LIBRARIES Cannot use standard library functions. Cannot call into other functions
  • Rule 3 : No I/O No printfs or writes to files. Either use traces or write back to cpu memory to debug
  • Rule 4: Only locals or args Can only work with variables declared within function or input arrays.
  • `

Read here for more details on LLVM

Run

cd $REPO
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=/data/gem5-salam-v2


# Building gem5-SALAM. WE PREBUILD. YOU DO NOT NEED TO BUILD GEM5
git clone git@github.com:CMPT-7ARCH-SFU/gem5-SALAM.git
cd gem5-SALAM; scons -j 16 build/ARM/gem5.opt -j`nproc`

# Set compiler
export CROSS_COMPILE_DIR=/data/arm/gcc-arm-none-eabi-10.3-2021.10/bin

# Build benchmark

cd $REPO/benchmarks/vector_dma
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
module load llvm-10
# Build datapath and host binary
make clean; make

$M5_PATH/build/ARM/gem5.opt --debug-flags=DeviceMMR,LLVMInterface,AddrRanges,NoncoherentDma,RuntimeCompute --outdir=BM_ARM_OUT/vector_dma gem5-config/fs_vector_dma.py --mem-size=4GB --kernel=$LAB_PATH/benchmarks/vector_dma/host/main.elf --disk-image=$M5_PATH/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=$LAB_PATH/benchmarks --accbench=vector_dma --caches --l2cache --acc_cache

# OR in a single line
./runvector.sh -b vector_dma

Scripts

How are gem5 configs generated?

  • config.yml is added to all the benchmarks. The systembuilder.py read the config for each benchmark and creates the python files (fs_$BENCHMARK.py and $BENCHMARK.py) containing the setup of the accelerator. By defualt, the generated python files can be found in config/SALAM/generated/ directory.

There is a commented line in the bash files:

mkdir -p $LAB_PATH/config/SALAM/generated
mkdir benchmarks/gemm
cp benchmarks/vector_dma/config.yml benchmarks/gemm/config.yml
export BENCH=gemm
${LAB_PATH}/SALAM-Configurator/systembuilder.py --sysName $BENCH --benchDir "benchmarks/${BENCH}"

This line generates the python files for us. After the config file is generated, we manaully change it to read the input files (m1.bin, m2.bin).

  • Reading from input files: Change the line 138 in fs_$BENCHMARK.py with:
elif args.kernel is not None:
    test_sys.workload.object_file = binary(args.kernel)
    test_sys.workload.extras = [os.environ["LAB_PATH"]+"/benchmarks/inputs/m0.bin",os.environ["LAB_PATH"]+"/benchmarks/inputs/m1.bin"]
test_sys.workload.extras_addrs = [0x80c00000,0x80c00000+8*8]
  • Auto generated header: systembuilder.py creates a new header containing the SPM and DMA addresses. This header is stored in the benchmark directory under the name of $BENCHMARK_clstr_hw_defines.h. We include this header in the defines.h of every benchmark.

YAML files

For each of our accelerators we also need to generate an yaml file (if you use system builder these are generated). In each yaml file we can define the number of cycles for each IR instruction and provide any limitations on the number of Functional Units (FUs) associated with IR instructions.

Additionally, there are options for setting the FU clock periods and controls for pipelining of the accelerator. Below is an example with a few IR instructions and their respective cycle counts:

instructions:
  add:
    functional_unit: 1
    functional_unit_limit: 5
    opcode_num: 13
    runtime_cycles: 1

Importantly, under the AccConfig section, we set MMR specific details such as the size of the flags register, memory address, interrupt line number, and the accelerator’s clock.

- Accelerator:
    - Name: vector_dma
      IrPath: benchmarks/vector_dma/hw/vector_dma.ll
      ConfigPath: benchmarks/vector_dma/hw/vector_dma.ini
      Debug: True
      PIOSize: 25
      PIOMaster: LocalBus
      InterruptNum: 68

In the Memory section, you can define the scratchpad’s memory address, size, response latency and number of ports. Also, if you want the accelerator to verify that memory exists in the scratchpad prior to accessing the scratchpad we can set ready mode to true.

- Var:
    - Name: MATRIX1
      Type: SPM
      Size: 64
      Ports: 2
- Var:
    - Name: MATRIX2
      Type: SPM
      Size: 64
      Ports: 2
- Var:
    - Name: MATRIX3
      Type: SPM
      Size: 64
      Ports: 2

Function Unit to LLVM mapping

Hardware has function units. LLVM IR has ops. There is a many:1 mapping between LLVM IR ops and function units i.e., different instruction types could be scheduled on same function unit (taking into consideration cycle and pipelining constraints). SALAM enables the designer to control this mapping. Typically laid out in the config.yml file. Here is the total list of ops and function units supported by SALAM. See the “function_unit” under the instructions tab.

  • You will typically not need to modify the mapping. However its useful to know how to track.

Functional Unit List

  • This is used for calculating area
  • When set to 0; SALAM will adjust to match max dynamic value i.e., what is the smallest circuit that can run kernel without sacrificing ILP
  • getLimit will set the max value,
  • getAvailable varies as functional units are used
Function Unit ID
INTADDER 1
INTMULTI 2
INTSHIFTER 3
INTBITWISE 4
FPSPADDER 5
FPDPADDER 6
FPSPMULTI 7
FPSPDIVID 8
FPDPMULTI 9
FPDPDIVID 10
COMPARE 11
GETELEMENTPTR 12
CONVERSION 13
OTHERINST 14
REGISTER 15
COUNTER 16
TRIG_SINE 17

Opcode List

This is used for tracking activity factors and

  • When function unit limit is set to 0, getUsage will let us know what is min required to ensure performance is not affected.
  • Multiple opcodes could/will map to common function unit. It depends on the config.yml assignment. e.g., 13 (add), 15 (sub) map to the integer adder unit. ops like gep (34) map to a default unit id 0 and do not count towards area calculation as they are a software artifact. In hardware they will map to wires or control.
Opcode Number
Add 13
Addrspac 50
Alloca 31
AndInst 28
Ashr 27
Bitcast 49
Br 2
Call 56
Fadd 14
Fcmp 54
Fdiv 21
Fence 35
Fmul 18
Fpext 46
Fptosi 42
Fptoui 41
Fptrunc 45
Frem 24
Fsub 16
Gep 34
Icmp 53
Indirect 4
Inttoptr 48
Invoke 5
Landingp 66
Load 32
Lshr 26
Mul 17
OrInst 29
Phi 55
Ptrtoint 47
Resume 6
Ret 1
Sdiv 20
Select 57
Sext 40
Shl 25
Srem 23
Store 33
Sub 15
SwitchIn 3
Trunc 38
Udiv 19
Uitofp 43
Unreacha 7
Urem 22
Vaarg 60
XorInst 30
Zext 39
  • The number of function units can be set in HWAccConfig.py
  • By default it is set to zero. This means use as many as required by IR e.g., if IR requires 3 +s use 3 adders per cycle, if 5 +s use 5 adders per cycle. When circuit unrolled more, the number of function units is automatically bumped up.
  • Setting it explicitly controls how many ops can be issued in a single cycle.
acc.hw_interface.functional_units.integer_adder.limit = 5

Constructing the System

We are now going to leverage and modify the example scripts for gem5’s full system simulation. In gem5-config/fs_vector_d,a.py we have a modified version of the script located in gem5’s default folder. The main difference in our configuration is there are two additional parameters

Adding accelerators to cluster

  • fs_vector_dma.py : Connects accelerator cluster to arm. Accelerator connections start at line 231:vector_dma.makeHWAcc(args, test_sys)
  • vector_dma.py : Sets up specific accelerator system
  • HWAccConfig.py : Provides accelerator independent helper functions

  • vector_dma.py

In order to simplify the organization of accelerator-related resources, we define a accelerator cluster. This accelerator cluster will contain any shared resources between the accelerators as well as the accelerators themselves. It has several functions associated with it that help with attaching accelerators to it and for hooking cluster into the system.


# Allocate cluster and build it
def makeHWAcc(args, system):
	system.vector_dma_clstr = AccCluster()
	buildvector_dma_clstr(args, system, system.vector_dma_clstr)

def buildvector_dma_clstr(args, system, clstr):
# Define memory map. Any read/write from cpu to this range sent to accelerator cluster
	local_low = 0x10020000
	local_high = 0x10030000
	local_range = AddrRange(local_low, local_high)
# Mutually exclusive range. Any accelerator access to this range sent to CPU's L2 and DRAM
	external_range = [AddrRange(0x00000000, local_low-1), AddrRange(local_high+1, 0xFFFFFFFF)]
	system.iobus.mem_side_ports = clstr.local_bus.cpu_side_ports
# Connect caches if any, if cache_size !=0
	clstr._connect_caches(system, options, l2coherent=True)
	gic = system.realview.gic

We then invoke the _connect_caches function (line 20) in order to connect any cache hierarchy that exists in-between the cluster, the memory bus, or l2xbar of the CPU depending on design. This gives the accelerator cluster master access to resources outside of itself. It also establishes coherency between cluster and other resources via caches. If no caches are needed this will merely attach the cluster to the memory bus without a cache.

    system.acctest._connect_caches(system, options, l2coherent=True, cache_size = "32kB")

These functions are defined in gem5-SALAM/src/hwacc/AccCluster.py

Define communication

  • Add DMA

The DMA control address defined here has to match common/dma.h. The memory mapped control, pio, int_num all have to match the values set in config.yml.

	# Noncoherent DMA
	clstr.dma = NoncoherentDma(pio_addr=0x10020000, pio_size = 21, gic=gic, int_num=95)
	clstr.dma.cluster_dma = clstr.local_bus.cpu_side_ports
	clstr.dma.max_req_size = 64
	clstr.dma.buffer_size = 128
	clstr.dma.dma = clstr.coherency_bus.cpu_side_ports
	clstr.local_bus.mem_side_ports = clstr.dma.pio

We, we are going to create a CommInterface (Line 30) which is the communications portion of our Top accelerator. We will then configure Top and generate its LLVM interface by passing CommInterface, a config file, and an IR file, to AccConfig (Line 31). This will generate the LLVM interface, configure any hardware limitations, and will establish the static Control and Dataflow Graph (CDFG).

We then connect the accelerator to the cluster (Line 32). This will attach the PIO port of the accelerator to the cluster’s local bus that is associated with MMRs.

Accelerator

  • Create a CommInterface
  • Configure it using AccConfig
  • Attach it to the accelerator cluster
  • Because we want our hardware accelerator to be managed by the host, we connect the PIO directly to the coherent cross bar.
	# vector_dma Definition
	acc = "vector_dma"
  # Datapath definition
	ir = os.environ["LAB_PATH"]+"/benchmarks/vector_dma/hw/vector_dma.ll"

	# Configs for picking up memory maps and instruction to FU mapping
  config = os.environ["LAB_PATH"]+"/benchmarks/vector_dma/config.yml"
	yaml_file = open(config, 'r')
	yaml_config = yaml.safe_load(yaml_file)
	debug = False
	for component in yaml_config["acc_cluster"]:
		if "Accelerator" in component.keys():
			for axc in component["Accelerator"]:
				print(axc)
				if axc.get("Name","") == acc:
						debug = axc["Debug"]
  # Communication interface to accelerator.
	clstr.vector_dma = CommInterface(devicename=acc, gic=gic, pio_addr=0x10020040, pio_size=64, int_num=68)
	AccConfig(clstr.vector_dma, ir, config)
	# vector_dma Config
	clstr.vector_dma.pio = clstr.local_bus.mem_side_ports
	# Activate/Deactivate debug messages
  clstr.vector_dma.enable_debug_msgs = debug
  • Connecting scratchpads
# MATRIX1 (Variable)
# Base address. Accelerator/CPU reach into scratchpad starting at this address
	addr = 0x10020080
# Set range. 0x40 = 64 bytes
	spmRange = AddrRange(addr, addr + 0x40)
# Create scratchpad object
	clstr.matrix1 = ScratchpadMemory(range = spmRange)
# Report in debug table?
	clstr.matrix1.conf_table_reported = False
# If ready (True) then pay attention to flags below
	clstr.matrix1.ready_mode = False
# Zero out after first read
	clstr.matrix1.reset_on_scratchpad_read = True
# Read succeeds only if previously init (False)
	clstr.matrix1.read_on_invalid = False
# Write succeeds only if previously read (True)
	clstr.matrix1.write_on_valid = True
# Connect scratchpad to local bus
	clstr.matrix1.port = clstr.local_bus.mem_side_ports

Run and Stats

export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=/data/gem5-salam-v2

cd $REPO/benchmarks/vector_dma
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/Modules/3.2.10/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-10
#
make clean; make

# In short
$ ./runvector.sh -b -p vector_dma
# This will create a BM_ARM_OUT/vector_dma (this is your m5_out folder)
# The debug-trace.txt will contain stats for your accelerator

Do you understand what these stats are?

  • Runtime in cycles , System clock (Wall clock time). Only accelerator does not include host DMA
  • Stalls
  • Accelerator Power
  • Leakage/Dynamic Power
cat BM_ARM_OUT/vector_dma/debug-trace.txt

system.vector_dma_clstr.vector_dma.llvm_interface
Total Area: 12290.3
Total Power Static: 0.154364

Total Power Dynamic: 2.43911
       Function Unit - Limit (0=inf)Units/Cycle
   double_multiplier -        0       0
  bitwise_operations -        0       0
         bit_shifter -        0       0
        double_adder -        0       0
       float_divider -        0       0
         bit_shifter -        0       0
  integer_multiplier -        0       0
       integer_adder -        0      16
      double_divider -        0       0
         float_adder -        0       0
    float_multiplier -        0       0
   ========= Performance Analysis =============
   Setup Time:                      0h 0m 0s 1ms 117us
   Simulation Time (Total):         0h 0m 0s 1ms
   Simulation Time (Active):        0h 0m 0s 1ms
        Queue Processing Time:      0h 0m 0s 0ms
             Scheduling Time:       0h 0m 0s 0ms
             Computation Time:      0h 0m 0s 0ms
   System Clock:                    0.1GHz
   Runtime:                         20 cycles
   Runtime:                         0.2 us
   Stalls:                          0 cycles
   Executed Nodes:                  19 cycles

The metrics of interest are dynamic power, area, and Runtime.

  • Dynamic power is calculated as $\Sigma_{i=1}{N} E_{i} * N_{i}$ . $E_{i}$: Energy cost of single event. $N_{i}$: Number of events
  • Refer here for E constants Power Model) Table below lists costs of different events.

TODOs

  • Change unroll count in benchmarks/vector_dma/hw/vector_dma.c from 1-16 and see what happens to runtime cycles each time. Also see the stats for Total Number of Registers:, Max Register Usage Per Cycle: Runtime:, Runtime FUs. Power Analysis.
  • Why is there no stalls
    WARNING: Remember you have to make clean and rebuild .ll and main.elf each time

  • Change ports for scratchpad from 2-8 (need to modify vector_dma.py see range loops e.g., line 72. Need to modify for each scratchpad) and see what happens to cycles. Why does changing number of ports to 1 increase stalls ? To try and understand follow step below
  • Change the host cpu type in runvector.sh to MinorCPU and see difference in overall simulation time.
  • Set FLAGS="HWACC,LLVMRuntime" in run-vector.sh. Re-run and check debug-trace.txt . Try to comphrehend what the trace says. This will include step-by-step execution of the hardware. Disable the flags for assignments; otherwise traces will consume too much space.
  • Try and draw the dataflow by hand

Comments on trace

  • open BM_ARM_OUT/vector_dma/debug-trace.txt Look for lines of type. Trying to read addr: 0x0000000102..., 4 bytes through port:
  • Check how many such reads occur in tick.
  • When changing ports to 1 Check how many reads occur in tick

  • This indicates whether the llvm file was loaded and runtime initialized
  • MMR refers to the start/stop flag
1476840000: system.acctest.vector_dma: Checking MMR to see if Run bit set
1476840000: system.acctest.vector_dma.compute: Initializing LLVM Runtime Engine!
1476840000: system.acctest.vector_dma.compute: Constructing Static Dependency Graph
1476840000: system.acctest.vector_dma.compute: Parsing: (/data/src/gem5-lab2/benchmarks/vector_dma/hw/vector_dma.ll)
  • Read from 0x2f10000c; indicates read from address. Depeding on address range this either refers to scratchpad or global memory

  • Check the computation operations. Open the $REPO/benchmarks/vector_dma/hw/vector_dma.ll file and identify these instructions.

1476910000: system.acctest.vector_dma.compute.i(  %7 = shl i32 %6, 2): Performing shl Operation
1476910000: system.acctest.vector_dma.compute.i(  %7 = shl i32 %6, 2): 2 << 2
1476910000: system.acctest.vector_dma.compute.i(  %7 = shl i32 %6, 2): shl Complete. Result = 8
1476910000: system.acctest.vector_dma.compute.i(  %7 = shl i32 %6, 2): Operation Will Commit in 1 Cycle(s)
1476910000: system.acctest.vector_dma.compute.i(  %13 = add i32 %12, %10): Performing add Operation (13)
1476910000: system.acctest.vector_dma.compute.i(  %13 = add i32 %12, %10): 2 + 2
1476910000: system.acctest.vector_dma.compute.i(  %13 = add i32 %12, %10): add Complete. Result = 4
1476910000: system.acctest.vector_dma.compute.i(  %13 = add i32 %12, %10): Operation Will Commit in 1 Cycle(s)
1476910000: system.acctest.vector_dma.compute.i(  %6 = add i32 %5, %3): Performing add Operation (6)
1476910000: system.acctest.vector_dma.compute.i(  %6 = add i32 %5, %3): 3 + 3
1476910000: system.acctest.vector_dma.compute.i(  %6 = add i32 %5, %3): add Complete. Result = 6
  • Dataflow graph Visualizer. See if you can spot the difference between parallel and serial.
module load llvm-10
clang --version
# Should be 10.
cd $REPO/vector_dma/hw/
# Dataflow graph with
clang -emit-llvm -S vector_dma.c -o vector_dma-10.ll
opt -load /data/PDG/build/libpdg.so --dot-pdg --dot-only-ddg vector_dma-10.ll
dot -Tpdf pdgragh.vadd.dot -o pdgragh.vadd.serial.pdf

clang -emit-llvm -O3 -S vector_dma.c -o vector_dma-10.ll
opt -load /data/PDG/build/libpdg.so --dot-pdg --dot-only-ddg vector_dma-10.ll
dot -Tpdf pdgragh.vadd.dot -o pdgragh.vadd.parallel.pdf

Model 1.5 : Batched DMA.

benchmarks/vector_dma_2

In Model 1, we moved all the data we need into the scratchpad and then kickstarted the computation. However, scratchpads are finite and accelerators can only work with data in the scratchpad. Hence we may need to restrict the size of the accelerators and process data in multiple batches. In this example we are going to restrict the accelerator to process only 8 elements. However we have 16 elements in the array. To process all the elements we have to process the data in 2 batches. The modifications are to the manager host code.

// Modified config.ini to set scratchpad size
// Modified defines.h
#define N 8
// The accelerator datapath will work on 8 elements at a time
// Modified top.c to process 16 elements as two batches.
// Batch 0 DMAs elements 0-8 to scratchpad
  dmacpy(spm1, m1, sizeof(TYPE) * N);
  while (!pollDma());
  resetDma();
  dmacpy(spm2, m2, sizeof(TYPE) * N);
  while (!pollDma());
  resetDma();
// Invoke accelerator on scratchpad address range
val_a = (uint64_t)spm_base;
  val_b = (uint64_t)(spm_base + sizeof(TYPE) * N);
  val_c = (uint64_t)(spm_base + 2 * sizeof(TYPE) * N);



// Batch 2 DMAs elements 8-16
  dmacpy(spm1, m1 + N, sizeof(TYPE) * N);
  while (!pollDma());
  resetDma();
  dmacpy(spm2, m2 + N, sizeof(TYPE) * N);
  while (!pollDma());
  resetDma();

// Notice in both cases we are passing the scratchpad base address to the accelerator datapath to work on. This is redundant and we can hardcode it into the accelerator datapath in hw/ if we want to vector_dma_2x.c.

export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=/data/gem5-salam-v2

cd $REPO/benchmarks/vector_dma_2
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/Modules/3.2.10/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-10
#
make clean; make
# This should create a .ll file in your hw/
# and main.elf file in host/

# Full command for gem5 simulation
$M5_PATH/build/arm/gem5.opt --debug-flags=DeviceMMR,LLVMInterface,AddrRanges,NoncoherentDma,RuntimeCompute --outdir=BM_ARM_OUT/vector_dma_2 gem5-config/fs_vector_dma_2.py --mem-size=4GB --kernel=$LAB_PATH/benchmarks/vector_dma_2/host/main.elf --disk-image=$M5_PATH/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=$LAB_PATH/benchmarks --accbench=vector_dma_2 --caches --l2cache --acc_cache

# Short command
$ ./runvector.sh -p -b vector_dma_2
# This will create a BM_ARM_OUT/vector_dma_2 (this is your m5_out folder)
# The debug-trace.txt will contain stats for your accelerator

TODO

  • Change N to 4 and perform computation in 4 batches.

Model 2: Cache

The cache model hooks up the accelerator to the global memory through a coherence crossbar. It is ok to be not familiar with coherence bar when reading this document. You only need to understand that with coherence available the accelerators can directly reference the DRAM space mapped to the CPU.

To enable accelerator cache.

  • First chanage line 49 to CACHE_OPTS="--caches --l2cache --acc_cache"
  • The cache size can be changed in clstr._connect_caches(system, options, l2coherent=True, cache_size = "32kB"):gem5-config/vector_cache.py
$M5_PATH/build/ARM/gem5.opt --debug-flags=DeviceMMR,LLVMInterface,AddrRanges,NoncoherentDma,RuntimeCompute --outdir=BM_ARM_OUT/vector_cache gem5-config/fs_vector_cache.py --mem-size=4GB --kernel=$LAB_PATH/benchmarks/vector_cache/host/main.elf --disk-image=$M5_PATH/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=$LAB_PATH/benchmarks --accbench=vector_cache --caches --l2cache --acc_cache

# In short
$ ./runvector.sh -b vector_cache -p

The image below compares the system organization with an accelerator cache and without.

The primary difference between the cache and dma version is in the host code

Host code for cache model

The pointers passed to the accelerator point to the global memory space (base). The load and store operations directly touch these locations and access them through the coherence cross bar.

// benchmarks/vector_cache/host/main.cpp
  uint64_t base = 0x80c00000;
  uint64_t spm_base = 0x2f100000;
  val_a = (uint64_t)base;
  val_b = (uint64_t)(base + sizeof(TYPE) * N);
  val_c = (uint64_t)(base + 2 * sizeof(TYPE) * N);

The cache model hooks up the accelerator to the global memory through a coherence crossbar. It is ok to be not familiar with coherence bar when reading this document. You only need to understand that with coherence available the accelerators can directly reference the DRAM space mapped to the CPU.

To enable accelerator cache.

  • The cache size can be changed in vector_cache.py:clstr._connect_caches(system, options, l2coherent=True, cache_size = "32kB"):vector.py
  • If you don’t set this correctly; your application may misbehave
cd $REPO/benchmarks/vector_cache
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-10
# Build datapath and host binary
make clean; make


$ ./runvector.sh -b vector_cache -p

The image below compares the system organization with an accelerator cache and without.

TODOs

    1. Try out a larger vector. You need to modify the host code to initialize bigger data, a 32 element array with numbers [0x1 to 0x20].
    • Directly initalize in host code starting from base address. Note that you do not have allocators and initializers in bare metal. You will need to modify the pointers starting from base.
    • Modify arg1, arg2 and arg3 to point to the new base addresses.

    Modify the parameter in N in defines.h and

    1. Modify the inputs/m0.bin and m1.bin to inclue the additonal 16 numbers. use hexedit . Check these lines to ensure m0 and m1 are loaded in the appropriate place. Note that numbers are in little endian format and in hex (e.g., 17 = 0x10). If m0 a 32 int array starts at 0x80c00000 what address does m1 start at?
  test_sys.kernel = binary(options.kernel)
        test_sys.kernel_extras = [os.environ["LAB_PATH"]+"/benchmarks/inputs/m0.bin",os.environ["LAB_PATH"]+"/benchmarks/inputs/m1.bin"]
        test_sys.kernel_extras_addrs = [0x80c00000,0x80c00000+os.path.getsize(os.environ["LAB_PATH"]+"/benchmarks/inputs/m0.bin")]
        print("Loading file m0 at" + str(hex(0x80c00000)))
        print("Loading file m1 at" + str(hex(0x80c00000 + os.path.getsize(os.environ["LAB_PATH"]+"/benchmarks/inputs/m0.bin"))))

Model 3 Multi Accelerator with Top manager

cd $REPO
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=export M5_PATH=/data/gem5-salam-v2
# Set compiler
export CROSS_COMPILE_DIR=/data/arm/gcc-arm-none-eabi-10.3-2021.10/bin

cd $REPO/benchmarks/multi_vector
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-10
# Build datapath and host binary
make clean; make


$ ./runmulti.sh -b multi_vector -p


$ $M5_PATH/build/ARM/gem5.opt --debug-flags=HWACC,Runtime --outdir=BM_ARM_OUT/multi_vector gem5-config/fs_multi_vector.py --mem-size=8GB --kernel=$LAB_PATH/benchmarks/multi_vector/host/main.elf --disk-image=$M5_PATH/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=$LAB_PATH/benchmarks --accbench=multi_vector --caches --l2cache --acc_cache

For larger applications we may need to include multiple accelerators in the cluster. For this, we need to include a top accelerator to coordinate each of the other accelerators. Figure below shows the system model. The top now offloads the DMA and accelerator kickstart logic; it also initiates the DMA movement between the accelerators. The host in this case simply passes the address pointers to the inputs and output zone. There are two accelerators vector and vector2.ll

// host/main.cpp
volatile uint8_t  * top   = (uint8_t  *)0x2f000000;
volatile uint32_t * val_a = (uint32_t *)0x2f000001;
volatile uint32_t * val_b = (uint32_t *)0x2f000009;
volatile uint32_t * val_c = (uint32_t *)0x2f000011;

int main(void) {

  // Pointers in DRAM. m1 and m2 are inputs.
  // m3 is the output
    uint32_t base = 0x80c00000;
    TYPE *m1 = (TYPE *)base;
    TYPE *m2 = (TYPE *)(base + sizeof(TYPE) * N);
    TYPE *m3 = (TYPE *)(base + 2 * sizeof(TYPE) * N);

  // MMRegs of the top accelerator.
   // Argument 1 to top
    *val_a = (uint32_t)(void *)m1;
   // Argument 2 to top
    *val_b = (uint32_t)(void *)m2;
   // Argument 3 to top
    *val_c = (uint32_t)(void *)m3;
   
hw/source/top.c Code for top accelerator coordinator. This is itself an accelerator
config.yml Configuration for accelerators
hw/source/vector.c Code for first stage of vector accelerator
hw/source/vector2.c Code for second stage of vector accelerator
hw/ir llvm files after compiler generates dataflow graph

Memory map

  • Defined in multi_vector.py (generated from config.yml by scripts/systembuilder.py)
Start address Description
0x10020040 Memory mapped args for top
0x10020080 Memory mapped args for vector
0x10020780 Memory mapped args for vector2
0x100200c0, 0x10020300, 0x10020540 Scratchpad for vector
0x100207c0, 0x10020a00, 0x10020c40 scratchpad for vector2
  // Accelerator 1: vector.c
    for(i=0;i<N;i++) {
            tmp_m3[i]  = (m1[i] + m2[i]);

    }
  // Accelerator 2: vector2.c
    for(i=0;i<N;i++) {
            m3[i]  = tmp_m3[i] * 8;

    }

Top manages the accelerator themselves

  • Step 0: Obtain DMA control reg address
// hw/source/top.c
  volatile uint8_t *DmaFlags = (uint8_t *)(DMA);
  volatile uint64_t *DmaRdAddr = (uint64_t *)(DMA + 1);
  volatile uint64_t *DmaWrAddr = (uint64_t *)(DMA + 9);
  volatile uint32_t *DmaCopyLen = (uint32_t *)(DMA + 17);

Step 1: DMA DRAM->Scratchpad

Transfer data from DRAM to Scratchpad S1 of Accelerator Vector.

// multi_vector/hw/top.c
// Global Memory Address
  *_DMARdAddr = (uint32_t)m1;
// Scratchpad address
  *_DMAWrAddr = (uint32_t)MATRIX1;
// Vector Len
  *_DMACopyLen = vector_size;
  // // // Fence it
  *_DMAFlags = 0;
  while (*_DMAFlags != 0x0);
  *_DMAFlags = DEVINIT;
  // Poll DMA for finish
  while ((*_DMAFlags & DEVINTR) != DEVINTR);
  // // Reset DMA
  *_DMAFlags = 0x0;

Scratchpad memory is laid out in the following manner

M1 M2 M3
0x100200c0 - N*sizeof(int) 0x10020300 - N*sizeof(int) 0x10020540 - N*sizeof(int)

Step 2: Start accelerator V1.

Set up arguments if required. The accelerator can only work with data in the scratchpad or local registers. These are fixed memory rangers in the DMA space. In this case, the V1 vector accelerator does not require any additional arguments. To start the accelerator from TOP it is important to follow the steps below (in particular checking if accelerator is ready for kickstart) after arguments are are setup.

// Write to argument MMR of V1 accelerator

// First, check if accelerator ready for kickstarting
while (*V1Flags != 0x0);

// Start the accelerated function
*V1Flags = DEV_INIT;

// Poll function for finish
while ((*V1Flags & DEV_INTR) != DEV_INTR);

// Reset accelerator for next time.
*V1Flags = 0x0;

Step 3: DMA accelerator V1 -> V2.

The output of accelerator v1 is the input of v2. Need to copy N*4 bytes from

  // Transfer the output of V1 to V2.
  *DmaRdAddr = M3ADDR;
  *DmaWrAddr = M1ADDR_V2;
  *DmaCopyLen = vector_size;
  *DmaFlags = DEV_INIT;
  // //Poll DMA for finish
  while ((*DmaFlags & DEV_INTR) != DEV_INTR)
    ;

Step 4 Kick start accelerator v2.

// Write to argument MMR of V2 accelerator

// First, check if accelerator ready for kickstarting
while (*V2Flags != 0x0);

// Start the accelerated function
*V2Flags = DEV_INIT;

// Poll function for finish
while ((*V2Flags & DEV_INTR) != DEV_INTR);

// Reset accelerator for next time.
*V2Flags = 0x0;

Step 5 Copy data from v2 to cpu space

  // Transfer M3
  // Scratchpad addresss
  *DmaRdAddr = M3ADDR_V2;
  // Global address the host wants the final result in
  *DmaWrAddr = m3_addr;
  // Number of bytes
  *DmaCopyLen = vector_size;
  // Start DMA
  *DmaFlags = DEV_INIT;
  // Poll DMA for finish
  while ((*DmaFlags & DEV_INTR) != DEV_INTR)
    ;

TODO

  • The DMAing is high overhead between the accelerators V1 and V2. Ask yourself Why ? Merge V2 into V1 and modify top. Compare cycles. Why did it not reduce as much as you expect. What about the power of the merged accelerator ?

Model 3.5 Multi accelerator with accelerator cache.

We now create a multi-accelerator system with a shared cache. We do not need to explicitly transfer data between accelerators and all data is implicitly transferred between the accelerators through the shared cluster cache. The top only has to set up the appropriate arguments and invoke the accelerators in sequence. Each accelerator reads and writes back to the global memory space and the cluster cache captures the locality.

cd $REPO
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=/data/gem5-salam-v2
#  Set compiler
export CROSS_COMPILE_DIR=/data/arm/gcc-arm-none-eabi-10.3-2021.10/bin

cd $REPO/benchmarks/multi_vector_cache
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-10
# Build datapath and host binary
make clean; make


# In short.
$ ./runmulti.sh -b multi_vector_cache -p

# Full command
$ $M5_PATH/build/ARM/gem5.opt --debug-flags=HWACC,Runtime --outdir=BM_ARM_OUT/multi_vector_cache gem5-config/fs_multi_vector_cache.py --mem-size=8GB --kernel=$LAB_PATH/benchmarks/multi_vector_cache/host/main.elf --disk-image=$M5_PATH/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=$LAB_PATH/benchmarks --accbench=multi_vector_cache --caches --l2cache --acc_cache


Step: Activate accelerator v1 and v2

// benchmarks/multi_vector_cache/hw/source/top.c
// Pass on host address arguments to accelerator
  *V1Arg1 = m1_addr;
  *V1Arg2 = m2_addr;
  *V1Arg3 = m3_addr;
// Start V1
  *V1Flag = DEV_INIT;
  // Poll function for finish
  while ((*V1Flag & DEV_INTR) != DEV_INTR);
  *V1Flag = 0x0;

// Start V2
  *V2Flag = DEV_INIT;
  while ((*V2Flag & DEV_INTR) != DEV_INTR);
  *V2Flag = 0x0;

TODO

  • Merge V2 into V1 and modify top. Compare cycles. What about the number of accesses to the cluster cache ?

Model 3: Streaming Pipelines

Streaming pipelines introduce a FIFO interface to the memory system. If you take a look at the datapaths vector_dma/hw/vector_dma.c you notice that the memory access pattern is highly regular with no ordering requirements between the elements of the array. We simply sequence through the elements of the vectoro applying an operation at individual locations. This can be concisely described as a stream of values. A stream simply provides a FIFO interface to the data.

Memory map. The term MMR refers to the memory mapped register and flags used to control the DMA and accelerators.

           
0x100200c0 (TOP MMR) 0x10020100 (S1 MMR) 0x10020400 (S2 MMR) 0x100204c0 (S3 MMR) 0x10020000 (StreamDMA MMR) 0x10020080 (Noncoherent DMA)
     
0x10020000 (DRAM->S1, S3->DRAM) 0x100203c0 (S1->S2 FIFO port) 0x10020480 (S2->S3 FIFO port)

Step 1 StreamDMA -

This stream data from DRAM in chunks of stream_size (bits). Figure illustrates a stream DMA. We need to create a new configuration and modify top to initiate the stream. The stream DMA includes a control pio (similar to other accelerators). This can be written to by top to control from where in the DRAM data is being streamed. The out port of the StreamDMA engine is wired up to the stream ports of one of the accelerators. Each stream is a single input-single output FIFO. Each accelerator has a .stream interface into which all the required streams are wired in. In this case i) we read from the DRAM and send data to accelerator S1. ii) read data from accelerator S3 and write it into stream DMA.

  • Each of the the accelerators will use memory mapped address 0x2f0001000 to read/write to the stream addresses.
  • Each access will read stream_size:8 bits worth of data from the port.
  • In total the FIFO will supply StrDmaRdFrameSize bytes of data in chunks of stream_size. The total number of dataflow tokens generated will be $\frac{RdFrameSize*8}{stream_size}$.
    # Configuration in gem5-config/vector_stream.py
  # 0x1002000 DMA control address
	clstr.streamdma = StreamDma(pio_addr=0x10020000, status_addr=0x10020040, pio_size = 32, gic=gic, max_pending = 32)
	# Stream read/write address
  clstr.streamdma.stream_addr = 0x10020000 + 32
	clstr.streamdma.stream_size = 128
	clstr.streamdma.pio_delay = '1ns'
	clstr.streamdma.rd_int = 210
	clstr.streamdma.wr_int = 211
	clstr.streamdma.dma = clstr.coherency_bus.cpu_side_ports
	clstr.local_bus.mem_side_ports = clstr.streamdma.pio


    # DRAM->Accelerator S1
   	clstr.s1.stream = clstr.streamdma.stream_out

    # Accelerator S3->DRAM	
    clstr.s3.stream = clstr.streamdma.stream_in

// vector_stream/hw/source/top.c
 // StreamDma
  volatile uint8_t *StrDmaFlags = (uint8_t *)(STREAMDMA_Flags);
  volatile uint64_t *StrDmaRdAddr = (uint64_t *)(STREAMDMA_Flags + 4);
  volatile uint64_t *StrDmaWrAddr = (uint64_t *)(STREAMDMA_Flags + 12);
  volatile uint32_t *StrDmaRdFrameSize = (uint32_t *)(STREAMDMA_Flags + 20);
  volatile uint8_t *StrDmaNumRdFrames = (uint8_t *)(STREAMDMA_Flags + 24);
  volatile uint8_t *StrDmaRdFrameBuffSize = (uint8_t *)(STREAMDMA_Flags + 25);
  volatile uint32_t *StrDmaWrFrameSize = (uint32_t *)(STREAMDMA_Flags + 26);
  volatile uint8_t *StrDmaNumWrFrames = (uint8_t *)(STREAMDMA_Flags + 30);
  volatile uint8_t *StrDmaWrFrameBuffSize = (uint8_t *)(STREAMDMA_Flags + 31);

// Initiate Stream from DRAM to FIFO port
  *StrDmaRdAddr = in_addr;
  *StrDmaRdFrameSize = INPUT_SIZE; // Specifies number of bytes
  *StrDmaNumRdFrames = 1;
  *StrDmaRdFrameBuffSize = 1;
// Start Stream
  *StrDmaFlags = STR_DMA_INIT_RD | STR_DMA_INIT_WR;

Step 2 Streambuffers.

Stream buffers establish ports directly between accelerators. They do not need to set up during runtime The configuration is set up and the accelerators simply read from the address that controls the port. For example, here we have set up a stream buffer between accelerator v1 and v2. Each accelerator uses the address to read and write to the FIFO. The streambuffer only supports a single input and output port.

  ┌──────────────────────┐                             ┌───────────────┐
  │    Accelerator V1    │      ┌─────────────┐        │   Acclerator  │
  │                      ├─────►│ FIFO Buffer ├────────►     V2        │
  └──────────────────────┘      └─────────────┘        └───────────────┘

# Address accelerator v1 and v2 can read and write to.
# S1Out (Stream Variable)
addr = 0x10020380
# stream_size. # bits read on each access
clstr.s1out = StreamBuffer(stream_address = addr, status_address= 0x100203c0, stream_size = 8, buffer_size = 8)
# Input to the buffer from accelerator S1
clstr.s1.stream = clstr.s1out.stream_in
# Output of buffer sent to accelerator S2.
clstr.s2.stream = clstr.s1out.stream_out

Each stream-buffer only supports 1-1 input and output port. However multiple stream buffers can be wired to a single accelerator. However, each accelerator can have multiple streambuffer ports.

                       ┌───────────────┐
┌─────────────┐        │   Acclerator  │
│  0x10020380 ├────┬───►               │
└─────────────┘    ├───►      V2       │
                   │   └───────────────┘
┌─────────────┐    │
│ 0x10020440  ├────┘
└─────────────┘
# Address accelerator v1 and v2 can read and write to access FIFO.
addr = 0x10020380
clstr.s1out = StreamBuffer(stream_address = addr, status_address= 0x100203c0, stream_size = 8, buffer_size = 8)
clstr.s1.stream = clstr.s1out.stream_in
clstr.s2.stream = clstr.s1out.stream_out
cd $REPO

export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=export M5_PATH=/data/gem5-salam-v2
# Set compiler
export CROSS_COMPILE_DIR=/data/arm/gcc-arm-none-eabi-10.3-2021.10/bin

cd $REPO/benchmarks/vector_stream
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-10
# Build datapath and host binary
make clean; make


# In short on 227.
$ ./runvector_stream.sh  -p

# Full command

$M5_PATH/build/ARM/gem5.opt --outdir=BM_ARM_OUT/vector_stream gem5-config/fs_vector_stream.py --mem-size=4GB --kernel=$LAB_PATH/benchmarks/vector_stream/host/main.elf --disk-image=$M5_PATH/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=$LAB_PATH/benchmarks --accbench=vector_stream --caches --l2cache --acc_cache


Step 3 Top

The purpose of top is to kickstart the stream dma from memory. Completion is detected by checking whether the output stream is complete. The overall execution is data-driven. When the FIFO port empties out the top accelerator triggers the completion of the stream.

// Start Stream DMAs
  *StrDmaFlags = STR_DMA_INIT_RD | STR_DMA_INIT_WR;

  // Start all accelerators
  // Start S1
  *S1 = 0x01;
  // Start S2
  *S2 = 0x01;
  // Start S3
  *S3 = 0x01;

// Wait for all accelerators to finish before sending interrupt to CPU
while ((*StrDmaFlags & 0x08) == 0x08);

Step 3 Accelerator stages S1-S3

As each each accelerator fills the stream buffer ports they will automatically trigger the operations in neighboring accelerators in a dataflow manner. Each accelerator has to know how many tokens are going to be generated and has to read the stream buffer port. The S1 stage writes to the FIFO streambuffer between S1 and S2. It uses the appropriate stream buffer memory mapped port.

// vector_stream_clstr_hw_defines.h
//Accelerator: TOP
#define TOP 0x100200c0
//Accelerator: S1
#define S1 0x10020100
#define S1Buffer 0x10020140
#define S1Out 0x10020380
#define S1Out_Status 0x100203c0
//Accelerator: S2
#define S2 0x10020400
#define S2Out 0x10020440
#define S2Out_Status 0x10020480


// hw/S1.c
	volatile dType_8u * STR_IN  	= (dType_8u *)(S1In);
	volatile dType_8u * BUFFER 		= (dType_8u *)(S1Buffer); 
	volatile dType_8u * STR_OUT		= (dType_8u *)(S1Out);

	
	for (dType_Reg i = 0; i < INPUT_SIZE; i++) {
			*STR_OUT = (*STR_IN) + BUFFER[i];
        }
}

Complete configuration

TODO

  • Add another stage 4 to the streaming accelerators. You will need to add hw/source/S4.c, hw/configs/S4.ini. Modify top.c to define memory map for MMR and Stream ports. Modify hw/gem5-config/vector_stream.py. You will need make S4 the final stage writing to the stream DMA. You will have to define new streambuffer that connects S3 and S4. You may and will need to makefile modifications. Figure it out.

Generating gem5 SoC configs

A key part of the gem5 infrastructure is the ability to generate SoC configurations. This is done by using the config.yml file. The config.yml file is processed by a python script (SALAM-Configurator/systembuilder.py)

cd $REPO
export $LAB_PATH = $PWD
export bench = "gemm"
# benchmarks/gemm/config.yml : Top level config file
# Includes all the required components with their sizes
 ${LAB_PATH}/SALAM-Configurator/systembuilder.py --sysName gemm --benchDir "benchmarks/gemm"
# Two outputs
benchmarks/gemm/gemm_clstr_hw_defines.h # Defines the memory map of accelerators
---
acc_cluster:
# Name of header to be generated
  - Name: multi_vector_clstr
# Define DMA
  - DMA:
    - Name: dma
      MaxReqSize: 64  # Max request size
      BufferSize: 128 # Buffer size
      PIOMaster: LocalBus # Bus on which requests are invoked
      Type: NonCoherent # Coherent or NonCoherent
      InterruptNum: 95 # Do not change. Interrupt number. Check boot.s if interrupt number is changed
  - Accelerator:  # Define accelerators. Multiple defined here
    - Name: Top  # Name of accelerator
      IrPath: benchmarks/multi_vector/hw/top.ll # Datapath definition
      ConfigPath: benchmarks/multi_vector/hw/top.ini # Configuration file. For future extensions
      PIOSize: 25 # Number of bytes of memory mapped registers. 1 Byte flag. 8 bytes for each registers
      InterruptNum: 68 # Interrupt number. DO NOT CHANGE. Check boot.s if interrupt number is changed. Only Top has interrupt
      PIOMaster: LocalBus # Bus on which requests are invoked
      # Local to PIO
      LocalSlaves: LocalBus # Local bus to which the accelerator is connected
      Debug: False # Debug. False or True. Make sure its enabled if you want to see what's going on within the accelerator
  - Accelerator: # 2nd accelerator
    - Name: vector
      IrPath: benchmarks/multi_vector/hw/vector.ll
      ConfigPath: benchmarks/multi_vector/hw/vector.ini
      Debug: False
      PIOSize: 1
      PIOMaster: LocalBus
    - Var: # Add-ons to accelerator
      - Name: MATRIX1 # Scratchpad name
        Type: SPM # Scratchpad
        Size: 512 # size in bytes
        Ports: 2 # Number of ports. Parallel accesses to scratchpad.
    - Var:
      - Name: MATRIX2
        Type: SPM
        Size: 512
        Ports: 2
    - Var:
      - Name: MATRIX3
        Type: SPM
        Size: 512
        Ports: 2
  - Accelerator:
    - Name: vector2
      IrPath: benchmarks/multi_vector/hw/vector2.ll
      ConfigPath: benchmarks/multi_vector/hw/vector2.ini
      Debug: False
      PIOSize: 1
      PIOMaster: LocalBus
    - Var:
      - Name: V2_MAT1
        Type: SPM
        Size: 512
        Ports: 2
    - Var:
      - Name: V2_MAT2
        Type: SPM
        Size: 512
        Ports: 2
    - Var:
      - Name: V2_MAT3
        Type: SPM
        Size: 512
        Ports: 2
hw_config:   # Always include below configuration. Defines the function unit spec.
  top:
  vector2:
  vector:
    instructions:
      add:
        functional_unit: 1
        functional_unit_limit: 0
        opcode_num: 13
        runtime_cycles: 0
        ............

Acknowledgement

This document has put together by your CMPT 750/450 instructors and Milad Hakimi.