Skip to main content

Videos

Acknowledgments

In this lab and accelerator assignment we will be using a specific version of gem5 developed at University of North Carolina Charlotte. We thank the authors of gem5 SALAM for making their tool available as open source. CMPT 750 version of gem5 includes changes and benchmarks and may not be backwards compatible with the SALAM version

Gem5 ACC Overview

gem5 ACC extends gem5 to model domain-specific architectures and heterogeneous SoCs. When creating your accelerator there are a many considerations to be made which have first order effect on overall performance and energy.

These include:

  • How to integrate it into the system and define accelerator
  • How to receive data ? Is it coupled to main memory ?
  • Does the accelerator need DMAs
  • How to control the accelerator
  • Parallelism Desired

In this case the code we are going to accelerate is the following

char input[8] = {0x1,0x2,0x3,0x4,0x5,0x6,0x7,0x8};
char const[8] = {0x1,0x2,0x3,0x4,0x5,0x6,0x7,0x8};
char output[8];
for (int i = 0; i < 7 ; i++)
{
  output[i] = input[i] + const[i];
  output[i] = output[i]*2;
  output[i] = output[i]*2;
}

We will illustrate accelerator creation using the DMA model.

Vector_DMA SoC

$ git clone git@github.com:CMPT-7ARCH-SFU/gem5-lab2.git
$ cd gem5-lab2/benchmarks/vector_dma
$ ls
defines.h  host  hw  Makefile
   
host/ Defines the main that runs on CPU and invokes the accelerator
hw/ Defines the accelerator datapath
hw/config.ini Defines accelerator configuration
inputs/m0.bin,m1.bin Input files that contain the data
defines.h Common defines used by both datapath and host code

Our SoC has two main sections.

  • Host
  • Accelerator

Here we focus on the host code and interactions with the accelerator. For this application, we are using a bare metal kernel. This means that we will have a load file, assembly file, and must generate ELF files for execution.

# Psuedocode for host code
1. Set up addresses for scratchpad
2. Copy data from DRAM into scratchpad

3. Start accelerator.
5. Copy data from scratchpad to DRAM

1. Address Mapping

  • defines.h Top-level definition for memory mapped addresses. Since this is a bare-metal SoC without any access to virtual memory and memory allocators, we have to define the memory space. The overall memory space looks like the following:
#define acc        *(char *)0x2f000000
#define val_a      *(int *)0x2f000001
#define val_b      *(int *)0x2f000009
#define val_c      *(int *)0x2f000011
0 0x2f000000 0x2f000001-11 (accelerator parameters. 8 bytes each) is the accelerator parameter) 0x2f100000-0x2FFFFFFF Limit
Host DRAM Coherent Address space Accelerator status (0: inactive. 1: start: 4: running) 3 parameters (see bench/ code) Scratchpad memories Host DRAM space

These memory spaces are set in the following files

  • Accelerator range

Any accesses to this range, either from host code or accelerator is routed to the accelerator cluster.

gem5-config/HWAcc.py
local_low       = 0x2F000000
local_high      = 0x2FFFFFFF
  • Accelerator Start Address and Parameters
config.ini
[CommInterface]
pio_addr = 0x2f000000
pio_size = 64

pio_size is in bytes

  • Scratchpad address
config.ini
[Memory]
addr_range = 0x2f100000

In this instance we want to have the DMAs and accelerator controlled by an additional device to reduce overhead on the CPU. We define the helper functions in common/dma.h.

  • Specifies loading of binary files on host code. Here we have a 16 * 4 byte binary file with integers 0x00-0xF. The data is stored in little endian format. LSB—MSB
$ cd benchmarks/vector_dma
$ xxd m0.bin
00000000: 0100 0000 0200 0000 0300 0000 0400 0000  ................
00000010: 0500 0000 0600 0000 0700 0000 0800 0000  ................
00000020: 0900 0000 0a00 0000 0b00 0000 0c00 0000  ................
00000030: 0c00 0000 0d00 0000 0e00 0000 0f00 0000  ................
# fs_vector_input.py
  test_sys.kernel_extras = [os.environ["LAB_PATH"]+"/benchmarks/vector_dma/m0.bin",os.environ["LAB_PATH"]+"/benchmarks/vector_dma/m1.bin"]
  • Define the DRAM addresses where the input data is loaded
main.cpp

uint64_t base = 0x80c00000;
TYPE *m1 = (TYPE *)base;
TYPE *m2 = (TYPE *)(base + sizeof(TYPE) * N);
TYPE *m3 = (TYPE *)(base + 2 * sizeof(TYPE) * N);

2. Copy data from DRAM to Scratchpad

We then setup the DMA to perform the memory copy between DRAM and the scratchpad memory. dmacpy is similar to memcpy. Note the address ranges used for performing the copy. The destination uses the scratchpad range specified in the config.ini and gem5-scripts. This space is carved out of the global memory space and the host CPU knows to route any reads and writes to this address range to the scratchpad

# Define scratchpad addresses.
uint64_t spm_base = 0x2f100000;
TYPE *spm1 = (TYPE *)spm_base;
TYPE *spm2 = (TYPE *)(spm_base + sizeof(TYPE) * N);
TYPE *spm3 = (TYPE *)(spm_base + 2 * sizeof(TYPE) * N);

# spm1 is destination address
# m1 is source address
# Size in bytes.
dmacpy(spm1, m1, sizeof(TYPE) * N);
while (!pollDma());
resetDma();
dmacpy(spm2, m2, sizeof(TYPE) * N);
while (!pollDma());
resetDma();

3. Start accelerator

Set up parameters in accelerator memory-mapped registers. The accelerator status byte is set in pio_addr = 0x2f000000 in config.ini. pio_size = 64. This will set up 64 memory mapped registers starting from address 0x2f000001. Each entry is 8 bytes. The parameters are automatically derived from the accelerator function definition.

 // 4 possible values. 0x0: inactive 0x1: to start the accelerator. 0x4 active.
  acc = 0x01;
  printf("%d\n", acc);
  // Check for accelerator end.
  while (acc != 0x0) {
    printf("%d\n", acc);
  }

In our boot code, we setup an Interrupt Service Routine (ISR) in isr.c that the accelerator triggers to set up the end of the execution. This will reset the the accelerator status to 0x0. Which we spin on in the host code.

// ISR.c. Invoked when accelerator is complete
void isr(void)
{
	printf("Interrupt\n\r");
  // Helps break the for loop in the host code
	acc = 0;
}

4. Copy result from accelerator.

We copy back the results from the accelerator to the DRAM so that the host code can access and check.

 dmacpy(m3, spm3, sizeof(TYPE) * N);
  while (!pollDma());
  resetDma();

Check yourself ?

  • In which file is the base address of scratchpad defined?
  • Where are the low and high mark of the accelerator addresses defined?
  • How did we detect the accelerator complete execution

Accelerator datapath definition

We will first start with creating the code for our accelerator.

In bench/vector_dma.c there is a vector loop application written. To expose parallelism for computation and memory access we fully unroll the innermost loop of the application. The simulator will natively pipeline the other loop instances for us. To accomplish the loop unrolling we can utilize clang compiler pragmas such as those on line 18 of vector_dma.c.

  // Unrolls loop and creates instruction parallelism
    #pragma clang loop unroll_count(8)
    for(i=0;i<N;i++) {
            prod[i]  = 4*(m1[i] + m2[i]);

    }

With unrolling DDgraph

The hardware ends up being a circuit that implements the above dataflow graph. The unrolling creates 8 way parallelism. The loads to m1[i] and m2[i] can happen in parallel. The adds and multiplies can happen in parallel. The figures show the compiler representation or view that gets mapped down to hardware. Each node in the graph in an LLVM IR instruction. This is just an intermediate RISCy ISA-like representation with certain important differences.

Benefits of Compiler IR view of Accelerator

  • Infinite registers. Typical object code for CPUs is limited by the architectural registers. This causes unnecesary memory operations to manage register overflows and underflows that hide the available parallelism. Compilers don’t have such limitations since they are trying to simply capture the available parallelism and locality

  • Dataflow Semantics. While object code is linearly laid out and relies on a program counter. Compiler IR inherently supports dataflow semantics with no specific program counter.

Without unrolling Plain

Rules for Accelerator Datapath

We are generating hardware datapath from the the C code specified, hence we have a number of rules. If these rules are violated, the compiler may complain, you may encounter a runtime error from llvm runtime engine of SALAM or may even have a silent failure. Its very important you follow them.

  • Rule 1: SINGLE FUNCTION Only single function perfmitted per accelerator .c file.
  • Rule 2: NO LIBRARIES Cannot use standard library functions. Cannot call into other functions
  • Rule 3 : No I/O No printfs or writes to files. Either use traces or write back to cpu memory to debug
  • Rule 4: Only locals or args Can only work with variables declared within function or input arrays.

Read here for more details on LLVM

Run

cd $REPO
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=/data/src/750-SALAM


# Building gem5-SALAM
git clone git@github.com:CMPT-7ARCH-SFU/gem5-SALAM.git
cd gem5-SALAM; scons -j 16 build/ARM/gem5.opt -j`nproc`


# Build benchmark

cd $REPO/benchmarks/vector_dma
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh 
module load llvm-38
# Build datapath and host binary
make clean; make



# If you are  227 ip machine follow instructions below


# Running without docker
/data/src/750-SALAM/build/ARM/gem5.opt --debug-flags=NoncoherentDma --outdir=BM_ARM_OUT/vector_dma gem5-config/run_vector.py --mem-size=4GB --kernel=/data/src/gem5-lab2.0/benchmarks/vector_dma/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab2.0/benchmarks --accbench=vector_dma --caches --l2cache

# OR in a single line
./runvector.sh -b vector_dma


# Running inside docker if on 236,
docker run -v /data:/data/ -v $HOME/gem5-lab2:/repo --env M5_PATH=/data/src/750-SALAM --env LAB_PATH=/repo --user $(id -u):$(id -g) -it gem5-salam:latest bash -c "cd /repo; /repo/runvector.sh -b vector_dma -p"

Scripts

INI files

For each of our accelerators we also need to generate an INI file. In each INI file we can define the number of cycles for each IR instruction and provide any limitations on the number of Functional Units (FUs) associated with IR instructions.

Additionally, there are options for setting the FU clock periods and controls for pipelining of the accelerator. Below is an example with a few IR instructions and their respective cycle counts:

[CycleCounts]
counter = 1
gep = 0
phi = 0
select = 1
ret = 1
br = 0
switch = 1
indirectbr = 1
invoke = 1

Importantly, under the AccConfig section, we set MMR specific details such as the size of the flags register, memory address, interrupt line number, and the accelerator’s clock.

[AccConfig]
flags_size = 1
config_size = 0
int_num = -1
clock_period = 10
premap_data = 0
data_bases = 0

In the Memory section, you can define the scratchpad’s memory address, size, response latency and number of ports. Also, if you want the accelerator to verify that memory exists in the scratchpad prior to accessing the scratchpad we can set ready mode to true.

[Memory]
addr_range = 0x2f100000
size = 98304
latency = 2ns
ports = 4
ready_mode = True
reset_on_private_read = False

Lastly, we need to set the memory address under CommInterface of our MMR for the accelerator as well as the overall size of the MMR that account for all variables that need to be passed, 8 bytes per variable, and flags. The pio_size is in bytes. So 1 byte for the start/stop flag and 63 bytes for the remaining input arguments. Each argument is 8 bytes so 8 args in total.

[CommInterface]
pio_addr = 0x2f000000
pio_size = 65

Constructing the System

We are now going to leverage and modify the example scripts for gem5’s full system simulation. In gem5-config/fs_vector_input.py we have a modified version of the script located in gem5’s default folder. The main difference in our configuration is there are two additional parameters

Adding accelerators to cluster - HWAcc.py

  • Line 242: HWAcc.makeHWAcc(options, test_sys) we invoke our own function that has been added on `HW. This adds gem5-config/HWAcc.py to the overall system configuration.

In order to simplify the organization of accelerator-related resources, we define a accelerator cluster. This accelerator cluster will contain any shared resources between the accelerators as well as the accelerators themselves. It has several functions associated with it that help with attaching accelerators to it and for hooking cluster into the system.

The _attach_bridges function (line 19) connects the accelerator cluster into the larger system, and connects the memory bus to the cluster. This gives devices outside the cluster master access to cluster resources.

system.acctest._attach_bridges(system, local_range, external_range)

We then invoke the _connect_caches function (line 20) in order to connect any cache hierarchy that exists in-between the cluster, the memory bus, or l2xbar of the CPU depending on design. This gives the accelerator cluster master access to resources outside of itself. It also establishes coherency between cluster and other resources via caches. If no caches are needed this will merely attach the cluster to the memory bus without a cache.

    system.acctest._connect_caches(system, options, l2coherent=True, cache_size = "32kB")

These functions are defined in gem5-SALAM/src/hwacc/AccCluster.py

Define communication

  • Add DMA

The DMA control address defined here has to match common/dma.h. The default value is 0x2ff00000 and 24*8 bytes starting from that address.

system.acctest.dma = NoncoherentDma(pio_addr=0x2ff00000, pio_size=24, gic=system.realview.gic, max_pending=32, int_num=95)
    system.acctest._connect_cluster_dma(system, system.acctest.dma)

We, we are going to create a CommInterface (Line 30) which is the communications portion of our Top accelerator. We will then configure Top and generate its LLVM interface by passing CommInterface, a config file, and an IR file, to AccConfig (Line 31). This will generate the LLVM interface, configure any hardware limitations, and will establish the static Control and Dataflow Graph (CDFG).

We then connect the accelerator to the cluster (Line 32). This will attach the PIO port of the accelerator to the cluster’s local bus that is associated with MMRs.

Hardware

For our Hardware, we follow the same steps.

  • Create a CommInterface
  • Configure it using AccConfig
  • Attach it to the accelerator cluster

This can be seen in

# gem5-config/HWAcc.py
# Add the benchmark function
 acc_bench = options.accpath + "/" + options.accbench + "/bench/" + options.accbench + ".ll"

    # Specify the path to the config file for an accelerator
    # acc_config = <Absolute path to the config file>
    acc_config = options.accpath + "/" + options.accbench + "/config.ini"
......
.......
 # Add an accelerator attribute to the cluster
    setattr(system.acctest, options.accbench, CommInterface(devicename=options.accbench))
    ACC = getattr(system.acctest,options.accbench)
    AccConfig(ACC, acc_config, acc_bench)

    # Add an SPM attribute to the cluster
    setattr(system.acctest, options.accbench+"_spm", ScratchpadMemory())
    ACC_SPM = getattr(system.acctest,options.accbench + "_spm")
    AccSPMConfig(ACC, ACC_SPM, acc_config)
    system.acctest._connect_spm(ACC_SPM)

    # Connect the accelerator to the system's interrupt controller
    ACC.gic = system.realview.gic

    # Connect HWAcc to cluster buses
    system.acctest._connect_hwacc(ACC)
    ACC.local = system.acctest.local_bus.slave
    ACC.acp = system.acctest.coherency_bus.slave

Because we want our hardware accelerator to be managed by the host, we connect the PIO directly to the coherent cross bar.

We then define a scratchpad memory and configure it using AccSPMConfig, which points to our accelerator’s config file (Line 42).

Lastly we connect scratchpad memory to the cluster (Line 43), this allows for all accelerators in the cluster to access it.

Lines 46-63 configure different buffer sizes for the DMA. These are optional, but are presented to demonstrate how you can impose additional limitations on the DMA to control how data is transferred.

Run and Stats

export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=/data/src/750-SALAM

cd $REPO/benchmarks/vector_dma
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/Modules/3.2.10/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-38
#
make clean; make
# This should create a .ll file in your hw/
# and main.elf file in host/
# On 227 
/data/src/750-SALAM/build/ARM/gem5.opt --debug-flags=NoncoherentDma --outdir=BM_ARM_OUT/vector_dma gem5-config/run_vector.py --mem-size=4GB --kernel=/data/src/gem5-lab2.0/benchmarks/vector_dma/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab2.0/benchmarks --accbench=vector_dma --caches --l2cache
# In short
$ ./runvector.sh -b vector_dma 
# This will create a BM_ARM_OUT/vector_dma (this is your m5_out folder)
# The debug-trace.txt will contain stats for your accelerator




# Run on 236 machine
# Running inside docker if on 236,
docker run -v /data:/data/ -v $HOME/gem5-lab2:/repo --env M5_PATH=/data/src/750-SALAM --env LAB_PATH=/repo --user $(id -u):$(id -g) -it gem5-salam:latest bash -c "cd /repo; /repo/runvector.sh -b vector_dma -p"



Do you understand what these stats are?

  • Runtime in cycles * System clock (Wall clock time). Only accelerator does not include host DMA
  • Stalls
  • Accelerator Power
  • Leakage/Dynamic Power
cat BM_ARM_OUT/vector_dma/debug-trace.txt


Cycle Counts Loaded!
**** REAL SIMULATION ****
1238719500: system.acctest.dma: SRC:0x0000000080c00000, DST:0x000000002f100000, LEN:64
1238849500: system.acctest.dma: Transfer completed in 0.13 us
1242711500: system.acctest.dma: SRC:0x0000000080c00040, DST:0x000000002f100040, LEN:64
1242831500: system.acctest.dma: Transfer completed in 0.12 us
1
********************************************************************************
system.acctest.vector_dma.compute
   ========= Performance Analysis =============
   Setup Time:                      0.000363995seconds
   Simulation Time:                 0.00181577seconds
   System Clock:                    0.1GHz
   Transistor Latency:              10ns
   Runtime:                         23 cycles .
   Runtime:                         0.23 us
   Stalls:                          10 cycles
   Executed Nodes:                  12 cycles

********************************************************************************
   ========= Performance Analysis =================
   Setup Time:                      363995ns
   Simulation Time:                 1.81577e+06ns
   System Clock:                    0.1GHz
   Transistor Latency:              102ns
   Runtime:                         23 cycles
   Runtime:                         2.3e-09 seconds
   Stalls:                          10 cycles
       Load Only:                   0 cycles
       Store Only:                  1 cycles
       Compute Only:                0 cycles
       Compute & Store:             2 cycles
       Load & Store:                0 cycles
       Load & Compute:              1 cycles
       Load & Compute & Store:      6 cycles
   Executed Nodes:                  12 cycles
       Load Only:                   0 cycles
       Store Only:                  1 cycles
       Compute Only:                0 cycles
       Compute & Store:             2 cycles
       Load & Store:                0 cycles
       Load & Compute:              3 cycles
       Load & Compute & Store:      6 cycles

   ========= Runtime FU's ========= (Max | Avg) ===
   Counter FU's:                       1 | 0.347826
   Integer Add/Sub FU's:               3 | 0.260870
   Integer Mul/Div FU's:               0 | 0.000000
   Integer Shifter FU's:               4 | 0.206522
   Integer Bitwise FU's:               1 | 0.347826
   Floating Point Float Add/Sub:       0 | 0.000000
   Floating Point Double Add/Sub:      0 | 0.000000
   Floating Point Float Mul/Div:       0 | 0.000000
   Floating Point Double Mul/Div:      0 | 0.000000
   0 Cycle Compare FU's:               3 | 0.347826
   GEP Instruction FU's:               6 | 0.347826
   Type Conversion FU's:               0 | 0.000000

   ========= Static FU's ==========================
   Counter FU's:                    0
   Integer Add/Sub FU's:            0
   Integer Mul/Div FU's:            0
   Integer Shifter FU's:            0
   Integer Bitwise FU's:            0
   Floating Point Float Add/Sub:    0
   Floating Point Double Add/Sub:   0
   Floating Point Float Mul/Div:    0
   Floating Point Double Mul/Div:   0
   0 Cycle Compare FU's:            0
   GEP Instruction FU's:            0
   Type Conversion FU's:            0
   Other:                           0

   ========= Pipeline Register Usage =============
   Total Number of Registers:       23
   Max Register Usage Per Cycle:    14
   Avg Register Usage Per Cycle:    6.260870
   Avg Register Size (Bytes):       5.833333

   ========= Memory Configuration =================
   Cache Bus Ports:                 117
   Shared Cache Size:               0kB
   Local Bus Ports:                 46
   Private SPM Size:                0kB
   Private Read Ports:              0
   Private Write Ports:             0
   Private Read Bus Width:          0
   Private Write Bus Width:         0
       Memory Reads:                0
       Memory Writes:               0
   ========= Power Analysis ======================
   FU Leakage Power:                0.014534 mW
   FU Dynamic Power:                0.000000 mW
   FU Total Power:                  0.014534 mW

   Registers Leakage Power:          0.002576 mW
   Registers Dynamic Power:          0.000000 mW
       Register Reads (Bits):        144
       Register Writes (Bits):       144
   Registers Total Power:            0.002576 mW

   SPM Leakage Power:               0.000000 mW
   SPM Read Dynamic Power:          0.000000 mW
   SPM Write Dynamic Power:         0.000000 mW
   SPM Total Power:                 0.000000 mW

   Cache Leakage Power:             0.000000 mW
   Cache Read Dynamic Power:        0.000000 mW
   Cache Write Dynamic Power:       0.000000 mW
   Cache Total Power:               0.000000 mW

   Accelerator Power:               0.017110 mW
   Accelerator Power (SPM):         0.017110 mW
   Accelerator Power (Cache):       0.017110 mW

   ========= Area Analysis =======================
   FU Area:                         1048.954346 um^2 (0.001049 mm^2)
   Register Area:                   209.350159 um^2 (0.000209 mm^2)
   SPM Area:                        0.000000 um^2 (0.000000 mm^2)
   Cache Area:                      0.000000 um^2 (0.000000 mm^2)

   Accelerator Area:                1258.304443 um^2 (0.001258 mm^2)
   Accelerator Area (SPM):          1258.304443 um^2 (0.001258 mm^2)
   Accelerator Area (Cache):        1258.304443 um^2 (0.001258 mm^2)

   ========= SPM Resizing  =======================
   SPM Optimized Leakage Power:     0.000000 mW
   SPM Opt Area:                    0.000000 um^2

1354939000: system.acctest.dma: SRC:0x000000002f100080, DST:0x0000000080c00080, LEN:64
1354989000: system.acctest.dma: Transfer completed in 0.05 us
Exiting @ tick 1368457000 because m5_exit instruction encountered

TODOs

  • Change unroll count in benchmarks/vector_dma/hw/vector_dma.c from 1-16 and see what happens to runtime cycles each time. Also see the stats for Total Number of Registers:, Max Register Usage Per Cycle: Runtime:, Runtime FUs. Power Analysis.

WARNING: Remember you have to rebuild .ll and main.elf each time

  • Change ports benchmarks/vector_dma/hw/config.ini [Memory] from 1-8 and see what happens to cycles. Why does changing number of ports to 1 increase stalls ? To try and understand follow step below
  • Change the host cpu type in runvector.sh to MinorCPU and see difference in overall simulation time.
  • Set FLAGS="HWACC,LLVMRuntime" in run-vector.sh. Re-run and check debug-trace.txt . Try to comphrehend what the trace says. This will include step-by-step execution of the hardware. Disable the flags for assignments; otherwise traces will consume too much space.
  • Try and draw the dataflow by hand

Comments on trace

  • open BM_ARM_OUT/vector_dma/debug-trace.txt Look for lines of type. Trying to read addr: 0x000000002f100004, 4 bytes through port:
  • Check how many such reads occur in tick 1271825000,
  • When changing ports to 1 Check how many reads occur in tick 1271935000

  • This indicates whether the llvm file was loaded and runtime initialized
  • MMR refers to the start/stop flag
1476840000: system.acctest.vector_dma: Checking MMR to see if Run bit set
1476840000: system.acctest.vector_dma.compute: Initializing LLVM Runtime Engine!
1476840000: system.acctest.vector_dma.compute: Constructing Static Dependency Graph
1476840000: system.acctest.vector_dma.compute: Parsing: (/data/src/gem5-lab2/benchmarks/vector_dma/hw/vector_dma.ll)
  • Read from 0x2f10000c; indicates read from address. Depeding on address range this either refers to scratchpad or global memory

  • Check the computation operations. Open the $REPO/benchmarks/vector_dma/hw/vector_dma.ll file and identify these instructions.

1476910000: system.acctest.vector_dma.compute.i(  %7 = shl i32 %6, 2): Performing shl Operation
1476910000: system.acctest.vector_dma.compute.i(  %7 = shl i32 %6, 2): 2 << 2
1476910000: system.acctest.vector_dma.compute.i(  %7 = shl i32 %6, 2): shl Complete. Result = 8
1476910000: system.acctest.vector_dma.compute.i(  %7 = shl i32 %6, 2): Operation Will Commit in 1 Cycle(s)
1476910000: system.acctest.vector_dma.compute.i(  %13 = add i32 %12, %10): Performing add Operation (13)
1476910000: system.acctest.vector_dma.compute.i(  %13 = add i32 %12, %10): 2 + 2
1476910000: system.acctest.vector_dma.compute.i(  %13 = add i32 %12, %10): add Complete. Result = 4
1476910000: system.acctest.vector_dma.compute.i(  %13 = add i32 %12, %10): Operation Will Commit in 1 Cycle(s)
1476910000: system.acctest.vector_dma.compute.i(  %6 = add i32 %5, %3): Performing add Operation (6)
1476910000: system.acctest.vector_dma.compute.i(  %6 = add i32 %5, %3): 3 + 3
1476910000: system.acctest.vector_dma.compute.i(  %6 = add i32 %5, %3): add Complete. Result = 6
  • Dataflow graph Visualizer. See if you can spot the difference between parallel and serial.
module unload llvm-38
module load llvm-10
clang --version
# Should be 10.
cd $REPO/vector_dma/hw/
# Dataflow graph with
clang -emit-llvm -S vector_dma.c -o vector_dma-10.ll
opt -load /data/PDG/build/libpdg.so --dot-pdg --dot-only-ddg vector_dma-10.ll
dot -Tpdf pdgragh.vadd.dot -o pdgragh.vadd.serial.pdf

clang -emit-llvm -O3 -S vector_dma.c -o vector_dma-10.ll
opt -load /data/PDG/build/libpdg.so --dot-pdg --dot-only-ddg vector_dma-10.ll
dot -Tpdf pdgragh.vadd.dot -o pdgragh.vadd.parallel.pdf

Model 1.5 : Batched DMA.

benchmarks/vector_dma_2x

In Model 1, we moved all the data we need into the scratchpad and then kickstarted the computation. However, scratchpads are finite and accelerators can only work with data in the scratchpad. Hence we may need to restrict the size of the accelerators and process data in multiple batches. In this example we are going to restrict the accelerator to process only 8 elements. However we have 16 elements in the array. To process all the elements we have to process the data in 2 batches. The modifications are to the manager host code.

// Modified config.ini to set scratchpad size
// Modified defines.h
#define N 8
// The accelerator datapath will work on 8 elements at a time
// Modified top.c to process 16 elements as two batches.
// Batch 0 DMAs elements 0-8 to scratchpad
  dmacpy(spm1, m1, sizeof(TYPE) * N);
  while (!pollDma());
  resetDma();
  dmacpy(spm2, m2, sizeof(TYPE) * N);
  while (!pollDma());
  resetDma();
// Invoke accelerator on scratchpad address range
val_a = (uint64_t)spm_base;
  val_b = (uint64_t)(spm_base + sizeof(TYPE) * N);
  val_c = (uint64_t)(spm_base + 2 * sizeof(TYPE) * N);



// Batch 2 DMAs elements 8-16
  dmacpy(spm1, m1 + N, sizeof(TYPE) * N);
  while (!pollDma());
  resetDma();
  dmacpy(spm2, m2 + N, sizeof(TYPE) * N);
  while (!pollDma());
  resetDma();

// Notice in both cases we are passing the scratchpad base address to the accelerator datapath to work on. This is redundant and we can hardcode it into the accelerator datapath in hw/ if we want to vector_dma_2x.c. 

export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=export M5_PATH=/data/src/750-SALAM

cd $REPO/benchmarks/vector_dma_2x
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/Modules/3.2.10/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-38
#
make clean; make
# This should create a .ll file in your hw/
# and main.elf file in host/
# In short
/data/src/750-SALAM/build/ARM/gem5.opt --debug-flags=NoncoherentDma --outdir=BM_ARM_OUT/vector_dma_2x gem5-config/run_vector.py --mem-size=4GB --kernel=/data/src/gem5-lab2.0/benchmarks/vector_dma_2x/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab2.0/benchmarks --accbench=vector_dma_2x --caches --l2cache

# On 227
$ ./runvector.sh -b vector_dma_2x 
# This will create a BM_ARM_OUT/vector_dma_2x (this is your m5_out folder)
# The debug-trace.txt will contain stats for your accelerator

TODO

  • Change N to 4 and perform computation in 4 batches.

Model 2: Cache

The cache model hooks up the accelerator to the global memory through a coherence crossbar. It is ok to be not familiar with coherence bar when reading this document. You only need to understand that with coherence available the accelerators can directly reference the DRAM space mapped to the CPU.

To enable accelerator cache.

  • First chanage line 49 to CACHE_OPTS="--caches --l2cache --acc_cache"
  • The cache size can be changed in clstr._connect_caches(system, options, l2coherent=True, cache_size = "32kB"):gem5-config/HWAcc.py
/data/src/750-SALAM/build/ARM/gem5.opt --debug-flags=HWACC --outdir=BM_ARM_OUT/vector_cache gem5-config/fs_vector_input.py --mem-size=4GB --kernel=/data/src/gem5-lab-acc/benchmarks/vector_cache/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab-acc/benchmarks --accbench=vector_cache --caches --l2cache --acc_cache
# In short
$ ./runvector.sh -b vector_cache -p

The image below compares the system organization with an accelerator cache and without.

The primary difference between the cache and dma version is in the host code

Host code for cache model

The pointers passed to the accelerator point to the global memory space (base). The load and store operations directly touch these locations and access them through the coherence cross bar.

// benchmarks/vector_cache/host/main.cpp
  uint64_t base = 0x80c00000;
  uint64_t spm_base = 0x2f100000;
  val_a = (uint64_t)base;
  val_b = (uint64_t)(base + sizeof(TYPE) * N);
  val_c = (uint64_t)(base + 2 * sizeof(TYPE) * N);

The cache model hooks up the accelerator to the global memory through a coherence crossbar. It is ok to be not familiar with coherence bar when reading this document. You only need to understand that with coherence available the accelerators can directly reference the DRAM space mapped to the CPU.

To enable accelerator cache.

  • First chanage line 49 to CACHE_OPTS="--caches --l2cache --acc_cache"
  • The cache size can be changed in clstr._connect_caches(system, options, l2coherent=True, cache_size = "32kB"):vector.py
  • If you don’t set this correctly; your application may misbehave
cd $REPO/benchmarks/vector_cache
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-38
# Build datapath and host binary
make clean; make


# Run on 227 machine
/data/src/750-SALAM/build/ARM/gem5.opt --debug-flags=HWACC --outdir=BM_ARM_OUT/vector_cache gem5-config/fs_vector_input.py --mem-size=4GB --kernel=/data/src/gem5-lab-acc/benchmarks/vector_cache/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab-acc/benchmarks --accbench=vector_cache --caches --l2cache --acc_cache
# In short. 
$ ./runvector.sh -b vector_cache -p



# Run on 236 machine
# Running inside docker if on 236,
docker run -v /data:/data/ -v $HOME/gem5-lab2:/repo --env M5_PATH=/data/src/750-SALAM --env LAB_PATH=/repo --user $(id -u):$(id -g) -it gem5-salam:latest bash -c "cd /repo; /repo/runvector.sh -b vector_cache -p"


The image below compares the system organization with an accelerator cache and without.

TODOs

    1. Try out a larger vector. You need to modify the host code to initialize bigger data, a 32 element array with numbers [0x1 to 0x20].
      • Directly initalize in host code starting from base address. Note that you do not have allocators and initializers in bare metal. You will need to modify the pointers starting from base.
      • Modify arg1, arg2 and arg3 to point to the new base addresses.

    Modify the parameter in N in defines.h and

    1. Modify the inputs/m0.bin and m1.bin to inclue the additonal 16 numbers. use hexedit . Check these lines to ensure m0 and m1 are loaded in the appropriate place. Note that numbers are in little endian format and in hex (e.g., 17 = 0x10). If m0 a 32 int array starts at 0x80c00000 what address does m1 start at?
  test_sys.kernel = binary(options.kernel)
        test_sys.kernel_extras = [os.environ["LAB_PATH"]+"/benchmarks/inputs/m0.bin",os.environ["LAB_PATH"]+"/benchmarks/inputs/m1.bin"]
        test_sys.kernel_extras_addrs = [0x80c00000,0x80c00000+os.path.getsize(os.environ["LAB_PATH"]+"/benchmarks/inputs/m0.bin")]
        print("Loading file m0 at" + str(hex(0x80c00000)))
        print("Loading file m1 at" + str(hex(0x80c00000 + os.path.getsize(os.environ["LAB_PATH"]+"/benchmarks/inputs/m0.bin"))))

Model 3 Multi Accelerator with Top manager

cd $REPO
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=export M5_PATH=/data/src/750-SALAM

cd $REPO/benchmarks/multi_vector
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-38
# Build datapath and host binary
make clean; make


# In short on 227. 
$ ./runmulti.sh -b multi_vector -p

# Run on 227 machine
/data/src/750-SALAM/build/ARM/gem5.opt --debug-flags=HWACC --outdir=BM_ARM_OUT/multi_vector gem5-config/run_multi.py --mem-size=4GB --kernel=/data/src/gem5-lab-acc/benchmarks/multi_vector/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab-acc/benchmarks --accbench=multi-cache --caches --l2cache --acc_cache


# Run on 236 machine
# Running inside docker if on 236,
docker run -v /data:/data/ -v $HOME/gem5-lab2:/repo --env M5_PATH=/data/src/750-SALAM --env LAB_PATH=/repo --user $(id -u):$(id -g) -it gem5-salam:latest bash -c "cd /repo; /repo/runmulti.sh -b multi-cache -p"

For larger applications we may need to include multiple accelerators in the cluster. For this, we need to include a top accelerator to coordinate each of the other accelerators. Figure below shows the system model. The top now offloads the DMA and accelerator kickstart logic; it also initiates the DMA movement between the accelerators. The host in this case simply passes the address pointers to the inputs and output zone. There are two accelerators vector and vector2.ll

// host/main.cpp
volatile uint8_t  * top   = (uint8_t  *)0x2f000000;
volatile uint32_t * val_a = (uint32_t *)0x2f000001;
volatile uint32_t * val_b = (uint32_t *)0x2f000009;
volatile uint32_t * val_c = (uint32_t *)0x2f000011;

int main(void) {

  // Pointers in DRAM. m1 and m2 are inputs.
  // m3 is the output
    uint32_t base = 0x80c00000;
    TYPE *m1 = (TYPE *)base;
    TYPE *m2 = (TYPE *)(base + sizeof(TYPE) * N);
    TYPE *m3 = (TYPE *)(base + 2 * sizeof(TYPE) * N);

  // MMRegs of the top accelerator.
   // Argument 1 to top
    *val_a = (uint32_t)(void *)m1;
   // Argument 2 to top
    *val_b = (uint32_t)(void *)m2;
   // Argument 3 to top
    *val_c = (uint32_t)(void *)m3;
   
hw/source/top.c Code for top accelerator coordinator. This is itself an accelerator
hw/configs/top.ini Configuration for top accelerator
hw/source/vector.c, hw/configs/vector.ini Code for first stage of vector accelerator
hw/source/vector2.c, hw/configs/vector2.ini Code for second stage of vector accelerator
hw/ir llvm files after compiler generates dataflow graph

Memory map

Start address Description
0x2f000000 Memory mapped args for top (defined in top.ini)
0x2f0000F0 Memory mapped args for vector (defined in vector.ini)
0x2f000100 Memory mapped args for vector2 (defined in vector2.ini)
0x2f100000 Scratchpad for vector
0x2f200000 scratchpad for vector2
  // Accelerator 1: vector.c
    for(i=0;i<N;i++) {
            tmp_m3[i]  = (m1[i] + m2[i]);

    }
  // Accelerator 2: vector2.c
    for(i=0;i<N;i++) {
            m3[i]  = tmp_m3[i] * 8;

    }

Top manages the accelerator themselves

  • Step 0: Obtain DMA control reg address
// hw/source/top.c
  volatile uint8_t *DmaFlags = (uint8_t *)(DMA);
  volatile uint64_t *DmaRdAddr = (uint64_t *)(DMA + 1);
  volatile uint64_t *DmaWrAddr = (uint64_t *)(DMA + 9);
  volatile uint32_t *DmaCopyLen = (uint32_t *)(DMA + 17);

Step 1: DMA DRAM->Scratchpad

Transfer data from DRAM to Scratchpad S1 of Accelerator Vector.

// hw/source/top.c
// Global memory address
*DmaRdAddr = m1_addr;
// Scratchpad address of vector. Defined in hw_defines.h
// 0x2f100000. This is for input 1
*DmaWrAddr = M1ADDR;
// Number of bytes
*DmaCopyLen = vector_size;
// Copy bytes
*DmaFlags = DEV_INIT;
// Poll DMA for finish
while ((*DmaFlags & DEV_INTR) != DEV_INTR);

// Transfer M2 to scratchpad now.

Scratchpad memory is laid out in the following manner

M1 M2 M3
0x2f100000 - N*sizeof(int) 0x2f100040 - N*sizeof(int) 0x2f100080 - N*sizeof(int)

Step 2: Start accelerator V1.

Set up arguments if required. The accelerator can only work with data in the scratchpad or local registers. These are fixed memory rangers in the DMA space. In this case, the V1 vector accelerator does not require any additional arguments. To start the accelerator from TOP it is important to follow the steps below (in particular checking if accelerator is ready for kickstart) after arguments are are setup.

// Write to argument MMR of V1 accelerator

// First, check if accelerator ready for kickstarting
while (*V1Flags != 0x0);

// Start the accelerated function
*V1Flags = DEV_INIT;
  
// Poll function for finish
while ((*V1Flags & DEV_INTR) != DEV_INTR);

// Reset accelerator for next time.
*V1Flags = 0x0;

Step 3: DMA accelerator V1 -> V2.

The output of accelerator v1 is the input of v2. Need to copy N*4 bytes from 0x2f100080 to 0x2f200000.

  // Transfer the output of V1 to V2.
  *DmaRdAddr = M3ADDR;
  *DmaWrAddr = M1ADDR_V2;
  *DmaCopyLen = vector_size;
  *DmaFlags = DEV_INIT;
  // //Poll DMA for finish
  while ((*DmaFlags & DEV_INTR) != DEV_INTR)
    ;

Step 4 Kick start accelerator v2.

// Write to argument MMR of V2 accelerator

// First, check if accelerator ready for kickstarting
while (*V2Flags != 0x0);

// Start the accelerated function
*V2Flags = DEV_INIT;
  
// Poll function for finish
while ((*V2Flags & DEV_INTR) != DEV_INTR);

// Reset accelerator for next time.
*V2Flags = 0x0;

Step 5 Copy data from v2 to cpu space

  // Transfer M3
  // Scratchpad addresss
  *DmaRdAddr = M3ADDR_V2;
  // Global address the host wants the final result in
  *DmaWrAddr = m3_addr;
  // Number of bytes
  *DmaCopyLen = vector_size;
  // Start DMA
  *DmaFlags = DEV_INIT;
  // Poll DMA for finish
  while ((*DmaFlags & DEV_INTR) != DEV_INTR)
    ;

TODO

  • The DMAing is high overhead between the accelerators V1 and V2. Ask yourself Why ? Merge V2 into V1 and modify top. Compare cycles. Why did it not reduce as much as you expect. What about the power of the merged accelerator ?

Model 3.5 Multi accelerator with accelerator cache.

We now create a multi-accelerator system with a shared cache. We do not need to explicitly transfer data between accelerators and all data is implicitly transferred between the accelerators through the shared cluster cache. The top only has to set up the appropriate arguments and invoke the accelerators in sequence. Each accelerator reads and writes back to the global memory space and the cluster cache captures the locality.

cd $REPO
export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=export M5_PATH=/data/src/750-SALAM

cd $REPO/benchmarks/multi_vector_cache
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-38
# Build datapath and host binary
make clean; make


# In short on 227. 
$ ./runmulti.sh -b multi_vector_cache -p

# Run on 227 machine
/data/src/750-SALAM/build/ARM/gem5.opt --debug-flags=HWACC --outdir=BM_ARM_OUT/multi_vector_cache gem5-config/run_multi.py --mem-size=4GB --kernel=/data/src/gem5-lab-acc/benchmarks/multi_vector_cache/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab-acc/benchmarks --accbench=multi-cache --caches --l2cache --acc_cache


# Run on 236 machine
# Running inside docker if on 236,
docker run -v /data:/data/ -v $HOME/gem5-lab2:/repo --env M5_PATH=/data/src/750-SALAM --env LAB_PATH=/repo --user $(id -u):$(id -g) -it gem5-salam:latest bash -c "cd /repo; /repo/runmulti.sh -b multi-cache -p"

Step: Activate accelerator v1 and v2

// benchmarks/multi_vector_cache/hw/source/top.c
// Pass on host address arguments to accelerator
  *V1Arg1 = m1_addr;
  *V1Arg2 = m2_addr;
  *V1Arg3 = m3_addr;
// Start V1
  *V1Flag = DEV_INIT;
  // Poll function for finish
  while ((*V1Flag & DEV_INTR) != DEV_INTR);
  *V1Flag = 0x0;

// Start V2
  *V2Flag = DEV_INIT;
  while ((*V2Flag & DEV_INTR) != DEV_INTR);
  *V2Flag = 0x0;

TODO

  • Merge V2 into V1 and modify top. Compare cycles. What about the number of accesses to the cluster cache ?

Model 3: Streaming Pipelines

Streaming pipelines introduce a FIFO interface to the memory system. If you take a look at the datapaths vector_dma/hw/vector_dma.c you notice that the memory access pattern is highly regular with no ordering requirements between the elements of the array. We simply sequence through the elements of the vectoro applying an operation at individual locations. This can be concisely described as a stream of values. A stream simply provides a FIFO interface to the data.

Memory map. The term MMR refers to the memory mapped register and flags used to control the DMA and accelerators.

           
0x2F000000 (TOP MMR) 0x2F000100 (S1 MMR) 0x2F000200 (S2 MMR) 0x2F000300 (S3 MMR) 0x2fe00000 (StreamDMA MMR) 0x2ff00000 (Noncoherent DMA)
     
0x2F003000 (DRAM->S1, S3->DRAM) 0x2F003000 (S1->S2 FIFO port) 0x2F004000 (S2->S3 FIFO port)

Step 1 StreamDMA -

This stream data from DRAM in chunks of stream_size (bits). Figure illustrates a stream DMA. We need to create a new configuration and modify top to initiate the stream. The stream DMA includes a control pio (similar to other accelerators). This can be written to by top to control from where in the DRAM data is being streamed. The out port of the StreamDMA engine is wired up to the stream ports of one of the accelerators. Each stream is a single input-single output FIFO. Each accelerator has a .stream interface into which all the required streams are wired in. In this case i) we read from the DRAM and send data to accelerator S1. ii) read data from accelerator S3 and write it into stream DMA.

  • Each of the the accelerators will use memory mapped address 0x2f0001000 to read/write to the stream addresses.
  • Each access will read stream_size:8 bits worth of data from the port.
  • In total the FIFO will supply StrDmaRdFrameSize bytes of data in chunks of stream_size. The total number of dataflow tokens generated will be $\frac{RdFrameSize*8}{stream_size}$.
    # Configuration in gem5-config/vector_stream.py
    # Control address for setting up stream
    addr = 0x2fe00000
    clstr.stream_dma0 = StreamDma(pio_addr=addr, pio_size=32, gic=gic, max_pending=32)
    # Address for reading/writing to stream from accelerator
    clstr.stream_dma0.stream_addr= local_low + 0x1000
    # Number of bits per FIFO access.
    clstr.stream_dma0.stream_size=8
    clstr.stream_dma0.pio_delay='1ns'
    clstr.stream_dma0.rd_int = 210
    clstr.stream_dma0.wr_int = 211
    clstr._connect_dma(system, clstr.stream_dma0)

    # DRAM->Accelerator S1
    clstr.S1.stream = clstr.stream_dma0.stream_out

    # Accelerator S3->DRAM
    clstr.S3.stream = clstr.stream_dma0.stream_in
// vector_stream/hw/source/top.c
// Define Stream control config
  volatile uint8_t *StrDmaFlags = (uint8_t *)(STREAM_DMA_MMR);
  volatile uint64_t *StrDmaRdAddr = (uint64_t *)(STREAM_DMA_MMR + 4);
  volatile uint64_t *StrDmaWrAddr = (uint64_t *)(STREAM_DMA_MMR + 12);
  volatile uint32_t *StrDmaRdFrameSize = (uint32_t *)(STREAM_DMA_MMR + 20);
  volatile uint8_t *StrDmaNumRdFrames = (uint8_t *)(STREAM_DMA_MMR + 24);
  volatile uint8_t *StrDmaRdFrameBuffSize = (uint8_t *)(STREAM_DMA_MMR + 25);
  volatile uint32_t *StrDmaWrFrameSize = (uint32_t *)(STREAM_DMA_MMR + 26);
  volatile uint8_t *StrDmaNumWrFrames = (uint8_t *)(STREAM_DMA_MMR + 30);
  volatile uint8_t *StrDmaWrFrameBuffSize = (uint8_t *)(STREAM_DMA_MMR + 31);

// Initiate Stream from DRAM to FIFO port
  *StrDmaRdAddr = in_addr;
  *StrDmaRdFrameSize = INPUT_SIZE; // Specifies number of bytes
  *StrDmaNumRdFrames = 1;
  *StrDmaRdFrameBuffSize = 1;
// Start Stream
  *StrDmaFlags = STR_DMA_INIT_RD | STR_DMA_INIT_WR;

Step 2 Streambuffers.

Stream buffers establish ports directly between accelerators. They do not need to set up during runtime The configuration is set up and the accelerators simply read from the address that controls the port. For example, here we have set up a stream buffer between accelerator v1 and v2. Each accelerator uses the address to read and write to the FIFO. The streambuffer only supports a single input and output port.

  ┌──────────────────────┐                             ┌───────────────┐
  │    Accelerator V1    │      ┌─────────────┐        │   Acclerator  │
  │                      ├─────►│ FIFO Buffer ├────────►     V2        │
  └──────────────────────┘      └─────────────┘        └───────────────┘
# Address accelerator v1 and v2 can read and write to.
addr = local_low + 0x3000
clstr.S1Out = StreamBuffer(stream_address=addr, stream_size=1, buffer_size=8)
# # of bits read on each access
clstr.S1Out.stream_size = 8
# Input to the buffer from accelerator S1
clstr.S1.stream = clstr.S1Out.stream_in
# Output of buffer sent to accelerator S2.
clstr.S2.stream = clstr.S1Out.stream_out

Each stream-buffer only supports 1-1 input and output port. However multiple stream buffers can be wired to a single accelerator. However, each accelerator can have multiple streambuffer ports.

                       ┌───────────────┐
┌─────────────┐        │   Acclerator  │
│ 0x2f003000  ├────┬───►               │
└─────────────┘    ├───►      V2       │
                   │   └───────────────┘
┌─────────────┐    │
│ 0x2f004000  ├────┘
└─────────────┘
# Address accelerator v1 and v2 can read and write to access FIFO.
addr = local_low + 0x3000
clstr.B1 = StreamBuffer(stream_address=addr, stream_size=1, buffer_size=8)
addr = local_low + 0x4000
clstr.B2 = StreamBuffer(stream_address=addr, stream_size=1, buffer_size=8)
clstr.S2.stream = clstr.B1.stream_out
clstr.S2.stream = clstr.B2.stream_out
cd $REPO

export LAB_PATH=$PWD
# Use prebuilt gem5
export M5_PATH=export M5_PATH=/data/src/750-SALAM

cd $REPO/benchmarks/vector_stream
# Load the LLVM and clang compilers in your path
# Load modules
source /data/.local/modules/init/zsh
export LD_LIBRARY_PATH=/data/.local/Tcl/lib
module load llvm-38
# Build datapath and host binary
make clean; make


# In short on 227. 
$ ./runvector_stream.sh  -p

# Full command
/data/src/750-SALAM/build/ARM/gem5.opt --outdir=BM_ARM_OUT/vector_stream gem5-config/run_vector_stream.py --mem-size=4GB --kernel=/data/src/gem5-lab2/benchmarks/vector_stream/host/main.elf --disk-image=/data/src/750-SALAM/baremetal/common/fake.iso --machine-type=VExpress_GEM5_V1 --dtb-file=none --bare-metal --cpu-type=DerivO3CPU --accpath=/data/src/gem5-lab2/benchmarks --accbench=vector_stream --caches --l2cache --acc_cache

# Running inside docker if on 236,
docker run -v /data:/data/ -v $HOME/gem5-lab2:/repo --env M5_PATH=/data/src/750-SALAM --env LAB_PATH=/repo --user $(id -u):$(id -g) -it gem5-salam:latest bash -c "cd /repo; /repo/runvector_stream.sh -p"

Step 3 Top

The purpose of top is to kickstart the stream dma from memory. Completion is detected by checking whether the output stream is complete. The overall execution is data-driven. When the FIFO port empties out the top accelerator triggers the completion of the stream.

// Start Stream DMAs
  *StrDmaFlags = STR_DMA_INIT_RD | STR_DMA_INIT_WR;

  // Start all accelerators
  // Start S1
  *S1 = 0x01;
  // Start S2
  *S2 = 0x01;
  // Start S3
  *S3 = 0x01;

// Wait for all accelerators to finish before sending interrupt to CPU
while ((*StrDmaFlags & 0x08) == 0x08);

Step 3 Accelerator stages S1-S3

As each each accelerator fills the stream buffer ports they will automatically trigger the operations in neighboring accelerators in a dataflow manner. Each accelerator has to know how many tokens are going to be generated and has to read the stream buffer port. The S1 stage writes to the FIFO streambuffer between S1 and S2. It uses the appropriate stream buffer memory mapped port.

// hw/source/hw_defines.h
#define BASE			0x2F000000
#define StreamIn  BASE + 0x1000
#define S1Out			BASE + 0x3000

// hw/source/top.c
	volatile dType_8u * STR_IN  	= (dType_8u *)(S1In);
	volatile dType_8u * STR_OUT		= (dType_8u *)(S1Out);
  ......
	
	for (dType_Reg i = 0; i < INPUT_SIZE; i++) {
			*STR_OUT = (*STR_IN) + BUFFER[i];
        }
}

Complete configuration

TODO

  • Add another stage 4 to the streaming accelerators. You will need to add hw/source/S4.c, hw/configs/S4.ini. Modify top.c to define memory map for MMR and Stream ports. Modify hw/gem5-config/vector_stream.py. You will need make S4 the final stage writing to the stream DMA. You will have to define new streambuffer that connects S3 and S4. You may and will need to makefile modifications. Figure it out.

Acknowledgement

This document has put together by your CMPT 750/450 instructors.