SFU Architecture I Class Review Unix Programming Tools and Basics of the C Language

Grace Hopper developed the first compiler for a computer programming language.

Errata Unrolling Loops

Use llvm-11 compiler instead of llvm-10

 source /data/.local/modules/init/zsh     
 module load llvm-11 #(instead of llvm-10)

With llvm-11 you can use #pragma unroll() to unroll the inner loop how much ever you desire.
However, the outer loop has to be unrolled manually. To achieve complete unrolling. Unroll the outer loop manually.
To achieve best results, get rid of the middle loop (since its always 1 iteration) e.g., Convert the gemm loop from
```
for (i = ... i++)
for (j = ... j++)
  for(k = ...k++)
```
to

for (i = 0 ... i=i+2)
  // i = 0 loop
   #pragma unroll..
   for(k= = ... k++)
  // i = 1 loop..
   #pragma unroll..
   for(k= = ... k++) 

Unrolling example

Github clone link

Github clone

Errata Unrolling Loops
Github clone link
Goals
Overview
Background - Handwritten Digit Classification
Check Yourself
Inputs, Output and Algorithm steps
- Matrix File Format
The Accelerators
Step 0 : Generate gem5 configs for different accelerators
- Baremetal input
Task 1: Cache-based MNIST (100 points)
Task 2: DMA-based Accelerators (100 points)
Report
- Report
- Generating your own inputs
Grading
Acknowledgments

Goals

The purpose of this project is to have you implement a simple, yet extremely useful hardware accelerators.
You will learn to use registers efficiently, write functions, use accelerator conventions for calling your functions,
You will learn how to define datapaths, optimize them and connect them all up to create an SoC.
You will learn how to move data between different accelerators.

This can be done in groups of up to 2 students for both graduate and undergraduates You can choose to either complete Task 1 or Task 2 Meet and discuss with TA OH, and explain design. Also cross check with Arrvindh on Nov 23rd

Overview

You will implement functions which operate on matrices and vectors – for example, matrix multiplication. You will then use these functions to construct a simple Artificial Neural Net (ANN), which will be able to classify handwritten digits to their actual number! You will see that ANNs can be simply implemented using basic numerical operations, such as vector inner product, matrix multiplications, and thresholding.

Note: Although the spec is quite long, please make sure to read through the whole thing as it contains a lot of important information about the functions provided, running Venus, and testing your code.

Background

At a basic level, a neural networks tries to approximate a (non-linear) function that maps your input into a desired output. A basic neuron consists of a weighted linear combination of the input, followed by a non-linearity – for example, a threshold. Consider the following neuron, which implements the logical AND operation:

It is easy to see that for A=0, B=0, the linear combination 0*0.6 + 0*0.6 = 0, which is less than the threshold of 1 and will result in a 0 output. With an input A=0, B=1 or A=1, B=0 the linear combination will results in 1*0.6 + 0*0.6 = 0.6, which is less than 1 and result in a 0 output. Similarly, A=1, B=1 will result in 1*0.6+1*0.6=1.2, which is greater than the threshold and will result in a 1 output! What is interesting is that the simple neuron operation can also be described as an inner product between the vector [A,B]^T and the weights vector [0.6,0.6]^T followed by as thresholding, non-linear operation.

More complex functions can not be described by a simple neuron alone. We can extend the system into a network of neurons, in order to approximate the complex functions. For example, the following 2 layer network approximates the logical function XOR What is XOR?

The above is a 2 layer network. The network takes 2 inputs, computes 2 intemediate values, and finally computes a single final output. It can be written as matrix multiplications with matrices m_0 and m_1 with thresholding operations in between as shown below:

You are probably wondering how the weights of the network were determined? This is beyond the scope of this project. We encourage you to take a course on machine learning. We will only say that the weights can be trained by giving the network pairs of correct inputs and outputs and changing the weights such that the error between the outputs of the network and the correct outputs is minimized. Learning the weights is called: ``Training’’. Using the weights on inputs is called “Inference”. We will only perform inference, and you will be given weights that were pre-trained by your dedicated TA’s.

Handwritten Digit Classification

In this project we will implement a similar, but slightly more complex network which is able to classify handwritten digits. As inputs, we will use the MNIST data set, which is a dataset of 60,000 28x28 images containing handwritten digits ranging from 0-9. We will treat these images as “flattened” input vectors sized 784x1.

In a similar way to the example before, we will perform matrix multiplications with pre-trained weight matrices m_0 and m_1. Instead of thresholding we will use two different non-linearities: The ReLU and ArgMax functions.

  # In python matrix[m:n] has m rows and n columns.
  # Stage 1 Inputs: m0[128:784] * mnist_input[784:1], Output = [128:1]
    hidden_layer = gemm(m0, input).
  # Stage 2 Recall that relu can be performed in-place Inputs [128,1] = Output [128,1]
    relu(hidden_layer)
  # Input: [10:128] * [128:1] Output : [10:1].
    scores = gemm(m1, hidden_layer)
  # Input: [10:1] Output : [1:1] (scalar).
    result = argmax(scores)

Let implement mnist in python first in software.

cd $REPO
python3 mnist_in_python.py $PWD/inputs/mnist/bin/mnist_input0.bin $PWD/inputs/mnist/bin/m0.bin $PWD/inputs/mnist/bin/m1.bin

Check Yourself

Look at the python code and identify the statements you will be accelerating.
Have you gone over the gem5-accelerator lab?
Do you know the difference in system model between DMA, Cache, and Streaming?
Do you know how the systembuilder works ? What is the output of systembuilder/
How is the memory space of the accelerator laid out?
How does host code detect accelerator has run to completion?
How does the host code change between DMA and Cache?
How does the accelerator datapath change between DMA/Cache and Streaming?

Inputs, Output and Algorithm steps

input/ : the various input files. There are totally three four sets of inputs, mnist, simple0, simple1, and simple2.

# e.g.,  listing of mnist
$ ls inputs/mnist
bin txt

The bin subfolder contains the binary input files that you’ll run cache/mnist/host/host.c on, while the txt subfolder contains the plaintext versions for debugging and calculating the expected output.

Within the bin and txt subfolders, you’ll find files for m0 and m1 which define the network, and a folder containing several inputs.

cd $REPO/inputs
$ ls mnist/bin/inputs
mnist_input0.bin mnist_input3.bin mnist_input6.bin
mnist_input1.bin mnist_input4.bin mnist_input7.bin
mnist_input2.bin mnist_input5.bin mnist_input8.bin

$ ls $REPO/inputs/mnist/bin/
m0.bin m1.bin

ref/ : The reference outputs.

 Stage 1:   hidden_layer = gemm(m0, input). Input: [128,784] * [784,1], Output = [128*1]
 Stage 2:   relu(hidden_layer) # Recall that relu is performed in-place Input [128,1] = Output [128,1]
 Stage 3:   scores = gemm(m1, hidden_layer) Input: [10,128] * [128,1] Output : [10:1].
 Stage 4:  argmax(scores)

ls ./ref/
mnist
simple0
simple1
simple2

We include the reference outputs for each of the inputs. The traces display the out array of each of the stages in main. They include the outputs from each stage of main for verification.

$ ls ./ref/mnist/
mnist_input0.trace mnist_input3.trace mnist_input6.trace
mnist_input1.trace mnist_input4.trace mnist_input7.trace
mnist_input2.trace mnist_input5.trace mnist_input8.trace

$ python3 mnist_in_numpy.py ./inputs/mnist/bin/inputs/mnist_input0.bin ./inputs/mnist/bin/m0.bin ./inputs/mnist/bin/m1.bin
# It has identified input0 as 6.
6

Matrix File Format

Our matrix files come in two forms: binary and plaintext.The binary format is the one we will be using for all our programs as it is easier to read. The plain text version has been provided to help with debugging The usage is as follows:

We recommend the xxd command to open the binary file (DO NOT USE YOUR EDITOR). You can find it’s man page here, but its default functionality is to output the raw bits of the file in a hex representation.

The first 8 bytes of the binary file represent two 4 byte integers. These integers are the number of rows and columns of the matrix. Every 4 following bytes represents an integer that is an element of the matrix, in row-major order. In this case each of the 4 bytes represents a vallue of the pixel There are no gaps between elements.. It is important to note that the bytes are in little-endian order. This means the least significant byte is placed at the lowest memory address. For files, the start of the file is considered the lower address. This relates to how we read files into memory, and the fact that the start/first element of an array is usually at the lowest memory address.

$ xxd ./inputs/mnist/bin/inputs/mnist_input0.bin | more
# hit q to exit the viewer

We’ve included a python script called convert.py to convert between the text and binary formats

python convert.py file.bin file.txt --to-ascii to go from binary to plaintext
python convert.py file.txt file.bin --to-binary to go from plaintext to binary

For example, let’s say the plaintext example in the previous section is stored in file.txt. We can run python convert.py file.txt file.bin --to-binary to convert it to a binary format.

Generating Your Own MNIST Inputs

Just for fun, you can also draw your own handwritten digits and pass them to the neural net. First, open up any basic drawing program like Microsoft Paint. Next, resize the image to 28x28 pixels, draw your digit, and save it as a .bmp file in the directory /inputs/mnist/student_inputs/.

Inside that directory, we’ve provided bmp_to_bin.py to turn this .bmp file into a .bin file for the neural net, as well as an example.bmp file. To convert it, run the following from inside the /inputs/mnist/student_inputs directory:

python bmp_to_bin.py example

This will read in the example.bmp file, and create an example.bin file. We can then input it into our neural net, alongside the provided m0 and m1 matrices.

You can convert and run your own .bmp files in the same way. You should be able to achieve a reasonable accuracy with your own input images.

The Accelerators

In this project, all two-dimensional matrices will be stored as a one-dimensional vector in row-major order. One way to think about it is that we can create a 1D vector from a 2D matrix by concatenating together all the rows in the matrix. Alternatively, we could concatenate all the columns together instead, which is known as column-major order.

For a more in-depth look at row-major vs. column-major order, see this Wikipedia page.

The stride of a vector is the number of memory locations between consecutive elements of our vector, measured in the size of our vector’s elements. If our stride is n, then the memory addresses of our vector elements are n * sizeof(element) bytes apart.

So far, all the arrays/vectors we’ve worked with have had stride 1, meaning there is no gap betwen consecutive elements. Now, to do the row * column dot products with our row-major matrices that we’ll need for matrix multiplication, we will need to consider vectors with varying strides. Specifically, we’ll need to do this when considering a column vector in a flattened, row-major representation of a 2D matrix

Let’s take a look at a practical example. We have the vector int *a with 3 elements.

If the stride is 1, then our vector elements are *(a), *(a + 1), and *(a + 2), in other words a[0], a[1], and a[2].
However, if our stride is 4, then our elements are at *(a), *(a + 4), and *(a + 8) or in other words a[0], a[4], and a[8].

To summarize in C code, to access the ith element of a vector int *a with stride s, we use *(a + i * s), or a[i * s]. We leave it up to you to translate this memory access into RISC-V.

For a closer look at strides in vectors/arrays, see this Wikipedia page.

Task 1: ReLU

Implement our relu function to apply the mathematical ReLU function on every element of the input array. This ReLU function is defined as \text{ReLU}(a) = max(a, 0), and applying it elementwise on our matrix is equivalent to setting every negative value equal to 0.

Additionally, notice that our relu function operates on a 1D vector, not a 2D matrix. We can do this because we’re applying the function individually on every element of the matrix, and our 2D matrix is stored in memory as a row-major 1D vector.

    v0 = [3, -42, 432, 7, -5, 6, 5, -114, 2]
    reLu(v0) = zero out all -ve values = [3, 0, 432, 7, 0, 6, 5, 0, 2]

Task 2: ArgMax

Near the end of our neural net, we’ll be provided with scores for every possible classification. For MNIST, we’ll be given a vector of length 10 containing scores for every digit ranging from 0 to 9. The larger the score for a digit, the more confident we are that our handwritten input image contained that digit. Thus, to classify our handwritten image, we pick the digit with the highest score.

The score for the digit i is stored in the i-th element of the array, to pick the digit with the highest score we find the array index with the highest value. Implement the argmax function to return the index of the largest element in the array. If there are multiple largest elements, return the smallest index.

Just like relu, this function takes in a 1D vector. The index you’re expected to return is the index of the largest element in this 1D vector.

Testing: Argmax

We should get the following:

    v0 = [3, -42, 432, 7, -5, 6, 5, -114, 2]
    argmax(v0) = index of 432  = 2

Task 3: Matrix Multiplication

We will implement matrix multiplication as a collection of dot_products The dot product of two vectors a and b is defined as dot(a, b) = $\sum_{i=0}^{n-1} a_ib_i = a_0 * b_0 + a_1 * b_1 + \cdots + a_{n-1} * b_{n-1}$, where a_i is the ith element of a. Notice that this function takes in the a stride as a variable for each of the two vectors, make sure you’re considering this when calculating your memory addresses. We’ve described strides in more detail in the background section above, which also contains a detailed example on how stride affects memory addresses for vector elements.

For a closer look at dot products, see this Wikipedia page.

e.g.,

    v0 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
    v1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
    dot(v0, v1) = 1 * 1 + 2 * 2 + ... + 9 * 9 = 285

What if we changed the length to 3 and the stride of the second vector v1 to 2, without changing the values in static memory? Now, the vectors contain the following:

    v0 = [1, 2, 3]
    v1 = [1, 3, 5]
    dot(v0, v1) = 1 * 1 + 2 * 3 + 3 * 5 = 22

Note that v1 now has stride 2, so we skip over elements in memory when calculating the dot product. However, the pointer v1 still points to the same place as before: the start of the sequence of integers 1 to 9 in memory.

Now that we have a dot product function that can take in varying strides, we can use it to do matrix multiplication. Implement the gemm function to compute the matrix multiplication of two matrices. The matrix multiplication of two matrices A and B results in the output matrix C = AB, where C[i][j] is equal to the dot product of the i-th row of A and the j-ith column of B. Note that if we let the dimensions of A be (A_R*B_R), and the dimensions of B be (m * p), then the dimensions of C must be (n * p). Additionally, unlike integer multiplication, matrix multiplication is not commutative, AB != BA.

// A[n][m], B[m][p] .
// A - Base address, B - Base Address
// Unroll
for (i = 0; i<n; i++) {
// Note that for mnist p = 1 always for both stage 1 and 3
  for(j = 0; j<p; j++) {
// Dot product.
    for (k = 0 ; k < m; i++)
      c[i][j] = a[i][k] * b[k][j]
  }
}

For a closer look at matrix multiplication, see this Wikipedia page.

Check yourself

There are two different matrix shapes in GEMM. What are they ? What are the n,m,p factors for the Stage1:GEMM and Stage3:GEMM in MNIST.
What is the total size of input and output (in bytes) of Stage1:GEMM and Stage2:GEMM
If we wish to use a single hardware datapath for both Stage1:GEMM and Stage2:GEMM. What are the challenges? Hint: Think about the what happens if the total number of accesses made by an accelerator dynamically changes based on the input.

Step 0 : Generate gem5 configs for different accelerators

After generating py script fix OH and run it by the TA: Parker Tian

## Fill config.yml with appropriate components (scratchpads, accelerators)
# Warning 1: Only top accelerator has interrupt port : always hardcode to 68
# Warning 2: DMA interrupts always hardcoded to 95
# Warning 3: Set up scratchpads to appropriate sizes based on the inputs
# Warning 4: Relu, Argmax, Mnist are all separate systems. So, you need to generate 3 different configs.
${LAB_PATH}/SALAM-Configurator/systembuilder.py --sysName ${benchmark} --benchDir "[dma|cache]/${benchmark}"

WARNING: DO NOT PROCEED TO SUBSEQUENT STEPS WITHOUT CHECKING WITH TA ON GENERATED PYTHON FILES AND CONFIG

Things to triple check in gem5-config/[$bench].py

Do you have coherent DMA?
Do you have non coherent DMA?
Do you have scratchpads? (If yes, double check memory map against .h file)
Do you have top’s local connected ? (see multi_vector.py for hint)
Do you have accelerators? (Is the scratchpad comm interface connected to scratchpad with many ports)
Is accelerator’s acp port connected to the coherent xbar.

If you have questions refer to the lab.

Baremetal input

In mnist/cache/host/host.cpp, implement the main function. This will bring together everything you’ve written so far, and create a basic sequence of functions that will allow you to classifiy the preprocessed MNIST inputs using the pretrained matrices we’ve provided. You may need to allocate space when reading in matrices and computing the layers of the network, but remember we work with baremetal.

The inputs are directly loaded in the global DRAM space gem5-config/fs_cache.py:line 130-150


0x80c00000 Input (1*784)	M0 (128*784)	M1 (10*128)

Recall that the first 8 bytes contains the two 4 byte dimensions of the matrix, which will tell you how many bytes to read from the rest of the file. Additionally, recall that the binary matrix file is already in row-major order.

Read gem5-config/fs_[BENCH].py 3 additional parameters can be passed to gem5-opt.

$M5_PATH/build/arm/gem5.opt ..... --input inputs/mnist/bin/inputs/mnist_input0.bin --m0 inputs/mnist/bin/m0.bin --m1 inputs/mnist/bin/m1.bin 

Task 1: Cache-based MNIST (100 points)

BASE Folder: cache/

Source

ls cache/
# Prefilled sample Relu accelerator
relu:
defines.h  host  hw  Makefile

# Prefilled sample argmax accelerator
argmax:
defines.h  host  hw  Makefile

# Common interface calls. Typically you should not have to read thse files. Apart from knowing what the interface calls do.
common:
dma.h  m5ops.h  Makefile  queue_dma.c  queue_dma.h  syscalls.c  syscalls.o

# TODO: MNIST accelerator
mnist:
defines.h  host  hw  Makefile

Each of the accelerators are organized in the following manner


host	Containts the host code. `TODO: FILL main.cpp`
hw/*.c	`TODO:`Datapath definitions for hardware accelerators
config.yml	`TODO:`. Configs for each of the accelerators. Fill accelerator and scratchpad.

Helpers

To help you with your task we have provided some sample layouts for the project

cd $REPO

Check yourself

WARNING: DO NOT PROCEED IF YOU ARE UNABLE TO ANSWER ALL OF THE QUESTIONS BELOW.

What n,m,p dimensions for stage1:gemm and stage3:gemm?
Read the config files. What is the MMR (memory-mapped) argument register addresses for each of the accelerators mnist mnist/hw/hw_defines.h?
What is the address range of input, m0, and m1 matrices in the DRAM ? What is the size of the matrices in bytes ? Hint: Read mnist/host/main.cpp Hint: Read gem5-config/run_mnist.py
What is the address range read by the gemm accelerator ?
How many arrays do you need to allocate space for Hint: Note that Relu can modify in place
How many bytes do you need to allocate for the hidden_layer and scores matrix ?
If you unrolled the m0 matrix by 128. How many multipliers would you need to multiply m0 and input. How many cycles would it take to calculate a single element in hidden_layer?

mnist/host/main.cpp Draw the memory layout. You will need temporary matrices to hold intermediate results, the hidden_layer, scores, and final result
mnist/hw/source/top.c Invoke the accelerators in sequence, setting up pointers to the main memory.
mnist/hw/source/relu.c, gemm.c, argmax.c Implement the accelerator functions.

REPORT

Vary the unroll factor of the m loop in mnist/hw/gemm.c from 4,8,16.
Vary the cache size from 32K, 64K, 128K, 256K,1M in gem5 config:line 23`

Metric	Description
top:Runtime `debug-trace.txt`	Total execution time
gemm:Rutime `debug-trace.txt`	gemm execution time. `WARNING: For each invocation of gemm you will see a stat dump`
relu:Rutime `debug-trace.txt`	relu execution time. `WARNING: For each invocation of gemm you will see a stat dump`
argmax:Rutime `debug-trace.txt`	argmax execution time. `WARNING: For each invocation of gemm you will see a stat dump`

Plot 1 Performance Create a scatter plot for top:runtime for combination of unroll factor and cache size [4,8,16] x [32K,64K,128K,256K,1M] 20 simulation points. Which configuration performs best ? Why ?
Plot 2 Stall Cycles Plot a stacked bar plot. Stall and Comput cycles (Runtime - stall), for each accelerator (gemm,relu,argmax) for the unroll factors (64,128) and cache sizes (32K, 64K, 128K) Which configuration minimizes stall cycles? Why ?

Plot 3 Energy

Calculate the cache energy as follows: stats.txt:system.acctest.cluster_cache.demand_accesses::total $\times$ $Cache_{Energy}$ + $\Sigma_{i=0}^{i=All}$ debug-trace.txtFU Total Power $\times$ Cycles

Config	Energy
32K	22pj
64K	24pj
128K	27pj
256K	30pJ
1M	32pj

Opt 1 : One of the matrix input can be held in registers entirely.

To reduce the cache acceses one of the matrices in gemm can be held in registers to prevent repeated accesses to the cache. e.g., input in Stage 1, the hidden_layer

// Datapath definition
void reg_ex(int a[], int b[], int N)
int tmp[N]
for (int i = 0; i < N; i++)
 tmp[i] = b[i]
// Use tmp
//tmp is a local declaration and is allocated a register
// Subsequent references to tmp will use the value in registers.
// While b is in memory

Plot 4 Energy. 1M, 256 unrolling. Plot the energy and cycles with and without optimization

Task 2: DMA-based Accelerators (100 points)

BASE Folder: dma/

The cache-based model permits the accelerators to work with data in global DRAM. This increases the energy cost of every memory access in the accelerators (even if they are cached in the cluster cache). We introduce a local scratchpad for each accelerator to make data accesses more efficient. However, an important constraint is that accelerators can only touch and work with data locally stored in their scratchpad. This makes it challenging to support Stage1:gemm and Stage3:gemm in the same hardware since they work with matrices of entirely different shapes.

Stage1:gemm Inputs: m0[128:784] * mnist_input[784:1], Output = [128:1]
Stage3:gemm Input: [10:128] * [128:1] Output : [10:1].

This means we will have to create two different gemm accelerators in the cluster. Model 1 in figure below.

The size of the scratchpad. To calculate an element/row in the hidden layer we need 784 ints (from the m0 matrix), 1 int (for result). The 784 ints of the input matrix are shared between all the rows. The total scratchpad required for calculating N rows of hidden layer in a batch is.

Model 1 : 1 Tile

In the baseline DMA model we assume that the scratcpad can hold the entire working set of all the required input and output matrices. i.e., 1 Tile encompasses the entire matrix. Here, once the data is in the scratchpad you are only limited by the number of ALUs used for the computation.

The stage1:gemm is most arithmetic intensive and we will be exploring multiplier utilization for this stage. You can explore this using the manually unroll of the outer loop for (i = 0; i<n; i++) { . What is n for Stage1:gemm?

The number of multipliers/adders/subtractors that controls how many multiplications and additions can be performed in a cycle.

Report

Your mileage may vary with loop unrolling in clang 10. Thus you will have to manually unroll the loops.
Manually unroll the outermost loops by 4,8 for (i = 0; i<m; i++) { and report your results. Unroll innermost loop by 392 and 784 (Calculate how many multipliers you need for each case)
Vscode's multi cursor may come in handy: code Vary the unroll factor of stage 1:gemm. Compare performance and energy between different factors and cache model.

Check yourself

How many multipliers does the inner loop m require in stage1:gemm and stage2:gemm (assume its fully unrolled).
If unroll factor 10. How many multiplipliers is required ? (assume m loop is fully unrolled)
What is the working set (number of bytes for input a single element in output) for stage1:gemm, stage2:relu, stage3:gemm and stage4:argmax? e.g., Stage1:gemm. To calculate hidden_layer[0][0] requires 1 row from the m0 (7844), entire mnist_input (7844), and 4 bytes for result. Note that mnist_input is shared/reused for calculating entire all hidden_layer[0:128][0].

Model 2 : 1 Tile Merged accelerators

Note that relu and argmax can be folded into their previous stages as they are pure functions that can run on each element of the matrix independently. This means we can eliminate the DMAing/scratchpads for relu and argmax.

Select Model 2

Report

Fix number of multipliers to 32K. Calculate the energy savings as a result of this. Compare latency and energy against model 1 (same multiplier configuration)
Change number of multipliers to 16K.

  # In python matrix[m:n] has m rows and n columns.
  # Stage 1 + 2 Inputs: m0[128:784] * mnist_input[784:1], Output = [128:1]
  # Recall that relu can be performed in-place Inputs [128,1] = Output [128,1]
    hidden_layer = relu(gemm(m0, input))

  # Input: [10:128] * [128:1] Output : [10:1].
  # Note that argmax can check as soon as single element of hidden_layer is generated.
    argmax = argmax(gemm(m1, hidden_layer))

Model 3: Bounded scratchpad size

To increase the computation-to-memory ratio,

Tiling/batching can be applied. Each accelerator works on a tile of the resultant matrix at a time.
Instead of writing data back to global memory after each accelerator, we can DMA and forward data directly between the accelerators on-chip. Each computes one tile of matrix C. One thread in the thread block computes one element of the tile. The figure shows the matrix divided into tiles/batches. To compute this, the stage 1 gemm loads a batch of N rows at a time. The stage 1:gemm computes hidden_layer in multiple tiles. In each batch/tile

The input is loaded at the beginning from global memory to scratchpad and then into local temp registers
Hint:If you think smart you will recognize that once input is loaded into registers. You are free to use that space in the scratchpad to store your results
Also you can wait for all batches to complete in stage1:gemm before bulk DMAing hidden layer to stage 2.
The accelerators loads one tile of m0 from memory into scratchpad.
Accelerator performs computation, and DMAs temporary result from Stage 1:Gemm to Relu After all the batches are done. The relu is kickstarted. Similarly, we connect all accelerators.

Check yourself

WARNING: YOU SHOULD NOT TILE BETWEEN STAGE2 and STAGE3.
i.e., Collect all the tiles in STAGE 2 before before starting stage 3 computation.
(784N+784+1)4 bytes. If the scratchpad size is 32 KB. How many rows can you process in a batch. Note that the number of rows may not be an even multiple of batch size. You may have to process fewer rows in the final iteration.

Report

Set the scratchpad size to 32KB, 64KB, 128KB, 256KB, 512KB, 1MB for Stage 1:GEMM. You can assume that that the stage2:gemm have enough sufficient scratchpad space to store all tiles (how many bytes is required?).
Compare against model 2 and model 1.

Scratchpad Config	Energy
32K	9pj
64K	10pj
128K	12pj
256K	15pJ
1M	16pj
Stage2:GEMM	2pj

Report

Compare against Task 1 and Task 2 designs, performance and energy.

Generating your own inputs

For MNIST, there are two additional folders:

txt/labels/ contains the true labels for each input, which are the actual digits that each corresponding input image contains.
student_inputs/ contains a script to help you run your own input images, as well as an example.

All the files for testing the mnist network are contained in inputs/mnist. There are both binary and plaintext versions of m0, m1, and 9 input files (inputs/mnist/bin/inputs/mnist_input[0-9]*.bin).

To test on the first input file for example, run the following:

This should print out the digit with the highest score. You can compare the printed digit versus the one in inputs/mnist/txt/labels/label0.txt. It will also print out the output of each stage.

Not all inputs will classify properly. A neural network will practically never have 100% accuracy in its predictions. In our test cases specifically, mnist_input2 and mnist_input7 will be misclassified to 9 and 8 respectively. All other test cases should classify correctly.

To check the final label provided. We have also included a python script that implements MNIST.

python3 mnist.py ./inputs/mnist/bin/inputs/mnist_input0.bin ./inputs/mnist/bin/m0.bin ./inputs/mnist/bin/m1.bin
6

You can check the printed digit printed by main against the plaintext labels for each of the input files in the mnist/txt/labels folder.

We’ve also included a script inputs/mnist/txt/print_mnist.py, which will allow you to view the actual image for every mnist input. For example, you can run the following command from the directory inputs/mnist/txt to print the actual image for mnist_input8 as ASCII art alongside the true label.

cd inputs/mnist/txt/
python3 print_mnist.py 8

Grading

Test	Points
cache	100	OR
DMA	100

Acknowledgments

This assignment has been modified the CMPT 750 instructor