source /data/.local/modules/init/zsh
module load llvm-11 #(instead of llvm-10)
#pragma unroll()
to unroll the inner loop how much ever you desire.for (i = ... i++)
for (j = ... j++)
for(k = ...k++)
to
for (i = 0 ... i=i+2)
// i = 0 loop
#pragma unroll..
for(k= = ... k++)
// i = 1 loop..
#pragma unroll..
for(k= = ... k++)
This can be done in groups of up to 2 students for both graduate and undergraduates
You can choose to either complete Task 1 or Task 2
Meet and discuss with TA OH, and explain design. Also cross check with Arrvindh on Nov 23rd
You will implement functions which operate on matrices and vectors – for example, matrix multiplication. You will then use these functions to construct a simple Artificial Neural Net (ANN), which will be able to classify handwritten digits to their actual number! You will see that ANNs can be simply implemented using basic numerical operations, such as vector inner product, matrix multiplications, and thresholding.
Note: Although the spec is quite long, please make sure to read through the whole thing as it contains a lot of important information about the functions provided, running Venus, and testing your code.
At a basic level, a neural networks tries to approximate a (non-linear) function that maps your input into a desired output. A basic neuron consists of a weighted linear combination of the input, followed by a non-linearity – for example, a threshold. Consider the following neuron, which implements the logical AND
operation:
It is easy to see that for A=0, B=0, the linear combination 0*0.6 + 0*0.6 = 0, which is less than the threshold of 1 and will result in a 0 output. With an input A=0, B=1 or A=1, B=0 the linear combination will results in 1*0.6 + 0*0.6 = 0.6, which is less than 1 and result in a 0 output. Similarly, A=1, B=1 will result in 1*0.6+1*0.6=1.2, which is greater than the threshold and will result in a 1 output! What is interesting is that the simple neuron operation can also be described as an inner product between the vector [A,B]^T and the weights vector [0.6,0.6]^T followed by as thresholding, non-linear operation.
More complex functions can not be described by a simple neuron alone. We can extend the system into a network of neurons, in order to approximate the complex functions. For example, the following 2 layer network approximates the logical function XOR
What is XOR?
The above is a 2 layer network. The network takes 2 inputs, computes 2 intemediate values, and finally computes a single final output. It can be written as matrix multiplications with matrices m_0
and m_1
with thresholding operations in between as shown below:
You are probably wondering how the weights of the network were determined? This is beyond the scope of this project. We encourage you to take a course on machine learning. We will only say that the weights can be trained by giving the network pairs of correct inputs and outputs and changing the weights such that the error between the outputs of the network and the correct outputs is minimized. Learning the weights is called: ``Training’’. Using the weights on inputs is called “Inference”. We will only perform inference, and you will be given weights that were pre-trained by your dedicated TA’s.
In this project we will implement a similar, but slightly more complex network which is able to classify handwritten digits. As inputs, we will use the MNIST data set, which is a dataset of 60,000 28x28 images containing handwritten digits ranging from 0-9. We will treat these images as “flattened” input vectors sized 784x1.
In a similar way to the example before, we will perform matrix multiplications with pre-trained weight matrices m_0
and m_1
. Instead of thresholding we will use two different non-linearities: The ReLU
and ArgMax
functions.
# In python matrix[m:n] has m rows and n columns.
# Stage 1 Inputs: m0[128:784] * mnist_input[784:1], Output = [128:1]
hidden_layer = gemm(m0, input).
# Stage 2 Recall that relu can be performed in-place Inputs [128,1] = Output [128,1]
relu(hidden_layer)
# Input: [10:128] * [128:1] Output : [10:1].
scores = gemm(m1, hidden_layer)
# Input: [10:1] Output : [1:1] (scalar).
result = argmax(scores)
Let implement mnist in python first in software.
cd $REPO
python3 mnist_in_python.py $PWD/inputs/mnist/bin/mnist_input0.bin $PWD/inputs/mnist/bin/m0.bin $PWD/inputs/mnist/bin/m1.bin
# e.g., listing of mnist
$ ls inputs/mnist
bin txt
The bin
subfolder contains the binary input files that you’ll run cache/mnist/host/host.c
on, while the txt
subfolder contains the plaintext versions for debugging and calculating the expected output.
Within the bin
and txt
subfolders, you’ll find files for m0
and m1
which define the network, and a folder containing several inputs.
cd $REPO/inputs
$ ls mnist/bin/inputs
mnist_input0.bin mnist_input3.bin mnist_input6.bin
mnist_input1.bin mnist_input4.bin mnist_input7.bin
mnist_input2.bin mnist_input5.bin mnist_input8.bin
$ ls $REPO/inputs/mnist/bin/
m0.bin m1.bin
Stage 1: hidden_layer = gemm(m0, input). Input: [128,784] * [784,1], Output = [128*1]
Stage 2: relu(hidden_layer) # Recall that relu is performed in-place Input [128,1] = Output [128,1]
Stage 3: scores = gemm(m1, hidden_layer) Input: [10,128] * [128,1] Output : [10:1].
Stage 4: argmax(scores)
ls ./ref/
mnist
simple0
simple1
simple2
We include the reference outputs for each of the inputs. The traces display the out array of each of the stages in main. They include the outputs from each stage of main for verification.
$ ls ./ref/mnist/
mnist_input0.trace mnist_input3.trace mnist_input6.trace
mnist_input1.trace mnist_input4.trace mnist_input7.trace
mnist_input2.trace mnist_input5.trace mnist_input8.trace
$ python3 mnist_in_numpy.py ./inputs/mnist/bin/inputs/mnist_input0.bin ./inputs/mnist/bin/m0.bin ./inputs/mnist/bin/m1.bin
# It has identified input0 as 6.
6
Our matrix files come in two forms: binary and plaintext.The binary format is the one we will be using for all our programs as it is easier to read. The plain text version has been provided to help with debugging The usage is as follows:
We recommend the xxd
command to open the binary file (DO NOT USE YOUR EDITOR). You can find it’s man page here, but its default functionality is to output the raw bits of the file in a hex representation.
The first 8 bytes of the binary file represent two 4 byte integers. These integers are the number of rows and columns of the matrix. Every 4 following bytes represents an integer that is an element of the matrix, in row-major order. In this case each of the 4 bytes represents a vallue of the pixel There are no gaps between elements.. It is important to note that the bytes are in little-endian order. This means the least significant byte is placed at the lowest memory address. For files, the start of the file is considered the lower address
. This relates to how we read files into memory, and the fact that the start/first element of an array is usually at the lowest memory address.
$ xxd ./inputs/mnist/bin/inputs/mnist_input0.bin | more
# hit q to exit the viewer
We’ve included a python script called convert.py
to convert between the text and binary formats
python convert.py file.bin file.txt --to-ascii
to go from binary to plaintextpython convert.py file.txt file.bin --to-binary
to go from plaintext to binaryFor example, let’s say the plaintext example in the previous section is stored in file.txt
. We can run python convert.py file.txt file.bin --to-binary
to convert it to a binary format.
Generating Your Own MNIST Inputs
Just for fun, you can also draw your own handwritten digits and pass them to the neural net. First, open up any basic drawing program like Microsoft Paint. Next, resize the image to 28x28 pixels, draw your digit, and save it as a .bmp
file in the directory /inputs/mnist/student_inputs/
.
Inside that directory, we’ve provided bmp_to_bin.py
to turn this .bmp
file into a .bin
file for the neural net, as well as an example.bmp
file. To convert it, run the following from inside the /inputs/mnist/student_inputs
directory:
python bmp_to_bin.py example
This will read in the example.bmp
file, and create an example.bin
file. We can then input it into our neural net, alongside the provided m0
and m1
matrices.
You can convert and run your own .bmp
files in the same way. You should be able to achieve a reasonable accuracy with your own input images.
In this project, all two-dimensional matrices will be stored as a one-dimensional vector in row-major order. One way to think about it is that we can create a 1D vector from a 2D matrix by concatenating together all the rows in the matrix. Alternatively, we could concatenate all the columns together instead, which is known as column-major order.
For a more in-depth look at row-major vs. column-major order, see this Wikipedia page.
The stride of a vector is the number of memory locations between consecutive elements of our vector, measured in the size of our vector’s elements. If our stride is n, then the memory addresses of our vector elements are n * sizeof(element) bytes apart.
So far, all the arrays/vectors we’ve worked with have had stride 1, meaning there is no gap betwen consecutive elements. Now, to do the row * column
dot products with our row-major matrices that we’ll need for matrix multiplication, we will need to consider vectors with varying strides. Specifically, we’ll need to do this when considering a column vector in a flattened, row-major representation of a 2D matrix
Let’s take a look at a practical example. We have the vector int *a
with 3 elements.
*(a)
, *(a + 1)
, and *(a + 2)
, in other words a[0]
, a[1]
, and a[2]
.*(a)
, *(a + 4)
, and *(a + 8)
or in other words a[0]
, a[4]
, and a[8]
.To summarize in C code, to access the i
th element of a vector int *a
with stride s
, we use *(a + i * s)
, or a[i * s]
. We leave it up to you to translate this memory access into RISC-V.
For a closer look at strides in vectors/arrays, see this Wikipedia page.
Implement our relu
function to apply the mathematical ReLU function on every element of the input array. This ReLU function is defined as \text{ReLU}(a) = max(a, 0), and applying it elementwise on our matrix is equivalent to setting every negative value equal to 0.
Additionally, notice that our relu
function operates on a 1D vector, not a 2D matrix. We can do this because we’re applying the function individually on every element of the matrix, and our 2D matrix is stored in memory as a row-major 1D vector.
v0 = [3, -42, 432, 7, -5, 6, 5, -114, 2]
reLu(v0) = zero out all -ve values = [3, 0, 432, 7, 0, 6, 5, 0, 2]
Near the end of our neural net, we’ll be provided with scores for every possible classification. For MNIST, we’ll be given a vector of length 10 containing scores for every digit ranging from 0 to 9. The larger the score for a digit, the more confident we are that our handwritten input image contained that digit. Thus, to classify our handwritten image, we pick the digit with the highest score.
The score for the digit i is stored in the i-th element of the array, to pick the digit with the highest score we find the array index with the highest value. Implement the argmax
function to return the index of the largest element in the array. If there are multiple largest elements, return the smallest index.
Just like relu
, this function takes in a 1D vector. The index you’re expected to return is the index of the largest element in this 1D vector.
We should get the following:
v0 = [3, -42, 432, 7, -5, 6, 5, -114, 2]
argmax(v0) = index of 432 = 2
We will implement matrix multiplication as a collection of dot_products The dot product of two vectors a and b is defined as dot(a, b) = $\sum_{i=0}^{n-1} a_ib_i = a_0 * b_0 + a_1 * b_1 + \cdots + a_{n-1} * b_{n-1}$, where a_i is the ith element of a. Notice that this function takes in the a stride as a variable for each of the two vectors, make sure you’re considering this when calculating your memory addresses. We’ve described strides in more detail in the background section above, which also contains a detailed example on how stride affects memory addresses for vector elements.
For a closer look at dot products, see this Wikipedia page.
e.g.,
v0 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
v1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]
dot(v0, v1) = 1 * 1 + 2 * 2 + ... + 9 * 9 = 285
What if we changed the length to 3 and the stride of the second vector v1
to 2, without changing the values in static memory? Now, the vectors contain the following:
v0 = [1, 2, 3]
v1 = [1, 3, 5]
dot(v0, v1) = 1 * 1 + 2 * 3 + 3 * 5 = 22
Note that v1
now has stride 2, so we skip over elements in memory when calculating the dot product. However, the pointer v1
still points to the same place as before: the start of the sequence of integers 1 to 9 in memory.
Now that we have a dot product function that can take in varying strides, we can use it to do matrix multiplication. Implement the gemm
function to compute the matrix multiplication of two matrices. The matrix multiplication of two matrices A and B results in the output matrix C = AB, where C[i][j] is equal to the dot product of the i-th row of A and the j-ith column of B. Note that if we let the dimensions of A be (A_R*B_R), and the dimensions of B be (m * p), then the dimensions of C must be (n * p). Additionally, unlike integer multiplication, matrix multiplication is not commutative, AB != BA.
// A[n][m], B[m][p] .
// A - Base address, B - Base Address
// Unroll
for (i = 0; i<n; i++) {
// Note that for mnist p = 1 always for both stage 1 and 3
for(j = 0; j<p; j++) {
// Dot product.
for (k = 0 ; k < m; i++)
c[i][j] = a[i][k] * b[k][j]
}
}
For a closer look at matrix multiplication, see this Wikipedia page.
Hint: Think about the what happens if the total number of accesses made by an accelerator dynamically changes based on the input
.After generating py script fix OH and run it by the TA: Parker Tian
## Fill config.yml with appropriate components (scratchpads, accelerators)
# Warning 1: Only top accelerator has interrupt port : always hardcode to 68
# Warning 2: DMA interrupts always hardcoded to 95
# Warning 3: Set up scratchpads to appropriate sizes based on the inputs
# Warning 4: Relu, Argmax, Mnist are all separate systems. So, you need to generate 3 different configs.
${LAB_PATH}/SALAM-Configurator/systembuilder.py --sysName ${benchmark} --benchDir "[dma|cache]/${benchmark}"
WARNING: DO NOT PROCEED TO SUBSEQUENT STEPS WITHOUT CHECKING
WITH TA ON GENERATED PYTHON FILES AND CONFIG
Things to triple check in gem5-config/[$bench].py
multi_vector.py
for hint)If you have questions refer to the lab
.
In mnist/cache/host/host.cpp
, implement the main
function. This will bring together everything you’ve written so far, and create a basic sequence of functions that will allow you to classifiy the preprocessed MNIST inputs using the pretrained matrices we’ve provided. You may need to allocate space when reading in matrices and computing the layers of the network, but remember we work with baremetal.
The inputs are directly loaded in the global DRAM space gem5-config/fs_cache.py:line 130-150
0x80c00000 Input (1*784) | M0 (128*784) | M1 (10*128) |
Read gem5-config/fs_[BENCH].py
3 additional parameters can be passed to gem5-opt.
$M5_PATH/build/arm/gem5.opt ..... --input inputs/mnist/bin/inputs/mnist_input0.bin --m0 inputs/mnist/bin/m0.bin --m1 inputs/mnist/bin/m1.bin
BASE Folder: cache/
ls cache/
# Prefilled sample Relu accelerator
relu:
defines.h host hw Makefile
# Prefilled sample argmax accelerator
argmax:
defines.h host hw Makefile
# Common interface calls. Typically you should not have to read thse files. Apart from knowing what the interface calls do.
common:
dma.h m5ops.h Makefile queue_dma.c queue_dma.h syscalls.c syscalls.o
# TODO: MNIST accelerator
mnist:
defines.h host hw Makefile
Each of the accelerators are organized in the following manner
host | Containts the host code. TODO: FILL main.cpp |
hw/*.c | TODO: Datapath definitions for hardware accelerators |
config.yml | TODO: . Configs for each of the accelerators. Fill accelerator and scratchpad. |
To help you with your task we have provided some sample layouts for the project
cd $REPO
WARNING: DO NOT PROCEED IF YOU ARE UNABLE TO ANSWER
ALL OF THE QUESTIONS BELOW.
mnist/hw/hw_defines.h
?Hint: Read mnist/host/main.cpp
Hint: Read gem5-config/run_mnist.py
Hint: Note that Relu can modify in place
mnist/host/main.cpp
Draw the memory layout. You will need temporary matrices to hold intermediate results, the hidden_layer, scores, and final resultmnist/hw/source/top.c
Invoke the accelerators in sequence, setting up pointers to the main memory.mnist/hw/source/relu.c, gemm.c, argmax.c
Implement the accelerator functions.m
loop in mnist/hw/gemm.c
from 4,8,16.Metric | Description |
---|---|
top:Runtime debug-trace.txt |
Total execution time |
gemm:Rutime debug-trace.txt |
gemm execution time. WARNING: For each invocation of gemm you will see a stat dump |
relu:Rutime debug-trace.txt |
relu execution time. WARNING: For each invocation of gemm you will see a stat dump |
argmax:Rutime debug-trace.txt |
argmax execution time. WARNING: For each invocation of gemm you will see a stat dump |
Calculate the cache energy as follows:
stats.txt:system.acctest.cluster_cache.demand_accesses::total
$\times$ $Cache_{Energy}$ + $\Sigma_{i=0}^{i=All}$ debug-trace.txt
FU Total Power $\times$ Cycles
Config | Energy |
---|---|
32K | 22pj |
64K | 24pj |
128K | 27pj |
256K | 30pJ |
1M | 32pj |
To reduce the cache acceses one of the matrices in gemm can be held in registers to prevent repeated accesses to the cache. e.g., input in Stage 1, the hidden_layer
// Datapath definition
void reg_ex(int a[], int b[], int N)
int tmp[N]
for (int i = 0; i < N; i++)
tmp[i] = b[i]
// Use tmp
//tmp is a local declaration and is allocated a register
// Subsequent references to tmp will use the value in registers.
// While b is in memory
BASE Folder: dma/
The cache-based model permits the accelerators to work with data in global DRAM. This increases the energy cost of every memory access in the accelerators (even if they are cached in the cluster cache). We introduce a local scratchpad for each accelerator to make data accesses more efficient. However, an important constraint is that accelerators can only touch and work with data locally stored in their scratchpad. This makes it challenging to support Stage1:gemm and Stage3:gemm in the same hardware since they work with matrices of entirely different shapes.
This means we will have to create two different gemm accelerators in the cluster. Model 1
in figure below.
In the baseline DMA model we assume that the scratcpad can hold the entire working set of all the required input and output matrices. i.e., 1 Tile encompasses the entire matrix. Here, once the data is in the scratchpad you are only limited by the number of ALUs used for the computation.
The stage1:gemm is most arithmetic intensive and we will be exploring multiplier utilization for this stage.
You can explore this using the manually unroll of the outer loop for (i = 0; i<n; i++) {
. What is n for Stage1:gemm?
for (i = 0; i<m; i++) {
and report your results. Unroll innermost loop by 392 and 784 (Calculate how many multipliers you need for each case)Vscode's multi cursor may come in handy:
code
Vary the unroll factor of stage 1:gemm. Compare performance and energy between different factors and cache model.m
require in stage1:gemm and stage2:gemm (assume its fully unrolled).single element
in output) for stage1:gemm, stage2:relu, stage3:gemm and stage4:argmax?
e.g., Stage1:gemm. To calculate hidden_layer[0][0] requires 1 row from the m0 (7844), entire mnist_input (7844), and 4 bytes for result. Note that mnist_input is shared/reused for calculating entire all hidden_layer[0:128][0].Note that relu and argmax can be folded into their previous stages as they are pure functions that can run on each element of the matrix independently. This means we can eliminate the DMAing/scratchpads for relu and argmax.
Select Model 2
# In python matrix[m:n] has m rows and n columns.
# Stage 1 + 2 Inputs: m0[128:784] * mnist_input[784:1], Output = [128:1]
# Recall that relu can be performed in-place Inputs [128,1] = Output [128,1]
hidden_layer = relu(gemm(m0, input))
# Input: [10:128] * [128:1] Output : [10:1].
# Note that argmax can check as soon as single element of hidden_layer is generated.
argmax = argmax(gemm(m1, hidden_layer))
To increase the computation-to-memory ratio
,
stage 1:gemm
computes hidden_layer in multiple tiles.
In each batch/tileHint:If you think smart you will recognize that once input is loaded into registers. You are free to use that space in the scratchpad to store your results
WARNING: YOU SHOULD NOT TILE BETWEEN STAGE2 and STAGE3.
i.e., Collect all the tiles in STAGE 2 before before starting stage 3 computation.
(784N+784+1)4 bytes. If the scratchpad size is 32 KB. How many rows can you process in a batch. Note that the number of rows may not be an even multiple of batch size. You may have to process fewer rows in the final iteration.
Scratchpad Config | Energy |
---|---|
32K | 9pj |
64K | 10pj |
128K | 12pj |
256K | 15pJ |
1M | 16pj |
Stage2:GEMM | 2pj |
For MNIST, there are two additional folders:
txt/labels/
contains the true labels for each input, which are the actual digits that each corresponding input image contains.student_inputs/
contains a script to help you run your own input images, as well as an example.All the files for testing the mnist network are contained in inputs/mnist
. There are both binary and plaintext versions of m0
, m1
, and 9 input files (inputs/mnist/bin/inputs/mnist_input[0-9]*.bin).
To test on the first input file for example, run the following:
This should print out the digit with the highest score. You can compare the printed digit versus the one in inputs/mnist/txt/labels/label0.txt
. It will also print out the output of each stage.
Not all inputs will classify properly. A neural network will practically never have 100% accuracy in its predictions. In our test cases specifically, mnist_input2
and mnist_input7
will be misclassified to 9 and 8 respectively. All other test cases should classify correctly.
To check the final label provided. We have also included a python script that implements MNIST.
python3 mnist.py ./inputs/mnist/bin/inputs/mnist_input0.bin ./inputs/mnist/bin/m0.bin ./inputs/mnist/bin/m1.bin
6
You can check the printed digit printed by main
against the plaintext labels for each of the input files in the mnist/txt/labels
folder.
We’ve also included a script inputs/mnist/txt/print_mnist.py
, which will allow you to view the actual image for every mnist input. For example, you can run the following command from the directory inputs/mnist/txt
to print the actual image for mnist_input8
as ASCII art alongside the true label.
cd inputs/mnist/txt/
python3 print_mnist.py 8
Test | Points | |
---|---|---|
cache | 100 | OR |
DMA | 100 |
This assignment has been modified the CMPT 750 instructor