The GEMM (General Matrix-Matrix Multiply) operation multiplies two matrices, $A$ and $B$. These matrices have dimensions $M$×$K$ and $K$×$N$, respectively. The dimension $K$ is shared between both matrices. When we multiply these matrices, we use a triple-nested loop to combine their elements appropriately. Implementing GEMM involves iterating over the $M$ and $N$ dimensions and also co-iterating over the shared dimension $K$.
“Co-iteration” means that, as we loop through the process of multiplication, we’re iterating together over that shared dimension $K$ for both matrices. Depending on which level of the nested loops we choose to place this co-iteration, we get different dataflows for performing the multiplication:
Inner Product (IP): Places the co-iteration at the innermost loop. Think of it as focusing on a single element of the resulting matrix and accumulating its value by iterating over the shared dimension. This method generates a single full sum of each resulting element at a time without merging partial sums.
Outer Product (OP): Puts the co-iteration at the outermost loop. This involves multiplying a single row from $A$ with a single column from $B$ to produce an entire matrix, and then aggregating these matrices to get the final result. This method generates multiple partial sums at a time, requiring merging hardware and potentially increased memory traffic.
Gustavson’s (Gust): Features the co-iteration at the middle loop. This method generates partial sums in the current fiber (i.e., the current row of $A$ and column of $B$) and then aggregates them to get the final result.
Beyond just co-iterating over $K$, we also loop over dimensions $M$ and $N$. The order in which we loop over these independent dimensions $M$ and $N$ can also affect our computation pattern. In fact, for each of the three dataflows (IP, OP, Gustavson’s), changing the loop order for $M$ and $N$ can give us two distinct variants. That’s why we end up with six possible variations.
The importance of choosing one variant over another largely comes down to “data reuse”. In computing, it’s efficient to reuse data that’s already loaded into fast memory (like a cache). By prioritizing the stationarity of a particular dimension (i.e. kept close to the processing elements), we can better reuse data, making computations more efficient. The three different methods are depicted in the figure below. Image Credit
cd $REPO
git clone https://github.com/CMPT-7ARCH-SFU/Ass-Locality-Inputs.git inputs
cd inputs; git lfs pull
# The sample includes smaller inputs. The formats for input is
# bcsr/ -> sparse matrices
# dense/ -> dense matrices
# sample/ -> smaller matrices
# File naming format.
# square matrices [bcsr|dense]_[nrows]_[density in %].bin
# e.g., bcsr_4_1.bin : 4x4 bcsr. 1% density. Tile sizes for all matrices is 2
# rectangular matrices [bcsr_dense]_[nrows]_[ncols]_[density in %].bin
You can also generate your own test inputs:
# Create dense matrix of size 4x4
python3 bcsr/generate_blocked.py --create-dense 1 -nrows 4 -ncols 4 -output dense.bin
# Print dense matrix
python3 bcsr/generate_blocked.py --print-dense 1 -input dense.bin
Files | Description |
---|---|
gemm/tiled_gemm.cpp | You will implement the gemm for dense matrices methods in this file. |
gemm/dense_dense_gemm.py | This is a python interface to invoke the dense gemm methods. |
bcsr/tiled_spmm.cpp | You will implement the gemm for Sparse-Dense matrices methods in this file. |
bcsr/bcsr_dense_mm.py | This is a python interface to invoke the Sparse-Dense gemm methods. |
cd $REPO
export PYTHONPATH=$PWD/bcsr/:$PYTHONPATH
# Generate a small dense matrix of dimensions 4x4
python3 bcsr/generate_blocked.py --create-dense 1 -nrows 4 -ncols 4 -output dense_small.bin
# Print the generated matrix
python3 bcsr/generate_blocked.py --print-dense 1 -input dense_small.bin
# Compile the CPP file containing GEMM code
g++ -O3 -std=c++11 -shared -o dense_mult.so -fPIC gemm/tiled_gemm.cpp
# Multiply the generated matrix by itself.
python3 gemm/dense_dense_mm.py --bin ./dense_mult.so --input_a dense_small.bin --input_b dense_small.bin -o result.bin
dense_dense_mm.py
.REPORT.md
and REPORT.pdf
.REPORT.md
should contain answers to all the questions and plots in markdown format.REPORT.md
to REPORT.pdf
. You can use the VSCode Markdown PDF extension to convert markdown to pdf.plots
.REPORT.pdf
on canvas at https://canvas.sfu.ca/courses/86832/assignments/1010537.| Dataflow | Stationary | Tiled | Streaming | A format | B preferred format | C format |
| -------------------- | ---------- | ----- | --------- | -------- | ------------------ | -------- |
| MNK Inner Product(M) | C | A | B | Row | Col | Row |
| KMN Outer Product(M) | A | B | C | Column | Row | Row |
| MKN Gustavson’s(M) | A | C | B | Row | Row | Row |
To improve the performance of our transpose, we can use a technique called cache blocking. Cache blocking is a technique that attempts to reduce the cache miss rate by improving the temporal and/or spatial locality of memory accesses. To achieve this, we will transpose the matrix one “block” at a time. We will make sure that the number of bytes within the width of the block can fit into an entire cache line. The following images compares transposition without blocking (base case) and with blocking. It’s worth noting that the effectiveness of tiled matrix transposition depends on factors such as the size of the matrix, cache sizes, and memory access patterns. For example, if the matrix is small enough to fit in the cache, then the tiled approach may not offer any performance benefits. Similarly, if the matrix is already stored in a contiguous memory region, then the tiled approach may not be necessary.
The process of tiled matrix transposition involves the following steps:
Naive Transpose
Miss Pattern Naive Transpose
Blocked Transpose
Miss Pattern Blocked Transpose
In this task, you will parallelize the outer loop of the GEMM operation. The idea is to distribute the rows of the output matrix among the available threads. This distribution can be viewed as decomposing the problem into smaller sub-tasks. Please refer to the OpenMP tutorials below for more details.
Tiling, also known as blocking, is a common optimization technique used in matrix multiplication, especially in the context of General Matrix Multiply (GEMM) in numerical linear algebra. The idea behind tiling is to improve data locality by dividing matrices into smaller sub-matrices or “tiles”, and then performing the multiplication operation on these smaller chunks. Tiling is one method to make the multiplication more cache-friendly by ensuring that the data being operated on fits inside the fast cache memory and can be reused before it’s evicted.
Modern processors have a memory hierarchy that consists of different types of memory/storage units. This includes registers, different levels of cache (L1, L2, L3, etc.), main memory (RAM), and possibly others. Access time varies greatly between these units; for instance, accessing the L1 cache is much faster than accessing the main memory.
In this task, you will implement tiling for the inner loops of the GEMM operation. The idea is to divide the matrices into smaller tiles and then perform the multiplication on these tiles. The following pseudo-code and figure illustrate the concept of tiling for the inner loops of GEMM. You need to also parallelize the outer loop using either OpenMP or pthreads.
function Tiled_Inner(A, B, C, M, N, K, blockSize):
for m from 0 to M step blockSize:
for n from 0 to N step blockSize:
for k from 0 to K step blockSize:
// Calculate upper bounds for each block
mUpper = m + blockSize
nUpper = n + blockSize
kUpper = k + blockSize
for i from m to mUpper:
for j from n to nUpper:
temp = 0
for l from k to kUpper:
temp += A[i][l] * B[l][j]
C[i][j] += temp
end function
Sparse matrices are matrices that are mostly filled with zero (or another default value) and have only a few non-zero elements. Representing them in the conventional dense format is often inefficient in terms of both memory and computation. That’s where sparse storage formats, such as BSR, come in handy.
Here’s a breakdown of the BSR format. In the BSR format, the sparse matrix is divided into small dense sub-matrices called blocks. These blocks have a fixed size $R \times C$ , where $R$ is the row block size and $C$ is the column block size. These blocks are then stored in a compressed sparse row format.
Use Cases: BSR is particularly efficient when non-zero elements are clustered into small dense blocks. If the matrix exhibits this kind of structure, the BSR format can offer better performance than other formats. Figure illustrates the BCSR format.
BSR format uses three main data arrays:
Members | Description |
---|---|
blockValues | This is a 2D array that stores the dense blocks. It has a shape of (nnz_blocks, R, C), where nnz_blocks is the number of non-zero blocks. |
Colidx | This is a 1D array of column indices that specify the col ids of nz blocks within a row. |
RowPtr | Similar to the CSR format, the RowPtr array defines the starting and ending index for each row of blocks in the indices array. |
We have provided a python script for generating and transforming BSR matrices. You can use this script to generate BSR matrices from dense matrices and vice versa. You can also print the generated BSR matrices. The script also supports generating BSR matrices with a specific block size. The following commands illustrate how to use the script:
# Create BCSR matrix. 50% density. Block size : 2
python3 bcsr/generate_blocked.py --create-csr 1 -nrows 4 -ncols 4 -density 0.5 -blocksize 2 -output bcsr_half_density.bin
# Print BCSR matrix
python3 bcsr/generate_blocked.py --print-csr 1 -input bcsr_half_density.bin
# Convert to block size 4. Input block size is not required. It is encoded in the file.
python3 bcsr/generate_blocked.py --transform-bcsr 1 -input bcsr_half_density.bin -blocksize 4 -output bcsr_4.bin
# Convert Dense to BCSR matrix with block size 2.
python3 bcsr/generate_blocked.py --transform-dense 1 -input dense_small.bin -blocksize 2 -output bcsr_2.bin
To run testbench
# Make sure the bcsr and dense have compatible columns and rows. This Python
# script multiplies a Blocked CSR matrix by a dense matrix (in that order), and
# stores the result in result.bin
python3 bcsr/bcsr_dense_mm.py -bcsr bcsr.bin -dense dense.bin -o result.bin
# To run the C++ version (you have to implement) in bcsr/tiled_spmm.cpp, run:
g++ -shared -o bcsr_mult.so -fPIC bcsr/tiled_spmm.cpp
python3 bcsr/bcsr_dense_mm.py -bin bcsr_mult.so -bcsr bcsr.bin -dense dense.bin -o result.bin
Let’s break down why Sparse Matrix Multiplication (SpMM) can be more favorable than dense general matrix-matrix multiplication (GEMM) for certain matrices.
Your task is to implement Gustavson’s dataflow for SpMM. We provide an implementation in bcsr/bcsr_dense_mm.py that you can use to test your implementation.
The code provided in bcsr/bcsr_dense_mm.py
iterates over block rows. Given that each block row can be processed independently, there’s a clear opportunity for parallelization.
Here’s a conceptual breakdown of how to parallelize the outer loop using OpenMP:
For load balancing, the default behavior of OpenMP (which uses equal-sized chunks) might be sufficient. However, if the blocks have widely varying numbers of non-zeros, then a dynamic scheduling approach may be more appropriate. Refer to the OpenMP tutorials linked above for more details about scheduling patterns.
Convolution filters, commonly known as kernels, play an integral role in image processing tasks such as blurring, sharpening, embossing, and edge detection. The operation involves convolving a kernel with an image. These kernels are often 3x3 matrices, and their mathematical representation in the convolution process is:
$g(x,y)$ is the resulting filtered image, $f(x,y)$ represents the original image, and $w$ is the kernel.
To elucidate this operation, consider a set of pixels from the image’s top-left corner. When applying the convolution kernel of size (2×2) to the pixel at position (1,1):
The filter evaluates values surrounding the target pixel, spanning from (x-1, y-1) to (x+1, y+1). These values are then multiplied by their respective elements in the kernel matrix. The results are summed up to produce the new pixel value.
This convolution is executed across the image, save for certain edge pixels. A question arises: How do we handle edge pixels, such as the one at (0,0)? There are several techniques to address this, including extending, wrapping, mirroring, or cropping the image (or occasionally the kernel). If this piques your curiosity, a detailed explanation is available here.
Kernels can produce varied effects depending on their values. Many of these filters are identifiable by specific names. For a visual illustration, refer to the Wikipedia article linked above.
In the following tasks, we implement convolutions using two approaches.
We provide a python script for testing your implementation. You can use the following commands to generate and print the input images. We also provide a canny edge detection implementation, which is a common image processing task that heavily relies on convolutions.
cd $REPO/conv
# Here is a Python testbench for your implementation in C++ (convolution.cpp)
# convolution.cpp contains the C++ code you will be updating.
# convolution.cpp will be compiled to a shared library similar previous tasks.
# conv_test.py expects a shared library called "convolution.so"
g++ -shared -o convolution.so -fPIC convolution.cpp
python3 conv_test.py
# Run Canny Edge Detection on F.jpg. This will output an image "F_gray.jpg".
python3 main.py --path ./F.jpg
Convolution calculation can be tiled similar to matrix multiplication to improve data reuse and reduce memory traffic. The idea is to divide the image into smaller tiles and perform the convolution on these tiles. In this task, you are required to modify the implementation we provide in convolution.cpp
to perform tiled convolutions. The following gif illustrates the concept of tiled convolutions.
A special case of tiling can be employed where the tile size is 1xN. This is called a line buffer. The line buffer is a common optimization technique used in convolutional neural networks (CNNs) to improve data reuse and reduce memory traffic. The idea is to store a line of pixels from the image in a buffer and reuse it for multiple convolutions. Line buffers permit the flat array to be streamed in one row at a time and improves prefetching and caching. This is illustrated in Figure below. The height of the line buffer is dependent on the size of the filter. For example, a 3x3 filter would require a line buffer of height 3, while a 5x5 filter would require a line buffer of height 5. Implement line buffering and compare it to the tiled convolutions.
Finally, convolutions can be implemented as a gemm by preprocessing the input image. im2col is a common technique used to convert the image into a matrix. The idea is to extract patches from the image and store them as columns in a matrix. The size of the patches is determined by the kernel size. The resulting matrix is then multiplied with the flattened kernel matrix to produce the output image. This is illustrated in Figure below.
The illustration is for an rgb image; however in the assignment you are expected to implement only grayscale i.e., you can assume number of channels is always 1. Furthermore the illustration is for multiple filters (like in typical DNNs). Here we only have a single filter (which eliminates the need for a filter matrix).
You can assume the filter matrices are fixed and known at compiler time. You don’t have to measure the time taken for them; here we only use a single filter for any given invocation.
You can leverage the gemm implementation from Module 1. However, note that in module 1 you may tested with square matrices; here the matrix could be tall and narrow. Furthermore, the filter matrix does not need to be actually stored as a matrix as the filter values simply repeat. You can simply store the filter values in a small array and use it to multiply with the im2col matrix.
Your task is to implement im2col and compare the performance of convolution as a GEMM with the sequential convolution.
You may want to run your GEMM-based convolution through the provided Canny Edge Detection implementation to verify correctness.
Task | Points |
---|---|
GEMM | 30 |
BCSR | 60 |
CONV | 60 |