_cs295_vload_int()
?Have you read common/intrin.h
and familiarized yourself with vector API?
for (int i = 0; i < N ; i++) {
c[i] = a[i] * b[i]
}
WARNING: DO NOT ATTEMPT THIS ASSIGNMENT WITHOUT COMPLETING ABOVE STEPS
We will be vectorizing five applications. Each kernel is organized into a separate folder.
CAXPY:
caxpy.h dataset.h gendata.py main.cpp
Relu:
Relu.h dataset.h gendata.py main.cpp
ReluEXT:
ReluEXT.h dataset.h gendata.py main.cpp
SoA:
SoA.h dataset.h gendata.py main.cpp
IMAX:
imax.h dataset.h gendata.py main.cpp
common:
helpers.h intrin.cpp intrin.h logger.cpp logger.h
File | Description |
---|---|
dataset.h | Contains the input arrays for the particular benchmark. DO NOT MODIFY |
main.cpp | Driver program that sets up inputs and checks output. DO NOT MODIFY . |
Relu.h,ReluEXT.h,SoA.h,caxpy.h,imax.h | Contains serial implementations that you will be vectorizing |
intrin.h | Header definition of vector operations. READ AND UNDERSTAND FUNCTION ARGUMENT |
intrin.cpp | Implementation of vector library; NOT REQUIRED tO READ. You do need to understand API |
The relu computation is similar to the relu function in assignment 3. It filters out the positive values in an array. Here we will be writing the output vack to a different array.
Files to modify
Serial Relu
void ReluSerial(int *values, int *output, int N)
// N%VECTOR_WIDTH = 0
{
for (int i = 0; i < N; i++)
{
int x = values[i];
if (x < 0)
{
output[i] = 0;
}
else
{
output[i] = x;
}
}
}
Vectorization suggestion
Debugging Hint
cd $REPO/Relu
make Relu.bin
./Relu.bin"
Here we will Relu extend that to support arbitrary N. i.e., you cannot assume N is a multiple of VECTOR_WIDTH.
Files to modify
Serial Relu
void ReluSerial(int *values, int *output, int N)
// N%VECTOR_WIDTH = 0
{
for (int i = 0; i < N; i++)
{
int x = values[i];
if (x < 0)
{
output[i] = 0;
}
else
{
output[i] = x;
}
}
}
Vectorization suggestion
Hint: Combine the masks
. Think how masks should be combined.Debugging Hint
cd $REPO/ReluEXT
make ReluEXT.bin
./ReluEXT.bin"
As a running example, we use a conditionalized AXPY kernel, CAXPY. Figure 1 shows CAXPY expressed in C as a serial loop. CAXPY takes as input an array of conditions, a scalar a, and vectors x and y, and then it computes y += ax for the elements for which the condition is true.
Files to modify
Serial Caxpy
void CAXPYSerial(int N, int cond[], int a, int X[], int Y[]) {
int i;
for (i = 0; i < N; i++) {
if (cond[i]) Y[i] = a * X[i] + Y[i];
}
}
Vectorization suggestion
Hint: Combine the masks
. Think how masks should be combined.Debugging hints
Vectorizing this kernel is slightly more challenging than prior since there is a dependency and there is interaction between elements in the vector. As you can see the loop is not data parallel.
int SoASerial(int *values, int N)
{
int sum = 0;
for (int i = 0; i < N; i++)
{
sum += values[i];
}
return sum;
}
You can assume the size of array is a multiple of VECTOR_WIDTH
You will vectorize a non-traditional vector application, imax, which finds the index of max value element in an array.
// pseudo code
max = l[0];
for ( i = 0 ; i < n ; i ++) {
if ( l [ i ] > max ) {
max = l [ i ];
index = i
}
}
_cs295_firstbit
function does ?// GLOBAL_MAX, GLOBAL_INDEX
for (i = 0; i < n; i += VLEN) {
for (j = 0; j < VLEN; j++) {
find max in VLEN elements or less
update VLEN_MAX and VLEN_INDEX
}
// Compare and update GLOBAL_MAX AND GLOBAL_INDEX
// Move onto next VLEN elements. Caution: n may not be multiple of VLEN
}
Test | Points |
---|---|
relu | 10 |
reluEXT | 10 |
caxpy | 10 |
soa | 10 |
imax | 20 |
$ bash ./scripts/localci.sh
# if you see SUCCESS and *.log.sucess then you passed. You can also check your *_Grade.json to see your tentative grade.
# If you see FAILED, then inspect *.log.failed. Check the failed section to see what tests failed.
Remember from the testing framework section that these sanity tests are not comprehensive, and you should rely on your own tests to decide whether your code is correct. Your score will be determined mostly by hidden tests that will be ran after the submission deadline has passed.