Skip to main content

Github Clone Link

Errata

  • Change line 5 in ReluEXT/main.cpp to #include “ReluEXT.h”

Check Yourself

  • Do you know what a vector is ?
  • Have you completed lab 9?
  • Have you read the intrinsics doc
  • What does this function do _cs295_vload_int()?
  • Have you read common/intrin.h and familiarized yourself with vector API?

  • Can you vectorize this code
for (int i = 0; i < N ; i++) {
  c[i] = a[i] * b[i]
}

WARNING: DO NOT ATTEMPT THIS ASSIGNMENT WITHOUT COMPLETING ABOVE STEPS

Goals

  • The goal of this assignment is to allow you to explore the vector ISA using its functional simulator library we have provided.
  • Students will vectorize short functions code to gain a better understanding of how data-level parallel code maps to vector-style processors, and to practice optimizing vector code for a given implementation.
  • Further students will understand how to efficiently perform computation on 1D array data structures.

Source

We will be vectorizing five applications. Each kernel is organized into a separate folder.

CAXPY:
caxpy.h    dataset.h  gendata.py main.cpp

Relu:
Relu.h     dataset.h  gendata.py main.cpp

ReluEXT:
ReluEXT.h  dataset.h  gendata.py main.cpp

SoA:
SoA.h      dataset.h  gendata.py main.cpp

IMAX:
imax.h      dataset.h  gendata.py main.cpp

common:
helpers.h  intrin.cpp intrin.h   logger.cpp logger.h
File Description
dataset.h Contains the input arrays for the particular benchmark. DO NOT MODIFY
main.cpp Driver program that sets up inputs and checks output. DO NOT MODIFY.
Relu.h,ReluEXT.h,SoA.h,caxpy.h,imax.h Contains serial implementations that you will be vectorizing
intrin.h Header definition of vector operations. READ AND UNDERSTAND FUNCTION ARGUMENT
intrin.cpp Implementation of vector library; NOT REQUIRED tO READ. You do need to understand API

TODO 1: Relu

The relu computation is similar to the relu function in assignment 3. It filters out the positive values in an array. Here we will be writing the output vack to a different array.

Files to modify

  • Relu/Relu.h

Serial Relu

void ReluSerial(int *values, int *output, int N) 
// N%VECTOR_WIDTH = 0
{
  for (int i = 0; i < N; i++)
  {
    int x = values[i];
    if (x < 0)
    {
      output[i] = 0;
    }
    else
    {
      output[i] = x;
    }
  }
}

Vectorization suggestion

  • Load array VECTOR_WIDTH into a vector register at a time.
  • Create masks for condition.
  • Use masks to store the appropriate value to the result

Debugging Hint

  • Think about which elements do you need to store and what value you need to store to them.
cd $REPO/Relu
make Relu.bin
./Relu.bin"

TODO 2: Relu-Ext

Here we will Relu extend that to support arbitrary N. i.e., you cannot assume N is a multiple of VECTOR_WIDTH.

Files to modify

  • ReluEXT/ReluEXT.h

Serial Relu

void ReluSerial(int *values, int *output, int N) 
// N%VECTOR_WIDTH = 0
{
  for (int i = 0; i < N; i++)
  {
    int x = values[i];
    if (x < 0)
    {
      output[i] = 0;
    }
    else
    {
      output[i] = x;
    }
  }
}

Vectorization suggestion

  • Load array VECTOR_WIDTH into a vector register at a time.
  • Create masks for condition. Create masks for checking how many lanes are required. Hint: Combine the masks. Think how masks should be combined.
  • Use masks to store the appropriate value to the result

Debugging Hint

  • Only in the final iteration, if N is not a multiple of vector width then we need to combine the masks.
cd $REPO/ReluEXT
make ReluEXT.bin
./ReluEXT.bin"

TODO3: CAXPY

As a running example, we use a conditionalized AXPY kernel, CAXPY. Figure 1 shows CAXPY expressed in C as a serial loop. CAXPY takes as input an array of conditions, a scalar a, and vectors x and y, and then it computes y += ax for the elements for which the condition is true.

Files to modify

  • Caxpy/Caxpy.h

Serial Caxpy

void CAXPYSerial(int N, int cond[], int a, int X[], int Y[]) {
  int i;
  for (i = 0; i < N; i++) {
    if (cond[i]) Y[i] = a * X[i] + Y[i];
  }
}

Vectorization suggestion

  • Load array VECTOR_WIDTH into a vector register at a time.
  • Create masks for condition. Create masks for checking how many lanes are required. Hint: Combine the masks. Think how masks should be combined.
  • Use masks to perform computation and store.

Debugging hints

  • Masks have to be used by multiple operations. Performing Y[i] = a * X[i] + Y[i] requires multiple instructions.

TODO4: Sum-of-Array

Vectorizing this kernel is slightly more challenging than prior since there is a dependency and there is interaction between elements in the vector. As you can see the loop is not data parallel.

int SoASerial(int *values, int N)
{
  int sum = 0;
  for (int i = 0; i < N; i++)
  {
    sum += values[i];
  }

  return sum;
}

Check yourself

  • Why is the above code hard to vectorize
  • What is the variable which is introducing a dependency across loop iterations ?

Vectorization hints

Debugging Hints

  • In this test you will be permuting values so when debugging you may be better off checking the actual .data field in each vector register
  • If you find yourself using masks then you most likely have an incorrect implementation. This program should not require masks.

TODO5: IMAX

You will vectorize a non-traditional vector application, imax, which finds the index of max value element in an array.

// pseudo code
max = l[0];
for ( i = 0 ; i < n ; i ++) {
if ( l [ i ] > max ) {
  max = l [ i ];
  index = i
}
}

Check yourself

  • Why is the above code hard to vectorize ?
  • What is the variable which is introducing a dependency across loop iterations ?
  • Do you know what _cs295_firstbit function does ?

Vectorization hints

  • We will split the outer loop into two loops. Outer loop segments array into VLEN elements at a time
  • Inner loop finds max amongst the VLEN elements.
// GLOBAL_MAX, GLOBAL_INDEX
for (i = 0; i < n; i += VLEN) {
    for (j = 0; j < VLEN; j++) {
        find max in VLEN elements or less
        update VLEN_MAX and VLEN_INDEX
    }
  // Compare and update GLOBAL_MAX AND GLOBAL_INDEX
  // Move onto next VLEN elements. Caution: n may not be multiple of VLEN
}
  • We will be vectorizing the inner loop and find VLEN_MAX and VLEN_INDEX
  • Keep the current max value to a scalar vector register, which is initialized zero.
  • Also, keep the current index in a scalar register whose initial value is zero.
  • Load a vector and compare every element against the max.
  • For values greater than the max, pick the one of them.
  • update mask to only include elements greater than max
  • update the max and index and continue until no more elements left to compare

Grading

Test Points
relu 10
reluEXT 10
caxpy 10
soa 10
imax 20

End-to-End tests

$ bash ./scripts/localci.sh
# if you see SUCCESS and *.log.sucess then you passed. You can also check your *_Grade.json to see your tentative grade.
# If you see FAILED, then inspect *.log.failed. Check the failed section to see what tests failed.

Remember from the testing framework section that these sanity tests are not comprehensive, and you should rely on your own tests to decide whether your code is correct. Your score will be determined mostly by hidden tests that will be ran after the submission deadline has passed.