SFU Computer Architecture Class Syllabus

Tensorflow

TODO

Start thinking about a project (max: 2 students). Project group can differ from presentation group.
Pick topic by Sep 30th.
Goal: Explore some aspect of ML and architecture Very open to any idea
Suggested ideas are both broad and not extensive
Due: Last day of the exam week

Suggested Course Projects

FPGA-oriented projects (work with C, Xilinx tools)

General references

https://github.com/drichmond/PYNQ-HLS

1. Pipe CNN

To find an optimal design one needs to perform design space exploration (with hardware parameters VEC_SIZE, LANE_NUM and CONV_GP_SIZE_X) to find the optimal design that maximizes the throughput or minimizes the excution time. PipeCNN provides such a framework. Study the various configurations of PipeCNN on a Altera FPGA DE1-SOC that we will provide. To set up PipeCNN on the Altera board you may want to refer here

References

https://github.com/thinkoco/c5soc_opencl
https://github.com/thinkoco/c5soc_opencl/issues/4

2. Xilinx’s DNNDK.

Xilinx’s DNNDK is a machine learning kit for running deep neural networks effectively on FPGAs. DNNDK’s core hardware is a DPU unit (essentially a tensor arithmetic core). The main technique that allows neural nets to run effectively on the hardware is a set of compression and quantization techniques. These techniques reduce the block RAM requirements of the networks and fit on the FPGA. The goal of this project would be to explore these techniques and explore the feasible hardware designs.

References

https://www.hackster.io/adam-taylor/machine-learning-at-the-edge-with-xilinx-dnn-developer-kit-68c672

3. Quantization of neural networks

Quantization is an important technique for improving both compute and memory energy. Quantization reduces bitwidth to improve compute density and enables data to fit on the BRAMS available on the FPGA. Explore the design tradeoffs in quantization across different networks.

References

https://github.com/Xilinx/FINN/blob/master/docs/FPGA2018_Tutorial.pdf (slide 13)
https://github.com/Xilinx/BNN-PYNQ
https://github.com/Xilinx/FINN

Hardware design exploration (work with Python and C++)

1. TANGRAM

Tangram is a python tool that allows you to explore energy efficient dataflow schedules. A schedule is a specific orgranization of loops and parallel iterations on a 2D grid of processing elements to minimize data movement. Sweep the various configurations for Mobilenet V2 or another network and report the architectural tradeoffs in data movement and compute.

References

https://github.com/stanford-mast/nn_dataflow

2. MAESTRO

At its core many DNN layers can be represented as a set of deeply nested loops iterating over multiple tensors. Casting this as a compiler problem means that the hardware design can now be treated as a space that explores various loop reordering, tiling, and unrolling strategies. Maeri is a tool that does that. Specify various networks such as Resnet-50 and Mobilnet using Maeri (may need to extend tool) and evaluate the optinization opportunities.

References

https://github.com/georgia-tech-synergy-lab/MAESTRO
http://synergy.ece.gatech.edu/tools/maestro/

Hardware design oriented projects (work with Chisel and Scala or Python)

At SFU we have created our own high-level-synthesis framework that translates C/C++ parallel programs to Chisel (a new SCALA-based hardware language from Berkley). We have an extensive set of core machine learning operators implemented in hardware (e.g., GEMM or element-wise operation) on varied tensor shapes (2D or 1D right now). The following set of projects deals with SFU’s ML synthesis framework.

1. Create the dataflow for a layer (any network) using SFU’s ML framework (Could lead to co-authoring a paper)

Currently our generator generates each individual kernel (e.g., a set of operations on tensors). The current work can be extended to support connecting different kernels together and define the overall system design. For such a generator, we need to define a secondary IR. The Specification of the system IR can contain:

Computation kernels
Parallel Task Controllers
Line buffers
Scratchpads
Cache module
Global buses like AXI and RoCC

2. Build ML accelerator and connect to RISC-V (could also lead to a paper)

RISC-V is a new open-source processor that supports the addition of custom operations. Connect one of the ML operations to RISC-V

To connect uIR to RoCC interface there are two pieces that need to be done. First, implementing RoCC interface inside dandelion, hence, uIR accelerator can talk with RISC-V core. For this part the following task needs to be done:

Define a set of instructions for riscv custom instructions such as:
Load
Store
Exe
Set Argument

In the second phase, a clear software interface needs to be defined so the user can easily use the interface, instead of writing assembly.

References:

https://github.com/chipsalliance/rocket-chip
https://github.com/ucb-bar/rocc-template
https://github.com/seldridge/rocket-rocc-examples
https://bitbucket.org/taylor-bsg/bsg_riscv_rocc/src/master/

3. Generate dataflow from tensor expressions.

Tensors are the primitive data type used in many machine learning projects. There is even a move to standardize (tensor)[https://github.com/dmlc/dlpack] formats across the community

Suppose we have an input mathematic expression of tensors (taco)[http://tensor-compiler.org/]. We want to build a statically scheduled dataflow of the input expression using the operators provided by SFU’s ML library.

References

https://csil-git1.cs.surrey.sfu.ca/ashriram/ply-playground (convert infix to postfix)
http://tensor-compiler.org/ (tensor expression compiler)

References (a project that generates chisel from scala)

https://github.com/hngenc/systolic-array

Full stack projects (work with TVM. Need to be good systems programmer).

May need a good GPU (sfucloud provides GPU resources).

1. Higher level Quantization of deep learning models.

The most recent optimization technique for DNNs has been quantization. Many language level frameworks are interested in exposing opportunities for quantization and auto-tuning the network. Try and replicate the work shown here below or build on it to quantize networks using the TVM framework

https://github.com/uwsampl/tutorial
https://tvm.ai/2019/04/29/opt-cuda-quantized