General references
https://github.com/drichmond/PYNQ-HLS
To find an optimal design one needs to perform design space exploration (with hardware parameters VEC_SIZE, LANE_NUM and CONV_GP_SIZE_X) to find the optimal design that maximizes the throughput or minimizes the excution time. PipeCNN provides such a framework. Study the various configurations of PipeCNN on a Altera FPGA DE1-SOC that we will provide. To set up PipeCNN on the Altera board you may want to refer here
References
https://github.com/thinkoco/c5soc_opencl
https://github.com/thinkoco/c5soc_opencl/issues/4
Xilinx’s DNNDK is a machine learning kit for running deep neural networks effectively on FPGAs. DNNDK’s core hardware is a DPU unit (essentially a tensor arithmetic core). The main technique that allows neural nets to run effectively on the hardware is a set of compression and quantization techniques. These techniques reduce the block RAM requirements of the networks and fit on the FPGA. The goal of this project would be to explore these techniques and explore the feasible hardware designs.
References
https://www.hackster.io/adam-taylor/machine-learning-at-the-edge-with-xilinx-dnn-developer-kit-68c672
Quantization is an important technique for improving both compute and memory energy. Quantization reduces bitwidth to improve compute density and enables data to fit on the BRAMS available on the FPGA. Explore the design tradeoffs in quantization across different networks.
References
https://github.com/Xilinx/FINN/blob/master/docs/FPGA2018_Tutorial.pdf (slide 13)
https://github.com/Xilinx/BNN-PYNQ
https://github.com/Xilinx/FINN
Tangram is a python tool that allows you to explore energy efficient dataflow schedules. A schedule is a specific orgranization of loops and parallel iterations on a 2D grid of processing elements to minimize data movement. Sweep the various configurations for Mobilenet V2 or another network and report the architectural tradeoffs in data movement and compute.
References
https://github.com/stanford-mast/nn_dataflow
At its core many DNN layers can be represented as a set of deeply nested loops iterating over multiple tensors. Casting this as a compiler problem means that the hardware design can now be treated as a space that explores various loop reordering, tiling, and unrolling strategies. Maeri is a tool that does that. Specify various networks such as Resnet-50 and Mobilnet using Maeri (may need to extend tool) and evaluate the optinization opportunities.
References
https://github.com/georgia-tech-synergy-lab/MAESTRO
http://synergy.ece.gatech.edu/tools/maestro/
At SFU we have created our own high-level-synthesis framework that translates C/C++ parallel programs to Chisel (a new SCALA-based hardware language from Berkley). We have an extensive set of core machine learning operators implemented in hardware (e.g., GEMM or element-wise operation) on varied tensor shapes (2D or 1D right now). The following set of projects deals with SFU’s ML synthesis framework.
Currently our generator generates each individual kernel (e.g., a set of operations on tensors). The current work can be extended to support connecting different kernels together and define the overall system design. For such a generator, we need to define a secondary IR. The Specification of the system IR can contain:
RISC-V is a new open-source processor that supports the addition of custom operations. Connect one of the ML operations to RISC-V
To connect uIR to RoCC interface there are two pieces that need to be done. First, implementing RoCC interface inside dandelion, hence, uIR accelerator can talk with RISC-V core. For this part the following task needs to be done:
In the second phase, a clear software interface needs to be defined so the user can easily use the interface, instead of writing assembly.
References:
https://github.com/chipsalliance/rocket-chip
https://github.com/ucb-bar/rocc-template
https://github.com/seldridge/rocket-rocc-examples
https://bitbucket.org/taylor-bsg/bsg_riscv_rocc/src/master/
Tensors are the primitive data type used in many machine learning projects. There is even a move to standardize (tensor)[https://github.com/dmlc/dlpack] formats across the community
Suppose we have an input mathematic expression of tensors (taco)[http://tensor-compiler.org/]. We want to build a statically scheduled dataflow of the input expression using the operators provided by SFU’s ML library.
References
https://csil-git1.cs.surrey.sfu.ca/ashriram/ply-playground (convert infix to postfix)
http://tensor-compiler.org/ (tensor expression compiler)
References (a project that generates chisel from scala)
https://github.com/hngenc/systolic-array
The most recent optimization technique for DNNs has been quantization. Many language level frameworks are interested in exposing opportunities for quantization and auto-tuning the network. Try and replicate the work shown here below or build on it to quantize networks using the TVM framework
https://github.com/uwsampl/tutorial
https://tvm.ai/2019/04/29/opt-cuda-quantized