Gitlab : https://csil-git1.cs.surrey.sfu.ca/ashriram/hls-p1.git
git clone https://csil-git1.cs.surrey.sfu.ca/ashriram/hls-p1.git
The first part and the second part are each worth 25 points, and the third part is worth 50 points. The instructions for each part of the lab can be found in the README files in each of the corresponding directories.
answers.txt should include the following format
Part 1 Design Latency: XXX
Part 2 Design Latency: XXX
Part 3 PROGRAM_LOOP Iteration Latency: XXX
You can directly obtain a VM image with the Xilinx Vivado toolchains pre-installed here. This VM image will run in VMWare Fusion (OSX) or VMWare Workstation Player (Windows) which you can obtain for free via the CS SFU VMWare Academic program. To log in to the VM, the password is “deeplearning”.
In order to update your environment to point to the Xilinx toolchain executables, source the settings script with the following command upon starting your VM.
source ~/Xilinx/Vivado/2017.1/settings64.sh
If you wish to set up your own VM and Xilinx toolchains on a Windows or Linux system, you can continue reading.
If you don’t have a 64-bit Linux OS installed on your machine, we recommend VirtualBox (free), VMWare (free under CSE VMWare Academic Program), or dual booting your machine.
Make sure to allocate at least 32GB of disk drive space for your VM’s main partition. In addition, compilation jobs can be resource-intensive, so allocating at least 4GB of DRAM for your VM would be wise. We’ve tested the tools under Ubuntu 16.04.2 but any of the following OSes or newer should work:
Note If you’re using VMWare, do not have your source and work directory sit on a shared drive with your host OS. For some reason VMWare directory sharing is slow to update file changes from the host OS to the virtual OS, which can lead to compilation bugs.
You’ll need to install Xilinx’ FPGA compilation toolchain, Vivado HL WebPACK 2017.1 (PAY ATTENTION TO THE VERSION), which a license-free version of the Vivado HLx toolchain.
Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin
.chmod u+x Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin
./Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin
~/.bashrc
with the following line:# Xilinx Vivado 2017.1 environment
source <install_path>/Vivado/2017.1/settings64.sh
This tutorial gives a brief hello-world-like introduction to Vivado HLS.
In vadd.cc
you’ll find the function definition of the vector add module. You’ll notice immediately that everything looks like standard C++ except for a few differences. At first glance, we find a lot of pragmas
scattered around that provide necessary control to the programmer over how the software gets synthesized by HLS into hardware.
First off, the output of the HLS compiler is a hardware module as opposed to a binary (which is what a compiler like gcc
would produce). As a result you are expected to specify to the compiler how you intend to connect your module to the outside world. You’ll notice the use of interface pragmas
such as the one below:
#pragma HLS INTERFACE m_axi port = a offset = slave bundle = a_port
This pragma tells the compiler that the a
argument is a master interface that uses the AXI bus protocol. A master port differs from a slave port by being able to initiate memory requests. This tells us that our vector add module will be able to perform read/write requests from each port: a
, b
, c
.
You can find more information on how HLS synthesizes interfaces under the Managing Interfaces section of Chapter 1 of the Vivado HLS User Manual.
Let’s take the design as it is, and synthesize a hardware module with HLS. In order to compile the program, you’ll need a .tcl
script that provides a ‘recipe’ to the compiler on how to compile your sources (you can think of it as a Makefile).
Looking into hls.tcl
you’ll find that it defines the FPGA part that is being targeted (xc7z020clg484-1
), as well as the target clock period in ns for the compiler (10ns clock period target). The latter will tell the compiler how to insert pipeline registers in the design so it can meet timing constraints.
You’ll also notice that we are passing a test file, vadd_test.cc
which contains test cases to make sure that the vector add source behaves as intended.
In order to compile the design with HLS, execute the following command:
cd hls-tutorials/part1
vivado_hls -f hls.tcl
After a few seconds, it will synthesize a hardware design as Verilog, VHDL and SystemC files, which you’ll find under vadd/solution0/syn/
. It will also produce a report file that provides timing closure, resource utilization, and performance metrics. You will find the report file under vadd/solution0/syn/report/vadd_csynth.rpt
.
Looking into it, you will find:
The performance of our vector add design out of the box is a little underwhelming: 3084 cycles to add two vectors of 1024 elements.
Try to optimize the hardware design by inserting an HLS pragma
to tell the compiler to pipeline the vector addition loop. You’ll find how to do this in the Optimizing the Design section of Chapter 1 of the Vivado HLS User Manual.
Deliverable: Pipeline the Vector Add design and report the design latency you’ve achieved.
In this tutorial, we’ll optimize a slightly more interesting design that performs matrix-matrix multiplication.
It takes two matrices A and B, that have (M, N) and (N, O) shapes respectively, and produces an output matrix C, with shape (M, O).
For simplicity, these have all been set to be 64 in the gemm.h
header file.
As a result this GEMM design has to perform 64 x 64 x 64 = 262144
multiplications.
In the previous vector add design, the design would stream inputs in and stream outputs out as it performed addition. This was acceptable since inputs are used only once, and outputs written only once in vector addition.
For matrix multiplication however there is a fair amount of data reuse to take advantage of. Every element of A is read O times, every element of B is read M times, and every element of C is written to N times.
Consequently, in order to optimize for data access, it’s generally a good idea to store values on-chip (i.e. on the FPGA) to facilitate data re-use. In our simple example, we are lucky to be able to store the entire matrices on chip. With HLS, we can instantiate local SRAM buffers by declaring arrays. These will consume BRAM resources on the FPGA, so we need to be careful not to exceed resources constraints.
int a_buff[M][N];
int b_buff[O][N];
int c_buff[M][O];
You can find more information on how HLS synthesizes local SRAM buffers under the Arrays section of Chapter 3 of the Vivado HLS User Manual.
Now that we have instantiated local buffers in our design, we need to load them with input matrix data stored in DRAM, and eventually store the results back to DRAM after the computation is over.
One easy trick is to use memcpy
- HLS will synthesize this into an efficient FSM that can initiate data transfers between DRAM and SRAM:
memcpy(&a_buff[0][0], const_cast<int*>(a), sizeof(int) * M * N);
As with the cstring.h
standard function, the size of the memory transfer needs to be specified in Bytes.
You can find more information on how to use memcpy
in the AXI4 Master Interface subsection of Chapter 1, pages 124-130 of the Vivado HLS User Manual.
In order to compiler the design with HLS, execute the following command:
cd hls-tutorials/part2
vivado_hls -f hls.tcl
By default the design is not optimized and it takes over 1M cycles to complete the matrix multiplication.
Try to optimize the hardware design by inserting appropriate HLS pragmas in the design to tell the compiler to pipeline matrix multiplication so that you can achieve 1 vector dot product per cycle. You will need to re-partition your on-chip input buffers in order to increase the number of read ports.
You’ll find how to do this in the Optimizing the Design section of Chapter 1 of the Vivado HLS User Manual - look for the ARRAY_PARTITION
pragma description.
Deliverable: Optimize the GEMM design and report the design latency you’ve achieved.
In this tutorial, we’ll showcase how to use HLS to synthesize a simple pipelined CPU.
The CPU uses a simple 32-bit RISC ISA. It includes an explicitly managed instruction cache and data cache, and a register file.
The instruction cache and data cache both have 1024 32-bit entries, and the register file contains 16 32-bit registers named R0
to R15
.
The ISA is described below:
Opcode | Fields | Description | Operation |
---|---|---|---|
FINISH | None | Indicates end of program | end |
WRITE_IMMEDIATE | DST, IMM | Writes the immediate value IMM to DST register | r[DST] <- IMM |
LOAD | REG, ADDR | Loads data memory at address pointed by ADDR register into REG register | r[REG] <- data(r[ADDR]) |
STORE | REG, ADDR | Stores REG register to data memory at index pointed by the ADDR register | data(r[ADDR]) <- r[REG] |
BEQ | SRC0, SRC1, NEW_PC | Sets PC to NEW_PC if SRC0 register and SRC1 register are equal | PC <- NEW_PC if r[SRC0] == r[SRC1] |
BNE | SRC0, SRC1, NEW_PC | Sets PC to NEW_PC if SRC0 register and SRC1 register are not equal | PC <- NEW_PC if r[SRC0] != r[SRC1] |
ADD | DST, SRC0, SRC1 | Sets DST register to the sum of SRC0 and SRC1 registers | r[DST] <- r[SRC0] + r[SRC1] |
The entire CPU specification is described in cpu.h
. You’ll note the use of arbitrary precision integers (ap_int
, ap_uint
) which are HLS-specific data types that allow for non-standard precision integers.
You can find more information on special data types in HLS under the Data Types for Efficient Hardware section of Chapter 1 of the Vivado HLS User Manual.
In addition, we are relying on bit-fields for convenient data-packing into 32-bit values. This software construct allows us to pack instruction fields into a single 32-bit instruction. Depending on the instruction, the fields will be interpreted differently. We achieve this using a C Union
, which lets us interpret the same generic instruction struct as a different instructions depending on the opcode field.
Next let’s look at the CPU source code in cpu.cc
.
In terms of interface, the CPU has two memory ports: one for instruction memory, and the other for data memory (similar to the Harvard Architecture from the old days).
These two memory ports allow for data transfer between DRAM and the on-chip instruction and data caches using the memcpy
construct we saw in Part 2.
This CPU is fairly simple, so it won’t have implicitly managed-caches. Instead it will initialize its instruction and data caches from DRAM at the beginning of a given program and dump its data cache back to DRAM at the end of the program.
The main CPU loop executes the program in the instruction cache until it hits a FINISH
instruction, which asserts the finish
flag.
We have structured our code to perform CPU actions in stages defined by the good old-fashioned RISC 5-stage pipeline:
LOAD
or STORE
operations.Note however that this won’t necessarily generate a 5-stage pipeline (we haven’t even told HLS to pipeline our design!). This structure is merely for legibility and to reflect the organization of classic CPUs.
The cpu_test.cc
file lets us test the CPU on ‘hand-assembled’ programs.
We don’t have a compiler, so we need to generate the assembly line by line.
Thankfully we have helper functions that makes it a little easier to generate an assembly instruction:
getFinishInsn()
: generates a FINISH
instructiongetWriteImmediateInsn()
: generates a WRITE_IMMEDIATE
instructiongetMemoryInsn()
: generates a LOAD
or STORE
instructiongetBranchInsn()
: generates a BEQ
or BNE
instructiongetBinaryInsn()
: generates an ADD
instruction (but can be extended to generate binary arithmetic operations)To test that the CPU works as intended, we have provided you with a fully unrolled vector add program.
In order to compile the design with HLS and run the test program, execute the following command:
cd hls-tutorials/part3
vivado_hls -f hls.tcl
If you look at the report, you’ll notice that there is no design latency estimate. This is due to the fact that the hardware execution is determined dynamically based on the assembly program length.
Although we don’t have a full-design latency and initiation interval breakdown, we can analyze the CPU’s program loop (labeled PROGRAM_LOOP
in the source code).
Since that loop is currently not pipelined, we won’t get an initiation interval but we can already assess that it has an iteration latency of 5 cycles.
The CPU design currently does not support branching instructions BEQ
and BNE
. This makes implementing loops in assembly programs impossible.
Extend the CPU design to handle those branching instructions (you can insert your code in the region marked with the TODO
comment in cpu.cc
.
Also in cpu_test.cc
implement an assembly program that takes advantage of branches to perform vector addition. We’ve already provided the scaffolding to implement a loop-based program.
You can extend the loop body of that assembly program, and enable the execution of that program in the test by setting USE_BRANCH
to true
.
Re-run the HLS compilation to make sure that the design passes simulation:
cd hls-tutorials/part3
vivado_hls -f hls.tcl
Deliverable: The cpu.cc and cpu_test.cc that implement the branch instructions and test for control flow branches.
Last but not least, we’ll need to pipeline our CPU. So far, it’s only been a single-cycle design, which as we know is sub-optimal.
The beauty of HLS is that we don’t need to change our source to implement pipelining. You should be able to pipeline this design using the HLS pragmas
that you used in Part 1 and 2.
Deliverable: Pipeline the PROGRAM_LOOP and report the initiation-interval (II) and iteration latency that you’ve achieved.
Change to the first directory.
Then, issue:
$ tar cvf Ass1.tar *
which creates a tar ball (i.e., a single file) that contains the contents of the folder.
Compress your file using gzip:
$ gzip Ass1.tar