Grace Hopper developed the first compiler for a computer programming language.

HLS Tutorial

Gitlab : https://csil-git1.cs.surrey.sfu.ca/ashriram/hls-p1.git

git clone https://csil-git1.cs.surrey.sfu.ca/ashriram/hls-p1.git

Tutorial Breakdown

The first part and the second part are each worth 25 points, and the third part is worth 50 points. The instructions for each part of the lab can be found in the README files in each of the corresponding directories.

Deliverables

cpu.cc
cpu_test.cc
answers.txt

answers.txt should include the following format

Part 1 Design Latency: XXX
Part 2 Design Latency: XXX
Part 3 PROGRAM_LOOP Iteration Latency: XXX

Prerequisites:

You can directly obtain a VM image with the Xilinx Vivado toolchains pre-installed here. This VM image will run in VMWare Fusion (OSX) or VMWare Workstation Player (Windows) which you can obtain for free via the CS SFU VMWare Academic program. To log in to the VM, the password is “deeplearning”.

In order to update your environment to point to the Xilinx toolchain executables, source the settings script with the following command upon starting your VM.

source ~/Xilinx/Vivado/2017.1/settings64.sh

If you wish to set up your own VM and Xilinx toolchains on a Windows or Linux system, you can continue reading.

Linux 64-bit OS

If you don’t have a 64-bit Linux OS installed on your machine, we recommend VirtualBox (free), VMWare (free under CSE VMWare Academic Program), or dual booting your machine.

Make sure to allocate at least 32GB of disk drive space for your VM’s main partition. In addition, compilation jobs can be resource-intensive, so allocating at least 4GB of DRAM for your VM would be wise. We’ve tested the tools under Ubuntu 16.04.2 but any of the following OSes or newer should work:

Red Hat Enterprise Linux 6.6 64-bit
CentOS Linux 6.7
SUSE Enterprise Linux 11.4 64-bit
Ubuntu Linux 16.04.1 LTS 64-bit

Note If you’re using VMWare, do not have your source and work directory sit on a shared drive with your host OS. For some reason VMWare directory sharing is slow to update file changes from the host OS to the virtual OS, which can lead to compilation bugs.

Vivado HL WebPACK 2017.1

You’ll need to install Xilinx’ FPGA compilation toolchain, Vivado HL WebPACK 2017.1 (PAY ATTENTION TO THE VERSION), which a license-free version of the Vivado HLx toolchain.

Go to the download webpage, and download the Linux Self Extracting Web Installer for Vivado HL 2017.1 WebPACK and Editions.
You’ll have to sign in with a Xilinx account. This requires a Xilinx account creation that will take 2 minutes.
Pass the Name and Address Verification by clicking “Next”, and you will get the opportunity to download a binary file, called Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin.
Now that the file is downloaded, go to your Downloads directory, and change the file permissions so it can be executed: chmod u+x Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin
Now you can execute the binary: ./Xilinx_Vivado_SDK_2017.1_0415_1_Lin64.bin
A Vivado 2017.1 Installer program GUI will launch.
- Click “Next” on the Welcome screen.
- Enter your Xilinx User Credentials under “User Authentication” and select the “Download and Install Now” before clicking “Next” on the Select Install Type screen.
- Accept all terms before clicking on “Next” on the Accept License Agreements screen.
- Select the “Vivado HL WebPACK” before clicking on “Next” on the Select Edition to Install screen.
- Under the Vivado HL WebPACK screen, before hitting “Next”, check the following options (the rest should be unchecked):
  - Design Tools -> Vivado Design Suite -> Vivado
  - Design Tools -> Vivado Design Suite -> Vivado High Level Synthesis
  - Devices -> Production Services -> SoCs -> Zynq-7000 Series
- Your total download size should be about 3GB and the amount of Disk Space Required 13GB.
- Set the installation directory before clicking “Next” on the Select Destination Directory screen. It might highlight some paths as red - that’s because the installer doesn’t have the permission to write to that directory. In that case select a path that doesn’t require special write permissions (e.g. in your home directory).
- Hit “Install” under the Installation Summary screen.
- An Installation Progress Window will pop-up to track progress of the download and the installation.
- This process will take about 20-30 minutes depending on your connection speed.
- A pop-up window will inform you that the installation completed successfully. Click “OK”.
- Finally the Vivado License Manager will launch. Select “Get Free ISE WebPACK, ISE/Vivado IP or PetaLinux License” and click “Connect Now” to complete the license registration process.
The last step is to update your ~/.bashrc with the following line:

# Xilinx Vivado 2017.1 environment
source <install_path>/Vivado/2017.1/settings64.sh

HLS Tutorial Part 1: Vector Add

This tutorial gives a brief hello-world-like introduction to Vivado HLS.

In vadd.cc you’ll find the function definition of the vector add module. You’ll notice immediately that everything looks like standard C++ except for a few differences. At first glance, we find a lot of pragmas scattered around that provide necessary control to the programmer over how the software gets synthesized by HLS into hardware.

Interface Pragmas

First off, the output of the HLS compiler is a hardware module as opposed to a binary (which is what a compiler like gcc would produce). As a result you are expected to specify to the compiler how you intend to connect your module to the outside world. You’ll notice the use of interface pragmas such as the one below:

#pragma HLS INTERFACE m_axi port = a offset = slave bundle = a_port

This pragma tells the compiler that the a argument is a master interface that uses the AXI bus protocol. A master port differs from a slave port by being able to initiate memory requests. This tells us that our vector add module will be able to perform read/write requests from each port: a, b, c.

You can find more information on how HLS synthesizes interfaces under the Managing Interfaces section of Chapter 1 of the Vivado HLS User Manual.

Synthesizing the Design with HLS

Let’s take the design as it is, and synthesize a hardware module with HLS. In order to compile the program, you’ll need a .tcl script that provides a ‘recipe’ to the compiler on how to compile your sources (you can think of it as a Makefile).

Looking into hls.tcl you’ll find that it defines the FPGA part that is being targeted (xc7z020clg484-1), as well as the target clock period in ns for the compiler (10ns clock period target). The latter will tell the compiler how to insert pipeline registers in the design so it can meet timing constraints.

You’ll also notice that we are passing a test file, vadd_test.cc which contains test cases to make sure that the vector add source behaves as intended.

In order to compile the design with HLS, execute the following command:

cd hls-tutorials/part1
vivado_hls -f hls.tcl

After a few seconds, it will synthesize a hardware design as Verilog, VHDL and SystemC files, which you’ll find under vadd/solution0/syn/. It will also produce a report file that provides timing closure, resource utilization, and performance metrics. You will find the report file under vadd/solution0/syn/report/vadd_csynth.rpt.

Looking into it, you will find:

A timing summary with an estimate of the shortest clock period the compiler was able to achieve along the synthesized design’s hardware critical path (lower is better, and hopefully within the specified 10ns target).
A latency report which provides metrics on latency and throughput for the synthesized hardware design. This report will also give you a detailed breakdown of Iteration Latencies, and Initiation Intervals of the loops in your design. Initiation interval metrics, which is a performance measure of throughput, will be provided if the design is pipelined.
A utilization summary that breaks utilization down between BRAMs (memory), DSPs (multipliers), FF and LUTs (general-purpose logic). It’s important to stay below 100% of resources, even though these estimates are not always extremely accuracy (we need to push the design through full place and route to get faithful utilization metrics).

Optimizing Performance with Pipelining

The performance of our vector add design out of the box is a little underwhelming: 3084 cycles to add two vectors of 1024 elements.

Try to optimize the hardware design by inserting an HLS pragma to tell the compiler to pipeline the vector addition loop. You’ll find how to do this in the Optimizing the Design section of Chapter 1 of the Vivado HLS User Manual.

Deliverable: Pipeline the Vector Add design and report the design latency you’ve achieved.

HLS Tutorial Part 2: Matrix Multiplication

In this tutorial, we’ll optimize a slightly more interesting design that performs matrix-matrix multiplication. It takes two matrices A and B, that have (M, N) and (N, O) shapes respectively, and produces an output matrix C, with shape (M, O). For simplicity, these have all been set to be 64 in the gemm.h header file. As a result this GEMM design has to perform 64 x 64 x 64 = 262144 multiplications.

Local SRAM Buffering

In the previous vector add design, the design would stream inputs in and stream outputs out as it performed addition. This was acceptable since inputs are used only once, and outputs written only once in vector addition.

For matrix multiplication however there is a fair amount of data reuse to take advantage of. Every element of A is read O times, every element of B is read M times, and every element of C is written to N times.

Consequently, in order to optimize for data access, it’s generally a good idea to store values on-chip (i.e. on the FPGA) to facilitate data re-use. In our simple example, we are lucky to be able to store the entire matrices on chip. With HLS, we can instantiate local SRAM buffers by declaring arrays. These will consume BRAM resources on the FPGA, so we need to be careful not to exceed resources constraints.

int a_buff[M][N];
int b_buff[O][N];
int c_buff[M][O];

You can find more information on how HLS synthesizes local SRAM buffers under the Arrays section of Chapter 3 of the Vivado HLS User Manual.

Loading and Storing data between DRAM and SRAM

Now that we have instantiated local buffers in our design, we need to load them with input matrix data stored in DRAM, and eventually store the results back to DRAM after the computation is over.

One easy trick is to use memcpy - HLS will synthesize this into an efficient FSM that can initiate data transfers between DRAM and SRAM:

memcpy(&a_buff[0][0], const_cast<int*>(a), sizeof(int) * M * N);

As with the cstring.h standard function, the size of the memory transfer needs to be specified in Bytes.

You can find more information on how to use memcpy in the AXI4 Master Interface subsection of Chapter 1, pages 124-130 of the Vivado HLS User Manual.

Synthesizing the Design with HLS

In order to compiler the design with HLS, execute the following command:

cd hls-tutorials/part2
vivado_hls -f hls.tcl

By default the design is not optimized and it takes over 1M cycles to complete the matrix multiplication.

Optimizing Performance with Pipelining and Memory Partitioning

Try to optimize the hardware design by inserting appropriate HLS pragmas in the design to tell the compiler to pipeline matrix multiplication so that you can achieve 1 vector dot product per cycle. You will need to re-partition your on-chip input buffers in order to increase the number of read ports.

You’ll find how to do this in the Optimizing the Design section of Chapter 1 of the Vivado HLS User Manual - look for the ARRAY_PARTITION pragma description.

Deliverable: Optimize the GEMM design and report the design latency you’ve achieved.

HLS Tutorial Part 3: CPU

In this tutorial, we’ll showcase how to use HLS to synthesize a simple pipelined CPU.

The CPU uses a simple 32-bit RISC ISA. It includes an explicitly managed instruction cache and data cache, and a register file. The instruction cache and data cache both have 1024 32-bit entries, and the register file contains 16 32-bit registers named R0 to R15.

ISA

The ISA is described below:

Opcode	Fields	Description	Operation
FINISH	None	Indicates end of program	`end`
WRITE_IMMEDIATE	DST, IMM	Writes the immediate value IMM to DST register	`r[DST] <- IMM`
LOAD	REG, ADDR	Loads data memory at address pointed by ADDR register into REG register	`r[REG] <- data(r[ADDR])`
STORE	REG, ADDR	Stores REG register to data memory at index pointed by the ADDR register	`data(r[ADDR]) <- r[REG]`
BEQ	SRC0, SRC1, NEW_PC	Sets PC to NEW_PC if SRC0 register and SRC1 register are equal	`PC <- NEW_PC if r[SRC0] == r[SRC1]`
BNE	SRC0, SRC1, NEW_PC	Sets PC to NEW_PC if SRC0 register and SRC1 register are not equal	`PC <- NEW_PC if r[SRC0] != r[SRC1]`
ADD	DST, SRC0, SRC1	Sets DST register to the sum of SRC0 and SRC1 registers	`r[DST] <- r[SRC0] + r[SRC1]`

The entire CPU specification is described in cpu.h. You’ll note the use of arbitrary precision integers (ap_int, ap_uint) which are HLS-specific data types that allow for non-standard precision integers.

You can find more information on special data types in HLS under the Data Types for Efficient Hardware section of Chapter 1 of the Vivado HLS User Manual.

In addition, we are relying on bit-fields for convenient data-packing into 32-bit values. This software construct allows us to pack instruction fields into a single 32-bit instruction. Depending on the instruction, the fields will be interpreted differently. We achieve this using a C Union, which lets us interpret the same generic instruction struct as a different instructions depending on the opcode field.

CPU HLS Source

Next let’s look at the CPU source code in cpu.cc. In terms of interface, the CPU has two memory ports: one for instruction memory, and the other for data memory (similar to the Harvard Architecture from the old days).

These two memory ports allow for data transfer between DRAM and the on-chip instruction and data caches using the memcpy construct we saw in Part 2. This CPU is fairly simple, so it won’t have implicitly managed-caches. Instead it will initialize its instruction and data caches from DRAM at the beginning of a given program and dump its data cache back to DRAM at the end of the program.

The main CPU loop executes the program in the instruction cache until it hits a FINISH instruction, which asserts the finish flag.

We have structured our code to perform CPU actions in stages defined by the good old-fashioned RISC 5-stage pipeline:

Fetch: Read an instruction from the instruction cache at the PC index, and increment the PC.
Decode: Read the instruction fields based on the opcode.
Execute: Read register file, and perform arithmetic operations.
Memory: Access data cache for LOAD or STORE operations.
Write Back: Write to register file.

Note however that this won’t necessarily generate a 5-stage pipeline (we haven’t even told HLS to pipeline our design!). This structure is merely for legibility and to reflect the organization of classic CPUs.

CPU Test Code and Assembly Generation

The cpu_test.cc file lets us test the CPU on ‘hand-assembled’ programs. We don’t have a compiler, so we need to generate the assembly line by line. Thankfully we have helper functions that makes it a little easier to generate an assembly instruction:

getFinishInsn(): generates a FINISH instruction
getWriteImmediateInsn(): generates a WRITE_IMMEDIATE instruction
getMemoryInsn(): generates a LOAD or STORE instruction
getBranchInsn(): generates a BEQ or BNE instruction
getBinaryInsn(): generates an ADD instruction (but can be extended to generate binary arithmetic operations)

To test that the CPU works as intended, we have provided you with a fully unrolled vector add program.

Synthesizing the Design with HLS

In order to compile the design with HLS and run the test program, execute the following command:

cd hls-tutorials/part3
vivado_hls -f hls.tcl

If you look at the report, you’ll notice that there is no design latency estimate. This is due to the fact that the hardware execution is determined dynamically based on the assembly program length.

Although we don’t have a full-design latency and initiation interval breakdown, we can analyze the CPU’s program loop (labeled PROGRAM_LOOP in the source code). Since that loop is currently not pipelined, we won’t get an initiation interval but we can already assess that it has an iteration latency of 5 cycles.

Extending the CPU to Support Branching

The CPU design currently does not support branching instructions BEQ and BNE. This makes implementing loops in assembly programs impossible.

Extend the CPU design to handle those branching instructions (you can insert your code in the region marked with the TODO comment in cpu.cc.

Also in cpu_test.cc implement an assembly program that takes advantage of branches to perform vector addition. We’ve already provided the scaffolding to implement a loop-based program. You can extend the loop body of that assembly program, and enable the execution of that program in the test by setting USE_BRANCH to true.

Re-run the HLS compilation to make sure that the design passes simulation:

cd hls-tutorials/part3
vivado_hls -f hls.tcl

Deliverable: The cpu.cc and cpu_test.cc that implement the branch instructions and test for control flow branches.

Pipelining the CPU

Last but not least, we’ll need to pipeline our CPU. So far, it’s only been a single-cycle design, which as we know is sub-optimal.

The beauty of HLS is that we don’t need to change our source to implement pipelining. You should be able to pipeline this design using the HLS pragmas that you used in Part 1 and 2.

Deliverable: Pipeline the PROGRAM_LOOP and report the initiation-interval (II) and iteration latency that you’ve achieved.

What to Submit and How

Make sure that your files are organized as follows:
- part1/ – everything related to the part-1
- part2/ – everything related to the part-2
- part3/ – everything related to the part-3
- README — everything describing how to build. student ids etc
Change to each of your folders and issue the command make clean. This will remove all object files as well as all output and temporary files
Change to the first directory.
Then, issue:
```
$ tar cvf Ass1.tar *
```
which creates a tar ball (i.e., a single file) that contains the contents of the folder.
Compress your file using gzip:
```
$ gzip Ass1.tar
```
Submit via CourSys by the deadline posted there.