CMPT 250 Assignment 4

Due April 2, 2004.

In this assignment, we will explore pipelining and the single-cycle architecture from Assignment 3.

For pipelining, you can use provided implementations of a pipeline register and a single-bit pipeline register. These are simple registers that load with each rising edge of the clock. [The single-bit version is necessary for std_logic signals that need to be pipelined; the regular pipelining register is used for std_logic_vector signals.]

Below is an overview of the construction of the pipelined CPU: [figure in PDF]

pipelined CPU overview

Not shown here are the pipeline registers—every time a signal crosses a stage boundary, a pipelining register must be added. See also the diagram on p. 452 of the text (Figure 8-24).

Hint: When creating the pipeline, you will end up with several signals for the same values at different stages of the pipe. Keep yourself sane by coming up with some consistent naming scheme. For example, append a number indicating which stage each signal is in. So, in the control unit, you might have signals named DA2 (for the DOF stage DA signal), DA3 (for the EX stage DA signal) and DA (the output port, needed in the WB stage).

Along with the solutions to previous assignments, there are many other files provided for this assignment. You can download all of them in a TAR file if you like. This command will compile all of the provided files (so you can make a Makefile a little faster):

vhdlan instrrom pipereg pipebit extend clock pc id bc regfile psr mux binput logic alu fu

(Pay attention to which instruction ROM file you want to compile when running that command.)

On all assignments, there will be marks allocated for the style of your code. You should just make sure you use appropriate variable/signal names, comment hard-to-understand parts, etc.

Pipelined Control

In a file named pcontrol.vhd create a structural description of the control unit for the example architecture, with a four-stage pipeline, as described above and in the text. You should use this entity declaration:

entity pcontrol is
  port (
    clock, V, C, N, Z : in std_logic;
    DA, AA, BA, const : out std_logic_vector(2 downto 0);
    FS                : out std_logic_vector(4 downto 0);
    MB, MD, RW, MW    : out std_logic);
end pcontrol;

You can (and probably should) start with the control unit from assignment 3.

Note that the provided instruction decoder is slightly different from the one used in previous assignments. The instructions that were undefined have been converted to NOPs (no-operation instructions). We will need the NOPs to avoid hazards. You can download an updated description of the instruction set: [PS] [PDF].

We also need to be a little more careful with the branch control. Until the pipeline is filled, the branch control's inputs will be uninitialized ('U'). We need to make sure the branch control increments the program counter in this case. The provided branch control does this.

The signals for branch control aren't detailed in the text. All of the control inputs to the branch control should be pipelined to stage 3 (EX). The status bits should be piped to stage 4 (WB)—the status bits will be coming from the previous instruction (the instruction just before the branch), but the other signals come from the branch instruction. Since those instructions are in different stages when the branch control does its job, the inputs must come from the different stages.

Note that there are several instruction ROMs provided:

An instruction ROM that contains a few simple instructions, just to see if things are working.
Does the sum 1...10, with plenty of space between instructions to avoid hazards. Note that the conditional branch must come immediately after the instruction that creates the status bits it is examining. Also note that the branch must target two instructions before you might think—by the time the branch happens, the PC has incremented twice.
Does the sum 1...10, with NOPs between instructions only where necessary to avoid hazards.
Tries to do the sum 1...10, but the pipeline is ignored so hazards are left to happen. Doesn't work (or even come close).

In all of the instruction ROMs, the default value for a memory location is a NOP instruction. So, any addresses that aren't explicitly assigned a different value are NOPs.

Pipelined Datapath

In a file named pdp.vhd create a structural description of the datapath unit for the example architecture, with a four-stage pipeline, as described above and in the text. You should use this entity declaration:

entity pdp is
  port (
    clock, RW, MB, MD  : in std_logic;
    DA, AA, BA         : in std_logic_vector(2 downto 0);
    FS                 : in std_logic_vector(4 downto 0);
    const_in, data_in  : in std_logic_vector(15 downto 0);
    V, C, N, Z         : out std_logic;
    addr_out, data_out : out std_logic_vector(15 downto 0));
end pdp;

Note that the register file has been modified for the pipelined CPU. It is now a "read-after-write" register file. This is necessary to allow the WB and DOF stages to work on the same register in the same cycle. This implementation actually writes to the register file on the falling edge of the previous clock cycle. See p. 549 (second paragraph) for more details.

We also must make sure that the register file does not write when the RW signal is uninitialized. The provided register file does this properly.

Pipelined CPU

From your control and datapath, create a pipelined CPU in a file named pcpu.vhd with this entity declaration:

entity pcpu is
  port (
    data_in      : in std_logic_vector(15 downto 0);
    A_out, B_out : out std_logic_vector(15 downto 0);
    MW           : out std_logic);
end pcpu;

You should be able to use the same CPU from assignment 3, with the entity names changed.

Programming the Pipelined CPU

There is no circuitry in the processor we have constructed here to deal with hazards (neither data nor control). So, when programming it, they must be taken into account.

The problem here will be to write a program that does unsigned integer multiplication. The values to multiply will be read from "memory". We won't really connect a memory unit; it will be simulated by a testbench. The provided test bench will give the appearance that M[0]=5 and M[1]=7. (You can, and probably should, change this while testing.)

The values from these memory locations will be loaded to the register file, multiplied and stored back to memory 0. A skeleton instruction ROM has been provided for you to start with. It does the memory accesses; you have to fill in the multiplication. (Leave the file name as instrrom-mult.vhd and entity name as instrrom.)

How you do this is entirely up to you. This section of the assignment will be worth more than the others. Half of the marks will be given for completing the multiplication; the other half for the speed of the code. You can assume that the multiplication doesn't overflow—that the result of the multiplication will fit in 16 bits.

Create a text file named about.txt and describe your multiplication algorithm and how the code works. You can also indicate what you did to speed it up.

Doing the multiplication in under 180 cycles (in the worst case) is good. Less than 100 cycles is possible (but not easy).

Please also submit a sim.scr file that traces at least these signals when simulating the tb_mult entity:

You can probably use the provided sim.scr file, at least as a start, for this.

Clock Speed

Create a file clock.txt and answer these questions in it:

  1. What is the minimum clock period that could be used in the CPU without pipelining (as created in Assignment 3)? Why do you know this? ["I experimented until it stopped working" is probably not a good answer.]
  2. What is the minimum clock period that can be used in the pipelined CPU created here? [You have to be a little careful here of the read-after-write register file used in the datapath. The writing inputs (DA, RW, D) must get to the register file by the falling edge of the clock.]
  3. What are the propagation delays in the various stages? Which one is limiting the clock speed?

Note: If t is the clock period (in nanoseconds), then the clock frequency (in megahertz) is 1000/t .


You have to use the Submission server to submit your work. You should submit the files pdp.vhd, pcontrol.vhd, pcpu.vhd, about.txt, sim.scr, instrrom-mult.vhd, clock.txt.

You can do this by typing these commands (change into your assignment directory if you haven't already):

tar cvf a4.tar  pdp.vhd pcontrol.vhd pcpu.vhd about.txt sim.scr clock.txt instrrom-mult.vhd
gzip a4.tar

Then, submit the file a4.tar.gz. If you want to submit a ZIP file instead, you can do that but figuring out how is your problem.

Copyright © Greg Baker, last modified 2004-03-16.