#### **Parallelism and Vector Instructions**

CMPT 295 Week 9

#### **Parallelism and Vector Instructions**

### **WARNING**: Lab 9, Ass 5 work with fixed-length vector intrinsics. Not RISC-V

- Most concepts carry over, if not programming details
- RISC-V supports variable length vectors.
   Lab 9 and ASS 5 do not

#### Roadmap



#### What is a computer program?

```
for (int i = 0; i < N; i++){
    output[i] = x[i] * y[i];
}</pre>
```

#### What is a computer program?

Processor executes instruction referenced by the program counter (PC)

(executing the instruction will modify machine state: contents of registers, memory, CPU state, etc.)

Move to next instruction ...

Then execute it...



And so on...



#### Scalar Loop

# for (i = 0; i < N; i++){ output[i] = x[i] \* y[i]; }</pre>

#### Vector Loop (data parallelism)

for (i = 0; i < N; i=i+VLEN){
 output[i:i+VLEN-1] =
 x[i:i+VLEN-1] \* y[i:i+VLEN-1];</pre>

#### Scalar Execution



#### Vector Execution

| x[0] | x[1] | x[2] | x[3] |  |  |  |      |  |  |  |  |
|------|------|------|------|--|--|--|------|--|--|--|--|
| *    |      |      |      |  |  |  |      |  |  |  |  |
| y[0] | y[1] | y[2] | y[3] |  |  |  | y[7] |  |  |  |  |

|   | x[8] | x[9] | x[10] | x[11] |  |  |  | x[15] |  |  |  |
|---|------|------|-------|-------|--|--|--|-------|--|--|--|
| * |      |      |       |       |  |  |  |       |  |  |  |
|   | y[8] | y[9] | y[10] | y[11] |  |  |  | y[15] |  |  |  |

# for (i = 0; i < N; i++){ output[i] = x[i] \* y[i]; }</pre>

#### How many total ins? N \* 9

#### How many useful inst?

4\* N (LD,LD,MUL,ST)

# a0: &x[0], a1: &y[0], a2: &result[0], a5: N # t1 = 0: loop index i loop: # load x[i] and y[i] lw a4,0(a0) lw a3,0(a1) # multiplication mul a4,a4,a3 # store word sw a4,0(a2) # Bump pointers addi a0,a0,4 addi a1,a1,4 addi a2,a2,4 addi t1, t1, 1 bne t1,a5,loop

How many useless (maintenance) inst?

5\*N

# for (i = 0; i < N; i=i+VLEN){ output[i:i+VLEN-1] = x[i:i+VLEN-1] \* y[i:i+VLEN-1]; }</pre>

- How many total ins? N \* 9 / VLEN How many useful ins ? 4\*N/VLEN How many useless inst?
  - 5\*N/VLEN

### Why Parallelism? Why Efficiency?

A parallel computer is a collection of processing elements that cooperate to solve problems quickly

We care about performance We care about efficiency

We're going to use multiple processing element to get it

### Speedup

One major motivation of using parallel processing: Speedup

For a given problem:

speedup = <u>execution time (using 1 elements)</u> execution time (using P elements)

#### **Parallel Model: Vector Processing**

- Vector processors have high-level operations that work on linear arrays of numbers: "vectors"
  - SCALAR (1 operation)







add r3, r1, r2

add.vv v3, v1, v2

#### **Parallel Model:Vector Processing**

- Vector processors have high-level operations that work on linear arrays of numbers: "vectors"
  - SCALAR (1 operation)







add.vv v3, v1, v2

out[0] = x[0]+y[0] out[1] = x[1]+y[1]

out[0:VLEN-1] = x[0:VLEN-1] + Y[0:VLEN-1]

out[VLEN:2\*VLEN-1] = x[VLEN:2\*VLEN-1] + y[VLEN:2\*VLEN-1]

#### **Vector Registers**

- \* 32 vector data registers, v0-v31, each VLEN bits long
- \* Vector length register v1
- Vector type register vtype

#### \* Vector register file

- Each register is an array of elements
- Size of each register determines maximum vector length
- Vector length register determines vector length for a particular operation

#### Multiple parallel execution units =

#### "lanes"

#### (sometimes called "<u>pipelines</u>" or "<u>pipes</u>")

Vector data registers VLEN bits per vector register, (implementation-dependent)



Vector length register



Vector type register



#### **Baseline CPU**

| Fetch PC       |
|----------------|
| mul a4,a4,a3   |
|                |
| ALU 0 (scalar) |
|                |
|                |
|                |
|                |
|                |
|                |
|                |
|                |
|                |
|                |
| Seeler Dec     |
| Scalar Reg     |

## Vector CPU: Add arithmetic units to increase compute capability



#### Fetch Ins

Single instruction, multiple data

- Parallelism: Multiple data elements
- Efficiency: Fetch single instruction

Same instruction broadcast on all ALUs Each instruction updates/reads multiple elements from vector register

#### **Virtual Processor Vector Model**

- Vector operations are SIMD
   (single instruction multiple data) operations
- Each element is computed by a virtual processor (VP)
- Number of VPs given by vector length
  - vector control register

#### **Vector Architectural State**



#### Scalar Code

for (i = 0; i < N; i++){
 output[i] = x[i] + y[i];</pre>

#### loop:

ł

# load x[i] and y[i] lw a5,0(a2) lw a6,0(a3) # addition add a5,a5,a6 # store word sw a5,0(a1) # Bump pointers addi a1,a0,4 addi a2,a1,4 addi a3,a2,4 addi a3,a2,4 sub a0,a0,1 bnez a0, loop

#### **Vector Code**

for (i = 0; i < N;i=i+VLEN){
 output[i:i+VLEN] =
 x[i:i+VLEN] + y[i:i+VLEN];</pre>

loop: # t0=VLEN # load x[I,i+VLEN], y[] vle32.v v8, (a2) vle32.v v16, (a3) # addition vadd.vv v24,v8,v16 # store res[i:i+VLEN] vse32.v v24,(a1) **#** Bump pointers slli t1,t0,2 add a2, a2,t1 add a3,a3,t1 add a1,a1,t1 # Bump loop by vlen sub a0,a0,t0 bnez a0, loop

#### **Masking and Conditional Ops**



- Disable unwanted vector lanes
- Conditional branches where different operations for different vector elements
- Handling tail/left-over elements when software array length not multiple of vector width.

#### Tail Processing

Remaining = N for (i = 0; i < N;){ int VLEN; if (N-i > MAX\_VLEN) VLEN = MAX\_VLEN else VLEN = N-i

#### setvl(VLEN)

res[i:i+VLEN] = x[i:i+VLEN] + y[i:i+VLEN];









Mask

0

0

0

0

1

1

1

1

#### What about conditional branches?

Time



Assume logic below is to be executed for each element in input array 'A' producing output into array 'result'

```
<unconditional code>
  float x = A[i];
if (x > 0) {
     float tmp = exp(x, 5.f);
     tmp *= kMyConst1;
     x = tmp + kMyConst2;
  } else {
           float tmp = kMyConst1;
           x = 2.f * tmp;
  }
    <resume unconditional code>
 result[i] = x;
```

#### What about conditional branches?

Time



Assume logic below is to be executed for each element in input array 'A' producing output into array 'result'

```
<unconditional code>
  float x = A[i];
if (x > 0) {
     float tmp = exp(x, 5.f);
     tmp *= kMyConst1;
     x = tmp + kMyConst2;
  } else {
        float tmp = kMyConst1;
       x = 2.f * tmp;
    <resume unconditional code>
 result[i] = x;
```

#### Mask discard output of ALUs

Time



Not All ALUs do useful work Worst case: 1/8 peak performance Assume logic below is to be executed for each element in input array 'A' producing output into array 'result'

#### After branch continue normal execution





### Terminology

- Instruction stream coherence ("coherent execution")
  - Same instruction sequence applies to all elements operated upon simultaneously
  - Coherent execution is necessary for efficient use of SIMD processing resources
  - Coherent execution IS NOT necessary for efficient parallelization across cores, since each core has the capability to fetch/decode a different instruction stream
- "Divergent" execution
  - A lack of instruction stream coherence

#### **New RISC-V "V" Vector Extension**

- Standard extension to the RISC-V ISA
  - An updated form of Cray-style vectors for modern microprocessors
  - Appearing in commercial implementations from Alibaba, Andes, Semidynamics, SiFive, ...
  - Basis of European supercomputer initiative (EPI)
- Following slides present short tutorial on current standard
  - https://github.com/riscv/riscv-v-spec

#### **RISC-V Scalar State**

Program counter (pc)

32x32/64-bit integer registers (x0-x31)x0 always contains a 0

Floating-point (FP), adds 32 registers (**f0f31**)

• each can contain a single- or doubleprecision FP value (32-bit or 64-bit IEEE FP)

FP status register (**fcsr**), used for FP rounding mode & exception reporting

ISA string options:

- RV32I (XLEN=32, no FP)
- RV32IF (XLEN=32, FLEN=32)
- RV32ID (XLEN=32, FLEN=64)
- RV64I (XLEN=64, no FP)
- RV64IF (XLEN=64, FLEN=32)
- RV64ID (XLEN=64, FLEN=64)

| 0 | FLEN-1 | 0                                                                                                                                                                                                                                                                                     |
|---|--------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|   |        |                                                                                                                                                                                                                                                                                       |
|   |        |                                                                                                                                                                                                                                                                                       |
|   |        |                                                                                                                                                                                                                                                                                       |
|   |        |                                                                                                                                                                                                                                                                                       |
|   |        |                                                                                                                                                                                                                                                                                       |
|   |        |                                                                                                                                                                                                                                                                                       |
|   |        |                                                                                                                                                                                                                                                                                       |
|   | f7     |                                                                                                                                                                                                                                                                                       |
|   | f8     |                                                                                                                                                                                                                                                                                       |
|   | f9     |                                                                                                                                                                                                                                                                                       |
|   | f10    |                                                                                                                                                                                                                                                                                       |
|   | f11    |                                                                                                                                                                                                                                                                                       |
|   | f12    |                                                                                                                                                                                                                                                                                       |
|   | f13    |                                                                                                                                                                                                                                                                                       |
|   | f14    |                                                                                                                                                                                                                                                                                       |
|   | f15    |                                                                                                                                                                                                                                                                                       |
|   | f16    |                                                                                                                                                                                                                                                                                       |
|   | f17    |                                                                                                                                                                                                                                                                                       |
|   | f18    |                                                                                                                                                                                                                                                                                       |
|   | f19    |                                                                                                                                                                                                                                                                                       |
|   | f20    |                                                                                                                                                                                                                                                                                       |
|   | f21    |                                                                                                                                                                                                                                                                                       |
|   | f22    |                                                                                                                                                                                                                                                                                       |
|   | f23    |                                                                                                                                                                                                                                                                                       |
|   | f24    |                                                                                                                                                                                                                                                                                       |
|   | f25    |                                                                                                                                                                                                                                                                                       |
|   | f26    |                                                                                                                                                                                                                                                                                       |
|   | f27    |                                                                                                                                                                                                                                                                                       |
|   | f28    | 3                                                                                                                                                                                                                                                                                     |
|   | f29    | <b>/</b>                                                                                                                                                                                                                                                                              |
|   | f30    |                                                                                                                                                                                                                                                                                       |
|   |        |                                                                                                                                                                                                                                                                                       |
|   |        |                                                                                                                                                                                                                                                                                       |
| 0 | 31     | 0                                                                                                                                                                                                                                                                                     |
|   | fcsr   |                                                                                                                                                                                                                                                                                       |
|   |        | $ \begin{bmatrix} f0 \\ f1 \\ f2 \\ f3 \\ f4 \\ f5 \\ f6 \\ f6 \\ f7 \\ f8 \\ f9 \\ f10 \\ f11 \\ f11 \\ f12 \\ f13 \\ f14 \\ f15 \\ f16 \\ f16 \\ f17 \\ f18 \\ f19 \\ f20 \\ f21 \\ f22 \\ f23 \\ f24 \\ f25 \\ f26 \\ f27 \\ f28 \\ f29 \\ f30 \\ f31 \\ FLEN \\ 0 \end{bmatrix} $ |

#### **Vector Extension Additional State**

- \* Vector length register v1
- \* Vector type register vtype
- **\* Other control registers:** 
  - vstart
    - For trap handling
  - vrm/vxsat
    - Fixed-point rounding mode/saturation
    - Also appear in separate vcsr
  - vlenb
    - Gives vector length in bytes (read-only)

Vector data registers VLEN bits per vector register, <u>(implementation-dependent)</u>  $\mathbf{v}\mathbf{0}$ **v**31 Vector length register vl Vector type register vtype

#### Vector Type Register (vtype)

*Ideally, info would be in instruction encoding, but no space in 32-bit instructions. Planned 64-bit encoding extension would add these as instruction bits.* 

| 31 30                                                 |      | 8 7    | <u>6 5 </u> | 3 2         | 0       |
|-------------------------------------------------------|------|--------|-------------|-------------|---------|
| /ill reserved (write 0)                               |      | vma    | vta / vse   | ew[2:0] vlm | ul[2:0] |
| <b>vsew[2:0]</b> field encodes standard element width |      |        |             |             |         |
| (SEW) in bits of elements in vector register (SEW =   | vsev | v[2:0] |             | SEW         |         |
| 8*2 <sup>vsew</sup> )                                 | 0    | 0      | 0           | 8           |         |
| <b>vlmul[2:0]</b> encodes vector register length      | 0    | 0      | 1           | 16          |         |
| multiplier (LMUL = $2^{vlmul} = 1/8 - 8$ )            | 0    | 1      | 0           | 32          |         |
|                                                       | 0    | 1      | 1           | 64          |         |
| vta specifies tail-agnostic                           | 1    | 0      | 0           | 128         |         |
|                                                       | 1    | 0      | 1           | 256         |         |
| <b>vma</b> specifies <i>mask-agnostic</i>             | 1    | 1      | 0           | 512         |         |
|                                                       | 1    | 1      | 1           | 1024        |         |

#### **Example Vector Register Data Layouts (LMUL=1)**

| 3       2       1       0         3       2       1       0         1       0       0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | SEW<br>8b<br>16b<br>32b                |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------|
| 7       6       5       4       3       2       1       0         7       6       5       4       3       2       1       0         3       2       1       0         1       0         0       0       0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | SEW<br>8b<br>16b<br>32b<br>64b         |
| F       E       D       C       B       A       9       8       7       6       5       4       3       2       1       0         F       E       D       C       B       A       9       8       7       6       5       4       3       2       1       0         7       6       5       4       3       2       1       0         3       2       1       0       0       0       0       0       0         1       0       0       0       0       0       0       0       0                                                                                                                                                                                                             | SEW<br>8b<br>16b<br>32b<br>64b<br>128b |
| VLEN = 256b                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | 1200                                   |
| 1F       1E       1D       1C       1B       1A       19       18       17       16       15       14       13       12       11       10       F       E       D       C       B       A       9       8       7       6       5       4       3       2       1       0         1F       1E       1D       1C       1B       1A       19       18       17       16       15       14       13       12       11       10       F       E       D       C       B       A       9       8       7       6       5       4       3       2       1       0         F       E       D       C       B       A       9       8       7       6       5       4       3       2       1       0 | SEW<br>8b<br>16b                       |
| 7 6 5 4 3 2 1 0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | 32b                                    |
| 3 2 1 0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | 64b                                    |
| 1 0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | 128b                                   |
| 0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             | 256b                                   |

#### Setting vector configuration, vsetvli/vsetivli/vsetvl

The **vset**{**i**}**vl**{**i**} configuration instructions set the **vtype** register, and also set the **vlype** register, returning the **vl** value in a scalar register



Usually use register-immediate form, **vsetvli**, to set **vtype** parameters. Immediate-immediate form, **vsetivli**, used when vector length known statically The register-register version **vsetvl** is usually used only for context save/restore

#### **Vector Length Multiplier, LMUL**

- Gives fewer but longer vector registers
  - Called "vector register groups" operate as single vectors
  - Must use even register names only for LMUL=2 (v0,v2,..), and every fourth register for LMUL=4 (v0,v4, ...), etc.
- Used for
  - 1) to increase efficiency by using longer vectors
  - 2) accommodate mixed-width operations (e.g., masks)

|        | F | Е | D | С | В | А | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Byte       |
|--------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|------------|
|        |   |   |   | 3 |   |   |   | 2 |   |   |   | 1 |   |   |   | 0 | v2*n+0     |
| LMUL=2 |   |   |   | 7 |   |   |   | 6 |   |   |   | 5 |   |   |   | 4 | v2*n+1     |
|        |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |            |
|        | F | Е | D | С | В | А | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Byte       |
|        |   |   |   | 9 |   |   |   | 8 |   |   |   | 1 |   |   |   | 0 | v4 * n + 0 |
|        |   |   |   | В |   |   |   | А |   |   |   | 3 |   |   |   | 2 | v4 * n + 1 |
| LMUL=4 |   |   |   | D |   |   |   | С |   |   |   | 5 |   |   |   | 4 | v4 * n + 2 |
|        |   |   |   | F |   |   |   | Е |   |   |   | 7 |   |   |   | 6 | v4 * n + 3 |

### Simple stripmined vector memcpy example

|                                           | <pre># void *memcpy(void* dest, co<br/># a0=dest, a1=src, a2=n<br/>#<br/>memcpy:</pre> | onst void* src, size_t n            | 1) |
|-------------------------------------------|----------------------------------------------------------------------------------------|-------------------------------------|----|
| Set configuration, calculate vector strip | <pre>mv a3, a0 # Copy destination loop:</pre>                                          | on                                  |    |
| length                                    | vsetvli t0, a2, e8,m8,ta,ma<br>vle8.v v0, (a1)                                         | # Vectors of 8b<br># Load bytes     |    |
| Unit-stride<br>vector load                | add a1, a1, t0<br>sub a2, a2, t0                                                       | # Bump pointer<br># Decrement count |    |
| elements<br>(bytes)                       | vse8.v v0, (a3)<br>add a3, a3, t0                                                      | # Store bytes<br># Bump pointer     |    |
| Unit-stride<br>vector store               | bnez a2, loop<br>ret                                                                   | # Any more?<br># Return             | ł  |
| elements<br>(bytes)                       |                                                                                        |                                     |    |

Same binary machine code can run on machines with any VLEN!

#### **Vector Unit-Stride Loads/Stores**

| # vd dest | tination, rs1 ba | se | address, | <pre>vm is mask encoding (v0.t or <missing>)</missing></pre> |
|-----------|------------------|----|----------|--------------------------------------------------------------|
| vle8.v    | vd, (rs1), vm    | #  | 8-bit    | unit-stride load                                             |
| vle16.v   | vd, (rs1), vm    | #  | 16-bit   | unit-stride load                                             |
| vle32.v   | vd, (rs1), vm    | #  | 32-bit   | unit-stride load                                             |
| vle64.v   | vd, (rs1), vm    | #  | 64-bit   | unit-stride load                                             |

# vs3 store data, rs1 base address, vm is mask encoding (v0.t or <missing>)
vse8.v vs3, (rs1), vm # 8-bit unit-stride store
vse16.v vs3, (rs1), vm # 16-bit unit-stride store
vse32.v vs3, (rs1), vm # 32-bit unit-stride store
S7
vse64.v vs3, (rs1), vm # 64-bit unit-stride store

for i = 0 to VLEN - 1
vd[i] = load(rs1 + i)

#### **Vector Strided Load/Store Instructions**

| # vd desti | nation, rs1 | base | add | ress, | , rs2 byte stride   |
|------------|-------------|------|-----|-------|---------------------|
| vlse8.v    | vd, (rs1),  | rs2, | vm  | #     | 8-bit strided load  |
| vlse16.v   | vd, (rs1),  | rs2, | vm  | #     | 16-bit strided load |
| vlse32.v   | vd, (rs1),  | rs2, | vm  | #     | 32-bit strided load |
| vlse64.v   | vd, (rs1),  | rs2, | vm  | #     | 64-bit strided load |

| # vs3 stor | e data, rs1 ł | base add | ress, | rs2 byte stride      |
|------------|---------------|----------|-------|----------------------|
| vsse8.v    | vs3, (rs1),   | rs2, vm  | #     | 8-bit strided store  |
| vsse16.v   | vs3, (rs1),   | rs2, vm  | #     | 16-bit strided store |
| vsse32.v   | vs3, (rs1),   | rs2, vm  | #     | 32-bit strided store |
| vsse64.v   | vs3, (rs1),   | rs2, vm  | #     | 64-bit strided store |

for i = 0 to VLEN - 1
 vd[i] = load(rs1 + i\*rs2)

#### **Vector Indexed Loads/Stores**

# Vector unordered indexed load instructions # vd destination. rs1 base address. vs2 indices vd, (rs1), vs2, vm # unordered 8-bit indexed load of SEW data vluxei8.v vluxei16.v vd, (rs1), vs2, vm # unordered 16-bit indexed load of SEW data for i = 0 to VLEN - 1 vluxei32.v vd, (rs1), vs2, vm # unordered 32-bit indexed load of SEW data vd, (rs1), vs2, vm # unordered 64-bit indexed load of SEW data vluxei64.v

# Vector ordered indexed load instructions

# vd destination, rs1 base address, vs2 indices

vd, (rs1), vs2, vm # ordered 8-bit indexed load of SEW data vloxei8.v vloxei16.v vd. (rs1), vs2, vm # ordered 16-bit indexed load of SEW data vloxei32.v vd, (rs1), vs2, vm # ordered 32-bit indexed load of SEW data vd. (rs1), vs2, vm # ordered 64-bit indexed load of SEW data vloxei64.v

#### # Vector unordered-indexed store instructions

# vs3 store data, rs1 base address, vs2 indices

vsuxei8.v vs3, (rs1), vs2, vm # unordered 8-bit indexed store of SEW data vsuxei16.v vs3, (rs1), vs2, vm # unordered 16-bit indexed store of SEW data vsuxei32.v vs3, (rs1), vs2, vm # unordered 32-bit indexed store of SEW data vsuxei64.v vs3, (rs1), vs2, vm # unordered 64-bit indexed store of SEW data

#### # Vector ordered indexed store instructions

# vs3 store data, rs1 base address, vs2 indices vs3, (rs1), vs2, vm # ordered 8-bit indexed store of SEW data vsoxei8.v vs3, (rs1), vs2, vm # ordered 16-bit indexed store of SEW data vsoxei16.v vs3, (rs1), vs2, vm # ordered 32-bit indexed store of SEW data vsoxei32.v vs3, (rs1), vs2, vm # ordered 64-bit indexed store of SEW data vsoxei64.v

vd[i] = load(rs1 + vs2[i])

Index data width encoded in instruction, while data size encoded in vtype.vsew field

VLEN=256b, SLEN=128b

#### SEW=8b, LMUL=1, VLMAX=32

| 1 | 1E 10 | ) 1C | 1B 1 | A 19 | 9 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | F | E | D | С | В | Α | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Byte   |
|---|-------|------|------|------|------|----|----|----|----|----|----|----|----|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|--------|
| 1 | 1E 10 | 10   | 1B 1 | .A 1 | 9 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | F | Е | D | С | В | A | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | v1*n+0 |

#### SEW=16b, LMUL=2, VLMAX=32

| 1F 1E | 1D 10 | 1B 1 | A | 19 18 | 17 | 16 | 15 14 | 13 | 12 | 11 | 10 | F | Е | D | С | В | A | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 0 | Byte   |
|-------|-------|------|---|-------|----|----|-------|----|----|----|----|---|---|---|---|---|---|---|---|---|---|---|---|---|---|-----|--------|
| 17    | 16    | 1    | 5 | 14    | :  | 13 | 12    |    | 11 |    | 10 |   | 7 |   | 6 |   | 5 |   | 4 |   | 3 |   | 2 |   | 1 | 0   | v2*n+0 |
| 1F    | 16    | 1    | D | 10    | :  | 1B | 1A    |    | 19 |    | 18 |   | F |   | Е |   | D |   | С |   | в |   | А |   | 9 | 8   | v2*n+1 |

#### SEW=32b, LMUL=4, VLMAX=32

| 1F 1E 1D 1C | 1B 1A 19 18 | 17 16 15 14 | 13 12 11 10 | FEDC | B A 9 8 | 7654 | 3 2 1 0 | Byte       |
|-------------|-------------|-------------|-------------|------|---------|------|---------|------------|
| 13          | 12          | 11          | 10          | 3    | 2       | 1    | 0       | v4 * n + 0 |
| 17          | 16          | 15          | 14          | 7    | 6       | 5    | 4       | v4 * n + 1 |
| 1B          | 1A          | 19          | 18          | В    | A       | 9    | 8       | v4 * n + 2 |
| 1F          | 1E          | 1D          | 1C          | F    | E       | D    | С       | v4 * n + 3 |

#### SEW=64b, LMUL=8, VLMAX=32

| 1F 1E 1D 1C 1B 1A 19 18 | 17 16 15 14 13 12 11 10 | FEDCBA98 | 7 6 5 4 3 2 1 0 | Byte       |
|-------------------------|-------------------------|----------|-----------------|------------|
| 11                      | 10                      | 1        | 0               | v8 * n + 0 |
| 13                      | 12                      | 3        | 2               | v8*n+1     |
| 15                      | 14                      | 5        | 4               | v8 * n + 2 |
| 17                      | 16                      | 7        | 6               | v8 * n + 3 |
| 19                      | 18                      | 9        | 8               | v8 * n + 4 |
| 18                      | 1A                      | В        | A               | v8 * n + 5 |
| 1D                      | 1C                      | D        | С               | v8 * n + 6 |
| 1F                      | 1E                      | F        | E               | v8 * n + 7 |

### LMUL=8 stripmined vector memcpy example

|                        | <pre># void *memcpy(void* dest, c # a0=dest, a1=src, a2=n</pre> | const void* src, size_t n)   |
|------------------------|-----------------------------------------------------------------|------------------------------|
|                        | # 40-dest, 41-sic, 42-11<br>#                                   | Combine eight                |
|                        | memcpy:                                                         | vector registers into        |
| Set configuration,     | mv a3, a0 # Copy destinati                                      | on group                     |
| calculate vector strip | loop:                                                           | (v0,v1,,v7)                  |
| length                 | vsetvli t0, a2, e8,m8,ta,ma                                     | # Vectors of 8b              |
| Unit stride            | vle8.v v0, (a1)                                                 | # Load bytes                 |
| Unit-stride            | add a1, a1, t0                                                  | # Bump pointer               |
| vector load            | sub a2, a2, t0                                                  | <pre># Decrement count</pre> |
| bytes                  | vse8.v v0, (a3)                                                 | # Store bytes                |
| Unit-stride            | add a3, a3, t0                                                  | # Bump pointer               |
| vector store           | bnez a2, loop                                                   | # Any more?                  |
| bytes                  | ret                                                             | # Return                     |

Binary machine code can run on machines with any VLEN!

#### Masking

- Nearly all operations can be optionally under a mask (or predicate) held in vector register v0
- A single vm bit in instruction encoding selects whether unmasked or under control of v0

- Integer and FP compare instructions provided to set masks into any vector register
- Can perform mask logical operations between any vector registers

#### **Vector Integer Add Instructions**

```
# Integer adds.
vadd.vv vd, vs2, vs1, vm # Vector-vector
vadd.vx vd, vs2, rs1, vm # vector-scalar
vadd.vi vd, vs2, imm, vm # vector-immediate
# Integer subtract
vsub.vv vd, vs2, vs1, vm # Vector-vector
vsub.vx vd, vs2, rs1, vm # vector-scalar
# Integer reverse subtract
vrsub.vx vd, vs2, rs1, vm # vd[i] = rs1 - vs2[i]
vrsub.vi vd, vs2, imm, vm # vd[i] = imm - vs2[i]
```

#### **Integer Compare Instructions**

| Comparison                                  | Assembler Mapping                                                                                                      | Assembler Pseudoinstruction                              |
|---------------------------------------------|------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------|
| va < vb<br>va <= vb<br>va > vb<br>va >= vb  | <pre>vmslt{u}.vv vd, va, vb, vm vmsle{u}.vv vd, va, vb, vm vmslt{u}.vv vd, vb, va, vm vmsle{u}.vv vd, vb, va, vm</pre> | vmsgt{u}.vv vd, va, vb, vm<br>vmsge{u}.vv vd, va, vb, vm |
| va < x<br>va <= x<br>va > x<br>va >= x      | <pre>vmslt{u}.vx vd, va, x, vm vmsle{u}.vx vd, va, x, vm vmsgt{u}.vx vd, va, x, vm see below</pre>                     |                                                          |
| va < i<br>va <= i<br>va > i<br>va >= i      | vmsle{u}.vi vd, va, i-1, vm<br>vmsle{u}.vi vd, va, i, vm<br>vmsgt{u}.vi vd, va, i, vm<br>vmsgt{u}.vi vd, va, i-1, vm   | vmslt{u}.vi vd, va, i, vm<br>vmsge{u}.vi vd, va, i, vm   |
| va, vb vector r<br>x scalar i<br>i immediat | nteger register                                                                                                        |                                                          |

#### **Mask Logical Operations**

| vmand.mm vd, vs2, vs1    | # vd[i] = | vs2[i].LSB &   | & vs1[i].LSB             |
|--------------------------|-----------|----------------|--------------------------|
| vmnand.mm vd, vs2, vs1   | # vd[i] = | !(vs2[i].LSB & | & vs1[i].LSB)            |
| vmandnot.mm vd, vs2, vs1 | # vd[i] = | vs2[i].LSB &   | & !vs1[i].LSB            |
| vmxor.mm vd, vs2, vs1    | # vd[i] = | vs2[i].LSB ^   | ∧ vs1[i].LSB             |
| vmor.mm vd, vs2, vs1     | # vd[i] = | vs2[i].LSB     | vs1[i].LSB               |
| vmnor.mm vd, vs2, vs1    | # vd[i] = | !(vs2[i[.LSB   | vs1[i].LSB)              |
| vmornot.mm vd, vs2, vs1  | # vd[i] = | vs2[i].LSB     | !vs1[i].LSB              |
| vmxnor.mm vd, vs2, vs1   | # vd[i] = | !(vs2[i].LSB ^ | <pre>^ vs1[i].LSB)</pre> |

Several assembler pseudoinstructions are defined as shorthand for common uses of mask logical operations:

| <pre>vmcpy.m vd, vs =&gt; vmand.mm</pre>  | vd, vs, vs  # Copy mask register   |
|-------------------------------------------|------------------------------------|
| <pre>vmclr.m vd =&gt; vmxor.mm</pre>      | vd, vd, vd   # Clear mask register |
| vmset.m vd => vmxnor.mm                   | vd, vd, vd  # Set mask register    |
| <pre>vmnot.m vd, vs =&gt; vmnand.mm</pre> | vd, vs, vs # Invert bits           |

#### Agnostic vs Undisturbed

vsetvli t0, a0, e32, m1, ta, ma

x[8]

x[9]



