# **Parallelism and Vector Instructions**

CMPT 295 Week 9

### **Parallelism and Vector Instructions**

**WARNING**: Lab 9 and Assignment 5 work with fixed-length vector intrinsics, not RISC-V

- Most concepts carry over, if not programming details
- RISC-V supports variable length vectors, but Lab 9 and Assignment 5 do not

### Roadmap



# What is a computer program?

# for (int i = 0; i < N; i++){ output[i] = x[i] \* y[i]; }</pre>

# What is a (sequential) computer program?

PC

Processor executes instruction referenced by the program counter (PC)

(executing the instruction will modify machine state: contents of registers, memory, CPU state, etc.)

Move to next instruction ...

Then execute it...

And so on...

```
# a0: &x[0], a1: &y[0],
a2: &output[0], a5: N
# t1 = 0: loop index i
loop:
# load x[i] and y[i]
lw a4,0(a0)
lw a3,0(a1)
# multiplication
mul a4,a4,a3
# store word
sw a4,0(a2)
# Bump pointers
addi a0,a0,4
addi a1,a1,4
addi a2,a2,4
addi t1,t1,1
bne t1,a5,loop
```

# We don't have to do ops one-at-a-time Scalar Loop for (i = 0; i < N; i++){ output[i] = x[i] \* y[i];</pre>

# Vector Loop (data parallelism) for (i = 0; i < N; i=i+VLEN){ output[i:i+VLEN-1] = x[i:i+VLEN-1]\*y[i:i+VLEN-1];</pre>

#### Scalar Execution



#### Vector Execution

| x[0] | x[1] | x[2] | x[3] |  |  |  | x[7] |  |  |
|------|------|------|------|--|--|--|------|--|--|
| *    |      |      |      |  |  |  |      |  |  |
| y[0] | y[1] | y[2] | y[3] |  |  |  | y[7] |  |  |

| x[8] | x[9] | (9) x[10] x[11] |       | x[15] |  |  |       |  |  |  |
|------|------|-----------------|-------|-------|--|--|-------|--|--|--|
| *    |      |                 |       |       |  |  |       |  |  |  |
| y[8] | y[9] | y[10]           | y[11] |       |  |  | y[15] |  |  |  |

# a0: &x[0], a1: &y[0], a2: &output[0], a5: N for (i = 0; i < N)# t1 = 0: loop index i loop: output[i] = x[i] # load x[i] and y[i] lw a4,0(a0) lw a3,0(a1) # multiplication mul a4,a4,a3 How many total inst? # store word sw a4,0(a2) 9 \* N # Bump pointers addi a0,a0,4 How many useful inst? addi a1,a1,4 addi a2,a2,4 4 \* N (LD,LD,MUL,ST) addi t1,t1,1 bne How many useless (maintenance) inst?

5 \* N

# for (i = 0; i < N; i=i+VLEN){ output[i:i+VLEN-1] = x[i:i+VLEN-1] \* y[i:i+VLEN-1];</pre>

- How many total inst?
  - 9 \* N / VLEN
- How many useful inst?
  - 4 \* N / VLEN
  - How many useless inst?
    - 5 \* N / VLEN

## **Parallel Model: Vector Processing**

 Vector processors have high-level operations that work on linear arrays of numbers: "vectors"









add r3, r1, r2

add.vv v3, v1, v2

## **Parallel Model: Vector Processing**

 Vector processors have high-level operations that work on linear arrays of numbers: "vectors"









add r3, r1, r2

add.vv v3, v1, v2



# Why Parallelism? Why Efficiency?

A parallel computer is a collection of processing elements that cooperate to solve problems quickly

We care about performance and efficiency

We're going to use multiple processing elements to get it

# Speedup

One major motivation of using parallel processing: Speedup

For a given problem:

speedup = execution time using 1 element execution time using P elements

#### **Vector Registers**

- \* Vector length register v1
- \* Vector type register vtype

#### Vector register file

- Each register is an array of elements
- Size of each register determines maximum vector length
- Vector length register determines vector length for a particular operation

#### Multiple parallel execution units =

#### "lanes"

#### (sometimes called "<u>pipelines</u>" or "<u>pipes</u>")



Vector length register



Vector type register



# **RISC-V Scalar State**

Program counter (pc)

32x32/64-bit integer registers (**x0-x31**) • **x0** always contains a 0

Floating-point (FP), adds 32 registers (**f0f31**)

• each can contain a single- or doubleprecision FP value (32-bit or 64-bit IEEE FP)

FP status register (**fcsr**), used for FP rounding mode & exception reporting

ISA string options:

- RV32I (XLEN=32, no FP)
- RV32IF (XLEN=32, FLEN=32)
- RV32ID (XLEN=32, FLEN=64)
- RV64I (XLEN=64, no FP)
- RV64IF (XLEN=64, FLEN=32)
- RV64ID (XLEN=64, FLEN=64)

| XLEN-1    | 0 | FLEN-1 | 0 |
|-----------|---|--------|---|
| x0 / zero |   | fO     |   |
| x1        |   | f1     |   |
| x2        |   | f2     |   |
| x3        |   | f3     |   |
| x4        |   | f4     |   |
| x5        |   | f5     |   |
| x6        |   | f6     |   |
| x7        |   | f7     |   |
| x8        |   | f8     |   |
| x9        |   | f9     |   |
| x10       |   | f10    |   |
| x11       |   | f11    |   |
| x12       |   | f12    |   |
| x13       |   | f13    |   |
| x14       |   | f14    |   |
| x15       |   | f15    |   |
| x16       |   | f16    |   |
| x17       |   | f17    |   |
| x18       |   | f18    |   |
| x19       |   | f19    |   |
| x20       |   | f20    |   |
| x21       |   | f21    |   |
| x22       |   | f22    |   |
| x23       |   | f23    |   |
| x24       |   | f24    |   |
| x25       |   | f25    |   |
| x26       |   | f26    |   |
| x27       |   | f27    |   |
| x28       |   | f28    |   |
| x29       |   | f29    |   |
| x30       |   | f30    |   |
| x31       |   | f31    |   |
| XLEN      |   | FLEN   |   |
| XLEN-1    | 0 | 31     | C |
| pc        |   | fcsr   |   |
| XLEN      |   | 32     |   |

## **Vector Extension Additional State**

- \* 32 vector data registers, v0-v31, each VLEN bits long
- \* Vector length register v1
- \* Vector type register vtype
- \* Other control registers:
  - vstart
    - For trap handling
  - vrm/vxsat
    - Fixed-point rounding mode/saturation
    - Also appear in separate vcsr
  - vlenb
    - Gives vector length in bytes (read-only)



#### **Virtual Processor Vector Model**

- Vector operations are SIMD
   (single instruction multiple data) operations
- Each element is computed by a virtual processor (VP)
- Number of VPs given by vector length
  - Vector control register

## Scalar Code

for (i = 0; i < N; i++){
 output[i] = x[i] + y[i];
}</pre>

#### loop:

# load x[i] and y[i] lw a5,0(a2) lw a6,0(a3) # addition add a5,a5,a6 # store word sw a5,0(a1) # Bump pointers addi a1,a0,4 addi a2,a1,4 addi a3,a2,4 addi a3,a2,4 sub a0,a0,1 bnez a0, loop

#### **Vector Code**

for (i = 0; i < N;i=i+VLEN){
 output[i:i+VLEN] =
 x[i:i+VLEN] + y[i:i+VLEN];</pre>

loop: # t0=VLEN # load x[I,i+VLEN], y[] vle32.v v8, (a2) vle32.v v16, (a3) # addition vadd.vv v24,v8,v16 # store res[i:i+VLEN] vse32.v v24, (a1) # Bump pointers slli t1,t0,2 add a2, a2,t1 add a3,a3,t1 add a1,a1,t1 # Bump loop by vlen sub a0,a0,t0 bnez a0, loop

#### **Agnostic vs Undisturbed**





tu - Tail undisturbed

| x[8] | x[9] | x[10] | x[11] |  |  |  |  |
|------|------|-------|-------|--|--|--|--|
|------|------|-------|-------|--|--|--|--|

ta - Tail agnostic

| x[8] | x[9] | x[10] | x[11] | S<br>€<br>S<br>S<br>S<br>S<br>S<br>S<br>S<br>S<br>S<br>S<br>S<br>S<br>S<br>S<br>S<br>S<br>S<br>S | ₹,<br>?<br>? | ₹<br>10,000<br>20,000 | 5<br>6<br>6<br>6<br>7<br>7 |
|------|------|-------|-------|--------------------------------------------------------------------------------------------------|--------------|-----------------------|----------------------------|
|------|------|-------|-------|--------------------------------------------------------------------------------------------------|--------------|-----------------------|----------------------------|

# **Tail Processing 1: VLEN**

# Remaining = N for (i = 0; i < N;){ int VLEN; if (N-i > MAX\_VLEN) VLEN = MAX\_VLEN else VLEN = N-i

#### setvl(VLEN)

# res[i:i+VLEN] = x[i:i+VLEN] + y[i:i+VLEN];

1

1

1







Mask

1

1

1

1 1

#### loop: vsetvli t0, a0, e32 # Set VLEN # load x[I,i+VLEN], y[] vle32.v v8, (a2) vle32.v v16, (a3) # addition vadd.vv v24,v8,v16 # store res[i:i+VLEN] vse32.v v24,(a1) # Bump pointers slli t1,t0,2 add a2, a2,t1 add a3,a3,t1 add a1,a1,t1 # Bump loop by vlen sub a0,a0,t0 bnez a0, loop

# **Masking and Conditional Ops**



#### Why?

- Disable unwanted vector lanes
- Conditional branches where different operations are executed for different vector elements
- Handling tail/left-over elements when software array length not multiple of vector width.

#### **Tail Processing 2 : Masks**

```
for (i = 0; i + VLEN < N; i = i+VLEN)
 res[i:i+VLEN] =
   x[i:i+VLEN] + y[i:i+VLEN];
}
Bool msk1[VLEN] = \{0\};
while(i<VLEN) {</pre>
  msk1[i] = 1
  i++;
}
res[i:i+VLEN] = x[i:i+VLEN]+y[i:i+VLEN];#msk1
ł
```

| Vector Execution |      |      |      |   |   |   |      |  |  |  |
|------------------|------|------|------|---|---|---|------|--|--|--|
| x[0]             | x[1] | x[2] | x[3] |   |   |   | x[7] |  |  |  |
| *                | *    | *    | *    | * | * | * | *    |  |  |  |
| y[0]             | y[1] | y[2] | y[3] |   |   |   | y[7] |  |  |  |

y[8]

|   | x[9] | x[10] | x[11] |  |  |   |
|---|------|-------|-------|--|--|---|
|   | *    | *     | *     |  |  | 1 |
| I | y[9] | y[10] | y[11] |  |  |   |

| Mask |   |   |   |   |   |   |   |  |  |
|------|---|---|---|---|---|---|---|--|--|
| 1    | 1 | 1 | 1 | 1 | 1 | 1 | 1 |  |  |

|   |   |   | Ma | ask |   |   |   |
|---|---|---|----|-----|---|---|---|
| 1 | 1 | 1 | 1  | 0   | 0 | 0 | 0 |

```
Mask lanes based on condition
for (i = 0; i < N){
    if (x[i] != y[i])
        res[i] = x[i] + y[i]
    else
        res[i] = x[i] * 2
}</pre>
```

| x[0] | ×[0] | ×[0] | ×[0] |    |    |    | ×[0] |      |
|------|------|------|------|----|----|----|------|------|
| ×[0] | ×[1] | +    | +    |    |    |    | +    | Msk1 |
| *2   | *2   |      |      | *2 | *2 | *2 |      | Msk2 |

# Utilization

```
for (i = 0; i + VLEN < N; i = i+VLEN){
  res[i:i+VLEN] =
    x[i:i+VLEN] + y[i:i+VLEN];
}
Bool msk1[VLEN] = {0};
// Calculate mask
res[i:i+VLEN] = x[i:i+VLEN]+y[i:i+VLEN];#msk1
}</pre>
```

#### VLEN (hardware) = 8. N (Array size) = 12.



Second Iteration: 50% utilization.

### What about conditional branches?

Time



Assume logic below is to be executed for each element in input array 'A' producing output into array 'result'

```
<unconditional code>
  float x = A[i];
if (x > 0) {
     float tmp = exp(x, 5.f);
     tmp *= kMyConst1;
     x = tmp + kMyConst2;
  } else {
           float tmp = kMyConst1;
            x = 2.f * tmp;
  }
    <resume unconditional code>
 result[i] = x;
```

## What about conditional branches?

Time



Assume logic below is to be executed for each element in input array 'A' producing output into array 'result'

```
<unconditional code>
  float x = A[i];
if (x > 0) {
     float tmp = exp(x, 5.f);
     tmp *= kMyConst1;
     x = tmp + kMyConst2;
  } else {
       float tmp = kMyConst1;
       x = 2.f * tmp;
    <resume unconditional code>
 result[i] = x;
```

# Mask discard output of ALUs

Time



Not All ALUs do useful work Worst case: 1/8 peak performance Assume logic below is to be executed for each element in input array 'A' producing output into array 'result'

## After branch continue normal execution

Time



# Terminology

Instruction stream coherence ("coherent execution")

- Same instruction sequence applies to all elements operated upon simultaneously
- Coherent execution is necessary for efficient use of SIMD processing resources
- Coherent execution IS NOT necessary for efficient parallelization across cores, since each core has the capability to fetch/decode a different instruction stream

#### "Divergent" execution

- A lack of instruction stream coherence

#### **New RISC-V "V" Vector Extension**

- Standard extension to the RISC-V ISA
  - An updated form of Cray-style vectors for modern microprocessors
  - Appearing in commercial implementations from Alibaba, Andes, Semidynamics, SiFive, ...
  - Basis of European supercomputer initiative (EPI)
- Following slides present short tutorial on current standard
  - https://github.com/riscv/riscv-v-spec

### Vector Type Register (vtype)

*Ideally, info would be in instruction encoding, but no space in 32-bit instructions. Planned 64-bit encoding extension would add these as instruction bits.* 

| 31 30                                               |      | <u>8 7 6</u> | <u> </u>  | 3 2       | 0        |
|-----------------------------------------------------|------|--------------|-----------|-----------|----------|
| vill reserved (write 0)                             |      | vma vt       | ta / vsew | [2:0] vln | nul[2:0] |
| vsew[2:0] field encodes standard element width      |      | /            | /         |           |          |
| (SEW) in bits of elements in vector register (SEW = | vsew | [2:0]        |           | SEW       |          |
| 8*2 <sup>vsew</sup> )                               | 0    | 0            | 0         | 8         |          |
| <b>vlmul[2:0]</b> encodes vector register length    | 0    | 0            | 1         | 16        |          |
| multiplier (LMUL = $2^{vlmul} = 1/8 - 8$ )          | 0    | 1            | 0         | 32        |          |
|                                                     | 0    | 1            | 1         | 64        |          |
| vta specifies tail-agnostic                         | 1    | 0            | 0         | 128       |          |
|                                                     | 1    | 0            | 1         | 256       |          |
| <b>vma</b> specifies <i>mask-agnostic</i>           | 1    | 1            | 0         | 512       |          |
|                                                     | 1    | 1            | 1         | 1024      |          |

#### **Example Vector Register Data Layouts (LMUL=1)**

| VLEN=32b                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | 2 1 0 | SEW<br>8b<br>16b<br>32b                        |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|------------------------------------------------|
| 7       6       5       4       3       2         7       6       5       4       3       2         3       2       1         1       1       1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | 2 1 0 | SEW<br>8b<br>16b<br>32b<br>64b                 |
| VLEN=128b       F       E       D       C       B       A       9       8       7       6       5       4       3       2       1         3       2       1       1       1       1       1       1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | 2 1 0 | SEW<br>8b<br>16b<br>32b<br>64b<br>128b         |
| VLEN = 256b                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |       | 1200                                           |
| 1F       1E       1D       1C       1B       1A       19       18       17       16       15       14       13       12       11       10       F       E       D       C       B       A       9       8       7       6       5       4       3       2       1         F       E       D       C       B       A       9       8       7       6       5       4       3       2       1         7       6       5       4       3       2       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1       1 | 2 1 0 | SEW<br>8b<br>16b<br>32b<br>64b<br>128b<br>256b |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |       | 2000                                           |

#### Setting vector configuration, vsetvli/vsetivli/vsetvl

The **vset**{**i**}**vl**{**i**} configuration instructions set the **vtype** register, and also set the **vlype** register, returning the **vl** value in a scalar register



Usually use register-immediate form, **vsetvli**, to set **vtype** parameters. Immediate-immediate form, **vsetivli**, used when vector length known statically The register-register version **vsetvl** is usually used only for context save/restore

# **Vector Length Multiplier, LMUL**

- Gives fewer but longer vector registers
  - Called "vector register groups" operate as single vectors
  - Must use even register names only for LMUL=2 (v0,v2,..), and every fourth register for LMUL=4 (v0,v4, ...), etc.
- What is LMUL used for?
  - 1) To increase efficiency by using longer vectors
  - 2) To accommodate mixed-width operations (e.g., masks)

|        | F | Е | D | С | В | А | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Byte       |
|--------|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|------------|
|        |   |   |   | 3 |   |   |   | 2 |   |   |   | 1 |   |   |   | 0 | v2*n+0     |
| LMUL=2 |   |   |   | 7 |   |   |   | 6 |   |   |   | 5 |   |   |   | 4 | v2*n+1     |
|        |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |            |
|        | F | Е | D | С | В | А | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 | Byte       |
|        |   |   |   | 9 |   |   |   | 8 |   |   |   | 1 |   |   |   | 0 | v4 * n + 0 |
|        |   |   |   | В |   |   |   | А |   |   |   | 3 |   |   |   | 2 | v4 * n + 1 |
| LMUL=4 |   |   |   | D |   |   |   | С |   |   |   | 5 |   |   |   | 4 | v4 * n + 2 |
|        |   |   |   | F |   |   |   | Е |   |   |   | 7 |   |   |   | 6 | v4 * n + 3 |

# Simple stripmined vector memcpy example

|                                                                                            | # void *memcpy(void* dest, co<br># a0=dest, a1=src, a2=n<br>#                                                                                   | onst void* src, size_t n)                                                                                                  |
|--------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------|
| Set configuration, calculate vector strip                                                  | <pre>memcpy:<br/>mv a3, a0 # Copy destination<br/>loop:</pre>                                                                                   | on                                                                                                                         |
| length<br>Unit-stride<br>vector load<br>elements<br>(bytes)<br>Unit-stride<br>vector store | vsetvli t0, a2, e8,m8,ta,ma<br>vle8.v v0, (a1)<br>add a1, a1, t0<br>sub a2, a2, t0<br>vse8.v v0, (a3)<br>add a3, a3, t0<br>bnez a2, loop<br>ret | <pre># Vectors of 8b # Load bytes # Bump pointer # Decrement count # Store bytes # Bump pointer # Any more? # Return</pre> |
| elements<br>(bytes)                                                                        |                                                                                                                                                 |                                                                                                                            |

Same binary machine code can run on machines with any VLEN!

#### **Vector Unit-Stride Loads/Stores**

| # vd dest | ination, rs1 ba | se | address, | <pre>vm is mask encoding (v0.t or <missing>)</missing></pre> |
|-----------|-----------------|----|----------|--------------------------------------------------------------|
| vle8.v    | vd, (rs1), vm   | #  | 8-bit    | unit-stride load                                             |
| vle16.v   | vd, (rs1), vm   | #  | 16-bit   | unit-stride load                                             |
| vle32.v   | vd, (rs1), vm   | #  | 32-bit   | unit-stride load                                             |
| vle64.v   | vd, (rs1), vm   | #  | 64-bit   | unit-stride load                                             |

# vs3 store data, rs1 base address, vm is mask encoding (v0.t or <missing>)
vse8.v vs3, (rs1), vm # 8-bit unit-stride store
vse16.v vs3, (rs1), vm # 16-bit unit-stride store
vse32.v vs3, (rs1), vm # 32-bit unit-stride store
vse64.v vs3, (rs1), vm # 64-bit unit-stride store

for i = 0 to VLEN - 1
vd[i] = load(rs1 + i)

#### **Vector Strided Load/Store Instructions**

| # vd desti | nation, rs1 | base | add | ress | , rs2 byte stride   |
|------------|-------------|------|-----|------|---------------------|
| vlse8.v    | vd, (rs1),  | rs2, | vm  | #    | 8-bit strided load  |
| vlse16.v   | vd, (rs1),  | rs2, | vm  | #    | 16-bit strided load |
| vlse32.v   | vd, (rs1),  | rs2, | vm  | #    | 32-bit strided load |
| vlse64.v   | vd, (rs1),  | rs2, | vm  | #    | 64-bit strided load |

| # vs3 store data, rs1 base address, rs2 byte stride |        |       |      |    |   |        |         |       |  |
|-----------------------------------------------------|--------|-------|------|----|---|--------|---------|-------|--|
| vsse8.v                                             | vs3, ( | rs1), | rs2, | vm | # | 8-bit  | strided | store |  |
| vsse16.v                                            | vs3, ( | rs1), | rs2, | vm | # | 16-bit | strided | store |  |
| vsse32.v                                            | vs3, ( | rs1), | rs2, | vm | # | 32-bit | strided | store |  |
| vsse64.v                                            | vs3, ( | rs1), | rs2, | vm | # | 64-bit | strided | store |  |

for i = 0 to VLEN - 1
 vd[i] = load(rs1 + i\*rs2)

#### **Vector Indexed Loads/Stores**

# Vector unordered indexed load instructions # vd destination, rs1 base address, vs2 indices vd, (rs1), vs2, vm # unordered 8-bit indexed load of SEW data vluxei8.v vluxei16.v vd, (rs1), vs2, vm # unordered 16-bit indexed load of SEW data for i = 0 to VLEN - 1 vluxei32.v vd, (rs1), vs2, vm # unordered 32-bit indexed load of SEW data vd, (rs1), vs2, vm # unordered 64-bit indexed load of SEW data vluxei64.v

# Vector ordered indexed load instructions

# vd destination, rs1 base address, vs2 indices

vd, (rs1), vs2, vm # ordered 8-bit indexed load of SEW data vloxei8.v vloxei16.v vd. (rs1), vs2, vm # ordered 16-bit indexed load of SEW data vd, (rs1), vs2, vm # ordered 32-bit indexed load of SEW data vloxei32.v vd. (rs1), vs2, vm # ordered 64-bit indexed load of SEW data vloxei64.v

#### # Vector unordered-indexed store instructions

# vs3 store data, rs1 base address, vs2 indices

vsuxei8.v vs3, (rs1), vs2, vm # unordered 8-bit indexed store of SEW data vsuxei16.v vs3, (rs1), vs2, vm # unordered 16-bit indexed store of SEW data vsuxei32.v vs3, (rs1), vs2, vm # unordered 32-bit indexed store of SEW data vsuxei64.v vs3, (rs1), vs2, vm # unordered 64-bit indexed store of SEW data

#### # Vector ordered indexed store instructions

# vs3 store data, rs1 base address, vs2 indices vs3, (rs1), vs2, vm # ordered 8-bit indexed store of SEW data vsoxei8.v vs3, (rs1), vs2, vm # ordered 16-bit indexed store of SEW data vsoxei16.v vs3, (rs1), vs2, vm # ordered 32-bit indexed store of SEW data vsoxei32.v vs3, (rs1), vs2, vm # ordered 64-bit indexed store of SEW data vsoxei64.v

vd[i] = load(rs1 + vs2[i])

Index data width encoded in instruction, while data size encoded in vtype.vsew field

VLEN=256b, SLEN=128b

#### SEW=8b, LMUL=1, VLMAX=32

| 1F 1E 1D 1C 1B 1A 19 | 18 17 16 1 | 15 14 13 12 | 2 11 10 F | EDCB    | A 9 8 7 | 6 5 4 3 | 2 1 0 | Byte   |
|----------------------|------------|-------------|-----------|---------|---------|---------|-------|--------|
| 1F 1E 1D 1C 1B 1A 19 | 18 17 16 1 | 15 14 13 12 | 2 11 10 F | E D C B | A 9 8 7 | 6 5 4 3 | 2 1 0 | v1*n+0 |

#### SEW=16b, LMUL=2, VLMAX=32

| 1F 1E | 1D 1C | 1B 1/ | 19 | 18 | 17 1 | 6 1! | 5 14 | 13 | 12 | 11 | 10 | F | E | D | С | В | Α | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 0 | _ Byte |     |
|-------|-------|-------|----|----|------|------|------|----|----|----|----|---|---|---|---|---|---|---|---|---|---|---|---|---|---|-----|--------|-----|
| 17    | 16    | 15    | 5  | 14 | 1    | 3    | 12   |    | 11 |    | 10 |   | 7 |   | 6 |   | 5 |   | 4 |   | 3 |   | 2 |   | 1 | (   | v2 * n | + 0 |
| 1F    | 1E    | 10    | )  | 1C | 1    | в    | 1A   |    | 19 |    | 18 |   | F |   | Е |   | D |   | С |   | в |   | А |   | 9 | 8   | v2 * n | 1+1 |

#### SEW=32b, LMUL=4, VLMAX=32

| 1F 1E 1D 1C | 1B 1A 19 18 | 17 16 15 14 | 13 12 11 10 | FEDC | B A 9 8 | 7 6 5 4 | 3 2 1 0 | Byte       |
|-------------|-------------|-------------|-------------|------|---------|---------|---------|------------|
| 13          | 12          | 11          | 10          | 3    | 2       | 1       | 0       | v4 * n + 0 |
| 17          | 16          | 15          | 14          | 7    | 6       | 5       | 4       | v4 * n + 1 |
| 1B          | 1A          | 19          | 18          | В    | A       | 9       | 8       | v4 * n + 2 |
| 1F          | 1E          | 1D          | 1C          | F    | E       | D       | С       | v4 * n + 3 |

#### SEW=64b, LMUL=8, VLMAX=32

| 1F 1E 1D 1C 1B 1A 19 18 | 17 16 15 14 13 12 11 10 | FEDCBA98 | 7 6 5 4 3 2 1 0 | Byte       |
|-------------------------|-------------------------|----------|-----------------|------------|
| 11                      | 10                      | 1        | 0               | v8*n+0     |
| 13                      | 12                      | 3        | 2               | v8*n+1     |
| 15                      | 14                      | 5        | 4               | v8 * n + 2 |
| 17                      | 16                      | 7        | 6               | v8*n+3     |
| 19                      | 18                      | 9        | 8               | v8*n+4     |
| 18                      | 1A                      | В        | A               | v8 * n + 5 |
| 1D                      | 1C                      | D        | С               | v8*n+6     |
| 1F                      | 1E                      | F        | E               | v8*n+7     |

# LMUL=8 stripmined vector memcpy example

|                        | <pre># void *memcpy(void* dest, c "</pre> | const void* src, size_t n)   |
|------------------------|-------------------------------------------|------------------------------|
|                        | # a0=dest, a1=src, a2=n<br>#              | Combine eight                |
|                        | memcpy:                                   | vector registers into        |
| Set configuration,     | mv a3, a0 # Copy destinati                | on group                     |
| calculate vector strip | loop:                                     | (v0,v1,,v7)                  |
| length                 | vsetvli t0, a2, e8,m8,ta,ma               | # Vectors of 8b              |
| Unit strido            | vle8.v v0, (a1)                           | # Load bytes                 |
| Unit-stride            | add a1, a1, t0                            | # Bump pointer               |
| vector load            | sub a2, a2, t0                            | <pre># Decrement count</pre> |
| bytes                  | vse8.v v0, (a3)                           | # Store bytes                |
| Unit-stride            | add a3, a3, t0                            | # Bump pointer               |
| vector store           | bnez a2, loop                             | # Any more?                  |
| bytes                  | ret                                       | # Return                     |

Binary machine code can run on machines with any VLEN!

#### Masking

- Nearly all operations can be optionally under a mask (or predicate) held in vector register v0
- A single vm bit in instruction encoding selects whether unmasked or under control of v0

- Integer and FP compare instructions provided to set masks into any vector register
- Can perform mask logical operations between any vector registers

#### **Vector Integer Add Instructions**

```
# Integer adds.
vadd.vv vd, vs2, vs1, vm # Vector-vector
vadd.vx vd, vs2, rs1, vm # vector-scalar
vadd.vi vd, vs2, imm, vm # vector-immediate
# Integer subtract
vsub.vv vd, vs2, vs1, vm # Vector-vector
vsub.vx vd, vs2, rs1, vm # vector-scalar
# Integer reverse subtract
vrsub.vx vd, vs2, rs1, vm # vd[i] = rs1 - vs2[i]
vrsub.vi vd, vs2, imm, vm # vd[i] = imm - vs2[i]
```

#### **Integer Compare Instructions**

| Comparison                                  | Assembler Mapping                                                                                                      | Assembler Pseudoinstruction                              |
|---------------------------------------------|------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------|
| va < vb<br>va <= vb<br>va > vb<br>va >= vb  | <pre>vmslt{u}.vv vd, va, vb, vm vmsle{u}.vv vd, va, vb, vm vmslt{u}.vv vd, vb, va, vm vmsle{u}.vv vd, vb, va, vm</pre> | vmsgt{u}.vv vd, va, vb, vm<br>vmsge{u}.vv vd, va, vb, vm |
| va < x<br>va <= x<br>va > x<br>va >= x      | vmslt{u}.vx vd, va, x, vm<br>vmsle{u}.vx vd, va, x, vm<br>vmsgt{u}.vx vd, va, x, vm<br>see below                       |                                                          |
| va < i<br>va <= i<br>va > i<br>va >= i      | vmsle{u}.vi vd, va, i-1, vm<br>vmsle{u}.vi vd, va, i, vm<br>vmsgt{u}.vi vd, va, i, vm<br>vmsgt{u}.vi vd, va, i-1, vm   | vmslt{u}.vi vd, va, i, vm<br>vmsge{u}.vi vd, va, i, vm   |
| va, vb vector r<br>x scalar i<br>i immediat | nteger register                                                                                                        |                                                          |

#### **Mask Logical Operations**

| vmand.mm vd, vs2, vs1    | # vd[i] = | vs2[i].LSB &   | & vs1[i].LSB             |
|--------------------------|-----------|----------------|--------------------------|
| vmnand.mm vd, vs2, vs1   | # vd[i] = | !(vs2[i].LSB & | & vs1[i].LSB)            |
| vmandnot.mm vd, vs2, vs1 | # vd[i] = | vs2[i].LSB &   | & !vs1[i].LSB            |
| vmxor.mm vd, vs2, vs1    | # vd[i] = | vs2[i].LSB ^   | <pre>^ vs1[i].LSB</pre>  |
| vmor.mm vd, vs2, vs1     | # vd[i] = | vs2[i].LSB     | vs1[i].LSB               |
| vmnor.mm vd, vs2, vs1    | # vd[i] = | !(vs2[i[.LSB   | vs1[i].LSB)              |
| vmornot.mm vd, vs2, vs1  | # vd[i] = | vs2[i].LSB     | !vs1[i].LSB              |
| vmxnor.mm vd, vs2, vs1   | # vd[i] = | !(vs2[i].LSB ^ | <pre>^ vs1[i].LSB)</pre> |

Several assembler pseudoinstructions are defined as shorthand for common uses of mask logical operations:

| vmcpy.m vd, vs => v  | /mand.mm vd, vs, vs | <pre># Copy mask register</pre>  |
|----------------------|---------------------|----------------------------------|
| vmclr.m vd => vm     | nxor.mm vd, vd, vd  | <pre># Clear mask register</pre> |
| vmset.m vd => vm     | nxnor.mm vd, vd, vd | # Set mask register              |
| vmnot.m vd, vs => vm | nnand.mm vd, vs, vs | # Invert bits                    |

#### **Agnostic vs Undisturbed**





tu - Tail undisturbed

| x[8] | x[9] | x[10] | x[11] |  |  |  |  |
|------|------|-------|-------|--|--|--|--|
|------|------|-------|-------|--|--|--|--|

ta - Tail agnostic

| x[8] | x[9] | x[10] | x[11] | <u>کی</u><br>روژی | ૡૢૢૢૢૣૣૣૣૣૣૢૢૢૢૢૢૢૢૢૢૢૢૢૢૢૢૢૢૢૢૢૢૢૢૢૢૢ | 5<br>5<br>6<br>6<br>7<br>7<br>7<br>7<br>7<br>7<br>7<br>7<br>7<br>7<br>7<br>7<br>7<br>7<br>7<br>7 | ₹ <u>₹</u><br>} |
|------|------|-------|-------|-------------------|----------------------------------------|--------------------------------------------------------------------------------------------------|-----------------|
|------|------|-------|-------|-------------------|----------------------------------------|--------------------------------------------------------------------------------------------------|-----------------|