# **Caches and Virtual Memory**

CMPT 295 Week 8

#### **Processes**

- A process is an instance of a running program
  - Provided by the OS
    - OS uses a data structure (PCB) to represent each process
  - Maintains the *interface* between the program and the underlying hardware (CPU + memory)
- Process provides each program with two key abstractions:
  - Logical control flow
    - Each program seems to have exclusive use of the CPU
    - Provided by kernel mechanism called context switching
  - Private address space
    - Each program seems to have exclusive use of main memory
    - Provided by kernel mechanism called virtual memory



L21 Virtual Memory CMPT 295

### What is a process?

#### It's an illusion!



L21 Virtual Memory CMPT 295

### What is a process?

#### It's an illusion!



### Multiprocessing: The Illusion



- Computer runs many processes simultaneously
  - Applications for one or more users
    - Web browsers, email clients, editors, ...
  - Background tasks
    - Monitoring network & I/O devices

### Multiprocessing: The Reality



- Single processor executes multiple processes concurrently
  - Process executions interleaved, CPU runs one at a time
  - Address spaces managed by virtual memory system (today's lecture)
  - Execution context (register values, stack, ...) for other processes saved in memory

### Multiprocessing



- Context switch
  - 1) Save current registers in memory

### Multiprocessing



- Context switch
  - 1) Save current registers in memory
  - 2) Schedule next process for execution

### Multiprocessing



#### Context switch

- 1) Save current registers in memory
- 2) Schedule next process for execution
- 3) Load saved registers and switch address space

#### Multiprocessing on Multicore Processors



- Multicore processors
  - Multiple CPUs ("cores") on single chip
  - Share main memory (and some of the caches)
  - Each can execute a separate process
    - Kernel schedules processes to cores
    - **Still** constantly swapping processes

 $0xFF\cdots\cdots F$ 

### Memory as we know it so far... is virtual!

- Programs refer to virtual memory addresses
  - Conceptually memory is just a very large array of bytes
  - System provides private address space to each process
- Allocation: Compiler and run-time system
  - Where different program objects should be stored
  - All allocation within single virtual address space
- But...
  - We probably don't have 2<sup>w</sup> bytes of physical memory
  - We certainly don't have 2<sup>w</sup> bytes of physical memory for every process
  - Processes should not interfere with one another
    - Except in certain cases where they want to share code or  $dat_a^{0x00\cdots\cdots0}$

## Why Virtual Memory (VM)?

- Efficient use of limited main memory (RAM)
  - Use RAM as a cache for the parts of a virtual address space
    - Some non-cached parts stored on disk
    - Some (unallocated) non-cached parts stored nowhere
  - Keep only active areas of virtual address space in memory
    - Transfer data back and forth as needed
- Simplifies memory management for programmers
  - Each process "gets" the same full, private linear address space
- Isolates address spaces (protection)
  - One process can't interfere with another's memory
    - They operate in different address spaces
  - User process cannot access privileged information
    - Different sections of address spaces have different permissions

### A System Using Virtual Addressing



- Physical addresses are completely invisible to programs
  - Used in all modern desktops, laptops, servers, smartphones...
  - One of the great ideas in computer science

Indirection and Address Spaces
Virtual memory



- \* Virtual address space: Set of  $N = 2^n$  virtual addr
  - {0, 1, 2, 3, ..., N-1}
- \* Physical address space: Set of  $M = 2^m$  physical addr
  - {0, 1, 2, 3, ..., M-1}
- Every byte in main memory has:
  - one physical address (PA)
  - zero, one, or more virtual addresses (VAs)

#### **Memory Overview**









#### **Address Translation**

How do we perform the virtual → physical address translation?



#### **Address Translation: Page Tables**

CPU-generated address can be split into:

```
n-bit address: Virtual Page Number Page Offset
```

- Request is Virtual Address (VA), want Physical Address (PA)
- Note that Physical Offset = Virtual Offset (page-aligned)
- Use lookup table that we call the page table (PT)
  - Replace Virtual Page Number (VPN) for Physical Page Number (PPN) to generate Physical Address
  - Index PT using VPN: page table entry (PTE) stores the PPN plus management bits (e.g. Valid, Dirty, access rights)
  - Has an entry for every virtual page

#### **Address Translation: Page Hit**



- 1) Processor sends virtual address to MMU (memory management unit)
- 2-3) MMU fetches PTE from page table in cache/memory (Uses PTBR to find beginning of page table for current process)
- 4) MMU sends physical address to cache/memory requesting data
- 5) Cache/memory sends data to processor

```
VA = Virtual Address PTEA = Page Table Entry Address PTE= Page Table Entry
PA = Physical Address Data = Contents of memory stored at VA originally requested by CPU
```

#### **Address Translation: Page Fault**



- 1) Processor sends virtual address to MMU
- 2-3) MMU fetches PTE from page table in cache/memory
- 4) Valid bit is zero, so MMU triggers page fault exception
- 5) Handler identifies victim (and, if dirty, pages it out to disk)
- 6) Handler pages in new page and updates PTE in memory
- 7) Handler returns to original process, restarting faulting instruction

### Page Hit

Page hit: VM reference is in physical memory



## Page Fault

Page fault: VM reference is NOT in physical memory



#### **Address Translation Is Slow**

- The MMU accesses memory twice: once to get the PTE for translation, and then again for the actual memory request
  - The PTEs may be cached in L1 like any other memory word
    - But they may be evicted by other data references
    - And a hit in the L1 cache still requires 3-4 cycles

- What can we do to make this faster?
  - Solution: add another cache!

#### **Speeding up Translation with a TLB**

- Translation Lookaside Buffer (TLB):
  - Small hardware cache in MMU
    - Split VPN into TLB Tag and TLB Index based on # of sets in TLB
  - Maps virtual page numbers to physical page numbers
  - Stores page table entries for a small number of pages
  - Much faster than a page table lookup in cache/memory







A TLB hit eliminates a memory access!

#### **Address Translation**



### **Address Manipulation**



### **Summary of Address Translation Symbols**

- Basic Parameters
  - $N = 2^n$  Number of addresses in virtual address space
  - $M = 2^m$  Number of addresses in physical address space
  - $P = 2^p$  Page size (bytes)
- Components of the virtual address (VA)
  - VPO Virtual page offset
  - VPN Virtual page number
  - **TLBI** TLB index
  - TLBT TLB tag
- Components of the physical address (PA)
  - PPO Physical page offset (same as VPO)
  - PPN Physical page number

#### **Peer Question**

- How many bits wide are the following fields?
  - 16 KiB pages
  - 48-bit virtual addresses
  - 16 GiB physical memory

|     | VPN       | PPN |
|-----|-----------|-----|
| (A) | <b>34</b> | 24  |
| (B) | <b>32</b> | 18  |
| (C) | <b>30</b> | 20  |
| (D) | 34        | 20  |

#### **Some VM Topics Not Covered in 295**

- Supporting more than one page size
  - Small pages lead to more TLB misses; large pages lead to fragmentation
  - Modern CPUs support >1 page size (e.g., 4KB, 2MB, 1GB)
- Multi-level page tables
  - Needed to support large VM address space
  - https://en.wikipedia.org/wiki/Intel 5-level paging
- TLB hierarchy
  - Modern CPUs have more than one level in TLB
  - L1 usually split into two structures, one for instructions (iTLB) and another for data (dTLB). Entries 32-256
  - L2 TLB unified (instructions and data), 512-2048 entries

### Simple Memory System Example (small)

- Addressing
  - 14-bit virtual addresses
  - 12-bit physical address
  - Page size = 64 bytes



## Simple Memory System: Page Table

- Only showing first 16 entries (out of \_\_\_\_\_)
  - **Note:** showing 2 hex digits for PPN even though only 6 bits
  - Note: other management bits not shown, but part of PTE

| VPN | PPN | Valid |
|-----|-----|-------|
| 0   | 28  | 1     |
| 1   | _   | 0     |
| 2   | 33  | 1     |
| 3   | 02  | 1     |
| 4   | -   | 0     |
| 5   | 16  | 1     |
| 6   | _   | 0     |
| 7   | _   | 0     |

| VPN | PPN | Valid |
|-----|-----|-------|
| 8   | 13  | 1     |
| 9   | 17  | 1     |
| Α   | 09  | 1     |
| В   | _   | 0     |
| C   | _   | 0     |
| D   | 2D  | 1     |
| Ε   | _   | 0     |
| F   | 0D  | 1     |
|     |     |       |

Why does the

TLB ignore the

page offset?

### **Simple Memory System: TLB**

- 16 entries total
- 4-way set associative

| Set | Tag | PPN | Valid |
|-----|-----|-----|-------|-----|-----|-------|-----|-----|-------|-----|-----|-------|
| 0   | 03  | _   | 0     | 09  | 0D  | 1     | 00  | _   | 0     | 07  | 02  | 1     |
| 1   | 03  | 2D  | 1     | 02  | _   | 0     | 04  | _   | 0     | 0A  | _   | 0     |
| 2   | 02  | _   | 0     | 08  | _   | 0     | 06  | _   | 0     | 03  | _   | 0     |
| 3   | 07  | _   | 0     | 03  | 0D  | 1     | 0A  | 34  | 1     | 02  | _   | 0     |

## Simple Memory System: Cache

**Note:** It is just coincidence that the PPN is the same width as the cache Tag

- Direct-mapped with K = 4 B, C/K = 16
- Physically addressed



| Index | Tag | Valid | <i>B0</i> | B1 | B2 | В3 |
|-------|-----|-------|-----------|----|----|----|
| 0     | 19  | 1     | 99        | 11 | 23 | 11 |
| 1     | 15  | 0     | -         | -  | -  | _  |
| 2     | 1B  | 1     | 00        | 02 | 04 | 80 |
| 3     | 36  | 0     | _         | -  | _  | _  |
| 4     | 32  | 1     | 43        | 6D | 8F | 09 |
| 5     | 0D  | 1     | 36        | 72 | F0 | 1D |
| 6     | 31  | 0     | _         | -  | _  | _  |
| 7     | 16  | 1     | 11        | C2 | DF | 03 |

| Index | Tag | V |
|-------|-----|---|
| 8     | 24  |   |
| 9     | 2D  |   |
| Α     | 2D  |   |
| В     | 0B  |   |
| C     | 12  |   |
| D     | 16  |   |
| Ε     | 13  |   |
| F     | 14  |   |

| Tag | Valid | В0 | B1 | <i>B2</i> | В3 |
|-----|-------|----|----|-----------|----|
| 24  | 1     | 3A | 00 | 51        | 89 |
| 2D  | 0     | _  | _  | _         | _  |
| 2D  | 1     | 93 | 15 | DA        | 3B |
| OB  | 0     | _  | _  | _         | _  |
| 12  | 0     | _  | -  | _         | _  |
| 16  | 1     | 04 | 96 | 34        | 15 |
| 13  | 1     | 83 | 77 | 1B        | D3 |
| 14  | 0     | _  | _  | _         | _  |

### **Current State of Memory System**

#### TLB:

| Set | Tag | PPN | V |
|-----|-----|-----|---|-----|-----|---|-----|-----|---|-----|-----|---|
| 0   | 03  | _   | 0 | 09  | 0D  | 1 | 00  | _   | 0 | 07  | 02  | 1 |
| 1   | 03  | 2D  | 1 | 02  | _   | 0 | 04  | _   | 0 | 0A  | _   | 0 |
| 2   | 02  | _   | 0 | 08  | _   | 0 | 06  | _   | 0 | 03  | _   | 0 |
| 3   | 07  | _   | 0 | 03  | 0D  | 1 | 0A  | 34  | 1 | 02  | _   | 0 |

#### Page table (partial):

| 'PN | PPN | V | VPN | PPI |
|-----|-----|---|-----|-----|
| 0   | 28  | 1 | 8   | 13  |
| 1   | _   | 0 | 9   | 17  |
| 2   | 33  | 1 | Α   | 09  |
| 3   | 02  | 1 | В   | _   |
| 4   | _   | 0 | С   | _   |
| 5   | 16  | 1 | D   | 2D  |
| 6   | _   | 0 | Е   | _   |
| 7   | -   | 0 | F   | 0D  |
|     |     |   |     |     |

#### Cache:

| Index | Tag | V | В0 | <b>B1</b> | B2 | В3 |
|-------|-----|---|----|-----------|----|----|
| 0     | 19  | 1 | 99 | 11        | 23 | 11 |
| 1     | 15  | 0 | -  | -         | -  | _  |
| 2     | 1B  | 1 | 00 | 02        | 04 | 80 |
| 3     | 36  | 0 | _  | -         | _  | _  |
| 4     | 32  | 1 | 43 | 6D        | 8F | 09 |
| 5     | 0D  | 1 | 36 | 72        | F0 | 1D |
| 6     | 31  | 0 | _  | _         | _  | _  |
| 7     | 16  | 1 | 11 | C2        | DF | 03 |

| Index |
|-------|
| 8     |
| 9     |
| Α     |
| В     |
| C     |
| D     |
| E     |

F

| K | Tag | V | В0 | B1 | B2 | В3 |
|---|-----|---|----|----|----|----|
|   | 24  | 1 | 3A | 00 | 51 | 89 |
|   | 2D  | 0 | _  | _  | -  | _  |
|   | 2D  | 1 | 93 | 15 | DA | 3B |
|   | OB  | 0 | _  | _  | _  | _  |
|   | 12  | 0 | _  | _  | -  | _  |
|   | 16  | 1 | 04 | 96 | 34 | 15 |
|   | 13  | 1 | 83 | 77 | 1B | D3 |
|   | 14  | 0 | _  | _  | _  | _  |

Note: It is just coincidence that the PPN is the same width as the cache Tag

❖ Virtual Address: 0x03D4



VPN \_\_\_\_\_ TLBT \_\_\_\_ TLBI \_\_\_\_ TLB Hit? \_\_\_ Page Fault? \_\_\_ PPN \_\_\_\_

Physical Address:



CT \_\_\_\_\_ CI \_\_\_\_ CO \_\_\_\_ Cache Hit? \_\_\_ Data (byte) \_\_\_\_\_

Note: It is just coincidence that the PPN is the same width as the cache Tag

❖ Virtual Address: 0x038F



VPN \_\_\_\_\_ TLBT \_\_\_\_ TLBI \_\_\_\_ TLB Hit? \_\_\_ Page Fault? \_\_\_ PPN \_\_\_\_

Physical Address:



CT \_\_\_\_\_ CI \_\_\_\_ CO \_\_\_\_ Cache Hit? \_\_\_ Data (byte) \_\_\_\_\_

Note: It is just coincidence that the PPN is the same width as the cache Tag

❖ Virtual Address: 0x0020



VPN \_\_\_\_\_ TLBT \_\_\_\_ TLBI \_\_\_\_ TLB Hit? \_\_\_ Page Fault? \_\_\_ PPN \_\_\_\_

Physical Address:



Note: It is just coincidence that the PPN is the same width as the cache Tag

❖ Virtual Address: 0x036B



VPN \_\_\_\_\_ TLBT \_\_\_\_ TLBI \_\_\_\_ TLB Hit? \_\_\_ Page Fault? \_\_\_ PPN \_\_\_\_

Physical Address:



#### **Practice VM Question**

- Our system has the following properties
  - 1 MiB of physical address space
  - 4 GiB of virtual address space
  - 32 KiB page size
  - 4-entry fully associative TLB with LRU replacement
- a) Fill in the following blanks:

| <br>Entries in a page table | <br>Minimum bit-width o |
|-----------------------------|-------------------------|
|                             | PTBR                    |

\_\_\_\_\_ TLBT bits \_\_\_\_\_ Max # of valid entries in a page table

#### **Practice VM Question**

One process uses a page-aligned square matrix mat [] of 32-bit integers in the code shown below:

```
#define MAT_SIZE = 2048
for(int i = 0; i < MAT_SIZE; i++)
  mat[i*(MAT_SIZE+1)] = i;</pre>
```

b) What is the largest stride (in bytes) between successive memory accesses (in the VA space)?

#### **Practice VM Question**

One process uses a page-aligned square matrix mat [] of 32bit integers in the code shown below:

```
#define MAT_SIZE = 2048
for(int i = 0; i < MAT_SIZE; i++)
  mat[i*(MAT_SIZE+1)] = i;</pre>
```

c) Assuming all of mat[] starts on disk, what are the following hit rates for the execution of the for-loop?

\_\_\_\_\_ TLB Hit Rate \_\_\_\_\_ Page Table Hit Rate