In this document we try to understand how gem5 models the performance of systems. We trace the events occurring within the simulator to try and understand gem5 internals.
git clone git@github.com:CMPT-7ARCH-SFU/gem5-lab.git
cd gem5-lab
Folder | |
---|---|
gem5-config | Python scripts for setting up system and running gem5 simulation |
benchmarks | Benchmarks that we will be running on top of gem5 |
For this lab we have chosen the RISC-V
ISA and use simple hand written assembly to control the instructions being modelled. Typically with a C/C++
program the compiler will include a ton of other code related to the C runtime and libc making it hard to obtain a full trace of events. With small assembly microbenchmarks we can obtain complete execution traces for analysis.
# Load modules
$ source /data/.local/modules/init/zsh
# If you are running this on CMPT 7ARCH machines then you should not have to module load
# Path is hardcoded in the makefile
module load rv-newlib # Check that riscv64 compiler suite is in your path.
cd benchmarks
make
Gem5 is a large piece of software (Builds is typically 10s of GB). Typically we do not include the source in the gem5-lab or gem5-ass. It is also included in the .gitignore so that you do not check it in by mistake. So go ahead clone and build it.
Use the preinstalled gem5 if your disk space quota is a problem for building gem5
This applies only to the labs and assignments. For final projects request extra quota
and build the gem5 binaries yourself
# gem5 comes preinstalled at /data/gem5-baseline
export M5_PATH=/data/gem5-baseline
DO NOT BUILD FOR LABS AND ASSIGNMENTS USE PRE-INSTALLED ABOVE
Follow build instructions below only after you have requested atleast 35G of disk space from the
helpdesk@cs.sfu.ca. Only applies to those doing projects on gem5
git clone git@github.com:CMPT-7ARCH-SFU/gem5-baseline.git
cd gem5-baseline
## Stop git tracking large file changes. Add --global if you want to turn off for all.
git config oh-my-zsh.hide-info 1
# First build might take a long time
# on 10 cores it may take 15 minutes
# Open a tmux session.
scons -j 8 build/RISCV/gem5.opt CPU_MODELS='AtomicSimpleCPU,O3CPU,TimingSimpleCPU,MinorCPU' --gold-linker
The gem5-configuration scripts are included in the gem5-config folder. There are two files. run-micro.py
and system.py
.
ls gem5-config/
run-micro.py
system.py
run-micro.py | Specifying the cpu model to be used, interacting with the simulation, dumping stats, etc. |
system.py | Putting together the system. Connecting the ports between the cpu and memory models |
Lets inspect run-micro.py
to understand what’s happening.
Open run-micro.py
in your favorite editor. We are only showing code snippets.
class IntALU(FUDesc):
opList = [ OpDesc(opClass='IntAlu',opLat=1) ]
count = 32
class IntMultDiv(FUDesc):
opList = [ OpDesc(opClass='IntMult', opLat=1),
OpDesc(opClass='IntDiv', opLat=20, pipelined=False) ]
if buildEnv['TARGET_ISA'] in ('x86'):
opList[1].opLat=1
count=32
There are 4 parameters that we care about
opClass | Each instruction in your program specifies an opClass. You can find this defined in .isa files under gem5/arch/riscv/isa/decoder.isa . Look for Op. By default everything is IntAlu unless specified. Here we are defining the opClass so that the instructions when decoded can be forwarded to the appropriate queue. You can have more than one function unit with the same opClass |
opLat | This specifies the latency of the function unit. An instruction. The integer number is in clock ticks. e.g., IntAlu the latency is 1 cycle. IntDiv the latency is 20 cycles, unless x86 in which case Div is 1 cycle. This is because when ISA is x86 the latency here is the latency of the micro ops |
pipelined | This specifies the throughput i.e., delay between subsequent operations issued to the same function unit. e.g., intDiv pipelined=false means back-to-back operations can only be issued 20 cycles apart |
The FU list is initialized for all CPUs (exception is TimingSimple). TimingSimple assumes each instruction is single cycle and does not model the function units. To account for operations that take more than one cycle, we need to define functional units that inherit the MinorFU
interface provided in Gem5.
class Minor4CPU(MinorCPU):
...
fuPool = Minor4_FUPool()
class O3_W256CPU(DerivO3CPU):
.....
fuPool = Ideal_FUPool()
Figure below shows the system performance model. Each CPU include multiple FU classes. Each FUclass includes multiple FUs. Each FU is a queue (if pipelined) the throughput is 1 cycle (if pipelined). The latency is set by OpLat
.
class O3_W2KCPU(DerivO3CPU):
branchPred = BranchPredictor()
fuPool = Ideal_FUPool()
fetchWidth = 32
decodeWidth = 32
renameWidth = 32
dispatchWidth = 32
issueWidth = 32
wbWidth = 32
commitWidth = 32
squashWidth = 32
fetchQueueSize = 256
LQEntries = 250
SQEntries = 250
numPhysIntRegs = 1024
numPhysFloatRegs = 1024
numIQEntries = 2096
numROBEntries = 2096
We are only setting a few of the parameters and leaving the others to the default values. We typically do not modify the Delay parameters and only modify the buffer and width parameters. Here is the description of some of the parameters. If you wish to know all the parameters follow along with class lectures. We include a figure here to illustrate.
These parameters control the CPU front and backends. You can find parameters definitions for the O3CPU under $M5_PATH/src/cpu/o3/O3CPU.py
. Similarly for minor you can find under $M5_PATH/src/cpu/minor/MinorCPU.py
The two main components of the gem5 system model that we will deal with are the CPU model and the Memory model. gem5 is highly configurable permitting any cpu model to be connected to any other memory model. Here we show 3 examples
An OOO CPU connected to Inf memory model i.e., No cache misses. All memory accesses complete instantaneously.
An OOO CPU model connected to a L1 instruction and data cache, and a two-level hierarchy (includes shared L2), with detailed memory controller (different DDR parameters can be set in MemCtrl)
gem5 has a few CPU models for different purposes.
In gem5-config/run-micro.py, those are the first command line parameter option.
The information to make highly accurate models isn’t generally public for non-free CPUs, so either you must either rely vendor provided models or on experiments/reverse engineering.
There is no simple answer for “what is the best CPU”, in theory you have to understand each model and decide which one is closer your target system.
CPU | 1-liner |
---|---|
AtomicSimpleCPU |
the default one. All operations (including memory accesses) happen instantaneously. The fastest simulation, but not realistic at all. |
TimingSimpleCPU |
Non-Memory instructions operate in 1 cycle. Memory instructions depend on the memory model. See Memory model |
MinorCPU |
Generic In-order Multi-issue core. Honors all dependencies (true and false). Can issue parallel instructions. Fixed 4-stage pipeline. |
DerivO3CPU |
Most detailed OOO core in gem5. |
https://en.wikipedia.org/wiki/Classic_RISC_pipeline
gem5’s minor cpu implements a similar but 4 stage pipeline. Minor CPU was initially designed to mimic this, but then as folks started adding bells and whistles it morphed into a in-order and OOO combination.
https://en.wikipedia.org/wiki/Superscalar_processor
http://www.lighterra.com/papers/modernmicroprocessors/ explains it well.
You basically decode multiple instructions in one go, and run them at the same time if they can go in separate functional units and have no conflicts.
And so the concept of branch predictor must come in here: when a conditional branch is reached, you have to decide which side to execute before knowing for sure.
This is why it is called a type of instruction-level-parallelism.
Although this is a microarchitectural feature, it is so important that it is publicly documented. For example:
gem5 calls them “functional units”.
gem5 has execution-unit,functional units explicitly modelled as shown at gem5-functional-units, and those are used by both gem5-minorcpu
and gem5-derivo3cpu
. (TimingSimpleCPU does not model the latency of function units; every instruction completes in 1 cycle)
Each instruction is marked with a class, and each class can execute in a given [execution-unit,functional unit].
Which units are available is visible for example on the «gem5-config-ini» of a «gem5-minorcpu» run. Functional units are not present in simple CPUs like «gem5-timingsimplecpu».
export M5_PATH=/data/gem5-baseline
$M5_PATH/build/RISCV/gem5.opt --redirect-stdout --debug-flags=Event,ExecAll,FmtFlag --debug-file=trace.out --debug-start=1000 --debug-end=1000 ./gem5-config/run_micro.py Minor4 SingleCycle ./benchmarks/hello.riscv.bin --clock 1GHz
For example, m5out/config.ini
of a minor run:
contains:
....
[system.cpu]
type=MinorCPU
children=branchPred dcache dtb executeFuncUnits icache interrupts isa itb power_state tracer workload
....
executeInputWidth=2
executeIssueLimit=2
....
Here also note the executeInputWidth=2
and executeIssueLimit=2
suggesting that this is a dual-issue superscalar processor
.
The system.cpu
points to:
[system.cpu.executeFuncUnits]
type=MinorFUPool
children=funcUnits0 funcUnits1 funcUnits2 funcUnits3 funcUnits4 funcUnits5 funcUnits6 funcUnits7
and the two first units are in full:
[system.cpu.executeFuncUnits.funcUnits0]
type=MinorFU
children=opClasses timings
opLat=3
[system.cpu.executeFuncUnits.funcUnits0.opClasses]
type=MinorOpClassSet
children=opClasses
[system.cpu.executeFuncUnits.funcUnits0.opClasses.opClasses]
type=MinorOpClass
opClass=IntAlu
and:
[system.cpu.executeFuncUnits.funcUnits1]
type=MinorFU
children=opClasses timings
opLat=3
[system.cpu.executeFuncUnits.funcUnits1.opClasses]
type=MinorOpClassSet
children=opClasses
[system.cpu.executeFuncUnits.funcUnits1.opClasses.opClasses]
type=MinorOpClass
opClass=IntAlu
So we understand that:
IntAlu
opClass
with more than one type. Those first two units just happen to have a single type.The full list is:
IntAlu
, opLat=3
IntMult
, opLat=3
IntDiv
, opLat=9
. So we see that a more complex operation such as division has higher latency.FloatAdd
, FloatCmp
, and a gazillion other floating point related things. opLat=6
.SimdPredAlu
: Vector predication unit opLat=3
MemRead
, MemWrite
, FloatMemRead
, FloatMemWrite
. opLat=1
IprAccess
Interrupt Priority Register access, InstPrefetch
, instruction prefetching unit. opLat=1
The top-level configuration can be found in gem5-config/run-micro.py
The complete list can be found in $M5_PATH/src/cpu/minor/MinorCPU.py
:
class MinorDefaultFUPool(MinorFUPool):
funcUnits = [MinorDefaultIntFU(), MinorDefaultIntFU(),
MinorDefaultIntMulFU(), MinorDefaultIntDivFU(),
MinorDefaultFloatSimdFU(), MinorDefaultPredFU(),
MinorDefaultMemFU(), MinorDefaultMiscFU()]
In Gem5, each instruction has a certain opClass
that determines on which unit it can run. For example: class AddImm
, which is what we get on a simple add x1, x2, 1
, sets itself as an IntAluOp
on the constructor as expected:
AddImm::AddImm(ExtMachInst machInst,
IntRegIndex _dest,
IntRegIndex _op1,
uint32_t _imm,
bool _rotC)
: DataImmOp("add", machInst, IntAluOp, _dest, _op1, _imm, _rotC)
we see:
[system.cpu]
type=DerivO3CPU
children=branchPred dcache dtb fuPool icache interrupts isa itb power_state tracer workload
and following fuPool
:
[system.cpu.fuPool]
type=FUPool
children=FUList0 FUList1 FUList2 FUList3 FUList4 FUList5 FUList6 FUList7 FUList8 FUList9
so for example FUList0
is:
[system.cpu.fuPool.FUList0]
type=FUDesc
children=opList
count=6
eventq_index=0
opList=system.cpu.fuPool.FUList0.opList
[system.cpu.fuPool.FUList0.opList]
type=OpDesc
eventq_index=0
opClass=IntAlu
opLat=1
pipelined=true
and FUList1
:
[system.cpu.fuPool.FUList1.opList0]
type=OpDesc
eventq_index=0
opClass=IntMult
opLat=3
pipelined=true
[system.cpu.fuPool.FUList1.opList1]
type=OpDesc
eventq_index=0
opClass=IntDiv
opLat=20
pipelined=false
So summarizing all units we have:
IntAlu
with opLat=3
IntMult
with opLat=3
and IntDiv
with opLat=20
FloatAdd
, FloatCmp
, FloatCvt
with opLat=2
export M5_PATH=/data/gem5-baseline
cd $REPO
$M5_PATH/build/RISCV/gem5.opt --redirect-stdout --outdir=Simple_m5out --debug-flags=Event,ExecAll,FmtFlag --debug-file=trace.out --debug-start=1000 --debug-end=1000 ./gem5-config/run_micro.py Simple SingleCycle ./benchmarks/hello.riscv.bin --clock 1GHz
# Pay attention to how the clock is set, case matters
Description | |
---|---|
–outdir | The output directory |
–redirect-stdout | redirecting the stdout of the program simulated by gem5. |
–debug-start, –debug-end | Cycle tick window for which tracing is switched on. ALWAYS USE FOR PROGRAMS OUTSIDE THE LAB |
./gem5-config/run_micro.py | Top-level script. Sets up the simulation model and interacts with simulator events |
simple Inf hello.riscv.bin –clock 2GHz | Parameters passed to the simulation script |
hello.riscv.bin | The binary to be simulated |
export M5_PATH=/data/gem5-baseline
$M5_PATH/build/RISCV/gem5.opt --outdir=O3_m5out --redirect-stdout --debug-flags=O3PipeView --debug-file=trace.out ./gem5-config/run_micro.py DefaultO3 Slow ./benchmarks/array.riscv.bin
Description | |
---|---|
–debug-flags=O3PipeView –debug-file=trace.out | Dumps trace.out in O3_m5out/ that can be read by Konata and O3 pipeviewer. |
–redirect-stdout | redirecting the stdout of the program simulated by gem5. |
–debug-start, –debug-end | Cycle tick window for which tracing is switched on. ALWAYS USE FOR PROGRAMS OUTSIDE THE LAB |
./gem5-config/run_micro.py | Top-level script. Sets up the simulation model and interacts with simulator events |
simple Inf hello.out –clock 2GHz | Parameters passed to the simulation script |
hello.riscv.bin | The binary to be simulated |
–debug-flags, | Arguments passed to the binary. Switches on tracing |
gem5 is an event based simulator, and as such the event queue is of the crucial elements in the system. Every single action that takes time (e.g. notably reading from memory) models that time delay by scheduling an event in the future.
The gem5 event queue stores one callback event for each future point in time.
The event queue is implemented in the class EventQueue in the file src/sim/eventq.hh.
Not all times need to have an associated event: if a given time has no events, gem5 just skips it and jumps to the next event: the queue is basically a linked list of events.
Important examples of events include:
CPU ticks
peripherals and memory
At gem5 event queue AtomicSimpleCPU syscall emulation freestanding example analysis we see for example that at the beginning of an AtomicCPU simulation, gem5 sets up exactly two events:
the first CPU cycle
one exit event at the end of time which triggers gem5 simulate() limit reached
Then, at the end of the callback of one tick event, another tick is scheduled.
And so the simulation progresses tick by tick, until an exit event happens.
# start event
0: Event: system.cpu.wrapped_function_event: EventFunctionWrapped 17 scheduled @ 0
0: Event: Event_21: generic 21 scheduled @ 0
# Timeout event. At this cycle, gem5 forcefully stops
0: Event: Event_21: generic 21 rescheduled @ 18446744073709551615
0: Event: system.cpu.wrapped_function_event: EventFunctionWrapped 17 executed @ 0
...
Program exits
9573000: ExecEnable: system.cpu: A0 T0 : @m5_work_end : M5Op : IntAlu : D=0x0000000000000000 flags=(IsInteger|IsSerializeAfter|IsNonSpeculative)
export M5_PATH=/data/gem5-baseline
## Run TimingCPU with Inf Memory (0 latency, Inf capacity)
$M5_PATH/build/RISCV/gem5.opt --redirect-stdout --debug-start=0 --debug-end=60000 --debug-file=trace.out --outdir=Inf_m5out --debug-flags=Event,ExecAll,FmtFlag ./gem5-config/run_micro.py Simple Inf ./benchmarks/hello.riscv.bin --clock 1GHz
$ cat Inf_m5out/trace.out
# Every instruction completes instantaneously
0: ExecEnable: system.cpu: A0 T0 : @_start : addi a0, zero, 1 : IntAlu : D=0x0000000000000001 flags=(IsInteger)
0: Event: system.mem_ctrl.wrapped_function_event: EventFunctionWrapped 9 scheduled @ 0
0: Event: system.mem_ctrl.wrapped_function_event: EventFunctionWrapped 9 executed @ 0
0: Event: Event_14: Timing CPU icache tick 14 scheduled @ 0
0: Event: Event_14: Timing CPU icache tick 14 executed @ 0
0: ExecEnable: system.cpu: A0 T0 : @_start+4 : addi a1, zero, 0 : IntAlu : D=0x0000000000000000 flags=(IsInteger)
In this mode every instruction completes instantaneously with 0 latency
cd benchmarks # first 3 lines needed if you haven't already compiled benchmarks
make
cd ..
export M5_PATH=/data/gem5-baseline
$M5_PATH/build/RISCV/gem5.opt --redirect-stdout --debug-start=0 --debug-end=60000 --debug-file=trace.out --outdir=Simple_m5out --debug-flags=Event,ExecAll,FmtFlag ./gem5-config/run_micro.py Simple SingleCycle ./benchmarks/hello.riscv.bin --clock 1GHz
1000: ExecEnable: system.cpu: A0 T0 : @_start : addi a0, zero, 1 : IntAlu : D=0x0000000000000001 flags=(IsInteger)
...
2000: ExecEnable: system.cpu: A0 T0 : @_start+4 : addi a1, zero, 0 : IntAlu : D=0x0000000000000000 flags=(IsInteger)
...
line:49 gem5-config/system.py
.Remember. instructions are also data
. Hence every non-memory operation completes in 1000ps.system.py
from 1ns to 5ns and see what happens to the rate at which instructions are complete.export M5_PATH=/data/gem5-baseline
$M5_PATH/build/RISCV/gem5.opt --redirect-stdout --debug-file=trace.out --debug-start=0 --debug-end=60000 --outdir=Simple_m5out --debug-flags=Event,ExecAll,FmtFlag ./gem5-config/run_micro.py Simple SingleCycle ./benchmarks/array.riscv.bin --clock 1GHz
export M5_PATH=/data/gem5-baseline
$M5_PATH/build/RISCV/gem5.opt --redirect-stdout --debug-file=trace.out --debug-start=0 --outdir=Simple_m5out --debug-flags=Event,ExecAll,FmtFlag ./gem5-config/run_micro.py Simple Slow ./benchmarks/array.riscv.bin --clock 1GHz
As we have seen, operations can take multiple cycles to complete in the Minor model CPU. In this section, we will change the integer ALU operations latency to a single cycle and see how it affects the performance.
Using the following commands, we will obtain traces and statistics for running two of the microbenchmarks array.riscv.bin
and hello.riscv.bin
using the multi-cycle ALU operations. You can also tweak the number of cycles taken for different operations to complete and see how they affect the performance.
export M5_PATH=/data/gem5-baseline
$M5_PATH/build/RISCV/gem5.opt --redirect-stdout --debug-file=trace.out --debug-start=0 --outdir=Minor4_m5out --debug-flags=Event,ExecAll,FmtFlag ./gem5-config/run_micro.py Minor4 SingleCycle ./benchmarks/array.riscv.bin --clock 1GHz
export M5_PATH=/data/gem5-baseline
$M5_PATH/build/RISCV/gem5.opt --redirect-stdout --debug-file=trace.out --debug-start=0 --outdir=Minor4_m5out --debug-flags=Event,ExecAll,FmtFlag ./gem5-config/run_micro.py Minor4 SingleCycle ./benchmarks/hello.riscv.bin --clock 1GHz
line 126
in run_micro.py
to 1 cycle and see how is the performance affected using the same two command in the previous subsection.export M5_PATH=/data/gem5-baseline
$ $M5_PATH/build/RISCV/gem5.opt --redirect-stdout --debug-file=trace.out --debug-start=0 --outdir=O3_m5out --debug-flags=Event,ExecAll,FmtFlag ./gem5-config/run_micro.py DefaultO3 Slow ./benchmarks/array.riscv.bin --clock 1GHz
To see all CPU options:
export M5_PATH=/data/gem5-baseline
# To see all CPU options
$M5_PATH/build/RISCV/gem5.opt ./gem5-config/run_micro.py --help
- What are the OOO processor models declared in run_micro.py ?
- What is the difference between O3_W256 O3_W2K, and DefaultO3?
First, the statistics file contains general statistics about the execution:
The statistic dump begins with ---------- Begin Simulation Statistics ----------.
There may be multiple of these in a single file if there are multiple statistic dumps during the gem5 execution. This is common for long running applications, or when restoring from checkpoints.
A couple of important statistics are sim_seconds which is the total simulated time for the simulation, sim_insts which is the number of instructions committed by the CPU, and host_inst_rate which tells you the performance of gem5.
Next, the SimObjects’ statistics are printed. For instance, the memory controller statistics. This has information like the bytes read by each component and the average bandwidth used by those components.
Stat | # | Description |
---|---|---|
final_tick | 18446744073709551616 | Number of ticks from beginning of simulation (restored from checkpoints and never reset) |
host_inst_rate | 402192 | Simulator instruction rate (inst/s) |
sim_freq | 1000000000000 | Frequency of simulated ticks |
sim_insts | 162530 | Number of instructions simulated |
sim_ops | 296082 | Number of ops (including micro ops) simulated |
sim_seconds | 18446744.073710 | Number of seconds simulated |
sim_ticks | 18446744073709551616 | Number of ticks simulated |
system.cpu.BranchMispred | 4133 | Number of branch mispredictions |
system.cpu.Branches | 36480 | Number of branches fetched |
system.cpu.committedInsts | 162530 | Number of instructions committed |
system.cpu.committedOps | 296082 | Number of ops (including micro ops) committed |
Later in the file is the CPU statistics, which contains information on the number of syscalls, the number of branches, total committed instructions, etc.
Stat | # | Description |
---|---|---|
system.cpu.num_cc_register_reads | 191583 | number of times the CC registers were read |
system.cpu.num_cc_register_writes | 124014 | number of times the CC registers were written |
system.cpu.num_conditional_control_insts | 30902 | number of instructions that are conditional controls |
system.cpu.num_fp_alu_accesses | 5219 | Number of float alu accesses |
system.cpu.num_fp_insts | 5219 | number of float instructions |
system.cpu.num_fp_register_reads | 7182 | number of times the floating registers were read |
system.cpu.num_fp_register_writes | 3645 | number of times the floating registers were written |
system.cpu.num_func_calls | 1414 | number of times a function call or return occured |
system.cpu.num_idle_cycles | -inf | Number of idle cycles |
system.cpu.num_int_alu_accesses | 291751 | Number of integer alu accesses |
system.cpu.num_int_insts | 291751 | number of integer instructions |
system.cpu.num_int_register_reads | 560951 | number of times the integer registers were read |
system.cpu.num_int_register_writes | 245957 | number of times the integer registers were written |
This document has put together by your CMPT 750/450 instructors with support from the gem5 community.