Due Sunday, January 27. 11:59PM Pacific Time
In this assignment you will learn how to analyze performance of applications using hardware counters. This is a skill that you could use in many research projects. Your goal will be to analyze how applications are affected by contention for the processor cache.
You will choose several experimental applications. Of these applications, one will be the principal application, others will be the interfering applications. You will run the principal application with each interfering applications, ensuring that the two share the same cache. (More on how to accomplish this below). Your will measure the cache miss rate (in misses per instruction) and the instruction per cycle (IPC) of the principal application as its runs with each interfering application. Based on these data you will conclude how the principal application’s cache miss rates and IPC are affected by the interfering applications. Try to provide an explanation for the effects you observe by learning something the nature of the applications (you can do this by reading the code or by finding information about the benchmarks online).
You will perform experiments on one of these machines.
- coolthreads.cs.sfu.ca – this is a Solaris/SPARC (Niagara) system with eight cores running Solaris 11. On each core there are four hardware threads, or virtual CPUs. The threads running on the same core share an L1 instruction cache and an L1 data cache. In addition, cores share an L2 cache. L2 cache is unified: it contains both instructions and data. On Niagara you can measure interference in two different cache levels: L1 cache and L2 cache.
To measure interference in both the L1 cache and the L2 cache you will need to run your benchmarks on the same core. There are eight cores on Niagara, and four virtual CPUs per core. In total, there are 32 virtual CPUs numbered 0-31. Virtual CPUs on core 0 are numbered 0-3, virtual CPUs on core 1 are numbered 4-7, etc. To run your benchmarks on core 1, for example, you could bind you benchmarks to virtual processors 4 and 6, for example. (More on binding later).
To measure the interference in the L2 cache, but not in the L1 cache, you will need to run your benchmarks on two different cores. For example, you could bind them to virtual processors 1 and 5.
quad.cs.sfu.ca – this is a Solaris/x86 quad-core system running Solaris 11. On quad’s motherboard, there are two physical chips (or CPU packages), each CPU has two cores. The two cores share an L2 cache. On quad you can only test the interference in the L2 cache, since the L1 cache is not shared. Virtual processors in quad are numbered 0, 1, 2, and 3. L2 cache is shared among processors 0 and 1, and among processors 2 and 3. To test L2 cache interference, you would bind your benchmarks to processors 0 and 1, or to processors 2 and 3.
Using hardware counters:
To evaluate cache interference of applications, you will need to measure various performance statistics. For example, to evaluate L1 cache interference, you will measure L1 data cache misses, L1 instruction cache misses and instruction per cycle (IPC). Cache miss rate should increase if there is a high cache intereference between the benchmarks. IPC should decrease, on the other hand, the higher the cache miss rate the fewer instructions per cycle the program executes.
To measure these performance statistics you will use a tool cputrack. To find out what kinds of hardware counters are available on your experimental machine run “cputrack –h”. Notice the difference in hardware counters between coolthreads and quad!
Read the manual page for cputrack “man cputrack” to learn how it works.
Here are some examples of using cputrack.
To measure L1 instruction cache misses and IPC on coolthreads, run your benchmark like this:
cputrack –evf –t –T1 –o ouput.txt –c pic0=IC_miss,sys,pic1=Instr_cnt,sys your_benchmark
option –t asks cputrack to print processor cycles: you will use this for the calculation of the IPC.
- option –c tells cputrack which events to count in the two counters available on coolthreads. In this example, we count instruction cache misses (IC_miss) in counter 0 and retired instructions (Instr_cnt) in counter 1.
- “sys” tells cputrack to count events that occur both at user level and in system calls.
- option –T1 tells cputrack to sample counters once per second. Using larger intervals is not recommended due to a bug in cputrack (that my students and I have found) – it may result in cputrack reporting wrong values.
- option –o tells cputrack to write output to the file called “output.txt”.
- options –evf are very important to use. Read about them in the man page.
Performance counters on quad
Quad has an Intel processor, often referred to as x86 architecture. This is a CISC processor, and it has a complex structure and many hardware counters. It may be quite challenging to figure out which counter to use to count the events you want. Here are some hints for this assignment:
Also note that on quad only pic0 counter is working! So you cannot use pic1.
- To count the number of completed instructions use
- To count the number of instruction cache misses use
- To count the number of data cache misses use
- To count the number of L2 cache misses use
Interpreting the output of cputrack.
Once you ran your benchmark via cputrack, cputrack will produce an output file (output.txt) in the example above. This file will have many lines reporting hardware counter values – one for each second of running time. You probably care about the aggregate results for the entire run, so you should look for a line that looks something like this:
There are seven columns in this line:
1229.003 12841 1 fini_lwp 157486144176 3040518 54988926444
To calculate the IPC, you would divide column #7 by column #5. To calculate the instruction cache miss rate (in misses per instruction), you would divide column #6 by column #7.
- Column #1 is the wallclock time (in second) since the beginning of the program.
- Column #2 is the PID of the process running your benchmark.
- Column #3 is light-weight process id – there would be multiple of those if you ran a multithreaded application.
- Column #4 (fini_lwp) tells you that this is the aggregate statistics for when the process exits.
- Column #5 is the number of CPU cycles that elapsed since your program started (this measures the cycles that your program spent on CPU, not counting the time it blocked on I/O or was descheduled).
- Column #6 counts the number if instruction cache misses.
- Column #7 counts the number of retired instructions.
You will use programs from the SPEC CPU2000 benchmarks suite. You can find
more information about these benchmarks here. There is a variety of
applications in this benchmark suite – from them choose two main
benchmarks and two interfering benchmarks. (So you will have four
pairs of benchmarks in total). Make sure that your benchmarks
show a variety of cache access patterns: i.e., that there are
both cache-intensive as well as cache-hungry applications.
Note, you will need to run your benchmarks “by hand” as opposed to using the runspec utility. You can read what it’s all about here.
When you run the main application with an interfering one, you need to
ensure that the interfering application keeps running while the
main application is running. So if the interfering application
is shorter than the main application, you'll need to restart
the interfering application while the main application runs.
Binding applications to the same core
Use the runbind utility described here. To bind your benchmark to a particular virtual cpu (cpu 7, for example) , you will run it like this:
Recall that you also have to run the whole thing with cputrack to measure performance. So you will do this like:
runbind –p 7
Note that when you do this, your cputrack output file will contain measurements for both the runbind command and your benchmark. Be sure to get the output for the right PID. You can tell which PID corresponds to your benchmark by examining the output file (hint: look in the very beginning)!
cputrack runbind –p 7 < benchmark>
Tips and tricks
To ensure that your results are statistically significant, you will need to
run each experiments more than once, measure the mean and
standard deviation of the measurements. Standard
deviation should be small, otherwise you cannot trust the
numbers. Run each pair of benchmarks three times. If the
standard deviation is small (below 2% of the mean), don't run any more
experiments. If not, you will need to repeat each
experiment more times.
- To ensure that you get sound data, you will need to run experiments
while no one else is using the machine. Therefore, you
will need to reserve time on coolthreads or
quad. Reservation protocol is described here. You are asked to reserve time judiciously, as many of your classmates will need to use it as well. Note that you will only need to reserve time exclusively when you run your final experiments. For the time used on learning how to run the benchmarks and use cputrack, you will need reserve the machine in a non-exclusive mode.
- When you are learning how to run SPEC benchmarks, you can use the
machine dinosaur.cs.sfu.ca. This is a
Solaris-Sparc machine, whose system interface is very
similar to that of coolthreads. While hardware
performance counters work differently on dinosaur than on
coolthreads, running SPEC benchmarks would work just the
same. You do not need a reservation for dinosaur.
- As machine time may become scarce, you are encouraged to start the assignment early!
What to submit
You must perform the analysis of the interference between the benchmarks in
any one type of cache (either L1 I-cache, L1 D-cache,
or L2 cache). You must submit a well written report describing:
- The goal of the study
- Experimental platform
- Methodology for running the experiments (i.e., how you set up the experiment to measure what you want to measure, how you ensured that your results are statistically significant)
- Graphs and charts showing the results
- Analysis of the results, including discussion of any anomalies in the data