Assignment 1

In this assignment you will learn how to analyze performance of applications using hardware counters. This is a skill that you could use in many research projects. Your goal will be to analyze how applications are affected by contention for the processor cache.

You will choose several experimental applications. Of these applications, one will be the principal application, others will be the interfering applications. You will run the principal application with each interfering applications, ensuring that the two share the same cache. (More on how to accomplish this below). Your will measure the cache miss rate (in misses per instruction) and the instruction per cycle (IPC) of the principal application as its runs with each interfering application. Based on these data you will conclude how the principal application’s cache miss rates and IPC are affected by the interfering applications. Try to provide an explanation for the effects you observe by learning something the nature of the applications (you can do this by reading the code or by finding information about the benchmarks online).

Experimental platforms:

You will perform experiments on either one (not all) of these machines.
  • coolthreads.cs.sfu.ca – this is a Solaris/SPARC (Niagara) system with eight cores running Solaris 11. On each core there are four hardware threads, or virtual CPUs. The threads running on the same core share an L1 instruction cache and an L1 data cache. In addition, cores share an L2 cache. L2 cache is unified: it contains both instructions and data. On Niagara you can measure interference in two different cache levels: L1 cache and L2 cache.

    To measure interference in both the L1 cache and the L2 cache you will need to run your benchmarks on the same core. There are eight cores on Niagara, and four virtual CPUs per core. In total, there are 32 virtual CPUs numbered 0-31. Virtual CPUs on core 0 are numbered 0-3, virtual CPUs on core 1 are numbered 4-7, etc. To run your benchmarks on core 1, for example, you could bind you benchmarks to virtual processors 4 and 6, for example. (More on binding later).

    To measure the interference in the L2 cache, but not in the L1 cache, you will need to run your benchmarks on two different cores. For example, you could bind them to virtual processors 1 and 5.

  • quad.cs.sfu.ca – this is a Solaris/x86 quad-core system running Solaris 11. On quad’s motherboard, there are two physical chips (or CPU packages), each CPU has two cores. The two cores share an L2 cache. On quad you can only test the interference in the L2 cache, since the L1 cache is not shared. Virtual processors in quad are numbered 0, 1, 2, and 3. L2 cache is shared among processors 0 and 1, and among processors 2 and 3. To test L2 cache interference, you would bind your benchmarks to processors 0 and 1, or to processors 2 and 3.

Using hardware counters:

To evaluate cache interference of applications, you will need to measure various performance statistics. For example, to evaluate L1 cache interference, you will measure L1 data cache misses, L1 instruction cache misses and instruction per cycle (IPC). Cache miss rate should increase if there is a high cache intereference between the benchmarks. IPC should decrease, on the other hand, the higher the cache miss rate the fewer instructions per cycle the program executes.

To measure these performance statistics you will use a tool cputrack. To find out what kinds of hardware counters are available on your experimental machine run “cputrack –h”. Notice the difference in hardware counters between coolthreads and quad! Read the manual page for cputrack “man cputrack” to learn how it works.

Here are some examples of using cputrack.

To measure L1 instruction cache misses and IPC on coolthreads, run your benchmark like this:

cputrack -evf -t -T0.1 -o ouput.txt -c pic0=IC_miss,sys,pic1=Instr_cnt,sys your_benchmark
  • option -t asks cputrack to print processor cycles: you will use this for the calculation of the IPC.
  • option -c tells cputrack which events to count in the two counters available on coolthreads. In this example, we count instruction cache misses (IC_miss) in counter 0 and retired instructions (Instr_cnt) in counter 1.
  • “sys” tells cputrack to count events that occur both at user level and in system calls.
  • option -T0.1 tells cputrack to sample counters ten times per second. Using larger intervals is not recommended due to a bug in cputrack (that my students and I have found): it may result in cputrack reporting wrong values.
  • option -o tells cputrack to write output to the file called “output.txt”.
  • options -evf are very important to use. Read about them in the man page.

Performance counters on quad

Quad has an Intel processor, often referred to as x86 architecture. This is a CISC processor, and it has a complex structure and many hardware counters. It may be quite challenging to figure out which counter to use to count the events you want. Here are some hints for this assignment:
  • To count the number of completed instructions use inst_retired
  • To count the number of instruction cache misses use ifu_ifetch_miss
  • To count the number of data cache misses use dcu_lines_in
  • To count the number of L2 cache misses use l2_lines_in
Also note that on quad only pic0 counter is working! So you cannot use pic1.

Interpreting the output of cputrack.

Once you ran your benchmark via cputrack, cputrack will produce an output file (output.txt) in the example above. This file will have many lines reporting hardware counter values – one for each second of running time. You probably care about the aggregate results for the entire run, so you should look for a line that looks something like this:
1229.003 12841 1 fini_lwp 157486144176 3040518 54988926444
There are seven columns in this line:
  • Column #1 is the wallclock time (in second) since the beginning of the program.
  • Column #2 is the PID of the process running your benchmark.
  • Column #3 is light-weight process id – there would be multiple of those if you ran a multithreaded application.
  • Column #4 (fini_lwp) tells you that this is the aggregate statistics for when the process exits.
  • Column #5 is the number of CPU cycles that elapsed since your program started (this measures the cycles that your program spent on CPU, not counting the time it blocked on I/O or was descheduled).
  • Column #6 counts the number if instruction cache misses.
  • Column #7 counts the number of retired instructions.
To calculate the IPC, you would divide column #7 by column #5. To calculate the instruction cache miss rate (in misses per instruction), you would divide column #6 by column #7.

Experimental applications

You will use programs from the SPEC CPU2000 benchmarks suite. You can find more information about these benchmarks here. There is a variety of applications in this benchmark suite – from them choose two main benchmarks and two interfering benchmarks. (So you will have four pairs of benchmarks in total). Make sure that you pick both memory-intensive as well as CPU-intensive applications. One main application should be memory-intensive, another CPU-intensive, same for the interfering applications.  Note, you will need to run your benchmarks “by hand” as opposed to using the runspec utility. You can read what it’s all about here.

When you run the main application with an interfering one, you need to ensure that the interfering application keeps running while the main application is running. So if the interfering application is shorter than the main application, you'll need to restart the interfering application while the main application runs.

Binding applications to the same core

Use the runbind utility described here. To bind your benchmark to a particular virtual CPU (CPU 7, for example) , you will run it like this:
runbind -p 7 <benchmark>
Recall that you also have to run the whole thing with cputrack to measure performance. So you will do this like:
cputrack runbind -p 7 <benchmark>
Note that when you do this, your cputrack output file will contain measurements for both the runbind command and your benchmark. Be sure to get the output for the right PID. You can tell which PID corresponds to your benchmark by examining the output file (hint: look in the very beginning)!

To bind multiple benchmarks, for instance one to CPU 7 and another to CPU 8 use runbind like this:

cputrack runbind -p 7 <benchmark1> arg1 arg2 -p 8 <benchmark2> arg1 arg2

If your program takes arguments of the form "-p" runbind will get confused. For instance, in the following example:

benchmark -p -p
it will assume that "-p" argument denotes the specificaiton of the next benchmark to launch. To get around this problem, use the "runbind-one-command" program, and launch multiple instances with cputrack as follows:

For bash:

cputrack runbind-one-command -p 7 <benchmark1> -p arg1 -p arg2 & cputrack runbind-one-command -p 8 <benchmark2> -p arg1 -p arg2
For csh and tcsh, put a ";" after "&".

Tips and tricks

  • To ensure that your results are statistically significant, you will need to run each experiments more than once, measure the mean and standard deviation of the measurements. Standard deviation should be small, otherwise you cannot trust the numbers. Run each pair of benchmarks three times. If the standard deviation is small (below 2% of the mean), don't run any more experiments. If not, you will need to repeat each experiment more times.
  • To ensure that you get sound data, you will need to run experiments while no one else is using the machine. Therefore, you will need to reserve time on coolthreads or quad. Reservation protocol is described here. You are asked to reserve time judiciously, as many of your classmates will need to use it as well. Note that you will only need to reserve time exclusively when you run your final experiments. For the time used on learning how to run the benchmarks and use cputrack, you will need reserve the machine in a non-exclusive mode.
  • When you are learning how to run SPEC benchmarks, you can use the machine dogwood.css.sfu.ca. aThis is a Solaris-Sparc machine, whose system interface is very similar to that of coolthreads. While hardware performance counters work differently on dogwood than on coolthreads, running SPEC benchmarks would work just the same. You do not need a reservation for dinosaur. You will need an FAS account to log on to dogwood, which you should have if you are a graduate student. If you are an undergraduate student you can apply for a FAS account. You will need to fill out this form and bring it to me for signature.
  • As machine time may become scarce, you are encouraged to start the assignment early!
  • What to submit

    You must prepare a well-written report of your analysis. Please spell check your report before submitting! Pick your favourite paper that we read so far, and model your report after the experimental section in that paper. Your report must not exceed 5 pages in 10-point Times New Roman font with 1-inch margins (so make your figures small and pretty -- but not too small, so they are still readable). I will deduct points if your report does not comply with the formating specifications.Do not play with line spacing or other formatting tricks to fit more text.

    You must perform the analysis of the interference between the benchmarks in any one type of cache (either L1 I-cache, L1 D-cache, or L2 cache).

    Discuss the following in the report:

    • The goal of the study
    • Experimental platform
    • Benchmarks
    • Methodology for running the experiments (i.e., how you set up the experiment to measure what you want to measure, how you ensured that your results are statistically significant)
    • Graphs and charts showing the results
    • Analysis of the results, including discussion of any anomalies in the data
    • Conclusions
© Copyright 2007 Simon Fraser University www.sfu.ca