
|
Assignment 2
In class we studied scalable synchronization, and in particular we
learned that how you implement a spinlock can significantly affect
performance of the application. In this assignment you will experiment
with different lock implementations and build your own.
Experimental platforms:
You will perform experiments on one or more of these machines.
- quad.cs.sfu.ca – this is a Solaris/x86 quad-core
system from Intel, running Solaris 11. Quad has two physical chips (or
CPU packages), each with two cores. The two cores share an L2
cache. L2 cache is shared among processors on the same chip:
processors 0 and 1, and processors 2 and 3. Lock
synchronization will be lower between cores located on the same chip
than between cores on different chips.
- octavia.cs.sfu.ca -
this is a Solaris/x86 system similar to quad, but instead of four cores
it has eight. There are four pairs of cores, each pair shares an L2
cache.
- trocadero.cs.sfu.ca
-this is a Solaris/x86 system from AMD, running Solaris 11. It has two
chips. Each chip has four cores. Each core has private L1 instruction
and data caches and a private L2 cache. The L3 cache is shared among
the cores on the same chip.
- sumi.cs.sfu.ca
-this system is very similar to trocadero. It is a Solaris/x86 system from
AMD, running Solaris 11. It
has two chips. Each chip has six cores. Each core has private L1
instruction and data caches and a private L2 cache. The L3 cache is
shared among the cores on the same chip. Sumi has a more efficient cache
coherence protocol than trocadero. The protocol is based on
directory lookup, rather than broadcast.
- miga.cs.sfu.ca
-this system is exactly like sumi. It is a Solaris/x86 system from
AMD, running Solaris 11. It
has two chips. Each chip has six cores. Each core has private L1
instruction and data caches and a private L2 cache. The L3 cache is
shared among the cores on the same chip. Miga has a more efficient cache
coherence protocol than trocadero. The protocol is based on
directory lookup, rather than broadcast.
Before you begin:
Make sure that you can log in to one of the servers suggested above.
Also make sure that the directory containing Sun Studio compiler is in
your path. Otherwise you will not be able to compile the code that you
must use for this assignment.
The directory containing the compiler is:
/opt/SUNWspro/bin/cc
If you do not know how to add it to your path, please find the answer
on Google!
Your experimental software:
You will begin experimenting with a simple test program that runs
several threads competing for locks. The program uses three lock
implementations:
- pthreads mutex -- the default Solaris implementation
- lightweight spinlock -- the author's interpretation of the
Solaris spinlock
- load control lock -- the author's implementation based on
the Johnson's algorithm we read about in class.
To become familiar with the experimental software, here is what you
need to do:
- Download the software to one of
the servers in the list above.
The software consists of the dynamically linked library contatining the
implementation of the lightweight spinlock and the load-control lock
and the test program.
- Compile the library and the test program:
cd LC_IMPL; make
cd ../; make
Becoming familiar with the code:
You need to read the code, both the test application and the library
implementation (in the LC_IMPL directory) to understand how it works.
When you read the library code, focus only on load_control.c file. The
rest of the files were borrowed and adapted from the OpenSolaris kernel
to support load control. Read them if you are interested, but this is
not necessary for the assignment.
To make sure you understand the code, answer the following questions
about it. Answers to the questions must be submitted in your assignment
report, so be sure to write them down. Note that some answers you will
be able to find just by
reading the code. For others, you might have to do some searching
online!
- How do you control which lock to use -- what variable or
argument?
- How do you control the number of threads in the test
program?
- What are the four different ways to control the amount of
lock contention in the test program?
- What is the default sleep interval for the load controller
thread? How can you change it?
- How can you control the amount of work that each thread
performs between critical sections -- which variable or argument?
- How does the test program obtain execution time statistics,
which are printed at the end of the run?
Evaluating performance of locks:
Your next step is to run an experiment that will test the performance
of the three different locks used by the test program. You need
to decide on your own how you are going to design this experiment. Your
goal is to understand which lock implementation performs best. You may
find that different locks perform better in different conditions -- for
instance, depending on what server you use, how many threads you use
(fewer than the number of cores or more), and the degree of lock
contention. So you will need to think how you will vary these different
parameters to understand the performance of locks in different
conditions.
Performance of the test program may vary from one run to another, so
for each experiment you need to perform several runs and report the
mean and standard deviation (or you can use another way to summarize
the distribution of data).
Implementing Anderson Queue
lock:
You final goal is to add the implementation of the Anderson queue lock
that we studied in class. If you'd like more details about how it
should be implemented, refer to this
book chapter on spinlocks. (This link is password protected. I will
tell you the password in class).
Feel free to refer to the implementation of the lightweight spinlock in
the library provided to you, in order to implement efficient spinning
in the Anderson queue lock.
Once you are done, analyze the performance of the test program under
the Anderson queue lock and compare the results with your analysis of
the
other three locks.
Do your results lead to the same conclusion as
those in the Decoupling
Contention
Management
from
Scheduling paper we read in class?
What to submit
You must prepare a well-written report of your analysis. Please spell
check your report before submitting! Pick your favourite paper that we
read so far, and model your report after the experimental section in
that paper. Your report must not exceed 5 pages in 10-point Times New
Roman font with 1-inch margins (so make your figures small and pretty
-- but not too small, so they are still readable). I will deduct points
if your report does not comply with the formating specifications. Do
not play with line spacing or other formatting tricks to fit more text.
Include the following in the report:
- Experimental hardware-- what server you used and key
characteristics of the server.
- Experimental design -- explain how you ran your experiment.
Give enough details so another person could reproduce it, but be
concise (there is no need to provide the scripts).
- Justification for the experiment -- explain why you set up
your experiment in the way you did. What did you want to learn and why
did you decide that the experimental setup that you used was the best
way to answer the questions you were looking to investigate?
- Report the results.
- Analyize the data: does it support your original prediciton
of how the system will behave? If not, explain why. If some behaviour
is unexpected, do not leave it unexplained. Investigate why the results
do not meet your expectations.
- Write a well thought-out concise summary of your findings.
Compare and contrast your results with those in the Decoupling
Contention
Management
from
Scheduling paper we read in class.
- Include the appendix containing the answers to the code
reading questions. The appendix does not count towards 5 pages, must
have the same format as the rest of the report and must not exceed two
pages.
- The hard-copy of the final report is due in class.
|