Parallel Computation Example

Since the last compiler/interpreter/language comparison using the straightforward Mandelbrot set calculation was fun, I thought I'd try it again.

This time, I wanted to experiment with parallelizing the calculation: using as many processor cores as are available to finish the calculation more quickly. The Mandelbrot calculation is embarrassingly parallel, so it was very easy to modify. Since almost all of the work is parallelizable, Amdahl's law predicts good speedup, approximately proportionate to the number of cores.

In each case, this was done with the thread pool pattern: creating a bunch of threads that pick jobs from a queue and execute them.

Python Implementation

The Python multiprocessing module, introduced in Python 2.6 makes working with process pools easy: you can easily create a process pool and just start feeding it tasks.

My parallel implementation in mandel-parallel.py adds relatively little to my original implementation. I mostly just had to split the calculation of a single row off into a separate function that could be called by a pool process.

With two cores, the speedup was very close to 200%.

It should also be noted that there is absolutely no problem using code from a Cython module in a multiprocessing job: you just have to import the module and call the function. As long as you have pure Cython functions, this should be very fast and very easy.

Results with PyPy were just about the same: close to a 200% speedup when using two cores.

Java Implementation

Not surprisingly, the Java implementation suffers the syntactic overhead of static typing, but can do thread pools relatively easily using the ThreadPoolExecutor class. The code that does the work must be in a separate class that implements Runnable.

In order to make everything work, the Runnable implementation needs a constructor that takes whatever arguments are needed, the run() method that does the work and stores the result, and another method that returns the stored result.

My implementation in MandelParallel.java does all of this. Again, the speedup was almost 200% with two cores.

Haskell Implementation

The Haskell parallel package provides some convenient tools for parallelizing functional code. I used this in my mandel-multi.hs. (Compare this with my other parallel Haskell example.)

In fact, the Haskell code was the easiest to modify to run in parallel (once I figured out how): I simply replaced a map with parMap rseq (and used deepseq to kill off some lazy evaluation of the inner lists). The parMap function does the same thing as map, but does it in parallel. The first argument, rseq, indicates a “strategy”: it basically says not to be lazy about the calculation and that parMap should actually compute the values.

GHC does take some convincing to actually run the code in parallel. It must be compiled and run like this:

ghc -O2 --make -with-rtsopts="-N2" -threaded mandel-multi.hs
./mandel-multi

(The -threaded argument tells GHC to include the multi-thread stuff. The -N2 argument tells the runtime to use two threads.)

The speedup was very close to 200% over the single-threaded version without parMap.

The speedup seemed to very sensitive to the exact evaluation strategies used: when I tried with rdeepseq instead of rseq, the speedup was 175%. Without deepseq used in the row function, the program used both cores but took longer than the single-threaded version. The lesson seems to be that controlling lazy evaluation in a non-trivial program takes some skill.

C++ Non-Implementation

I didn't bother implementing the code in C or C++, but it seems that there are some perfectly reasonable C++ Thread Pool options.

Chunk Size

The size of the piece of work that is given to each thread is important.

If the chunk is too small, then the threads spend an excessive time communicating with each other. As an example, I modified the Python implementation so it gives each pixel to a worker (instead of each row) in mandel-parallel-tiny.py. This implementation took around 15% longer than the above (which was actually better than I thought, presumably because the amount of work required to calculate each pixel is relatively large).

If the chunk size is too large, we risk not using our available cores effectively. For example, if we split our job into exactly one chunk for each core, we might find that one of the jobs completes earlier (because it was easier, or that core didn't get interrupted by other work). That will leave the other cores working to finish while one sits idle. As an example, I set chunksize=ystep/2 in the Python implementation (on two cores). This added about 6% to the run time (even though these two chunks should be quite similar in running time).

See also my parallel Haskell example for another look at the chunk size given to each thread/process.


Copyright © , last modified 2012-07-13.