This time, I wanted to experiment with parallelizing the calculation: using as many processor cores as are available to finish the calculation more quickly. The Mandelbrot calculation is embarrassingly parallel, so it was very easy to modify. Since almost all of the work is parallelizable, Amdahl's law predicts good speedup, approximately proportionate to the number of cores.
In each case, this was done with the thread pool pattern: creating a bunch of threads that pick jobs from a queue and execute them.
My parallel implementation in
mandel-parallel.py adds relatively little to my original implementation. I mostly just had to split the calculation of a single row off into a separate function that could be called by a pool process.
With two cores, the speedup was very close to 200%.
It should also be noted that there is absolutely no problem using code from a Cython module in a
multiprocessing job: you just have to import the module and call the function. As long as you have pure Cython functions, this should be very fast and very easy.
Results with PyPy were just about the same: close to a 200% speedup when using two cores.
Not surprisingly, the Java implementation suffers the syntactic overhead of static typing, but can do thread pools relatively easily using the
ThreadPoolExecutor class. The code that does the work must be in a separate class that implements
In order to make everything work, the
Runnable implementation needs a constructor that takes whatever arguments are needed, the
run() method that does the work and stores the result, and another method that returns the stored result.
My implementation in
MandelParallel.java does all of this. Again, the speedup was almost 200% with two cores.
In fact, the Haskell code was the easiest to modify to run in parallel (once I figured out how): I simply replaced a
parMap rseq (and used
deepseq to kill off some lazy evaluation of the inner lists). The
parMap function does the same thing as
map, but does it in parallel. The first argument,
rseq, indicates a “strategy”: it basically says not to be lazy about the calculation and that
parMap should actually compute the values.
GHC does take some convincing to actually run the code in parallel. It must be compiled and run like this:
ghc -O2 --make -with-rtsopts="-N2" -threaded mandel-multi.hs ./mandel-multi
-threaded argument tells GHC to include the multi-thread stuff. The
-N2 argument tells the runtime to use two threads.)
The speedup was very close to 200% over the single-threaded version without
The speedup seemed to very sensitive to the exact evaluation strategies used: when I tried with
rdeepseq instead of
rseq, the speedup was 175%. Without
deepseq used in the
row function, the program used both cores but took longer than the single-threaded version. The lesson seems to be that controlling lazy evaluation in a non-trivial program takes some skill.
I didn't bother implementing the code in C or C++, but it seems that there are some perfectly reasonable C++ Thread Pool options.
The size of the piece of work that is given to each thread is important.
If the chunk is too small, then the threads spend an excessive time communicating with each other. As an example, I modified the Python implementation so it gives each pixel to a worker (instead of each row) in
mandel-parallel-tiny.py. This implementation took around 15% longer than the above (which was actually better than I thought, presumably because the amount of work required to calculate each pixel is relatively large).
If the chunk size is too large, we risk not using our available cores effectively. For example, if we split our job into exactly one chunk for each core, we might find that one of the jobs completes earlier (because it was easier, or that core didn't get interrupted by other work). That will leave the other cores working to finish while one sits idle. As an example, I set
chunksize=ystep/2 in the Python implementation (on two cores). This added about 6% to the run time (even though these two chunks should be quite similar in running time).
See also my parallel Haskell example for another look at the chunk size given to each thread/process.