Writing the following text I had the C++, POSIX and especially POSIX threads context in mind.
The Wide Finder project tries to answer the following question: "What is a good way to write computer programs to take advantage of modern slow-clock-rate/many-core computers, without imposing excessive complexity on programmers?" (See [1])
Currently I think that multithreaded programming is a good answer to the question. If a problem can be parallelized, then it does not cause much hassle to parallelize it using multithreaded programming. In my opinion the problems with multithreaded programming are exaggerated most of the time.
Basically one has to consider that no two threads are allowed to write or read/write to the same memory location at the same time without synchronisation. That's all.
The Wide Finder 2 benchmark is about collecting some statistics from a 42 GiB webserver logfile (for more information see [2]). Each line has to be parsed and analyzed. A simple single-threaded C++ version of the reference program can be found in wide_finder_single.cpp. This version is a little bit longer than necessary because of the older GCC version on the test machine. On the other hand there are no comments. Note the nice partial_sort_copy algorithm.
The runtime of this version is unbearable slow on the test machine. The reason probably is the slow memory allocation on the machine. The code does lots of memory allocations.
Parallelizing the task is quite straightforward: Because the lines are very short compared to the whole file, one can divide the file into equal sized chunks and give them to an appropriate number of threads. There are two things to consider:
An implementation of this strategy can be found in wide_finder_multi.cpp. The code size has nearly doubled but this is due to the very short base. On the test machine this code runs within 37 minutes with 32 threads. Most of the time all 32 cores run at full speed. This shows that the implementation is not I/O bound. However there has to be some multithreaded overhead because the scaling was not really linear. I also ran it on an Opteron machine with 128 GiB of main memory and 16 cores where the whole file can be cached. Here the scaling was nearly linear from 1 to 16 worker threads. This machine is faster than the test platform and it shows better multithreading behaviour.
Given the above results with the multithreaded version I changed my focus. The parallelization was successful with nearly linear scaling and it was easily done. But there were programs on the result page [3] that were much faster than mine. Now I just wanted to write the fastest program for this problem that I was able to write. At least it should be as fast as the currently fastest implementation on the results page.
It turned out that making the program as fast as possible is no longer a parallelization problem. One now has to see how to use the available hardware and operating system in the best possible way. Optimizing the runtime of a single thread is identical to optimizing the runtime of a singlethreaded version. In my opinion the scope of the benchmark is left behind.
With the new goal in mind I started to optimize the program. As everyone knows one should never start to optimize a program without hard data about the bottlenecks. But there was a new problem: On Linux I would just use Oprofile but I have no idea how to do profiling on Solaris. Additionally a similar system like Oprofile would require root access that I do not have. Therefore I have to admit that I used my intuition about the bottlenecks to improve the runtime.
After I ran the first experiments, my intuition told me three things. First, the whole problem has to be I/O bound given a sufficient number of threads because the task is very easy. Second, the memory bandwidth is the main bottleneck of the machine and it seems as if it is connected to the I/O bandwidth. Third, memory management is very poor on that machine.
To remove as much pressure as possible from the memory management I used the following modifications:
There are several ways to read the data from the file. I tried the following ones. They are ordered from slowest to fastest.
Very early in the process I started to use the HOARD Memory Allocator [4]. The improvement was between 10 and 20 percent.
I've tried three different regular expression engines. PCRE [5] was the fastest one. Boost.Regex [6] and Boost.Xpressive [7] were much slower than PCRE for this task.
The current approach divides the file into n chunks if there are n threads to work on them. Although the chunks are equal sized, the runtimes of the threads vary by about 70 percent. It would be possible to have the threads working all time by using smaller chunks and a consumer queue for the threads. However it turns out that this approach is not faster than the current one. One could also parallelize the report phase of the program that currently takes 13 seconds on average, but it is not worth the hassle.
The final result can be found in wide_finder_aio.cpp
The resulting program is bound by the I/O and memory subsytem of the test machine. If I understand it correctly the machine has 8 processors. It makes no sense to run the program with more than 8 worker threads. Running it with more than 8 threads shows higher runtimes. This is probably because of contention in the hardware that leads to more overhead.
The runtime of the program is quite predictable. However if there are blocks of the file cached in main memory and they are used before they are evicted by other parts of the memory, then one can achieve better runtimes than what is given by the raw I/O subsystem. Running with the same number of threads over and over again shows nearly no cache gains because the program starts to work on parts of the file that were evicted first by the previous run. The remaining cached blocks are most probably evicted while the current program runs.
The program runs on average 5 minutes and 50 seconds with 8 threads. However if the caches are filled one can achieve better results. The best runtime I saw was 5 minutes and 1 seconds when a 16 thread version ran directly after a 12 thread version. (Interestingly an open/read/close version ran once in only 4 minutes and 40 seconds)
Here is a file that shows different runs on the whole file. The first row shows the number of threads. The second row shows the real elapsed time. The third row shows the user time and the last one the system time. The runs have been started in the ordering that you find in the file. You see how changing the number of worker threads affects the overall runtime.
The program uses about 1.4 GiB of virtual memory.
I think I have reached my goal of writing the fastest implementation I can. I have no ideas to improve the runtime that are not already proven wrong. To go further one has to do it correctly by using hard profiling data.
Another important conclusion is: I have to learn OCaml.
If you have remarks or questions regarding my approach just write me. widefinder@pontohonk.de
Compiling the code on the test platform is not so easy. First one has to have a 64bit version of PCRE. I did not find one on the machine and therefore installed it in /export/home/bartoschek/software/. To compile one of the programs above you can use the following line:
g++ -m64 -O2 wide_finder_aio.cpp -o finder \
-I/export/home/bartoschek/software/include \
-L/export/home/bartoschek/software/lib \
-lpcrecpp -lrt
To run the code you have to add two directories to the library search path. For example by:
export LD_LIBRARY_PATH=\
/export/home/bartoschek/software/lib:\
/usr/sfw/lib/sparcv9
Afterwards you can just start the program with two arguments. The input data file and the number of threads to use.
Written by Christoph Bartoschek
[1] http://wikis.sun.com/display/WideFinder/Wide+Finder+Home
[2] http://wikis.sun.com/display/WideFinder/The+Benchmark
[3] http://wikis.sun.com/display/WideFinder/Results
[6] http://www.boost.org/doc/libs/release/libs/regex/
[7] http://www.boost.org/doc/libs/release/doc/html/xpressive.html