Wide Finder 2 in C++ and POSIX

Preface

Writing the following text I had the C++, POSIX and especially POSIX threads context in mind.

Wide Finder

The Wide Finder project tries to answer the following question: "What is a good way to write computer programs to take advantage of modern slow-clock-rate/many-core computers, without imposing excessive complexity on programmers?" (See [1])

Currently I think that multithreaded programming is a good answer to the question. If a problem can be parallelized, then it does not cause much hassle to parallelize it using multithreaded programming. In my opinion the problems with multithreaded programming are exaggerated most of the time.

Basically one has to consider that no two threads are allowed to write or read/write to the same memory location at the same time without synchronisation. That's all.

The Wide Finder 2 benchmark is about collecting some statistics from a 42 GiB webserver logfile (for more information see [2]). Each line has to be parsed and analyzed. A simple single-threaded C++ version of the reference program can be found in wide_finder_single.cpp. This version is a little bit longer than necessary because of the older GCC version on the test machine. On the other hand there are no comments. Note the nice partial_sort_copy algorithm.

The runtime of this version is unbearable slow on the test machine. The reason probably is the slow memory allocation on the machine. The code does lots of memory allocations.

The multithreaded version

Parallelizing the task is quite straightforward: Because the lines are very short compared to the whole file, one can divide the file into equal sized chunks and give them to an appropriate number of threads. There are two things to consider:

A chunk can start and end in the middle of the line. One has to decide what to do with such lines. My solution is that all threads ignore the first line in their chunk because it is considered by the thread that works on the previous chunk. The thread that gets the first chunk is not allowed to ignore its first line.
Because all threads heavily update the statistics it is prohibitive to have a global statistic object. Each thread has to get its own statistic object and after processing the statistics have to be merged.

An implementation of this strategy can be found in wide_finder_multi.cpp. The code size has nearly doubled but this is due to the very short base. On the test machine this code runs within 37 minutes with 32 threads. Most of the time all 32 cores run at full speed. This shows that the implementation is not I/O bound. However there has to be some multithreaded overhead because the scaling was not really linear. I also ran it on an Opteron machine with 128 GiB of main memory and 16 cores where the whole file can be cached. Here the scaling was nearly linear from 1 to 16 worker threads. This machine is faster than the test platform and it shows better multithreading behaviour.

Motivation

Given the above results with the multithreaded version I changed my focus. The parallelization was successful with nearly linear scaling and it was easily done. But there were programs on the result page [3] that were much faster than mine. Now I just wanted to write the fastest program for this problem that I was able to write. At least it should be as fast as the currently fastest implementation on the results page.

It turned out that making the program as fast as possible is no longer a parallelization problem. One now has to see how to use the available hardware and operating system in the best possible way. Optimizing the runtime of a single thread is identical to optimizing the runtime of a singlethreaded version. In my opinion the scope of the benchmark is left behind.

Optimization

With the new goal in mind I started to optimize the program. As everyone knows one should never start to optimize a program without hard data about the bottlenecks. But there was a new problem: On Linux I would just use Oprofile but I have no idea how to do profiling on Solaris. Additionally a similar system like Oprofile would require root access that I do not have. Therefore I have to admit that I used my intuition about the bottlenecks to improve the runtime.

After I ran the first experiments, my intuition told me three things. First, the whole problem has to be I/O bound given a sufficient number of threads because the task is very easy. Second, the memory bandwidth is the main bottleneck of the machine and it seems as if it is connected to the I/O bandwidth. Third, memory management is very poor on that machine.

To remove as much pressure as possible from the memory management I used the following modifications:

The data chunks are written to char arrays and parsing is done without any copy. Basically there is only the copy from the file to the main memory left.
All strings a thread has to handle are stored uniquely in the main memory. Each thread has a string store. Within a thread each string is identified by a pointer to it.
The hashtables that are used profit from the previous optimization. The statistics of each thread are hashed by the pointer and not by the whole string. Only for the strings store and for the merging phase one has to hash over the whole string.
For each string that is needed for the statistics there are at most n copies allocated where n is the number of worker threads. During the merge phase no new strings are allocated because one can use one of the existing copies.

There are several ways to read the data from the file. I tried the following ones. They are ordered from slowest to fastest.

The C++ fstream class as it is used in the first two versions was the slowest alternative.
The C FILE API turned out to be a little bit faster than the C++ fstream class but it was still very slow. The main reason might be the additionally memory copies both APIs impose.
A whole file mmap gave better results than the streaming APIs. However there were not many gains with more threads. Tuning the mmap behaviour with madvise did not improve anything.
The basic open/read/write/close POSIX API gave very good results. There is basically only a single copy from the file to the main memory as with mmap. However this API was way faster than the mmap approach. Maybe this is the case because the same memory buffer is heavily reused.
The best results gave the asynchronous I/O POSIX API. Here one advises the kernel to read data in a buffer while one processes the data of another buffer. Ideally the kernel sees the I/O requests very early and is able to schedule them optimally.

Very early in the process I started to use the HOARD Memory Allocator [4]. The improvement was between 10 and 20 percent.

I've tried three different regular expression engines. PCRE [5] was the fastest one. Boost.Regex [6] and Boost.Xpressive [7] were much slower than PCRE for this task.

The current approach divides the file into n chunks if there are n threads to work on them. Although the chunks are equal sized, the runtimes of the threads vary by about 70 percent. It would be possible to have the threads working all time by using smaller chunks and a consumer queue for the threads. However it turns out that this approach is not faster than the current one. One could also parallelize the report phase of the program that currently takes 13 seconds on average, but it is not worth the hassle.

The final result can be found in wide_finder_aio.cpp

Results

The resulting program is bound by the I/O and memory subsytem of the test machine. If I understand it correctly the machine has 8 processors. It makes no sense to run the program with more than 8 worker threads. Running it with more than 8 threads shows higher runtimes. This is probably because of contention in the hardware that leads to more overhead.

The runtime of the program is quite predictable. However if there are blocks of the file cached in main memory and they are used before they are evicted by other parts of the memory, then one can achieve better runtimes than what is given by the raw I/O subsystem. Running with the same number of threads over and over again shows nearly no cache gains because the program starts to work on parts of the file that were evicted first by the previous run. The remaining cached blocks are most probably evicted while the current program runs.

The program runs on average 5 minutes and 50 seconds with 8 threads. However if the caches are filled one can achieve better results. The best runtime I saw was 5 minutes and 1 seconds when a 16 thread version ran directly after a 12 thread version. (Interestingly an open/read/close version ran once in only 4 minutes and 40 seconds)

Here is a file that shows different runs on the whole file. The first row shows the number of threads. The second row shows the real elapsed time. The third row shows the user time and the last one the system time. The runs have been started in the ordering that you find in the file. You see how changing the number of worker threads affects the overall runtime.

The program uses about 1.4 GiB of virtual memory.

Conclusion

I think I have reached my goal of writing the fastest implementation I can. I have no ideas to improve the runtime that are not already proven wrong. To go further one has to do it correctly by using hard profiling data.

Another important conclusion is: I have to learn OCaml.

If you have remarks or questions regarding my approach just write me. widefinder@pontohonk.de

Compiling

Compiling the code on the test platform is not so easy. First one has to have a 64bit version of PCRE. I did not find one on the machine and therefore installed it in /export/home/bartoschek/software/. To compile one of the programs above you can use the following line:


g++ -m64 -O2 wide_finder_aio.cpp -o finder \
    -I/export/home/bartoschek/software/include \
    -L/export/home/bartoschek/software/lib \
    -lpcrecpp -lrt

Figure 1: Compiling the Code

To run the code you have to add two directories to the library search path. For example by:


export LD_LIBRARY_PATH=\
/export/home/bartoschek/software/lib:\
/usr/sfw/lib/sparcv9

Figure 2: Library Search Path

Afterwards you can just start the program with two arguments. The input data file and the number of threads to use.

Written by Christoph Bartoschek

References

[1] http://wikis.sun.com/display/WideFinder/Wide+Finder+Home

[2] http://wikis.sun.com/display/WideFinder/The+Benchmark

[3] http://wikis.sun.com/display/WideFinder/Results

[4] http://www.hoard.org/

[5] http://www.pcre.org

[6] http://www.boost.org/doc/libs/release/libs/regex/

[7] http://www.boost.org/doc/libs/release/doc/html/xpressive.html