Tag Archives: parallel algorithms

Parallel String Histogram

Not long ago I’ve read the following article: Solving the Multicore Dilemma.

The article proposes an interesting approach to enabling automaic parallelization of high-level code. This approach is presented as an alternative to enforcing immutability of values (pure functional programming). We can think of it as (full) scatter-gather paradigm (while pure FP corresponds to the «gather» part).

As for me, I don’t think that this approach is really universally applicable, but it may be fruitful anyway even as a mindset for parallelization by hand.

The post you’re reading is devoted to a small special case, which was used as a simple example in the aforementioned article. It’s about getting the histogram of a string (I call it «counting codes» below). Imagine that the simple sequential algorithm is transformed into a scatter-gather counterpart. There each character codepoint has its own container where 1s are accumulated. I think that a sophisticated parallelizing compiler should find out that we don’t need real containers but just a bunch of simple counters.

Well, this way of reasoning may lead to a structure consisting of atomic counters and replacing pushing and summing with atomic increment (so no scatter-gather anymore). People doing parallel algorithms may guess that atomic counters being used so intensively will prevent us from gaining any performance increase (even with state-of-the-art hardware support for atomic operations) for this «parallel» code. More likely, it will be working slower than that simple sequential algorithm. I don’t claim that our would-be sophisticated compiler couldn’t find out the problem with atomic increment. But too clever compilers may produce quite unpredictable code…

Meanwhile, it seems logical to parallelize this algorithm just by providing each thread with it’s own array of counters and then summing the results. I found it to be interesting to measure real-life speed of these algorithms.

For simplicity I used 8-bit char, so I can use a plain C array of 256 integers fitting exactly in 1kb of RAM to store counters for all possible codepoints. Strings are generated using Mersenne twister mt19937 and uniform_int_distribution. The source code containing all the details is available via URLs given below.

Tested algorithms:

  • reference sequential: reference.h
  • atomic via OpenMP means: atomic.h
  • using std::atomic (a dirty hack, just for testing!) with OpenMP thread-group: std_atomic.h
  • parallel with per-thread counters arrays and no atomic operations: parallel.h

Other code: random_string.h, test.cpp. Archived MSVC project with compiled exe (x64) and the testing results: here.

Compiler/OS used: Microsoft Visual C++ 2012 (x64, default release settings + /GS- /Qpar) / Windows 7SP1 Pro x64.

Testing machines: AMD Phenom 2 x4 @ 2.1GHz and 3.2GHz with 2x2Gb DDR3-1333 RAM, Intel Celeron U3400 1.07GHz with 4Gb DDR3-800.

I tried std::atomic just to see «lock xadd» in assembly (MSVC transforms OpenMP atomic directive into «call _vcomp_atomic_add_i4» here, the procedure is hidden somewhere in the runtime). What’s interesting and counter-intuitive (at least for me), OpenMP variant stably works faster than std::atomic on Phenom and vice versa: std::atomic is quite faster on Celeron.

Both atomic variants work an order of magnitude SLOWER than simple sequential algorithm – a serious warning to everyone using atomics indiscriminately. Atomic operations prevent caching of counters and lock memory bus (therefore each iteration loads from and stores to DRAM an integer, and no matter how many threads try to do it, they do it serially).

«Normal» parallel variant works only twice faster on my quad-core machine (and almost no faster on the dualcore machine) than a sequential one not reaching even 1Gbps string reading speed @ 2.1GHz and hardly reaching 1.33Gbps @ 3.2GHz (in a perfect situation the counters array should reside in L1D, string parts should be prefetched to cores’ caches line by line effectively reaching RAM subsystem throughput limit). So, a field for some improvement obviously exists.

Performance scaling with string length and CPU frequency is almost linear. However doing small strings in parallel may lead to high overhead (threads starting and joining, possible context switches).

Intel Celeron U3400

AMD Phenom @ 3.2GHz

AMD Phenom @ 2.1GHz

UPDATE 2013.03.25.

The code was revised and published on Bitbucket.