How to Post Source Code on

Since most of the posts in this blog are programming related, I need to post source code from time to time. I’ve been trying to find a way to post source code in such that it will be syntax highlighting and all the goodies.

I tried to use the “/code” tag in the wordpress editor, but it outputs in some awkward font.

/Code button will insert around the text. The font chosen by this theme is not legible.

int main()
std::cout<<"You can't read me anyway"<<std::endl;
return 0;

Use the Sourcecode Tag

So after some googling, I finally found a support page in wordpress. Apparently all you have to do is to wrap your source code in

[sourcecode language=”???”] [/sourcecode]

Replace ??? with the language of your choice. For example, C++ will use “cpp”.  See the support page for the complete list.

#include <iostream>
int main()
   std::cout << "Ah, much better" << std::endl;
   return 0;


It is quite simple!

Prefer Simple Mutex Over Recursive Mutex

For those of us who work in the multi-threaded environment, the concept of mutex (“mutual exclusion”) should be no stranger. There are two common categories of mutex – simple and recursive.

Simple mutex can only be locked once, and if the same thread tries to lock the same mutex again, it will deadlock.

Recursive mutex can be locked by the same thread numerous times, and the same thread must unlock the same number of times before the critical region can be accessed by another thread.

Recursive mutex is common. In fact, mutexes are recursive by default in Windows.

With all that said, I believe recursive mutex should be avoid as much as possible.

Obvious Overhead

Since recursive mutex tracks more information than a simple mutex, it should be no surprise that it is more expensive.

Here is a sample test case with multiple threads accessing a critical region. The mutex implementation is from Boost.Thread library.

Red dotted line indicates the cost of a single simple mutex. Blue line indicates the cost of multiple recursive mutexes.

A single recursive mutex is quite a bit more expensive than a simple mutex. If recursive mutex is used, it is probably going to be locked more than once by the same thread (otherwise, why would you use it?). The graph also demonstrates the cost of recursively locking the mutex again, up to five times. As you can see, it is linear, and it is definitely not free.

The term “recursive mutex” might sound fancy, but they are nothing but redundant locks. After the critical region is locked, each additional recursive mutex adds nothing but overhead.

Some might argue that this cost is negligible. This might be true depending on the application, but this is just the surface of the problem.

The Real Problem

The truth is that locks and mutexes are difficult to use correctly. Until we have STM, there is no silver bullet.

Recursive mutex makes locking easy by taking away the programmer’s responsibility of tracking mutexes. Programmers can add recursive mutex to existing functions and member functions easily without refactoring code. This would not have been possible with simple mutex. However, this advantage is merely an illusion.

Don’t believe it? Here’s the thought of David Butenhof, the person who introduced recursive mutex in POSIX thread.

Here’s some of his powerful arguments:

1. Recursive mutex allows you to lose track of your locking scheme and scope. As your call stack becomes deeper, you will have less sense on how long you’ve been locking the critical region.

2. Recursive mutex are reference counted, and releasing it does not imply that the critical region is freed. So how would you know how to unlock?

I’ve learned a lot from his post. His argument addresses the fundamental design issue in software that use recursive mutex.

To avoid using recursive mutex, we have to consider refactoring.


In functional programming, a typical refactoring technique to break up recursive locking is by splitting a function into two components– outer and inner. The outer function holds the lock, and invoke the inner functions to perform its task. The inner functions is an internal function that is not exposed as an API.

In Object Oriented programming, this technique can also be applied in form of Pimpl idiom or private functions.

Consider this example:

Class A has two public functions – One and Two. One calls Two. Both methods are thread-safe, they both lock the same recursive mutex, which is a private member of Class A. Below demonstrates that the functional programming refactoring technique can be applied in OO programming.

A::One calls A::Two, they both lock on the same mutex member variable.

Refactor the implemention of A::Two into A::Two_Impl. A::One and A::Two nows calls A::Two_Impl. Recursive mutex becomes unnecessary, and can be refactored into a simple mutex.

Refactoring – A More Complicated Scenario

When more than one classes are involved, things gets more complicated. The technique above will not hold.

Here’s a more complicated (yet realistic) example:

Class A has two public functions – One and Three. Both functions are thread-safe, and lock the same recursive mutex, which is a private member variable of A. Class B has one public function – Two.

A::One calls B::Two, and B::Two calls A::Three.

A more complicated scenario that involves more than one class.

Now we have a mess. The possible call paths grows exponentially as the call stack gets deeper.

In the example, there are three possible paths.

1. A::One, B::Two, A::Three

2. B::Two, A::Three

3. A::Three

If we apply the previous technique, we would refactor A::Three into A::Three and A::Three_Impl. That won’t work here because class B expects A::Three_Impl to be a public interface.

There is no fix-all solution here, but here’s some suggestion.

1. Refactor A and B such that they do not circular reference each other. In other word, A can reference B, or B can reference A, but not at the same time.

2. Merge A and B in one class, or move B::Two into class A.

Use Recursive Mutex If It Is Absolutely Positively Necessary

Unfortunately, in the real world, things get even more complicated. The complicated example above only involves two classes. But what if more classes are involved, the number of call paths grows exponentially, and becomes unrealistic to refactor?

As a discipline of computer science, I would argue that for every solution with recursive mutex, there must be a solution for simple mutex.

As a software engineer who is under the gun with a deadline… I say just hack it and use recursive mutex. We are shipping tomorrow! 🙂

But as David Butenhof said, recursive mutex is a hack. Instead of relying on recursive mutex up front, we should avoid it until it is absolutely necessary.


Recursive mutex is dangerous because you lose sense of locking scope. It costs more than a simple mutex.

Consider to refactor the code to avoid the use of recursive mutex.

Use recursive mutex if it can’t be refactored economically.


The source and the spreadsheet can be downloaded here.

Tools: Visual Studio 2008, Boost 1.39, Window XP SP3 32bit

Start Using Unordered Map

For years, STL Map is the premium container choice for C++ programmers for fast data insertion, look-up and removal.

It uses a red-black tree, has elegant logarithmic properties, and guarantees to be balanced.

Therefore, Map is used very often in our product. Almost too often…

In TR1, Unordered Map has long been proposed. Boost has a cross platform implementation. VC9.0 also support the TR1 namespace.

It provides a powerful alternative to C++ programmers.

Penalty of Sorting

Besides fast data operation, Map also provides data sorting. As a refresher for data structure 101, a balanced binary tree will provide O(log n) for add/remove/search operations. In the case of red-black tree, it guarantees that the longest branch will not be 2x longer than the shortest branch. So, it is mostly balanced, but some operations can be somewhat longer than others.

But what if sorting is not necessary? Why would we heapify the whole tree on every add/remove operation? Can we do better than O(log n)?

TR1 Unordered Map is designed to tackle this problem.

C++ Hash Table

Most modern programming languages comes with some standard hash tables. For C++, well, it is still in a “draft” called technical report 1. As usual, the C++ standard committee spends most of its time on metaprogramming stuff, and neglect the simple and useful stuff.

The bottom line is that Unordered Map is the C++ hash table.  It’s interface is very similar to Map. In fact, if sorting is not a requirement, simply changing your container map to unordered_map will be enough.

Unordered Map Could Be Right for You

Unordered Map offers O(1) for add/remove/search operations, assuming that the items are sparsely distributed among the buckets. In worst case, all items are hashed to the same bucket, and all operation becomes linear.

The worst case doesn’t happen often if the hash function is doing its job. If sorting is unnecessary, unordered map is faster than map.

But talk is cheap, let’s see some benchmark.

Speed Benchmark

To test the containers, we insert a set of uniformly distributed random numbers into each container, and then one by one find the inserted numbers.

First, let’s test the performance of small container size under 50 elements. This is to find out if either of the container has any overhead associated for initialization.

For small container size, unordered_map is slightly faster than map
Map search shows a nice O(log n) growth. Unordered_map grows much slower than logarithmic.

Next, we can try a much large data set to see how the container handle itself.

Again, unordered_map insertion grows slower than map

This is the penalty of sorting. Unordered_map runs at O(1), and it is unaffected the number of elements when the hash function is doing its job.

You can see that the search operation in Unordered_map is blazing fast. But hash tables are more than meets the eyes; the devil is in the details. We need further analysis before jumping into conclusions.

Looking A Bit Deeper

Unlike a balanced tree structure, hash table are somewhat unpredictable in nature. If the hash function is poorly suited for the given data set, it would backfire and result in linear runtime. Because of this nasty behavior, some programmers would even avoid hash table altogether.

There are numerous algorithm to optimize a hash table to avoid the worst case. In Unordered_map, there are two critical factors to consider- the max load factor and the hash function.

According to C++ TR1, the max load factor for Unordered_map should be 1.0. So if the average bucket size is greater than 1, rehashing would occur. VC 9.0 didn’t follow the standard, and uses 4.0 as the default load factor.

When Unordered_map grows beyond the allowed load factor, the number of buckets will change. In Boost, the programmer decided to increase the number of buckets by a prime number. In VC 9.0, the bucket size increases by a factor of 8 on each rehash.

For hash function, we can look at the integer hash function. Boost unordered_map follows closely with Peter Dimov’s proposal on Section 6.18 of the TR1 issue list. The hash function is defined as

for(unsigned int i = length * size_t_bits; i > 0; i -= size_t_bits)
seed ^= (std::size_t) (positive >> i) + (seed<<6) + (seed>>2);
seed ^= (std::size_t) val + (seed<<6) + (seed>>2);

VC90 uses a variant of LCG, the Park-Miller random number generator.

Trying for the Worst Case

Now that we know a bit more of implementation details, let’s see how bad the worst case is.

To do this, we can try to hash 1,000,000 random numbers. Unlike before, the random number will be generated from a non-uniform distribution – Gamma, Exponential, and Geometric. I particularly enjoy Gamma distribution because its shape can easily be skewed with shape and scale parameters.

There are two questions to find out from this test.

1. What is the maximum number of items hashed into a single bucket? In other word, what’s the worst case like?

2. Did the load factor of 1 cause more re-hashing? As we know, rehashing is very expensive and should be done the least possible.

Let’s see some results!

After 1,000,000 items inserted, it appears that the skewed data sample makes little difference to hash function.
The number of rehashes is 17 across all distributions.

Okay, I failed miserably in my attempt for the worst case. Choosing a skewed distribution only made a slight dent to the hash table. On average, the most overloaded bucket after 1,000,000 only has 7 items in there. In a balanced tree, O(log2(1,000,000) = 20. The hash function is definitely doing its job.


If sorting is unnecessary, Unordered_map provides a nice alternative to Map. If the container doesn’t change often, the performance gain with Unordered_map is even greater.

Don’t worry too much about the worst case of a hash table. The Boost implementation is rock-solid.

Avoid the VC9.0 TR1 Unordered_map because it does not follow standard. It is also a bit slower in comparison to Boost.


The focus of this article is on speed. It is also necessary to compare the memory usage between Map and Unordered Map. I did some primitive testing and did not see a difference. But before jumping into conclusion, I need more time to think about the test setup. I will get to that when I get more time.


The source and the spreadsheets can be downloaded here.

Tools: Visual Studio 2008, Boost 1.39, Window XP SP3