I came across an interesting Microsoft Support article on heap performance counters. Apparently there is a registry setting that enables heap counters on Perfmon. This allows users to profile various aspect of heaps in a process.
Perfmon.exe displays these counters when the following registry key is set:
One of the counter that caught my attention is Heap Lock Contention, which is the number of collisions per sec on the heap lock. I learned of heap contention awhile ago from Windows via C/C++, but I have never been able measure it.
In 2009, I wrote some test code to benchmark Low Fragmentation Heap (LFH). Recall that the original test is single-threaded program that randomly allocates and deallocates various size buffers a number of times.
With minor touch-ups, I customized the test code to run with two threads in parallel. So I kicked off the modified test and added a Heap Lock Contention counter on the main process heap.
The lock contention counter gathered some very interesting results. The test program with default allocator generated about 15 collision per second on the heap lock.
I re-ran the test program to use LFH allocator (switchable through a command line argument). The LFH allocator results in 50% less contention compare to the default allocator in Window XP.
I could not get this counter to work properly under Window 7. Microsoft mentioned that only Windows Server 2003, Windows Vista, and Windows Server 2008 are enhanced.
If heap lock contention is a problem, Windows via C/C++ recommends to create a separate heap for allocation intensive classes with a custom new/delete operator.
LFH outperforms the default allocator under Window XP. The heap contention counter confirms my original test result in 2009.
Tools: Visual Studio 2008 (VC9), Boost 1.45, Window XP SP3 (32 bit)
Recently, I have been playing around with reader-writer (RW) locks. I have never encountered RW locks in practice, but I have read that they could be inefficient in practice, and often results in more harm than good.
Recall that traditional mutex ensures that only one thread may enter a critical region. But if the critical region is being written infrequently, it is possible to exploit this concurrency by allowing multiple reader with RW locks.
So when exactly should RW lock be used in place of traditional mutex? To answer this question, I wrote a benchmark program to understand the scalability of RW locks.
Boost shared_mutex Benchmark
Since C++ is my primary programming language at work, I started by picking on shared_mutex of the Boost threading library.
In my benchmark, I focus primarily on two variables – the writer frequency, and the hold time of the mutex.
For implementation, there are 4 worker threads (for my quad-core CPU) working with a critical region that approximate e. At each iteration, one of the threads has a certain probability to become a writer. My goal is to see the performance change as the writing frequency increases.
And to control the hold time of the mutex, each thread will performs a certain number of iterations called E. As E becomes larger, the hold time of the mutex increases.
At E = 1, even when there are zero contention, the overhead completely wipes out any performance gain of the concurrent readers.
At E = 50, the longer hold time pays off slightly under low contention. However, the performance degrades rapidly as contention increases.
As you can see, the results are very disappointing. Boost shared_mutex only offers performance gain under extremely low contention with large hold time. The large hold time is unrealistic in practice because most programmers are taught to minimize their critical region.
I was curious to see if SRW performs any better, so I added SRW into my benchmark.
Although SRW offers similar scalability compare to boost shared_mutex, it has lower overhead, outperforms boost shared_mutex in almost all cases.
After looking into the implementation of boost shared_mutex, I realize that its lock-free algorithm is complex and tracks many states. This implementation has so much overhead that it is impractical.
SRW offers has far lower overhead, and can be useful under low contention. Unfortunately, it is only available for Vista and beyond.
Neither mutex type offer real performance advantage when contention goes beyond 2%. Somehow, I speculate that Amdahl’s Law is playing a part here. The chart looks very much like the inverse of speedup graph I plotted last year.