Using Visual Studio PGO as a Profiler

Last year, when our software was running into performance issue, I was desperately looking for a profiler for a large native C++ application. In the past, I’ve tried Rational Purify, and DevPartner and they just could not handle our application (or our machine could not handle the profiler).

So I came across Visual Studio’s Profile Guided Optimization (PGO). In a nutshell, VS compiler uses PGO to optimize the software based on real world scenario, as opposed to the traditional static file analysis. Like you would expect, it consists of three phases – Instrumentation, Training, and PG Optimization.

It turns out that PGO generates an useful profile data from the Training phase. With this profile data, PGO can be used as a lightweight native C++ profiler that provides pretty good code coverage.

The Instruction

PGO is supported from VC8.0 and up. I have tried it on VC9.0 and VC10.0, and the instructions were identical.

Assuming your software can be compiled with Visual Studio, and it is written in native C/C++.

1. Click Build -> Profile Guided Optimization -> Instrument.

2. Click Build -> Profile Guided Optimization -> Run Instrumented/Optimization Application. You will need to exercise the region of the software that you would like to profile. The longer you run it, the more accurate the profile data would be as it averages out the startup overhead.

3. Exit your software. In the folder of your executable (release folder), you should see a xxx.pgd file, and a xxx.pgc file. The pgd file is your profile database that holds all your methods, and the pgc file is the profiling data recorded during the software run.

4. Now open up your Visual Studio Command Prompt. You will probably find it in Start -> Programs ->Microsoft Visual Studio (version) ->Visual Studio Tools.

5. Go to the release folder of your executable. In this step, you need to merge the software run with the profile database. Type pgomgr /merge xxx.pgc xxx.pgd.

6. Once you merged it, you can use the pgomgr to generate a summary of your software run. To do this, type pgomgr /summary xxx.pgd. I recommend piping to output to a text file.

7. The summary file should include the code coverage analysis from your software run.

The summary provides a simple, yet very powerful data on the behavior of your software. It gives you an idea where the hotspots are, and what to optimize.

To find out more about the summary (including the /detail summary), see Kang Su’s blog on “Cracking Profile-Guided Optimization profile data with PGOMGR

Thoughts

Keep in mind that the optimization level of the instrumented build is toned down dramatically. Therefore, the results might not reflect the actual performance in the release build.

In my experience, the instrumented build runs faster than a debug build.

PGO can only instrument DLL and executable. It can not instrument static library.

I have attempted to used PGO to optimize our software. It didn’t turn out too well. Either my machine ran out of memory (4 GB), or the PGO’ed executable didn’t behave properly.

Troublesome Locks

At my company, we develop a large software with fairly high concurrency. In general, asynchronous message passing is the choice for thread communication for its simplicity. When performance matters, we use locks and mutexes to share data among threads.

As the software grows, subtle bugs from dead-locks and race conditions are appearing around code that was designed with locks and mutexes.

I can say confidently that almost all of these bugs are caused by poor interface design or poor locking granularity. But I also feel that a well-designed thread-safe interface is extremely difficult to achieve.

Always Chasing The Perfect Granularity

To have a thread-safe interface, it requires far more than just putting a lock on every member function. Consider a supposedly thread-safe queue below, is this class really useful in a multi-threaded environment?

// thread-safe queue that has a scoped_lock guarding every member function
template <typename T>
class ts_queue()
{
public:
	T& front();
	void push_front(T const &);
	void pop_front();
	bool empty() const;
};

Unfortunately no. Here’s a simple code below where it falls apart. Say f() is called by multiple threads, if the the empty() check returns false, there is no guarantee that front() and pop_front() will succeed.

Even if front() succeeds, there is no guarantee that pop_front() will succeed. The ts_queue class interface is subject to race condition, and there is nothing you can do about it without an interface change.

//f may be called by multiple threads
void f(ts_queue<int> const &tsq)
{
	if(tsq.empty() == false)
	{
		int i = tsq.front(); // this might fail even if empty() was false
		tsq.pop_front(); // this might fail even if front() succeed
	}
}

The problem is that mutexes are not composable. You can’t just call a thread-safe function one after another to achieve thread-safety. In the ts_queue example, the interface granularity is too small to perform the task required by f().

On the other hand, it would be easy to use some global mutex to lock down the entire ts_queue class. But when the granularity is too large, you lose performance by excessive locking. In that case, the multi-core processors will never be fully utilized.

Some programmers try to solve it by exposing the internal mutex to the interface, and pun the granularity problem to their users. But if the designers can’t solve the problem, is it fair to expect their users to be able to solve it? Also, do you really want to expose concurrency?

So in the process of designing a thread-safe class, I feel that I am spending majority of the time chasing the perfect granularity.

Full of Guidelines

In addition to the problem of granularity, you need to follow many guidelines to maintain a well designed thread-safe interface.

  • You can’t pass reference/pointers of member variables out to external interface, whether it is through return value, or through an out parameter. Anytime a member variable is passed to an external interface, it could be cached and therefore lose the mutex protection.
  • You can’t pass reference/pointers member variables into functions that you don’t control. The reason is same as before, except this is far more tricky to achieve.
  • You can’t call any unknown code while holding onto a lock. That includes library functions, plug-ins, callbacks, virtual functions, and more.
  • You must always remember to lock mutexes in a fixed order to avoid the ABBA deadlocks.
  • There are more…

And if you are programming in C++, you need to watch out for all of the above while wrestling with the most engineered programming language in the world. 🙂

Final Thoughts

Locks and mutexes are just too complex for mortals to master.

And until C++0x defines a memory model for C++, lock-free programming is not even worth trying.

I will stop complaining now.