Rvalue Reference Speeds Up Your Code

One of the most exciting feature to me in C++0x is Rvalue Reference. It is designed to eliminate spurious copying of objects and to solve the Forwarding Problem.

Rvalue Reference is a small technical extension to C++, but it could seriously increase the performance of existing C++ application for free or minimal changes.

GCC 4.3 and beyond has Rvalue Reference support. All it requires is a compiler option “-std=gnu++0x”, and all the supported experimental features will be enabled.

The C++ committee has a nice introduction to Rvalue Reference.

Too Many Copies

Rvalue Reference by itself isn’t too useful to the daily C++ programmers. Chances are the library writers will have the bulk of the work on optimizing their containers, and it will be mostly invisible to the end user. But nevertheless, the end user will need to supply two functions to take advantage of this optimization. They are Move Copy Constructor and Move Assignment Operator.

Consider a POD class that holds an index value for sorting, and a buffer that holds some data. It has an overloaded copy constructor and assignment operator to keep track of number of copying done on the object. Since it is designed to be used in a standard container, I will also supply a sorting predicate.

// global static variable to keep track of the number
// of times Pod has been copied.
static int Copies = 0;
// Plain Old Data
class Pod
{
public:
 	// constructor
	Pod() : m_index(0)
	{
		m_buffer.resize(1000); // holds 1k of data
	}
	Pod(Pod const & pod)
		: m_index(pod.m_index)
		, m_buffer(pod.m_buffer)
	{
		++Copies;
	}
	// not the best operator=, for demo purposes
	Pod &operator=(Pod const &pod)
	{
		m_index = pod.m_index;
		m_buffer = pod.m_buffer;
		++Copies;
	}
	int m_index; // index used for sorting
	std::vector<unsigned char> m_buffer; // some buffer
};
// Sorting Predicate for the POD
struct PodGreaterThan
{
	bool operator() (Pod const & lhs, Pod const & rhs) const
	{
		if(lhs.m_index > rhs.m_index) { return true; }
		return false;
	}
};

Now, we would like to utilize this data in a vector, and sort them.

std::vector<Pod> manyPods(1000000); // a million pod
std::tr1::mt19937 eng;
std::tr1::uniform_int<int> unif(1, 0x7fffffff);

 // assign a random integer to each POD m_index
for (std::size_t i = 0; i < manyPods.size(); ++i)
	manyPods[i].m_index = unif(eng);

// sort the million pods
std::sort(manyPods.begin(), manyPods.end(), PodGreaterThan());

If you run this code in GCC 4.3.4, you will find out POD has been deep copied 17362106 times during the course of std::sort. That’s quite a bit of copies! Let’s see what Rvalue Reference can do for us.

Move Copy Constructor and Move Assignment Operator

Many of the copies in the previous example are likely to be temporary objects. Regardless of the sort algorithm, it is likely that it will spend much time swapping position between two items. And we know, the standard swap logic involves a temporary copy, and it has a very short lifespan. Rvalue Reference could be used to minimize this overhead.

Following Dave Abraham’s guideline on the move semantics.

A move assignment operator “steals” the value of its argument, leaving that argument in a destructible and assignable state, and preserves any user-visible side-effects on the left-hand-side.

We want to utilize the move semantics on temporary objects because they are short lived and side-effects are not a concern.

So let’s supply a Move Copy Constructor and Move Assignment Operator to the class Pod.

// move copy constructor
Pod(Pod && pod)
{
	m_index = pod.m_index;
	// give a hint to tell the library to "move" the vector if possible
	m_buffer = std::move(pod.m_buffer);
}
// move assignment operator
Pod & operator=(Pod && pod)
{
	m_index = pod.m_index;
	// give a hint to tell the library to "move" the vector if possible
	m_buffer = std::move(pod.m_buffer);
}

In these method, it is similar to the typical deep copy, except we are invoking std::move on the buffer. std::move might look like magic, but it works just like a static_cast from std::vector to std::vector&&. This is the crucial point of Rvalue Reference. By casting the vector to vector &&, it will invoke the Move functions of the vector automatically, if one exist. If the Move functions do not exist, it will fall back to the default non-move functions (deep copy). This is the beauty of the Rvalue Reference overloading rule.

GCC 4.3.4 vector is “Move Ready”. Here’s an example of their Move Assignment Operator.

#ifdef __GXX_EXPERIMENTAL_CXX0X__
vector&
operator=(vector&& __x)
{
	// NB: DR 675.
	this->clear();
	this->swap(__x);
	return *this;
}
#endif

*On a side note, Line 6 is very interesting because it gets around a bug that is caused by a delay in the deletion of temp object.

Speed Up after the Move Upgrade

Now we run the same code again after supporting Move Copy Constructor and Move Assignment Operator in Pod.

The number of copies dropped by two-thirds.
The number of copies falls by two-thirds.

std::sort runs roughly twice as fast as before.

By simply supplying a Move Constructor and Move Assignment operator to Pod, the sort routine runs twice as fast. That also implies that we’ve been wasting half the time in the sort routine moving temporary objects around.

Conclusion

Rvalue Reference will provide a nice performance gain to users of STL containers.

To end user, all that’s required is a Move Constructor and Move Assignment Operator.

Thoughts

It would be even nicer if they can supply default Move functions to all objects. It appears that it has been proposed. I look forward to see its progress.

Source

The source can be downloaded here.

Tools: GCC 4.3.4, Cygwin 1.7, Window 7 64bit

Memory Alignment Problems

Memory alignment is the way data types are arranged and accessed in memory. There are numerous tutorial on the internet that covers the topic in depth. This is not one of them. This is just a post that gathers my thoughts on how little I knew about the topic. 🙂

I’ve read Write Portable Code awhile ago, and there are some recommended practices that I follow to avoid alignment issues. Along the lines of the recommended practices is to avoid using memory overlay and bit-fields.

By avoiding those features, I don’t deal with memory alignment issues too often. But at the same time, I also avoided understanding those memory alignment issues in the first place.

In the past several weeks, I have worked with low level programmer that uses those features. After working with alignment bugs that we have encountered, I feel like I need to take Programming 101 again.

Bus Error on Solaris

My co-worker was developing a cross-platform software in C that receives and routes data from the software I developed(vague on purpose). We integrated our code, and they worked fine under the Linux system. Then he ported the code over to the Solaris machine, it quickly gave him a “Bus Error”.

// code has been drastically simplified for demonstration purposes

// some byte array
char p[5000];

//...many lines later

// extract a 4 byte integer from the buffer, and get a Bus Error from Solaris
int *intData= (int *) ((char *)&p);
int fourByteInteger = *intData;

It smells like an alignment issue. By de-referencing of variable intData, it probably caused a misaligned memory access. We didn’t know the exact detail, but my co-worker changed it to memcpy (one of the recommended practice from Writing Portable Code), and everything is happy.

So it is an alignment issue. But this leaves a bigger question, why does it work on the Linux system?

Unaligned Memory Access on Intel Architecture

The Linux system uses an Intel processor, and the Solaris system uses a SPARC processor.

Turns out that Intel Architecture allows unaligned memory access, with a small performance penalty. I’ve been sheltered under the Intel Architecture for so long that I took unaligned memory access “feature” for granted.

So this leads to another question, how much is the penalty, and how to detect unaligned memory access?

Finding out the penalty isn’t a difficult task. You can force an unaligned access by upconverting a byte pointer into a pointer of a larger type. Here is some pseudo-code.

// loop through the entire array by assuming that the void
// pointer is of type T
template<typename T>
void LoopThroughData( void *data, uint32_t size )
{
	T *dataT = (T*) data;
	T *dataTEnd = dataT + size/sizeof(T);

	while( dataT != dataTEnd )
	{
		(*dataT)*=2;
		dataT++;
	}
}
...
char buffer[2000];
char *bufferWithOffset = buffer + someOffset;
// loop through the array by assuming that it is an integer array
LoopThroughData<int>((void *)bufferWithOffset, 2000);

Here’s some plots that shows the penalty of unaligned access in Intel architecture.

Convert a byte array with 32 bit integer array with different offset values.
Convert a byte array with 64 bit integer array with different offset values.

The plot shows that there is a 2% performance penalty on unaligned 32 bit integer access and 3.6% performance penalty on unaligned 64 bit integer access.
I am using an Intel I5-750 processor. The penalty ratio is likely to be different across the Intel processor family.

Defense Against Unaligned Memory Access

Regardless of architecture, we should avoid unaligned memory access. Say that you can’t use memcpy for some performance reason, there is a compiler specific macro that can help detect unaligned memory.

In Visual Studio, there is __alignof that returns the alignment requirement of a given type. In GCC, the equivalent routine is__alignof__. With this tool, I wrote a small C++ routine that will determine whether a given pointer meet its alignment requirement.

template <typename T>
bool CheckIfDataIsAligned(T *p)
{
	if(((uintptr_t)p % __alignof(T)) == 0)
	{
		return true;
	}
	return false;
}

If your compiler does not support any variant of alignof, there is a clever solution that implement it in terms of through offsetof.

Bit-field Padding Problem

Another problem I encountered recently is data structure padding. My co-worker defined a bitfield to extract message from devices.

// code simplified for demonstration purposes.
// method 1
struct BitFields
{
	uint32_t a : 16;
	uint32_t b : 16;
	uint8_t c;
	uint8_t d;
};
// method 2
struct BitFields2
{
	uint16_t a;
	uint16_t b;
	uint8_t c;
	uint8_t d;
};

I am simplifying the code here. The story is a bit more complicated. There are many messages defined, and he is using the size of the messages to determine the offset to read from a large buffer. He found out that if he uses method 1 for his bit-fields, things are completely out of sync. If he uses method 2, everything works.

If you run the sizeof() operator on both object, object defined with method 1 will be bigger than method 2. This is because compiler have the tendency to align a structure to the nearest multiple of the largest member alignment value. So in the case of method 1, the largest method is uint32_t, and causes a 2 byte padding at the end of the structure.

Defense Against Bit-field Padding

Regardless of how much understanding I have on bit-fields, mistakes can always be made. I came up with two personal guideline to follow next time I define a bit-field.

1. Use unnamed bit-field if padding is intended.

2. Use Static Assert to validate structure sizes to prevent unaware paddings.

struct BitFields
{
	uint32_t a : 16;
	uint32_t b : 16;
	uint8_t c;
	uint8_t d;
	uint8_t : 8; // use unnamed bit-field if it is intentional
	uint8_t : 8; // use unnamed bit-field if it is intentional
};
// static assertion to guard against unaware padding
BOOST_STATIC_ASSERT(sizeof(BitFields) == 8);

struct BitFields2
{
	uint16_t a;
	uint16_t b;
	uint8_t c;
	uint8_t d;
};
// static assertion to guard against unaware padding
BOOST_STATIC_ASSERT(sizeof(BitFields) == 6);

On a side note, I know that #pragma pack(n) can also remove padding. But #pragma pack(n) only gives programmer partial control over a structure’s alignment. Compiler can still choose to align object less than n if n is greater than 1.

Source

The source and spreadsheet can downloaded here.

Compiler: Visual Studio 2008

Machine Specification: Intel I5-750, Window 7 64 Bit.