## Some Plots On Amdahl’s Law

The typical lifespan of software is much longer than hardware. It is important nowdays to consider a program’s scalability on multiple processor.

Amdahl’s Law is a famous equation that predicts the maximum theoretical speedup with additional processors.

The equation is simple to derive. It yields powerful yet pessimistic results.

I plotted the equation with some practical input to remind myself how poorly my code will scale in the future. 🙂

### Equation

The idea is simple. Consider a computation that has three parts: setup, compute and finalize. The setup and finalize section can not be performed concurrently, but compute section can be parallelized to multiple processors.

The total running time given a number of processors would be the sum of three parts. An important measurement is relative speedup. It that tells us how much faster the program is running with additional processors. In a perfect world, if all parts of the computation can be performed concurrently, the speedup will be perfectly linear. This implies S(P) will always be the same as the number of processors. This is idealistic and highly unlikely in practice.

To find out the burden of the parts that can’t be parallelized, we first create a representation that describes them, denoted ɣ. Now if we rewrite S(P) with ɣ, we will get Amdahl’s Law. This tells us that even if number of processor goes up infinitely, we are still bounded by 1/ɣ. Efficiency, another metric, can be computed by normalizing S(P) by the number of processor. ### Plots

This plot shows the speedup on different ɣ value with 2,4,8 and 16 core scenario. From the graph, when ɣ is 50%, implying that 50% of the work can’t be parallelized, the increase in number of processor is nearly irrelevant.

Efficiency is a metric that tells you if you are getting enough “bang for the buck” with more processors. As expected, efficiency curve drops more steeply as more cores are involved.

### Pitfall

Amdahl’s Law assumes that the problem size stays fixed, and this is not always true. For example, if an application is to perform weather simulation, it is likely that an extra processor to add details to the model while the keeping the total execution time constant. If this is so, Amdahl’s law is more pessimistic than necessary.

This is a strong hint on how we can tackle the scalability problem in the future.

### Conclusion

Pay attention to ɣ. If ɣ goes past 50%, the scalabilty outlook of the program is very poor.

Efficiency is important because it shows the law of diminishing on more processors.

## Auto Type Deduction At A Glance

One of the new features in C++0x is auto type deduction. Two keywords are introduced in the new standard – auto and decltype.

This feature allows the type of a variable to be deduced from its initializer. Here is an example straight from n1984.

```// x has type double, not sure why I would do this though.
auto x = 3.14;
// y has type of x, which is also double. Now this is just confusing.
decltype(x) y = 3.14;
```

Simple concept, yet confusing. Yup, sounds like C++ alright :).

Since I have GCC 4.4 installed on Ubuntu, I spent some time playing around to get a feel for it.

### Reducing Verbosity

C++ is a statically-typed language. Therefore, it is more strongly typed than other interpreted language like Ruby or Python. Types in C++, especially with generic programming, are often defined within the scope of another class or namespace. This often becomes very verbose.

```std::vector<int> v(100,0);
std::vector<int>::iterator vi = v.begin(); // very verbose
std::vector<int>::iterator vend = v.end(); // too much typing
while(vi != vend)
{
std::cout <<*vi << std::endl;
++vi;
}
```

With the keyword auto, declaration becomes simple.

```std::vector<int> v(100,0);
auto vi = v.begin(), vend = v.end(); // multi-declarator from n1737
while(vi != vend)
{
std::cout <<*vi << std::endl;
++vi;
}
```

### Auto May Fail in RTTI

I am convinced that every feature in C++ comes with bullets to blow your legs away, and this is no exception.

The compiler deduces the type of auto during compile time, but what about run-time? Well, it might not work very well, depending on your scenario.

Here’s is an example that can cause problems.

```class Base
{
public:
virtual void f() {};
};
class Derived : public Base
{
public:
virtual void f() {};
};
void Oops()
{
Base * d = new Derived;
// ...
auto b = *d; // b has type Base, therefore unintentional slicing occurs
b.f(); // calls the wrong function
}
```

### Imposter Problem

When using the auto keyword, there is another interesting side effect that might lead to undesired results.

Consider this example where a library exports Object and CreateObject() and it is included by another program. The compiler will deduce the type of variable o to be Object.

```// including an external library
class Object { public:	void f(){}; }
Object CreateObject() { return Object(); }
...
// user code
void DoStuff()
{
auto o = CreateObject(); // o should be Object
o.DoStuff();
}
```

If a programmer accidentally compiled against another library that exports similar interfaces, the user code will compile without any warnings or errors.

```// oops, included the wrong library
class Imposter { public: void f(){}; };
Imposter CreateObject() { return Imposter(); }
...
// compiler is perfectly happy in user code
void doStuff()
{
auto o = CreateObject(); // o is now Imposter
o.DoStuff();
}
```

Some might argue that this is a “feature”. I disagree because if I want this behavior, I would’ve made DoStuff() a generic function that takes in decltype(CreateObject()) (more on that later).

What we have here is a non-generic function having generic behavior. I do not feel comfortable using auto in this case.

### Trailing-Return Type

C++0x allows for an alternative syntax to declare functions, with trailing-return-type.

Here is an example for a function that returns an integer.

```auto func() -> int
{
return 5;
}
```

The example here seems silly because this syntax is really designed for generic programming.

Now consider another example where a generic function GenericMerge() merges two classes into a third class.

```class A{}; class B{}; class C{};
// merge function that create C out of A and B
C Merge(A const &a, B const &b ) { return C(); }
// a generic function that just merge all types
template<typename Lhs, typename Rhs>
auto GenericMerge(Lhs const &lhs, Rhs const &rhs)
-> decltype(Merge(lhs, rhs))
{
return Merge(lhs, rhs);
}
```

In C++03, this would not have been possible because the compiler does not know that C is the result of Merge(). With trailing-return-type, the return type is deferred as type of Merge(lhs,rhs).

This is a very powerful upgrade to generic programming in C++.

sidenote: Expressions in decltype will not be invoked at runtime.

### Conclusion

auto will help to reduce the verbosity of the code. At the same time, it may lead to more confusing code.

Like all C++ features, auto comes with bullets to blow your legs away.

Trailing-return-type gives a serious power-up to generic programming.

### Tools

GCC 4.4 with Eclipse, Ubuntu 9.10

## Pipeline Concurrency Pattern

I was reading a book called Patterns for Parallel Programming during my flight to Hawaii. It was a long flight, and I couldn’t sleep.

The authors’ targeted audiences seems to be the academic world. The content of the book is very dry. It has too much text and formulas, and not enough pictures. (and I like pictures!)

Anyway, in the design space section, the book briefly covered the Pipeline Pattern. Although the chapter is short, the implementation is very elegant, and the architecture is beautiful.

Here I attempt my C++ implementation following the spirits of the Pipeline Pattern.

### The Basic Idea

The basic idea of  the Pipeline Pattern is similar to a processor instruction pipeline. A pipeline holds a number of stages. Each stage performs some operations and pass its results to the next stage. Since all stages are executed concurrently, as the pipeline fills up, concurrency is achieved. C is a unit of work. First, C(1) will go through stage 1. When it completes, C(2) will go through stage 1, and C(1) moves onto stage 2, and so on..

To express it in high-level programming language, we can think of a stage that just grabs some work unit from the previous stage, perform operations, and output its results in the next stage.

```// pseudo-code for a pipeline stage
while(more incoming work unit)
{
receive work unit from the previous stage
perform operation on the work unit
send work unit to the next stage
}
```

Since I am implementing the pattern in a OO fashion, stage will be expressed as a class. A pipeline, which consists of many stages, will be a class that holds an array of stages.

Finally, since stages communicate with each other by passing data, we need another data structure that supports this behavior.

The author chose to use a Blocking Queue to connect the stages. Each stage will hold an in-queue and out-queue pointer. The in-queue of a stage is the out-queue of the previous stage, and the out-queue of a stage is the in-queue of the next stage. So it is a linked-list of queues. Ci is the work unit to the computed. It will first be inserted to Q1. Stage1's in-queue is Q1, and out-queue is Q2. After stage1 completes its operation, it output the result to Q2. Stage2's in-queue is Q2, and out-queue is Q3. This continues until Ci goes through all stages.

So enough talking, let’s jump straight into the fun stuff.

### Blocking Queue

First, we start with the smallest unit, the Blocking Queue. The behavior follow closely to the Java’s implementation.

The Blocking Queue allows two simple operations – Take() and Add(). Take() extracts input data, or blocks until some data is available for a stage. Add() simply adds data to a queue.

```// some code and comments are omitted to save space, download the
// source to see the full implementation
template<typename WorkType>
class CBlockingQueue
{
public:
WorkType Take()
{
boost::mutex::scoped_lock guard(m_mutex);
if(m_deque.empty() == true)
{
m_condSpaceAvailable.wait(guard);
}
WorkType o = m_deque.back();
m_deque.pop_back();
return o;
}
{
boost::mutex::scoped_lock guard(m_mutex);
if( (m_deque.size() >= m_maxSize) &&
(NoMaxSizeRestriction!= m_maxSize) )
{
return false;
}
m_deque.push_front(o);
m_condSpaceAvailable.notify_one();
return true;
}
private:
boost::uint64_t m_maxSize;
std::deque<WorkType> m_deque;
boost::mutex m_mutex;
boost::condition m_condSpaceAvailable;
};
```

### Pipeline Stage

A pipeline stage holds the in-queue and out-queue. In-queue is where the work unit comes in, and out-queue is where result goes. The Run() function performs three steps that can be customized through inheritance.

```// some code and comments are omitted to save space, download the
// source to see the full implementation
template<typename WorkType>
class CPipelineStage
{
public:
void InitQueues(
boost::shared_ptr<CBlockingQueue<WorkType> > inQueue,
boost::shared_ptr<CBlockingQueue<WorkType> > outQueue)
{
m_inQueue = inQueue;
m_outQueue = outQueue;
}
void Run()
{
FirstStep();
while(Done() == false)
{
Step();
}
LastStep();
}

virtual void FirstStep() = 0;
virtual void Step() = 0;
virtual void LastStep() = 0;

CBlockingQueue<WorkType> & GetInQueue() const { return *m_inQueue; }
CBlockingQueue<WorkType> & GetOutQueue() const { return *m_outQueue; }

bool Done() const { return m_done; }
void Done(bool val) { m_done = val; }
private:
bool m_done;
};
```

### Linear Pipeline

The pipeline class allows stages to be added dynamically through AddStage(). Since all the Blocking Queues are owned by the pipeline, each added stage’s in-queue and out-queue will also be initialized through AddStage().

AddWork() allows a work unit to be added to the first stage. The work unit will be processed by all stages, and can be extracted through GetResult().

At last, Start() will simply kick off all stages concurrently by spinning up multiple threads.

```// some code and comments are omitted to save space, download the
// source to see the full implementation
template<typename WorkType>
class CLinearPipeline
{
public:
{
m_stages.push_back(stage);
size_t numStage = m_stages.size();
m_queues.resize(numStage+1);

if(m_queues[numStage-1] == 0)
{
m_queues[numStage-1] =
boost::shared_ptr<CBlockingQueue<WorkType> >(
new CBlockingQueue<WorkType>()
);
}
m_queues[numStage] =
boost::shared_ptr<CBlockingQueue<WorkType> >(
new CBlockingQueue<WorkType>()
);
m_stages[numStage-1]->InitQueues(
m_queues[numStage-1], m_queues[numStage]
);
}
{
}
WorkType GetResult()
{
return m_queues[m_queues.size()-1]->Take();
}
void Start()
{
for(size_t i=0; i<m_stages.size(); ++i)
{
boost::bind(&CLinearPipeline<WorkType>::StartStage, this, i)
)));
}
}
void Join()
{
for(size_t i=0; i<m_stages.size(); ++i)
{
}
}
private:
void StartStage(size_t index)
{
m_stages[index]->Run();
}
std::vector<
boost::shared_ptr<CPipelineStage<WorkType> >
> m_stages;
std::vector<
boost::shared_ptr<CBlockingQueue<WorkType> >
> m_queues;
std::vector<
};
```

### Sample Usage

Now that the Blocking Queue, the Stage and the Pipeline are defined, we are ready to make our own pipeline.

As a user of the generic pipeline class, we need to define two things – the work unit, and the stages.

For demostration, I defined the work unit to be a vector of integers, and a stage called AddOneStage, which adds one of all element of the vector.

```class CAddOneStage : public CPipelineStage<std::vector<int> >
{
public:
virtual void FirstStep() { /* omit */ }
virtual void Step()
{
std::vector<int> work = GetInQueue().Take();
for(size_t i=0; i<work.size(); ++i)
{
++work[i];
}
}
virtual void LastStep()	{ /* omit */ };
};
```

In the program, I will create a pipeline with three AddOneStage. So if I input a vector that goes through the pipeline, the end result would be the input plus three.

```int main()
{
CLinearPipeline<std::vector<int> > pl;

pl.Start();

std::vector<int> work(100,0);
for(size_t i=0; i<10; ++i)
{
}
for(size_t i=0; i<10; ++i)
{
std::vector<int> result = pl.GetResult();
std::cout << result << std::endl;
}
pl.Join();
return 0;
}
```

### Thoughts

CLinearPipeline essentially established a powerful concurrency framework for pipelining. It allows stages will be customized and added easily.

Although the work unit are passed by copy in the example, it can be changed to pointer easily to avoid the copy, nothing more than implementation details.

Problems such as handling protocol stacks can easily leverage this pattern because each layer can be handled independently.

Performance is best when the work among stages are evenly divided. You don’t want a long pole among the stages.

Unfortunately, the maximum amount of concurrency is limited by the number of pipeline stages.

### Tools

Visual Studio 2008, Boost 1.40.

### Reference

Timothy G. Mattson; Beverly A. Sanders; Berna L. Massingill, (2005). Patterns For Parallel Programming. Addison Wesley. ISBN-10: 0321228111

template<typename WorkType>
class CPipelineStage
{
public:
void InitQueues(
boost::shared_ptr<CBlockingQueue<WorkType> > inQueue,
boost::shared_ptr<CBlockingQueue<WorkType> > outQueue)
{
m_inQueue = inQueue;
m_outQueue = outQueue;
}
void Run()
{
FirstStep();
while(Done() == false)
{
Step();
}
LastStep();
}

virtual void FirstStep() = 0;
virtual void Step() = 0;
virtual void LastStep() = 0;

CBlockingQueue<WorkType> & GetInQueue() const { return *m_inQueue; }
CBlockingQueue<WorkType> & GetOutQueue() const { return *m_outQueue; }

bool Done() const { return m_done; }
void Done(bool val) { m_done = val; }
private:
bool m_done;
};