Compiler Optimisations

Compiler Optimisations

22 Apr 2019    

There are a variety of optimisations that can be applied to code; ranging from optimising machine code to analysing and processing functions as a whole. This tip will try and briefly cover a few of the interesting / common ones. Overall, there are quite a few optimisations that can be applied to C++. For example, LLVM provides approximately 100 passes that either do optimisations or enable them. Before we dive in, one thing to note is that the order of optimisations is important.

Return Value Optimisation

When returning a value from a function by copy, there are cases when the return statement will create a temporary.

struct Obj
{
	Obj(const Obj&)
	{
		// ...
	}
};

Obj get_obj() {
	return {}; 
}

int main()
{
	// Creates a temporary and invokes copy constructor
	Obj o = get_obj();
}

The RVO optimisation will avoid that copy in the above case. This however, as of C++17, is no longer just a compiler optimisation but instead is part of the C++ standard guaranteed to be applied everywhere.

Resources

Dead Store Elimination

This optimisation removes redundant writes to memory. For example:

Point p;

// Redundant - will be removed
p.z = symbol.position.z;

p = get_current_coord();

// ...

While it may not be obvious in such a contrived example, there are cases with nesting, function calls, out parameters that can really use this optimisation to it’s full potential.

Resources

Loop Unrolling & Vectorisation

Unrolling a loop means performing several iterations of the loop in a single block thus avoiding checking the end condition of the loop repeatedly, greatly improving branch prediction performance. This however leads to an increase in binary size. This can be done manually but is also performed by the compiler when generating machine code and often uses SSE instructions, which is called vectorisation. My favourite example of this, is a manual one, compares the std::find function between LLVM (libc++) & GCC (libstdc++) (Both have been modified for readability).

// LLVM - libc++
template <class InputIterator, class T>
InputIterator find(InputIterator first, InputIterator last, const T& value)
{
    for (; first != last; ++first)
        if (*first == value)
            break;
    
	return first;
}

// GCC - libstdc++
template<typename RandomAccessIterator, typename T>
RandomAccessIterator find(RandomAccessIterator first, RandomAccessIterator last, const T& value)
{
    int trip_count = (last - first) >> 2;

    for ( ; trip_count > 0 ; --trip_count)
    {
        if (*first == value)
            return first;
        ++first;

        if (*first == value)
            return first;
        ++first;

        if (*first == value)
            return first;
        ++first;

        if (*first == value)
            return first;
        ++first;
    }

    switch (last - first)
    {
    case 3:
        if (*first == value)
            return first;
        ++first;
    case 2:
        if (*first == value)
            return first;
        ++first;
    case 1:
        if (*first == value)
            return first;
        ++first;
    case 0:
    default:
        return last;
    }
}

The codegen differences can be found here.

Resources

Static Single Assignment (SSA)

Generally in an IR representation (LLVM IR) each variable has a single assignment and changing that value will lead to defining a new variable with the new value. Generally, registers are used to represent variables whereby single alphabet registers with numbers are used. While the SSA form doesn’t itself optimise anything, it allows other optimisations to be performed easily. Generating SSA is an extremely complicated, please refer to the links in the resources below to understand how this is done.

So given the following C++ code

x = 1;
y = x + 1;
x = 2;
z = x + 1;

The SSA form might look something like this:

x0 = 1;
y0 = x0 + 1;
x1 = 2;
z0 = x1 + 1;

In case of branches, a branch merge function is introduced and values can be split up as such:

x = input();

if (x == 42)
    y = 1;
else
    y = x + 2;

print(y);

And the SSA form would be:

x0 = input();

if (x0 == 42)
    y0 = 1;
else
    y1 = x0 + 2;

y2 = φ(y0, y1); // Merge function

print(y2);

This form makes the following optimisations easier.

Resources

Strength reduction

Strength reduction generally acts on SSA form. It reduces the complexity (or strength) of certain operations. For example given the following C++ code:

// A
x *= 4;

// B
c = 7;

for (i = 0; i < N; i++)
{
    y[i] = c * i; // Multiplication is expensive
}

And after applying the strength reduction it can look like this:

// A
X <<= 2; // Cheaper bitwise operation

// B
c = 7;
k = 0;

for (i = 0; i < N; i++)
{
    y[i] = k;
    
    // Replaced with cheaper addition instructions
    k = k + c;
}

Resources

Constant Propagation & Folding

Constant folding is when operations on multiple constants can be collapsed into one or calculated at compile time.

i += 320 * 200 * 32;

// Folded into
i += 2'048'000

Constant propagation is when a constant value or expression can be directly evaluated and substituted in subsequent expressions. Given the following C++ code:

  int a = 30;
  int b = 9 - (a / 5);
  int c;

  c = b * 4;
  if (c > 10)
  {
     c = c - 10;
  }

  return c * (60 / a);

The whole thing can be collapsed by first constant folding:

  int a = 30;
  int b = 3;
  int c;

  c = 12;
  if (true)
  {
     c = 2;
  }
  return c * 2;

And then finally performing full propagation, thus eliminating values in the process:

  int c;
  c = 12;

  if (true)
  {
     c = 2;
  }
  return c * 2;

  // Finally yields
  return 4;

Resources

Global & Local Value numbering

Value numbering is also an SSA based optimisation that tries to find and eliminate similar expressions that yield the same value. They are done at either local or global scope, the latter being heavily reliant on the SSA form. Given the following code:

// SSA form
a3 = x1 + y2
b3 = x1 + y2
a4 = 17
c3 = x1 + y2

First the register values are given a unique name/number (remember we’re more or less dealing with registers here!):

a3_0 = x1_0 + y2_0
b3_0 = x1_0 + y2_0
a4_1 = 17
c3_0 = x1_0 + y2_0

And then the duplicated values are eliminated:

a3_0 = x1_0 + y2_0
b3_0 = a3_0
a4_1 = 17
c3_0 = a3_0

Resources

Common Subexpression elimination

CSE is similar to value numbering but acts or larger expressions. This optimisation requires an analysis step and works in a similar global vs local step. A simple example would be:

a = b * c + g;
d = b * c * e;

Which cane be transformed to:

t = b * c;
a = t + g;
d = t * e;

Resources

Conclusions

While these optimisations seem helpful, they can not only have unintended side effects but security flaws as well. However, knowing the kind and type of optimisations is always helpful when debugging these kind of issues or analysing performance problems. This was a very brief intro to a handful of these optimisations, so please feel free to dig into the resources if you’re interested.