The Perils of Undefined Behaviour

The Perils of Undefined Behaviour

02 Jul 2018    

There are quite a few cases where the C++ standard simply states that if certain conditions are met then the behaviour of the program is undefined. This means that there is no guarantee as to what the program will do. Compilers on the other hand love undefined behaviour (UB), especially optimisers since it allows them to heavily optimise away quite a few things.

So if a program closely adheres to the rules laid out by the C++ stanard then the compiler will most probably do what you expect, otherwise, there’s absolutely no guarantee since the compiler is allowed to assume that the lines with UB never happen. This should not be confused with implementation defined behaviour, which is dangerous in it’s own way. Example: std::size_t is implementation defined or the mapping of reinterpret_cast<> is also implementation defined.

There are a few types of UB. While not all of them are equal, they should all be treated with caution and remember to always check the code gen of every compiler to make sure. For example defining a function in the std namespace will result in a program that has undefined behaviour, except when the standard allows it, eg: extending std::hash<>. In reality, it’s probably not that bad and won’t really affect your program. However, something like signed integer overflow could have some unintended side effects. Given the following trivial piece of code:

int is_more(int x, int y)
{
    if (x + 1 > x)
        return y;
    else
        return x;
}

// Function actually gets compiled down to
int is_more(int x, int y)
{
    return x;
}

The condition x + 1 > x is undefined (by the C++ standard) due to signed integer overflow (okay for unsigned) which means the compiler is allowed to assume that this condition never happens and just collapses the lines to return x. These optimisations remain invisible unless you look at the generated code (assembly). Although it is a trivial case, it’s even more worrying if you consider that the if else conditions could have been inlined from another function, or possibly hidden under some macro! Try putting the above code in compiler explorer and convert the integers to unsigned, you’ll see the compiler adds extra checks for overflow.

One could argue that although in the x86 architecture the add instruction has two’s complement semantics it still doesn’t matter to the compiler, which follows the C++ standard. From Roger Miller via Steve Summit, this is like saying:

Somebody once told me that in basketball you can’t hold the ball and run. I got a basketball and tried it and it worked just fine. He obviously didn’t understand basketball.

Another example of UB, is one (of my favourites) from Raymond Chen at Microsoft. Given the following code stub:

struct RefArray {
    int** Data;

    int& operator[](int index) {
        return *Data[i];
    }
};

RefArray gFrameCounts;

void Refresh(int* frameCount) {
    // .. a bunch of refresh code ..
    if (frameCount != nullptr) ++*frameCount;
}

void RefreshAndCount(int i) {
    Refresh(&gFrameCounts[i]);
}

Can you spot the UB? It happens when comparing framecount != nullptr, the compiler removes this check completely and always increments the frameCount. This is because the line gFrameCounts[i] returns a reference int& and you’re then taking the address of the returned reference. Which seems fine but in C++ a reference cannot be invalid/null. Since the reference is always valid that means the pointer passed to Refresh will always have a valid address and the compiler completely optimises away the nullptr check!

Another notorious UB is the strict aliasing violation. For example if we have the following code:

union IntBytes {
    int i;
    unsigned char c[4];
};

int SumOfBytes(int i) {
    IntBytes temp;

    temp.i = i;

    // Technically speaking this is UB
    return temp.c[0] + temp.c[1] + temp.c[2] + temp.c[3];
}

According to C++, when we write to temp.i, accessing temp.c is undefined. This is because temp.i and temp.c cannot point (alias) to the same thing, technically. So temp.i = i assignment has potentially no effect on the program and it can be removed, leaving you returning garbage from the stack. Although realistically, it depends on the compiler. So just to be sure, always check the codegen that it is producing the correct output. To get around it, especially in trivial cases like these, a memcpy will produce the exact same output without any pitfalls of UB (Link to memcpy code). Unions and reinterpret_cast suffer from this problem. There are other aliasing related optimisations that don’t happen but those aren’t UB related.

UB is not all bad, it does help the compiler optimise giving us more efficient code. It is however very difficult to spot and you most probably won’t get a compiler diagnostic about this (This is now changing in latest compiler versions). While this was a very very very brief intro to undefined behaviour, here are a few talks and blogs posts to read if you’re interested and want to know more about it: