A Trip Through the CPU Pipeline

A Trip Through the CPU Pipeline

29 Apr 2019    

When we compile our code from C++ to Assembly, it converts it into a stream of instructions. The CPU then reads this stream of instructions and executes them. But in reality, it’s not as simple as just getting an instruction and executing it. In this tip we’ll look at what actually goes on in the CPU when you feed it a stream of instructions.

While this post will only target specific Intel microarchitectures (>= Sandy Bridge) it will give you an idea of what goes on inside the CPU since the concepts are very much applicable to other processors & microarchitectures as well. The only prerequisite is some basic knowledge of x64 Assembly. As always, please do checkout the resources at the end if you want a further in depth explanation of the concepts.

Why Use a Pipeline?

Pipelines reduce the number of cycles it requires to execute instructions thus increasing throughput. Much like an assembly line (no pun intended), each stage in the pipeline will do different things; so one instruction (or stage) doesn’t hold up the whole processor. For example the table below shows how multiple stages can happen at once since an instruction can be broken down into several stages.

Instruction \ Cycle Clock 1 2 3 4 5 6 7
1 IF ID EX MEM WB    
2   IF ID EX MEM WB  
3     IF ID EX MEM WB
4       IF ID EX MEM
5         IF ID EX

Where:

  • IF - Instruction Fetch
  • ID - Instruction Decode
  • EX - Execute
  • MEM - Memory Access
  • WB - Register Writeback

The Intel Pipeline

Intel don’t actually reveal much about their pipeline, but people over the years have reverse engineered CPUs, via software counters etc, to reveal some information about the internals. This tip is a combination of both of those things. Generally speaking the Intel pipeline has more or less stayed the same from Ivy Bridge to Skylake (at least) and the only changes come in form of different sizes of caches, number of entires, number & size of ways, etc. The pipeline which we’ll be using for reference is for Skylake processors as illustrated below:

SkylakePipeline

That is indeed pretty complicated but hopefully by the end of this post you’ll be able to understand most of it. And firstly to simplify what we’re dealing with, we can look at the pipeline in the following way:

SimplifiedPipeline

Note: Open the images in a new tab to view them full screen! (Sorry)

Now, let’s walk through the above pipeline!

Branch Prediction

The Branch Predictor Unit (BPU) is responsible for guessing which path a conditional branch instruction might take and storing that value for later reuse. If the prediction is correct the pipeline can continue as expected but if it mispredicts then it has to flush the whole pipeline!

One simple and rather straightforward way to do this is to store the target address of a branch in a Branch Target Buffer (BTB) along with the state of the branch. The state is stored using a 2-bit (4 possible values) Saturating Counter as shown below:

SaturatingCounter

Every time the branch is taken the counter goes up to the next state and down if not taken. When the counter is in one of the taken states then it predicts the branch as taken and vice versa. So if a branch is sitting at “Strongly Not Taken” then it would need to be taken twice to the predicted correctly. This is prone to flip flopping the state of a branch between the weaker states causing strong mis-predictions. Thus, to improve the prediction rate is to remember the last n occurrences of the branch and use a saturating counter for each of the possible 2^n patterns. This is called an Adaptive Two Level Predictor:

AdaptivePredictor

Consider the example of n = 2. This means that the last two occurrences of the branch are stored in a 2-bit shift register. This branch history register can have 4 different values: 00, 01, 10, and 11; where 0 means “not taken” and 1 means “taken”. Now, we make a pattern history table with four entries, one for each of the possible branch histories. Each entry in the pattern history table contains a 2-bit saturating counter of the same type as in diagram above. The branch history register is used for choosing which of the four saturating counters to use. If the history is 00 then the first counter is used. If the history is 11 then the last of the four counters is used.

In the case of a branch that is alternately taken and not taken, the branch history register will always contain either 01 or 10. When the history is 01 we will use the counter with the binary number 01B in the pattern history table. This counter will soon learn that after 01 comes a 0. Likewise, counter number 10B will learn that after 10 comes a 1. After a short learning period, the predictor will make 100% correct predictions. Counters number 00B and 11B will not be used in this case.

That is a quick but basic explanation of a branch prediction strategy, while in reality it uses a few extra things such as a Global History Table or alternatively using an Agree Predictor. But the bottom line is that the BPU will attempt to spit out an address of where the next executable instructions may lie in memory.

Fetch & Pre-Decode

Once we have an instruction address, we can fetch the instruction stream at that point. The main problem with the instruction stream is that the x86 instruction set is a CISC architecture that has variable length opcodes. Opcodes are byte identifiers that tell the CPU what operation to perform. For example the following assembly will the associated encoding:

xor eax, eax ; 0x31 0xc0
ret ; 0xc3

As you can see from the above example, the instructions vary in size but are all part of a single instruction stream. This isn’t much of a problem in RISC (such as ARM) where the opcodes are the same size. So the job of the Fetcher is to recognise these special bytes of memory, basically just the beginning of each instruction, and record their offsets. On Ivy Bridge till before Skylake the fetcher processes 16 bytes while Skylake and above was increased to 32 bytes per cycle. Thus actually aligning your instructions (loop, branches et all) to 16 or 32 byte boundaries can actually help speed things up, although the compiler tries to do this for you.

Instruction Queue & Macro Fusion

The offsets are then fed to a queue which also performs something called Macro Fusion. Some instructions can actually be combined to be executed (almost) simultaneously. For example:

dec ecx
jnz L1
; The 2 above can be fused into one

cmp dword ptr [esi], 0
je L2
; No fusion. Both memory operand and immediate

Macro fusion is very selective and only happens in a few, but very important, cases. Like the dec & jnz example above could very easily be a loop condition!

Decode (with Stack Engine)

The x64 instructions we see (& love!) are actually not what get executed directly on the CPU. Instead, they get converted to an internal instruction set (most likely RISC). This means that one x64 instruction can generate one or more micro-operations, or often abbreviated as µops or uops. The job of the decoder is to take the instruction stream coming in and produce µops for each of those instructions. For example:

add eax, edi ; generates 1 µop

add eax, [MEM] ; 2 µops - load from MEM & then add

This is done in conjunction with a special unit called the Stack Engine. Stack instructions such as push, pop, call and ret all modify the stack pointer esp. Previous processors used the integer ALU for adjusting the stack pointer. For example, the instruction push eax generated three µops on the P3 processor, two for storing the value of eax, and one for subtracting 4 (32-bit) from esp. On newer ones, the same instruction generates only one µop. The stack engine also inserts synchronization instructions to ensure sequence order is preserved.

The µops are then fed into the DSB.

Decoder Stream Buffer (DSB)

The DSB is essentially a cache, added in Ivy Bridge and above processors. This indexes µops with their address. Before a fetch is performed the DSB is checked to see if the µop already exists. This also helps in case of a misprediction since it avoids part of the pipeline from being decoded into µops again.

This cache is also highly dependent on alignment of the instruction stream as well as the types of instructions that can end a cache way - usually unconditional jump instructions and such.

Micro Fusion

The next step is then to bundle together certain µops into smaller singular operations. They are joined into a single entry but may be executed by two different execution units but are always retired as a single µop. This helps increase the capacity of the Reorder Buffer (ROB) reducing the bottleneck.

mov [esi], eax ; 1 fused µop
add eax, [esi] ; 1 fused µop
add [esi], eax ; 2 single + 1 fused µop
fadd qword ptr [esi] ; 1 fused µop

Loop Stream Detector (LSD)

After fusing together the µops, they are fed to the LSD. This unit is responsible for detecting a stream of µops that are the same. It does this based on a static threshold, so even offsetting the index of your tight loop by 1, can cause an LSD miss.

Once it detects a loop it shuts off the front end fetch & decode operations and continues to feed the same µops till it hits a loop mispredict. Be wary, this may cause oddities as well as differences in benchmarks.

The next parts of the pipeline is what Intel collectively calls it’s Out-of-Order Execution Engine.

Register Allocation & Renaming

While the x64 instruction set only has 16 available architectural registers in reality chips have more than a 100 or probably even thousands of registers internally on chip! The µops contain references to these architectural registers and hence are renamed to map to an internal register. This is done using a Register Alias Table (RAT). The chip is essentially doing an on the fly dependency analysis. This allows parallelising instructions to save time and increase throughput since using a limited number of registers means having to wait for an existing register to free up before performing subsequent operations.

For example if we have this code that adds two numbers to two locations in memory:

; Pseudo µops
eax = read(MEM1)
edi = edi + eax
write MEM1, edi

eax = read(MEM2)
esi = esi + eax
write MEM2, esi

While the assembly here is (more or less) fine, the problem here is that eax is used to load memory from two different places and that takes time. So instead, eax can be renamed to use two internal registers instead:

eax_0 = read(MEM1)
edi_1 = edi_0 + eax_0
write MEM1, edi_1

eax_2 = read(MEM2)
esi_1 = esi_0 + eax_0
write MEM2, esi_1

Now the read, add and writes are independent of one another and can be run in parallel using two different internal registers. The RAT then tracks the current version of each register as it maps to a permanent (architectural) register. It does this by either mapping into a Reorder Buffer (ROB, temporary values) or a Permanent Register File (actual registers and their final values). Another thing this helps with is when values need to be moved around to get into the final (correct) register.

; ASM Code for
; int get_div(int x, int y)
; {
;     return x % y;
; }

mov     eax, edi
cdq
idiv    esi
mov     eax, edx
ret

So instead of moving the final value from edx into eax the Renamer can just directly put the final value in eax thus avoiding a mov altogether. The Renamer also creates temporary instructions and reorders instructions and puts them into the ROB.

mov eax, [mem1]
imul eax, 6
mov [mem2], eax
mov eax, [mem3]
add eax, 2
mov [mem4], eax

After reordering this is converted to:

mov eax, [mem1]
mov ebx, [mem2]
add ebx, eax
imul eax, 6
mov [mem3], eax
mov [mem4], ebx

Which makes waiting for memory to load and write out a lot more parallel!

The ROB holds the state of the µops that are in progress. This involves keeping track of ones that are completed so that any µops dependent on the completed µop can then be executed. µops remain in the ROB until they are retired. Skylake has 226 entries in it’s ROB so the number of µops that it can execute and keep track of at once is limited.

ROB Read

The ROB-read stage can receive up to three µops from the RAT in one clock cycle, and each µop can have two input registers. This gives a total of up to six input registers. If these six registers are all different and all stored in the permanent register file, then it will take three clock cycles to perform the six reads through the three read ports of the register file. The preceding RAT stage will be stalled until the ROB-read is ready again. The decoders and instruction fetch will also be stalled if the queue between the decoders and the RAT is full. This queue has only approximately ten entries so it will quickly be full. Thus often re-using the same register can have advantages.

; Register read stall
mov [edi + esi], eax
mov ebx, [esp + ebp]

; No register read stall
mov [edi + esi], edi
mov ebx, [edi + edi]

Reservation Station

This unit is essentially a scheduler. The CPU has several execution units that process different instructions (often some overlap) but these are clustered and accessible via a number of ports - 8 ports for Skylake processors, giving it 8 µops per cycle (this is rare). If you go back to the first (and complicated) diagram, it’ll show you a list of ports and what type of execution unit each one supports. The thing to note there is, that there are several ports that can do basic tasks such as load, store, integer add etc, but very few (even just one!) ports that can do floating point or vector instructions.

So once a µop comes in it gets dispatched to a suitable port. The state of the µops are kept track using register files.

Execution

Finally, this is where the µop is executed in one of the execution units. Each execution unit may be pipelined themselves. So while it may take 6 cycles for a multiply µop to complete, it can take a new µop every cycle.

ROB Write

Once a µop completes, it broadcasts it’s result across a bus which unblocks dependent µops. This updates the ROB buffer that the µop has completed and is ready to retire. Since all operations will continue to live in the ROB till they retire, this is where the speculative µops need to wait until it is proven that a branch was predicted correctly. If it was then it goes to the next stage otherwise is discarded. (Can you spot Spectre / Meltdown here?)

Retirement Station

This is where the values from the temporary registers stored in the ROB are copied into the permanent registers (eax, ebx etc). This is also the part that ensures the instructions are completed in the input order. This is the final stage of the pipeline. If you survived all the way through till the end - congrats!

Conclusion

Commercial grade CPUs are incredibly complicated beasts that take years to design by teams consisting of thousands of people. All this pipelining in the end is beneficial to the users to make things move faster and to resolve a lot of the legacy cruft. While this tip covers a few basic concepts and the stages of the pipeline, it doesn’t really cover how to use these concepts in code - that in itself is left as an exercise to the reader! (Hint: Translate this stuff from assembly back to C++ to find potential improvements). There are also quite a few things that aren’t covered so do check out the resources below, especially the Agner Fog Manuals which have a great reputation among assembly programmers.

Resources