Interleaved Multithreading: A Pipeline's Best Friend

Summary

One of the largest CPU performance bottlenecks is branching, which prevents the CPU from being able to pre-process future instructions since it does not know which instructions will be executed next. Interleaved multithreading, a CPU-level solution to the problem of running multiple program threads at once, lessens this problem significantly in very clever and subtle ways.

The CPU Pipeline

Note: readers who already know about the CPU pipeline may wish to skip this section.

Everything you know about CPUs is wrong. A common simplification in explaining how CPUs work is to say that they perform a single instruction every clock cycle. Tick, 1 + 1 = 2. Tick, 2 * 2 = 4. Tick. In fact, most CPU instructions, especially the more complex ones such as division, take many clock cycles to perform.

The reason for this is that many CPU instructions are much more complex than a single step. Take, for example, a simple addition instruction. It looks something like this:

Add the value in register A to the value in register B and place the result in register C.

Though we often think of this as a single instruction, multiple steps are actually required. First the CPU has to read in the values of A, B, and C. Then it has to retrieve the values in the specified registers. Then it has to add those values together. Then it has to store the result in the output register. All told, even a simple addition instruction can take far longer than a single clock cycle.

It turns out, however, that the CPU can still, on average, perform N instructions in N clock cycles. It does this through the use of a CPU pipeline. A CPU pipeline is basically like an assembly line for a CPU. While one piece of a particular instruction is being performed somewhere in the CPU, some other part of the CPU is working on a different piece of a different instruction. Thus, though it may take an instruction far longer than a single clock cycle to progress from the beginning of the pipeline to the end, the pipeline is still completing one instruction every clock cycle.

The Problem of Branching

Note: Readers who already know about the performance penalties to CPU pipelining caused by branching may wish to skip this section.

In order for the pipeline to function at full capacity, however, an instruction must be placed in the pipeline well before the instruction before it is complete. In order for this to happen, the CPU has to know many steps in advance what instructions need to be executed.

Most of the time, this isn’t an issue. Programs are just long sequences of instructions, so looking ahead is no problem. But what happens when the CPU isn’t sure where to look?

This happens when a program branches. Branching is just a technical term for having an instruction which might result in the program jumping to one of multiple different places elsewhere in the program (the “branches”). The most common example of this is an if statement. If the if conditional is true, the program will jump to the body of the statement. Otherwise, it will jump past the body. Imagine the following simple example:

1 if (i < j)
2   i++;
3 i--;

When the CPU gets to the instruction for line 1, it’s not sure if it should place line 2 in the pipeline or line 3. It has to wait until the previous instruction – i < j – has finished processing so that it knows which instruction to place in the pipeline next. [1]

When this happens, the performance gains of the pipeline are lost. If it takes 20 clock cycles for the i < j instruction to pass through the pipeline, then there will be a 20 clock cycle delay between the processing of the i < j instruction and the instruction following it, as opposed to the normal 1 cycle delay.

Interleaved Multithreading

Most modern computing environments allow for multithreading – the running of multiple instruction threads in parallel such that each thread can behave as though it’s the only thing running on the machine. One of the most common methods of achieving multithreading is through the use of scheduling, where threads take turns getting time to run on the CPU.

However, scheduling is not the only method that exists. In particular, in this article, I’m going discuss a method known as interleaved multithreading (IM).

Interleaved multithreading is a multithreading scheme which relies heavily on the design of the CPU. In order to run two threads in parallel, IM executes instructions from each thread in turns, switching between threads each instruction. First it executes an instruction from thread A, then one from thread B, then another from thread A, and so on. This scheme can be easily extended to more than two threads, but I’m going to stick to discussing two for simplicity’s sake.

One of the advantages of IM, and the one that this article focuses on, is that it can greatly reduce the negative performance effects of branching.

Imagine that we have a CPU with a pipeline which can process instructions in ten different stages (instructions process from left to right in this diagram):

0 1 2 3 4 5 6 7 8 9
-------------------
? | ! ^ & % / * - +
-------------------

Notice that the latest instruction to be added was a compare operation (denoted “?”). Let’s assume that the instruction following that compare is a branch. We can now see that no more instructions can be added to the pipeline until the compare instruction is completed. Just after the compare is executed, ten clock cycles elapse before another instruction is completed, and for that time, the CPU effectively performs ten times slower than normal.

0 1 2 3 4 5 6 7 8 9
-------------------
. . . . . . . . . ?
-------------------

But now let’s imagine that this CPU is running IM. In IM, threads A and B switch of between themselves every clock cycle:

0 1 2 3 4 5 6 7 8 9
-------------------
A B A B A B A B A B
-------------------

Now let’s imagine that the last instruction in thread A was a compare, and it will be followed by a branch. Just as we would expect, thread A cannot begin processing any new instructions until the compare is complete. However, thread B has no such restrictions since its program flow is completely independent of the behavior of thread A. Thus, while thread A is pausing to allow its compare to be completed, thread B can continue utilizing the pipeline:

0 1 2 3 4 5 6 7 8 9
-------------------
. B . B . B . B ? B
-------------------

As with the original case, just after the compare is completed, the CPU appears to be underperforming since no new instructions are completed. However, whereas before the CPU appeared to be underperforming by a factor of ten, now it’s only underperforming by a factor of two. Just by adding IM, we have increased our performance under branching fivefold.

In fact, we can extend this even further. In theory, it would be possible for the CPU to notice that a branch is happening, and flood the pipeline with instructions from thread B until the compare from thread A had completed. Such an improvement would effectively render performance penalties from branching meaningless, since branching would not cause a single missed instruction evaluation. However, it’s unclear to me whether this is done in practice.

One thing that certainly is done in practice is to run N threads simultaneously with a pipeline containing N stages. This means that no two instructions from the same thread are ever in the pipeline at the same time. This also renders performance penalties from branching meaningless. For any given thread, the classic simplification that CPUs execute instructions completely serially, one per clock cycle, is effectively true under this approach.


[1] In reality, most modern CPUs have a feature called branch prediction which tries to make an educated guess at which branch will be executed. If the CPU predicts correctly, the program will perform as though no branch had happened, and no pipeline time will be lost. However, branch prediction, while quite good, is not 100% accurate, and branching still causes large performance losses.