Outdated: This post was written in January, 2019 and is not currently maintained.

In concurrent applications, we need a way to allow different tasks to make progress at the “same” time, which is known as concurrency. The most widely-used technology for concurrency is threads, the minimal execution unit provided by the operating system. However, there is a non-negligible overhead associated with multi-threading, especially when the threads are only performing small, lightweight tasks.

Multithreading overhead comes from how threads are implemented in operating systems. In Linux, a thread is just a "lightweight" process that shares resources, like virtual memory, with its parent process. When a thread yields control and begins to wait on an event to wake up, the operating system will take over control, find the next suitable process to run, and switch to that process, just like what it would do with normal processes. This introduces a significant overhead for many kinds of multithread applications, especially those requiring frequent message passing between threads. Additionally, when a thread is created or destroyed, the operating system has to initialize or destroy the thread control block (TCB, task_struct in Linux), which is a non-trivial operation.

Coroutine is a mechanism for reducing multithreading overheads. As a userspace equivalent of threads, coroutines are managed entirely in userspace and is opaque to the underlying OS.

Stackless Coroutines

There are two types of coroutine systems, stackless and stackful. A stackless coroutine system is usually implemented with some kind of language-level transformations, like the CPS (Continuation Passing Style) transformation. A program in CPS form, as its name suggests, instead of passing the results of computations by return values, passes them by “continuations”, which looks like:

// Original form
let a = f1()
let b = f2(a)

// CPS form
    a => f2(
        b => print(b)

The code above looks like the ancient JavaScript callback pattern, doesn't it? JavaScript is by design single threaded and asynchronous, and blocking operations have to be implemented with callbacks. In fact, this kind of callbacks is an explicit form of CPS, regardless of whether Promise is used to avoid problems like callback hell. To simplify programming, async/await syntax is introduced later, which enables users to write code that looks synchronous but is in fact asynchronous, and is usually implemented with automatic CPS transformation.

Another approach to stackless coroutine is to make use of a big switch loop:

// Original form
int f1(int a) {
    int i, v = 0;
    for(i = 0; i < a; i++) v += i;
    return v;

// Stackless coroutine
struct Local {
    int i;
    int v;
int f1(int a, int state, struct Local *local, int *out) {
    switch(state) {
        case 0:
            local->i = 0;
            local->v = 0;
            return 1;
        case 1:
            if(local->i < a) {
                local->v += local->i;
                return 1;
            } else {
                *out = local->v;
                return -1;

In the code above, we maintain the execution state of f1 outside of itself, therefore allowing recovery of its execution at any point of time as long as we preserve state and local.

Stackful Coroutines

Stackless coroutines are “stackless” because they do not explicitly preserve the call stack, while stackful coroutine systems allocate a new call stack for each coroutine. Compared to stackless coroutines, stackful coroutines usually do not need language-level transformations, making it easier to be integrated into different programming languages.

When the OS switches tasks, it needs to save the machine state of the old task and load the new task’s one. A machine state contains all used registers, along with in-memory data that should be preserved, e.g. the call stack. As we are going to do coroutine switching in userspace, we have to manually implement the logic for switching machine states.

For convenience, in this section, we assume that our platform is Linux on a x86_64 machine and we are implementing cooperative scheduling. Therefore, we need to follow the x86_64 System V ABI when dealing with calling conventions. The ABI specifies that the registers rbx, rbp, r12, r13, r14 and r15 are callee saved, so they must be saved before switching to another coroutine.

Let’s start from the coroutine switching function:

    "push %rbx\n"
    "push %rbp\n"
    "push %r12\n"
    "push %r13\n"
    "push %r14\n"
    "push %r15\n"
    "mov (%rdi), %rax\n"
    "mov %rsp, (%rdi)\n"
    "mov %rax, %rsp\n"
    "pop %r15\n"
    "pop %r14\n"
    "pop %r13\n"
    "pop %r12\n"
    "pop %rbp\n"
    "pop %rbx\n"

The co_switch function pushes all current callee-saved registers onto the current stack, swaps (%rdi) (the memory pointed to by the first argument, target stack address here) with the stack pointer %rsp, loads the callee-saved registers of the target coroutine, and then executes ret to return in the target coroutine.

Suppose we have a coroutine A that has just executed the instruction mov %rax, %rsp. At this point, we are sure that the old stack, which is swapped out into memory location (%rdi), has the following layout (from lower address to higher address):

[r15] [r14] [r13] [r12] [rbp] [rbx] [return_address]

A coroutine can only be switched out with co_switch, so its stack must have the same layout. Therefore, since we are already on the target stack after executing mov %rax, %rsp, we can pop the saved registers in the reverse order, from r15 to rbx. Then, the ret instruction is executed, which pops the return address and jumps to it, and the caller code on the target coroutine observes a normal return from co_switch.

However, we just assumed that we already have a coroutine switched out previously, and some other logic is required for starting a new coroutine.

We have to construct the stack layout:

[r15] [r14] [r13] [r12] [rbp] [rbx] [return_address]

The values of the saved registers can be arbitrary, since there’s no old registers to save. To construct a valid “return” address, a trampoline is needed:

    "pop %rax\n" // entry
    "pop %rdi\n" // userdata
    "call *%rax\n"

pre_call_entry pops a function pointer entry and the associated userdata from the stack, calls entry with userdata as its only argument, then executes ud2 to indicate entry should never return. Therefore, we need to include entry and userdata in our stack construction:

[r15] [r14] [r13] [r12] [rbp] [rbx] [return_address] [entry] [userdata]

With the two assembly functions and the expected stack layout, now we can write the code to start a coroutine:

void co_switch(void **stack);
void pre_call_entry();

struct Coroutine;

typedef void (*CoEntry)(struct Coroutine *co);

struct Coroutine {
    void *stack;
    CoEntry entry;
    int terminated;

void call_entry(struct Coroutine *co) {
    co->terminated = 1;

void start_coroutine(struct Coroutine *co) {
    void **stack = (void **) co->stack;

    *(--stack) = co;
    *(--stack) = call_entry; // 16-byte aligned

    *(--stack) = pre_call_entry;

    *(--stack) = 0;
    *(--stack) = 0;
    *(--stack) = 0;
    *(--stack) = 0;
    *(--stack) = 0;
    *(--stack) = 0;

    co->stack = (void *) stack;


The full example code is available on my GitHub Gist.

Optimizations and Tricks

  • ret in co_switch can be replaced by an explicit pop and jmp to avoid return address prediction failures.
  • If your coroutines are extremely lightweight and usually use a very small space on the stack, we can use a fixed memory region as the execution stack (“shared stack”) and copy in/out the used range (stack_end - sp) of the coroutine stack when switching in/out of a coroutine. Although memory copying is needed, this approach can save a lot of memory while still allowing to use guard pages to ensure safety.
  • To implement cross-platform stackful coroutines without writing assembly, the pthread_* functions can be used along with setjmp and longjmp to initialize the machine state on a new stack and switch between stacks, without actually using OS threads.