Note on x86-64 Memory Model

2022-04-26

In the past years, I have undergone a few cycles of learning the x86-64 memory model, only to eventually forget it again. Today I was fortunate to see a great paper which explained this matter very clearly, so I’m taking a note here for future reference.

The model in the paper is particularly easy to understand because it is described by standard software lock primitives^[1], as below:

There is one global lock G.
There is one background thread T.
Each CPU has a store buffer, which is a queue of items <address, value>. The store buffer is pushed by the owning CPU, and popped by the background thread T.

The background thread T does only one thing:

Lock global lock G.
Pop an item <addr, value> from the store buffer of a CPU, write the value to main memory: MainMemory[addr] = value.
Unlock global lock G.

The procedure for a CPU to execute an instruction is described below.

STORE instruction

Push item <addr, value> to its store buffer.

LOAD instruction

Lock global lock G.
If addr exists in its store buffer, return corresponding value^[2]. Otherwise return MainMemory[addr].
Unlock global lock G.

MFENCE instruction

Wait until its store buffer is eventually emptied by background thread T.

ATOMIC instruction

Lock global lock G.
Run the atomic instruction, using subroutines described above for LOAD and STORE.
Empty its own store buffer: pop every item <addr, value> from the store buffer and write to main memory: MainMemory[addr] = value, until the store buffer is empty.
Unlock global lock G.

Note that the semantics of LOAD and STORE provide the expected consistency on single-threaded programs.

An Application

Let’s analyze why the familiar spinlock implementation below is correct under x86-64 memory model:

void Init(uint8_t* lock) { *lock = 0; }
void Lock(uint8_t* lock) {
    while (!compare_and_swap(lock, 0 /*expect*/, 1 /*desired*/));
}
void Unlock(uint8_t* lock) { 
    *lock = 0; 
}

We will prove the correctness via a token-counting argument. Each <lock, 0> in a store buffer counts as one token, and if MainMemory[lock] == 0, it also counts as one token. By definition at any moment the number of tokens cannot go below 0.

By the abstract machine semantics above, it’s not hard to prove that:

The background thread T cannot increase the total number of tokens.
Each Unlock() call creates one token.
Each Lock() call cannot return until it successfully consumes at least one token (If the CAS succeeded by seeing a 0 in its store buffer, that token is lost after the CAS because CAS flushes store buffer and also changes the memory value to 1. If the CAS succeeded by seeing a 0 in the main memory, that token is also lost because the store buffer item of the new value 1 is flushed to memory, overwriting the 0 value).

Initially there is one token (by the Init() call). Since Unlock() may only be called after^[3] (guaranteed by program order) Lock(), the total number of tokens cannot go above one at any moment. So after a Lock() returns, there must be zero tokens, so no other Lock() can return. The total number of tokens goes back to one only when the Unlock() in that program is called, and only after that other Lock() operation may return. So the Lock() -> Unlock() time intervals are pairwisely non-overlapping, providing the mutual exclusiveness as one would expect.

Footnotes

Therefore, while the description is logically equivalent to the guarantees provided by the hardware, this is not how the hardware physically implements the memory subsystem. The hardware implementation is way more efficient. ↩︎
Of course, if addr showed up multiple times in the store buffer, we should return the value of the latest version. ↩︎
Throughout this argument, the time relationship is about wall clock time (or, the relative position in the interleaved event sequence of all CPUs). ↩︎