In the past years, I have undergone a few cycles of learning the x86-64 memory model, only to eventually forget it again. Today I was fortunate to see a great paper which explained this matter very clearly, so I’m taking a note here for future reference.
The model in the paper is particularly easy to understand because it is described by standard software lock primitives[1], as below:
- There is one global lock
G
. - There is one background thread
T
. - Each CPU has a store buffer, which is a queue of items
<address, value>
. The store buffer is pushed by the owning CPU, and popped by the background threadT
.
The background thread T
does only one thing:
- Lock global lock
G
. - Pop an item
<addr, value>
from the store buffer of a CPU, write the value to main memory:MainMemory[addr] = value
. - Unlock global lock
G
.
The procedure for a CPU to execute an instruction is described below.
STORE instruction
- Push item
<addr, value>
to its store buffer.
LOAD instruction
- Lock global lock
G
. - If
addr
exists in its store buffer, return correspondingvalue
[2]. Otherwise returnMainMemory[addr]
. - Unlock global lock
G
.
MFENCE instruction
- Wait until its store buffer is eventually emptied by background thread
T
.
ATOMIC instruction
- Lock global lock
G
. - Run the atomic instruction, using subroutines described above for
LOAD
andSTORE
. - Empty its own store buffer: pop every item
<addr, value>
from the store buffer and write to main memory:MainMemory[addr] = value
, until the store buffer is empty. - Unlock global lock
G
.
Note that the semantics of LOAD
and STORE
provide the expected consistency on single-threaded programs.
An Application
Let’s analyze why the familiar spinlock implementation below is correct under x86-64 memory model:
1 | void Init(uint8_t* lock) { *lock = 0; } |
We will prove the correctness via a token-counting argument. Each <lock, 0>
in a store buffer counts as one token, and if MainMemory[lock] == 0
, it also counts as one token. By definition at any moment the number of tokens cannot go below 0
.
By the abstract machine semantics above, it’s not hard to prove that:
- The background thread
T
cannot increase the total number of tokens. - Each
Unlock()
call creates one token. - Each
Lock()
call cannot return until it successfully consumes at least one token (If the CAS succeeded by seeing a0
in its store buffer, that token is lost after the CAS because CAS flushes store buffer and also changes the memory value to 1. If the CAS succeeded by seeing a0
in the main memory, that token is also lost because the store buffer item of the new value1
is flushed to memory, overwriting the0
value).
Initially there is one token (by the Init()
call). Since Unlock()
may only be called after[3] (guaranteed by program order) Lock()
, the total number of tokens cannot go above one at any moment. So after a Lock()
returns, there must be zero tokens, so no other Lock()
can return. The total number of tokens goes back to one only when the Unlock()
in that program is called, and only after that other Lock()
operation may return. So the Lock() -> Unlock()
time intervals are pairwisely non-overlapping, providing the mutual exclusiveness as one would expect.
Footnotes
Therefore, while the description is logically equivalent to the guarantees provided by the hardware, this is not how the hardware physically implements the memory subsystem. The hardware implementation is way more efficient. ↩︎
Of course, if
addr
showed up multiple times in the store buffer, we should return thevalue
of the latest version. ↩︎Throughout this argument, the time relationship is about wall clock time (or, the relative position in the interleaved event sequence of all CPUs). ↩︎