This is another note on JavaScriptCore (JSC)'s DFG JIT. The topic today is on DFG’s IR design and a related optimizer pass called CPS Rethreading. As before, disclaimer first:
Disclaimer
I do not work on JavaScriptCore, and my experience with Javascript is very limited. Everything described in this post is from my understanding of the JSC source code. They may be inaccurate or outright wrong. Please feel free to email me at haoranxu510 [at] gmail.com for any corrections and discussions.
In DFG’s IR, each function consists of a list of basic blocks, and each basic block contains a list of IR intructions. An IR instruction is also implicitly the value this instruction produces, and can be used as operand to other IR instructions. Everything till now is similar to a typical SSA representation. However, there are two important differences:
- Each IR instruction can only use the SSA values (implicitly produced by other IR instructions) in the same basic block.
- Unlike in SSA, DFG does not have an IR instruction named “Phi”. Instead, in DFG’s IR, each function has a list of “slots”, which are local variables that one can read from and write to. There are two IR instructions
GetLocal
andSetLocal
[1] which allows reading/writing a slot respectively.
As one would expect, a GetLocal
takes a slot number constant as operand, and produces an SSA value. A SetLocal
takes an SSA value (i.e., an IR instruction that produces a value) and a slot number constant as operands, and produces nothing.
The above representation (called LoadStore
form in JSC) is very flexible for users to generate the IR, but not good for optimization, because in such representation it is hard to reason about the data stored in local variables. This is where the CPS Rethreading pass comes in, which transforms the graph to the so-called ThreadedCPS
form. The CPS rethreading pass optimizes away redundant loads, adds auxiliary Phi nodes that describe the data flow of loads and stores, and also performs liveness analysis.
The most interesting design choice here is that the Phi nodes are auxiliary and optional, and are not part of the IR graph: the correctness of the IR graph is not affected even if all Phi nodes were removed, and by doing so, it only puts the graph back to LoadStore
form.
As a result, a transformation pass may choose to either maintain the Phi nodes and the liveness information, or simply mark that the graph has been put back to LoadStore
form (and then the CPS Rethreading pass can be called to bring it to ThreadedCPS
form again).
The extra information available in ThreadedCPS
form is the following:
- Each basic block gets a list of auxiliary Phi nodes. Unlike Phi nodes in SSA form, the Phi node here is parametrized by a slot number constant, so it denotes the value of that local variable at the start of the basic block. Those Phi nodes are only referenced by other Phi nodes (from other basic blocks) and by the stuffs below. They are never used by a normal IR instruction as operand.
- Each basic block gets two auxiliary arrays
variablesAtHead
andvariablesAtTail
of lengthN
(whereN
is the total number of local variable slots of the function). ThevariablesAtHead
array denotes the value of all live variables at the start of the basic block. ThevariablesAtTail
array denotes the value of all available (defined here as either live at the start of the basic block, or modified inside the basic block) variables at the end of the basic block. - Each
GetLocal
node gets an auxiliary pointer which denotes the value thisGetLocal
would yield. The pointer points to either a value-producing instruction in the same basic block (which means theGetLocal
would yield that value), or a Phi node in the same basic block (which means theGetLocal
would yield the value of the Phi). - Redundant
GetLocal
s to the same local variable in the same basic block are removed. There are two cases:
(a). AGetLocal
after anotherGetLocal
can be replaced by the previousGetLocal
.
(b). AGetLocal
after aSetLocal
can be replaced by the operand of theSetLocal
.
There are a few points that are worth to note:
- Point (2) above implies that the total space consumption is
O(NM)
whereN
is the total number of local variable slots andM
is the number of basic blocks. - Redundant
SetLocal
s to the same local variable inside the same basic block are not removed. Probably it has something to do with OSR. - The design above makes sure that all the extra information (most notably, the Phi nodes) are auxiliary: they can be safely dropped without affecting correctness.
The Algorithm
The algorithm is implemented in DFGCPSRethreadingPhase.cpp. For simplicity, as before, we will focus on GetLocal
and SetLocal
only, and ignore Flush
, PhantomLocal
, SetArgumentDefinitely
and SetArgumentMaybe
related logic.
The first step of the algorithm resets the existing state in case the pass has been invoked earlier. It clears the variablesAtHead
and variablesAtTail
array for each basic block, removes all Phi nodes, and reserves the space for the annotation pointer for each GetLocal
node (by simply repurposes the unused child1
field).
In the second step, for each basic block, it iterates all the IR instructions from up to down. The variablesAtTail
array is used as a scratchpad to keep track of the current value of each variable at the current IR instruction being iterated.
- If it is a
GetLocal
instruction, andvariablesAtTail[i]
isnullptr
, then this is the first time the variablei
is used in this basic block, and the value of this variable should come from a Phi node. So we create a Phi node and store it intovariablesAtHead[i]
. ThevariablesAtTail[i]
should also be updated to the currentGetLocal
node[2]. - If it is a
GetLocal
instruction, butvariablesAtTail[i]
is notnullptr
, then the local variablei
has been used in this basic block. Thus, thisGetLocal
can always be removed from the graph by replacing it with another node (with detail described earlier). Thanks to the design of the IR, thisGetLocal
may only be used by later IR instructions in the same basic block, so the replacement can be done easily[3]. - If it is a
StoreLocal
instruction, then we can simply update the corresponding slot invariablesAtTail
.
The third step builds the incoming edges for all Phi nodes. The Phi nodes created in the previous step are put into a list[4]. The algorithm iterates until the list gets empty. Each time an element, let’s say, a Phi node for some variable i
in basic block B
, is removed from the list. Then for all predecessors P
of basic block B
:
- If
P.variablesAtTail[i]
isnullptr
, then there isn’t a Phi node inP
for local variablei
yet. So, a new one is created, andP.variablesAtHead[i]
andP.variablesAtTail[i]
is updated. The newly created Phi node is put into the list. - Otherwise, the predecessor of the current Phi node for that basic block should simply be
P.variablesAtTail[i]
[5].
Note that the second step and the third step also guarantee that variablesAtHead
correctly contains the information for all live variables at the start of a basic block.
Conclusion
In my opinion, the design of the DFG IR is really neat. The standard SSA form is best suited for optimization, but it is both difficult and slow to construct and transform. The DFG IR made interesting design choices that intentionally deviates from the standard SSA form to improve compilation speed and usability. This matches DFG JIT’s design goal as a fast optimizer. The CPS Rethreading pass is a neat algorithm that serves multiple purposes simultaneously: to simplify and canonicalize the IR, to construct a SSA-like Phi data flow, and to perform liveness analysis.
Footnotes
In fact, there are also
Flush
andPhantomLocal
which are related to OSR, andSetArgumentDefinitely
andSetArgumentMaybe
which are related to special casing of arguments, but we will overlook them in this article for simplicity. ↩︎It seems like the algorithm intentionally updates it to the
GetLocal
node instead of the Phi node because it can provide more information to users: one can easily trace to the Phi from theGetLocal
but not otherwise around. ↩︎Specifically, all we need to do is to maintain the replacement map, and at the time we iterate a node, we check if any of its operands are in the replacement map. If yes, we replace it. The replacement map can simply be a
std::map
, but a hash table lookup is still slow, so in JSC there is a 8-byte scratch field in eachNode
struct that can be used for this purpose (or other similar purposes). ↩︎In fact, there are three different kinds of local variables:
Argument
,Local
andTmp
. So correspondingly, there are three different lists. I think it’s only an implementation detail: the three list could have been merged into one, but the current code design (where theKind
is used as a template parameter in several APIs) makes it easiest to just have three lists. ↩︎The actual implementation is a bit more complex. If the predecessor node is a
GetLocal
, it would further forward it to the value of theGetLocal
(as computed in the auxiliary pointer), but if the predecessor node is aStoreLocal
, it won’t forward it to its operand. But these are only implementation details to best fit the needs of the clients (the other optimization passes), so we omit them to highlight the core idea. ↩︎