「こんなきれいな星も、やっぱりここまで来てから、見れたのだと思うから。だから・・もっと遠くへ・・」

Understanding JavaScriptCore's DFG JIT: CPS Rethreading Pass

This is another note on JavaScriptCore (JSC)'s DFG JIT. The topic today is on DFG’s IR design and a related optimizer pass called CPS Rethreading. As before, disclaimer first:

Disclaimer
I do not work on JavaScriptCore, and my experience with Javascript is very limited. Everything described in this post is from my understanding of the JSC source code. They may be inaccurate or outright wrong. Please feel free to email me at haoranxu510 [at] gmail.com for any corrections and discussions.

In DFG’s IR, each function consists of a list of basic blocks, and each basic block contains a list of IR intructions. An IR instruction is also implicitly the value this instruction produces, and can be used as operand to other IR instructions. Everything till now is similar to a typical SSA representation. However, there are two important differences:

1. Each IR instruction can only use the SSA values (implicitly produced by other IR instructions) in the same basic block.
2. Unlike in SSA, DFG does not have an IR instruction named “Phi”. Instead, in DFG’s IR, each function has a list of “slots”, which are local variables that one can read from and write to. There are two IR instructions GetLocal and SetLocal[1] which allows reading/writing a slot respectively.

As one would expect, a GetLocal takes a slot number constant as operand, and produces an SSA value. A SetLocal takes an SSA value (i.e., an IR instruction that produces a value) and a slot number constant as operands, and produces nothing.

The above representation (called LoadStore form in JSC) is very flexible for users to generate the IR, but not good for optimization, because in such representation it is hard to reason about the data stored in local variables. This is where the CPS Rethreading pass comes in, which transforms the graph to the so-called ThreadedCPS form. The CPS rethreading pass optimizes away redundant loads, adds auxiliary Phi nodes that describe the data flow of loads and stores, and also performs liveness analysis.

The most interesting design choice here is that the Phi nodes are auxiliary and optional, and are not part of the IR graph: the correctness of the IR graph is not affected even if all Phi nodes were removed, and by doing so, it only puts the graph back to LoadStore form.

As a result, a transformation pass may choose to either maintain the Phi nodes and the liveness information, or simply mark that the graph has been put back to LoadStore form (and then the CPS Rethreading pass can be called to bring it to ThreadedCPS form again).

The extra information available in ThreadedCPS form is the following:

1. Each basic block gets a list of auxiliary Phi nodes. Unlike Phi nodes in SSA form, the Phi node here is parametrized by a slot number constant, so it denotes the value of that local variable at the start of the basic block. Those Phi nodes are only referenced by other Phi nodes (from other basic blocks) and by the stuffs below. They are never used by a normal IR instruction as operand.
2. Each basic block gets two auxiliary arrays variablesAtHead and variablesAtTail of length N (where N is the total number of local variable slots of the function). The variablesAtHead array denotes the value of all live variables at the start of the basic block. The variablesAtTail array denotes the value of all available (defined here as either live at the start of the basic block, or modified inside the basic block) variables at the end of the basic block.
3. Each GetLocal node gets an auxiliary pointer which denotes the value this GetLocal would yield. The pointer points to either a value-producing instruction in the same basic block (which means the GetLocal would yield that value), or a Phi node in the same basic block (which means the GetLocal would yield the value of the Phi).
4. Redundant GetLocals to the same local variable in the same basic block are removed. There are two cases:
(a). A GetLocal after another GetLocal can be replaced by the previous GetLocal.
(b). A GetLocal after a SetLocal can be replaced by the operand of the SetLocal.

There are a few points that are worth to note:

1. Point (2) above implies that the total space consumption is O(NM) where N is the total number of local variable slots and M is the number of basic blocks.
2. Redundant SetLocals to the same local variable inside the same basic block are not removed. Probably it has something to do with OSR.
3. The design above makes sure that all the extra information (most notably, the Phi nodes) are auxiliary: they can be safely dropped without affecting correctness.

The Algorithm

The algorithm is implemented in DFGCPSRethreadingPhase.cpp. For simplicity, as before, we will focus on GetLocal and SetLocal only, and ignore Flush, PhantomLocal, SetArgumentDefinitely and SetArgumentMaybe related logic.

The first step of the algorithm resets the existing state in case the pass has been invoked earlier. It clears the variablesAtHead and variablesAtTail array for each basic block, removes all Phi nodes, and reserves the space for the annotation pointer for each GetLocal node (by simply repurposes the unused child1 field).

In the second step, for each basic block, it iterates all the IR instructions from up to down. The variablesAtTail array is used as a scratchpad to keep track of the current value of each variable at the current IR instruction being iterated.

1. If it is a GetLocal instruction, and variablesAtTail[i] is nullptr, then this is the first time the variable i is used in this basic block, and the value of this variable should come from a Phi node. So we create a Phi node and store it into variablesAtHead[i]. The variablesAtTail[i] should also be updated to the current GetLocal node[2].
2. If it is a GetLocal instruction, but variablesAtTail[i] is not nullptr, then the local variable i has been used in this basic block. Thus, this GetLocal can always be removed from the graph by replacing it with another node (with detail described earlier). Thanks to the design of the IR, this GetLocal may only be used by later IR instructions in the same basic block, so the replacement can be done easily[3].
3. If it is a StoreLocal instruction, then we can simply update the corresponding slot in variablesAtTail.

The third step builds the incoming edges for all Phi nodes. The Phi nodes created in the previous step are put into a list[4]. The algorithm iterates until the list gets empty. Each time an element, let’s say, a Phi node for some variable i in basic block B, is removed from the list. Then for all predecessors P of basic block B:

1. If P.variablesAtTail[i] is nullptr, then there isn’t a Phi node in P for local variable i yet. So, a new one is created, and P.variablesAtHead[i] and P.variablesAtTail[i] is updated. The newly created Phi node is put into the list.
2. Otherwise, the predecessor of the current Phi node for that basic block should simply be P.variablesAtTail[i][5].

Note that the second step and the third step also guarantee that variablesAtHead correctly contains the information for all live variables at the start of a basic block.

Conclusion

In my opinion, the design of the DFG IR is really neat. The standard SSA form is best suited for optimization, but it is both difficult and slow to construct and transform. The DFG IR made interesting design choices that intentionally deviates from the standard SSA form to improve compilation speed and usability. This matches DFG JIT’s design goal as a fast optimizer. The CPS Rethreading pass is a neat algorithm that serves multiple purposes simultaneously: to simplify and canonicalize the IR, to construct a SSA-like Phi data flow, and to perform liveness analysis.

Footnotes

1. In fact, there are also Flush and PhantomLocal which are related to OSR, and SetArgumentDefinitely and SetArgumentMaybe which are related to special casing of arguments, but we will overlook them in this article for simplicity. ↩︎

2. It seems like the algorithm intentionally updates it to the GetLocal node instead of the Phi node because it can provide more information to users: one can easily trace to the Phi from the GetLocal but not otherwise around. ↩︎

3. Specifically, all we need to do is to maintain the replacement map, and at the time we iterate a node, we check if any of its operands are in the replacement map. If yes, we replace it. The replacement map can simply be a std::map, but a hash table lookup is still slow, so in JSC there is a 8-byte scratch field in each Node struct that can be used for this purpose (or other similar purposes). ↩︎

4. In fact, there are three different kinds of local variables: Argument, Local and Tmp. So correspondingly, there are three different lists. I think it’s only an implementation detail: the three list could have been merged into one, but the current code design (where the Kind is used as a template parameter in several APIs) makes it easiest to just have three lists. ↩︎

5. The actual implementation is a bit more complex. If the predecessor node is a GetLocal, it would further forward it to the value of the GetLocal (as computed in the auxiliary pointer), but if the predecessor node is a StoreLocal, it won’t forward it to its operand. But these are only implementation details to best fit the needs of the clients (the other optimization passes), so we omit them to highlight the core idea. ↩︎