「こんなきれいな星も、やっぱりここまで来てから、見れたのだと思うから。だから・・もっと遠くへ・・」

# Building the fastest Lua interpreter.. automatically!

It is well-known that writing a good VM for a dynamic language is never an easy job. High-performance interpreters, such as the JavaScript interpreter in Safari, or the Lua interpreter in LuaJIT, are often hand-coded in assembly. If you want a JIT compiler for better performance, well, you’ve got some more assembly to write. And if you want the best possible performance with multiple-tier JIT compilation… Well, that’s assembly all the way down.

I have been working on a research project to make writing VMs easier. The idea arises from the following observation: writing a naive interpreter is not hard (just write a big switch-case), but writing a good interpreter (or JIT compiler) is hard, as it unavoidably involves hand-coding assembly. So why can’t we implement a special compiler to automatically generate a high-performance interpreter (and even the JIT) from “the big switch-case”, or more formally, a semantical description of what each bytecode does?

### The LuaJIT Remake Project

I chose Lua as the experiment target for my idea, mainly because Lua is concise yet supports almost every language feature one can find in dynamic languages, including exotic ones like stackful coroutines. I named my project LuaJIT Remake (LJR) because in the long term, it will be a multi-tier method-based JIT compiler for Lua.

After months of work on the project, I’ve finally got some early results to share. LJR now has a feature-complete Lua 5.1 interpreter that is automatically generated at build time using a meta-compiler called Deegen (for “Dynamic language Execution Engine Generator”). More importantly, it is the world’s fastest Lua interpreter to date, outperforming LuaJIT’s interpreter by 28% and the official Lua interpreter by 171% on average on a variety of benchmarks[1].

The figure below illustrates the performance of our interpreter, the LuaJIT interpreter, and the official PUC Lua interpreter. PUC Lua’s performance is normalized to 1 as a baseline.

As the figure shows, our interpreter performs better than LuaJIT’s hand-coded-in-assembly interpreter on 31 out of the 34 benchmarks[2], and on geometric average, we run 28% faster than LuaJIT interpreter, and almost 3x the speed of official PUC Lua.

Enough of the numbers, now I will dive a bit into how my approach works.

### Why Assembly After All?

To explain how I built the fastest Lua interpreter, one needs to understand why (previously) the best interpreters have been hand-coded in assembly. This section is all about background. If you are already familiar with interpreters, feel free to skip to the next section.

Mike Pall, the author of LuaJIT, has explained this matter clearly in this great email thread back in 2011. The problem with the “big switch-case” approach is that C/C++ compilers simply cannot handle such code well. Although eleven years have passed, the situtation didn’t change much. Based on my experience, even if a function only has one fast path and one cold path, and the cold path has been nicely annotated with unlikely, LLVM backend will still pour a bunch of unnecessary register moves and stack spills into the fast path[3]. And for the “big switch-case” interpreter loop with hundreds of fast-paths and cold-paths, it’s unsurprising that compilers fail to work well.

Tail call, also known as continuation-passing style, is an alternative to switch-case-based interpreter loop. Basically each bytecode gets its own function that does the job, and when the job is done, control is transferred to the next function via a tail call dispatch (i.e., a jump instruction at machine code level). So despite that conceptually, the bytecode functions are calling each other, they are really jumping to each other at machine code level, and there will be no unbounded stack growth. An alternate way to look at it is that each “case” clause in the switch-case interpreter loop becomes a function. The “switch” will jump (i.e., tail call) to the corresponding “case” clause, and at the end of the case a jump (i.e., tail call) is executed to jump back to the switch dispatcher[4].

With the tail-call approach, each bytecode now gets its own function, and the pathological case for the C/C++ compiler is gone. And as shown by the experience of the Google protobuf developers, the tail-call approach can indeed be used to build very good interpreters. But can it push to the limit of hand-written assembly interpreters? Unfortunately, the answer is still no, at least at its current state.

The main blockade to the tail-call approach is the callee-saved registers. Since each bytecode function is still a function, it is required to abide to the calling convention, specifically, every callee-saved register must retain its old value at function exit. So if a bytecode function needs to use a callee-saved register, it needs to save the old value on the stack and restore it at the end[5]. The only way to solve this problem is to use a calling convention with no callee-saved registers. Unfortunately, Clang is (to-date) the only compiler that offers guaranteed-tail-call intrinsic ([[clang::musttail]] annotation), but it has no such user-exposed calling convention with no callee-saved registers. So you lose 6 (or 8, depending on cconv) of the 15 registers for no reason on x86-64, which is clearly bad.

Another blockade to the tail-call approach is, again, the calling convention. No unbounded stack growth is a requirement, but tricky problems can arise when the caller and callee function prototype does not match, and some parameters are being passed in the stack. So Clang makes the compromise and requires the caller and callee to have identical function prototypes if musttail is used. This is extremely annoying in practice once you have tried to write anything serious under such limitation (for POC purpose I had hand-written a naive Lua interpreter using musttail, so I have first-hand experience on how annoying it is).

### Generating the Interpreter: Another Level Of Indirection Solves Everything

As you might have seen, the root of all the difficulties is that our tool (C/C++) is not ideal for the problem we want to solve. So what’s the solution?

Of course, throwing the tool away and resort to sheer force (hand-coding assembly) is one solution, but doing so also results in high engineering cost. Can we do it more swiftly?

It is well-known that all problems in computer science can be solved by another level of indirection. In our case, C/C++ is a very good tool to describe the semantics of each bytecode (i.e., what each bytecode should do), but C/C++ is not a good tool to write the most efficient interpreter. So what if we add one level of indirection: we write the bytecode semantical description in C++, compile it to LLVM IR, and feed the IR into a special-purpose compiler. The special-purpose compiler will take care of all the dirty work, doing proper transformation to the IR and finally generate a nice tail-call-based interpreter.

For example, at LLVM IR level, it is trivial to make a function use GHC calling convention (a convention with no callee-saved registers) and properly transform the function to unify all the function prototype, thus solving the two major problems with musttail tail calls that is unsolvable at C/C++ level. In fact, Deegen (our meta-compiler that generates the interpreter) does a lot more than producing the tail calls, which we will cover in the rest of this post.

### Hide All the Ugliness Behind Nice APIs

In Deegen framework, the semantics of each bytecode is described by a C++ function. One of the most important design philosophy of Deegen is to abstract away all the nasty parts of an interpreter. I will demonstrate with a simplified example for the Add bytecode:

The function Add takes two boxed values (a value along with its type) lhs and rhs as input. It first checks if both lhs and rhs are double (the Is<tDouble>() check). If not, we throw out an error. Otherwise, we add them together by casting the two boxed value to its actual type (double) and do a normal double addition. Finally, we create a new boxed value of double type using TValue::Create<tDouble>(), return it as the result of the bytecode and dispatch to the next bytecode, through the Return() API call (note that this is not the C keyword return).

Notice how much nasty work we have abstracted away: decoding the bytecode, loading the operands and constants, throwing out errors, storing results to the stack frame, and dispatching to the next bytecode. All of these interpreter details either happen automatically, or happen with a simple API call (e.g., ThrowError or Return).

Now let’s extend our Add to add support for the Lua __add metamethod semantics:

The GetMMForAdd is some arbitrary runtime function call that gets the metamethod. Deegen does not care about its implementation: the bytecode semantic description is just a normal C++ function, so it can do anything allowed by C++, of course including calling other C++ functions. The interesting part is the MakeCall API. It allows you to call other Lua functions with the specified parameters, and most importantly, a return continuation. The MakeCall API does not return. Instead, when the called function returns, control will be returned to the return continuation (the AddContinuation function). The return continuation function is similar to the bytecode function: it has access to all the bytecode operands, and additionally, it has access to all the values returned from the call. In our case, the semantics for Lua __add is to simply return the first value returned by the call as the result of the bytecode, so we use GetReturnValueAtOrd(0) to get that value, and use the Return API we have covered earlier to complete the Add bytecode and dispatch to the next bytecode.

Again, notice how much nasty work that we have abstracted away: all the details of creating the new Lua frame, adjusting the parameters and return values (overflowing arguments needs to go to variadic arg if callee accepts it, insufficient arguments need to get nil), transferring control to the callee functions, etc., are all hidden by a mere MakeCall API. Furthermore, all of these are language-neutral: if we were to target some other languages (e.g., Python), most of the Deegen code that implements the MakeCall could be reused.

The use of return continuation is designed to support Lua coroutines. Since Lua coroutines are stackful, and yield can happen anywhere (as yield is not a Lua keyword, but a library function), we need to make sure that the C stack is empty at any bytecode boundary, so we can simply tail call to the other continuation to accomplish a coroutine switch. This design also has a few advantages compared with PUC Lua’s coroutine implementation:

1. We have no fragile longjmps.
2. We can easily make any library function that calls into VM yieldable using this mechanism. In fact, the error message cannot yield across C call frames is gone completely in LJR: all Lua standard library functions, including exotic ones like table.sort, are redesigned to be yieldable using this mechanism.

### Automation, Automation, and More Automation!

The bytecode semantic function specifies the execution semantics of the bytecode, but one still needs to specify the definition of the bytecode. For example, one needs to know that AddVN takes two operands where LHS is a bytecode slot and RHS is a number value in the constant table, and that AddVN returns one value, and that it always fallthroughs to the next bytecode and cannot branch to anywhere else. In Deegen, this is achieved by a bytecode specification language.

Again, let’s use the Add as the example:

There are a few things going on here so we will go through them one by one. First of all, the DEEGEN_DEFINE_BYTECODE is a macro that tells us that you are defining a bytecode.

The Operands(...) API call tells us that the bytecode has two operands, with each can be either a bytecode slot (a slot in the call frame) or a constant in the constant table. Besides BytecodeSlotOrConstant, one can also use Literal to define literal operands, and BytecodeRange to define a range of bytecode values in the call frame.

The Result(BytecodeValue) API call tells us that the bytecode returns one value and does not branch. The enum key BytecodeValue means the bytecode returns one TValue. One can also use enum key CondBr to specify that the bytecode can branch, or just no argument to specify that the bytecode doesn’t return anything.

The Implementation(...) API specifies the execution semantics of the bytecode, which is the Add function we just covered.

The interesting part is the Variant API calls. It allows one to create different variants of the bytecode. For example, in Lua, we have the AddVV bytecode to add two bytecode values, or the AddVN bytecode to add a bytecode value with a constant double, or the ADDNV bytecode to add a constant double with a bytecode value. In a traditional interpreter implementation, the implementation of all of these bytecodes must be written by hand, which is not only laborious, but also error prone. However, in Deegen’s framework, all you need to do is to specify them as Variants, and we will do all the work for you!

The IsConstant API allows optionally further specifying the type of the constant, as shown in the IsConstant<tDoubleNotNaN>() usage in the snippet. Deegen implemented special LLVM optimization pass to simplify the execution semantics function based on the known and speculated type information of the operands. For example, for the bytecode variant where rhs is marked as IsConstant<tDoubleNotNaN>(), Deegen will realize that the rhs.Is<tDouble>() check in the bytecode function must be true, and optimize it out. This allows us to automatically generate efficient specialized bytecode implementation, without adding engineering cost to the user. (And by the way, the tDouble and tDoubleNotNaN things, or more formally, the type lattice of the language, is also user-defined. Deegen is designed to be a generic meta-compiler: it is not welded to Lua).

Finally, Deegen will generate a user-friendly CreateAdd function for the user frontend parser to emit a Add bytecode. For example, the frontend parser can write the following code to generate an Add bytecode that adds bytecode slot 1 with constant 123.4, and stores the output into slot 2:

The implementation of CreateAdd will automatically insert constants into the constant table, select the most suitable variant based in the input types (or throwing out an error if no satisfying variant can be found), and append the bytecode into the bytecode stream. The concrete layout of the bytecode in the bytecode stream is fully hidden from the user. This provides a maximally user-friendly and robust API for the user parser logic to build the bytecode stream.

This link is the real implementation of all the Lua arithemtic bytecodes in LuaJIT Remake. It used a few features that we haven’t covered yet: the DEEGEN_DEFINE_BYTECODE_TEMPLATE macro allows defining a template of bytecodes, so Add, Sub, Mul, etc., can all be defined at once, minimizing engineering cost. The EnableHotColdSplitting API allows automatically hot-cold-splitting based on speculated and proven input operand types, and splits out the cold path into a dedicated function, which improves the final code quality (recall the earlier discussion on the importance of hot-cold code splitting?).

And below is the actual disassembly of the interpreter generated by Deegen for Lua’s AddVV bytecode. Comments are manually added by me for exposition purposes:

As one can see, thanks to all of our optimizations, the quality of the assembly generated by Deegen has no problem rivalling hand-written assembly.

### Inline Caching API: The Tricks of the Trade

A lot of LJR’s speedup over LuaJIT interpreter comes from our support of inline caching. We have rewritten the Lua runtime from scratch. In LJR, table objects are not stored as a plain hash table with an array part. Instead, our table implementation employed hidden classes, using a design mostly mirroring the hidden class design in JavaScriptCore.

Hidden class allows efficient inline caching, a technique that drastically speeds up table operations. Briefly speaking, one can think of a hidden class as a hash-consed metadata object that describes the layout of a table object, or (simplified for the purpose of exposition), a hash map from string key to the storage slot in the table storing the value of this string key.

Let’s use the TableGetById bytecode (aka, TGETS in LuaJIT) as example. TableGetById takes a table T and a fixed constant string k as input, and outputs T[k].

Due to the natural use case of dynamic languages, for a fixed TableGetById bytecode, the tables it operates on are likely to have the same hidden class, or only a few different kinds of hidden classes. So TableGetById will cache the most recent hidden class H it saw, as well as H[k], the storage slot in the table for the constant string key k. When TableGetById is executed on input T, it first check if the hidden class of T is just its cached hidden class H. If so (which is likely), it knows that the result must be stored in slot H[k] of T, so the expensive hash-lookup work (which queries hidden class H to obtain H[k]) can be elided.

In general, one can characterize the inline caching optimization as the following: there are some generic computation λ : input -> output that can be split into two steps:

1. An expensive but idempotent step λ_i : icKey -> ic where icKey is a subset of the input data, and ic is an opaque result.
2. A cheap but effectful step λ_e : <input, ic> -> output, that takes the input and the idempotent result ic for input in step 1, and outputs the final output.

If the computation satisfies such constraint, then one can cache icKey and the corresponding ic. Then on new inputs, if the icKey matches, the expensive idempotent step of computing ic can be safely elided.

Deegen provided generic inline caching APIs to allow easy employment of inline caching optimization. Specifically:

1. The full computation λ is specified as a C++ lambda (called the body lambda).
2. The effectful step λ_e is specified as C++ lambdas defined inside the body lambda (called the effect lambdas).

We allow specifying multiple possible effect lambdas in the body lambda, since the λ_e to execute can often be dependent on the outcome of the idempotent step. However, we require that at most one effect lambda can be executed in each run of the body lambda.

For example, for TableGetById, the code that employs inline caching would look like the following (simplified for the purpose of exposition):

The precise semantic of the inline caching APIs is the following:

1. When ic->Body() executes for the first time, it will honestly execute the body lambda. However, during the execution, when a ic->Effect API call is executed, it will create an inline cache[6] for this bytecode that records the IC key (defined by the ic->Key() API), as well as all captures of this effect lambda that are defined within the body lambda. These variables are treated as constants (the ic state).
2. Next time the ic->Body executes, compare the cached key against the actual key.
3. If the key matches, it will directly execute the previously recorded effect lambda. For each capture of the effect lambda, if the capture is defined inside the body lambda, it will see the cached value recorded in step 1. Otherwise (i.e., the capture is defined as a capture of the body lambda), it will see the fresh value.
4. If the key does not match, just execute step 1.

The precise semantic might look a bit bewildering at first glance. A more intuitive way to understand is that one is only allowed to do idempotent computation inside the body lambda (idempotent is with respect to the cached key and other values known to be constants to this bytecode). All the non-idempotent computations must go to the effect lambda. As long as this rule is followed, Deegen will automatically generate correct implementation that employs the inline caching optimization.

Deegen also performs exotic optimizations that fuses the ordinal of the effect lambda into the opcode, to save an expensive indirect branch that branches to the correct effect implementation when the inline cache hits. Such optimizations would have required a lot of engineering efforts in a hand-written interpreter. But in Deegen, it is enabled by merely one line: ic->FuseICIntoInterpreterOpcode().

Below is the actual disassembly of the interpreter generated by Deegen, for TableGetById bytecode. The assembly is for a “fused-IC” quickened variant (see above) where the table is known to have no metatable, and the property exists in the inline storage of the table. As before, comments are manually added by me for exposition purposes.

As one can see, in the good case of an IC hit, a TableGetById is executed with a mere 2 branches (one that checks the operand is a heap object, and one that checks the hidden class of the heap object matches the inline-cached value).

LuaJIT’s hand-written assembly interpreter is highly optimized already. Our interpreter generated by Deegen is also highly optimized, and in many cases, slightly better-optimized than LuaJIT. However, the gain from those low-level optimizations are simply not enough to beat LuaJIT by a significant margin, especially on a modern CPU with very good instruction-level parallelism, where having a few more instructions, a few longer instructions, or even a few more L1-hitting loads have negligible impact on performance. The support of inline caching is one of the most important high-level optimizations we employed that contributes to our performance advantage over LuaJIT.

### Conclusion Thoughts and Future Works

In this post, we demonstrated how we built the fastest interpreter for Lua (to date) through a novel meta-compiler framework.

However, automatically generating the fastest Lua interpreter is only the beginning of our story. LuaJIT Remake is designed to be a multi-tier method-based JIT compiler generated by the Deegen framework, and we will generate the baseline JIT, the optimizing JIT, the tiering-up/OSR-exit logic, and even a fourth-tier heavyweight optimizing JIT in the future.

Finally, Deegen is never designed to be welded to Lua, and maybe in the very far future, we can employ Deegen to generate high-performance VMs at a low engineering cost for other languages as well.

#### Footnotes

1. The benchmarks are run on my laptop with Intel i7-12700H CPU and 32GB DDR4 memory. All benchmarks are repeated 5 times and the average performance is recorded. ↩︎

2. As a side note, two of the three benchmarks where we lost to LuaJIT are string processing benchmarks. LuaJIT seems to have some advanced string-handling strategy, yielding the speedup. However, the strategy is not perfect: it failed badly on the life benchmark, and as a result, LuaJIT got stomped 3.6x by PUC Lua (and 16x by us) on that benchmark. ↩︎

3. Some of these poor code come from insufficiently-optimized calling convention handling logic (e.g., LLVM often just pours all the spills at function entry for simplicity), and some comes from the register allocator that doesn’t have enough understanding of hot/cold path (so that it believes that hoisting a register move or a stack spill from the cold path into the fast path is an optimization while it actually isn’t). Compilers are always evolving and get better, but at least in this case it isn’t enough. ↩︎

4. One can also imagine an optimization that makes each “case” directly jumps to the next “case”, instead of the switch dispatcher. This is known as “direct threading” in the literature for continuation-passing-style-based interpreter, or more widely known as a “computed-goto interpreter” for switch-case-based interpreter (since GCC computed-goto extension is the most straightforward way to implement such an optimization). ↩︎

5. If one looks at the problem globally, clearly the better solution is to only save all the callee-saved registers once when one enters the interpreter, and restores it when the interpreter finishes, instead of having each bytecode function doing the same work again and again. But it’s impossible to tell the C/C++ compiler that “this function doesn’t need to abide to the calling convention, because by high-level design someone else will do the job for it”. ↩︎

6. In the current implementation, for the interpreter, each bytecode is only allowed to keep one inline cache entry, so the newly-created entry always overwrites the existing entry. However, for JIT compilers, each inline cache entry will be a piece of JIT-generated code, so there can be multiple IC entries for each bytecode. ↩︎

# Pitfalls of using C++ Global Variable Constructor as a Registration Mechanism

I recently hit the following use case in my project: I have a function RunAllPasses(obj), which runs a list of transformation passes on obj. All passes are independent from each other, so one can run them in any order. The problem is, I want to easily add new passes to the list of passes.

Of course one can manually maintain the list of passes, and call each of them. But this results in quite a bit of boilerplate code needed for each pass, and a lot of header files with each file merely having one function declaration for the pass.

Can we have less boilerplate code?

One intuitive direction is to have each pass “register” itself into a pass list at program initialization time, through the help of a global variable. For example, if one writes

Then the constructor of g_registerMyPass would automatically run when the program starts, and push the pass into a global pass list. The RunAllPasses function can then simply run each pass in the pass list.

However, this approach turns out to be the source of a stream of problems, which ultimately forced me to give up this approach. Long story short, let’s start with the experiment that led me to my conclusion.

#### Linker: The Deal-Breaker

Create a mini project with two C++ files, a.cpp and b.cpp.

a.cpp simply declares a global variable that has a constructor, which prints a message:

b.cpp is just the main() function:

Now, run the program (the compiler and linker doesn’t matter, at least for the few I tried):

and we get the expected output of In constructor S followed by In main. This shows that the C++ compiler indeed took care to preserve the global variable s from being pruned by the linker even if it is unused, which is good.

But if we make a.cpp a library, things break!

After further investigation, it turns out that the erratic behavior depends on whether the file a.cpp contains any symbols that are being used by the main program. For example, adding another file c.cpp into the static library won’t help, even if c.cpp contains a function used by the main program. But if we change the code a bit, so that a.cpp contains a function used by the main program, like the following:

Then, magically, the In constructor S line would be printed out again.

What’s the problem? As it turns out, if none of the symbols in some file X of a static library is directly referenced by the main program, then the file X won’t be linked into the main program at all. And this “file-level pruning” ignores whatever “do-not-prune” annotation emitted by the C++ compiler in the file, since the file is not linked in altogether.

So I reached the conclusion that this approach is fundamentally fragile:

1. The irratic behavior won’t show up if the global variable is defined in an object file, only when it is defined in static libraries.
2. The irratic behavior won’t show up if the C++ file defining the global variable contains other declarations that is used by the main program.
3. There is no way (AFAIK) to fix this problem other than the -Wl,--whole-archive linker flag, which is not only fragile, but also a bad option because it unnecessarily bloats the final executable by often a lot.

The strict triggering condition means that the irratic behavior can hide undiscovered for a long time, until it is exposed by some completely irrelevant changes (e.g., moving a file to a static library, or moving some code around) and cause a debugging nightmare.

During the process, I also learned a number of C++-standard-imposed pitfalls about global variable constructor. I will only note one interesting example below.

The following code has undefined behavior, can you see why?

Answer: at the time the constructor of r runs, the constructor of g_list may not have run.

This is because according to C++ standard, “dynamic initialization of a non-block variable with static storage duration is unordered if the variable is an implicitly or explicitly instantiated specialization” (in our case, any instantiation of the variable r). Since std::map does not have a constexpr constructor, g_list is also dynamically initialized, so r may be initialized before g_list, even if g_list “appears” to be defined before r.

#### But isn’t Google Test using the same global variable trick?

The above question comes to my mind soon after I uploaded this post, so I gave it a try. The result is as expected: if I move my Google test files to a static library linked against the final unit test executable, all the tests are gone. Of course, for unit tests, there is absolutely no reason to make them a static library, so I would say Google Test made the completely correct design decision. However, for the general use cases, it seems unreasonable to silently introduce bugs when the code is linked as a static library.

# How to check if a real number is an integer in C++?

I have a double, and I want to know if its value is an integer that fits in a int64_t. How can I do it in C++?

Ask any C++ newbie, and you will get an obvious “answer”: cast your double to int64_t, then cast it back to double, and compare if it equals your original number.

But is it really correct? Let’s test it:

and here’s the output under clang -O3 (latest version 14.0.0):

!@#%^&… Why? Shouldn’t it at least print either a 1 or a 0? ### The Undefined Behavior Here’s the reason: when you cast a floating-point value to an integer type, according to C/C++ standard, if the integral part of the value does not fit into the integer type, the behavior is undefined (by the way, casting special floating-point values NaN, INF, -INF to integer is also undefined behavior). And unfortunately, Clang did the least helpful thing in this case: 1. It inlined the function IsInt64, so IsInt64(1e100) becomes expression 1e100 == (double)(int64_t)1e100. 2. It deduces that (int64_t)1e100 incurs undefined behavior since 1e100 does not fit into int64_t, so it evaluates to a special poison value (i.e., undefined). 3. Any expression on a poison value also produces poison. So Clang deduces that expression IsInt64(1e100) ? "1" : "0" ultimately evaluates to posion. 4. As a result, Clang deduces that the second parameter to printf is an undefined value. So in machine code, the whole expression is “optimized out”, and whatever junk stored in that register gets passed to printf. printf will interpret that junk value as a pointer and prints out whatever content at that address, yielding the junk output. Note that even though gcc happens to produce the expected output in this case, the undefined behavior is still there (as all C/C++ compilers conform to the same C/C++ Standard), so there is no guarantee that the IsInt64 function above will work on gcc or any compiler. So how to implement this innocent function in a standard-compliant way? ### The Bad Fix Attempt #1 To avoid the undefined behavior, we must check that the double fits in the range of the int64_t before doing the casting. However, there’s a few tricky problems involved: 1. While -2^63 (the smallest int64_t) has an exact representation in double, 2^63-1 (the largest int64_t) doesn’t. So we must be careful about the rounding problems when doing the comparison. 2. Comparing the special floating-point value NaN with any number will yield false, so we must write our check in a way that NaN won’t pass the check. 3. There is another weird thing called negative zero (-0). For the purpose of this post, we treat -0 same as 0. If not, you will need another special check. With these in mind, here’s the updated version: However, unfortunately, while the above version is correct, it results in some unnecessarily terrible code on x86-64: In fact, despite that out-of-range floating-point-to-integer cast is undefined behavior in C/C++, the x86-64 instruction cvttsd2si used above to perform the cast is well-defined on all inputs: if the input doesn’t fit in int64_t, then the output is 0x80000000 00000000. And since 0x80000000 00000000 has an exact representation in double, casting it back to double will yield -2^63, which won’t compare equal to any double value but -2^63. So the range-check is actually unnecessary for the code to behave correctly on x86-64: it is only there to keep the C++ compiler happy, but unfortunately, the C++ compiler is unable to realize that such check is unnecessary on x86-64, thus cannot optimize it out on x86-64. To summarize, on x86-64, all we need to generate is the last few lines of the above code. But is there any way we can teach the compiler to generate such assembly? ### The Bad Fix Attempt #2 In fact, our original buggy implementation produces exactly the above assembly. The problem is, whenever the optimizer of the C++ compiler inlines this function and figures out that the input is a compile-time constant, it will do constant propagation according to C++ rule – and as a result, generate the poison value. So can we stop the optimizer from this unwanted optimization, while still having it doing optimizations properly for the rest of the program? In fact, I have posted this question on LLVM forum months ago, and didn’t get an answer. But recently I suddenly had an idea. gcc and clang all support a crazy builtin named __builtin_constant_p. Basically this builtin takes one parameter, and returns true if the parameter can be proven by the compiler to be a compile-time constant[1]. Yes, the result of this function is dependent on the optimization level! This builtin has a very good use case: to implement constexpr offsetof. If you are certain that some expression p is a compile-time constant, you can do constexpr SomeType foo = __builtin_constant_p(p) ? p : p; to forcefully make p a constexpr, even if p is not constexpr by C++ standard, and the compiler won’t complain anything! This allows one to perform constexpr reinterpret_cast between uintptr_t and pointers, thus implement a constexpr-version offsetof operator. However, what I realized is that, this builtin can also be used to prevent the unwanted constant propagation. Specifically, we will check if (__builtin_constant_p(d)). If yes, we run the slow-but-correct code – this doesn’t matter as the optimizer is going to constant-fold the code anyway. If not, we execute the fast-but-UB-prone code, which is also fine because we already know the compiler can’t constant-fold anything to trigger the undefined behavior. The new version of the code is below: I tried the above code on a bunch of constants and non-constant cases, and the result seems good. Either the input is correctly constant-folded, or the good-version assembly is generated. So I thought I outsmarted the compiler in this stupid Human-vs-Compiler game. But am I…? ### Don’t Fight the Tool! Why does C/C++ have this undefined behavior after all? Once I start to think about this problem, I begin to realize that something must be wrong… The root reason that C/C++ Standard specifies that an out-of-range floating-point-to-integer cast is undefined behavior is because on different architectures, the instruction that performs the float-to-int cast exhibits different behavior when the floating-point value doesn’t fit in the integer type. On x86-64, the behavior of the cvttsd2si instruction in such cases is to produce 0x80000000 00000000, which is fine for our use case. But what about the other architectures? As it turns out, on ARM64, the semantics of the fcvtzs instruction (analogue of x86-64’s cvttsd2si) is saturation: if the floating-point value is larger than the max value of the integer type, the max value is produced; similarly, if the floating-point value is too small, the minimum integer value is produced. So if the double is larger than 2^63-1, fcvtzs will produce 2^63-1, not -2^63 like in x86-64. Now, recall that 2^63-1 doesn’t have an exact representation in double. When 2^63-1 is cast to double, it becomes 2^63. So if the input double value is 2^63, casting it to int64_t (fcvtzs x8, d0) will yield 2^63-1, and then casting it back to double (scvtf d1, x8) will yield 2^63 again. So on ARM64, our code will determine that the double value 2^63 fits in int64_t, despite that it actually does not. I don’t own a ARM64 machine like Apple M1, so I created a virtual machine using QEMU to validate this. Without surprise, on ARM64, our function fails when it is fed the input 2^63. So clearly, the undefined behavior is there for a reason… ### Pick the Right Tool Instead! As it turns out, I really should not have tried to outsmart the compiler with weird tricks. If performance is not a concern, then the UB-free version is actually the only portable and correct version: And if performance is a concern, then it’s better to simply resort to architecture-dependent inline assembly. Yes, now a different implementation is needed for every architecture, but at least it’s better than dealing with hard-to-debug edge case failures. Of course, the ideal solution is to improve the compiler, so that the portable version generates optimal code on every architecture. But given that neither gcc nor clang had supported this, I assume it’s not an easy thing to do. #### Footnotes 1. Note that this builtin is different from the C++20 std::is_constant_evaluated(). The is_constant_evaluated only concerns whether a constexpr function is being evaluated constexpr-ly. However, __builtin_constant_p tells you whether a (maybe non-constexpr) expression can be deduced to a compile-time known constant under the current optimization level, so it has nothing to do with constexpr. ↩︎ # Bizarre Performance Characteristics of Alder Lake CPU TL;DR: Some of the P-cores in Alder Lake CPU can exhibit highly unstable performance behavior, resulting in large noise for any benchmark running on it. UPDATE: A colleague of mine reported that the behavior can be observed on his i9-9980HK as well, and observed ~25% end-performance fluctuations on short-running benchmarks. So it seems like this behavior as been around for quite a while – dating back to at least the 9th-gen Intel CPU[1]. As a performance engineer, it’s routine to evaluate the performance before and after a code commit. This is why I’ve been faintly feeling that something is unusual about my new Intel Alder Lake i7-12700H laptop CPU. Today I dug into the problem. As I discovered, this CPU indeed exhibits some highly unusual and surprising performance characteristics, which can easily cause pitfalls for benchmarks. For background, Alder Lake features a hybrid architecture of the powerful P-cores and the weaker E-cores. i7-12700H has 6 P-cores and 8 E-cores. Of course, we want to have the P-cores run our time-sensitive tasks, such as our benchmarks. This can be done easily by taskset the process to only P-cores. This is where the story begins. I noticed two problems with the P-cores: 1. Sometimes it cannot turbo-boost to 4.7GHz, the Intel-specified max turbo boost frequency (for the one-active-core case) for i7-12700H. 2. Sometimes it cannot stay at the highest CPU frequency it can boost to. Point 1 implies that we cannot enjoy the full performance promoted by Intel. Point 2 implies that the core cannot deliver consistent performance, which is problematic for performance engineering, as the noise would make two benchmark runs less comparable. ### Test Setup To expose the problem, I wrote a dumb program that increments a variable in a dead loop, so that the frequency of the CPU running the program is maxed out. Then I use taskset to pin the program to one CPU, have it run for 60 seconds, and run cpufreq every second to record the frequency of that CPU in the duration[2]. I took the following precautions to ensure nothing outside the CPU chip is limiting the CPU from boosting to its max frequency: 1. Use isolcpus Linux kernel boot parameter to exclusively dedicate the tested CPU core to our test program. This removes any noise caused by the OS. 2. Confirm the CPU is not throttled by power limit: with only one active core (running our test program), the CPU package power consumption is less than 25W, far less than the base 45W TDP of i7-12700H. 3. Confirm the CPU is not temperature-throttled (by monitoring sensors). To be paranoid, I also set a 20s gap between each test so the temperature goes back to idle state. 4. Confirm the machine is in idle state, and stop unnecessary background services. 5. The CPU frequency governer is set to performance, and I confirmed that the governer is not limiting the turbo boost frequency. 6. Everything is at stock setting: nothing is overclocked or undervolted, etc. 7. All tests are repeated 3 times, and consistent behavior is observed for every core. ### Not All P-cores Are Born Equal The test confirmed my hypothesis that the 6 P-cores in my i7-12700H do not have a uniform quality. Specifically, my 6 P-cores exhibit three different performance characteriscs! I dubbed them as “gold core”, “B-grade core”, and “wild core”: 1. Gold core: the core can boost to and stay at 4.7GHz, just as Intel claimed. 2. B-grade core: the core can boost to and stay at a frequency lower than 4.7GHz. 3. Wild core: the core cannot boost to 4.7GHz, and cannot stay at any stable frequency: it will fluctuate wildly between a range of frequencies, and the degree of turbulence also varies per core. We will explain their performance characteristics below. ### The “Wild Cores” Let’s start with the most bizarre cores: the wild ones. As it turns out, 3 out of my 6 P-cores are wild (a whopping 50%!), and among those three cores, one of them is particularly wild, as shown in the plot below[3]: As you can see, the CPU frequency turbulents violently from 4.05GHz to 4.55GHz, and each run exhibits a completely different pattern. Clearly, if any benchmark were run on this core, such a large noise would be a headache to deal with. The other two wild cores I got were less turbulent. Even though, the noise introduced by the frequency instability still make them not ideal for benchmark comparison: ### The “B-grade Cores” The B-grade cores (as I dubbed) are better: while they cannot boost to 4.7GHz as promoted by Intel, at least they can operate at a consistent frequency, so benchmark results are comparable as long as they are run on the same core. It turns out that my i7-12700H has two B-grade cores, both capable of operating at 4.5GHz: As one can see, the core for the second graph has slightly higher frequency variations. Nevertheless, they are much stabler than the three wild cores. ### The “Gold Core” Only 1 out of the 6 P-cores of my i7-12700H matches Intel’s marketing[4]: As one can see, it operates stably at about 4.68GHz, just as Intel claimed. ### The Behavior of the E-cores Unlike the P-cores, it turns out that the E-cores have extremely stable behavior. All the eight E-cores can boost to and stay at 3.5GHz, just as the Intel specification said. There is not even a single outlier point: as you can see in the figure, it’s a completely straight line. ### Conclusion Thoughts Given Intel’s tight testing and binning quality-control process, it seems very unlikely that I’m seeing all of these only because I got a defective. So I conjecture the “wild core” behavior can likely be observed on many i7-12700H CPUs. Additionally, since i7-12700H is just the same i9-12900 chip with two below-quality P-cores disabled, it is also interesting to know if the behavior shows up on higher-end Alder Lake models, like the i9-12900K, as those presumably come from the better silicons, but I don’t have the ability to validate it. Nevertheless, from a practicalist’s perspective, the action to take is clear: run the benchmark to identify the best cores and the performance-unstable cores on your chip, avoid running benchmarks on the performance-unstable cores, and use the best cores for the most latency-sensitive application. For example, for my particular chip, physical core 2 (logical core 4-5) turns out to be the only “gold core”, so taskset -c 4 for single-threaded benchmark is a good idea. Similarly, for latency-sensitive multi-threaded application (like the QtCreator IDE, where UX is heavily affected by auto-completion latency), it is reasonable to modify the startup command in the desktop link to pin it to the good cores (logical core 0,1,4,5,8,9 in my particular chip). #### But why? I’m not expert at all, but my conjecture is that the increase in clock frequency and # of cores in recent CPUs might be the cause: due to silicon lottery, the max stable clock frequency is inherently different for each core. So as the chip gets more cores, it becomes exponentially harder to find chips where all cores in the chip match the spec frequency criteria – so maybe that’s why Intel loosened their criteria? On the other hand, boost frequency is designed to go down as more cores become active. So in theory, having one golden core is actually enough, as long as the OS is aware of which core is golden, and assigns performance-demanding task to that core. However, it doesn’t seem to be the case yet, at least for my Ubuntu running Linux kernel 5.15. #### Footnotes 1. On the other hand, my 7-th generation i7-7700HQ CPU does not have the problem described in this post. ↩︎ 2. The full bash script for the test can be found here. For least noise, you should use isolcpus boot parameter to isolate a subset of CPUs, reboot, modify the script to only test the isolated subset, then change isolcpus to isolate the opposite set of CPUs, reboot, and modify the script to test the opposite set. ↩︎ 3. The two logical CPUs of the physical core exhibit the same behavior, so I only show one of them. Same for other figures in this post. ↩︎ 4. Though if you take a closer look at their specification, you’ll see what Intel claimed is “up to 4.7GHz”, so technically they did not lie, as they never claimed all cores can meet their specification – though, I guess, two cores 0.2GHz slower, two cores 0.35GHz slower and turbulent, one core 0.5GHz slower and highly turbulent is still, hmm. ↩︎ # Understanding GC in JSC From Scratch Javascript relies on garbage collection (GC) to reclaim memory. In this post, we will dig a little bit into JSC (the Javascript engine of WebKit)'s garbage collection system. WebKit’s blog post on GC is a great post that explained the novelties of JSC’s GC and also positioned it within the context of various GC schemes in academia and industry. However, as someone with little GC background, I found WebKit’s blog post too hard to understand, and also too vague to understand the specific design used by JSC. So this blog post attempts to add in some more details, and aims to be understandable even by someone with little prior background on GC. The garbage collector in JSC is non-compacting, generational and mostly[1]-concurrent. On top of being concurrent, JSC’s GC heavily employs lock-free programming for better performance. As you can imagine, the design used by JSC is quite complex. So instead of diving into the complex invariants and protocols, we will start with the simplest design, and improve it step by step to converge at JSC’s design in the end. This way, we not only understand why JSC’s design works, but also how JSC’s design is reached. But first of all, let’s get into some background. ### Memory Allocation in JSC Memory allocator and GC are tightly coupled by nature – the allocator allocates memory to be reclaimed by the GC, and the GC frees memory to be reused by the allocator. In this section, we will briefly introduce JSC’s memory allocators. At the core of the memory allocation scheme in JSC is the data structure BlockDirectory[2]. It implements a fixed-sized allocator, that is, an allocator that only allocates memory chunks of some fixed size S. The allocator keeps tracks of a list of fixed-sized (in current code, 16KB) memory pages (“blocks”) it owns, and a free list. Each block is divided into cells of size S, and has a footer at its end[3], which contains various metadata information needed for GC and allocator, e.g., which cells are free. By aggregating and sharing metadata at the footer, it both saves memory and improves performance of related operations: we will go into details later. When a BlockDirectory needs to make an allocation, it tries to allocate from its free list. If the free list is empty, it tries to iterate through the blocks it owns[4], to see if it can find a block containing free cells (which are marked free by GC). If yes, it scans the block footer metadata to find out all the free cells[5] in this block, and put into the free list. Otherwise, it allocates a new block from the OS[6]. Note that this implies a BlockDirectory’s free list only contains cells in one block: this is called m_currentBlock in the code, and we will revisit this later. The BlockDirectory is used as the building block to build the memory allocators in JSC. JSC employs three kinds of allocators: 1. CompleteSubspace: this is a segregated allocator responsible for allocating small objects (max size about 8KB). Specifically, there is a pre-defined list of exponentially-growing size-classes[7], and one BlockDirectory is used to handle allocation for each size class. So to allocate an object, you find the smallest size class large enough to hold the object, and allocate from that size class. 2. PreciseAllocation: this is used to handle large allocations that cannot be handled by CompleteSubspace allocator[8]. It simply relies on the standard (malloc-like) memory allocator, though in JSC a custom malloc implementation called libpas is used. The downside is that since PreciseAllocation is done on a per-object basis, it cannot aggregate and share metadata information of multiple objects together to save memory and improve performance (as CompleteSubspace’s block footer did). Therefore, every PreciseAllocation comes with a whopping overhead of a 96-byte GC header to store the various metadata information needed for GC for this object (though this overhead is justified since each allocation is already at least 8KB). 3. IsoSubspace: each IsoSubspace is used to allocate objects of a fixed type with a fixed size. So each IsoSubspace simply holds a BlockDirectory to do allocation (though JSC also has an optimization for small IsoSubspace by making them backed by PreciseAllocation[9]). This is mainly a security hardening feature that makes use-after-free-based attacks harder[10]. As you can see, IsoSubspace is mostly a simplified CompleteSubspace, so we will ignore it for the purpose of this post. CompleteSubspace is the one that handles the common case: small allocations, and PreciseAllocation is mostly the rare slow path for large allocations. ### Generational GC Basics In JSC’s generational GC model, the heap consists of a small “new space” (eden), holding the newly allocated objects, and a large “old space” holding the older objects that have survived one GC cycle. Each GC cycle is either an eden GC or a full GC. New objects are allocated in the eden. When the eden is full, an eden GC is invoked to garbage-collect the unreachable objects in eden. All the surviving objects in eden are then considered to be in the old space[11]. To reclaim objects in the old space, a full GC is needed. The effectiveness of the above scheme relies on the so-called “generational hypothesis”: 1. Most objects collected by the GC are young objects (died when they are still in eden), so eden GC (which only collects the eden) is sufficient to reclaim most of the memory. 2. Pointers from old space to eden is much rarer than pointers from eden to old space or pointers from eden to eden, so an eden GC’s runtime is approximately linear to the size of the eden, as it only needs to start from a small subset of the old space. This implies that the cost of GC can be amortized by the cost of allocation. #### Inlined vs. Outlined Metadata: Why? Practically every GC scheme uses some kind of metadata to track which objects are alive. In this section, we will explain how those metadata are stored in JSC, and the motivation behind such design. In JSC, every object managed by the GC carries the following metadata: 1. Every object managed by GC inherit the JSCell class, which contains a 1-byte member cellState. This cellState is a color marker with two colors: white and black[12]. 2. Every object also has two out-of-object metadata bits: isNew[13] and isMarked. For objects allocated by PreciseAllocation, the bits reside in the GC header. For objects allocated by CompleteSubspace, the bits reside in the block footer. This may seem odd at first glance since isNew and isMarked could have been stored in the unused bits of cellState. However, this is intentional. The inlined metadata cellState is easy to access for the mutator thread (the thread executing Javascript code), since it is just a field in the object. However, it has bad memory locality for GC and allocators, which need to quickly traverse through all the metadata of all objects in some block owned by CompleteSubspace (which is the common case). Outlined metadata have the opposite performance characteristics: they are more expensive to access for the mutator thread, but since they are aggregated into bitvectors and stored in the block footer of each block, GC and allocators can traverse them really fast. So JSC keeps both inlined and outlined metadata to get the better of both worlds: the mutator thread’s fast path will only concern the inlined cellState, while the GC and allocator logic can also take advantage of the memory locality of the outlined bits isNew and isMarked. Of course, the cost of this is a more complex design… so we have to unfold it bit by bit. ### A Really Naive Stop-the-World Generational GC Let’s start with a really naive design just to understand what is needed. We will design a generational, but stop-the-world (i.e. not incremental or concurrent) GC, with no performance optimizations at all. In this design, the mutator side transfers control to the GC subsystem at a “safe point”[14] to start a GC cycle (eden or full). The GC subsystem performs the GC cycle from the beginning to the end (as a result, the application cannot run during this potentially long period, thus “stop-the-world”), and then transfer control back to the mutator side. For this purpose, let’s temporarily forget about CompleteSubspace: it is an optimized version of PrecisionAllocation for small allocations, and while it is an important optimization, it’s easier to understand the GC algorithm without it. It turns out that in this design, all we need is one isMarked bit. The isMarked bit will indicate if the object is reachable at the end of the last GC cycle (and consequently, is in the old space, since any object that survived a GC cycle is in old space). All objects are born with isMarked = false. The GC will use a breadth-first search to scan and mark objects. For full GC, we want to reset all isMarked bit to false at the beginnning, and do a BFS to scan and mark all objects reachable from GC roots. Then all the unmarked objects are known to be dead. For eden GC, we only want to scan the eden space. Fortunately, all objects in the old space are already marked at the end of the previous GC cycle, so they are naturally ignored by the BFS, so we can simply reuse the same BFS algorithm in full GC. In pseudo-code: Eden GC preparation phase: no work is needed. Full GC preparation phase[15]: Eden/Full GC marking phase: Eden/Full GC collection phase: But where does the scan start, so that we can scan through every reachable object? For full GC, the answer is clear: we just start the scan from all GC roots[16]. However, for eden GC, in order to reliably scan through all reachable objects, the situation is slightly more complex: 1. Of course, we still need to push the GC roots to the initial queue. 2. If an object in the old space contains a pointer to an object in eden, we need to put the old space object to the initial queue[17]. The invariant for the second case is maintained by the mutator side. Specifically, whenever one writes a pointer slot of some object A in the heap to point to another object B, one needs to check if A.isMarked is true and B.isMarked is false. If so, one needs to put A into a “remembered set”. Eden GC must treat the objects in the remembered set as if they were GC roots. This is called a WriteBarrier. In pseudo-code: ### Getting Incremental The stop-the-world GC isn’t feasible for production use. A GC cycle (especially a full GC cycle) can take a long time. Since the mutator (application logic) cannot run during the period, the application would appear irresponsive to the user, which is very bad user experience. A natural way to shorten this irresponsive period is to run GC incrementally: at safe points, the mutator transfers control to the GC. The GC only runs for a short time, doing a portion of the work for the current GC cycle (eden or full), then return control to the mutator. This way, each GC cycle is splitted into many small steps, so the irresponsive periods are less noticeable for the user. Incremental GC poses a few new challenges to the GC scheme. The first challenge is the extra interference between GC and mutator: the mutator side, namely the allocator and the WriteBarrier, must be prepared to see states arisen from a partially-completed GC cycle. And the GC side must correctly mark all reachable objects despite changes made by the mutator side in between. Specifically, our full GC must change: imagine that the full GC scanned some object o and handed back control to mutator, then the mutator changed a field of o to point to some other object dst. The object dst must not be missed from scanning. Fortunately, in such case o will be isMarked and dst will be !isMarked (if dst has isMarked then it has been scanned, so there’s nothing to worry about), so o will be put into the remembered set. Therefore, for full GC to function correctly in the incremental GC scheme, it must consider the remembered set as GC root as well, just like the eden GC. The other parts of the algorithm as of now can remain unchanged (we leave the proof of correctness as an excerise for the reader). Nevertheless, “what happens if a GC cycle is run partially?” is something that we must keep in mind as we add more optimizations. The second challenge is that the mutator side can repeatedly put an old space object into the remembered set, and result in redundant work for the GC: for example, the GC popped some object o in the remembered set, traversed from it, and handed over control to mutator. The mutator modified o again, putting it back to the remembered set. If this happens too often, the incremental GC could do a lot more work than a stop-the-world GC. The obvious mitigation is to have the GC scan the remembered set last: only when the queue has otherwise been empty do we start popping from the remembered set. However, it turns out that this is not enough. JSC employs a technique called Space-Time Scheduler to further mitigate this problem. In short, if it obverves that the mutator was allocating too fast, the mutator would get decreasingly less time quota to run so the GC can catch up (and in the extreme case, the mutator would get zero time quota to run, so it falls back to the stop-the-world approach). The WebKit blog post has explained it very clearly, so feel free to take a look if you are interested. Anyway, let’s update the pseudo-code for the eden/full GC marking phase: ### Incorporate in CompleteSubspace It’s time to get our CompleteSubspace allocator back so we don’t have to suffer the huge per-object GC header overhead incurred by PreciseAllocation. For PreciseAllocation, the actual memory management work is done by malloc: when the mutator wants to allocate an object, it just malloc it, and when the GC discovers a dead object, it just free it. CompleteSubspace introduces another complexity, as it only allocate/deallocate memory from the OS at 16KB-block level, and does memory management itself to divide the blocks into cells that it serves to the application. Therefore, it has to track whether each of its cells is available for allocation. The mutator allocates from the available cells, and the GC marks dead cells as available for allocation again. The isMarked bit is not enough for the CompleteSubspace allocator to determine if a cell contains a live object or not: newly allocated objects have isMarked = false but are clearly live objects. Therefore, we need another bit. In fact, there are other good reasons that we need to support checking if a cell contains a live object or not. A canonical example is the conservative stack scanning: JSC cannot precisely understand the layout of the stack, so it needs to treat everything on the stack that could be pointers and pointing to live objects as GC root, and this involves checking if a heap pointer points to a live object or not. One can easily imagine some kind of isLive bit that is true for all live objects, which is only flipped to false by GC when the object is dead. However, JSC employed a slightly different scheme, which is needed to facilitate optimizations that we will mention later. As you have seen earlier, the bit used by JSC is called isNew. However, keep in mind: you should not think of isNew as a bit that tells you anything related to its name, or indicates anything by itself. You should think of it as a helper bit, which sole purpose is that, when working togther with isMarked, they tell you if a cell contains a live object or not. This thinking mode will be more important in the next section when we introduce logical versioning. The core invariant around isNew and isMarked is: 1. At any moment, an object is dead iff its isNew = false and isMarked = false. If we were a stop-the-world GC, then to maintain this invariant, we only need the following: 1. When an object is born, it has isNew = true and isMarked = false. 2. At the end of each eden or full GC cycle, we set isNew of all objects to false. Then, all newly-allocated objects are live because its isNew is true. At the end of each GC cycle, an object is live iff its isMarked is true, so after we set isNew to false (due to rule 2), the invariant on dead object is maintained, as desired. However, in an incremental GC, since the state of a partially-run GC cycle can be exposed to mutator, we need to be careful that the invariant holds in this case as well. Specifically, in full GC, we reset all isMarked to false at the beginning. Then, during a partially-run GC cycle, the mutator may see a live object with both isMarked = false (beacuse it has not been marked by GC yet), and isNew = false (because it has survived one prior GC cycle). This violates our invariant. To fix this, at the beginning of a full GC, we additionally do isNew |= isMarked before clearing isMarked. Now, during the whole full GC cycle, all live objects must have isNew = true[18], so our invariant is maintained. At the end of the cycle, all isNew bits are cleared, and as a result, all the unmarked objects become dead, so our invariant is still maintained as desired. So let’s update our pseudo-code: Eden GC preparation phase: no work is needed. Full GC preparation phase: Eden/Full GC collection phase: In CompleteSubspace allocator, to check if a cell in a block contains a live object (if not, then the cell is available for allocation): ### Logical Versioning: Do Not Sweep! We are doing a lot of work at the beginning of a full GC cycle and at the end of any GC cycle, since we have to iterate through all the blocks in CompleteSubspace and update their isMarked and isNew bits. Despite that the bits in one block are clustered into bitvectors thus have good memory locality, this could still be an expensive operation, especially after we have a concurrent GC (as this stage cannot be made concurrent). So we want something better. The optimization JSC employs is logical versioning. Instead of physically clearing all bits in all blocks for every GC cycle, we only bump a global “logical version”, indicating that all the bits are logically cleared (or updated). Only when we actually need to mark a cell in a block during the marking phase do we then physically clear (or update) the bitvectors in this block. You may ask: why bother with logical versioning, if in the future we still have to update the bitvectors physically anyway? There are two good reasons: 1. If all cells in a block are dead (either died out during this GC cycle[19], or already dead before this GC cycle), then we will never mark anything in the block, so logical versioning enabled us to avoid the work altogether. This also implies that at the end of each GC cycle, it’s unnecessary to figure out which blocks become completely empty, as logical versioning makes sure that these empty blocks will not cause overhead to future GC cycles. 2. The marking phase can be done concurrently with multiple threads and while the mutator thread is running (our scheme isn’t concurrent now, but we will do it soon), while the preparation / collection phase must be performed single-threadedly and with the mutator stopped. Therefore, shifting the work to marking phase reduces GC latency in a concurrent setting. There are two global version number g_markVersion and g_newVersion[20]. Each block footer also stores its local version number l_markVersion and l_newVersion. Let’s start with the easier case: the logical versioning for the isNew bit. If you revisit the pseudo-code above, in GC there is only one place where we write isNew: at the end of each GC cycle, we set all the isNew bits to false. Therefore, we simply bump g_newVersion there instead. A local version l_newVersion smaller than g_newVersion means that all the isNew bits in this block have been logically cleared to false. When the CompleteSubspace allocator allocates a new object, it needs to start with isNew = true. One can clearly do this directly, but JSC did it in a trickier way that involves a block-level bit named allocated for slightly better performance. This is not too interesting, so I deferred it to the end of the post, and our scheme described here right now will not employ this optimization (but is otherwise intentionally kept semantically equivalent as JSC): 1. When a BlockDirectory starts allocating from a new block, it update the the block’s l_newVersion to g_newVersion, and set isNew to true for all already-allocated cells (as the block may not be fully empty), and false for all available cells. 2. Whenever it allocates a cell, it sets its isNew to true. Why do we want to bother setting isNew to true for all already-allocated cells in the block? This is to provide a good property. Since we bump g_newVersion at the end of every GC cycle, due to the scheme above, for any block with latest l_newVersion, a cell is live if and only if its isNew bit is set. Now, when checking if a cell is live, if its l_newVersion is latest, then we can just return isNew without looking at isMarked, so our logic is simpler. The logical versioning for the isMarked bit is similar. At the beginning of a full GC cycle, we bump the g_markVersion to indicate that all mark bits are logically cleared. Note that the global version is not bumped for eden GC, since eden GC does not clear isMark bits. There is one extra complexity: the above scheme would break down in incremental GC. Specifically, during a full GC cycle, we have logically cleared the isMarked bit, but we also didn’t do anything to the isNew bit, so all cells in the old space would appear dead to the allocator. In our old scheme without logical versioning, this case is prevented by doing isNew |= isMarked at the start of the full GC, but we cannot do it now with logical versioning. JSC solves this problem with the following clever trick: during a full GC, we should also accept l_markVersion that is off-by-one. In that case, we know the isMarked bit accurately reflect whether or not a cell is live, since that is the result of the last GC cycle. If you are a bit confused, take a look at footnote[21] for a more elaborated case discussion. It might also help to take a look at the comments in the pseudo-code below: Before we mark an object in CompleteSubspace, we need to update the l_markVersion of the block holding the cell to latest, and materialize the isMarked bits of all cells in the block. That is, we need to run the logic at the full GC preparation phase in our old scheme: isNew |= isMarked, isMarked = false for all cells in the block. This is shown below. A fun fact: despite that what we conceptually want to do above is isNew |= isMarked, the above code never performs a |= at all :) And also, let’s update the pseudo-code for relavent GC logic: Eden GC preparation phase: no work is needed. Full GC preparation phase: Eden/Full GC collection phase: With logical versioning, GC no longer sweeps the CompleteSubspace blocks to reclaim dead objects: the reclamation happens lazily, when the allocator starts to allocate from the block. This, however, introduces an unwanted side-effect. Some objects use manual memory management internally: they own additional memory that are not managed by GC, and have C++ destructors to free those memory when the object is dead. This improves performance as it reduces the work of GC. However, now we do not immediately sweep dead objects and run destructor, so the memory that are supposed to be freed by the destructor could be kept around indefinitely longer, if the block is never allocated from. To mitigate this issue, JSC will also periodically sweep the blocks and run the destructors of the dead objects. This is implemented by IncrementalSweeper, but we will not go into details. To conclude, logical versioning provided two important optimizations to the GC scheme: 1. The so-called “sweep” phase of the GC (to find out and reclaim dead objects) is removed for CompleteSubspace objects. The reclamation is done lazily. This is clearly better than sweeping through the block again and again in every GC cycle. 2. The full GC does not need to reset all isMarked bit in the preparation phase, but only lazily reset them in the marking phase by aboutToMark: this not only reduces work, but also allows the work to be done parallelized and while the mutator is running, after we make our GC scheme concurrent. ### Optimizing WriteBarrier: The cellState Bit As we have explained earlier, whenever the mutator modified a pointer of a marked object o to point to an unmarked object, it needs to add o to the “remembered set”, and this is called the WriteBarrier. In this section, we will dig a bit deeper into the WriteBarrier and explain the optimizations around it. The first problem with our current WriteBarrier is that the isMarked bit resides in the block footer, so retrieving its value requires quite a few computations from the object pointer. Also it doesn’t sit in the same CPU cache line as the object, which makes the access even slower. This is undesirable as the cost is paid for every WriteBarrier, no matter if we actually added the object to remembered set in the end or not. The second problem is, our WriteBarrier will repeatedly add the same object o to the remembered set every time it is run. The obvious solution is to make rememberedSet a hash set to de-duplicate the objects it contains, but doing a hash lookup to check if the object already exists is still too expensive. This is where the last metadata bit that we haven’t explained yet: the cellState bit comes in, which solves both problems. Instead of making rememberedSet a hash table, we reserve a byte (though we only use 1 bit of it) named cellState in every object’s object header, to indicate if we might need to put the object into the remembered set in a WriteBarrier. Since this bit resides in the object header as an object field (instead of in the block footer), it’s trivially accessible to the mutator who has the object pointer. cellState has two possible values: black and white. The most important two invariants around cellState are the following: 1. For any object with cellState = white, it is guaranteed that the object does not need to be added to remembered set. 2. Unless during a full GC cycle, all black (live) objects have isMarked = true. Invariant 1 serves as a fast-path: WriteBarrier can return immediately if our object is white, and checking it only requires one load instruction (to load cellState) and one comparison instruction to validate it is white. However, if the object is black, a slow-path is needed to check whether it is actually needed to add the object to remembered set. Let’s look at our new WriteBarrier: The first thing to notice is that the WriteBarrier is no longer checking if dst (the object that the pointer points to) is marked or not. Clearly this does not affect the correctness: we are just making the criteria less restrictive. However, it is unclear to me if we can improve performance while maintaining correctness by making some kind of check on dst as well, like the original WriteBarrier did. I wasn’t able to get a definite answer on this even from JSC developer. They have two conjectures on why they are doing this: first, by not checking dst, more objects are put into the remembered set and need to be scanned by GC, so the total amount of work increased. However, the mutator’s work probably decreased, as it does less checks and touches less cache lines (by not touching the outlined isMarked bit). Of course, the benefit is offsetted by that the mutator is adding more objects into the remembered set, but this isn’t too expensive either, as the remembered set is only a segmented vector. GC has to do more work, as it needs to scan and mark more objects. However, after we make our scheme concurrent, the marking phase of GC can be done concurrently as the mutator is running, so the latency is probably[22] hidden. Second, JSC’s DFG compiler has optimization pass that coalesces barriers on the same object together, and the barrier emitted this way naturally cannot check dst. Therefore, to make things easier, they simply made all the barriers to not check dst. Although these are all conjectures, and it is unclear if adding back the dst check can improve performance, this is how JSC works, so let’s stick to it. The interesting part is how the invariants above are maintained by the relavent parties. As always, there are three actors: the mutator (WriteBarrier), the allocator, and the GC. The interaction with the allocator is the simplest. All objects are born white. This is correct because newly-born objects are not marked, so have no reason to be remembered. The interaction with GC is during the GC marking phase: 1. When we mark an object and push it into the queue, we set its cellState to white. 2. When we pop an object from the queue, before we start to scan its children, we set its cellState to black. In pseudo-code, the Eden/Full GC marking phase now looks like the following (Line 5 and Line 9 are the newly-added logic to handle cellState, other lines unchanged): Let’s argue why the invariant is maintained by the above code. 1. For invariant 1, note that in the above code, an object is white only if it is inside the queue (as once it’s popped out, it becomes black again), pending scanning of its children. Therefore, it is guaranteed that the object will still be scanned by the GC later, so we don’t need to add the object to remembered set, as desired. 2. For invariant 2, at the end of any GC cycle, any live object is marked, which means it has been scanned, so it is black, as desired. Now let’s look at what WriteBarrierSlowPath should do. Clearly, it’s correct if it simply unconditionally add the object to remembered set, but that also defeats most of the purpose of cellState as an optimization mechanism: we want something better. A top business of cellState is to prevent adding an object into the remembered set if it is already there. Therefore, after we put the object into the remembered set, we will set its cellState to white, like shown below. Let’s prove why the above code works. Once we added an object to remembered set, we set it to white. We don’t need to add the same object into the remembered set until it gets popped out from the set by GC. But when GC pops out the object, it would set its cellState back to black, so we are good. JSC employed one more optimization. During a full GC, we might see black objects that has isMarked = false (note that this is the only possible case that the object is unmarked, due to invariant 2). In this case, it’s unnecessary to add the object to remembered set, since the object will eventually be scanned in the future (or it becomes dead some time later before it was scanned, in which case we are good as well). Furthermore, we can flip it back to white, so we don’t have to go into this slow path the next time a WriteBarrier on this object runs. To sum up, the optimized version is as below: ### Getting Concurrent and Getting Wild At this point, we already have a very good incremental and generational garbage collector: the mutator, allocator and GC all have their respective fast-paths for the common cases, and with logical versioning, we avoided redundant work as much as possible. In my humble opinion, this is a good balance point between performance and engineering complexity. However, obviously, “engineering complexity” is not a word inside JSC’s dictionary: after all, they have the most talented engineers, to the point that they even engineered their own purpose-built LLVM from scratch! To squeeze out every bit of performance, JSC proceeded to make the GC scheme concurrent. However, due to the nature of GC, it’s often infeasible to use locks to protect against race conditions for performance reasons, so extensive lock-free programming is employed. But once lock-free programming is involved, one starts to get into all sorts of architecture-dependent memory reordering problems. x86-64 is the more sane architecture: it only requires StoreLoadFence(), and it provides somewhat-TSO-like semantics, but JSC also needs ARM64 support for their Apple Sillicon devices. ARM64 has even fewer guarantees: load-load, load-store, store-load, and store-store can all be reordered by the CPU, so any innocent operation could actually need a fence. As if things were not bad enough, for performance reasons, JSC does not want to use too many memory fences on ARM64. So they have the so-called Dependency class, which creates an implicit CPU data dependency on ARM64 through some scary assembly hacks, so they can get the desired memory ordering for a specific data-flow without paying the cost of a memory fence. As you can imagine, with all of these complications and optimizations, the code can easily become horrifying. So due to my limited expertise, it’s unsurprising if I missed to explain or mis-explained some important race conditions in the code, especially some ARM64-specific ones: if you spotted any issue in this post, please definitely let me know. Let’s go through the concurrency assumptions first. Javascript is a single-threaded language, so there is always only one mutator thread[23]. Apart from the mutator thread, JSC has a bunch of compilation threads, a GC thread, and a bunch of marking threads. Only the GC marking phase is concurrent: during which the mutator thread, the compiler threads, and a bunch of marking threads are concurrently running (yes, the marking itself is also done in parallel). However, all the other GC phases are run with the mutator thread and compilation threads stopped. #### Some Less Interesting Issues First of all, clearly the isMarked and isNew bitvector must be made safe for concurrent access, since multiple threads (including marking threads and mutator) may concurrently update it. Using CAS with appropriate retry/bail mechanism is enough for the bitvector itself. BlockFooter is harder, and needs to be protected with a lock: multiple threads could be simutanuously calling aboutToMark(), so aboutToMark() must be guarded. For the reader side (the isMarked() function, which involves first checking if l_markVersion is latest, then reading the isMarked bitvector), in x86-64 thanks to x86-TSO, one does not need a lock or any memory fence (as long as aboutToMark takes care to update l_markVersion after the bitvector). In ARM64, since load-load reordering is allowed, a Dependency is required. Making the cellContainsLiveObject (or in JSC jargon, isLive) check lock-free is harder, since it involves potentially reading both the isMarked bit and the isNew bit. JSC employs optimistic locking to provide a fast-path. This is not very different from an optimistic locking scheme you can find in a textbook, so I won’t dive into the details. Of course, there are a lot more subtle issues to change. Almost all the pseudo-code above needs to be adapted for concurrency, either by using a lock or CAS, or by using some sort of memory barriers and concurrency protocol to ensure that the code works correctly under concurrency settings. But now let’s turn to some more important and tricky issues. #### The Race Between WriteBarrier and Marking One of the most important race is the race between WriteBarrier and GC’s marking threads. The marking threads and the mutator thread can access the cellState of an object concurrently. For performance reasons, a lock is infeasible, so race condition arises. It’s important to note that we call WriteBarrier after we have written the pointer into the object. This is not only more convenient to use (especially for JIT-generated code), but also allows a few optimizations: for example, in certain cases, multiple writes to the same object may only call WriteBarrier once at the end. With this in mind, let’s analyze why our current implementation is buggy. Suppose o is an object, and the mutator wants to store a pointer to another object target into a field f of o. The marking logic of GC wants to scan o and append its children into the queue. We need to make sure that GC will observe the o -> target pointer link. Let’s first look at the correct logic: Mutator (WriteBarrier) GC (Marker) Store(o.f, target) StoreLoadFence() // WriteBarrier begin t1 = Load(o.cellState) if (t1 == black): WriteBarrierSlowPath(o) Store(o.cellState, black) StoreLoadFence() t2 = Load(o.f) // Load a children of o Do some check to t2 and push it to queue This is mostly just a copy of the pseudocode in the above sections, except that we have two StoreLoadFence(). A StoreLoadFence() guarantees that no LOAD after the fence may be executed by the CPU out-of-order engine until all STORE before the fence have completed. Let’s first analyze what could go wrong without either of the fences. Just to make things perfectly clear, the precondition is o.cellState = white (because o is in the GC’s queue) and o.f = someOldValue. What could go wrong if the mutator WriteBarrier doesn’t have the fence? Without the fence, the CPU can execute the LOAD in line 3 before the STORE in line 1. Then, in the following interleaving: 1. [Mutator Line 3] t1 = Load(o.cellState) // t1 = white 2. [GC Line 1] Store(o.cellState, black) 3. [GC Line 3] t2 = Load(o.f) // t2 = some old value 4. [Mutator Line 1] Store(o.f, target) Now, the mutator did not add o to remembered set (because t1 is white, not black), and t2 in GC is the old value in o.f instead of target, so GC did not push target into the queue. So the pointer link from o to target is missed in GC. This can result in target being wrongly reclaimed despite it is live. And what could go wrong if the GC marking logic doesn’t have the fence? Similarly, without the fence, the CPU can execute the LOAD in line 3 before the STORE in line 1. Then, in the following interleaving: 1. [GC Line 3] t2 = Load(o.f) // t2 = some old value 2. [Mutator Line 1] Store(o.f, target) 3. [Mutator Line 3] t1 = Load(o.cellState) // t1 = white 4. [GC Line 1] Store(o.cellState, black) Similar to above, mutator sees t1 = white and GC sees t2 = oldValue. So o is not added to remembered set, and target is not pushed into the queue, the pointer link is missed. Finally, let’s analyze why the code behaves correctly if both fences are present. Unfortunately there is not a better way than manually enumerating all the interleavings. Thanks to the fences, Mutator Line 1 must execute before Mutator Line 3, and GC Line 1 must execute before GC Line 3, but the four lines can otherwise be reordered arbitrarily. So there are 4! / 2! / 2! = 6 possible interleavings. So let’s go! Interleaving 1: 1. [Mutator Line 1] Store(o.f, target) 2. [Mutator Line 3] t1 = Load(o.cellState) // t1 = white 3. [GC Line 1] Store(o.cellState, black) 4. [GC Line 3] t2 = Load(o.f) // t2 = target In this interleaving, the mutator did not add o to remembered set, but the GC sees target, so it’s fine. Interleaving 2: 1. [GC Line 1] Store(o.cellState, black) 2. [GC Line 3] t2 = Load(o.f) // t2 = some old value 3. [Mutator Line 1] Store(o.f, target) 4. [Mutator Line 3] t1 = Load(o.cellState) // t1 = black In this interleaving, GC saw the old value, but the mutator added o to the remembered set, so GC will eventually drain from the remembered set and scan o again, at which time it will see the correct new value target, so it’s fine. Interleaving 3: 1. [Mutator Line 1] Store(o.f, target) 2. [GC Line 1] Store(o.cellState, black) 3. [Mutator Line 3] t1 = Load(o.cellState) // t1 = black 4. [GC Line 3] t2 = Load(o.f) // t2 = target In this interleaving, GC saw the new value target, nevertheless, the mutator saw t1 = black and added o to the remembered set. This is unfortunate since GC will scan o again, but it doesn’t affect correctness. Interleaving 4: 1. [Mutator Line 1] Store(o.f, target) 2. [GC Line 1] Store(o.cellState, black) 3. [GC Line 3] t2 = Load(o.f) // t2 = target 4. [Mutator Line 3] t1 = Load(o.cellState) // t1 = black Same as Interleaving 3. Interleaving 5: 1. [GC Line 1] Store(o.cellState, black) 2. [Mutator Line 1] store(o.f, target) 3. [Mutator Line 3] t1 = Load(o.cellState) // t1 = black 4. [GC Line 3] t2 = Load(o.f) // t2 = target Same as Interleaving 3. Interleaving 6: 1. [GC Line 1] Store(o.cellState, black) 2. [Mutator Line 1] Store(o.f, target) 3. [GC Line 3] t2 = Load(o.f) // t2 = target 4. [Mutator Line 3] t1 = Load(o.cellState) // t1 = black Same as Interleaving 3. This proves that with the two StoreLoadFence(), our code is no longer vulnerable to the above race condition. #### Another Race Condition Between WriteBarrier and Marking The above fix alone is not enough: there is another race between WriteBarrier and GC marking threads. Recall that in WriteBarrierSlowPath, we attempt to flip the object back to white if we saw it is not marked (this may happen during a full GC), as illustrated below: It turns out that, after setting the object white, we need to do a StoreLoadFence(), and check again if the object is marked. If it becomes marked, we need to set obj->cellState back to black. Without the fix, the code is vulnerable to the following race: 1. [Precondition] o.cellState = black and o.isMarked = false 2. [WriteBarrier] Check isMarked() // see false 3. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed 'o' into queue 4. [GC Marking] Popped 'o' from queue, Store(o.cellState, black) 5. [WriteBarrier] Store(o.cellState, white) 6. [Postcondition] o.cellState = white and o.isMarked = true The post-condition is bad because o will not be added to the remembered set in the future, despite that it needs to be (as the GC has already scanned it). Let’s now prove why the code is correct when the fix is applied. Now the WriteBarrier logic looks like this: 1. [WriteBarrier] Store(o.cellState, white) 2. [WriteBarrier] t1 = isMarked() 3. [WriteBarrier] if (t1 == true): Store(o.cellState, black) Note that we omitted the first “Check isMarked()” line because it must be the first thing executed in the interleaving, as otherwise the if-check won’t pass at all. The three lines in WriteBarrier cannot be reordered by CPU: Line 1-2 cannot be reordered because of the StoreLoadFence(), line 2-3 cannot be reordered since line 3 is a store that is only executed if line 2 is true. The two lines in GC cannot be reordered by CPU because line 2 stores to the same field o.cellState as line 1. In addition, note that it’s fine if at the end of WriteBarrier, the object is black but GC has only executed to line 1: this is unfortunate, because the next WriteBarrier on this object will add the object to the remembered set despite it’s unnecessary. However, it does not affect our correctness. So now, let’s enumerate all the interleavings again! Interleaving 1. 1. [WriteBarrier] Store(o.cellState, white) 2. [WriteBarrier] t1 = isMarked() // t1 = false 3. [WriteBarrier] if (t1 == true): Store(o.cellState, black) // not executed Object is not marked and white, OK. Interleaving 2. 1. [WriteBarrier] Store(o.cellState, white) 2. [WriteBarrier] t1 = isMarked() // t1 = false 3. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed 'o' into queue 4. [WriteBarrier] if (t1 == true): Store(o.cellState, black) // not executed Object is in queue and white, OK. Interleaving 3. 1. [WriteBarrier] Store(o.cellState, white) 2. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed 'o' into queue 3. [WriteBarrier] t1 = isMarked() // t1 = true 4. [WriteBarrier] if (t1 == true): Store(o.cellState, black) // executed Object is in queue and black, unfortunate but OK. Interleaving 4. 1. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed 'o' into queue 2. [WriteBarrier] Store(o.cellState, white) 3. [WriteBarrier] t1 = isMarked() // t1 = true 4. [WriteBarrier] if (t1 == true): Store(o.cellState, black) // executed Object is in queue and black, unfortunate but OK. Interleaving 5. 1. [WriteBarrier] Store(o.cellState, white) 2. [WriteBarrier] t1 = isMarked() // t1 = false 3. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed 'o' into queue 4. [GC Marking] Popped 'o' from queue, Store(o.cellState, black) 5. [WriteBarrier] if (t1 == true): Store(o.cellState, black) // not executed Object is marked and black, OK. Interleaving 6. 1. [WriteBarrier] Store(o.cellState, white) 2. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed 'o' into queue 3. [WriteBarrier] t1 = isMarked() // t1 = true 4. [GC Marking] Popped 'o' from queue, Store(o.cellState, black) 5. [WriteBarrier] if (t1 == true): Store(o.cellState, black) // executed Object is marked and black, OK. Interleaving 7. 1. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed 'o' into queue 2. [WriteBarrier] Store(o.cellState, white) 3. [WriteBarrier] t1 = isMarked() // t1 = true 4. [GC Marking] Popped 'o' from queue, Store(o.cellState, black) 5. [WriteBarrier] if (t1 == true): Store(o.cellState, black) // executed Object is marked and black, OK. Interleaving 8. 1. [WriteBarrier] Store(o.cellState, white) 2. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed 'o' into queue 3. [GC Marking] Popped 'o' from queue, Store(o.cellState, black) 4. [WriteBarrier] t1 = isMarked() // t1 = true 5. [WriteBarrier] if (t1 == true): Store(o.cellState, black) // executed Object is marked and black, OK. Interleaving 9. 1. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed 'o' into queue 2. [WriteBarrier] Store(o.cellState, white) 3. [GC Marking] Popped 'o' from queue, Store(o.cellState, black) 4. [WriteBarrier] t1 = isMarked() // t1 = true 5. [WriteBarrier] if (t1 == true): Store(o.cellState, black) // executed Object is marked and black, OK. Interleaving 10. 1. [GC Marking] CAS(o.isMarked, true), Store(o.cellState, white), pushed 'o' into queue 2. [GC Marking] Popped 'o' from queue, Store(o.cellState, black) 3. [WriteBarrier] Store(o.cellState, white) 4. [WriteBarrier] t1 = isMarked() // t1 = true 5. [WriteBarrier] if (t1 == true): Store(o.cellState, black) // executed Object is marked and black, OK. So let’s update our pseudo-code. However, I would like to note that, in JSC’s implementation, they did not use a StoreLoadFence() after obj->cellState = white. Instead, they made the obj->cellState = white a CAS from black to white (with memory ordering memory_order_seq_cst). This is stronger than a StoreLoadFence() so their logic is also correct. Nevertheless, just in case my analysis above missed some other race with other components, our pseudo-code will stick to their logic… Mutator WriteBarrier pseudo-code: Eden/Full GC Marking phase: #### Remove Unnecessary Memory Fence In WriteBarrier The WriteBarrier is now free of hazardous race conditions. However, we are executing a StoreLoadFence() for every WriteBarrier, which is a very expensive CPU instruction. Can we optimize it? The idea is the following: the fence is used to protect against race with GC. Therefore, we definitely need the fence if the GC is concurrently running. However, the fence is unnecessary if the GC is not running. Therefore, we can check if the GC is running first, and only execute the fence if the GC is indeed running. JSC is even smarter: instead of having two checks (one that checks if the GC is running and one that checks if the cellState is black), it combines them into a single check for the fast-path where the GC is not running and the object is white. The trick is the following: 1. Assume black = 0 and white = 1 in the cellState enum. 2. Create a global variable called blackThreshold. This blackThreshold is normally 0, but at the beginning of a GC cycle, it will be set to 1, and it will be reset back to 0 at the end of the GC cycle. 3. Now, check if obj->cellState > blackThreshold. Then, if the check succeeded, we know we can immediately return: the only case this check can succeed is when the GC is not running and we are white (because blackThreshold = 0 and cellState = 1 is the only situation to pass the check). This way, the fast path only executes one check. If the check fails, then we fallback to the slow path, which performs the full procedure: check if GC is running, execute a fence if needed, then check if cellState is black again. In pseudo-code: Note that there is no race between WriteBarrier and GC setting/clearing IsGcRunning() flag and changing the g_blackThreshold value, because the mutator is always stopped at a safe point (of course, halfway inside WriteBarrier is not a safe point) when the GC starts/finishes. #### “Obstruction-Free Double Collect Snapshot” Concurrent GC also introduced new complexities for the ForEachChild function used by GC marking phase to scan all objects referenced by a certain object. Each Javascript object has a Structure (aka, hidden class) that describes how the content of this object shall be interpreted into object fields. Since the GC marking phase is run concurrently with the mutator, and the mutator may change the Structure of the object, and may even change the size of the object’s butterfly, GC must be sure that despite the race conditions, it will never crash by dereferencing invalid pointers and never miss to scan a child. Using a lock is clearly infeasible for performance reasons. JSC uses a so-called obstruction-free double collect snapshot to solve this problem. Please refer to the Webkit GC blog post to see how it works. ### Some Minor Design Details and Optimizations You might find this section helpful if you want to actually read and understand the code of JSC, but otherwise feel free to skip it: these details are not centric to the design, and are not particularly interesting either. I mention them only to bridge the gap between the GC scheme explained in this post and the actual implementation in JSC. As explained earlier, each CompleteSubspace owns a list of BlockDirectory to handle allocations of different sizes; each BlockDirectory has an active block m_currentBlock where it allocates from, and it achieves this by holding a free list of all available cells in the block. But how does it work exactly? As it turns out, each BlockDirectory has a cursor, which is reset to point at the beginning of the block list at the end of an eden or full GC cycle. Until it is reset, it can only move forward. The BlockDirectory will move the cursor forward, until it finds a block containing available cells, and allocate from it. If the cursor reaches the end of the list, it will attempt to steal a 16KB block from another BlockDirectory and allocate from it. If that also failed, it will allocate a new 16KB block from OS and allocate from it. I also mentioned that a BlockDirectory uses a free list to allocate from the currently active block m_currentBlock. It’s important to note that in the actual implementation of JSC, the cells in m_currentBlock does not respect the rule for isNew bit. Therefore, to check liveness, one either need to do a special-case check to see if the cell is from m_currentBlock (for example, see HeapCell::isLive), or, for the GC[24], stop the mutator, destroy the free list (and populate isNew in the process), do whatever inspection, then rebuild the free list and resume the mutator. The latter is implemented by two functions named stopAllocating() and resumeAllocating(), which are automatically called whenever the world is stopped or resumed. The motivation of allowing m_currentBlock to not respect the rule for isNew is (a tiny bit of) performance. Instead of manually setting isNew to true for every allocation, a block-level bit allocated (aggregated as a bitvector in BlockDirectory) is used to indicate if a block is full of live objects. When the free list becomes empty (i.e., the block is fully allocated), we simply set allocated to true for this block. When querying cell liveness, we check this bit first and directly return true if it is set. The allocated bitvector is cleared at the end of each GC cycle, and since the global logical version for isNew is also bumped, this effectively clears all the isNew bits, just as we desired. JSC’s design also support the so-called constraint solver, which allows specification of implicit reference edges (i.e., edge not represented as pointer in the object). This is mainly used to support Javascript interaction with DOM. This part is not covered in this post. Weak reference has multiple implementations in JSC. The general (but less efficient) implementation is WeakImpl, denoting a weak reference edge. The data structure managing them is WeakSet, and you can see it in every block footer, and in every PreciseAllocation GC header. However, JSC also employs more efficient specialized implementations to handle the weak map feature in Javascript. The details are not covered in this post. In JSC, objects may also have destructors. There are three ways the destructors are run. First, when we begin allocating from a block, destructors of the dead cells are run. Second, the IncrementalSweeper periodically scans the blocks and runs destructors. Finally, when the VM shuts down, the lastChanceToFinalize() function is called to ensure that all destructors are run at that time. The details of lastChanceToFinalize() are not covered in this post. JSC employs a conservative approach for pointers on the stack and in registers: the GC uses UNIX signals to suspend the mutator thread, so it can copy its stack contents and CPU register values to search for data that looks like pointers. However, it’s important to note that UNIX signal is not used to suspend the execution of the mutator: the mutator always actively suspends itself at a safe point. This is critical, as otherwise it could be suspended at weird places, for example, in a HeapCell::isLive check after it has read isNew but before it has read isMarked, and then GC did isNew |= isMarked, isMarked = false, and boom. So it seems like the only reason to suspend the thread is for the GC to get the CPU register values, including the SP register value so the GC knows where the stack ends. It’s unclear to me if it’s possible to do so in a cooperative manner instead of using costly UNIX signals. ### Acknowledgements I thank Saam Barati from JSC team for his enormous help on this blog post. Of course, any mistakes in this post are mine. #### Footnotes 1. Brief stop-the-world pause is still required at the start and end of each GC cycle, and may be intentionally performed if the mutator thread (i.e. the thread running Javascript code) is producing garbage too fast for the GC thread to keep up with. ↩︎ 2. The actual allocation logic is implemented in LocalAllocator. Despite that in the code BlockDirectory is holding a linked list of LocalAllocator, (at time of writing, for the codebase version linked in this blog) the linked list always contains exactly one element, so the BlockDirectory and LocalAllocator is one-to-one and can be viewed as an integrated component. This relationship might change in the future, but it doesn’t matter for the purpose of this post anyway. ↩︎ 3. Since the footer resides at the end of a 16KB block, and the block is also 16KB aligned, one can do a simple bit math from any object pointer to access the footer of the block it resides in. ↩︎ 4. Similar to that per-cell information is aggregated and stored in the block footer, per-block information is aggregated as bitvectors and stored in BlockDirectory for fast lookup. Specifically, two bitvectors empty and canAllocateButNotEmpty track if a block is empty, or partially empty. The code is relatively confusing because the bitvectors are layouted in a non-standard way to make resizing easier, but conceptually it’s just one bitvector for each boolean per-block property. ↩︎ 5. While seemingly straightforward, it is not straightforward at all (as you can see in the code). The free cells are marked free by the GC, and due to concurrency and performance optimization the logic becomes very tricky: we will revisit this later. ↩︎ 6. In fact, it also attempts to steal blocks from other allocators, and the OS memory allocator may have some special requirements required for the VM, but we ignore those details for simplicity. ↩︎ 7. In the current implementation, the list of sizes (byte) are 16, 32, 48, 64, 80, then 80 * 1.4 ^ n for n >= 1 up to about 8KB. Exponential growth guarantees that the overhead due to internal fragmentation is at most a fraction (in this case, 40%) of the total allocation size. ↩︎ 8. An interesting implementation detail is that IsoSubspace and CompleteSubspace always return memory aligned to 16 bytes, but PreciseAllocation always return memory address that has reminder 8 module 16. This allows identifying whether an object is allocated by PreciseAllocation with a simple bit math. ↩︎ 9. JSC has another small optimization here. Sometimes a IsoSubspace contains so few objects that it’s a waste to hold them using a 16KB memory page (the block size of BlockDirectory). So the first few memory pages of IsoSubspace use the so-called “lower-tier”, which are smaller memory pages allocated by PreciseAllocation. In this post, we will ignore this design detail for simplicity. ↩︎ 10. Memory of an IsoSubspace is only used by this IsoSubspace, never stolen by other allocators. As a result, a memory address in IsoSubspace can only be reused to allocate objects of the same type. So for any type A allocated by IsoSubspace, even if there is a use-after-free bug on type A, it is impossible to allocate A, free it, allocate type B at the same address, and exploit the bug to trick the VM into interpreting an integer field in B controlled by attacker as a pointer field in A. ↩︎ 11. In some GC schemes, an eden object is required to survive two (instead of one) eden GC to be considered in old space. The purpose of such design is to make sure that any old space object is at least one eden-GC-gap old. In contrast, in JSC’s design, an object created immediately before an eden collection will be considered to be in old space immediately, which then can only be reclaimed via a full GC. The performance difference between the two designs is unclear to me. I conjecture JSC chose its current design because it’s easier to make concurrent. ↩︎ 12. There is one additional color Grey in the code. However, it turns out that White and Grey makes no difference (you can verify it by grepping all use of cellState and observe that the only comparison on cellState is checking if it is Black). The comments explaining what the colors mean are also a bit outdated. This is likely a historical artifact. In my opinion JSC should really clean it up and update the comment, as it can easily cause confusion to readers who intend to understand the design. ↩︎ 13. The bit is actually called isNewlyAllocated in the code. We shorten it to isNew for convenience in this post. ↩︎ 14. Safe point is a terminology in GC. At a safe point, the heap and stack is in a coherent state understandable by the GC, so the GC can correctly trace out which objects are dead or live. ↩︎ 15. For PreciseAllocation, all allocated objects are chained into a linked list, so we can traverse all objects (live or dead) easily. This is not efficient: we will explain the optimizations for CompleteSubspace later. ↩︎ 16. Keep in mind that while this is true for now, as we add more optimizations to the design, this will no longer be true. ↩︎ 17. Note that we push the old space object into the queue, not the eden object, because this pointer could have been overwritten at the start of the GC cycle, making the eden object potentially collectable. ↩︎ 18. Also note that all objects dead before this GC cycle, i.e. the free cells of a block in CompleteSubspace, still have isNew = false and isMarked = false, as desired. ↩︎ 19. Recall that under generational hypothesis, most objects die young. Therefore, that “all objects in an eden block are found dead during eden GC” is something completely plausible. ↩︎ 20. In JSC, the version is stored in a uint32_t and they have a bunch of logic to handle the case that it overflows uint32_t. In my humble opinion, this is an overoptimization that results in very hard-to-test edge cases, especially in a concurrent setting. So we will ignore this complexity: one can easily avoid these by spending 8 more bytes per block footer to have uint64_t version number instead. ↩︎ 21. Note that any number of eden GC cycles may have run between the last full GC cycle and the current full GC cycle, but eden GC does not bump mark version. So for any object born before the last GC cycle (no matter eden or full), the isMarked bit honestly reflect if it is live, and we will accept the bit as its mark version must be off-by-one. For objects born after the last GC cycle, it must have a latest isNew version, so we can know it’s alive through isNew. In both cases, the scheme correctly determines if an object is alive, just as desired. ↩︎ 22. And probably not: first, true sharing and false sharing between GC and mutator can cause slowdown. Second, as we have covered before, JSC uses a Time-Space Scheduler to prevent the mutator from allocating too fast while the GC is running. Specifically, the mutator will be intentionally suspended for at least 30% of the duration. So as long as the GC is running, the mutator suffers from an 30%-or-more “performance tax”. ↩︎ 23. The real story is a bit more complicated. JSC actually reuse the same VM for different Javascript scripts. However, at any moment, at most one of the script can be running. So technically, there are multiple mutually-exclusive mutator threads, but this doesn’t affect our GC story. ↩︎ 24. The GC needs to inspect a lot of cells, and its logic is already complex enough, so having one less special-case branch is probably beneficial for both engineering and performance. ↩︎ # NP70PNP + Ubuntu Tweak Notes Recently I decided to get a new laptop to replace my 5-years-old one. I happened to discover something called “barebone laptop”, which are essentially laptops with no RAM or SSD installed and no brand logo painted. The barebone laptop manufacturers generally only provide bulk sale to OEM manufacturers. Since retail sale is not possible directly, there is a niche market for buying in barebone laptops in bulk and resell them to end customers, and there are small business who live on this niche, which is the easiest way for one to buy a barebone. Apart from being more environment-friendly (by reusing the RAM and SSD from the old laptop), the main advantage of a barebone is the price. My new Clevo NP70PNP bought from R&J Tech (a barebone reseller business) is1300, while a Dell laptop with the identical configuration[1] is sold at $2650. It’s surprising that Dell at least doubled the price[2] by simply plugging in the RAM and SSD and painting their logo on top[3], and their customers are still happily buying it. Anyway, let’s get back to the topic. My experience is that the Clevo NP70PNP can work very well with Ubuntu, though a few tweaks are needed. The hardest part is that it’s very hard to find the necessary tweaks in Google due to the unpopularity of barebone laptops, which is why I’m taking notes here. 1. It has the Intel AX201 Wifi card, which doesn’t work in Ubuntu 20.04, and the reason is the kernel version, so not fixable by manually installing the firmware. However, Ubuntu 22.04 (just released last month) supports the Wifi card out of the box. 2. The Ubuntu 22.04 live-USB black screens on regular boot, but can be fixed by selecting safe graphics at boot menu. It’s a pain – for whatever reason it takes 10 minutes to load the desktop, but fortunately this is only needed for live-USB install: after the install, the GPU driver and the graphics work fine. 3. The touchpad (model FTCS1000:01 2808:0102) is the one that took me the longest to make work. It works initially, but would fail randomly after some time. After a lot of fruitless googling, it turns out that the GPU setting is the problem! (I seriously have no idea why.) As it turns out, the fix is to disable MS Hybrid for GPU. One can do this either in BIOS (in Advanced Chip Settings, switch the option from MS Hybrid to Discrete GPU Only), or in Ubuntu NVIDIA X Server Settings (in PRIME settings choose Performance). 4. Even after the tweak, there are still some minor issues with touchpad. Specifically, the feature that automatically disables touchpad when external mouse is present or while typing does not work, since for some reason the touchpad cannot be disabled from xinput. However, for some reason, it can still be disabled in GNOME by bash command gsettings set org.gnome.desktop.peripherals.touchpad send-events disabled. So I wrote a udev rule to automatically disable the touchpad on external mouse plugging in and re-enable it when the external mouse is plugged out. Googling any udev rule tutorial should be sufficient. Automatically disabling touchpad on typing seems much harder. 5. There are some minor issue with Bluetooth. For some reason, with AD2P Sink, there is a 0.5s delay in playing music. The problem doesn’t exist with HSP, though the audio quality of HSP is considerably lower. I haven’t figured out how to fix the problem since I usually use a headset. 6. For some reason, whenever the GPU is under load, the fan would spin at max speed (even if the GPU temperature is only 40 C or so), and the noise is a bit too loud. And it seems like neither the NVIDIA GPU driver nor fancontrol could even detect the fan, not to mention controlling it, though I haven’t dig into this issue too deep either, since it’s not too problematic for my use case. Other than the issues mentioned above, everything works out of the box under Ubuntu 22.04, including all my external devices and all Fn hotkeys (except the one that disables touchpad). For the hardware side, IMO the model has only two minor design drawbacks: there are only two USB ports (and one of it is USB 2.0, seriously, why?); and the plastic hull seems relatively fragile and has many very thin parts, so I’m a bit concerned if the hull would break in an accident someday. The weight and the battery life are also not the best on the market, though they are definitely within reasonable range for a 17.3" performance-oriented laptop, and also I’m not too concerned about them for my use case. Overall, I would recommend it as a great laptop at a great price for Ubuntu users. #### Footnotes 1. CPU model, GPU model, screen size, screen resolution are all identical. The barebone doesn’t come with RAM or SSD, but the Dell$2650 model has the worst RAM and SSD that is sold at $30 on Amazon. ↩︎ 2. In fact, I would conjecture they tripled the price: given that R&J is such a small business and how fast laptop hardware models get outdated, I guess the bulk bought-in price from Clevo is likely much less than$1000. ↩︎

3. Of course, they also install Windows, but a Windows license not that expensive either, and also I don’t use Windows… ↩︎

# The Watchpoint Mechanism in JSC

While Javascript has a simple syntax, what happens behind the scene is far from simple. For example, consider this innocent-looking hypot function below:

It is clear what Math.sqrt does: it performs a square root. However, to actually execute Math.sqrt(...), a lot of steps are needed:

1. First, get the global object, where all global variables reside in.
2. Then, get the Math property of the global object. Normally the Math property exists (since it is predefined), but we can’t know for sure: if someone had indeed run delete Math; before, we must promptly throw out an error.
3. Next, get the sqrt property of Math. Note that we cannot even be certain that Math is an object (as someone could have done Math = 123;). So as in (2), we must not omit any check for error.
4. Finally, similarly, what the sqrt property contains can be anything. Even if it is a function, it could be any function. So as before, we must not omit any check, and if sqrt is indeed a Javascript function, we perform the Javascript function call.

So, in order to correctly (as every Javascript engine needs to be) execute this innocently looking Math.sqrt, a ton of stuffs must be done.

#### How can we make this faster?

The crucial observation is that while the programmer is technically allowed to do anything, including insane things like delete Math; or Math = 123, most sane programs will not do it. So for practical purposes, it is enough if we can make sane programs both correct and fast, while running insane programs only correctly.

In JSC (WebKit’s Javascript engine), this is achieved by Watchpoint.

Conceptually, a WatchpointSet represents a condition that one expects to be true, or simply put, a watchable condition. For example, we may expect the global object to contain property Math, and its value being equal to the predefined Math object.

One may attach Watchpoints to the WatchpointSet. A Watchpoint is essentially a callback: after attaching to a WatchpointSet, when the condition represented by the WatchpointSet becomes false, the callback is invoked (“fired”), so the owner who created the Watchpoint can react correspondingly.

While the watchpoint mechanism isn’t necessarily binded to JIT Compilation (for example, LLIntPrototypeLoadAdaptiveStructureWatchpoint works without JIT), it is most powerful when combined with JIT Compilation. We generate code that is optimized assuming the watchpoint condition holds, so inside the generated code, we don’t check for the condition at all. If the condition no longer holds, we must jettison the code – this is expensive, because all the work we did to generate the code is wasted, but the whole point of watchpoint is that such bad cases should happen only rarely.

#### A Motivating Example

Let’s go back to the Math.sqrt example: we want to get notified when a property of an object changes value. Therefore, all logic that writes value into object properties must cooperate with us. For simplicity, let’s assume the object Math has a Structure, say S. Then, there are two kinds of logic that may write to object properties:

1. The C++ code that implements object property writes (the slow paths).
2. The JIT’ed code that writes to a specific property of a specific structure (the fast paths).

The fast paths are known as inline caches. Inline caching is probably the most important optimization in JSC, but I will leave its details to another post. For the purpose of this post, it’s sufficient to think of each inline cache fast-path as a JIT-compiled piece of code that is specialized for a certain structure S and a certain property name prop. Given a value and an object with structure S, it writes value to property prop of the object.

The slow path case is easy to handle: whenever one writes to a property of an object, one checks whether there are Watchpoints watching the condition, and fire them. Of course, we are doing one extra check for every object property write. However, those code are already slow paths, so it doesn’t hurt too much to make them a bit more slower.

The fast path case is trickier. A naive solution is to add a watchpoint check, as how we handled slow-path. However, this is unsatisfactory: now, every fast-path write is doing one extra check! We can afford slowing down the slow-path, but we want to keep the fast-path fast.

So, the fast-path must not check for watchpoint conditions it violates at runtime. Instead, we permenantly invalidate any and all WatchpointSet it could violate as soon as the fast-path code is JIT’ed, no matter if there are watchers or not. As another consequence, since the fast path works on a fixed property (e.g. sqrt) of a fixed Structure (e.g. S), but not on fixed objects, our watchpoints have to be in the form of <Structure, property>: they work on Structure-level but not object-level (they are called ValueReplacementWatchpointSet in JSC). For example,when a fast-path writing the sqrt property of Structure S is built, we have to be conservative and permanently invalidate WatchpointSet <S, sqrt>, since we have no way to know if that fast-path is going to run on our Math object in the future.

#### The Design

This leads to the following design in JSC. A WatchpointSet has three possible states[1]:

1. DoesNotExist: The WatchpointSet object does not physically exist (and is implicitly Valid). This is needed because there is an infinite number of watchable conditions, and also that we want to save memory. In this state, there exists no fast-path that rely on or violate the watchpoint. Slowpath executions that violate the watchpoint are not recorded (but doing so wouldn’t break the scheme).
2. Valid[2]: The watchpoint is valid: no fast-path that may violate the watched condition has been built, and one may build fast-path relying on the watchpoint condition as long as it adds itself into the watcher list.
3. Invalidated: The watchpoint is permaently invalidated.

As one can see from the example in the previous section, the Watchpoint system needs to handle interactions with three components:

1. Slow-path (C++ code) that may violate the watched condition.
2. Fast-path (JIT’ed code) that may violate the watched condition.
3. Code (C++ or JIT’ed) that is optimized assuming the watched condition is true.

For (1), the slow-path must check in the code any watchable condition it violated, and if the corresponding WatchpointSet exists, fire all watchers. However, in such case, the slow-path have the choice between invalidate the WatchpointSet, or to keep it valid[3].

For (2), the fast-path code does not check the watchable condition it violates, but we must transit all WatchpointSets it may violate when executed to Invalidated when such a fast-path is JIT’ed (and we must create such WatchpointSet object if it does not exist yet).

For (3), we must disable the code when the watcher callback is invoked. If the code is C++ code, then disabling the codepath is as easy as flipping a flag. If the code is JIT’ed code, we must jettison the code[4].

#### Back to Our Example, and Adaptive Watchpoints

Unfortunately, in our example, it turns out that only watching on <Structure, Property> is not enough. While this handles writes to existing properties correctly, one may create new properties in the object, thus transitioning its Structure. Say, one did a Math.abc = 123;. Since it adds a property to Math, the object Math gets a different structure S2, but our watchpoint is watching on <S, sqrt>, and we are screwed. To fix this issue, we must get notified when our object changes structure as well. However, as before, since an object-property-write fast-path works on a fixed Structure but not a fixed object, we have to put our watchpoint at Structure level. That is, we will have a WachpointSet on each Structure S, asserting that it never makes further transitions to other Structures (this is called a StructureTransitionWatchpointSet in JSC).

The last interesting piece is what to do when a StructureTransitionWatchpointSet turns to Invalidated state. If the transition happened on another object with the same Structure S, even though our Math object is not modified, we have no choice but to invalidate our code, as the StructureTransitionWatchpointSet for S has been invalidated, so we have no way to get notified if our Math object gets transitioned in the future.

However, if the transition happened on object Math (i.e. Math itself gets a new Structure), then it’s possible to keep our optimized code valid: we just need to start watching <S2, sqrt> instead. So we will move our ValueReplacementWatchpoint to watch <S2, sqrt> and our StructureTransitionWatchpoint to watch S2, and keep our code valid[5]. In JSC, such watchpoints whose action on fire is to move themselves to new places have a terminology AdaptiveWatchpoints.

#### Ending Thoughts

This way, by watching that the Math property of the global object never changes value, and that the sqrt property of the Math object never changes value, the code Math.sqrt is reduced from two object property lookups with a ton of error checks to a constant (not even a branch!) in the JIT’ed code.

The watchpoint mechanism also helps other optimizations to generate better code. For example, the call opcode (which calls whatever is stored in Math.sqrt) has its own inline caching that records which functions it has called. For sane programs that does not mess up with the predefined objects, there will be only one callee recorded: the sqrt intrinsic function. Normally this would allow the compiler to emit a check (that the result of expression Math.sqrt equals the sqrt intrinsic function) and speculatively inline sqrt. However, since the watchpoint already tells us that Math.sqrt must evaluates to the sqrt intrinsic function, the compiler can do better: it may omit the check and inline sqrt directly. Now, for sane programs, all the terrible stuffs listed at the beginning of this post are gone, so the JIT’ed code to evaluate the Math.sqrt part is as efficient as if it were directly written in C++!

Finally, a couple of side notes:

1. If we want to avoid the case that the transition of another object results in invalidation of our code, we can give our object its own unique Structure, though the downside is that we might blow up the inline cache if we do it for too many objects.
2. The slow-path does not fire the watchpoint if the watchpoint is in DoesNotExist or Clear state. This not only saves memory, but is also an advantage for the use case above: while it’s plausible to assume that sane programs will not change Math.sqrt frequently, it’s also plausible for them to change it at program start (e.g., to log a warning if the input to sqrt is negative). Since such code will execute in slow-path and before any fast-path relying on the WatchpointSet is built, they will not invalidate the WatchpointSet, as desired.

#### Acknowledgements

I thank Saam Barati from JSC team for teaching me all of these (and more) using his precious spare time, and for his valuable comments on this post. Of course, any mistakes in this post are mine.

#### Footnotes

1. Note that the DoesNotExist state is not listed in the enum, since in this state the object doesn’t exist at all. ↩︎

2. In fact, JSC further distinguishes Valid state into Clear and Watched, to determine the behavior when a slow-path violation happened (see Footnote 3). However, this is only a design detail, so we put it in footnote for ease of understanding. ↩︎

3. When the WatchpointSet is in Clear state, the slow-path will keep it in Clear state. However, if it is in Watched state, even if there are no watchers, it will be transitioned to Invalidated state. ↩︎

4. Things get trickier if the code is already running (e.g., the code being jettisoned is the current function being executed, a function in the call stack, or even a function inlined by the current function), in which case we must OSR Exit to the baseline tier, but we will ignore such complexities in this post. ↩︎

5. Of course, if the ValueReplacementWatchpointSet of <S2, sqrt> or the StructureTransitionWatchpointSet of S2 is already Invalidated, we will still have to invalidate our code. ↩︎

# Note on x86-64 Memory Model

In the past years, I have undergone a few cycles of learning the x86-64 memory model, only to eventually forget it again. Today I was fortunate to see a great paper which explained this matter very clearly, so I’m taking a note here for future reference.

The model in the paper is particularly easy to understand because it is described by standard software lock primitives[1], as below:

1. There is one global lock G.
2. There is one background thread T.
3. Each CPU has a store buffer, which is a queue of items <address, value>. The store buffer is pushed by the owning CPU, and popped by the background thread T.

The background thread T does only one thing:

1. Lock global lock G.
2. Pop an item <addr, value> from the store buffer of a CPU, write the value to main memory: MainMemory[addr] = value.
3. Unlock global lock G.

The procedure for a CPU to execute an instruction is described below.

#### STORE instruction

1. Push item <addr, value> to its store buffer.

#### LOAD instruction

1. Lock global lock G.
2. If addr exists in its store buffer, return corresponding value[2]. Otherwise return MainMemory[addr].
3. Unlock global lock G.

#### MFENCE instruction

1. Wait until its store buffer is eventually emptied by background thread T.

#### ATOMIC instruction

1. Lock global lock G.
2. Run the atomic instruction, using subroutines described above for LOAD and STORE.
3. Empty its own store buffer: pop every item <addr, value> from the store buffer and write to main memory: MainMemory[addr] = value, until the store buffer is empty.
4. Unlock global lock G.

Note that the semantics of LOAD and STORE provide the expected consistency on single-threaded programs.

#### An Application

Let’s analyze why the familiar spinlock implementation below is correct under x86-64 memory model:

We will prove the correctness via a token-counting argument. Each <lock, 0> in a store buffer counts as one token, and if MainMemory[lock] == 0, it also counts as one token. By definition at any moment the number of tokens cannot go below 0.

By the abstract machine semantics above, it’s not hard to prove that:

1. The background thread T cannot increase the total number of tokens.
2. Each Unlock() call creates one token.
3. Each Lock() call cannot return until it successfully consumes at least one token (If the CAS succeeded by seeing a 0 in its store buffer, that token is lost after the CAS because CAS flushes store buffer and also changes the memory value to 1. If the CAS succeeded by seeing a 0 in the main memory, that token is also lost because the store buffer item of the new value 1 is flushed to memory, overwriting the 0 value).

Initially there is one token (by the Init() call). Since Unlock() may only be called after[3] (guaranteed by program order) Lock(), the total number of tokens cannot go above one at any moment. So after a Lock() returns, there must be zero tokens, so no other Lock() can return. The total number of tokens goes back to one only when the Unlock() in that program is called, and only after that other Lock() operation may return. So the Lock() -> Unlock() time intervals are pairwisely non-overlapping, providing the mutual exclusiveness as one would expect.

##### Footnotes

1. Therefore, while the description is logically equivalent to the guarantees provided by the hardware, this is not how the hardware physically implements the memory subsystem. The hardware implementation is way more efficient. ↩︎

2. Of course, if addr showed up multiple times in the store buffer, we should return the value of the latest version. ↩︎

3. Throughout this argument, the time relationship is about wall clock time (or, the relative position in the interleaved event sequence of all CPUs). ↩︎

# From X Macro to FOR_EACH to Cartesian Product Enumeration with C Macro

Quite a while ago I was implementing an interpreter. A common task in the interpreter is to select the correct interpreter function based on the type of the input. Let’s say we want to implement an addition. We might end up with something like below:

to implement the operation. At runtime, we want to dispatch to the right function base on the type of the operands. A natural way to do this is to have a static array holding the mapping from the operand type to the function pointer, similar to below:

so at runtime we can just read x_addOpPtr[operandType] to obtain the function pointer we want to call.

#### The X Macro

Although the code above can work, it is clearly too error prone. If we accidentally made a mistake in the order of the list, we are screwed. A better way is the X Macro pattern. We define a “X macro” for all the types:

Then, by defining what X(EnumType, CppType) expands to, we can create logic based on our needs. For example, the following code would reproduce the x_addOpPtr array we want:

Note that the final nullptr is needed because our expansion (void*)Add<CppType>, would generate an extra comma in the end.

#### The New Challenge

X Macro solved the above problem, but what if we want to handle, say, a type cast opcode?

Unlike the addition operator, now we have two types Src and Dst to enumerate on, so we have to generate a two-dimensional array. While X Macro can easily iterate through one list and perform action on every item, it cannot iterate through the Cartesian product of two lists. A worse solution, is of course, to manually define a list containing all the <Src, Dst> pairs, so we can do X macro again. But what if we want to do a three-dimensional Cartesian product in the future?

After some fruitless Googling and home-making attempts to build a “two dimensional X Macro”, I eventually gave up and switched to an ugly solution. Instead of generating a clean static array, we generate a tree of templated dispatching functions. The function at the i-th level use a dispatch array (built by X macro) to dispatch to the next level’s selector function based on the i-th parameter type. We get the function pointer when we reach the leaf. While this approach works, no doubt it is very ugly, and probably also less performant (I didn’t check if the C++ compiler were able to optimize away all the terrible things).

#### The FOR_EACH Macro

I used to believe my ugly solution is as good as one can get without resorting to manually enumerating the Cartesian product. However, today I learnt an interesting approach from David Mazieres, which he calls the FOR_EACH macro.

The semantics of the FOR_EACH macro is pretty clear. Taking a macro X (similar to the X in X Macro) and a comma-separated list of elements e1, e2, ... , en, the FOR_EACH macro invokes X on each e in the list. For example, the addition example would look like:

The most important difference between FOR_EACH macro and X Macro is that the FOR_EACH list definition doesn’t take X. Unlike the X Macro, where the macro to call on each element is hardcoded to only pass the element itself, the FOR_EACH macro decoupled “the element to be processed” and “the macro processing the element”. This removes the biggest blocker to implement a macro that can enumerate through Cartesian product of multiple lists.

The core of the trick which allows FOR_EACH’s list definition to get rid of the X lies in the C++20 new feature __VA_OPT__. David Mazieres’ original article is already a good explanation on how the FOR_EACH macro works so I won’t parrot it again. With the main blocker removed, after only a few hours of work, I was able to successfully extend FOR_EACH to support enumerating through the Cartesian product of multiple lists. (By the way, even after implementing it, I still have very little idea on how the C preprocessor works, but clang++ -E is enough to trial-and-error into a working solution).

#### The FOR_EACH_CARTESIAN_PRODUCT Macro

I call my macro FOR_EACH_CARTESIAN_PRODUCT. As the name suggests, it takes a macro X and one or more lists (L1), ..., (Ln). Then for each (e1, ..., en) in the Cartesian product L1 x ... x Ln , the macro X(e1, ..., en) is invoked. The elements in the Cartesian product are enumerated in lexical order.

For example, for the type-casting example above, the below code would construct our desired two-dimensional dispatch array:

Note that the generated array is one-dimensional, but indexing it is pretty simple: x_castOpPtr[opType1 * numTypes + opType2] will give us the desired function pointer for Src=opType1 and Dst=opType2.

The code, which contains both the implementation for FOR_EACH_CARTESIAN_PRODUCT and the above examples can be found here. The code is in public domain so feel free to use.

# Some Random Thoughts

Recently I attended the OOPSLA 2021 conference. While I haven’t barely learnt anything from the conference itself, it’s indeed a break from my routine life, and I have probably met more people than the sum of the past few months. So as a result, I had some random thoughts and reflections, which I decide to take notes here before they disappear.

#### Methodology of Decision Making

It is well known that one should not judge a decision based on its outcome, because we cannot have full awareness of the world (so there are information that we cannot know beforehand), and the world itself also contains many random factors (so we cannot predict the future even with complete information). Therefore, an undesirable outcome does not imply the original decision is wrong or improvable.

However, humans are born irrational. Many never understand the above argument at all. But even within the people who can recognize it, I have seen (on both myself and others) many cognitive pitfalls when applying the argument.

Pitfall #1. Only apply when things go wrong. Ego makes people believe in their own decisions. So when things work out, it’s easy to overlook the possibility that the success is only coincidental and the original decision is unjustifiable.
As an extreme example, winning a lottery does not justify the decision of spending large amounts of money buying lotteries (which has a negative expected gain).

Pitfall #2. Overlooking the information one can know. Similarly, due to people’s ego, when something fails, it’s easy to use the argument as an excuse to deny responsibility, not realizing that the original decision could have been improved with more investigation and reasoning.
The most obvious examples are decisions made through overconfidence and negligence.

Pitfall #3. Overlooking the information one cannot know. As stated earlier, one cannot have complete information of the world, so that “there are information that one cannot know” is also an information that must be considered in decision making.
Examples of this pitfall are “perfect plans” that are designed without backups and leaves no buffer on accidents and errors.

So in short, one should not judge a decision based on its outcome, be it desirable or not; one should judge a decision based on the justifiability of how the decision is reached.

(Interestingly, everything stated above can also be observed in Japanese Mahjong)

#### Methodology of Becoming Productive Researchers

I am aware that all my research ideas have been produced by pure luck. If one had rewinded time, I’m very doubtful if I could come up with the same ideas again. And I feel it clueless to figure out the “next idea” like some of the PhD students who are more “on the right track” could easily do. So I have been curious how the professors can generate an endless stream of ideas for papers. I happened to have discussed this topic with two professors, so for future reference, I take notes here based on my memory.

Q1: (The context here is theoretical computer science.) The difficulty of figuring out a proof is probably exponential to the number of the steps in the proof. So how can you and your group produce so many long (>50 pages) papers every year?

A1 (Richard): We are not going for particular problems. Instead, we have a set of powerful math tools as building blocks, and we just build things up by gluing the blocks together, without a particular goal in mind. If at some time we found that the thing we built solved some problem, we have a paper. It’s like a naval battle: you don’t search for and destroy a particular ship (problem) on the sea. You patrol the sea and destroy any ship spotted along the way.

Q2: But I assume you still have some intuition on what might work and what might not. What is the intuition that guides you and how did you get this intuition?

A2 (Richard): I don’t know. It’s like the first man who figured out they can put guns on ships.

Q3: (The context here is PL/Compilers, and my advisor primarily works on sparse tensor algebra.) I asked my advisor how he could always have the next paper idea to work on.

A3 (Fred): There are many solved problems in the dense algebra domain, but little is known about the sparse algebra counterparts. TACO is a framework for solving problems in the sparse algebra domain, so it opens up a sequence of works by porting the solved problems in dense algebra to sparse algebra.

#### What Prevents Constructive and Rational Discussion?

The motivation of this part is the (still ongoing!) Chinese Internet shitshow centered around a consipracy theory that the Grand Final match of Dota2 TI10 is fixed.

I have never held any expectation on the rationality of the mass public, but I’m still astonished by that a conspiracy theory without even a self-coherent story can get dominance in the Chinese Dota2 community.

Though it might be my illusion, but I do feel the Internet discussions I see on contemporary matters have been getting increasingly polarized / emotion-driven, and decreasingly constructive / helpful in the past years. Is the mass public becoming more irrational? I don’t know. But after thinking for a bit, I do feel there are a few contributing factors.

1. It takes much more words and time to refute a conspiracy theory or to post some serious discussion, than to propose a conspiracy theory or to post some trashtalk.
2. The bandwidth of the Internet has greatly increased, but the bandwidth of the useful information being carried has actually decreased. On one hand, blogs are replaced by tweets, texts are replaced by pictures/videos, so the mass public has been trained to only read short messages, not serious discussions. On the other hand, mobile phones, which are not even designed to type efficiently, have surpassed the market occupation rate of PCs long ago, so it is also harder for the mass public to publish anything other than short messages. So the mass public has been trained to only read 140 characters and post 140 characters, not any serious discussions.
3. By Pareto principle, 80% of the voice in a community comes from 20% of the people. And the people who feel most compelled to speak out are usually the people holding the most extreme opinions. But under the current shape of the Internet, where serious discussions are unfavored, the megaphone is handed over to the most irrational ones, not the most rational ones.
4. The ranking mechanism, where contents are ranked by user votes and shown to users by rank, served as another amplifier.
5. But what about the POLs? Will they send out rational messages and lead the public opinions to the rational side? Unfortunately, at least for the Chinese Internet’s status quo, where most of the POLs are commercialized, the answer is negative. The POLs do not care about anything but making more money, which come from public exposure and supporters. So they have no motivation to argue against the trend at all. In fact, many POLs are known to intentionally start flamewars or spread falsehood messages to gain exposure.
6. Another interesting factor is the bots. It might be surprising, but a CMU research showed that the majority of the COVID falsehood messages on Twitter are spreaded by bots. And while I haven’t seen an academic research for the Chinese Internet, it’s undeniable that there are many keyword-based bots for various purposes (lottery, advertisement, promotion, PR manipulation, etc). It’s not surprising that there are similar falsehood message bots as well.

But what exactly went wrong? And how this might be fixed? Honestly I don’t know.