Dropbox Paper

Jiterpreter For WebAssembly

Right now for acceptable performance, any user code needs to be AOTed and linked into dotnet.wasm at build time. This only happens during publish and can be time consuming, and it makes it more difficult for users to test and debug their applications. Some code patterns and features are also incompatible with AOT. The result is that in some scenarios, the user has no choice but to rely on the mono interpreter to run parts of their application - parts that may be performance sensitive.

The Ideal Solution

Would be a full JIT targeting WebAssembly, equivalent in functionality to the existing Mono and RyuJIT code generators for x86, arm, etc. Unfortunately, this poses many challenges:

The JIT would need to have adequate support for all the IL and intrinsics in a user application, including obscure opcodes and vectors

Some of the features needed to fully JIT entire methods or classes (like vector instructions) are conditionally available, which could cause methods or classes to suddenly become 10x slower depending on the user’s configuration

Relying on full method JIT to execute code produces new problems for threaded and synchronous workloads due to synchronization and blocking (browsers limit the amount of code you can compile synchronously to 4KB)

The JIT would be a large new source of behavioral differences and bug reports

The JIT would compete with AOT for engineering resources due to its size and scope

In practice this ideal JIT would be a big engineering undertaking that might not be usable by many customers. The rewards would be justified, but in the mid term it wouldn’t be straightforward to build and deliver it.

The Jiterpreter

The Mono interpreter is already tuned to deliver good performance in many workloads with the aid of a tiering system that performs optimizations users expect from an AOT compiler, like inlining. However, even aggressively optimizing this interpreter bytecode can’t bring it within reach of native - the overhead will still be in the 5-15x range. Any improvement needs to maintain the key benefits users get out of the interpreter, like fast iteration times, full feature support, and debugging.

The Jiterpreter solves this by providing optimized execution for interpreter bytecodes. Instead of a new from-scratch compiler or backend, the Jiterpreter slots into the existing interpreter and consumes its existing optimized bytecodes to generate tiny blobs of WebAssembly code that can be run in place of the interpreter.

Because it slots directly into the interpreter and uses the same bytecodes, it also can share all the interpreter’s internal state. Critically, this means that it does not have to implement everything, and complex or obscure opcodes can be handed off to the interpreter, while performance-critical computations, copies and comparisons run with performance comparable to native.

Because it fits into the interpreter, it maintains full support for debugging and does not meaningfully increase to build or startup time. It benefits from all the same optimizations the interpreter does, so as we further improve the interpreter’s optimizer, the jiterpreter will get faster too. By mirroring the data layout and opcodes of the interpreter, its behavior sticks as closely as possible to what the user would get in a regular debug build.

Traces, Not Methods

Once we decide to omit support for complex and obscure opcodes, we leave the realm of compiling entire methods and instead begin compiling “traces” of sequential opcodes. This allows us to intelligently target the hottest parts of an application and avoid completely falling down in the face of difficult code patterns, like 10k-line C# methods or methods with nested exception filters. For those cases, we can simply defer to the interpreter for the hard bits while we optimize the hot paths (if any).

Simpler methods can be represented by a single one of these traces, while a more complex method might have a trace for the body of an inner loop along with a trace for the prologue and epilogue. If the epilogue is cold (because the function typically returns from inside the loop body), no trace needs to be generated at all.

Traces can also omit the heavy code for implementing cold paths - for example, there’s no need to embed the code for a ‘throw’ statement because we can just hand it off to the interpreter in the rare circumstance that the method fails and has to throw an exception.

This produces small, dense native functions that are gentle on the branch predictor and fit nicely into the instruction cache in a way that an AOT compiler or method JIT - needing to provide full, comprehensive support for all execution paths - may struggle to.

Instead of globally shutting off optimizations when debugging is in play, we can instead disable individual traces on the fly based on whether breakpoints are set, and can safely drop out of a trace because it shares state with the interpreter.

Incremental, Progressive Enhancement

Over time this jiterpreter approach can be expanded and improved based on test cases and user applications - if our instrumentation shows an unsupported opcode or code pattern is important, we can cheaply expand the jiterpreter to support it. Heuristics can be adjusted to tune when it comes into play for ideal performance and memory usage. In the long run, we could even locally cache libraries of these compiled traces to get closer to the experience of AOT without some of the downsides (increased binary size, slower downloads, slower builds).

The jiterpreter model also would allow us to handle some key use cases currently not addressed by AOT, like generating function pointers, trampolines and wrappers at runtime.

As an expansion of the interpreter the jiterpreter also acts as a progressive enhancement - if it fails for any reason (disabled by CSP policy, no browser support, out of memory, code generator bug) the runtime can silently fall back to the regular interpreter with no difference in behavior and only regress for a small number of traces.

A Brief Implementation Overview

The current implementation works by inserting ‘prepare jiterpreter’ interpreter opcodes at key locations. Those opcodes update hit counts when visited, and when a hit count crosses a key threshold, the jiterpreter activates and attempts to compile a trace at that location. If the compilation is successful the opcode is patched to become an indirect function call that jumps into and executes the trace, while a failed compile patches the opcode to become a special type of nop. When a trace is executed it proceeds forward as far as possible, then returns the new instruction pointer so that the interpreter can resume execution at the appropriate place.

The jiterpreter is implemented in typescript and maps mono interpreter opcodes to wasm opcodes, typically around 4-8 wasm opcodes per mono opcode, in a single pass. When reaching an unsupported opcode or running out of space (due to browsers’ 4kb limit) the trace halts and returns to the interpreter. Forward branches are supported by splitting the trace into conditionally executed blocks, while backward branches and safepoints return to the interpreter.

one way traces deliver better performance is by loading a method’s constant data_items at compile time, so that they don’t have to be fetched from memory every time the opcode is executed. complex behaviors can be easily offloaded to existing C in the runtime, like mono_value_copy_internal. (note that the NULL_CHECK operation is further up in the typescript since it is shared between opcodes)

The jiterpreter’s implementation for a given opcode is typically a 1:1 translation of the interpreter, which makes adding new opcodes a few minutes of low-risk work instead of careful interaction with ASTs, IRs, and register allocators.

built-in instrumentation helps with selecting the right opcodes to implement and identifying methods that have optimization barriers worth fixing in the interpreter or jiterpreter