A tail-call interpreter in (nightly) Rust

g0xA52A2A 147 points 26 comments April 05, 2026
www.mattkeeter.com · View on Hacker News

Discussion Highlights (6 comments)

dathinab

> resulting VM outperforms both my previous Rust implementation and my hand-coded ARM64 assembly it's always surprising for me how absurdly efficient "highly specialized VM/instruction interpreters" are like e.g. two independent research projects into how to have better (fast, more compact) serialization in rust ended up with something like a VM/interpreter for serialization instructions leading to both higher performance and more compact code size while still being cable of supporting similar feature sets as serde(1) (in general monomorphisation and double dispatch (e.g. serde) can bring you very far, but the best approach is like always not the extrem. Neither allays monomorphisation nor dynamic dispatch but a balance between taking advantage of the strength of both. And specialized mini VMs are in a certain way an extra flexible form of dynamic dispatch.) --- (1): More compact code size on normal to large project, not necessary on micro projects as the "fixed overhead" is often slightly larger while the per serialization type/protocol overhead can be smaller. (1b): They have been experimental research project, not sure if any of them got published to GitHub, non are suited for usage in production or similar.

bjoli

Finally! Tail calls! I had to write rust some years ago, and the ocaml person in me itched to get to write tail recursion. Tail recursion opens up for people to write really really neat looping facilities using macros.

measurablefunc

More accurate title would be to say it is a tail call optimized interpreter. Tail calls alone aren't special b/c what matters is that the compiler or runtime properly reuses caller's frame instead of pushing another call frame & growing the stack.

ashutoshmishr88

nice to see become landing in nightly. does this work well with async or is it purely sync tail calls for now?

anematode

Nice post :) Last year I was working on a tail-call interpreter ( https://github.com/anematode/b-jvm/blob/main/vm/interpreter2... ) and found a similar regression on WASM when transforming it from a switch-dispatch loop to tail calls. SpiderMonkey did the best with almost no regression, while V8 and JSC totally crapped out – same finding as the blog post. Because I was targeting both native and WASM I wrote a convoluted macro system that would do a switch-dispatch on WASM and tail calls on native. Ultimately, because V8's register allocation couldn't handle the switch-loop and was spilling everything, I basically manually outlined all the bytecodes whose implementations were too bloated. But V8 would still inline those implementations and shoot itself in the foot, so I wrote a wasm-opt pass to indirect them through a __funcref table, which prevented inlining. One trick, to get a little more perf out of the WASM tail-call version, is to use a typed __funcref table. This was really horrible to set up and I actually had to write a wasm-opt pass for this, but basically, if you just naively do a tail call of a "function pointer" (which in WASM is usually an index into some global table), the VM has to check for the validity of the pointer as well as a matching signature. With a __funcref table you can guarantee that the function is valid, avoiding all these annoying checks.

kelnos

Ah that's great! I wonder why they went with a new keyword; I assumed the compiler would opportunistically do TCO when it thinks it's possible, and I figured that the simplest way to require TCO (or else fail compilation) could be done with an attribute. (Not sure if the article addressed that... I only skimmed it.)

Semantic search powered by Rivestack pgvector
3,663 stories · 34,065 chunks indexed