Lambda Calculus Benchmark for AI

marvinborner 135 points 39 comments April 25, 2026
victortaelin.github.io · View on Hacker News

Discussion Highlights (7 comments)

tromp

The corresponding repo https://github.com/VictorTaelin/LamBench describes this as: λ-bench A benchmark of 120 pure lambda calculus programming problems for AI models. → Live results What is this? λ-bench evaluates how well AI models can implement algorithms using pure lambda calculus. Each problem asks the model to write a program in Lamb, a minimal lambda calculus language, using λ-encodings of data structures to implement a specific algorithm. The model receives a problem description, data encoding specification, and test cases. It must return a single .lam program that defines @main. The program is then tested against all input/output pairs — if every test passes, the problem is solved. "Live results" wrongly links to https://victortaelin.github.io/LamBench/ rather than the correct https://victortaelin.github.io/lambench/ An example task (writing a lambda calculus evaluator) can be seen at https://github.com/VictorTaelin/lambench/blob/main/tsk/algo_... Curiously, gpt-5.5 is noticeably worse than gpt-5.4, and opus-4.7 is slightly worse than opus-4.6.

dataviz1000

lambench is single-attempt one shot per problem. I don't think they understand how the LLM models work. To truly benchmark a non-deterministic probabilistic model, they are going to need to run each about 45 times. LLM models are distributions and behave accordingly. The better story is how do the models behave on the same problem after 5 samples, 15 samples, and 45 samples. That said, using lambda calculus is a brilliant subject for benchmarking. The models are reliably incorrect. [0] [0] https://adamsohn.com/reliably-incorrect/

NitpickLawyer

New, unbenched problems are really the only way to differentiate the models, and every time I see one it's along the same lines. Models from top labs are neck and neck, and the rest of the bunch are nowhere near. Should kinda calm down the "opus killer" marketing that we've seen these past few months, every time a new model releases, esp the small ones from china. It's funny that even one the strongest research labs in china (deepseek) has said there's still a gap to opus, after releasing a humongous 1.6T model, yet the internet goes crazy and we now have people claiming [1] a 27b dense model is "as good as opus"... I'm a huge fan of local models, have been using them regularly ever since devstral1 released, but you really have to adapt to their limitations if you want to do anything productive. Same as with other "cheap", "opus killers" from china. Some work, some look like they work, but they go haywire at the first contact with a real, non benchmarked task. [1] - https://x.com/julien_c/status/2047647522173104145

cmrdporcupine

Odd to see GPT 5.5 behind 5.4?

internet_points

Would love to see where the mistral stuff lands. Also, being from Victor Taelin, shouldn't this be benching Interaction Combinators? :)

maciejzj

Can anyone more familiar with lambda calculus speculate why all models fail to implement fft? There are gazzilion fft implementations in various languages over the web and the actual cooley-tukey algorithm is rather short.

jakeinsdca

codex 5.5 is worse then 5.4 but 10x faster?

Semantic search powered by Rivestack pgvector
8,303 stories · 78,303 chunks indexed