Lambda Calculus Benchmark for AI
marvinborner
135 points
39 comments
April 25, 2026
Related Discussions
Found 5 related stories in 87.2ms across 8,303 title embeddings via pgvector HNSW
- We're running out of benchmarks to upper bound AI capabilities gmays · 15 pts · April 10, 2026 · 56% similar
- Show HN: A new benchmark for testing LLMs for deterministic outputs khurdula · 50 pts · April 29, 2026 · 55% similar
- Show HN: PhAIL – Real-robot benchmark for AI models vertix · 20 pts · March 31, 2026 · 55% similar
- Writing Lisp is AI resistant and I'm sad djha-skin · 56 pts · April 05, 2026 · 52% similar
- Show HN: LangAlpha – what if Claude Code was built for Wall Street? zc2610 · 128 pts · April 14, 2026 · 52% similar
Discussion Highlights (7 comments)
tromp
The corresponding repo https://github.com/VictorTaelin/LamBench describes this as: λ-bench A benchmark of 120 pure lambda calculus programming problems for AI models. → Live results What is this? λ-bench evaluates how well AI models can implement algorithms using pure lambda calculus. Each problem asks the model to write a program in Lamb, a minimal lambda calculus language, using λ-encodings of data structures to implement a specific algorithm. The model receives a problem description, data encoding specification, and test cases. It must return a single .lam program that defines @main. The program is then tested against all input/output pairs — if every test passes, the problem is solved. "Live results" wrongly links to https://victortaelin.github.io/LamBench/ rather than the correct https://victortaelin.github.io/lambench/ An example task (writing a lambda calculus evaluator) can be seen at https://github.com/VictorTaelin/lambench/blob/main/tsk/algo_... Curiously, gpt-5.5 is noticeably worse than gpt-5.4, and opus-4.7 is slightly worse than opus-4.6.
dataviz1000
lambench is single-attempt one shot per problem. I don't think they understand how the LLM models work. To truly benchmark a non-deterministic probabilistic model, they are going to need to run each about 45 times. LLM models are distributions and behave accordingly. The better story is how do the models behave on the same problem after 5 samples, 15 samples, and 45 samples. That said, using lambda calculus is a brilliant subject for benchmarking. The models are reliably incorrect. [0] [0] https://adamsohn.com/reliably-incorrect/
NitpickLawyer
New, unbenched problems are really the only way to differentiate the models, and every time I see one it's along the same lines. Models from top labs are neck and neck, and the rest of the bunch are nowhere near. Should kinda calm down the "opus killer" marketing that we've seen these past few months, every time a new model releases, esp the small ones from china. It's funny that even one the strongest research labs in china (deepseek) has said there's still a gap to opus, after releasing a humongous 1.6T model, yet the internet goes crazy and we now have people claiming [1] a 27b dense model is "as good as opus"... I'm a huge fan of local models, have been using them regularly ever since devstral1 released, but you really have to adapt to their limitations if you want to do anything productive. Same as with other "cheap", "opus killers" from china. Some work, some look like they work, but they go haywire at the first contact with a real, non benchmarked task. [1] - https://x.com/julien_c/status/2047647522173104145
cmrdporcupine
Odd to see GPT 5.5 behind 5.4?
internet_points
Would love to see where the mistral stuff lands. Also, being from Victor Taelin, shouldn't this be benching Interaction Combinators? :)
maciejzj
Can anyone more familiar with lambda calculus speculate why all models fail to implement fft? There are gazzilion fft implementations in various languages over the web and the actual cooley-tukey algorithm is rather short.
jakeinsdca
codex 5.5 is worse then 5.4 but 10x faster?