Lean 4 paper (2021): https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37
Discussion Highlights (20 comments)
blurbleblurble
Truly exciting
andai
Trustworthy vibe coding. Much better than the other kind! Not sure I really understand the comparisons though. They emphasize the cost savings relative to Haiku, but Haiku kinda sucks at this task, and Leanstral is worse? If you're optimizing for correctness, why would "yeah it sucks but it's 10 times cheaper" be relevant? Or am I misunderstanding something? On the promising side, Opus doesn't look great at this benchmark either — maybe we can get better than Opus results by scaling this up. I guess that's the takeaway here.
lefrenchy
Does Mistral come close to Opus 4.6 with any of their models?
patall
Maybe a naive question: given that they see better performance with more passes but the effect hits a limit after a few passes, would performance increase if they used different models per pass, i.e leanstral, kimi, qwen and leanstral again instead of 4x leanstral?
jasonjmcghee
Curious if anyone else had the same reaction as me This model is specifically trained on this task and significantly[1] underperforms opus. Opus costs about 6x more. Which seems... totally worth it based on the task at hand. [1]: based on the total spread of tested models
kittikitti
This is great, congratulations to the Mistral team! I'm looking forward to the code arena benchmark results. Thanks for sharing.
Havoc
What are these "passes" they reference here? Haven't seen that before in LLM evals Could definitely be interesting for having another model run over the codebase when looking for improvements
lsb
The real world success they report reminds me of Simon Willison’s Red Green TDD: https://simonwillison.net/guides/agentic-engineering-pattern... > Instead of taking a stab in the dark, Leanstral rolled up its sleeves. It successfully built test code to recreate the failing environment and diagnosed the underlying issue with definitional equality. The model correctly identified that because def creates a rigid definition requiring explicit unfolding, it was actively blocking the rw tactic from seeing the underlying structure it needed to match.
flakiness
FYI The Lean 4 paper: https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37
elAhmo
I don’t know a single person using Mistral models.
miacycle
The TDD foundation! We might need one of those. :)
JoshTriplett
Pleasant surprise: someone saying "open source" and actually meaning Open Source . It looks like the weights are Apache-2.0 licensed.
esperent
I absolutely called this a couple of weeks ago, nice to be vindicated! > I'm interested to see what it is in the age of LLMs or similar future tools. I suspect a future phase change might be towards disregarding how easy it is for humans to work with the code and instead focus on provability, testing, perhaps combined with token efficiency. > Maybe Lean combined with Rust shrunk down to something that is very compiler friendly. Imagine if you could specify what you need in high level language and instead of getting back "vibe code", you get back proven correct code, because that's the only kind of code that will successfully compile. https://news.ycombinator.com/item?id=47192116
hnipps
Here we go.
htrp
is the haiku comparison because they've distilled from the model?
rothific
There have been a lot of conversations recently about how model alignment is relative and diversity of alignment is important - see the recent podcast episode between Jack Clark (co-founder of Anthropic) and Ezra Klein. Many comments here point out that Mistral's models are not keeping up with other frontier models - this has been my personal experience as well. However, we need more diversity of model alignment techniques and companies training them - so any company taking this seriously is valuable.
piyh
Automated theorem provers running on a $5k piece of hardware is a cool version of the future
jasonjmcghee
Curious if pass@2 was tested for haiku and sonnet?
drdaeman
Can someone please explain... If I don't know any Lean (and I suspect most people don't), is it of any direct value? Trying to understand if there's something it can help me with (e.g. automatically write proofs for my Go programs somehow... I'm not sure) or should I just cheer solely for more open models out there, but this one isn't for me?
cadamsdotcom
It’s great to see this pattern of people realising that agents can specify the desired behavior then write code to conform to the specs. TDD, verification, whatever your tool; verification suites of all sorts accrue over time into a very detailed repository of documentation of how things are supposed to work that, being executable, puts zero tokens in the context when the code is correct. It’s more powerful than reams upon reams of markdown specs. That’s because it encodes details , not intent. Your intent is helpful at the leading edge of the process, but the codified result needs shoring up to prevent regression. That’s the area software engineering has always ignored because we have gotten by on letting teams hold context in their heads and docs. As software gets more complex we need better solutions than “go ask Jim about that, bloke’s been in the code for years”.
Related Discussions
Found 5 related stories in 53.5ms across 3,471 title embeddings via pgvector HNSW
Discussion Highlights (20 comments)
blurbleblurble
Truly exciting
andai
Trustworthy vibe coding. Much better than the other kind! Not sure I really understand the comparisons though. They emphasize the cost savings relative to Haiku, but Haiku kinda sucks at this task, and Leanstral is worse? If you're optimizing for correctness, why would "yeah it sucks but it's 10 times cheaper" be relevant? Or am I misunderstanding something? On the promising side, Opus doesn't look great at this benchmark either — maybe we can get better than Opus results by scaling this up. I guess that's the takeaway here.
lefrenchy
Does Mistral come close to Opus 4.6 with any of their models?
patall
Maybe a naive question: given that they see better performance with more passes but the effect hits a limit after a few passes, would performance increase if they used different models per pass, i.e leanstral, kimi, qwen and leanstral again instead of 4x leanstral?
jasonjmcghee
Curious if anyone else had the same reaction as me This model is specifically trained on this task and significantly[1] underperforms opus. Opus costs about 6x more. Which seems... totally worth it based on the task at hand. [1]: based on the total spread of tested models
kittikitti
This is great, congratulations to the Mistral team! I'm looking forward to the code arena benchmark results. Thanks for sharing.
Havoc
What are these "passes" they reference here? Haven't seen that before in LLM evals Could definitely be interesting for having another model run over the codebase when looking for improvements
lsb
The real world success they report reminds me of Simon Willison’s Red Green TDD: https://simonwillison.net/guides/agentic-engineering-pattern... > Instead of taking a stab in the dark, Leanstral rolled up its sleeves. It successfully built test code to recreate the failing environment and diagnosed the underlying issue with definitional equality. The model correctly identified that because def creates a rigid definition requiring explicit unfolding, it was actively blocking the rw tactic from seeing the underlying structure it needed to match.
flakiness
FYI The Lean 4 paper: https://dl.acm.org/doi/10.1007/978-3-030-79876-5_37
elAhmo
I don’t know a single person using Mistral models.
miacycle
The TDD foundation! We might need one of those. :)
JoshTriplett
Pleasant surprise: someone saying "open source" and actually meaning Open Source . It looks like the weights are Apache-2.0 licensed.
esperent
I absolutely called this a couple of weeks ago, nice to be vindicated! > I'm interested to see what it is in the age of LLMs or similar future tools. I suspect a future phase change might be towards disregarding how easy it is for humans to work with the code and instead focus on provability, testing, perhaps combined with token efficiency. > Maybe Lean combined with Rust shrunk down to something that is very compiler friendly. Imagine if you could specify what you need in high level language and instead of getting back "vibe code", you get back proven correct code, because that's the only kind of code that will successfully compile. https://news.ycombinator.com/item?id=47192116
hnipps
Here we go.
htrp
is the haiku comparison because they've distilled from the model?
rothific
There have been a lot of conversations recently about how model alignment is relative and diversity of alignment is important - see the recent podcast episode between Jack Clark (co-founder of Anthropic) and Ezra Klein. Many comments here point out that Mistral's models are not keeping up with other frontier models - this has been my personal experience as well. However, we need more diversity of model alignment techniques and companies training them - so any company taking this seriously is valuable.
piyh
Automated theorem provers running on a $5k piece of hardware is a cool version of the future
jasonjmcghee
Curious if pass@2 was tested for haiku and sonnet?
drdaeman
Can someone please explain... If I don't know any Lean (and I suspect most people don't), is it of any direct value? Trying to understand if there's something it can help me with (e.g. automatically write proofs for my Go programs somehow... I'm not sure) or should I just cheer solely for more open models out there, but this one isn't for me?
cadamsdotcom
It’s great to see this pattern of people realising that agents can specify the desired behavior then write code to conform to the specs. TDD, verification, whatever your tool; verification suites of all sorts accrue over time into a very detailed repository of documentation of how things are supposed to work that, being executable, puts zero tokens in the context when the code is correct. It’s more powerful than reams upon reams of markdown specs. That’s because it encodes details , not intent. Your intent is helpful at the leading edge of the process, but the codified result needs shoring up to prevent regression. That’s the area software engineering has always ignored because we have gotten by on letting teams hold context in their heads and docs. As software gets more complex we need better solutions than “go ask Jim about that, bloke’s been in the code for years”.