Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: ashme/aime in lean in < 10GB weights ?


TongKe Xue (Dec 08 2025 at 10:28):

[Context: I'm still not convinced using LLM to generate tactics is the best/right way to solve IMO/Putnam problems.]

Question: Are there any systems that:

  1. uses lean4 to solve Ashme/AIME problems
  2. program + all weights is <= 10 GB ?

Now, one might point out: ASHME is multiple choice, AIME is 0 - 999. Neither involves writing proofs. Why would you want to use Lean4?
Answer: it forces correct reasoning from step n to step n+1. Even if the final answer is 0-999 or multiple choice, it is useful for us to force every intermediate step to be mathematically correct.

Question: are there systems, using Lean4, a mix of good old fashioned symbolic AI + very small weight neural network, that can solve ASHME + AIME problems at > 90% accuracy?

The intuition here is I'm particularly interested in systems that "leans heavily" on lean4 tactics, good old fashioned symbolic ai, and less reliance on neural networks / LLMs.

Thanks!

Leni Aniva (Dec 09 2025 at 02:08):

I'm working on one such system and I'll try to get it published next year

Gerben Koopman (Dec 09 2025 at 13:22):

How would this work given that Mathlib itself is already ~32GB? You need to load this for premises no? Unless you prove every theorem ground up.

Gerben Koopman (Dec 09 2025 at 14:41):

I agree with your feeling though, that next-token generation seems ill-suited to generating tactics. It's just that the alternatives I can think of are also not better. But curious to see what other people came up with.

TongKe Xue (Dec 09 2025 at 19:58):

  1. Great question. Sorry for not being clear. You get all of MathLib, CSLib, any other public open source Lean4 db of formalized lemmas/theorems for free.
  2. The 10GB limit was meant mainly to focus on "weights/blobs of f32s".
  3. Taking a neural network's weights and "encoding" them somehow in a Lean4 db of theorems would be considered cheating. :-)

One reason I consider this might be possible is this:

  1. IMO style problems often require some form of inspiration, i.e. a step that says Consider the object/strategy XYZ where XYZ is a somewhat complicated expr that we need to come up with out of thin air.
  2. ASHME/AIME problems are "more direct"; after formalization, many of them just look like plug & play standard formulas, the 2d/3d geometry problems can often solved by "coordinates", combinatorics problems by case analysis, etc ...

Concretely, a problem that shows up on IMO but never on ASHME/AIME is something like this: "Consider [this weird 2 person game]. State which player has a winning strategy. Prove it."

So a solution to this involves this "inspiration step" of "wtf is the strategy; what are the invariants after each player makes their move"; this coming up of this expr/strategy from thin air. In contrast, the ASHME/AIME feels a bit more "bruteforce-able".

Thus, I too am curious what others are doing with this.

Xiyu Zhai (Dec 25 2025 at 07:19):

Why would this be meaningful? (I'm not doubting) I just want to know the motivation.

Lean4 is not exactly the most optimal frontend of interaction. The room to optimize is huge. I believe you can cooptimize NN and software and make <1B LLM work for these problems. And you didn't mention time complexity at all so it could try as many times as possible.

For things like image editing, I could imagine things like privacy concerns that motivates minizing the models. For math itself, what would be the motivation?

TongKe Xue (Dec 25 2025 at 21:21):

You're right. I failed to state time complexity.

I'd like to have something that can solve IMO/Putnam problems on a < $10k server, at around 5-10 mins / problem. I don't think this is realistic as of right now, but maybe something that could solve ASHME/AIME on a < $10k server, at around 5-10 mins / problem is possible.

Suppose such a system existed, it would be interesting to use it to synthesize formally verifiable code.

I think it is very "shaky foundations" to build "on top" of remote blackbox APIs.

Xiyu Zhai (Dec 25 2025 at 21:46):

It has a long way to go from there to formally verifiable code.

Code is quite complex.

As much as people would like to boast, there's quite a lot of ground to cover for code generation before it's useful for general programming.

Code verification is drastically different from math verification and Lean4 didn't really progress a lot for code verification from Rocq, etc. Currently people like to translate things to lean then verify it in lean then export it. That's an interesting use case but is quite insufficient for general stuffs.

I'm waiting to see Vst-scale projects in Lean4.

Xiyu Zhai (Dec 25 2025 at 21:50):

And I don't think cost plays a factor here. These costs are like nothing compared with modern LLM use cases. I think the real bottleneck isn't in the mlsys of neural networks.

Xiyu Zhai (Dec 25 2025 at 21:53):

I'm actually working on such a system that do minif2f-level problems efficiently. But it's mostly a research interest rather than hoping it's of any immediate huge impact. I would say the room to optimize is quite large. I'm just trying to grab the low-hanging fruit now that it still hasn't attracted many MLsys people to work on that.


Last updated: Feb 28 2026 at 14:05 UTC