Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Hilbert Agent

Opt (Oct 01 2025 at 07:13):

https://arxiv.org/abs/2509.22819 -- they have an agentic framework which with Gemini + Goedel Prover gets 70% on PutnamBench

Simon Sorg (Oct 01 2025 at 07:18):

How is this different from Lego Prover?

(deleted) (Oct 01 2025 at 08:27):

Lego prover looks entirely different to me

(deleted) (Oct 01 2025 at 08:57):

Anyway it's clear AI can prove theorems. Now the next step is building faster AIs

(deleted) (Oct 01 2025 at 09:02):

Here's a goal. Prove nontrivial theorems under 1 second.

Jared green (Oct 01 2025 at 15:30):

but how is it different from everything else? also when can we use it?

Michael Bucko (Oct 01 2025 at 17:33):

Simon Sorg schrieb:

How is this different from Lego Prover?

Lego - growing skill library, with "neural" / AI-autonomous focus.
Hilbert - informal reasoning-focused, with orchestration (4 components: "an informal LLM that excels at mathematical reasoning, a specialized prover LLM optimized for Lean 4 tactics, a formal verifier, and a semantic theorem retriever")

The future systems will most likely combine both "neural" and "natural". Modularity and growing libraries make sense, but also the ability to have autonomous NNs marking things green in the blueprint makes sense too.

(deleted) (Oct 01 2025 at 18:00):

Jared green said:

but how is it different from everything else? also when can we use it?

The components used in the paper are all publicly available. Both Goedel Prover v2 and Gemini 2.5 Pro are publicly available, and the prompts are in the paper. Replication takes some effort but isn't impossible, as every important detail is in the paper.

Kelly Davis (Oct 16 2025 at 09:04):

I'm just getting a chance to start reading the Hilbert paper this morning. It's quite similar to something I'm working on.

Hopefully I can write up a quick article on what I've done over the weekend and open source my code next week. It's built to be built upon, e.g. swapping in new formal provers and decomposition agents. So it should be a framework of general use in future.

Kelly Davis (Oct 19 2025 at 14:32):

Here's the initial version Gödels Poetry.

I'd appreciate any feedback, with enough eyeballs all bugs are shallow.

Ayush Agrawal (Nov 05 2025 at 13:21):

Hilbert paper's results are quite interesting. I tried to reproduce the same: https://github.com/ayush1801/HilbertProverX

Results and Observations:

On MiniF2F-test set, we achieve 74.5% (182/244) using the above parameters. 39 of these proofs cannot be proved directly and are proved recursively. With the similar reasoner calls as above, the paper achieves the performance of 80-85%.
All the successful proofs are provided at ./datasets/MINIF2F_TEST.zip
Some interesting observations:
- We find that proofs are generally unnecessarily long with a lot of unused tactics.
- There are a lot of major trivial lemmas that are generated and the current prover models often struggle to prove these:
- theorem one_is_integer : (1 : ℚ).isInt := sorry
- theorem a4_val (a : ℕ → ℕ) (h₄ : a 4 ^ 3 = 125) : a 4 = 5 := sorry
- theorem unknown_terms_final_value : ((212 : ℤ) + (414 : ℤ)) = (626 : ℤ) := sorry
- theorem six_div_ten_eq_three_div_five : (6 : ℚ) / (10 : ℚ) = (3 : ℚ) / (5 : ℚ) := sorry
- Ideally, the provers should not waste compute on proving such trivial goals. These goals should be offloaded to tactics developed in latest versions such as grind and hammers and should be made more powerful as these don't handle all the trivial goals.
- Without strict theorem check, models tend to transform the original theorem statements/lemmas being proved.

Upcoming Plan

Measuring performance on other benchmarks and using other reasoners.
Parallelising the pipeline.
Analysing the same to get better understanding of the above observations.
Exploring ideas to make the pipeline more efficient and better in terms of performance.
Collaborations are welcome! Please reach out to me!!

Please try it out :slight_smile:

Kelly Davis (Nov 05 2025 at 14:27):

@Ayush Agrawal

I don't know if the repo's settings are the ones you used for benchmarking, but a quick (and maybe incorrect) glance at the code finds the repo's

Subgoal solving seems to use fewer attempts , 1 vs 4, than Hilbert
Shallow solve has stricter line limits, 5 vs 30, than Hilbert
Error correction uses fewer iterations, 4 vs 6, than Hilbert
Search users fewer queries, 1 vs 5, than Hilbert

these likely go quite away to explaining the differences?

Ayush Agrawal (Nov 05 2025 at 14:30):

Subgoal solving does use 4 attempts (see the batch size). Yes, shallow solve proof length and error correction has fewer iterations. Search queries are also 5, check the prompts and the extraction part. If anything still feels off, let me know!

Ayush Agrawal (Nov 05 2025 at 14:35):

I think differences could be pertained to prompt formatting, thinking budget of the reasoner models as well as some hyperparameter difference in terms of overall compute budget (2 of those you pointed out!). We do plan to carry out the evaluations on other benchmarks.

Last updated: Feb 28 2026 at 14:05 UTC