Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: TheoremForge: Agentic Workflow for Data Synthesis

Yicheng Tao (Jan 27 2026 at 09:49):

We release TheroemForge, an agentic workflow serving as a formal data synthesis pipeline. It is designed to collect training data for expert models in statement formalization, proof generation, premise selection, proof correction and proof sketching with low budget. See our tech report for details.

Specifically, TheoremForge adopts iterative proof refinement and subgoal decomposition strategies inspired by Hilbert. A key difference is that, for cost control, it refrains from recursively decomposing subgoals.

Our work represents only a primary exploration of solutions to the scarcity of training data for subtasks in the formalization domain. Experimental results indicate that current general-purpose LLMs, such as Gemini, GPT-5, and Claude, still achieve relatively low performance in formalization. Although increasing the inference budget can mitigate this issue, such an approach is not economically sustainable in the long term.

We look forward to your feedback and hope that our work will provide some useful insights.

Yutong Wu (Jan 27 2026 at 13:56):

Hello! I noticed in the technical report the statement: "We rely on LLM-based extraction rather than the extract_goals tactic, as the latter has proven unreliable in certain contexts" (Subgoal Decomposition in Appendix A.3) Could you provide some examples or statistical results comparing the accuracy of the LLM-based method and the extract_goals tactic in extracting lemmas? I am also working on generating sketches, so I am quite curious about this.

Yicheng Tao (Jan 27 2026 at 14:48):

Yutong Wu said:

Hello! I noticed in the technical report the statement: "We rely on LLM-based extraction rather than the extract_goals tactic, as the latter has proven unreliable in certain contexts" (Subgoal Decomposition in Appendix A.3) Could you provide some examples or statistical results comparing the accuracy of the LLM-based method and the extract_goals tactic in extracting lemmas? I am also working on generating sketches, so I am quite curious about this.

Thanks for your attention. I should say that I didn't keep that example but sometimes the lemmas extract_goal created will lose all the context, only preserve the name and the goal. I haven't figured out what caused the problem. The document has warnings about potential failures.

BTW, I found goedels-poetry takes a syntactical approach by rewriting the Kimina Lean Server. Maybe it will help.

Thomas Zhu (Jan 27 2026 at 22:33):

Yicheng Tao said:

Yutong Wu said:

Hello! I noticed in the technical report the statement: "We rely on LLM-based extraction rather than the extract_goals tactic, as the latter has proven unreliable in certain contexts" (Subgoal Decomposition in Appendix A.3) Could you provide some examples or statistical results comparing the accuracy of the LLM-based method and the extract_goals tactic in extracting lemmas? I am also working on generating sketches, so I am quite curious about this.

Thanks for your attention. I should say that I didn't keep that example but sometimes the lemmas extract_goal created will lose all the context, only preserve the name and the goal. I haven't figured out what caused the problem. The document has warnings about potential failures.

BTW, I found goedels-poetry takes a syntactical approach by rewriting the Kimina Lean Server. Maybe it will help.

Per the linked documentation, it sounds like extract_goal *should solve your question.

Justin Asher (Jan 28 2026 at 00:08):

Thanks for the LeanExplore citation!

Just as a heads-up, I am in the process of updating it—the new search engine will be much better—but the API will be different. (I will try not to do breaking changes again going forward.)

Yicheng Tao (Jan 28 2026 at 03:35):

Thomas Zhu said:

Yicheng Tao said:

Yutong Wu said:

Hello! I noticed in the technical report the statement: "We rely on LLM-based extraction rather than the extract_goals tactic, as the latter has proven unreliable in certain contexts" (Subgoal Decomposition in Appendix A.3) Could you provide some examples or statistical results comparing the accuracy of the LLM-based method and the extract_goals tactic in extracting lemmas? I am also working on generating sketches, so I am quite curious about this.

Thanks for your attention. I should say that I didn't keep that example but sometimes the lemmas extract_goal created will lose all the context, only preserve the name and the goal. I haven't figured out what caused the problem. The document has warnings about potential failures.

BTW, I found goedels-poetry takes a syntactical approach by rewriting the Kimina Lean Server. Maybe it will help.

Per the linked documentation, it sounds like extract_goal *should solve your question.

Since I have lost the failed case, I'm not sure whether it will work. However, the document says only irrelevant variables will be removed. I'm sure in that failed case there are variables that satisfy the relevant conditions. But all the sorrys in the main theorem produce no context.

Maybe I will find the case in the future. I will add an option to use extract_goal * to extract subgoals later.

Yutong Wu (Jan 29 2026 at 08:12):

I found that extract_goal can introduce ambiguity when dealing with pretty printed objects (as already noted in the documentation), as shown in the following example:

import Mathlib
import Aesop

set_option maxHeartbeats 0

open BigOperators Real Nat Topology Rat

theorem mathd_algebra_153 (n : ℝ) (h₀ : n = 1 / 3) :
    Int.floor (10 * n) + Int.floor (100 * n) + Int.floor (1000 * n) + Int.floor (10000 * n) = 3702 := by

  have Step5 : Int.floor ((3 : ℝ) + 1/3) = (3 : ℤ) := by
    extract_goal *
    sorry

It shows in InfoView:

theorem mathd_algebra_153.extracted_1_1 (n : ℝ) (h₀ : n = 1 / 3) : ⌊3 + 1 / 3⌋ = 3 := sorry

But if we just copy the code as a new lemma, it will cause an error. Alternatively, we can use set_option pp.all true in extract_goal to keep Int.floor unchanged, but it will make the lemma extremely verbose and unreadable. I think using LLMs to extract lemmas is indeed necessary, as it is difficult to achieve this with rule-based methods. Does anyone have any thoughts on this issue?

Last updated: Feb 28 2026 at 14:05 UTC