Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Beyond TP: Formal Problem-Solving


Qi Liu (May 08 2025 at 10:27):

Hello everyone! Glad to share our latest work titled Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving. :glowing_star:

Existing formal mathematics mainly focuses on theorem proving (verifying known conclusions). But broader applications demand more emphasis on problem-solving (discovering unknown answers).
This gives rise to a crucial question::search:

Can a fully verifiable problem-solving process be achieved within a formal environment?:question:

Our answers are as follows:
:one: From the perspective of formal verification, a rigorous formulation of the problem-solving process has been carried out. :books:
:two: Formal Problem-Solving (FPS): A framework for process-verified problem-solving has been implemented using Lean 4. :puzzle_piece:
:three: Deductive FPS (D-FPS): The solving and verification processes are further decoupled to better align with informal problem-solving. :puzzle_piece:
:four: Evaluation

  • RPE: an evaluation method for formal answers. :trophy:
  • Three formal-informal aligned problem-solving benchmarks: FormalMath500, MiniF2F-Solving, and PutnamBench-Solving. :bar_chart:

We have also conducted exhaustive experiments on the mainstream FTP methods and the newly proposed methods.

Talk is cheap, show you the paper, code, data & results:
:headlines:Huggingface Paper: https://huggingface.co/papers/2505.04528
:scroll:Arxiv: https://arxiv.org/abs/2505.04528
:computer:Code, Benchmark, and Evaluation Results: https://github.com/Purewhite2019/formal_problem_solving_main

:handshake: Feedback and collaborations are welcome! Let’s advance formal mathematics together :)

Jason Rute (May 08 2025 at 11:51):

Interesting work. I worry about the Putnam benchmark. The original putnambench benchmark already had this capability. While I don't know that anyone used it, I also know that @George Tsoukalas has done a great and thankless job of maintaining this benchmark over time, carefully checking answers and recording SoTA. I worry that taking that away from the PutnamBench team could lead to a split in the benchmark with a lot of confusion (just like happened to MiniF2F).

Eric Wieser (May 08 2025 at 11:55):

I also don't think benchmarks which release only as a jsonl file are maintainable; no one is going to want to fix lean statements inside a JSON string where vscode can't help with them.

Eric Wieser (May 08 2025 at 11:57):

Especially when that JSON contains lean code like

@Eq (Set (Real  Real)) answer
  (@setOf (Real  Real) fun (f : Real  Real) =>
    @Exists Real fun (a : Real) =>
      @Exists Real fun (c : Real) =>
        And (@GE.ge Real Real.instLE a (@OfNat.ofNat Real (nat_lit 0) (@Zero.toOfNat0 Real Real.instZero)))
          (@Eq (Real  Real) f fun (x : Real) =>
            @HDiv.hDiv Real Real Real (@instHDiv Real (@DivInvMonoid.toDiv Real Real.instDivInvMonoid)) a
              (@HPow.hPow Real Nat Real (@instHPow Real Nat (@Monoid.toNatPow Real Real.instMonoid))
                (@HSub.hSub Real Real Real (@instHSub Real Real.instSub)
                  (@OfNat.ofNat Real (nat_lit 1) (@One.toOfNat1 Real Real.instOne))
                  (@HMul.hMul Real Real Real (@instHMul Real Real.instMul) c x))
                (@OfNat.ofNat Nat (nat_lit 2) (instOfNatNat (nat_lit 2))))))

Qi Liu (May 09 2025 at 02:43):

Jason Rute said:

Interesting work. I worry about the Putnam benchmark. The original putnambench benchmark already had this capability. While I don't know that anyone used it, I also know that George Tsoukalas has done a great and thankless job of maintaining this benchmark over time, carefully checking answers and recording SoTA. I worry that taking that away from the PutnamBench team could lead to a split in the benchmark with a lot of confusion (just like happened to MiniF2F).

Thanks for raising this :) We fully respect the PutnamBench team's work and invaluable efforts. We aim to collaborate, not fragment. Our jsonl version is just for reproducibility of experiments and peer review on the preprint—we plan to discuss integration with their team to align contributions later :smiley: .

Qi Liu (May 09 2025 at 03:02):

Eric Wieser said:

I also don't think benchmarks which release only as a jsonl file are maintainable; no one is going to want to fix lean statements inside a JSON string where vscode can't help with them.

Yeah — embedding Lean code in JSON isn't the most maintainable approach.
The jsonl was really just a quick solution for experiment reproducibility and peer review. We need to adjust pretty-printing (like pp.explicit=true) to extract code that works across both FPS and D-FPS setups, but I agree it's not ideal for actual usage.

We have a preliminary idea on this and will try later :thinking:

  1. Store problems and answers in .lean files (in TP-style);
  2. Build a lightweight parser to extract components for FPS/D-FPS.

If you've got ideas or want to poke at the design, feel free to chime in on GitHub! Always happy to make this more usable. :slight_smile:


Last updated: Dec 20 2025 at 21:32 UTC