Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: OpenAI "Learning to Reason with LLMs"


Floris van Doorn (Sep 12 2024 at 19:08):

I just saw this blog post: https://openai.com/index/learning-to-reason-with-llms/

Adam Topaz (Sep 12 2024 at 19:12):

yeah, it's all over twitter.

Jason Rute (Sep 12 2024 at 19:15):

Their approach is very reminiscent of the approaches which work for formal theorem proving: test-time-compute/search with self-improvement via reinforcement learning. It is exciting to see. They are getting a high score on AIME 2024 (at least compared to previous models). I would love to see more benchmarking of this model on math and formal theorem proving. Also when the small version is available through ChatGPT, I’d love to know what Lean/Coq/Isabelle users think of the suggestions. Does it still mix up Lean 3 and Lean 4? Is it a tool you could see yourself using regularly in the right workflow?

Adam Topaz (Sep 12 2024 at 19:17):

Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.

Ralph Furman (Sep 12 2024 at 19:18):

Meaning they actually decided it for business reasons?

Adam Topaz (Sep 12 2024 at 19:19):

That's what I assumed from that statement

Jason Rute (Sep 12 2024 at 19:20):

“Competitive advantage”

Adam Topaz (Sep 12 2024 at 19:25):

I must say the chain of thought output in the math example from the blogpost is impressive.

Matthew Ballard (Sep 12 2024 at 19:35):

So it does better on PhD level physics than AP calculus or am I misreading

Xiyu Zhai (Sep 12 2024 at 19:45):

Jason Rute said:

Their approach is very reminiscent of the approaches which work for formal theorem proving: test-time-compute/search with self-improvement via reinforcement learning. It is exciting to see. They are getting a high score on AIME 2024 (at least compared to previous models). I would love to see more benchmarking of this model on math and formal theorem proving. Also when the small version is available through ChatGPT, I’d love to know what Lean/Coq/Isabelle users think of the suggestions. Does it still mix up Lean 3 and Lean 4? Is it a tool you could see yourself using regularly in the right workflow?

In my personal viewpoint, these problems are all just RL problems converted from NP problems with the verifier either specified or learned. They don’t have the complications of MDP as in generic RL.

David Renshaw (Sep 12 2024 at 20:03):

I understand "RL" to be "reinforcement learning", but what are NP and MDP?

Adam Topaz (Sep 12 2024 at 20:03):

Xiyu Zhai said:

In my personal viewpoint, these problems are all just RL problems converted from NP problems with the verifier either specified or learned. They don’t have the complications of MDP as in generic RL.

Can you expand on that?

Adam Topaz (Sep 12 2024 at 20:03):

I assume NP is as in "NP-complete"? and MDP is Markov decision process, I suppose

Jason Rute (Sep 12 2024 at 20:30):

I think Xiyu is saying you can cast informal reasoning problems as NP problems (in the sense that it is easier to verify the solution than find the solution). Once you have a verifier (which for informal problems is not straightforward since there is no formal calculus, so you have to train a verifier model, which in this case is likely part of the LLM training). Then it is now a matter of searching for a solution. If the model’s first attempt is wrong it can backtrack and find a different answer. As for how to search, RL is one of the best search algorithms we have. Now I don’t know if Xiyu is thinking on the macro scale where RL is used to train the LLM model itself. AlphaProof uses test-time RL but this model only uses test-time compute in the form of chain of thought, not RL or traditional tree search. I don’t know the details of how training time RL works. MDP (Markov decision processes) refers to an RL setup where there is a graph of states connected by transitions (which are actions, but could also be stochastic). The transitions also come with rewards and the goal is to maximize the total reward in the end. A NP-style search problem is a simpler MDP where there are no intermediate rewards, just a reward of 1 at the end for finding the solution. But IMHO this makes the RL even more difficult because you don’t have a good partial signal and have to come up with your own version of internal reward. It also makes it hard to assign credit to early actions in the trajectory. I’m not sure it anything in this work is “just” X. The challenge is getting all the details right and we will see how long it takes others to catch up.

Jason Rute (Sep 12 2024 at 20:34):

Maybe @Xiyu Zhai isn’t talking about RL in the sense of gradient training using observed rewards, but just the challenge of finding the highest reward path through an MDP (of which we already said that NP-style search problems are a special case).

Xiyu Zhai (Sep 12 2024 at 20:45):

I didn’t expand because of inconfidence lol. I graduated from a theory group where they have been taking about RL theory for several years and I mostly don’t quite understand. In my thesis, I made this bold conjecture that most AI problems in CV or NLP or AI Math are NP problem in disguise. Not NP hard, just NP. Including even mnist, where the proper mathematical definition of shapes I believe is of some NP form, a certificate and a verifier. For those with geometric backgrounds, homotopy can be viewed as NP, riemannian isometry can be viewed as NP, and shape is just similar and intermediate between topological and riemannian. Now let’s talk about NLP. Sure things in physics and chemistry can’t be formalized easily as math, but I imagine there is some neural informal verification mechanism for them.

Xiyu Zhai (Sep 12 2024 at 20:46):

As for RL, I meant mostly the task not any specific methodology. Gradient descent on V or Q function is just one way.

Xiyu Zhai (Sep 12 2024 at 20:47):

My understanding is that NP problem with a recursive nature can be viewed as RL problem. The bellman equation is recursive to me

Xiyu Zhai (Sep 12 2024 at 20:49):

Jason Rute said:

I think Xiyu is saying you can cast informal reasoning problems as NP problems (in the sense that it is easier to verify the solution than find the solution). Once you have a verifier (which for informal problems is not straightforward since there is no formal calculus, so you have to train a verifier model, which in this case is likely part of the LLM training). Then it is now a matter of searching for a solution. If the model’s first attempt is wrong it can backtrack and find a different answer. As for how to search, RL is one of the best search algorithms we have. Now I don’t know if Xiyu is thinking on the macro scale where RL is used to train the LLM model itself. AlphaProof uses test-time RL but this model only uses test-time compute in the form of chain of thought, not RL or traditional tree search. I don’t know the details of how training time RL works. MDP (Markov decision processes) refers to an RL setup where there is a graph of states connected by transitions (which are actions, but could also be stochastic). The transitions also come with rewards and the goal is to maximize the total reward in the end. A NP-style search problem is a simpler MDP where there are no intermediate rewards, just a reward of 1 at the end for finding the solution. But IMHO this makes the RL even more difficult because you don’t have a good partial signal and have to come up with your own version of internal reward. It also makes it hard to assign credit to early actions in the trajectory. I’m not sure it anything in this work is “just” X. The challenge is getting all the details right and we will see how long it takes others to catch up.

Agree. Thanks a lot

Xiyu Zhai (Sep 12 2024 at 20:58):

One can compare this with world models proposed by Yann Lecun. I believe world model is limited to physical environments but NP viewpoint is more general.

Jason Rute (Sep 12 2024 at 21:13):

I think LeCun’s JEPA ideas could be applicable even to theorem proving and other NP problems. But I admit it is not as fleshed-out and explored as the current approaches.

Xiyu Zhai (Sep 12 2024 at 21:13):

Yeah agree

Xiyu Zhai (Sep 12 2024 at 21:13):

Like some high level planning

Xiyu Zhai (Sep 12 2024 at 21:16):

Lecun from his vision background might focus more on the physical world. But this hierarchy idea is transferrable

Matthew Ballard (Sep 12 2024 at 22:00):

Create a formal statement of Lagrange’s theorem in Lean

import group_theory.subgroup

variables {G : Type*} [group G] [fintype G] (H : subgroup G) [fintype H]

theorem lagrange_theorem :
  fintype.card H  fintype.card G :=
subgroup.card_subgroup_dvd_card H

In the current version of Lean

import Mathlib.GroupTheory.Subgroup.Basic
import Mathlib.Data.Fintype.Basic

variable {G : Type _} [Group G] [Fintype G] (H : Subgroup G)

theorem lagrange_theorem :
  Fintype.card H  Fintype.card G :=
Subgroup.card_subgroup_dvd_card H

Jason Rute (Sep 12 2024 at 22:33):

@Matthew Ballard Is that using o1-preview?

Jason Rute (Sep 13 2024 at 03:22):

How is it at proofs?

Johan Commelin (Sep 13 2024 at 03:55):

So it still didn't get the memo that the cool kids all use Lean 4 these days :oops:

Eric Taucher (Sep 13 2024 at 10:58):

FYI - Announcement in OpenAI forum

https://community.openai.com/t/new-reasoning-models-openai-o1-preview-and-o1-mini/938081

ChatGPT Plus subscribers can also try o1 today.

Matthew Ballard (Sep 13 2024 at 11:53):

Haven’t tried anything more yet. If anyone has suggestions, I’ll test them this afternoon.

Eric Taucher (Sep 13 2024 at 14:40):

Matthew Ballard said:

If anyone has suggestions

Suggestion: Not for examples to try but for creating information relevant for others. I am doing the same so these are my current ideas.

  1. Include how you ran the prompt, e.g. API or ChatGPT.
  2. Include which model used, e.g. o1-mini, o1-preview.
  3. Show the result using another common model such as GPT 4o, Claude, etc.
  4. Include the number of seconds the OpenAI reasoning model thought, e.g. Thought for 37 seconds. A video of a result showing the thinking process step names would be useful.

Leonardo de Moura (Sep 13 2024 at 14:46):

BTW, our friend and major past Lean contributor, @Daniel Selsam , is working on this project :tada:

Jason Rute (Sep 13 2024 at 14:55):

Yes, I saw he was a “foundational contributor”.

Eric Taucher (Sep 13 2024 at 14:59):

Here is a rough post of a first example, mostly to just see what makes sense for posting more examples.


The first example is taken from the popular tutorial

https://github.com/PatrickMassot/GlimpseOfLean


ChatGPT - 4o

Prompt

Reply


ChatGPT - o1-mini - Thought for 6 seconds

Prompt

Reply


ChatGPT - o1-preview - Thought for 37 seconds

Prompt

Reply

Jason Rute (Sep 13 2024 at 15:43):

Thanks! But maybe this example is too easy. They all seem to get it modulo not using brackets for rw sometimes.

Matthew Ballard (Sep 13 2024 at 16:01):

Tried a question from my old graduate algebra 1 final exam.

o1-preview through the web UI

Let G be a group. Prove that G is finite if and only if G has finitely many subgroups.

Matthew Ballard (Sep 13 2024 at 16:04):

Thought for 296 seconds.

Response

Matthew Ballard (Sep 13 2024 at 16:10):

4o

Matthew Ballard (Sep 13 2024 at 16:17):

Give a formal proof in the current version of Lean

Thought for 80 seconds

Response

Matthew Ballard (Sep 13 2024 at 16:19):

The current version of Lean please

Thought for 90 seconds

Response

Matthew Ballard (Sep 13 2024 at 16:21):

Anyone still have Lean 3 running?

Jason Rute (Sep 13 2024 at 16:23):

This might be fun, but maybe too hard:

Consider this Lean 4 code:

inductive MyNat
| zero : MyNat
| succ : MyNat -> MyNat

def MyNat.plus_left : (m n : MyNat) -> MyNat
| .zero, n => n
| (.succ m), n => .succ (plus_left m n)

def MyNat.plus_right : (m n : MyNat) -> MyNat
| n, .zero => n
| m, (.succ n) => .succ (plus_right m n)

example (m n: MyNat) : MyNat.plus_left m n = MyNat.plus_right m n := by sorry

Complete the proof of the example theorem.

Matthew Ballard (Sep 13 2024 at 16:24):

Same prompts for 4o and I get (after emphasizing the current version of Lean)

import Mathlib.GroupTheory.Subgroup.Basic
import Mathlib.Data.Fintype.Basic

-- Let G be a group
variables {G : Type*} [Group G]

-- First, we show that if G is finite, then G has finitely many subgroups
lemma finite_group_finite_subgroups [Fintype G] : Fintype (Subgroup G) :=
begin
  -- Subgroups are subsets of G, and since G is finite, the number of subsets is finite
  apply Fintype.ofInjective (λ H : Subgroup G, H.carrier),
  -- Injectivity: if two subgroups have the same carrier, then they are equal
  intros H1 H2 h,
  ext x,
  exact Set.ext_iff.mp h x,
end

-- Next, we show that if G has finitely many subgroups, then G is finite
lemma finite_subgroups_finite_group [Fintype (Subgroup G)] : Fintype G :=
begin
  -- We prove this by contradiction, assuming G is infinite
  by_contradiction h,
  -- Assume G is infinite
  have : Infinite G := Infinite.of_not_fintype h,
  -- Consider the set of cyclic subgroups of G
  let cyclic_subgroups := Set.range (λ g : G, Subgroup.zmultiples g),
  -- Each element of G generates a cyclic subgroup, and distinct elements generate distinct cyclic subgroups
  have : Infinite cyclic_subgroups,
  { apply Set.infinite_of_injective_forall_mem,
    intros g1 g2 h_cyclic,
    -- If the cyclic subgroups generated by g1 and g2 are equal, their generators must be related
    have : g1  Subgroup.zmultiples g2 := by { rw h_cyclic, exact Subgroup.mem_zmultiples g1 },
    rw Subgroup.mem_zmultiples_iff at this,
    -- g1 = g2^n for some integer n
    cases this with n h,
    cases n,
    -- If n = 0, g1 is the identity element
    { simp [pow_zero] at h, exact h.symm },
    -- Otherwise, g1 and g2 are equal up to powers, ensuring injectivity
    exact h },
  -- But this contradicts the assumption that G has finitely many subgroups
  exact not_infinite_of_fintype this,
end

-- Finally, we combine the two lemmas to get the desired iff statement
theorem group_finite_iff_finitely_many_subgroups : Fintype G  Fintype (Subgroup G) :=
finite_group_finite_subgroups, finite_subgroups_finite_group

Matthew Ballard (Sep 13 2024 at 16:25):

Which is like Lean 3.75

Matthew Ballard (Sep 13 2024 at 16:27):

To @Jason Rute 's prompt

example (m n : MyNat) : MyNat.plus_left m n = MyNat.plus_right m n := by
  induction m with
  | zero =>
    -- Base case: m = zero
    -- plus_left zero n = n by definition
    -- plus_right zero n = n by definition
    rfl
  | succ m ih =>
    -- Inductive step: Assume the hypothesis holds for m
    -- Show it holds for succ m
    calc
      plus_left (succ m) n
        = succ (plus_left m n)        : by rfl
        ... = succ (plus_right m n)   : by rw ih
        ... = plus_right (succ m) n   : by rfl

Matthew Ballard (Sep 13 2024 at 16:28):

And I need to return to productivity :)

Eric Taucher (Sep 13 2024 at 16:53):

Ran Jason's example and captured to video which was larger than the 25MB limit for uploading files here. Sorry I am not so adept with modifing and uploading video files but more than happy to email it to someone so they can use their magic

Here are the messages that ChatGPT displayed with model 01-preview.

ChatGPT messages

Eric Taucher (Sep 13 2024 at 17:04):

FYI

AMA with the OpenAI o1 team

https://x.com/OpenAIDevs/status/1834608585151594537

Eric Wieser (Sep 13 2024 at 17:17):

Matthew Ballard said:

Anyone still have Lean 3 running?

The web editor is still live, just use ```lean3 instead of ``` for code blocks and the "open in playground" link will still work

import mathlib

Eric Taucher (Sep 14 2024 at 12:24):

One point of note related to the new o1 models from OpenAI is that OpenAI will adjust the models moving forward and some prompts that work at present may not in the next few weeks. As such I have found it of benefit to establish a baseline of different questions early on to compare against future adjustments so that one has valid facts for determining if the model is degrading.

As I am not as adept as many here on building such a baseline of such questions, if others could post them here it would be appreciated.

Bas Spitters (Sep 16 2024 at 08:26):

OpenAI has released a new model that is claimed to be better at reasoning. Has anyone tried to rerun any of the existing benchmarks of AI for ITP to see whether it is relevant for our community?
https://openai.com/index/introducing-openai-o1-preview/

Notification Bot (Sep 16 2024 at 11:03):

A message was moved here from #Machine Learning for Theorem Proving > o1 by Jason Rute.

Jason Rute (Sep 16 2024 at 11:17):

@Bas Spitters, it would have to be a benchmark that previously used an LLM like GPT-4 or GPT-4o to synthesize whole proofs. There certainly are such benchmarks for MiniF2F and PutnamBench, for example, those run by @George Tsoukalas and @Amitayush Thakur, or those in CoqPilot (none of the CoqPilot authors appear to be in this Zulip, so we might have to repeat this discussion on the Coq zulip). I think there would be some practical concerns such as if the same prompt works for o1, if they can do these evals without getting rate limited, and if these researchers can afford the extra cost of running o1 benchmarks. (And the usual concerns about benchmarks getting leaked into LLM training data.) I think approaches like Copra (which call GPT-4o at each step of the proof) would not be a good fit for o1, since it would be using a lot of computation for each step, but that is not to say there isn't a future way to combine o1's power with knowledge of Lean/Coq's internal state.

Jason Rute (Sep 16 2024 at 11:23):

Oh, and @Sean Welleck's benchmarks in miniCTX and miniCodeProps. Those would be good o1 benchmark candidates.

Jason Rute (Sep 16 2024 at 11:28):

Cross post to the CoqPilot discussion: https://coq.zulipchat.com/#narrow/stream/252087-Machine-learning-and-automation/topic/CoqPilot/near/470561267

George Tsoukalas (Sep 16 2024 at 15:28):

Jason Rute said:

There certainly are such benchmarks for MiniF2F and PutnamBench, for example, those run by George Tsoukalas and Amitayush Thakur if they can do these evals without getting rate limited, and if these researchers can afford the extra cost of running o1 benchmarks.

An interesting thing I noticed is that the math demonstration of o1 in the blog post is actually on a Putnam problem (2023 A2) from last year! I tried o1-preview on the Lean formalization in PutnamBench and it wasn't able to produce a proof. o1-preview mentioned that there are some intricate parts in the argument and left sorry's instead of proving them. Also it seems that it still confuses Lean 3 and Lean 4 syntax. Based on this, I'm not sure I would expect strong results. And the cost of running it on the whole benchmark would be relatively high, the API is fairly pricy and the outputs generally produce a lot of tokens for these problems. I think if OpenAI could do an evaluation on any formal theorem-proving benchmark (doesn't have to be PutnamBench) I would be more confident about using some of our budget to do an eval ourselves.

Sid (Sep 21 2024 at 01:17):

Seems like Qwen 2.5 70B matches with a reward model on 256 generations + python execution matches O1-mini's AIME score (O1-mini is rumored to be 70B as well)

Jason Rute (Sep 21 2024 at 07:10):

@Sid reference? Is this Qwen-2.5-Math? Is this in the official documentation or someone else’s result? I pitty the graders who had to check 256 automatic-proofs for correctness. :)

Sid (Sep 21 2024 at 07:25):

Jason Rute said:

Sid reference? Is this Qwen-2.5-Math? Is this in the official documentation or someone else’s result? I pitty the graders who had to check 256 automatic-proofs for correctness. :)

It's here (https://qwenlm.github.io/blog/qwen2.5-math/) - last table on the page. Haha I think it's just final answer correctness (that said my guess is so is OAI's result)


Last updated: May 02 2025 at 03:31 UTC