Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Current state of end-user tools

Patrick Massot (Jan 04 2025 at 10:45):

A new year begins and I have a couple of new propaganda talks to prepare. As which each new round of talks, I try to figure out whether I could show anything using ML. There was one year when I could show autoformalization in Lean 3 using Zhangir and Ed’s Lean chat extension. But since then I was never able to find anything that could be seen as useful to users (as opposed to nice research projects) except for moogle/LeanSearch. Is there anything that is usable today and would propose either auto-formalization or suggest next step? Note that I don’t want to show Copilot since it seems to simply remember what I typed while preparing the demo.

Kevin Buzzard (Jan 04 2025 at 11:02):

I use copilot and I agree that it's good at remembering, which makes it look even more exciting.

Siddhartha Gadgil (Jan 04 2025 at 11:18):

LeanAide does some Autoformalization tolerably. I will send a demo video soon.

Patrick Massot (Jan 04 2025 at 12:00):

Kevin Buzzard said:

I use copilot and I agree that it's good at remembering, which makes it look even more exciting.

I think there are already enough people whose full-time job is to lie about what LLMs can dot, they don’t need my help.

Patrick Massot (Jan 04 2025 at 12:02):

I will definitely try LeanAide before my next talk.

Jason Rute (Jan 04 2025 at 12:31):

Some folks like @Kim Morrison have said that Sonnet 3.5 is pretty good at getting Lean syntax right. It however is not free (and you also need a free AI coding assistant plugin like Continue.dev or a VS-Code AI-enabled clone like Cursor. Those also work with all sorts of LLM models.)

Jason Rute (Jan 04 2025 at 12:32):

It is unfortunate that Lean syntax massively changed right in the middle of the AI boom. This has created a lot of artificial problems that I think Isabelle doesn’t have. (Rocq/Coq neither although Rocq’s/Coq’s lack of standard libraries is an interesting challenge.) It still might be possible there is a good prompt to get standard LLMs to avoid common mistakes like import Mathlib.Data.Nat.Basic.

Jason Rute (Jan 04 2025 at 12:32):

One thing I suggest is noting some examples you would like to see. Then maybe someone with Sonnet 3.5 access or LeanAide access (@Siddhartha Gadgil ) could let you know if current tools can do it.

Jason Rute (Jan 04 2025 at 12:32):

@Siddhartha Gadgil is LeanAide able to use Sonnet 3.5 and other LLM APIs, or just OpenAI?

Siddhartha Gadgil (Jan 04 2025 at 12:40):

LeanAide primarily uses the "OpenAI API", but this is supported by many models (e.g. Mistral on vLLM). I also have support for the Gemini API. In practice I am mostly using the OpenAI models.

One thing I plan to try soon is to switch to using LLMLean instead of my small ad hoc (curl based) internal API. Ideally we should all use common packages for common tasks, and if some important support is missing then it can be contributed upstream.

Patrick Massot (Jan 04 2025 at 12:41):

Thanks Jason. I’m not interested in non-free tools or tools that force using a weird editor.

Patrick Massot (Jan 04 2025 at 12:42):

Does LeanAide require a subscription to some LLM?

Siddhartha Gadgil (Jan 04 2025 at 12:46):

In the default configuration LeanAide uses GPT-4o, so does require this. However, one can locally host (say using vLLM) another model and configure it to use the correct url.

In my experiments, for at least a relatively easy set the model Mathstral from Mistral gave decent results. It can be run on a workstation with a single 20GB GPU (which my department has). Testing with other open models is something I would like, but am personally prioritising getting things working with the best models (and easiest to use) available.

Jason Rute (Jan 04 2025 at 12:52):

I think many other public models also use OpenAI’s API syntax.

Siddhartha Gadgil (Jan 04 2025 at 12:53):

Indeed, it has become the standard. Even Gemini supports it by public demand. LeanAide can use any such model - one just has to specify by url.

Jason Rute (Jan 04 2025 at 12:56):

So it can use other models as long as someone has an API url/key and that API supports OpenAI’s standard?

Siddhartha Gadgil (Jan 04 2025 at 12:58):

Yes for open models, at least if the configuration is of the form I expect. I have hosted models within our LAN with LLM and used them by specifying the url.

As of now, there is no support for authentication other than OpenAI, Gemini and Azure-OpenAI. But this will be easy to add.

Alex Kontorovich (Jan 04 2025 at 13:43):

Patrick Massot said:

Kevin Buzzard said:

I use copilot and I agree that it's good at remembering, which makes it look even more exciting.

I think there are already enough people whose full-time job is to lie about what LLMs can dot, they don’t need my help.

I don't follow - can you please elaborate on what you have against copilot?

Kevin Buzzard (Jan 04 2025 at 13:44):

Jason Rute said:

It is unfortunate that Lean syntax massively changed right in the middle of the AI boom. This has created a lot of artificial problems that I think Isabelle doesn’t have. (Rocq/Coq neither although Rocq’s/Coq’s lack of standard libraries is an interesting challenge.) It still might be possible there is a good prompt to get standard LLMs to avoid common mistakes like import Mathlib.Data.Nat.Basic.

Did anyone from the Lean community react to https://openai.com/form/rft-research-program/ ? I was half-tempted but decided I already had enough on my plate.

Patrick Massot (Jan 04 2025 at 14:19):

Alex Kontorovich said:

Patrick Massot said:

Kevin Buzzard said:

I use copilot and I agree that it's good at remembering, which makes it look even more exciting.

I think there are already enough people whose full-time job is to lie about what LLMs can dot, they don’t need my help.

I don't follow - can you please elaborate on what you have against copilot?

Here specifically I mean the following sequence of actions does not produce accurate things to show in a talk:

I prepare some Lean demo in VSCode, by typing everything myself.
Then I wonder whether Copilot could have helped and I could show it during the talk, so I try turning on suggestios
Now Copilot simply copy-paste what I did in step 1, because it remembers me typing it.

I saw this happening and I remember Jeremy also had this issue (and it seems Kevin saw it too).

Patrick Massot (Jan 04 2025 at 14:19):

This specific concern is completely independent from the general issues surrounding LLMs.

Kevin Buzzard (Jan 04 2025 at 14:20):

Yes it happened to me precisely once and I laughed it off (and told the audience what had probably happened, because it hadn't happened in the rehearsal!)

Jason Rute (Jan 04 2025 at 14:54):

Jason Rute said:

One thing I suggest is noting some examples you would like to see. Then maybe someone with Sonnet 3.5 access or LeanAide access (Siddhartha Gadgil ) could let you know if current tools can do it.

I again want to strongly emphasis this. It is often difficult or intimidating for users to directly try out new tools like LeanAide, Lean Copilot, or Continue.dev+Sonnet 3.5. Also those who do know how to use the tools often don’t share their experiences. I think a middle ground is for users to share problems they would like these tools to solve and those who are comfortable with the tools try those problems out on them. Also it is just good in general for us to collect lists of problems we would like an AI tool to solve. It isn’t ideal, but I think it would help.

Jason Rute (Jan 04 2025 at 14:55):

To get started, here is something I tried recently with limited success:

Write a Lean 4 proof of the fact that the square of the sum of the first $n$ numbers is the sum of the first $n$ cubes.

Siddhartha Gadgil (Jan 04 2025 at 15:51):

Jason Rute said:

To get started, here is something I tried recently with limited success:

Write a Lean 4 proof of the fact that the square of the sum of the first $n$ numbers is the sum of the first $n$ cubes.

I agree with this idea. LeanAide is not yet ready for this level of autoformalization but this is a current focus so hopefully it will get better in a month or so. For now, it assumes that the formulas for sums of numbers and of cubes of numbers are known (we should recursively ask for the proofs of these, which is one of the things I have to work on).

Jason Rute (Jan 04 2025 at 15:57):

If I ask Grok for example it gets the statement mostly right. It just messes up the imports (like most models do) and doesn’t open Finset. The proof is wrong on multiple accounts.

import Mathlib.Data.Nat.Basic
import Mathlib.Data.Nat.Parity
import Mathlib.Tactic.Linarith

theorem sum_of_squares_is_sum_of_cubes (n : ℕ) :
  (∑ i in range n, i)^2 = ∑ i in range n, i^3 := by
  -- Use induction on n
  induction n with
  | zero =>
    -- Base case: when n is 0, both sums are 0
    simp only [range_zero, sum_nil, pow_zero, zero_pow]
  | succ n ih =>
    -- Step case: assume the theorem holds for n, prove for n+1
    simp only [range_succ, sum_range_succ, pow_two]
    -- Expand both sides using the induction hypothesis
    rw [ih, ← add_assoc, add_mul, mul_add, mul_comm (n * (n + 1)) 1]
    -- Simplify the expression
    ring

Kevin Buzzard (Jan 04 2025 at 15:59):

(for example $2025=\sum_{n=1}^9n^3$ :-), certainly the only time I'll see that phenomenon in my lifetime! )

Jason Rute (Jan 04 2025 at 15:59):

@Siddhartha Gadgil The sum of the first n numbers is in mathlib. docs#Finset.sum_range_id

Siddhartha Gadgil (Jan 04 2025 at 16:08):

@Jason Rute Thanks for this. Seeing what is happening in my code immendiately suggests a logical improvement which I will implement (currently I only search for results used to prove a claim, I should search for the claims themselves too).

Alex Kontorovich (Jan 04 2025 at 16:26):

I don't need a copilot-type LLM to give me entire theorem statements (let alone correct, formalized proofs); even if it suggests statements that are close to what I need, that's already a great speed-up of my workflow. (I want quasi-autoformalization, where I'm allowed to interact with the LLM and correct its code, not try to get it to do everything perfectly all by itself.) In that sense, I find github's copilot, seamlessly integrated in VSCode, to already be extremely useful. In my opinion, the fact that it remembers something I deleted and suggests it next, thereby ruining a demo that's been practiced, is not reason not to show it to people as a useful resource...(?) (In the PEP I'm running at the JMM, I certainly plan to show people how to harness every chatbot I know of, in case it helps them formalize mathematics...)

Jason Rute (Jan 04 2025 at 16:35):

I would love someone to do a systematic evaluation (even if not scientific) of all the major LLMs for code on Lean in a tool like GitHub Copilot, Continue.dev, or Cursor. Models like Codestral, Claude Sonnet 3.5, GPt-4o, GPT-4, Qwen-2.5-coder, Llama-3.5, Gemini, and more. Also chat bots like ChatGPT, Gemini, Grok, etc., and reasoning chat bots like GPT-o1-mini/preview//pro, Gemini-2.0-flash-thinking, QwQ, DeepSeek-R1, etc.

Patrick Massot (Jan 04 2025 at 16:45):

Alex Kontorovich said:

In my opinion, the fact that it remembers something I deleted and suggests it next, thereby ruining a demo that's been practiced, is not reason not to show it to people as a useful resource...(?)

This is not in a teaching context where I try to teach people how to formalize mathematics. It is in a context where I try to show what proof assistants can do today. So showing a LLM example where the LLM does everything perfectly because it saw it before would just be a lie.

Jason Rute (Jan 04 2025 at 16:48):

@Patrick Massot is the only issue the copying? Like if some other plugin didn’t remember erased code and gave ok-but-not-perfectly correct Lean code suggestions, would that be interesting to you?

Patrick Massot (Jan 04 2025 at 16:49):

Yes, in that context that would be great.

Jason Rute (Jan 04 2025 at 17:02):

You could try continue.dev plugin with different models and see if they have this behavior. I don’t know if the memorizing behavior of GitHub Copilot is through the model API or the plugin. Or the more difficult-to-implement approach is to do the demo with a fresh problem every time, or to never fix the generate code for the example. :/ Another not ideal approach is to use a Chat interface like ChatGPT and copy and paste the code.

Jason Rute (Jan 04 2025 at 17:08):

A Reddit thread guesses that this Github CoPilot behavior is due to turning on the option “Allow GitHub to use my code snippets for product improvements”.

Kim Morrison (Jan 05 2025 at 01:00):

(Repeating what I've said before, the Clause Sonnet powered code completion agent available in Cursor is night-and-day better than Github Copilot. Switching to Cursor was pretty much painless for me --- I still have to use VS Code for remote ssh sessions, but otherwise haven't noticed any other annoyance. I would love to hear others' experience with it.)

Jason Rute (Jan 05 2025 at 01:38):

@Kim Morrison Just like Lean-specific tools like LeanAide and Lean Copilot I think non-Lean-specific tools like Cursor+Sonnet are confusing to people (although they are probably easier to install and use). Do you think you are the only Lean user using Cursor+Sonnet? I’ll try it out soon and let people know my thoughts but here are some quick questions for you or someone else who knows:

is the cursor+sonnet experience due more to cursor or sonnet? (Cursor is an IDE and Sonnet is a model. I think Sonnet is not free.)
how much does this setup cost a month?
does it help with Lean specific tasks like autoformalization or filling in a proof, or just generic programming like refactoring, boilerplate, and other standard stuff?
what is a good user workflow for people doing formalization projects using Cursor+Sonnet?
what are some tasks it really excels at?
what does it suck at?
Lean has a lot of hidden information like goal states or the error messages of tactics. Is there any good way to expose this to the model? Like pasting the goalstate or asking it to fix an error?
is there any special setup? Like can you change the system prompt to get the model to avoid common mistakes or learn how to query #loogle, #moogle or try tactics like exact??

Julian Berman (Jan 05 2025 at 01:48):

(I don't know the answers to most of that) but I'll just mention that I know that the Neovim equivalent of Cursor supports lots of models but tells users not to bother with anything other than Sonnet and that the other models have much worse performance.

Jason Rute (Jan 05 2025 at 01:59):

@Kim Morrison To clarify, I didn’t mean it pejoratively when I asked if you are the only Cursor+Sonnet Lean user. It is a legitamate question.

Jason Rute (Jan 05 2025 at 01:59):

You are the only one who talks about using that setup here on Zulip.

Kim Morrison (Jan 05 2025 at 05:49):

The nice experience is both the interface and the model:
- Cursor will suggest multiline edits, as well as include visual prompts that it has a suggestion further down the screen than your cursor, which tab will take you too.
- The model seems to know Lean 4 syntax well. It routinely suggestions well-formed cases and induction structured tactics, and I don't recall seeing any Lean 3 isms.
- Cursor seamlessly provides good context. I've been doing some boring work making the Array API match the List API, and Cursor regularly manages to suggest the next Array theorem (name, statement, and correct proof), apparently looking at the corresponding List file open in a different tab to work out what comes next, and seeing the pattern of how I've been proving the recent Array theorems.
I think I pay $20USD/month (or rather, the FRO does). I'd happily pay quite a bit more, tbh.
It's worth noting that I've been really enjoying Cursor while I'm working through this tedious List/Array work, which has a lot of boilerplate. But I still regularly use it elsewhere.
In terms of workflow, I don't attempt to use anything besides the "out of the box" experience. If tab completion suggests something good, I take it, but I don't attempt to prompt it at all.
That said, the built in chat mode is great for writing shell/python/github scripts. I find it better than using ChatGPT canvas mode because the editor integration lets me see what changes it is suggesting more easily.
I haven't done any experiments to work out how much of error messages it is reading, but I would really really like to know this!
I have not investigated custom prompts.

Jarod Alper (Jan 06 2025 at 04:37):

I am giving a talk at the JMM where I would also like to showcase the state-of-the-art of autoformalization tools such as LeanCopilot, LLMlean, and LeanAide. I've found each of these tools frustratingly difficult to set up, and unfortunately I've spent far more time trying to get them correctly installed than I've been able to actually experiment with them. It's usually the first step of the installation in the readme file that trips me up. The instructions always seem written with professional developers in mind.

Right now, in addition to github copilot (which I've never had issues getting to work), the only other tool that I have working now is LLMlean, so that's the one I will demo this week.

Jason Rute (Jan 06 2025 at 05:30):

What is the talk?

Jason Rute (Jan 06 2025 at 05:34):

But I do agree the Lean-specific ones are very hard to set up (and far in perceived user performance from the research papers). I am still having trouble getting Lean Copilot working again. I just tried Cursor today and it was very easy to setup (and showed a lot of potential). (I’m not entirely sure how it is using Sonnet 3.5 since I don’t recalling having a paid Sonnet account. Is it a trial period?)

Jason Rute (Jan 06 2025 at 05:35):

(I’m also giving a talk at JMM, but currently don’t plan to demo any tools.)

Siddhartha Gadgil (Jan 06 2025 at 06:15):

@Jason Rute @Jarod Alper Please let me know what part of LeanAide installation is hard. I have updated the README at https://github.com/siddhartha-gadgil/LeanAide. There are probably issues still, but if you let me know I will try to fix them.

Kim Morrison (Jan 06 2025 at 06:58):

Is it a trial period?

Yes. I remember it lasting me about a week (I think it depends on number of queries, not a fixed amount of time.)

Kevin Buzzard (Jan 06 2025 at 08:31):

I was given 2 weeks recently for cursor/sonnet

Jason Rute (Jan 06 2025 at 15:42):

@Siddhartha Gadgil I can give more details later, but I'm struggling to use it inside a new project. New projects use lakefile.toml (run elan update first). Also, when I fix that, it still isn't downloading the Mathlib cache. I think because of lean/mathlib version differences. See #new members > How to install LeanCopilot? and the surrounding discussion. I haven't tried directly working with the LeanAide repo yet. I can try that later.

Siddhartha Gadgil (Jan 06 2025 at 16:32):

Jason Rute said:

Siddhartha Gadgil I can give more details later, but I'm struggling to use it inside a new project. New projects use lakefile.toml (run elan update first). Also, when I fix that, it still isn't downloading the Mathlib cache. I think because of lean/mathlib version differences. See #new members > How to install LeanCopilot? and the surrounding discussion. I haven't tried directly working with the LeanAide repo yet. I can try that later.

Thanks. Besides toolchains not matching, there is an issue of embeddings being downloaded. I will sort these issues out in a couple of weeks and try to have a tag/branch for each stable toolchain/mathlib.

Jarod Alper (Jan 06 2025 at 17:14):

Jason Rute said:

What is the talk?

Embracing AI and Formalization: Experimenting with Tomorrow's Mathematical Tools

Jarod Alper (Jan 06 2025 at 17:32):

Siddhartha Gadgil said:

Jason Rute Jarod Alper Please let me know what part of LeanAide installation is hard. I have updated the README at https://github.com/siddhartha-gadgil/LeanAide. There are probably issues still, but if you let me know I will try to fix them.

My first issue was not knowing where to put the export OPENAI_API_KEY=<your-open-ai-key> command. I tried each of the files.bash_profile, .zprofile, .zshrc, and .bashrc, but it didn't work. I also didn't know whether or not to include the angle brackets; I think not. In the case of LLMlean, I had a similar issue and simply hard coded my API key into the LLMlean files. There is also either a missing or an additional quotation mark in the command require LeanCodePrompts from git "https://github.com/siddhartha-gadgil/LeanAide"@"main
For the record, I tried installing it both by cloning the LeanAide github repository and installing it in my existing Lean project, but neither worked for me.

I had Lean Lean Copilot working a few months ago, but it stopped working at some point, probably after updating mathlib.

Siddhartha Gadgil (Jan 06 2025 at 18:14):

@Jarod Alper Thanks. I will fix the README.

The angle brackets are not needed, so example export OPENAI_API_KEY="sk-proj-hhPtGBjdz14NO1KxDB5U_kWksdlkk3kJF. For running LeanAide after cloning, here are the steps (and an alternative) from bash:

./fetch.sh
export OPENAI_API_KEY=<your-open-ai-key>
code .

That is, we open code from a shell after setting the environment variable. One can also edit .bashrc, then open a new shell and launch code from that.

As environment variables can be fiddly in some contexts, an alternative is to have your OPENAI API key as the contents of the file with path rawdata/OPENAI_API_KEYrelative to the base of the LeanAide project.

Using from another project smoothly will take some work from me (embeddings are toolchain specific so matching is an issue) but the repository by itself should work as above.

Jarod Alper (Jan 08 2025 at 06:23):

@Siddhartha Gadgil Thank you! I was able to get it working.

GasStationManager (Jan 08 2025 at 16:55):

I'll mention LeanTool which I've been working on. I would say it is in an experimental stage, as I'm not aware of any user other than me. But I want to advocate for the general approach, of exposing to the AI all the features that makes lean a nice interactive theorem prover for human users: error message feedback for syntax errors; goals from sorries; search engines like Moogle and LeanSearch; interactive commands like apply?, plausible, and automated proving tools like aesop, Duper, and lean-auto. The rationale is that general-purpose LLMs (open source or otherwise) will keep getting stronger in their general intelligence and reasoning abilities; we just need to provide the rest.

Many of these features are not in the training data of these LLMs for various reasons, so to get the LLM to use them would require some prompting efforts. Some initial efforts in the repo; but many folks here have more experience/expertise in the usage of these features, so suggestions are appreciated!

Vedant Gupta (Jan 08 2025 at 18:59):

(deleted)

Hanting Zhang (Jan 29 2025 at 05:25):

I've been scanning through all of the threads in #Machine-Learning-for-Theorem-Proving and am kind of surprised that there isn't more interest in DSP-V1.5.

As far as I can see, this model has the "best" absolute performance on miniF2F/ProofNet. But (at least for me) it's gated by the fact that

I can't run it easily.
It's not very useful in covering the "last mile" of friction that makes it easy to integrate into a workflow. By "last mile" I mean the small sub-goals you often sorry at first before closing them at the end, because they are not hard enough to warrant immediate attention but also complicated enough to be annoying to deal with.

Right now DSP-V1.5 is trained on whole-proof generation on the actual theorem text, rather than pairs of (context, goals) states. So although DSP-V1.5 can close pretty complicated goals (like here), it can't reason starting from the state in the middle of a proof. Or I guess it can, but only in an ad-hoc manner that consumes the entire text representation of the theorem again and attempts to continue the proof from that text.

The other issue with these tools is that they don't update their knowledge based on new data the user might have just written. So if you're working on a new project with lemmas about some new object X, then the model has no idea what X is because it doesn't exist in the training data.

(warning: I don't know much about ML and the following is highly-likely to be somewhat BS)

Shouldn't it be possible to use the method done for DSP-V1.5 but use (context, goals) for the training data? Assuming the performance can be the same, this would be much easier to integrate into something like LLMLean or other gym environments.

For the other issue, shouldn't there be some way to handle this with a RAG database? That's pretty much the extent of my knowledge on how this stuff works. (Does LeanCopilot try to do something like this?)

But assuming these two issues can be fixed, the next step would be to add this as a backend to something like LLMLean or LeanCopilot to expose it as a tactic to end-users. There should be a considerable jump in the complexity of what "last mile" goals can be closed.

Jason Rute (Jan 29 2025 at 13:34):

When you say (context, goals) what do you mean by context? Local context? If so that is what a number of papers do. Have you looked at older papers like Hyper Tree Proof Search (or the new version ABEL)? Also AlphaProof might work that way.

Jason Rute (Jan 29 2025 at 13:34):

But I’m a bit confused. Doesn’t Deepseek-Prover v1.5 have the ability to start in the middle of a proof? Isn’t that the point of the v1.5 tree search? As for consuming the whole proof, if there is good prompt caching, this could mitigate the cost of long prompts in the tree search.

Jason Rute (Jan 29 2025 at 13:34):

As for keeping up-to-date, you can check out papers like LeanDojo, Graph2Tac (for Coq), Rango (for Coq), Mini CXT. Also there are many more ways to keep a prover up-to-date and it isn’t clear what is best. There is another thread on here discussing many options.

Jason Rute (Jan 29 2025 at 13:35):

Also finally, another thing with DeepSeek Prover is that it was only trained on competition type problems so it probably wouldn’t be useful in practice. But it is a great paper I agree!

Thomas Zhu (Jan 29 2025 at 13:43):

DeepSeek Prover v1.5 can start in the middle of the proof. It consumes the current proof state as a comment.

Hanting Zhang (Jan 29 2025 at 18:21):

@Jason Rute thanks for pointing out MiniCTX, I wasn't aware of it till now!

Hanting Zhang (Jan 29 2025 at 18:28):

Thomas Zhu said:

DeepSeek Prover v1.5 can start in the middle of the proof. It consumes the current proof state as a comment.

Yes, I see now. Somehow I missed this reading the paper, but appending as a comment makes total sense.

Also, reading the MiniCTX paper,

I would be really interested to see results using DSP-V1.5 instead of 4o. Do you guys have any plans to explore that direction?

Thomas Zhu (Jan 29 2025 at 21:05):

@Hanting Zhang Yes, we have planned also evaluating DeepSeek-Prover v1.5 on miniCTX (previous discussion regarding this and also using DeepSeek-Prover in general here).

Huajian Xin (Jan 30 2025 at 11:41):

You can also try DeepSeek-R1, which is also trained on the data of DeepSeek-Prover V1.5.

George Tsoukalas (Jan 30 2025 at 20:37):

We are also planning to evaluate o1/o3, Sonnet, Gemini Thinking, and R1 on PutnamBench soon. While it is a competition problem dataset, the results will have at least some bearing on what are the best frontier models for Lean tasks are

Andy Jiang (Feb 01 2025 at 15:40):

Huajian Xin said:

You can also try DeepSeek-R1, which is also trained on the data of DeepSeek-Prover V1.5.

Does it mean it was RL'd against lean compiler?

Julian Berman (Feb 01 2025 at 16:01):

Anecdotally/unscientifically -- I'm personally still finding Sonnet to be more helpful than DS-R1 in helping me with my simplish Lean proofs.

Julian Berman (Feb 01 2025 at 16:02):

Today I was proving some inequality where one side was trivial and the other direction was not (and provided the side of the proof that was easy as input), and DS-R1 suggested some trivially incorrect symmetric argument (a la changing left to right), and Claude actually helped push me in the right direction.

Last updated: May 02 2025 at 03:31 UTC