Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Incentives & sorry-filling leaderboard


Lenny Taelman (Jan 16 2025 at 11:11):

I really enjoyed @Jason Rute 's talk The last mile yesterday. The question raised about incentives that can nudge progress towards tools of practical use is I think a fundamental one. There has been lots of discussion about benchmarks and their limitations here, so here is an idle attempt at formulating another format.

Could filling sorry's in public repositories (as explored in LeanAgent, and also mentioned in Jason's talk) be turned into some form of open competition? Something like a leaderboard server with an API, where competing algorithms can fetch a sorry and send back their proof, and the leaderboard keeping track of which team is most successful at filling sorries? (Using some form of ELO-rating, where each sorry counts as a single game so that a team can raise on the board by filling a sorry that a competitor could not).

Could this be feasible? Would it be useful? Could researchers be enticed to participate in this?

Kevin Buzzard (Jan 16 2025 at 12:07):

I would love to see machines helping me to prove FLT. I am perhaps two weeks away from having a coherent proof of finite-dimensionality of spaces of quaternionic modular forms, modulo a bunch of sorried theorems all of which will have LaTeX proofs; in particular no missing mathematical information and no sorried data. The problem is that I actually want to get this done -- I think it will be a nice paper -- so I don't just want to leave it as a challenge really, I would like to get it over the line.

Lenny Taelman (Jan 16 2025 at 12:59):

What I had in mind would definitely not require to “leave” things as a challenge, I am guessing (hoping) there will always be enough open sorries around. Even if they are open for just a short time window, as long as more than one ‘model’ gets to have an attempt it is a potential data point to distinguish their performance and refine their elo scores.

Kim Morrison (Jan 16 2025 at 22:52):

There's also no reason that agents can't work on "old" sorries as practice problems.

Josh Clune (Jan 17 2025 at 03:58):

Kim Morrison said:

There's also no reason that agents can't work on "old" sorries as practice problems.

I agree there's probably some way to incorporate "old" sorries as part of some evaluation/incentive/leaderboard, but some care would probably be needed to avoid a situation where the framework primarily incentivizes models that are disproportionately good at remembering what proof was used to close an old sorry. If only new sorries are being considered, then there's much less danger of the answer just being in the training set.

Lenny Taelman (Jan 17 2025 at 19:42):

To test the assumption that there will always be enough "new" open sorries around, I scraped github for lean sorries whose blame date is at most 7 days old. I obtained 63 sorries from 8 repositories. Here is the unfiltered list:
sorry_urls.txt

Lenny Taelman (Jan 17 2025 at 19:43):

Not quite sure how to interpret all this, the variation is huge. But much of it should be valid validation material...

Kim Morrison (Jan 17 2025 at 22:10):

There's lots missing in that list, in particular FLT and Carleson aren't mentioned, but both have many sorrys.

Kim Morrison (Jan 17 2025 at 22:10):

Further, I think scraping branches other than main/master will get more sorrys.

Kim Morrison (Jan 17 2025 at 22:12):

Further, what we really need are "reproducible sorrys", i.e. given a URL, you can git clone it, run lake exe cache get then lake build X.Y.Z and reliably get the warning message about the sorry (ideally have the state at the sorry printed too).

Kim Morrison (Jan 17 2025 at 22:15):

There's then a problem with de-duplicating. Suppose we see "the same" sorry on two different branches of the same repository. When do they count as identical? Even if you pretty print the goal and they are identical, one sorry could be solvable because that branch has acquired the necessary preliminary lemmas, while the other sorry isn't.

Kim Morrison (Jan 17 2025 at 22:20):

Perhaps the right thing to do is to index all found sorries by their (hashed) pretty printed goal, but keep the git url (possibly plus timestamp?) for every discovered appearance of that sorry.

Then, for most purposes solving any one version of "that" sorry can count as victory, and if you want to be really thorough you could then try out that same proof on all the other appearances.

Kim Morrison (Jan 17 2025 at 23:19):

I propose the following architecture:

  • a file known_lean_repos.json, initially just an array of URLs
  • a script to scrape Reservoir at add to known_lean_repos.json
  • a script to query the Github API to add to known_lean_repos.json
  • a file observed_sorries.json, containing initially an array of tuples:
    • git URL (containing a git SHA, not HEAD or a branch name)
    • a line number
    • and a timestamp first_observed
  • now upgrade known_lean_repos.json to so each URL also has an associated last_scraped timestamp field
  • a script scrape_for_sorries <repo_url> <timestamp> which
    • clones the repo
    • looks at every commit since timestamp
    • finds added lines containing sorry (don't even worry for now whether it is a real sorry or a comment!)
    • adds these to observed_sorries.json
    • when finished, update the last_scraped timestamp in known_lean_repos.json
  • now the tricky part, a script elaborate_sorry <git URL> <line number>
    • I propose that this is implemented as much as possible as new functionality in the REPL / Pantograph / your favourite tool
    • clones the repo if not already present
    • checks out the commit
    • runs lake exe cache get and/or lake build as appropriate
    • elaborates the file, identifies which declaration contains the relevant line containing the sorry, and attempts to report the goal state at the sorry
    • result: add an entry to elaborated_sorries.json (or possibly this can be combined with `observed_sorries.json?) with a status, either: "could not find repo", "could not find commit", "could not find sorry", "could not elaborate declaration at line number", or a successful pretty printed goal state
  • and then we probably want indexed_sorries.json, which is dictionary from pretty printed goal states to URLs where that sorry was observed in the wild (but this can be regenerated at any time from the other files).

This is completely half-baked --- thinking as I type! Please pick it apart!

This seems very doable with current technology, and I think having this infrastructure up and running would enable lots of interesting things.

Lenny Taelman (Jan 18 2025 at 07:55):

Kim Morrison said:

There's lots missing in that list, in particular FLT and Carleson aren't mentioned, but both have many sorrys.

Just to make clear, this is a list of sorries whose blame date is less than 7 days old, so it might be that no sorries were added to FLT and Carleson in the past few days?

Lenny Taelman (Jan 18 2025 at 08:03):

Reproducibility is crucial indeed, but sounds feasible.

I'm not sure if duplicates would really be an issue for the leaderboard thing, at least if there are not too many (so some very basic filtering might suffice). And if some sorries are not solvable because of lemmas missing that is fine as well. In general, with an ELO approach as outlined above it really doesn't matter if many problems are unsolvable, as long as there are enough solvable ones.

Of course having a clean completely deduplicated list might be useful for other things as well!

Kevin Buzzard (Jan 18 2025 at 08:45):

Kim Morrison said:

There's lots missing in that list, in particular FLT and Carleson aren't mentioned, but both have many sorrys.

Yes but I've been running LT25 and working on new latex material on a branch and having problems bumping so I bet there was nothing with a blame 7 or fewer days old on main

Jared green (Jan 18 2025 at 15:37):

suppose the person creating the repo realises one of the things they tried to prove was actually invalid, and deletes it without recording the fact anywhere. what would you do then?

Jared green (Jan 18 2025 at 15:40):

even if they did record the fact that it was invalid, how would you know?

Jared green (Jan 18 2025 at 15:41):

and what if they were incorrect in deciding that?

Lenny Taelman (Jan 18 2025 at 16:48):

I don't think I would do anything at all? Surely there will always be sorries that are unprovable simply because they are false. As long as there are enough valid sorries that are within reach of competing algorithms (a big if...), that should not matter. But maybe you are referring to something else?

Jared green (Jan 18 2025 at 17:43):

having a way to prove some sorries invalid would slow the accumulation of those unsolved because of it

Jared green (Jan 18 2025 at 17:50):

and i mean invalid rather than false as that is undecidable

Kim Morrison (Jan 18 2025 at 22:05):

I don't think bad sorries are a problem at all. They are great test cases for plausible, and examples for the even harder problem of "solving a sorry by correcting the statement".

Alex Meiburg (Jan 20 2025 at 03:08):

Lenny Taelman said:

To test the assumption that there will always be enough "new" open sorries around, I scraped github for lean sorries whose blame date is at most 7 days old.

I love that you scraped this. But it's also interesting to see how it breaks down.

  • It looks like 17 of them are from someone doing the Mathematics in Lean tutorial
  • Another 20 (the Ostrowski) are from someone making a backup of a file when replacing it with a rewrite; so those 20 sorries are not actually new
  • Another 2 are from a pared-down MIL tutorial (the Demo/session1), and 1 more is autogenerated code - that I don't think currently wants that sorry filled, although it might be interesting as training data.

That leaves 23 on the list. Of those, 12 are from my(!) repo about quantum information, and I can confirm that those are new and mostly-provable sorries. But like you said, variation is huge, we had about three months of none of us putting anything in that repo. And other very active repos, like FLT, just didn't have anything added this week.

But, this would be an interesting number to track; some sort of time-averaged number of sorries that "look" new across time.

Eric Taucher (Jan 20 2025 at 12:02):

Alex Meiburg said:

sorries that "look" new across time.

Should it not be possible to look at the context of the sorry and convert it into a minimal reproducible example and then keep those in a database?

I seem to recall Jason noting this but did not see such in the slides, perhaps it was said in the presentation. ( The video is down for editing so could not check.)

Eric Taucher (Jan 20 2025 at 12:24):

Another piece of information that should be recorded in such a database when an AI model has reasoning ability is the reasoning/thought process.

While these reasoning models are not perfect, they often have seen more patterns than most people and sometimes will note something in the thought process that might trigger a new way to solve the problem even though the model failed at the task. So while the solution generated in trying to solve the sorry would fail the check by Lean as an acceptable proof, the the thought process should not be discarded because the AI failed to find a solution.

Alex Meiburg (Jan 20 2025 at 16:44):

Eric Taucher said:

Should it not be possible to look at the context of the sorry and convert it into a minimal reproducible example and then keep those in a database?

It absolutely should be, in theory. As Kim pointed out, there's subtleties where the context of "what other lemmas are proved already in the repo" (or even "what version of Lean/Mathlib are you building on") effectively changes the proof environment. Or what about, an identical proof statement and identical theorems in context, but a different open Classical or similar earlier in the file? Serializing the whole environment is doable in practice but is going to be a very fine-grained partition. What is necessary or not for faithful a MWE is a big ambiguous. (At a minimum, you need to recursively include any defs and theorems referenced - but that is a very bare minimum)

If I were to design a database system for tracking this ... I think something like, "the text representation of the theorem statement + the goal state at the location of the sorry" is probably a 99% accurate indicator of morally equivalent sorries. Then you could have several unique identifiers (repo URL+ git commit + filename + line number) associated to that, that represent slight variations in environment for what is essentially the same problem.

Kim Morrison (Jan 21 2025 at 01:05):

Yes, I don't think we should attempt at this point to serialize the environment at sorries in an entirely self-contained way. The "serialization" will need to consist of a git repo URL, a commit sha, and then rely on lake to build dependencies. That part can all be locally cached.

Kim Morrison (Jan 21 2025 at 01:05):

But we need simple standalone tools to do this.

Kim Morrison (Jan 21 2025 at 01:06):

The REPL has functionality for processing a local file, but this should be extended to something that can load a git URL/commit sha/filename triple.

Lenny Taelman (Jan 21 2025 at 12:05):

I've now cleaned up and improved the code, see https://github.com/LennyTaelman/SorryScraper. It also finds many more sorries. Attached is the output of a new run, with 51 sorries that are less than 1 day old (based on the git blame date).
sorries.txt.

Lenny Taelman (Jan 21 2025 at 12:29):

This actually looks like a much more interesting sample of sorries. It contains recent sorries by @Xavier Roblot, @Floris van Doorn, @Daniel Soukup, @Joël Riou,, @Michael Rothgang, @Daniel Morrison, and a couple of anonymous contributors. If any of those are willing to comment on those sorries (are they new, or just copied from older code? how hard do you expect them to be? any other comments?), that would be very much appreciated!

Kim Morrison (Jan 21 2025 at 12:30):

I'm just going to paste in the contents of sorries.txt to make them clickable:

https://github.com/fpvandoorn/carleson/blob/1d0d61deb511bf80f43f0c8f71da274b0c8659f8/Carleson/HolderVanDerCorput.lean#L97
https://github.com/fpvandoorn/carleson/blob/1d0d61deb511bf80f43f0c8f71da274b0c8659f8/Carleson/HolderVanDerCorput.lean#L105
https://github.com/fpvandoorn/carleson/blob/1d0d61deb511bf80f43f0c8f71da274b0c8659f8/Carleson/HolderVanDerCorput.lean#L112
https://github.com/fpvandoorn/carleson/blob/1d0d61deb511bf80f43f0c8f71da274b0c8659f8/Carleson/HolderVanDerCorput.lean#L171
https://github.com/fpvandoorn/carleson/blob/451f4568375c1be265a0beb82adb601f9a3ee74d/Carleson/HardyLittlewood.lean#L384
https://github.com/hargoniX/fil-rb/blob/495b941eb72c0200039f8750f41d067de617e3a3/Filrb/Internal/Model.lean#L189
https://github.com/hargoniX/fil-rb/blob/495b941eb72c0200039f8750f41d067de617e3a3/Filrb/Internal/Model.lean#L195
https://github.com/hargoniX/fil-rb/blob/495b941eb72c0200039f8750f41d067de617e3a3/Filrb/Internal/Model.lean#L198
https://github.com/hargoniX/fil-rb/blob/495b941eb72c0200039f8750f41d067de617e3a3/Filrb/Internal/Model.lean#L236
https://github.com/joelriou/topcat-model-category/blob/05b734528674155c6c54f2c49d2867762f5f4765/TopCatModelCategory/SSet/FundamentalGroupoid.lean#L18
https://github.com/joelriou/topcat-model-category/blob/05b734528674155c6c54f2c49d2867762f5f4765/TopCatModelCategory/SSet/FundamentalGroupoid.lean#L20
https://github.com/joelriou/topcat-model-category/blob/05b734528674155c6c54f2c49d2867762f5f4765/TopCatModelCategory/SSet/FundamentalGroupoid.lean#L22
https://github.com/joelriou/topcat-model-category/blob/05b734528674155c6c54f2c49d2867762f5f4765/TopCatModelCategory/SSet/FundamentalGroupoid.lean#L90
https://github.com/joelriou/topcat-model-category/blob/05b734528674155c6c54f2c49d2867762f5f4765/TopCatModelCategory/SSet/FundamentalGroupoid.lean#L96
https://github.com/joelriou/topcat-model-category/blob/05b734528674155c6c54f2c49d2867762f5f4765/TopCatModelCategory/SSet/Homotopy.lean#L20
https://github.com/joelriou/topcat-model-category/blob/05b734528674155c6c54f2c49d2867762f5f4765/TopCatModelCategory/SSet/Homotopy.lean#L69
https://github.com/joelriou/topcat-model-category/blob/05b734528674155c6c54f2c49d2867762f5f4765/TopCatModelCategory/SSet/Subcomplex.lean#L499
https://github.com/leanprover-community/mathlib4/blob/cd410875f97a5e8ffd93b683739483202ad99651/Mathlib/Geometry/Manifold/HasNiceBoundary.lean#L154
https://github.com/leanprover-community/mathlib4/blob/7c2926281641c28dc9181cac4f3738d4bb2a4e0f/Mathlib/Algebra/Azumaya/Basic.lean#L321
https://github.com/leanprover-community/mathlib4/blob/7c2926281641c28dc9181cac4f3738d4bb2a4e0f/Mathlib/Algebra/Azumaya/Basic.lean#L327
https://github.com/leanprover-community/mathlib4/blob/237526a5ca997d8087fc2e48476189b9ba2b59c2/Mathlib/LinearAlgebra/ExteriorPower/HodgeStar.lean#L71
https://github.com/leanprover-community/mathlib4/blob/237526a5ca997d8087fc2e48476189b9ba2b59c2/Mathlib/LinearAlgebra/ExteriorPower/WedgeProduct.lean#L38
https://github.com/leanprover-community/mathlib4/blob/237526a5ca997d8087fc2e48476189b9ba2b59c2/Mathlib/LinearAlgebra/ExteriorPower/WedgeProduct.lean#L44
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L315
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L316
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L340
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L343
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L345
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L346
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L347
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L368
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L399
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L400
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L424
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L427
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L429
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L430
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L431
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/PolarCoord.lean#L252
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/PolarCoord.lean#L270
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/Sandbox.lean#L233
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/Sandbox.lean#L257
https://github.com/m4lvin/lean4-pdl/blob/fb9f64ec5fc547d303efb4a9316eefe0118f7831/Pdl/Distance.lean#L566
https://github.com/siddhartha-gadgil/LeanAide/blob/cd35740f0fd5b4e20f7f99ba8db190e88d5a6a2a/CodeGen/from_statement_12669426150657371507.lean#L116
https://github.com/siddhartha-gadgil/LeanAide/blob/cd35740f0fd5b4e20f7f99ba8db190e88d5a6a2a/CodeGen/from_statement_12669426150657371507.lean#L161
https://github.com/siddhartha-gadgil/LeanAide/blob/c5b492594bec960b38c8aaab513dd9d6fe9ce50e/CodeGen/from_statement_13190426273245751.lean#L111
https://github.com/urkud/DeRhamCohomology/blob/36294892d98704ed045ed43c086b39ee8cd43247/DeRhamCohomology/ContinuousAlternatingMap/Wedge.lean#L77
https://github.com/urkud/DeRhamCohomology/blob/36294892d98704ed045ed43c086b39ee8cd43247/DeRhamCohomology/ContinuousAlternatingMap/Wedge.lean#L86
https://github.com/urkud/DeRhamCohomology/blob/36294892d98704ed045ed43c086b39ee8cd43247/DeRhamCohomology/ContinuousAlternatingMap/Wedge.lean#L90
https://github.com/urkud/DeRhamCohomology/blob/36294892d98704ed045ed43c086b39ee8cd43247/DeRhamCohomology/ContinuousAlternatingMap/Wedge.lean#L98
https://github.com/urkud/DeRhamCohomology/blob/36294892d98704ed045ed43c086b39ee8cd43247/DeRhamCohomology/ContinuousAlternatingMap/Wedge.lean#L101

Kim Morrison (Jan 21 2025 at 12:32):

The first 3 Carleson ones look moderately easy, easy, and easy.

Kim Morrison (Jan 21 2025 at 12:33):

The first three topcat-model-category sorries are definitions where one needs to understand the intent of the declaration name --- not just any inhabitant will do!

Kim Morrison (Jan 21 2025 at 12:34):

Most of the DeRhamCohomology sorries look easy to moderate.

Lenny Taelman (Jan 21 2025 at 12:34):

Kim Morrison said:

The first three topcat-model-category sorries are definitions where one needs to understand the intent of the declaration name --- not just any inhabitant will do!

Ouch, good point! Those should be out for any kind of theorem-proving competition. That should be part of some actual lean-validation process...

Kim Morrison (Jan 21 2025 at 12:35):

I guess when we build the next step, that finds the goal at each of these sorries, we not only need to report the goal, but the type of the goal (i.e. is it in Prop or Type u).

Kim Morrison (Jan 21 2025 at 12:36):

The LeanAide examples are fake sorries --- they are comments, not code!

Lenny Taelman (Jan 21 2025 at 12:36):

Kim Morrison said:

The LeanAide examples are fake sorries --- they are comments, not code!

At least that should be easy to filter out ;-)

Xavier Roblot (Jan 21 2025 at 12:39):

Most of the sorries from my code are just there because I did not have the time to fill-in or I was not sure what the statement should be or if I would need it and are not really interesting.

If you want a sorry for which I would be glad to get some help from AI because I don't see a clean and quick way to do it, you can have a look at this one : https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/Sandbox.lean#L257

Lenny Taelman (Jan 21 2025 at 12:40):

Xavier Roblot said:

Most of the sorries from my code are just there because I did not have the time to fill-in or I was not sure what the statement should be or if I would need it and are not really interesting.

Do you expect some of them to be true and provable?

Kim Morrison (Jan 21 2025 at 12:40):

"not really interesting" doesn't mean that they wouldn't be interesting for a AI benchmark,

Xavier Roblot (Jan 21 2025 at 12:41):

Most of them are missing definition or incomplete statements, I think. Let me have another look.

Kim Morrison (Jan 21 2025 at 12:41):

The question is not "is this hard and you would be interested in knowing the answer", but rather "is this easy and you wish an AI would grind it out for you so you don't need to bother".

Lenny Taelman (Jan 21 2025 at 12:41):

Or even: is it easy and doable, so it can help us "rate" AI systems, and motivate developers. I think a `good' sorry for this purpose is any sorry that is easy enough that it might be fillable by some AI system, but hard enough that it won't be filled by all of them.

Xavier Roblot (Jan 21 2025 at 12:42):

Kim Morrison said:

The question is not "is this hard and you would be interested in knowing the answer", but rather "is this easy and you wish an AI would grind it out for you so you don't need to bother".

Well, I think it is the case in the example I sent above. I know how to prove it but I don't want to do it because it would be tedious. Although that example might be a bit too difficult.

Kim Morrison (Jan 21 2025 at 12:44):

Great. I think this is more than enough to go on --- Lenny's list has a non-trivial fraction of relevant-to-AI training sorries, and so we should be doing this scraping regularly, recording the results, and building the tooling to analyse these sorries!

Joël Riou (Jan 21 2025 at 12:57):

Lenny Taelman said:

This actually looks like a much more interesting sample of sorries. It contains recent sorries by Xavier Roblot, Floris van Doorn, Daniel Soukup, Joël Riou,, Michael Rothgang, Daniel Morrison, and a couple of anonymous contributors. If any of those are willing to comment on those sorries (are they new, or just copied from older code? how hard do you expect them to be? any other comments?), that would be very much appreciated!

Regarding mine, this is very much work in progress. Some sorries are relatively easy, some are more difficult, and require specific mathematical arguments, which I would not expect AI to solve.

Xavier Roblot (Jan 21 2025 at 13:22):

The following statements are true and provable (with various difficulty):
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L315
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L316
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L340
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L343
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L345
https://github.com/leanprover-community/mathlib4/blob/c853daa166b9dd80ed9735c71db69a20561a613f/Mathlib/NumberTheory/NumberField/CanonicalEmbedding/NormLessThanOne.lean#L346

Jason Rute (Jan 21 2025 at 13:38):

Joël Riou said:

Regarding mine, this is very much work in progress. Some sorries are relatively easy, some are more difficult, and require specific mathematical arguments, which I would not expect AI to solve.

That is the best mix!

Floris van Doorn (Jan 21 2025 at 14:02):

Kim Morrison said:

The first 3 Carleson ones look moderately easy, easy, and easy.

In fact, the second one is false since it's missing I \ne \top :-)
I think this will be common: many sorry's will be false since a hypothesis is missing.

Kim Morrison (Jan 21 2025 at 14:07):

I guess you can handle this situation by "raising the stakes" --- an AI agent is allowed to suggest proofs that use extra hypotheses, and if the human accepts it, the AI gets a bigger reward than usual.

Another possibility when considering adding a hypothesis is to ask an agent to demonstrate that the additional hypothesis is available at any existing call sites (if there are any).

Lenny Taelman (Jan 21 2025 at 14:08):

Yet another possibility is to award disproving sorries.

Kim Morrison (Jan 21 2025 at 14:20):

I'm really hoping that by the end of this year, as soon as you type := after a theorem statement a counterexample generator that actually works will start running in the background... :fingers_crossed:

Jason Rute (Jan 21 2025 at 14:21):

Also, models like DeepSeek-Prover also have been trained to also be able to show theorems are false or trivially true (where the hypotheses are inconsistent). That could be another target.

Johan Commelin (Jan 21 2025 at 16:20):

@Adam Topaz and I have been discussing a weighted_sorry. We could upstream a prototype if people like the concept.
Our thoughts:

  • weighted_sorry n indicates an expectation that circa n lines of Lean code will be needed to fill in this sorry.
  • AIs can use n to choose which sorries to work on
  • AIs can also also train to predict n, in order to turn sorry into weighted_sorry n.

Adam Topaz (Jan 21 2025 at 16:23):

The prototype can currently be found in NodeGraph -- the declaration graphs display the total weight of any given node.

Patrick Massot (Jan 21 2025 at 23:08):

This list already reveals an obvious problem. Some of those sorries are in a student project and it would probably be very bad to ruin this project by proving them. If you really build this tool then I think you should ask permission from each repository author to play with their sorries in this way. This would be lot more polite.

Jason Rute (Jan 22 2025 at 00:57):

Sure, it would be good to ask before sending them PRs, but it isn't clear that such a feature wouldn't be helpful to a lot of students. Right now AI systems aren't that good so it isn't like it would do their project for them, and it might help them get unstuck on theorems where they just don't know the syntax (or by telling them there is an obvious contradiction in their theorem statement). Of course, it is probably situation-specific.

Also, it might be good to separate the two situations: (1) the system collects the sorry data (and tries to prove theorems). (2) the system notifies the authors of a solution via a PR to the project. This might be a case where one does the former but not the latter.

Joseph Myers (Jan 22 2025 at 01:48):

Is the weight in weighted_sorry meant to be the total number of lines to add (including building up API / proving other required lemmas) to get a sorry-free proof, or the number of lines to get a proof that is locally sorry-free but may depend on other pre-existing lemmas that use sorry, or just the number of lines expected locally for that specific sorry (possibly adding new lemmas, not counted in the value of n, that also need proving, in addition to existing lemmas that use sorry)?

Lenny Taelman (Jan 22 2025 at 07:21):

Patrick Massot said:

This list already reveals an obvious problem. Some of those sorries are in a student project and it would probably be very bad to ruin this project by proving them. If you really build this tool then I think you should ask permission from each repository author to play with their sorries in this way. This would be lot more polite.

Indeed it is important to realise that not everyone with a public repository is waiting for help (from humans or machines). As for permission, there are many degrees possible, with at the extremes:

  • most restrictive: do not include any sorries in a database without prior consent of repo owner
  • least restrictive: when a sorry is machine-filled and a pull-request is automatically made, include a link to "stop sending me pull requests" (this is probably necessary anyway, as a protection against a bug in the system generating too many PRs)

As for "ruining a project", I would think that this is not so likely to happen. I imagine (perhaps naively) that a student would be thrilled to find out that they have reduced the problem to small enough sub-goals that a machine can fill. (Just as I was thrilled when I discovered that aesop or exact? could fill in sorries that I thought I had to sweat through myself). I would also guess (maybe even more naively) that progress in machine proving is gradual enough that a supervisor would have a reasonable idea as to which student projects would still be interesting enough to pursue.

Johan Commelin (Jan 22 2025 at 07:33):

Joseph Myers said:

Is the weight in weighted_sorry meant to be the total number of lines to add (including building up API / proving other required lemmas) to get a sorry-free proof, or the number of lines to get a proof that is locally sorry-free but may depend on other pre-existing lemmas that use sorry, or just the number of lines expected locally for that specific sorry (possibly adding new lemmas, not counted in the value of n, that also need proving, in addition to existing lemmas that use sorry)?

This is not completely specced out yet. But my mental model is that n represents the number of lines that need to be added to close the sorry, where the proof may depend on other weighted_sorrys.

So the total number of lines to make a given statement sorry-free would be the sum of all the n that are expected to be necessary for the proof.
I guess this means that weighted_sorry should maybe include pointers to sorried results that are expected to be needed in the proof?

One thing is clear: whatever semantics we choose, the value of n will probably have to be updated regularly as work on the proof progresses.

Jared green (Jan 22 2025 at 13:45):

@Lenny Taelman are you thinking it would be opt-in or opt-out?

Lenny Taelman (Jan 22 2025 at 13:50):

I don't think that's for me to decide, but opt-out is certainly much easier to accomplish. There are actually several parallel questions here, and the answer may be different for them:

  1. Can we scrape your sorries and make the resulting database available (e.g. for people who want to do statistics, test their theorem-proving systems, etcetera)
  2. Can we use this database to create a 'leaderboard' to continuously evaluate automated systems and compare their performance (my original intention: create a benchmark/incentive that is more closely aligned with mathematical practice)
  3. Should automated systems (on themselves, or through the leaderboard) be allowed to make unsolicited pull-requests to fill sorries in your repositories?

I would actually be very interested in hearing what people think about these questions.

Lenny Taelman (Jan 22 2025 at 13:52):

Of course once we make the database, we cannot enforce how others use it. But at least we could provide some guidelines as to what the community considers good behaviour (if there is consensus...)

Jason Rute (Jan 22 2025 at 13:53):

I think 1 should be opt out, 2 maybe opt out, and 3 opt in.

Lenny Taelman (Jan 22 2025 at 13:54):

I also think the answers should not be set in stone, actual experience with such tools may give good reasons to adjust things...

Jason Rute (Jan 22 2025 at 13:56):

Especially for (1) if it is an Apache or MIT licensed project one is effectively saying that others are free to mess around with the code.

Lenny Taelman (Jan 22 2025 at 13:58):

Sure, but I would hope that such a project has some community support, and not just "everyone dislikes it but it is legally allowed".

Jason Rute (Jan 22 2025 at 13:59):

True

Jared green (Jan 22 2025 at 15:41):

surely releasing something under a permissive license(or even straight into the public domain) is an active choice that would imply openness to such activities, right? like it wouldnt just be the law's idea to allow it, but the individual's. but especially when there is a statement saying 'contributions are welcome/encouraged' in the readme.

Michael Rothgang (Jan 22 2025 at 16:10):

I would think that a generic student project either has no license, or is forked from a repository which might have one (possibly without a student thinking about that). In other words: have a permissive licence does not mean "I'm open to AI proposing pull requests". (It certainly doesn't for me.)

Jared green (Jan 22 2025 at 16:35):

for a forked project, you would find the original, which would have a subset of the same sorries. the license is a choice in the original. one with no license, if it is dormant should probably be treated like it has a permissive license, otherwise permission should probably be sought.

Michael Rothgang (Jan 22 2025 at 16:36):

I disagree with the "dormant" part. (And otherwise think Jason's distinction 1/2/3 above is good.)

Alex Meiburg (Jan 22 2025 at 17:23):

IANAL, but as far as the law is concerned and I understand it: code released with no explicit license defaults to retaining full copyright, which includes the restriction of not distributing copies, and making no derivative works. Building a system that scrapes code from public repos with no license and "works on it" is legally gray.

What that means in practice is messy, of course. If someone has a repository with no license, and I make a PR offering a bug fix (which includes forking the repo - a 'derivative work'), it would be pretty bizarre to say that I violated copyright law by doing so. But as Lenny said, 'technically legally allowed' is a low bar.

Jared green (Jan 22 2025 at 23:23):

it seems we may need a lawyers help on this then

Johan Commelin (Jan 23 2025 at 04:57):

No, I think we should stay clear of that. When in doubt, ask the repo owner for permission.

Tristan Figueroa-Reid (Jan 28 2025 at 06:10):

Hi! I was just thinking about this a little, and it seems much more helpful to keep this opt-in:

  • With an opt-in system for formal verification projects, there could be better support for reproducing a sorry or a proof_wanted in lean4web for mathlib4 or just gitpod.
  • As far as I know, there is a very finite list of independent permissively licensed formalization projects, many of which the authors are active participants of the lean4 community or have at least chatted in the leanprover Zulip a few times, so trying to scale for a problem that doesn't exist yet, and probably won't exist for quite some time, seems unnecessary. Even if lean4's popularity were to suddenly explode, having an established opt-in system would encourage people to make their repositories easier to reproduce, which seems beneficial for everyone involved.

Tristan Figueroa-Reid (Jan 28 2025 at 06:12):

I think just starting off with a simple list of known formalization projects and an automated scraper that clones each repository, finds sorrys or proof_wanteds, and has GitHub links and open PRs targeting said sorrys (like @Yaël Dillies's project) would be something that would help new contributors (and eventually AI) make beneficial contributions to the general lean formalization community.

Eric Taucher (Jan 28 2025 at 09:13):

Half baked idea that might work with regards to selecting sorry code , consider tagging such with a custom Lean attribute. I have created and used attributes in other functional programming languages, e.g. F#, but not with Lean so not sure if this will work.

The idea is simply that the attribute is added, possibly with arguments, and then the code collecting the sorry items can easily identify if this should be included. I suspect that Lean attributes can be much more fine grained in that right now it seems the entire repository would be opt-in/out but with an attribute, the selection is very fine grained.

If the code checking for the attribute starts with an empty sorry datastore for that repository then the removal of the attribute would mean that the item would no longer be included in the sorry datastore which would be an added benefit.

Anyway, just an idea.

Tristan Figueroa-Reid (Jan 28 2025 at 20:26):

I think leanprover-community/batteries's proof_wanted is the best form of this :+1:

Tristan Figueroa-Reid (Jan 29 2025 at 05:33):

I just wrote a little website for this that auto updates and comes with raw data: leodog896.github.io/unsolved-lean/

Tristan Figueroa-Reid (Jan 29 2025 at 05:48):

(note that the proof_wanted and sorry-collecting code is quite hacky - I tried to shortcut it by using a highlighter, because I couldn't figure out how to parse a lean4 file without evaluating every file, but it turns out the highlighter I used doesn't work with -- commits anyways.)

Adam Kurkiewicz (Feb 03 2025 at 12:09):

@Lenny Taelman I contributed solutions for two Olympiad problems to Compfiles:

https://github.com/dwrensha/compfiles

while I am not the maintainer and if you'd like to get more definitive answers you'd have to ask the maintainer @David Renshaw, here is my understanding of how this particular project works:

1) My expectation is that the maintainer would welcome pull requests that solve most of the sorries in the repo, even very old ones. For example, my expectation is that David would be happy to see a formal solution to this 2015 problem, which according to git blame has had a sorry for ~ 2 years already:

https://github.com/dwrensha/compfiles/blob/main/Compfiles/Imo2015P2.lean#L32

In fact, one of my contributions has been to finish a proof that was already ~ a third of the way there.

I suspect there might be less excitement about a solution that is poorly readable or if it compiles very slowly. But if an AI system produces a human-quality solution without any kind of cheating, then I expect David would be happy to accept a contribution. But I think old sorries are good benchmark material.

2) There is a very good dashboard which tells you a lot more about each problem, e.g. we can look at all the IMO problems here:

https://dwrensha.github.io/compfiles/imo.html

If it's green it means there is a full solution -- no sorries for you :). If it's yellow the statement of the problem is formalised, but not the solution -- there is at least one sorry! If it's grey, nothing is formalised yet.

3) If you click on a specific problem, e.g. this 2015 problem I linked to earlier, you will often get a link to an informal solution, in this case you need to click on "Art of Problem Solving" or "Evan Chen" links:

https://dwrensha.github.io/compfiles/problems/Compfiles.Imo2015P2.html

This is interesting, because it means that usually there is already an informal solution out there, and all that the AI needs to do is formalise that solution :).

4) The quality of continuous integration, how fast things compile, how up-to-date everything is, is very high. It took me very little time to get set up with the project just by following the instructions on github, and the dashboard, etc... all of this is automatically built by the CI.

Overall, I think the project would be perfect for what you're trying to do, there should be ~ over 50 high quality sorries there, and if you can bundle the informal solutions with the formal problem statements you might get a high quality autoformalisation benchmark out of this.

Lenny Taelman (Feb 03 2025 at 13:58):

This looks like a wonderful and mature project, very polished indeed. I think indeed that auto-proving and auto-formalizing these problems would be a good test for formal math models.

It is however a bit orthogonal to what I had in mind:

  • I wanted to focus on research math (for lack of better term) as opposed to olympiad math, as there is already quite some work towards the latter. I'd rather interpret research math broadly, and I don't think it is necessarily harder than olympiad math (looking up the right lemma to finish a proof is also research math; in fact a big part of what I do on a daily basis...). It is however different than olympiad math, and in many ways far more messy and heterogeneous.
  • The aim would be to stimulate the development of models towards something that may be useful in daily research math. A tool that is able to solve existing sorries in ongoing formalization problems is in my opinion by definition a tool that would be useful.
  • The idea would be that there may be some continuously evolving leaderboard, incentivising people to produce models that do well on things we care about.
  • Once there is a model that actually scores high enough, one can try to think about making it into a tool that is actually useable in practice...

I admit, all the above is just vane dreaming, and it may be very naive. The tests discussed above confirm that there is a big enough continuous supply of new sorries of varying level, for this to be possible at least in principle.

The biggest obstacle of course is to convince people to take part. The first step would be to establish that some existing model out there obtains a non-zero score.

So let me ask this here: any suggestions as to which existing model/system might be worth hooking up to see if it can establish a non-zero base line score?

Adam Kurkiewicz (Feb 03 2025 at 15:17):

I haven't tried it yet, but I'd like to try Aider:

https://aider.chat/

And then set it up to use maybe o3-mini (high) for architect and Claude Sonnet 3.5 for editor. But unfortunately most of those off-the-shelf systems are not amazing at Lean 4 (they still confuse syntax with Lean 3, hallucinate lemmas, make outright syntax errors, etc...). Claude Sonnet 3.5 has been the best for me so far, but only for very simple things (not whole sorries). I think getting a non-zero baseline might be hard. And while there are some research-quality tools out there, I haven't really tried installing them yet.

I think creating a benchmark to specifically elicit this kind of non-zero base line performance could be a noble goal in itself, even if the focus isn't pure maths (yet).

Adam Kurkiewicz (Feb 03 2025 at 15:39):

It looks like David enabled Lean Copilot on a separate branch last month in Compfiles, so maybe I will try that too:

https://github.com/dwrensha/compfiles/compare/main...lean-copilot

Although I expect if that could solve any sorries they wouldn't make it to the benchmark, because they would have already been solved

Austin Letson (Feb 23 2025 at 11:35):

@Lenny Taelman What is the next step to stand up a v0 of a sorry leader board? This is a cool project and I would love to help.

Is it to create a prototype for the elaborate_sorry <git URL> script:

Kim Morrison said:

  • now the tricky part, a script elaborate_sorry <git URL> <line number>
    • I propose that this is implemented as much as possible as new functionality in the REPL / Pantograph / your favourite tool
    • clones the repo if not already present
    • checks out the commit
    • runs lake exe cache get and/or lake build as appropriate
    • elaborates the file, identifies which declaration contains the relevant line containing the sorry, and attempts to report the goal state at the sorry
    • result: add an entry to elaborated_sorries.json (or possibly this can be combined with `observed_sorries.json?) with a status, either: "could not find repo", "could not find commit", "could not find sorry", "could not elaborate declaration at line number", or a successful pretty printed goal state

Lenny Taelman (Feb 23 2025 at 14:27):

Exactly, the next step would to produce a database of reproducable sorries, as opposed to just things that syntactically look like sorries. I have some rough code now to do that using REPL, (in src/offline_sorries.py at this repo) but it needs to be cleaned up, and there are compatibility issues between REPL and some lean repos that I have not yet figured out. Also: I am not yet convinced that REPL is the right tool for this, I would also want to have a look at the alternatives (such as Pantograph).

In any case, it looks like this is doable in principle, and it should be not too hard to build a continuously updating database of reproducible sorries.

Once this code is robust enough, I think the next challenge is to build a basic `client' which takes a database entry as input, reproduces it, and tries to prove it. The fist milestone would be to have such a client that gets a non-zero score, then others can try to beat it ;-).

Any help would be very welcome, I'll send you a DM to chat about this. Others are of course also welcome to join!

Austin Letson (Feb 23 2025 at 22:15):

Okay, thanks for the summary of current status. I look forward to the first non-zero score!

I experimented with Pantograph this morning, taking the new_sorries.json output from SorryScraper, downloading/building the Lean repos and then feeding the sorrys into Pantograph. I quickly ran into issue with Lean versions. Pantograph requires that the Lean version of the project and Pantograph be the same and Pantograph is currently on 4.15.0. This might be a configuration issue on my end so, I created an issue to ask if newer versions are supported.

repl has version tags which would make this issue easier. Potentially we should include the lean version with each sorry.

Are the compatibility issues with repl across an entire repo or for specific sorrys?

Kim Morrison (Feb 24 2025 at 03:35):

@Leni Aniva , do you think you can add tags for all compatible toolchains (ideally both stables and release candidates)? And keep doing this going forward? This is going to be critical for adoption.

Kim Morrison (Feb 24 2025 at 03:37):

This is actually something the @Austin Letson may be interested in, given their great work in lean-action: how can we let users sign up (we could maybe even make opt-in the default in default lake templates) for receiving regular reminders (or maybe even automated PRs) about creating tags for new toolchains, as everything Mathlib and upstream does?

Leni Aniva (Feb 24 2025 at 03:38):

Kim Morrison said:

Leni Aniva , do you think you can add tags for all compatible toolchains (ideally both stables and release candidates)? And keep doing this going forward? This is going to be critical for adoption.

Tags? Not branches? I can do this going forward. The reason I suggest branches is because it leaves rooms for future bug fixes. In Pantograph's repository I have the branch fixed/v4.16.0-rc2 for example, and that branch is pinned to v4.16.0-rc2. My current approach is to keep up with the newest Lean version

Leni Aniva (Feb 24 2025 at 03:39):

There has been another suggestion here to separate the versioning of PyP and P. Exactly why this is done is not very clear to me, but if we go along with this route, we would need to keep manifest files in PyP and each PyP would support multiple P versions. I've been waiting on the solution to this so I haven't made the tag yet.

Leni Aniva (Feb 24 2025 at 03:52):

Pantograph can currently import/export goal states and these can be used to communicate a goal across machines, but the exported goals don't have tagging information about their environment. i.e. the Lean version, mathlib version, etc. If its necessary I can engineer such a feature into Pantograph. IMO the best way to transport a frozen sorry across machines is not via binary files. These files are anchored to the Lean versions that produced them. The best way would be to encode each goal as a theorem in readable text

Austin Letson (Feb 24 2025 at 12:43):

@Lenny Taelman I think I was able to fix one of the compatibility issues with repl by allowing the user to specify the required version to pull from the repl repo. In order to support this across the whole pipeline, I think we will need to pull the lean version when we pull a new sorry.

I created a PR to add support for specifying a repl version when running offline_sorries.py

Austin Letson (Feb 24 2025 at 13:22):

Kim Morrison said:

This is actually something the Austin Letson may be interested in, given their great work in lean-action: how can we let users sign up (we could maybe even make opt-in the default in default lake templates) for receiving regular reminders (or maybe even automated PRs) about creating tags for new toolchains, as everything Mathlib and upstream does?

I think this could be handled by lean-update. We could add a step after lean-update creates the update commit to also add a tag to that commit?

GasStationManager (Feb 24 2025 at 20:29):

@Adam Kurkiewicz mentioned Aider above; I have not played with aider much myself, but was able to set it up to connect to LeanTool, which then allows the LLM to talk to Lean directly in a feedback loop to fix its errors.
E.g. to use the demo proxy server already up at http://www.codeproofarena.com:8800/v1 , do the following:

export OPENAI_API_BASE=http://www.codeproofarena.com:8800/v1
export OPENAI_API_KEY=<your api key for your chosen model>

# Prefix the model name with openai/
aider --model openai/sonnet

Anyone with experience using Aider? Have a sorry that you'd like to try? I am interested in hearing your experience on how it works..


Last updated: May 02 2025 at 03:31 UTC