Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: SorryDB project


Lenny Taelman (May 14 2025 at 13:35):

We are happy to share an update on the SorryDB project, which aims to set up a continuously running competition for AI systems trying to prove sorries from real Lean repositories. This is joint work-in-progress with @Austin Letson, @Oliver Dressler, @Auguste Poiroux. Previous discussion can be found:

Goals:

  • Evaluate AI agents on a useful task for the working/formalizing mathematician: filling in Prop-valued sorries in real Lean repositories, under real-world conditions.
  • Minimize the gap between benchmark performance and real world use (e.g. one should be able to run an agent overnight to attempt to fill sorries in a specific repository)

We were much inspired by Jason Rute's talk on The Last Mile, and by miniCTX, LeanAgent and SWE-bench.

We currently maintain a nightly updated index of Prop-valued sorries in public repositories. This is being scaled up to eventually cover all repositories listed at Reservoir.

We have implemented two naive agents, which take as input a list of sorry indices, and attempt to replace the sorries with valid proof strings. The first simply attempts "rfl", and the second asks an LLM for a one-shot whole-proof attempt. A third which uses an LLM to find a proof in tactic-by-tactic mode is planned. These are meant as templates—not competitive entries—designed to make building new agents as easy as possible. See source code and documentation.

The infrastructure is now mature enough that it makes sense to implement a few serious agents—e.g., porting existing theorem-proving models to this setup. Suggestions and contributions are very welcome! We’re happy to discuss what’s involved.

Our next target: a simple sorry server to automatically select recent sorries, serve them to agents via an API, verify solutions, and maintain an ELO-style ranking.

Finally, a few comments on data policy, based on the previous discussion:

  • we will only index repositories from Reservoir (in particular those have an open source licence).
  • owners can opt-out from having their repository indexed (just send us a message)
  • we will never send unsolicited automated pull requests or similar, any such future feature will be opt-in.

For more information, see SorryDB on GitHub, and the docs. Feedback is much appreciated!

Mario Carneiro (May 15 2025 at 01:29):

Hi, have you considered running on my repo (for various values of "my")? You mentioned in your talk that it only handles recent sorries, but the cutoff was something like 2 weeks which I find to be very short and misses out on a lot of easy sorries that are just dormant.

Justin Asher (May 15 2025 at 06:08):

@Lenny Taelman Thanks for the update! This looks great, and I am excited to play around with building bots. I had a few questions:

  • For the data infrastructure, what is the typical runtime for the sorry data scraping and indexing process? What is the typical runtime when starting from scratch versus updating repositories that you have already indexed?
  • How do you handle repositories with differing Lean versions? When I was writing LeanExplore, this became a problem, but I can imagine workarounds in your case.
  • Are you going to limit the ELO sorry leaderboard to specific repositories? Because I can see this being problematic if applied to repositories with lots of improper usage of axioms and partial / unsafe declarations.

Lenny Taelman (May 15 2025 at 06:22):

Mario Carneiro said:

Hi, have you considered running on my repo (for various values of "my")? You mentioned in your talk that it only handles recent sorries, but the cutoff was something like 2 weeks which I find to be very short and misses out on a lot of easy sorries that are just dormant.

Indeed that could also be part of 'scaling up'. We visit any repository that has been updated since we last visited, or if never visited if it has been updated since a certain cutoff date. Having this cutoff date at -\infty would just incur an extra cost for the initial run. The current update runs in (order of magnitude) 1 hour, so this should be feasible. @Austin Letson can give you more precise figures for the cutoff and runtime.

The focus for now has been on establishing that there is a continuous supply of fresh sorries.

Lenny Taelman (May 15 2025 at 08:29):

Justin Asher said:

  • How do you handle repositories with differing Lean versions? When I was writing LeanExplore, this became a problem, but I can imagine workarounds in your case

We record the Lean version as part of the indexing. (And use a matching version of REPL to analyse the sorry). In particular, any 'agent' that will compete will need to be able to deal with varying (and recent) Lean versions. I consider this a fundamental aspect of the 'real-world conditions' alluded to above.

Lenny Taelman (May 15 2025 at 08:32):

Justin Asher said:

  • Are you going to limit the ELO sorry leaderboard to specific repositories? Because I can see this being problematic if applied to repositories with lots of improper usage of axioms and partial / unsafe declarations.

This will always require a bit of monitoring, but so far the vast majority of sorries we see are honest mathematical statements. In any case, since this is meant to be a continuously-running competition, it is easy to declare some sorries 'invalid' after the fact, and remove them retroactively from the computation of the ELO rating.

A little bit of unchecked noise should be ok, as long as the rules are the same for everyone and the weird or irrelevant sorries do not dominate the landscape.

Austin Letson (May 19 2025 at 12:20):

Lenny Taelman said:

Mario Carneiro said:

Hi, have you considered running on my repo (for various values of "my")? You mentioned in your talk that it only handles recent sorries, but the cutoff was something like 2 weeks which I find to be very short and misses out on a lot of easy sorries that are just dormant.

Indeed that could also be part of 'scaling up'. We visit any repository that has been updated since we last visited, or if never visited if it has been updated since a certain cutoff date. Having this cutoff date at -\infty would just incur an extra cost for the initial run. The current update runs in (order of magnitude) 1 hour, so this should be feasible. Austin Letson can give you more precise figures for the cutoff and runtime.

The focus for now has been on establishing that there is a continuous supply of fresh sorries.

Re update runtime: The runtime of the update varies a lot based on which repos have been updated that day. As Lenny said, 1 hour is a good estimate. We are working on creating aggregated stats for runtime, and repo specific processing time.

Re cutoff date: Currently the cut off date is 2025-04-01. I will stamp that this is the cutoff date for any changes to the git repo. If any branch in the git repo has a commit after the cutoff date then we will index all sorries on that branch. For example if there is a lean version bump, then we will index all sorries on the branch with the bump, not just the sorries created after 2025-04-01. Note, when we incorporate more of reservoir into the database, we going to move this date back significantly.

Austin Letson (May 19 2025 at 12:40):

To @Mario Carneiro's point, I think we still want to index sorries from dormant repos. There are few possibilities on how to do this. For example:

  • Extend the cut off date back to the release of Lean4. This would essentially index sorries on every branch of every repository in Reservoir.
  • Keep a relatively recent cut off date, but index the default git branch for every repo in reservoir, even if the default branch wasn't updated before the cutoff.
  • Incorporate a Lean version into the cut off criteria, e.g., index all branches with Lean version >= 4.9. Since we are limited by the versions that REPL supports, this might be inevitable.

I am curious to hear from others who have maintained long running Lean projects and/or dormant Lean projects on what they think is the best approach.

Mario Carneiro (May 19 2025 at 12:43):

if you push back to -infty then it will inevitably actually be a lean version cutoff in the end, which makes sense to me.

Eric Wieser (May 19 2025 at 12:44):

Analogously, I don't think anyone is interested in training language models to write Python 2.2 code

Austin Letson (Sep 17 2025 at 10:06):

We are still accepting applications for the SorryDB Hackathon in Amsterdam, October 7th - 10th, 2025! For more details see the initial announcement.

If you are interested please fill out the application here.


Last updated: Dec 20 2025 at 21:32 UTC