Zulip Chat Archive

Stream: general

Topic: Benchmark formats

Damiano Testa (May 30 2025 at 13:13):

Eric Wieser said:

If someone could write an automated tool to convert all these benchmarks into your proposed format, then perhaps we could agree as a community to shift all of them over to that style

Here is a possible first step:

import Lean.Meta.Tactic.TryThis
import Batteries.Util.ProofWanted

namespace ToProofWanted

open Lean Elab Command Meta.Tactic.TryThis
def appendToDecl (s : Syntax) (orig : Name) (tail : String) : Syntax :=
  s.replaceM (m := Id) fun i =>
    if i.isIdent && orig == i.getId then
      let parts := orig.components
      let last := parts.getLast?.getD default
      let newLast : Name := .str .anonymous (last.toString ++ tail)

      some (mkIdentFrom i <| (parts.dropLast).foldr (· ++ ·) newLast)
    else none

/--
`#to_proof_wanted theorem thm ... := by sorry` converts a theorem to
* a definition `thm_statement` with the signature of `thm` as its type,
* a `proof_wanted thm : thm_statement` command.

It tries to make an effort at reproducing the local variables needed and at preserving
attributes of the original theorem.
-/
elab tk:"#to_proof_wanted" ppLine cmd:command : command => do
  elabCommand cmd
  match cmd with
  | `($decl:declModifiers theorem $declId $s* : $t := by sorry) =>
    let sVars : TSyntaxArray ``Parser.Term.bracketedBinder := s.raw.map (⟨·⟩)
    let defId := ⟨appendToDecl declId declId.raw[0].getId "_statement"⟩
    let toDef ← `(def $defId : Prop := ∀ $sVars*, $t)
    let toProofWanted ←
      `(command| $decl:declModifiers proof_wanted $declId : $(⟨defId.raw[0]⟩))
    let fmtDef := Syntax.prettyPrint toDef
    let proofWanted := Syntax.prettyPrint toProofWanted
    liftTermElabM do addSuggestion tk <|
      s!"{fmtDef.pretty.trim}\n\n{proofWanted.pretty.trim}\n".replace "  " " "
  | _ => logWarning "Not found!"

end ToProofWanted

/--
info: Try this: def A.B.C.D_statement : Prop := ∀ {k : Nat} (hk : 1 ≤ k) , True

@[simp]
 proof_wanted A.B.C.D : A.B.C.D_statement
---
warning: declaration uses 'sorry'
-/
#guard_msgs in
#to_proof_wanted
@[simp]
theorem A.B.C.D {k : Nat} (hk : 1 ≤ k) :
    True := by
  sorry

Damiano Testa (May 30 2025 at 13:14):

If you want to test #to_proof_wanted on a few declarations and seeing how it performs and what it does wrong, it can certainly be improved. Once you think that it works well, the replacement can be automated.

Damiano Testa (May 30 2025 at 13:17):

(By the way, I know that the formatting is not ideal, but it is what the pretty-printer gives. Rather than optimising the formatting right now, I would like to get the syntax to at least parse correctly, before fixing line breaks and whitespace.)

Eric Wieser (May 30 2025 at 13:18):

Arguably that should produce def A.B.C.D_statement : Prop := ∀ {k : Nat} (hk : 1 ≤ k), True

Damiano Testa (May 30 2025 at 13:33):

Eric, I edited the command above: is this what you had in mind?

Notification Bot (May 30 2025 at 13:36):

5 messages were moved here from #general > Discussion: Formal Conjectures by Eric Wieser.

Eric Wieser (May 30 2025 at 13:36):

The variable line is now unused, right?

Damiano Testa (May 30 2025 at 13:37):

Ah, right! Let me exclude it!

EDIT: The version above now does not include variables.

Damiano Testa (May 30 2025 at 13:41):

As for the formatting, it was quicker to outsource to the pretty-printed syntax, but a better solution would likely be to recycle as much as possible the original syntax, by extracting the relevant substrings. I can deal with that later, once everything else is fixed.

Robin Arnez (May 30 2025 at 16:56):

Damiano Testa schrieb:

(By the way, I know that the formatting is not ideal, but it is what the pretty-printer gives. Rather than optimising the formatting right now, I would like to get the syntax to at least parse correctly, before fixing line breaks and whitespace.)

You should almost certainly use PrettyPrinter.ppCommand toDef instead. Syntax.prettyPrint doesn't "pretty-print" in the usual sense but rather basically copy the source text.

Damiano Testa (May 30 2025 at 16:58):

None of the pretty-printers work as well as recycling the user-typed syntax, to be honest, this is why I did not try hard to make it work. I'll play around with substrings, once all the actual bugs have been ironed out!

Junyan Xu (May 31 2025 at 14:50):

proof_wanted currently doesn't play well with variable and the unusedSectionVars linter. For example

import Mathlib.FieldTheory.IsAlgClosed.Basic

variable {F F' : Type*} (E : Type*) [Field F] [Field F'] [Field E] [Algebra F F'] [Algebra F E]
  [Algebra.IsAlgebraic F E] [IsAlgClosed F']

proof_wanted nonempty_algEquiv_or_of_finrank_algClosure_eq_two
    (h : Module.finrank F F' = 2) :
    Nonempty (E ≃ₐ[F] F) ∨ Nonempty (E ≃ₐ[F] F')

Should we set_option linter.unusedSectionVars false automatically for proof_wanted?

Junyan Xu (May 31 2025 at 14:53):

In comparison, sorry works well and I don't know what mechanism prevents triggering the linter in the following ([Algebra.IsAlgebraic F E] [IsAlgClosed F'] are unused but still included in the signature of the sorried theorem):

import Mathlib.FieldTheory.IsAlgClosed.Basic

variable {F F' : Type*} (E : Type*) [Field F] [Field F'] [Field E] [Algebra F F'] [Algebra F E]
  [Algebra.IsAlgebraic F E] [IsAlgClosed F']

theorem nonempty_algEquiv_or_of_finrank_algClosure_eq_two
    (h : Module.finrank F F' = 2) :
    Nonempty (E ≃ₐ[F] F) ∨ Nonempty (E ≃ₐ[F] F') := by
  sorry

Jason Rute (May 31 2025 at 16:32):

One thing I care a lot about with AI benchmarks (but it is also important for other things like blueprints, homework, and proofs-for-hire) is being able to:

clearly state a theorem to be proved
check that the theorem is really proved

Jason Rute (May 31 2025 at 16:32):

Surprisingly, this is much harder to get right in Lean than one realizes. While Lean has good tools to check a theorem and proof that make it into the environment, it doesn't have good ways to check that:

a theorem/proof made it to the environment
it is the same statement as in the original benchmark (or blueprint, homework assignment, etc)

Jason Rute (May 31 2025 at 16:32):

It is well known that Lean can drop theorems due to bugs, and sometimes it does this silently. Also, it isn't hard to change the (term level) statement of a theorem with things like local type class instances, or turn off kernel proof checking with a user option. If we are serious about AI generating (possible many files worth) of Lean proofs, we need to be able to check for correctness well.

Jason Rute (May 31 2025 at 16:32):

I've talked about this in a few places already, notably #lean4 > A better Lean checker. In short, @Mario Carneiro made a good point that we need to come up with a standard workflow, and after that, we can make checking tools (and add them to standard tools like Lean4Lean or lean4checker). There already is a pretty good existing tool in @GasStationManager's https://github.com/GasStationManager/SafeVerify. (@Mario Carneiro looked over the code and thought it was pretty robust since it uses environment replay similar to Lean4Checker.) Currently, SafeVerify works with sorrys if I recall. A relevant question for this thread is if it could be extended to proof_wanted.

Jason Rute (May 31 2025 at 16:32):

Some related questions are if the SorryDB project (by @Lenny Taelman) can work with proof_wanted, and if Lean blueprint's can work with proof_wanted. Overall, it would be great to have a standard here. I think the defacto standard is using sorry, but it has rough edges as this thread motivates.

GasStationManager (May 31 2025 at 23:42):

I'd be happy to look into extending SafeVerify to proof_wanted, if there is community effort in maintaining proof_wanted based benchmarks... or if someone wants to get the ball rolling by contributing to the repo, I'm happy to collaborate :)

GasStationManager (Jun 01 2025 at 19:31):

On the topic of benchmark formats, if we move towards proof_wanted, I would really like to see a variant of it for function implementations, call it def_wanted or impl_wanted, where the problem solver needs to provide the body of a def, satisfying type requirements. (The proof obligation can be packaged into the type requirement, eg. as subtype). This will also be suitable for mathematical problem-solving benchmarks like PutnamBench where an answer needs to be provided together with the proof.

GasStationManager (Jun 01 2025 at 19:37):

As a silly example, suppose we want a function f (x:Nat) : Nat, with precondition Odd x, and postcondition Even (f x). One way to formalize this would be
def f' (x:Nat) (precond: Odd x) : {y:Nat // Even y} :=sorry
If we don't want the sorry, could we have
impl_wanted f' (x:Nat) (precond: Odd x) : {y:Nat // Even y}

Jason Rute (Jun 01 2025 at 21:10):

I don't know whether to post this comment here or #general > Discussion: Formal Conjectures, but what I imagine with benchmarks like the formal conjecture one (and honestly with poorly formalized benchmarks like most of the ones out there :slight_smile: ) it would also be usefully to easily prove other results about target "theorems":

A conjecture is false
A conjecture follows from another conjecture (or implies another conjecture)
A hypothesis are not vacuous (often a helpful check that something is formalized correctly)

(1) and (2) could be easily accomplished by @Kim Morrison's proposal to use def MyConjecture_statement := ... or using @Eric Wieser's trick typeof% MyConjecture. Also, many of the conjectures in of Formal Conjectures project are of the form ... <-> answer(sorry), which I think is supposed to make it easy to do (1), but I'm less certain if that is as useful as it seems. And, of course, one could manipulate the conjecture expression itself, at either the text level (by hand, by code, or using an LLM) or at the metaprogramming/expression level. Are there already good automated tools for this last approach?

Lenny Taelman (Jun 02 2025 at 08:10):

Jason Rute said:

Some related questions are if the SorryDB project (by Lenny Taelman) can work with proof_wanted, and if Lean blueprint's can work with proof_wanted. Overall, it would be great to have a standard here. I think the defacto standard is using sorry, but it has rough edges as this thread motivates.

Since we don't write our own benchmark, but just index statements (small or large) whose proof has been postponed (for a brief or longer time), the SorryDB project is completely dependent on what practitioners use in their formalisation projects. It also means that it will never be 100% fail-proof. However, we design this to be a centrally run competition, so that we can control the narrative and if necessary we can retroactively invalidate proofs.

Jason Rute (Jun 02 2025 at 10:59):

@Lenny Taelman does your benchmark already index proof_wanted statements?

Lenny Taelman (Jun 02 2025 at 13:29):

Jason Rute said:

@Lenny Taelman does your benchmark already index proof_wanted statements?

At the moment it does not. I did a small experiment a while back and noted that the number of hits was essentially zero. It could certainly be added, and it would in some sense be easier since if I understand things correctly Lean already verifies that the type of a proof_wanted declaration is already Prop-valued (whereas with sorries we need to do some work to infer the type and index only the ones that are Prop-valued).

Kevin Buzzard (Jun 03 2025 at 09:10):

I am tiring of the ever-growing output of compiling FLT just being pages and pages of warnings about sorries, and I don't see any appetite from the FRO to add a "ignore sorries when compiling" switch to lake, so I am currently strongly considering moving away from sorry and was thinking about switching to proof_wanted but I've not done anything about this yet because I have Big Proof to worry about next week.

Mario Carneiro (Jun 03 2025 at 09:11):

you can always use axiom mySorry {p} : p instead

Mario Carneiro (Jun 03 2025 at 09:12):

proof_wanted is not an adequate replacement most likely; it doesn't actually add the theorem to the environment so you can't refer to it in subsequent theorems and definitions

Kevin Buzzard (Jun 03 2025 at 09:23):

Mario Carneiro said:

you can always use axiom mySorry {p} : p instead

Yes. The reason I mention it here is that it will break Lenny's system whatever I use if I switch from sorry, but the pages of output from lake are pushing me that way.

Kevin Buzzard (Jun 03 2025 at 09:38):

Actually @Lenny Taelman does this suggestion break your SorryDB?

Lenny Taelman (Jun 03 2025 at 10:52):

Kevin Buzzard said:

Actually Lenny Taelman does this suggestion break your SorryDB?

Yes using a custom sorry axiom would break SorryDB, so I'd hope for some way to suppress sorry warnings instead!

More generally this is a weakness of the SorryDB idea: we rely on formalization projects tagging their "postponed" proofs in a constant way (both in time and between projects). So far this has been largely the case, but this may of course break down at some point...

Lenny Taelman (Jun 03 2025 at 10:54):

Kevin Buzzard said:

don't see any appetite from the FRO to add a "ignore sorries when compiling" switch to lake

Now I'm curious: Is this a statement about appetite to implementing such a thing, or appetite to accepting a third-party PR which implements it?

Mario Carneiro (Jun 03 2025 at 10:54):

see #lean4 > Silent sorry @ 💬 :)

Kevin Buzzard (Jun 03 2025 at 15:09):

I specifically mentioned in lean4#8611 that it would be good if the community had a standard for saying "I'm not going to prove this theorem" (as opposed to the obvious potential proliferation with the multiple solutions suggested in the linked thread)

Mario Carneiro (Jun 03 2025 at 15:12):

IIRC formal abstracts had a more refined characterization of unproved claims. I can think of at least the following categories:

I plan to get around to this later
I'm not planning on proving this but I think it's true and I welcome someone else doing it for me
This is technically false but assumed for pragmatic reasons
This is in the literature somewhere
This is in the literature here -> [...]
This is an open problem

Kevin Buzzard (Jun 03 2025 at 20:59):

FLT has all of these and they're all sorry -- FLT#xyz

Jason Rute (Jun 04 2025 at 01:43):

To add to Mario’s list, two more common types of sorries are:

this is an exercise (e.g. in a tutorial or homework assignment)
this is an AI benchmark theorem
this is a definition value that needs to be filled in to complete a theorem as part of a benchmark (or homework)

Last updated: Dec 20 2025 at 21:32 UTC