Zulip Chat Archive

Stream: general

Topic: Code with Proofs: the Arena

GasStationManager (Nov 14 2024 at 23:23):

I made a web site "Code with Proofs: The Arena", where users can create coding problems with formal specifications (as Lean theorem statements); other users can submit solutions consists of code and proof, and be judged by the Lean proof checker.

The code is open sourced at https://github.com/GasStationManager/CodeProofTheArena, with a demo site up at http://www.codeproofarena.com:8000/

If you are interested in Lean as a general programming language with ability for formal verification, you might enjoy the practice! Right now the demo site has some relatively easy problems taken from https://github.com/GasStationManager/CodeProofBenchmark Feel free to create your own challenges!

This is a work in progress. Feature requests are welcome! Or even better, contribute to the project.

The stated goal of the site is to collect and share data, for the training of open source coding AI. See my essay https://gasstationmanager.github.io/ai/2024/11/04/a-proposal.html for more details on the motivation.

Kevin Buzzard (Nov 15 2024 at 11:18):

I don't get it. To submit a solution I have to log in but I don't see how to create an account. Normally there would be a link "Don't have an account? Make one!" on the login page. Sorry to be dense.

Ruben Van de Velde (Nov 15 2024 at 11:19):

I see this at the top:
Screenshot from 2024-11-15 12-19-07.png
and it seems like the "Register" text links to what you're looking for

Kevin Buzzard (Nov 15 2024 at 11:25):

Due to the main page being displayed with a white background and the top line being displayed with a black background (the same colour background as my bookmarks bar etc), I literally didn't even notice that line. Thanks!

Kevin Buzzard (Nov 15 2024 at 11:40):

Both Eric and me are trying to break the system for the last challenge, and currently I'm winning:
http://5.78.72.236:8000/challenges/23/submissions

Kevin Buzzard (Nov 15 2024 at 11:43):

Oh apologies @GasStationManager , the system became unresponsive so I mashed "submit" lots of times to try and spark it into life and now you have lots of correct submissions from me to the first challenge :-(

Eric Wieser (Nov 15 2024 at 11:43):

Kevin Buzzard said:

Both Eric and me are trying to break the system for the last challenge, and currently I'm winning:
http://5.78.72.236:8000/challenges/23/submissions

I got there in the end :)

Kevin Buzzard (Nov 15 2024 at 11:46):

So @GasStationManager the point of these attacks on the last problem is just to show how people can be evil in various ways. There was one of these systems for Coq years ago, where you could actually win bitcoin by proving theorems, and people would find hacks and clean the system out :-(

Edward van de Meent (Nov 15 2024 at 18:11):

a good first start might be to make some sort of check that whatever is given as a solution is a term?

Edward van de Meent (Nov 15 2024 at 18:12):

additionally, it turns out you may make uncompilable challenges, which is a bit annoying...

Bolton Bailey (Nov 15 2024 at 18:17):

Bit of a quibble, but it wasn't clear to me that I had to include a := in my solution at first.

Edward van de Meent (Nov 15 2024 at 18:18):

i'm curious how bad the breakage gets... can someone find a solution to this challenge?

Bolton Bailey (Nov 15 2024 at 18:36):

Another note: It would be nice if the link to the playground came pre-loaded with the code relevant to the challenge.

Kent Van Vels (Nov 15 2024 at 18:40):

Is there a canonical setup (version number, etc. ) for the problems? I have a few proofs that work on my machine that aren't accepted on the server. Thanks.

Quinn (Nov 15 2024 at 18:46):

Outstanding

Eric Wieser (Nov 15 2024 at 19:02):

Edward van de Meent said:

i'm curious how bad the breakage gets... can someone find a solution to this challenge?

Yes

Edward van de Meent (Nov 15 2024 at 19:03):

oooh, cool

Eric Wieser (Nov 15 2024 at 19:04):

Edward van de Meent said:

a good first start might be to make some sort of check that whatever is given as a solution is a term?

I think the premise for how to check these is wrong; this has been thought about before in #Lean for teaching, and I think the approach is to have a lean program that uses withImportModules to read the user code, and then inspects the resulting environment to check the statements are what they are supposed to be.

Bolton Bailey (Nov 15 2024 at 19:31):

Quibbles aside, this is a very cool project! I don't think previous attempts at this like (seemingly now defunct) Math Genome Project have had any mechanism for actually checking proofs. Kudos @GasStationManager !

GasStationManager (Nov 15 2024 at 19:44):

Thanks all for testing it out! I did not know of these cool ways to bypass the proof checker :) Right now the code only checked for sorry warnings, and it compiles the function by itself first to prevent wrapping the theorem in comments. I'll look into @Eric Wieser suggestion of withImportModules, thanks! Would that in theory defend against all known attacks?

Eric Wieser (Nov 15 2024 at 19:45):

If you combined it with an axiom check, then I think so

GasStationManager (Nov 15 2024 at 21:08):

Kent Van Vels said:

Is there a canonical setup (version number, etc. ) for the problems? I have a few proofs that work on my machine that aren't accepted on the server. Thanks.

It is the mathlib4 version I pulled a couple days ago, and the lean-toolchain that it points to. Looking at lake-manifest.json the mathlib4 revision is 4ac7881bba2522f56411a175689efe7039731fb9. And Lean version is
Lean (version 4.14.0-rc2, x86_64-unknown-linux-gnu, commit 7dc1ceb8d431, Release)

GasStationManager (Nov 16 2024 at 00:20):

Eric Wieser said:

If you combined it with an axiom check, then I think so

What do you mean by an axiom check? To check the user has not introduced any new axioms?

Eric Wieser (Nov 16 2024 at 01:25):

After using withImportModules, you should call Lean.collectAxioms on the theorem name you intend to check, and verify only the standard three axioms are used

GasStationManager (Nov 16 2024 at 03:00):

Thanks @Eric Wieser ! Another noob question: would I be on the right track if I do the following: first compile the submission file with lean -o, then start with something like https://github.com/kim-em/lean-training-data/blob/master/scripts/declaration_types.lean ? What's the difference between Lean.withImportModules and Mathlib's CoreM.withImportModules (used by the above script)?

Bolton Bailey (Nov 16 2024 at 20:55):

Trying to install the github project (on mac) I get:

> sudo -u postgres psql

sudo: unknown user postgres
sudo: error initializing audit plugin sudoers_audit

Should I expect this to work on macOS or do I need to be on Linux?

GasStationManager (Nov 16 2024 at 22:09):

Bolton Bailey said:

Trying to install the github project (on mac) I get:
> sudo -u postgres psql

sudo: unknown user postgres
sudo: error initializing audit plugin sudoers_audit
Should I expect this to work on macOS or do I need to be on Linux?

I only tested on Linux (Ubuntu). For this part, assuming the postgresql server is installed and running, the main thing needed is to log into postgresql with whichever is the superuser account for the postgresql installation, and then create the database etc following the rest of the instructions

GasStationManager (Nov 16 2024 at 22:22):

This Stackoverflow suggests that for Mac homebrew installation of postgresql, just doing psql would work

GasStationManager (Nov 16 2024 at 22:28):

@Kent Van Vels You might try submitting again; the site had a bug regarding challenges with two theorems earlier, now fixed

𝚠𝚘𝚓𝚌𝚒𝚎𝚌𝚑 𝚗𝚊𝚠𝚛𝚘𝚌𝚔𝚒 (Nov 16 2024 at 22:41):

Eric Wieser said:

After using withImportModules, you should call Lean.collectAxioms on the theorem name you intend to check, and verify only the standard three axioms are used

Checking the used axioms is still insufficient because you can redefine axioms:

import Batteries.Tactic.OpenPrivate

/- Redefine `propext : False`. -/
open Lean Elab Meta in
open private Environment.mk from Lean.Environment in
#eval modifyEnv (m := MetaM) fun env =>
  let consts :=
    { env.constants with
        map₁ := env.constants.map₁.insert ``propext (.axiomInfo {
          name := ``propext
          type := .const ``False []
          levelParams := []
          isUnsafe := false
        })
    }
  Environment.mk env.const2ModIdx consts env.extensions env.extraConstNames env.header

theorem efsq : ∀ (x y z n : Nat),
    0 < x → 0 < y → 0 < z → 2 < n →
    x^n + y^n ≠ z^n := by
  exfalso
  exact propext

-- 'efsq' depends on axioms: [propext]
#print axioms efsq

GasStationManager (Nov 16 2024 at 23:00):

@Wojciech Nawrocki Nice one! I guess to check for this, we would pre-compute and store the correct types for the axioms, then after withImportModules, check whether any of the axioms has been redefined..

Kim Morrison (Nov 16 2024 at 23:05):

I would suggest using Environment.replay to protect against poisoned environments created via the constructor. (This is why we wrote replay.)

GasStationManager (Nov 16 2024 at 23:21):

Thanks @Kim Morrison ! That's very cool, I'll look into it. Quick question: when I google this I got both Lean.Replay and the repo lean4checker. Should I be looking at both?

Kim Morrison (Nov 16 2024 at 23:23):

lean4checker is using replay. It's essentially just a command line frontend around it, so it depends on your exact use case whether it's useful. Likely you just want to use it as an example.

Kim Morrison (Nov 16 2024 at 23:25):

Actually, @Wojciech Nawrocki's example replacing an axiom probably should have a test case in the lean4checker repository, if only for the sake of documenting the variety of "attacks" it protects against.

Kim Morrison (Nov 16 2024 at 23:26):

I think it's different from what we have there at this point.

Bolton Bailey (Nov 16 2024 at 23:42):

Thanks, the psql command worked with some debugging.

I am now not sure what change to alembic.ini step 5 is telling me to make.

I don't know if this related but when I run the application without this, and try to access the list of challenges, I get

sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "challenges" does not exist
LINE 2: FROM challenges JOIN users ON challenges.owner_id = users.id...

GasStationManager (Nov 17 2024 at 00:03):

Bolton Bailey said:

Thanks, the psql command worked with some debugging.

I am now not sure what change to alembic.ini step 5 is telling me to make.

Around line 64 in the file, it is setting sqlalchemy.url to connect to your local postgresql server. Replace the password there with what you chose in the previous step. Then the alembic command in the next step will connect to the db and create the tables

Bolton Bailey (Nov 17 2024 at 00:20):

Ok I substituted "LocalPass" on line 64 with the password I set in step 4, but I am still getting the same error.

GasStationManager (Nov 17 2024 at 00:24):

@Bolton Bailey Now redo Step 6 :
poetry run alembic upgrade head
This should create the tables

Bolton Bailey (Nov 17 2024 at 00:25):

Ah, yes of course that makes sense, I wasn't reading carefully

Bolton Bailey (Nov 17 2024 at 00:26):

Sweet, no more error, the site works locally for me now

GasStationManager (Nov 17 2024 at 22:14):

A brief update: I have gotten a domain name for the demo site: http://www.codeproofarena.com:8000/ so now you don't have to remember the IP address :)
As suggested by @Bolton Bailey , now the link to the playground from the submission page will load the function and theorem signatures into the playground.

Bolton Bailey (Nov 18 2024 at 00:43):

I would like to be able to put simple open questions on this website. One way I could imagine doing this is by having the user specify either "true" or "false", and then prove the theorem in question is equivalent to the truth value they specified. The trouble is, in the same vein as the "determine" problem for IMO problems, someone might just specify the full proposition they are trying to decide as their value.

Is there some native way in Lean to detect that a term is represented in a sufficiently simple form (i.e. only using a certain specified set of functions or constructors)?

Kim Morrison (Nov 18 2024 at 00:45):

Kim Morrison said:

Actually, Wojciech Nawrocki's example replacing an axiom probably should have a test case in the lean4checker repository, if only for the sake of documenting the variety of "attacks" it protects against.

https://github.com/leanprover/lean4checker/pull/31

GasStationManager (Nov 18 2024 at 03:25):

@Bolton Bailey One approach might be that the user can select to either prove the statement or the negation of the statement. Depending on the selection, the corresponding signature is given. Right now the website doesn't have that yet, but I think it can be very useful. E.g. given a piece of code without proof, determine whether it is correct by proving/disproving.

GasStationManager (Nov 18 2024 at 22:39):

Another question for @Kim Morrison and other experts here: we've been working on a more robust theorem checking script based on everyone's suggestions, here's a work-in-progress prototype: https://github.com/GasStationManager/SafeVerify
It is mostly adapted from lean4checker. It takes two .olean files, run Environment.replay on them, and compare the theorem types; also checks for axioms used. One issues is I would like to check for the noncomputable flag for definitions; for coding tasks this could allow cheating by producing an answer using e.g. Classical.ofNonempty. But it seems noncomputable is not part of the ConstantInfo but is in the environment extension. So Environment.replay is not going to load it into the environment (since it only loads the constants). The information should still be in the olean file / ModuleData, but I'm not knowledgeable enough to extract it. Once it is in the environment I can use isNoncomputable to access it. One thing I could do is to re-load the entire file with withImportModules; would I need to put the file somewhere in the search path?

Kim Morrison (Nov 18 2024 at 22:42):

I'm a bit skeptical that it is useful to look at the noncomputable flag, but maybe. withImportModules sounds like the way to go.

Eric Wieser (Nov 18 2024 at 23:07):

Isn't the flag meaningless, because I could be using implemented_by something_entirely_different?

GasStationManager (Nov 18 2024 at 23:30):

@Eric Wieser interesting; I'll have to understand more how implemented_by works.
But yes the general question is whether noncomputable is the right thing to check, to make sure the function is actually going to compute the answer. I could disallow the use of Classical.choice for the function def (so only propext and Quot.sound allowed); would that be too restrictive?

Bolton Bailey (Nov 18 2024 at 23:32):

Many of the Lean ecosystem's close-to-the-metal functionalities assume choice, so I would suggest trying to allow it. I believe simp for example invokes choice.

Eric Wieser (Nov 18 2024 at 23:40):

What type of cheating are you trying to catch here? Do you have an example in mind?

GasStationManager (Nov 18 2024 at 23:45):

@Eric Wieser The type of question I had in mind is asking for a function to produce an element from a set that is known to be nonempty but perhaps not trivial to compute. E.g. a Nash equilibrium of a game. Could one use Classical.ofNonempty to "generate" an element without computing it?

Eric Wieser (Nov 18 2024 at 23:59):

Would you accept an algorithm that doesn't terminate in the lifetime of the universe?

Eric Wieser (Nov 19 2024 at 00:01):

You're also in trouble here if your "computation" is real-valued, since division and comparison on Reals is noncomputable, so you can't write any algorithm that "compute"s in the sense of noncomputable

GasStationManager (Nov 19 2024 at 00:02):

Or perhaps the solution is to actually try to run the function with a single test input, and Lean would given an error/fail to compile if it is noncomputable

Bolton Bailey (Nov 19 2024 at 00:04):

What would be nice would be some kind of proven guarantee of the performance, sort of like this example, but Lean doesn't really have a way to do that for normal Lean functions yet.

GasStationManager (Nov 19 2024 at 00:05):

@Eric Wieser running time would be an issue but the first step is to ensure there is actually an algorithm provided

Eric Wieser (Nov 19 2024 at 00:05):

GasStationManager said:

Or perhaps the solution is to actually try to run the function with a single test input, and Lean would given an error/fail to compile if it is noncomputable

This doesn't tell you anything at all:

@[implemented_by Nat.zero]
noncomputable def definitely_at_least_two : ℕ :=
  Exists.choose (⟨3, by norm_num⟩ : ∃ x, 2 ≤ x)

theorem definitely_at_least_two_spec : 2 ≤ definitely_at_least_two :=
  Exists.choose_spec _

#eval definitely_at_least_two -- 0

Eric Wieser (Nov 19 2024 at 00:06):

Lean doesn't make any strong guarantees that the compiled output faithfully reflects the model you proved things about. All it guarantees is that the proofs about your ideal model really are sound.

Eric Wieser (Nov 19 2024 at 00:07):

It makes weak guarantees: namely that if the model and compiler diverges in core, that's a bug; and if it the model and compiler diverge in your own code that uses implemented_by or unsafe, then you got what you asked for.

Kim Morrison (Nov 19 2024 at 00:07):

(In particular, if you want to trust the compiled output you need to trust all the implemented_by attributes in your codebase.)

Bolton Bailey (Nov 19 2024 at 00:09):

For the purposes of this project, is there some way to simply guarantee that a particular codebase doesn't use the implemented_by or unsafe constructs?

Eric Wieser (Nov 19 2024 at 00:10):

That wouldn't help with things like the issue with Nat.ble in this thread where the model and compiler diverge in core.

Kim Morrison (Nov 19 2024 at 00:10):

Yes, but it would help.

Eric Wieser (Nov 19 2024 at 00:11):

I guess also building a contest to find inconsistencies in core between the model and compiler is probably good for lean, but it's bad for the problems that you meant to ask users to solve, and it's possibly bad from a security perspective to do that in public.

Kim Morrison (Nov 19 2024 at 00:11):

I think that if you use the environment after using replay, all implemented_by and unsafe have been stripped out? That doesn't actually tell you whether they were present.

Kim Morrison (Nov 19 2024 at 00:19):

Perhaps it makes sense to have an option (perhaps irreversible??) that disables implemented_by and unsafe from that point onwards. I'm not at all sure what this would look like, however.

Kim Morrison (Nov 19 2024 at 00:19):

I was just having a look at unsafe and implemented_by in Mathlib, they are used once or twice in meta-programming tools.

Kim Morrison (Nov 19 2024 at 00:20):

I wouldn't mind trying to move those out of Mathlib, then banning these.

Mario Carneiro (Nov 19 2024 at 00:28):

I've argued before that implemented_by should be treated like an axiom, rather than ofReduceBool. ofReduceBool isn't so much an axiom itself as a license to use another brand of axiom

Mario Carneiro (Nov 19 2024 at 00:29):

as long as it has the current meaning, it's almost useless for actual threat assessment unless you want to check that it is not used at all

Mario Carneiro (Nov 19 2024 at 00:31):

From the raw perspective of "what is the consistency strength of the system with this axiom added", for ofReduceBool the answer is trivially "it's inconsistent" if you don't track implemented_by statements. It's the same thing as if we had a tool which checks whether you have used any axioms (without regard to the number or statement of those axioms), it's pretty much useless unless the answer is "no axioms"

Mario Carneiro (Nov 19 2024 at 00:34):

unfortunately, it is not so easy to tell whether an implemented_by has been used, since the compiler doesn't keep track and the resulting code just has the replaced definition, you can't tell if it was supposed to be something else or if that was actually the definition called

Mario Carneiro (Nov 19 2024 at 00:36):

You can get a pretty good estimation if you simulate the replacement process on lean source expressions - just look for uses of the constant to be replaced, and you find one, mark that implemented_by as used and traverse the replacement's definition

Mario Carneiro (Nov 19 2024 at 00:37):

However if an extern is used then you are truly hosed, because even setting aside the behavior of the FFI code itself you don't know which lean functions it calls so that you can continue the traversal

GasStationManager (Nov 19 2024 at 04:08):

Thanks all for the helpful comments; I think I have a better understanding now. For this project my current plan is to ban implemented_by in submissions (by scanning the text). The unsafe/partial flags are already being checked (in the environment after replay). As for noncomputable functions, looks like the noncomputable flag might not be a reliable indicator; I'll investigate further, possibly other approach like inserting an #eval to see if the function can be run.
What is a simple way to run something like #eval in an environment after replay?

Kim Morrison (Nov 19 2024 at 04:45):

One could easily subvert a text scan for implemented_by.

Kim Morrison (Nov 19 2024 at 04:46):

Here's a blob of code from another project that provides a bare bones "text mode interface" to Lean.

Sorry I don't have time to explain it, but others can hopefully suggest alternatives, or explain what this does. :-)

Daniel Weber (Nov 19 2024 at 04:51):

Just checking for noncomputable isn't sufficient, e.g.

partial def a (h : Nonempty α) : α := a h

def computable : Nonempty α → α := a

Violeta Hernández (Nov 19 2024 at 08:44):

Eric Wieser said:

Would you accept an algorithm that doesn't terminate in the lifetime of the universe?

We actually have something like this in current Mathlib :wink:

import Mathlib.SetTheory.Ordinal.Notation

#eval ONote.fastGrowingε₀ 10 -- this is many orders of magnitude larger than Graham's number

Violeta Hernández (Nov 19 2024 at 09:05):

As for everything in this thread. It doesn't seem good that it's so difficult to figure out if a proof from the formal theorem prover is in fact properly formalized...

GasStationManager (Nov 24 2024 at 23:15):

The code and demo site are now updated to use the new proof checking script based on lean4checker. Thanks all for your suggestions and testing! More stress testing is welcome :)

One relevant change: earlier the system inserted import Mathlib by default. Under the new checking script this results in long wait times for compilation and checking. Now Mathlib is not imported by default. The challenge creator can import narrower libraries in the function signature. You can also include imports in the submissions as well, the system will scan for lines starting with import and put them at the beginning.

GasStationManager (Mar 04 2025 at 21:56):

There're a couple of recent discussion threads on reasoning about computational complexity in Lean.

I have posted a challenge on the Arena demo site: http://www.codeproofarena.com:8000/challenges/48
which asks to implement Fibonacci in linear time, and specifications requiring that the operation in question is called a linear number of times (using a state monad to count).

Can do a similar one for comparison sort.

Question is, can this set up be circumvented? Would one be able to produce a linear-time comparison sort algorithm that passes the proof checker?

Eric Wieser (Mar 04 2025 at 22:18):

It would be painful, but I believe you can circumvent it by defining you function as "if m, instMonad, op are isomorphic to the specific arguments passed in the test, then pull out the cheating construction noncomputably; otherwise use some slow implementation"

Eric Wieser (Mar 04 2025 at 22:18):

It's much less painful if you define the theorem after the state monad, since then you can just talk about equality:

open scoped Classical in
noncomputable def fLinear [Monad m] (op : Nat → Nat → m Nat) (n : Nat) : m Nat :=
  if h : ((⟨m, ‹Monad m›, op⟩ : (m : _) × (_ : Monad m) × (Nat → Nat → m Nat)) = ⟨State Nat, inferInstance, countingOp⟩) then
    pure n.fib
  else
    f op n

but ultimately doesn't affect whether it's possible

Eric Wieser (Mar 04 2025 at 22:20):

This all assumes that noncomputable is allowed; I'm not sure if you can cheat without it

GasStationManager (Mar 05 2025 at 01:05):

Thanks @Eric Wieser ! I think if the only way to attack this is using noncomputable, it is perhaps not so bad, since I'd need to ban noncomputable in submissions anyway. The aim of this framework is to have runnable, computable code, that you can also prove theorems about. None of this works if noncomputable code can sneak in.

I guess what I'm somewhat more worried about are potential ways to do the (slow) computation, but somehow avoid the cost being counted in the final tally / or just somehow create a state monad with a fake count. Would something like that be possible?

Eric Wieser (Mar 05 2025 at 01:29):

I suspect you can cheat noncomputable with implemented_by

Eric Wieser (Mar 05 2025 at 01:35):

The open PR on oracle computations plugs the hole above, I believe

GasStationManager (Mar 05 2025 at 07:42):

Eric Wieser said:

The open PR on oracle computations plugs the hole above, I believe

Which hole are you referring to?
I'm actually very interested in getting a better understanding of the PR (I think you mean this one https://github.com/leanprover-community/mathlib4/pull/20924). You're saying that the Comp monad from the PR is in some sense more robust against cheating attempts than the simple state monad. Which is plausible but I want to understand why.

Last updated: May 02 2025 at 03:31 UTC