Zulip Chat Archive

Stream: IMO-grand-challenge

Topic: general discussion

Miroslav Olšák (Oct 02 2019 at 14:07):

Problems of the "find / determine" type
I have collected the answers of the problems of this type from all the shortlists available on the official IMO website for a general idea how the answers can look like.
http://www.olsak.net/mirek/determine-answers.txt

Miroslav Olšák (Oct 02 2019 at 14:07):

Other domains.
It is well known that IMO problems are of four categories: "Geometry / Algebra / Number Theory / Combinatorics" (a friend of mine once came with a nice comparison to "Imagination / Computation / Knowledge / Thinking" :-) ). The problem statements are relatively monotonous inside a single domain unless there is a combinatorial flavour. The most difficult problems from both formalisation perspective and the problem solving perspective are imho problems from Combinatorial Geometry. I am for example curious about the formalisation of the windmill problem (2011-2, https://www.youtube.com/watch?v=M64HUIJFTZM).

Miroslav Olšák (Oct 02 2019 at 14:08):

Partial points
Some of the IMO problems are actually multiple (two) independent tasks, for example 2016-6 (= C7). Whether to allow partial points in such cases is worth consideration.

Daniel Selsam (Oct 03 2019 at 13:22):

I am for example curious about the formalisation of the windmill problem (2011-2, https://www.youtube.com/watch?v=M64HUIJFTZM).

@Miroslav Olšák For the windmill problem, are you concerned about formalizing the statement, finding a solution, or formalizing a solution?

Daniel Selsam (Oct 03 2019 at 13:27):

Are we interested in formalizing olympiad-like mathematical puzzles not necessarily coming from IMO?

@Miroslav Olšák There are many sources of problems that I think would provide valuable data for IMO, e.g. problems from national Olympiads. It is hard to say without knowing more whether your folklore list is worth the trouble of you translating it to English.

Daniel Selsam (Oct 03 2019 at 13:34):

Partial points
Some of the IMO problems are actually multiple (two) independent tasks, for example 2016-6 (= C7). Whether to allow partial points in such cases is worth consideration.

@Miroslav Olšák I agree. Unfortunately, I don't see any official statement concerning the number of points for solving sub-problems of problems. One option is to say "partial credit will be given for fully formalized sub-problems according to the number of points human judges would have awarded for the same sub-problems". What do you think?

Daniel Selsam (Oct 03 2019 at 13:37):

Because of that, I am now focused mainly on geometry, and I have translated the officialy available shortlists to a semi-formal language (parseable but without detailed semantics and not in any particular therem prover so far). So I could help a bit with this part.

@Miroslav Olšák Nice! Can you share one example to give us a sense of the semi-formal language?

Daniel Selsam (Oct 03 2019 at 13:41):

Problems of the "find / determine" type
I have collected the answers of the problems of this type from all the shortlists available on the official IMO website for a general idea how the answers can look like.
http://atrey.karlin.mff.cuni.cz/~mirecek/determine-answers.txt

@Miroslav Olšák Nice! What do you think about the current plan, of requiring human-assessment of the witnesses and only accepting witnesses that human judges would have accepted? See https://github.com/IMO-grand-challenge/formal-encoding/blob/master/design/determine.lean for more context.

Miroslav Olšák (Oct 03 2019 at 13:42):

Miroslav Olšák For the windmill problem, are you concerned about formalizing the statement, finding a solution, or formalizing a solution?

I am not concerned that formalizing the problem statement, or formalizing the solution would be impossible, I just find it somewhat challenging, so I would like to see it. Of course, if an automated system (not designed for solving this particular problem) could find a solution of it, I would be super-impressed. However, I feel the following issue of the formalization of the problem statement. There are some facts about the problem that are so obvious (but really nontrivial to formally proof) that it is unclear whether they are actually a part of the problem statement, or a part of the solution. I mean for example "given any initial line, there is exactly one windmill process".

Miroslav Olšák (Oct 03 2019 at 13:48):

Miroslav Olšák I agree. Unfortunately, I don't see any official statement concerning the number of points for solving sub-problems of problems. One option is to say "partial credit will be given for fully formalized sub-problems according to the number of points human judges would have awarded for the same sub-problems". What do you think?

It makes sense. Note that although there is no public document for the IMO marking scheme, the judges prepare it and agree on the marking scheme before checking the solutions.

Daniel Selsam (Oct 03 2019 at 14:14):

However, I feel the following issue of the formalization of the problem statement. There are some facts about the problem that are so obvious (but really nontrivial to formally proof) that it is unclear whether they are actually a part of the problem statement, or a part of the solution. I mean for example "given any initial line, there is exactly one windmill process".

@Miroslav Olšák I agree that these are big challenges. One approach might be to have a DSL for geometric processes, and then to encode the windmill process as a program in the DSL.

windmill(S, s, l) =
  while true:
    l \gets rotateUntil(l, s, CLOCK_WISE, fun l => exists s' \in S, s \neq s' /\ on(s', l))
    s \gets choose({s' \in S : s' \neq s /\ on(s', l)})

Here it would be provable that the {s' \in S : ...} set always has exactly one element (since by assumption |S| > 1 and no three points are collinear) and thus the process is deterministic. The key insight required to solve the problem could then be cast as discovering an invariant of the program.

I am not sure what kind of semantics such a hypothetical DSL would warrant, probably operational, and bottoming out into some geometric object parameterized by time. I am also not sure whether we would have already had such abstractions before the windmill year, or whether the abstractions we can build now will be good enough for future problems.

Daniel Selsam (Oct 03 2019 at 14:23):

As far as I know, the computational methods (Wu's method / Gröbner basis / ...) are stronger than any synthetic approach, and I have heard that they are capable of solving at least some of the IMO problems
...
Note that I also have a parser for that, so I can tell the types of the objects, possibly convert it to other format, etc.

@Miroslav Olšák I am very curious which existing problems can be solved by which existing tools. What do you think are the most relevant off-the-shelf tools to try?

Miroslav Olšák (Oct 03 2019 at 14:27):

Miroslav Olšák Nice! What do you think about the current plan, of requiring human-assessment of the witnesses and only accepting witnesses that human judges would have accepted? See https://github.com/IMO-grand-challenge/formal-encoding/blob/master/design/determine.lean for more context.

Well, putting a human into the loop is fine, it just rather postpones the problem than solves it (but it may be a good thing to postpone it). I also like the idea of whitelist of allowed operations for every individual problem where the problems available so far would help us prepare templates for such whitelists. But if a problem requiring something more complex emerged, we could simply modify the whitelist to allow what is necessary without providing much hints (it is actually also putting a human to the loop but to another place).

Kevin Buzzard (Oct 03 2019 at 16:10):

This looks interesting : https://mathoverflow.net/a/337705 . I don't know if it's good enough to solve IMO problems though

Patrick Massot (Oct 03 2019 at 20:21):

Maybe this conversation should move to another Zulip thread. That one is meant to be used by the Zulip AI only. More seriously, it would make it easier to find back this conversation. Miroslav, I think you can to the move by editing your first message in this thread.

Reid Barton (Oct 03 2019 at 20:22):

(Miroslav, Patrick is referring to the topic "stream events" in case it's not clear)

Miroslav Olšák (Oct 03 2019 at 20:22):

Oh, yes, sorry, I am not so familiar with Zulip.

Miroslav Olšák (Oct 03 2019 at 20:24):

Now it is "hmble", any suggestion for a better name? (why does it actually require a name, I just wanted to contribute to the general discussion)

Johan Commelin (Oct 03 2019 at 20:25):

Well, then call it "general discussion"

Daniel Selsam (Oct 03 2019 at 20:30):

I don't know, perhaps we should ask people from the community aroung automated deduction in geometry.

@Miroslav Olšák I am most curious about which (non-synthetic) decision procedures work for which existing problems, e.g. by considering them as nonlinear real arithmetic (NRA) problems.

Joe Hendrix (Oct 04 2019 at 20:40):

(deleted)

Patrick Massot (Oct 04 2019 at 20:51):

Well, then call it "general discussion"

Maybe I'm not reading carefully enough, but I was under the impression there was a clear geometry thread. If this is not specific enough then it means we give up using Zulip threads.

Miroslav Olšák (Oct 04 2019 at 21:21):

Well, then call it "general discussion"

Maybe I'm not reading carefully enough, but I was under the impression there was a clear geometry thread. If this is not specific enough then it means we give up using Zulip threads.

I had several remarks, only one of them is related to geometry (and I don't consider the Windmill problem related to geometry, it is rather combinatorics).

Daniel Selsam (Oct 04 2019 at 21:23):

I had several remarks, only one of them is related to geometry

@Miroslav Olšák FYI the recommended style is to use different topics for each question/comment in a batch.

Miroslav Olšák (Oct 04 2019 at 21:40):

By the way, the comments teaching me how to use Zulip look irrelevant from the general perspective. I suggest using rather private messages next time. By the way, can I delete at least my comments of regarding Zulip?

Patrick Massot (Oct 05 2019 at 09:31):

@Miroslav Olšák I'm very sorry my message may have sounded a bit aggressive. I'm sure Daniel and the whole grand-challenge team is very happy to read your contributions. Experience on this forum suggests things are a bit easier if we somehow try to separate topics, but there are plenty of counterexamples. So please don't let that issue prevent you from contributing.

Brando Miranda (Oct 23 2019 at 19:52):

@Daniel Selsam I am trying to understand the specifications we are trying to pin down in this community. Inspired by a question on the Intermediate Langauge stream referencing HOList and worrying about portability of the competition if things get tied down to lean, is the goal of the project also to re-implement something like HOList? How is it going to be different?

From the comment on that thread/stream it seems that re-implementing HOList would be a pain (I wish I understood why), but I think it would be important to understand the difference and planning things out before going out and re-implementing a complicated system like HOList. What are your thoughts?

Daniel Selsam (Oct 23 2019 at 22:53):

From the comment on that thread/stream it seems that re-implementing HOList would be a pain (I wish I understood why), but I think it would be important to understand the difference and planning things out before going out and re-implementing a complicated system like HOList. What are your thoughts?

It is not hard to interface with ML systems. Lean has a tactic framework, with excellent meta-programming support, and also a foreign function interface.

Jason Rute (Oct 24 2019 at 00:06):

it seems that re-implementing HOList would be a pain (I wish I understood why)

@Brando Miranda I can try to address (narrowly) what I think would need to happen to reimplement HOList in Lean. I know at least one person here is working on it. I don’t know what progress they have made. I’m probably the one who said this a “pain”. I should probably backtrack and say it is doable with a good amount of engineering work, and I hope someone builds it! As I think you have an ML background, I won’t try to cover up the ML terminology.

Jason Rute (Oct 24 2019 at 00:06):

As I see it, the HOList projects have the following parts that would need to be reimplemented:

A list of theorems For both training and testing, one needs a list of theorems (and the full context in some nicely parsable form) to train on.
Proof recording (optional) If one wants to do supervised learning, then one needs a list of proofs as well to train on. These proofs will contain the theorem to prove (with the context) as well as the various tactics which have been applied along with their arguments. This needs to be at some intermediate level which records the name of the tactic and the arguments (so at a higher level than type theory), but probably not at the level of the raw lean code. I’ve heard from some in the Lean community that the tactic environment could be hacked to provide this information, but I don’t know that it has ever been done. HOL Light has some advantages here. It has a simpler tactic framework (I think), it is a larger library (more training data), it is written by one person mostly (so is more uniform), and HOL Light only uses tactics (whereas Lean uses a mixture of tactics and the type theoretic framework). However, the ASTactic (CogGym) and ProverBot9001 projects also used proof recording for Coq.
An interactive environment If one wants to do reinforcement learning and/or tree search, one needs to be able to quickly interact with the system. For tree search, given a particular state, one needs to be able to try possible tactics, see what the results are and back track if needed (using for example beam search). Also, for reinforcement learning, one needs to be able to try out a very large number of scenarios (in this case theorems, either real or synthetic, to prove). This necessitates an even faster back-and-forth between the agent and the system. Google rewrote HOL Light in C++ for this purpose. (The various Coq ML projects don’t use reinforcement learning.)
A system for scoring tactics and tactic arguments Scoring the tactics can be done as a probability distribution over the tactics (computed by a neural network), but scoring the arguments to these tactics can be a bit more tricky because of the large number of possibilities. HOList has one system for doing this. The two Coq projects have another system. I don’t know if either is readily adaptable to Lean.
Access to neural networks and computer power for training and evaluation The agent will have to compute tactic and argument scores via (graph?) neural networks. Therefore, it needs access to TensorFlow or PyTorch and a distributed computing system.

Jason Rute (Oct 24 2019 at 00:06):

Some further comments. One doesn’t need proof recording. Instead one can train solely with theorem statements and reinforcement learning. Conversely, if one doesn’t use reinforcement learning, then one doesn’t need as much speed in the interactive environment. Also, the tree search agent could live inside Lean (as a tactic) making FFI calls to TensorFlow, say. Alternately it could have the agent in Python or C++. Then it would have to guide Lean from the outside. I don’t know which is better.

Brando Miranda (Oct 24 2019 at 01:11):

I know at least one person here is working on it.

Awesome! Do you think its possible to get me in touch with them or their team? thanks!

Brando Miranda (Oct 24 2019 at 01:16):

Google rewrote HOL Light in C++ for this purpose. (The various Coq ML projects don’t use reinforcement learning.)

Are you saying google wrote HOL Light (the entire Theorm prover, idk if that is a lot of work or not but it sounds like it) only so that they could do RL on HOL Light? (trying to repeat it back to you to make sure I got it).

On a very related note, does that mean for someone to re-implement HOList to make LeanList we would need to re-implement Lean in a language that allows for high performance/speed to do RL?

I think the fundamental thing I don't understand is how to do the IMO-grand-challenge without a system like HOList built already. Why wouldn't that need to be a pre-requisite?

Mario Carneiro (Oct 24 2019 at 01:16):

Lean is implemented in C++ already

Mario Carneiro (Oct 24 2019 at 01:17):

and high performance has always been an objective

Brando Miranda (Oct 24 2019 at 01:17):

Lean is implemented in C++ already

So its fast enough to do Reinforcement Learning (RL) on it already? Is that what your saying?

Mario Carneiro (Oct 24 2019 at 01:18):

I have no idea what specifically is required for that, but FFI should be sufficient

Brando Miranda (Oct 24 2019 at 01:18):

I have no idea what specifically is required for that, but FFI should be sufficient

what does FFI mean?

Mario Carneiro (Oct 24 2019 at 01:18):

foreign function interface, i.e. calling functions in other languages

Mario Carneiro (Oct 24 2019 at 01:19):

I am also curious about "Google rewrote HOL Light"

Brando Miranda (Oct 24 2019 at 01:20):

Im curious, for Foreign Function Interface (FFI), is Lean's faster than Coq?

Mario Carneiro (Oct 24 2019 at 01:21):

Lean 3 got an FFI only in the community version, and I haven't used Lean 4's

Mario Carneiro (Oct 24 2019 at 01:22):

I don't see any reason why FFI should be very slow

Reid Barton (Oct 24 2019 at 01:22):

I'm sure Lean 4's will be fast.

Mario Carneiro (Oct 24 2019 at 01:22):

I guess marshaling of large objects might be a performance penalty

Brando Miranda (Oct 24 2019 at 01:23):

Is an Foreign Function Interface (FFI), bi-directional? or is it only powerful from within Lean to say Python? What about the reverse?

Mario Carneiro (Oct 24 2019 at 01:24):

The only way I am aware of for other languages to talk to lean is through the server mode, which uses JSON for message passing

Mario Carneiro (Oct 24 2019 at 01:24):

I guess you could also try literally linking with lean as a library, but I've never seen that done and I have no idea if it's doable

Brando Miranda (Oct 24 2019 at 01:27):

How does SerAPI (https://github.com/ejgallego/coq-serapi) compare to Lean's foreign function interface (ffi)? Or are they totally different?

Reid Barton (Oct 24 2019 at 01:30):

It looks like something different

Bryan Gin-ge Chen (Oct 24 2019 at 01:31):

SerAPI looks more like Lean's server mode, at least from my quick skim of the Github readme. At least all of Lean's editor integration is done via lean --server.

Bryan Gin-ge Chen (Oct 24 2019 at 01:32):

Though as Mario said, the Lean server mode uses JSON and it looks like SerAPI is doing something much more sophisticated.

Reid Barton (Oct 24 2019 at 01:33):

The Lean FFI lets you call C functions from a compiled Lean program. For example handle.close is implemented by lean_io_prim_handle_close (okay, bad example!)

Reid Barton (Oct 24 2019 at 01:33):

or Int.add is implemented by lean_int_add

Mario Carneiro (Oct 24 2019 at 01:34):

can you pass or return objects?

Reid Barton (Oct 24 2019 at 01:35):

Like Ints? :slight_smile:

Reid Barton (Oct 24 2019 at 01:35):

though probably Int is itself some kind of magic when compiled

Mario Carneiro (Oct 24 2019 at 01:35):

As long as there is no message passing, I guess there is no reason for much performance overhead with the FFI; total runtime should be dominated by the C function itself

Mario Carneiro (Oct 24 2019 at 01:36):

but if you have some huge array you have to pass in, that could hurt if the FFI layer isn't done properly

Reid Barton (Oct 24 2019 at 01:40):

Based on the other performance engineering that has already gone into Lean 4, I'm confident that it will be at least possible to do efficient FFI

Reid Barton (Oct 24 2019 at 01:40):

The @& in the type of Int.add means that the argument is borrowed, I think

Reid Barton (Oct 24 2019 at 01:42):

https://github.com/leanprover/lean4/blob/master/library/Init/Data/Array/Basic.lean#L75 makes me think you can do zero-copy FFI with Array (assuming the C side is well-behaved of course)

Reid Barton (Oct 24 2019 at 01:43):

It would be cool if there was an FFI to Rust that could cooperate with the Rust types

Reid Barton (Oct 24 2019 at 01:43):

although I have no idea whether that is even possible

Brando Miranda (Oct 24 2019 at 01:44):

Is this High Performance conversation with the foreign function interface (FFI) the reason Coq projects (CoqGym, gamepad, etc) do not do Reinforcement Learning (RL)? Anyone know?

Reid Barton (Oct 24 2019 at 01:51):

I don't know the answer to that, but this FFI business is relevant because it means it is actually viable to build your ML system in Lean, which gives you direct access to the Lean tactic state and so on as well. For other theorem provers, you'd want to use a different programming language and then you have the problem of importing/exporting data like the tactic state. (Although I'm not sure why you couldn't just use OCaml as the host language for the theorem provers written in it.)

Brando Miranda (Oct 24 2019 at 02:28):

Well, most serious ML researchers use python, so thats why, I believe most of us don't know OCaml (I'm learning it myself now thought cuz I predicted it might be useful as you have pointed out, but even if I write my ML in OCaml then it means little people can build on it if its all in OCaml)

Brando Miranda (Oct 24 2019 at 02:30):

but this FFI business is relevant because it means it is actually viable to build your ML system in Lean, which gives you direct access to the Lean tactic state and so on as well.

Oh interesting! But would that mean I can build an ML system inside of Lean or inside of Python?

Brando Miranda (Oct 24 2019 at 02:32):

For other theorem provers, you'd want to use a different programming language and then you have the problem of importing/exporting data like the tactic state.

Do you mind expanding what this means? In particular, why does one need an external programming language for most ITPs? (perhaps a few examples would be nice) Also, why isn't this a problem in Lean? Is it because Lean is a programming language itself or because of the foreign function interface (FFI)?

(Perhaps I will go and play with Lean's foreign function interface (FFI) so that its less wishy-washy in my head and get my hands dirty).

Jason Rute (Oct 24 2019 at 02:35):

As for rewriting HOL light in C++. I tried to look it up. It seems that in the first HOList paper and the website mention that their modified form of HOL Light is called DeepHOL. I think looking at the code they seemed to have just rewritten the kernel in C++, but I am not certain. The whole thing is usable as a docker container where one can treat it as a black box theorem prover interface. (See the website for how to use it.)

Reid Barton (Oct 24 2019 at 02:39):

Other theorem provers (including Lean 3, really) either aren't programming languages at all, or aren't adequate as programming languages for the task (though I must say I don't understand the situation with Coq, in particular)

Jason Rute (Oct 24 2019 at 02:43):

I think one would need the following in a project like this.

A reinforcement learning master algorithm which decides which problems to try to attempt, tells the agent to try to solve them, records the results, and uses these results to train the neural network. This would also probably be a heavily parallelized application.
A search algorithm which tries to solve a particular problem (repeatedly querying a neural network as an oracle).
The part which actually runs the neural network. (And for speed it probably needs to batch up calls to the neural network and send them together to make efficient use of the GPU.
The part which trains the neural network.

I think the last two need to be in Python or C++. The tree search agent could be in Lean with FFI to the part which calls the network, but I don't know how well a purely functional language does with (non-depth-first) tree search. The overall master agent could be written in lean, but I assume it would make more sense in something else.

Jason Rute (Oct 24 2019 at 02:45):

Also, I guess speed doesn't matter as much when you have massive parallelism. (At my job, when we need something done fast, we just reserve more AWS instances. [Well in theory. In practice there always seems to be a bottle neck or it is too expensive.])

Jason Rute (Oct 24 2019 at 12:24):

I thought a bit more about tree search algorithms in Lean. In general, reasonably efficient search algorithms can be implemented with maps (hash tables or other lookup data structures) and priority queues (heaps). I know Lean 3 doesn't have a priority queue/heap but I found a good one in the book Purely Functional Data Structures. A few weeks ago I tried implementing both their BinomialHeap and SplayHeap in Lean. The SplayHeap is about an order of magnitude faster than the Binomial Heap and can do heap sort in Lean as fast as Lean's merge sort. (I noticed Lean 4 implements BinomialHeaps in the base library, and I wonder if SplayHeaps would be better.) So maybe if Lean is decked out with the fastest purely functional fastest data structures (or uses FFI to call non-functional C++ data structures), then it wouldn't be a large bottleneck to have the search agent live in Lean. Then one would have a powerful fast search tactic in Lean (which could be guided by FFI calls to a pre-trained neural network).

Brando Miranda (Oct 24 2019 at 14:07):

Then one would have a powerful fast search tactic in Lean (which could be guided by FFI calls to a pre-trained neural network).

What worries me is the "pre-trained" neural net (NN) part. I think if Lean is going to be used besides just a theorem prover, it should allow for training of the NN. Also, another thing to consider is that its going to be hard for people to adopt the challenge or ITP Lean environment/dataset if it all lives in a new programming language that is not "standard" like python. I dont think its going to be able to kick off like the famous large-scale computer vision competition/dataset (ImageNet). It has to be taken into account that if every competitor is forced to learn a new programming paradigm (like functional programming (fp)), note this isn't only a new programming language, it might take some time to people really use it (or perhaps people won't). It takes some time to get good at a new programming language, specially if its a new paradigm. Having things in Python imho will make any ITP challenge more likely to thrive.

Jason Rute (Oct 25 2019 at 00:21):

@Brando Miranda I think in some sense one needs both (to (1) be able to run a search algorithm inside Lean with a NN and (2) guide Lean from the outside in something like Python). Setting aside the IMO challenge and just thinking about improving Lean, from the perspective of a Lean user, they want whatever system one is building to be usable in Lean. And this is really a problem with the other tools out there. I don't know if any of the AI/ITP systems currently out there are usable and useful to the practitioners of that ITP system. However, if Lean had a HOList-like tactic called lean_ist (which admittedly would involve some additional setup), then they could write by lean_ist. The system would behave similar to the library_search tactic, outputting a proof which can be pasted into Lean.

Jason Rute (Oct 25 2019 at 00:21):

On the flip side, something that inhibits the growth of these AI for ITP systems is that it is a major engineering feat to link your AI library to your theorem prover (and just as large of a feat to learn the intricacies of the logic). One thing that has really increased neural network research progress has been environments which are easy to spin up and play with. These "gym" environments allow you apply the same training algorithm to many different problems which have the same interface. I think a very useful research project would be the following: Go through the other AI for theorem proving projects out there, and find a common gym-like interface for interactive theorem provers and related systems, including tableau calculus, constraint solvers, QBF solvers, SMT solvers with tactics, non-classical proof systems. (I can point one to over a dozen papers, each in a separate system.) As far as I see they have the following uniform framework:

A term and formula language
A local goal state: What one is trying to prove at the moment (as an ordered list of formulas in some formal language)
Premise List: all the possible previously proven theorem statements one could use (which needs to be significantly narrowed down in a process called “premise selection”)
Tactics or inference rules: a fixed finite set of rules one can apply to ones current goal state
Tactic parameters: the possible parameters that can be added to the above tactics. (This is by far the most inconsistent and flexible part of the framework. These parameters can include numbers, terms, and premises from the premise list.)
a training and test set: formulas to prove (and possibly proofs for supervised learning)
application: one needs to be able to quickly apply a rule/tactic
persistence/backtracking: So that any tree search algorithm can be applied, one needs a notion of backtracking. (Practically, this means that states need to be persistent. If one applies a tactic it creates a new state, without changing the old state. Think immutable data structures.)

Jason Rute (Oct 25 2019 at 00:21):

Now not all systems use all of the above. A simple logical system, e.g. QBFs, may not use premise selection or tactic parameters. A fully automatic system (e.g. an ATP like E-prover) may only have one tactic (solve) and the challenge is just premise selection. Some systems also don’t have the backtracking, but that severely limits how effectively one can search.

Jason Rute (Oct 25 2019 at 00:22):

Now HOList has basically created a system like this in DeepHOL (although not very well documented) which they make easier to use by burying it in a docker container, so it just becomes a black box (a gym). I also think they had to do a lot of work to make the persistence/backtracking thing work, but that is also not documented well. I think the CoqGym and GamePad systems also try to do something similar for Coq. In some systems like MetaMath, it shouldn’t be terribly hard to just rewrite the logic to make such an environment. The question is, can Lean be abstracted into a system like this which is fast (and more importantly, it satisfies the persistence/backtracking requirement).

Jason Rute (Oct 25 2019 at 00:22):

Getting back to the IMO challenge, I feel the goal is different than developing a training algorithm for an arbitrary neural theorem prover. However, I agree that it isn’t clear if this project will attract those outside of the established Lean community or Microsoft Research because of the barrier to entry. However, you are here @Brando, so maybe it is working. :)

Mario Carneiro (Oct 25 2019 at 00:27):

lean has always had a backtracking tactic state

Mario Carneiro (Oct 25 2019 at 00:27):

speed might be an issue, but possibly lean 4 has solved that problem

Jason Rute (Oct 25 2019 at 00:28):

But is it useable from the outside. Could it be wrapped in a black box with a set number of tactics? (So from the outside it is just like a chess board with a set number of possible moves?)

Mario Carneiro (Oct 25 2019 at 00:32):

I would really like a C++ API for doing this

Mario Carneiro (Oct 25 2019 at 00:33):

You can do it with lean --server but there is far too much overhead involved

Iddo Drori (Jan 06 2025 at 14:54):

The IMO Grand challenge https://imo-grand-challenge.github.io
rules state that the AI must be open-source, does this mean:
(a) Open code and open pre-training source and open inference source and open weights
or
(b) Open code and open inference source and open weights
or
(c) Open code and API calls

Jason Rute (Jan 06 2025 at 15:12):

@Iddo Drori The IMO grand challenge doesn’t really exist. It was just an ambitious idea that went officially nowhere, but inspired a lot of work. There is AIMO (which I think you have participated it). There are lots of benchmarks related to AI for IMO problems. There are AI agents (like AlphaGeometry, AlphaProof, o1, o3, etc), that can do IMO problems at different levels of success, and I’m sure more to come.

Kevin Buzzard (Jan 06 2025 at 15:29):

(speaking as someone who's listed on that web page, I can confirm Jason is correct)

Gaston Longhitano (Jan 06 2025 at 15:43):

@Jason Rute we're well aware of the field and different competitions.
@Kevin Buzzard we'd appreciate an answer to the question:
Does open source AI for the IMO grand challenge mean:
(a) Open source (such as OLMo 2)
or
(b) Open weights (such as DeepSeek V3)
or
(c) Code with API calls (such as o1)
Btw: Looks like Deepmind participated in the IMO grand challenge last year.

Kevin Buzzard (Jan 06 2025 at 15:52):

It is unlikely that the IMO-grand-challenge committee will ever make some formal response to your question so I would enjoy the freedom which this situation brings and make your own rules up.

Kevin Buzzard (Jan 06 2025 at 15:53):

I do not agree with your claim that DeepMind partipated in the challenge last year. Firstly none of their work is open source in any of the senses you suggest above, and secondly they did not contact any of the committee members indicating that they were participating in any way.

Kevin Buzzard (Jan 06 2025 at 15:56):

I have heard through the grapevine that there is at least one tech company that wants to participate in IMO 2025 in some kind of official way, but I am not involved in the organization of IMO 2025 in any way and I would suggest that you contact the IMO organizers directly if you have an interest in this.

Jason Rute (Jan 06 2025 at 16:08):

@Gaston Longhitano My point is that these questions are meaningless. There is no official committee or money behind the IMO Grand Challenge to adjudicate claims of solving the IMO Grand Challenge. Daniel Selsam's views have almost certainly changed since he dreamed up the competition (as he works on o1-style reasoning at OpenAI), and the committee listed on that website doesn't meet. Remember these rules came even before ~~Codex~~ GPT-3 was a thing, much less ChatGPT. If one wants something official, like @Kevin Buzzard says, I suggest they contact the official IMO 2025 organizers. Or if one wants the XTX AIMO $1 million prize money, they contact that organization. If you just want the opinions of the people in this chat forum, then lots of folks might have different answers. I know for example @Joseph Myers has strong views on this matter.

Jason Rute (Jan 06 2025 at 16:24):

If you would like to know Daniel's thought process he said this in 2021:
Daniel Selsam said:

related: how to ensure winning technology advances science as much as possible (i.e. is reproducible)?

the original rules (requiring entire system be open-source and easily reproducible) is ideal but might not be feasible, e.g. if the NN is an org's singular $1B model, so we'll have to adjust the rules depending on the technical approaches taken

Jason Rute (Jan 06 2025 at 16:48):

Although I guess there are betting markets, such as https://manifold.markets/FlorisvanDoorn/what-will-be-the-year-that-the-imo?play=true by @Floris van Doorn which reference these rules. I guess they will have to figure out how to adjudicate this. :smile:

Mao Mao (Jan 06 2025 at 16:48):

@Kevin Buzzard Thank you for your response and confirming that:

The IMO grand challenge committee exists.
Deepmind did not participate in the IMO grand challenge.

Btw: Perhaps we were confused by Deepmind's blog post (https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level) "More recently, the annual IMO competition has also become widely recognised as a grand challenge in machine learning and an aspirational benchmark for measuring an AI system’s advanced mathematical reasoning capabilities."

Iddo Drori (Jan 06 2025 at 16:50):

Thank you everyone, looking forward to IMO 2025 in Australia: https://imo2025.au

Kevin Buzzard (Jan 06 2025 at 17:13):

I very much hope that the IMO will embrace these new possibilities!

Jason Rute (Jan 06 2025 at 17:20):

Mao Mao said:

The IMO grand challenge committee exists.

Maybe I’m belaboring this point but, there is a committee only in the sense that there are 7 names on a website. I don’t think they have met in the last 3 years if not more. I don’t think they have drafted rules (as I think those rules on the website predate the committee’s formation), and they certainly have never done anything official in regards to hosting or judging competitions. If someone comes out and says “I’ve solved the IMO Grand Challenge” I’m not even sure if you could get the committee to meet to confirm this fits the criteria or add that achievement to their website. @Kevin Buzzard can correct me if I’m wrong.

Kevin Buzzard (Jan 06 2025 at 17:26):

The IMO grand challenge is a website made about 6 years ago and literally nothing else, I do not know how I can stress this any more than I have already. It is vaporware. Jason's understanding is correct.

Eric Wieser (Jan 06 2025 at 17:51):

Should the website be updated to reflect that?

Kevin Buzzard (Jan 06 2025 at 17:59):

That's a question for Selsam. I have never had any access to the website or indeed done anything other than replying "yes" to an email Dan sent me 7 years ago; that is the totality of my contribution here.

Jason Rute (Jan 06 2025 at 18:01):

Yes! If @Daniel Selsam would be willing to add a note at the top it would go a long way to clarifying things. He could even emphasize that while this website never materialized to an official competition, it has had tremendous influence including competitions like AIMO, Math AI like AlphaGeometry, AlphaProof, DeepSeek-Prover, o1, Gemini-Pro, o3, and benchmarks like MiniF2F, AIME 2024, IMO-Bench, Putnam-Bench, FrontierMath, etc.

Tristen Harr (Jan 08 2025 at 03:36):

Hi folks, not really a lean user (although I am a math nerd and software engineer so maybe one day haha), but I joined this forum to mention that there are some additional betting markets tied to this. I'm relatively familiar with how these markets work, and they will resolve the market to No if whomever comprises the "IMO Grand Challenge Committee" do not announce a winner and the $5 million AIMO grand prize hasn't been won by year end. (Regardless of if the committee actually exists, etc.) In case you are curious in that market the contractual rules state that the payout is achieved if "either the IMO Grand Challenge/the AIMO Grand Prize has been won before <date>"

I've thought this market was inefficiently priced for a while now and have been lurking this channel, but I recently hit my max position size so no point keeping quiet. I think there's a lot of AI hype and not enough experts involved which has led to a rare poorly priced market. As "the experts" I'd be curious what you'd all peg the odds of an AI winning the AIMO grand prize or the (apparently non-existent) IMO grand challenge by year end?

As I understand it, the AIMO is still in (relatively) early stages, as we're only at the second progress prize and it sounds like the IMO Grand Challenge isn't being won anytime soon as it isn't even really a thing anymore and is unlikely to ever be won. Is there something I'm missing that would explain why the markets think there's a 56% chance AI wins by year end other than hype? Is there some official competition from XTX markets for the grand prize later this year? It doesn't sound like there are plans to reconvene the committee and award a prize. Anyways... it looks to me like you can buy $1 for $0.53 at the moment unless I've misunderstood the conversations in this thread. Sorry if this is off-topic, I assume when you folks originally came up with the IMO Grand Challenge you didn't think people would be betting on it years later. :joy:

Kim Morrison (Jan 08 2025 at 04:23):

I do think there is an excellent chance that multiple models will gold medal at the 2025 IMO (perhaps with somewhat relaxed standards about how the problems are presented to the models).

If any of those are open-source-ish, I think there is going to be a pretty plausible argument to resolve yes, even if "the IMO Grand Challenge" doesn't actually exist in any meaningful way.

Jason Rute (Jan 08 2025 at 04:58):

Are betting markets the reason all these questions popped up recently in this thread?

Tristen Harr (Jan 08 2025 at 05:04):

Yes. People (myself included) have thousands of dollars on the line based on those old websites.

While I’d agree about there being a plausible argument if we’re talking about multiple models potentially IMO gold medalling, unless it’s officially stated that the IMO Grand Challenge is won, or the XTX Markets $5 million grand prize is won it won’t count. They’re very specific about the contract rules, as these are regulated event contracts. You might say “a reasonable person would say an open-source AI won” but these often resolve on technicalities. If the committee is going to reconvene and announce an official winner then I’d hope they’d follow the rules laid out on the page, even if it’s outdated/irrelevant. Checkable by the Lean kernel in 10 minutes, no opportunity for partial credit, as much time as a human competitor, open sourced and publicly released before the first day of the IMO and easily reproducible alongside not having the capability to query the internet. I’d argue that it also ought to be F2F if there’s an official winner. I guess it would be whatever the committee decided though, although for the sake of people who did bet, I’d hope if the committee reconvenes and announces a winner, they would clarify that nobody won the original IMO Grand Challenge, unless someone actually does. It’s what is linked as the page that will be used to decide who wins the money. :sweat_smile:

This zulip channel is linked from the page they say will be used to ultimately decide the winners of the bet. I'd almost guarantee if you've had a ton of questions popping up that is why.

Tristen Harr (Jan 08 2025 at 05:16):

It might sound silly, but my bet boils down to you folks not reconvening the committee and changing the website to say there is an official winner, and nobody winning the XTX markets grand prize by the end of the year. That’ll be what decides it. I guess your zulip channel has officially become high-stakes. :joy: Anyways, I’ll leave you folks to it, hope I didn’t bother anyone, thanks for the info! If you have any other thoughts would love other expert opinions.

Kim Morrison (Jan 08 2025 at 05:48):

Sounds like there are some excellent market manipulation opportunities here!

Kim Morrison (Jan 08 2025 at 05:56):

I propose that Kevin "announces" a result that will cause the highest loss to people who have wasted time talking about betting markets on this zulip. :-)

Joseph Myers (Jan 08 2025 at 12:46):

The people who set up betting pools based on something that was an idea for a competition rather than ever being an actual fully-specified competition (and where some details in the idea may have been left behind by how AI has developed in the past five years) have created their own problem and it's probably for them to figure out what if anything to do to solve the problem they created!

Joseph Myers (Jan 08 2025 at 12:59):

Here are some key differences between DeepMind getting one point short of gold in 2024, and the principles for the IMO Grand Challenge as originally stated:

The DeepMind AIs aren't open source (parts of AlphaGeometry 1 are open source, though I think not the synthetic data generation code; as far as I can tell, AlphaGeometry 2, as used for IMO 2024, hasn't been released).
The DeepMind work used two separate languages (Lean for non-geometry, and an AlphaGeometry domain-specific language for geometry); the IMO Grand Challenge envisaged everything being in Lean.
DeepMind prepared their own Lean versions of the problems, not following any externally defined conventions; the IMO Grand Challenge envisaged externally-agreed conventions and probably independently written Lean statements ("We are working on a proposal for encoding IMO problems in Lean and will seek broad consensus on the protocol."). In particular, if you have multiple F2F entrants attempting the problems of a given IMO, they all ought to use the same Lean version of the problems (one Lean version per problem), for a proper comparison between those entrants.

Joseph Myers (Jan 08 2025 at 13:01):

And also:

AlphaProof used a lot more wall-clock time than the same-time-limit-as-for-humans (4.5 hours per paper) rule in the IMO Grand Challenge.

Kevin Buzzard (Jan 08 2025 at 13:42):

Whilst I find Kim's suggestion funny, I am pretty sure that the committee will act sensibly here. Right now I have no evidence that the challenge was accomplished in 2024, with the only serious attempt AFAIK being DeepMind's AlphaProof, which did not satisfy the requirements as Joseph has clearly pointed out above. If there is something which gets closer in 2025 then clearly the committee will actually have to do something.

Kevin Buzzard (Jan 08 2025 at 15:02):

I have contacted the other members of the IMO Grand Challenge committee and alerted them to the fact that we might be called on to make a ruling if it looks like a system has satisfied the criteria laid out on the website.

I should also stress strongly here that the IMO Grand Challenge and the XTX-backed AIMO are unrelated projects (although they have connections, e.g. I am involved with both of them) and I am quite surprised that the two projects seem to be casually linked in the betting market links mentioned above. In stark contrast to the IMO Grand Challenge (which is just a website), the AIMO is a serious endeavour backed up by a committee who do do a lot of work and they can certainly be relied upon to make pronouncements on e.g. their progress prizes.

I hope everything now is completely clear.

Joseph Myers (Jan 09 2025 at 02:13):

Kevin Buzzard said:

I have heard through the grapevine that there is at least one tech company that wants to participate in IMO 2025 in some kind of official way, but I am not involved in the organization of IMO 2025 in any way and I would suggest that you contact the IMO organizers directly if you have an interest in this.

It would probably be best for anyone who wants to have their AI participate "officially" (whatever that means) to contact the IMO Board, since this is more a Board-level issue than one for the organizers of a particular IMO.

And if you want something involving significant preparation on either the IMO side or the mathlib side (e.g. an official Lean version of the problems, or common externally agreed rules for all AI F2F entrants), it would be best to start the arrangements with the Board and others well in advance (e.g. now) so there is time for appropriate preparations to be made.

Last updated: May 02 2025 at 03:31 UTC