Zulip Chat Archive
Stream: Machine Learning for Theorem Proving
Topic: Releasing LeanDojo
Kaiyu Yang (Jun 28 2023 at 02:53):
Hi everyone,
We're excited to release LeanDojo: open-source tools, benchmarks, and models for learning-based theorem proving in Lean. It provides robust and well-documented tools for data extraction and interacting with Lean programmatically (supporting both Lean 3 and Lean 4). We use LeanDojo to construct benchmarks from mathlib and develop LLM-based provers augmented with retrieval for premise selection.
We open-source everything. To our knowledge, this is the first set of open-source LLM-based theorem provers. Our model is relatively inexpensive and needs only one GPU week of training. It does not rely on any proprietary assets. You are welcome to check it out.
- Project: leandojo.org
- Paper
- Code
- Datasets: Lean 3, Lean 4
- ChatGPT plugin
- Tweets: https://twitter.com/KaiyuYang4/status/1673882824158613504. Retweeting is appreciated!
Scott Morrison (Jun 28 2023 at 04:09):
Looks fun!
The paper link at https://github.com/lean-dojo/LeanDojoChatGPT needs updating.
Scott Morrison (Jun 28 2023 at 04:12):
Am I right in understanding that your ChatGPT plugin doesn't actually connect with reprover in any way? That is, the plugin just gives GPT access to lean?
Scott Morrison (Jun 28 2023 at 04:18):
Your paper says:
To mitigate this issue, besides the random split, we create a challenging data split named
novel_premises. It requires testing proofs to use at least one premise that has never been used in
training. For example, the last two theorems in Fig. 3 both use the premise conj_mul. If one theorem
is in the training set of the novel_premises split, the other one must also be in training.
I'm concerned this will bias novel_premises away from "hard" theorems (which will tend to involve premises mentioned elsewhere, by virtue of being interesting) and towards "leaf" theorems (i.e. theorems about things that aren't otherwise of much interest, hence have "novel" premises). I haven't actually looked at the data split, however, so this may be a misguided worry?
Scott Morrison (Jun 28 2023 at 04:42):
I successfully ran your premise retrieval demo from https://github.com/lean-dojo/ReProver#premise-retriever, but I'm a bit confused how I'm meant to use this in practice. That example has 8 hardcoded premises, and then selects 2 after being shown a goal. What is the intended model when I only have a goal (i.e. and implicitly the entire library to choose from)?
Junyan Xu (Jun 28 2023 at 04:46):
Thanks for sharing @Kaiyu Yang ! Two issues I think I should point out:
- 
Mathlib has preferred citation entries which you can find at https://github.com/leanprover-community/mathlib/blob/master/CITATION.md 
- 
In your tweet you claim ReProver to be "the first LLM-based prover augmented with retrieval for premise selection", but the Magnushammer paper (which you did cite), even though being focused on premise selection, does integrate with the LLM-based Thor to achieve better performance. 
Junyan Xu (Jun 28 2023 at 04:53):
What is the intended model when I only have a goal (i.e. and implicitly the entire library to choose from)?
I think this use case is https://github.com/lean-dojo/ReProver#tactic-generator which uses the tacgen model. Combining both requires the retriever-tacgen model.
Scott Morrison (Jun 28 2023 at 05:02):
No, I'm not asking how to generate a candidate tactic. I just want some proposals of relevant premises.
Scott Morrison (Jun 28 2023 at 05:05):
I think this is a fundamental missing component in the Lean tactic ecosystem, and much higher priority than anything LLM driven. I think we're still at the point where it is premature to be expecting much from LLMs in theorem proving. But good premise selection + old fashioned automation (solve_by_elim, exact?, rewrite_search, hammers, etc, etc.) have lots of mileage.
Junyan Xu (Jun 28 2023 at 05:39):
Oh I see. The picture says "all accessible premises in the math library" "33K on average" (probably depends on import)?
The premises in the premise retriever example doesn't contain the full signature but appears to be the plain text appearing in the Lean files. Passing in strings in real time is definitely not efficient, and we should utilize pre-computed embeddings: @Adam Topaz's mathlib4 embeddings apparently use full signatures, but I'm not sure about @Zhangir Azerbayev's mathlib3 embeddings.
Kaiyu Yang (Jun 28 2023 at 14:19):
@Junyan Xu Thanks for pointing them out. We will use mathlib's preferred citation in the next revision (probably in August so that we can incorporate reviewers' comments since the paper is under review at NeurIPS Datasets and Benchmarks Track).
The second point is also valid. Our method emphasizes more on augmenting the LLMs with retrieval, i.e., the LLM takes retrieved premises as part of its input. In Magnushammer, they are separate. The LLM decides whether to call hammer, Magnushammer performs retrieval, and the retrieved premises are combined with a symbolic prover (some heuristic tactic templates), not the LLM. Nevertheless, the statement "first LLM-based prover augmented with retrieval" can be misleading, and we will rephrase it in the next revision.
Kaiyu Yang (Jun 28 2023 at 14:32):
@Scott Morrison Thank you for your comments!
You're right, the ChatGPT plugin is not connected to ReProver in any way. We simply wrap LeanDojo's interface (for initializing the proof search and running tactics) as APIs for ChatGPT to call.
Regarding the "novel_premises" split, it doesn't prevent a testing theorem (say, A) from using a premise that has been used in other theorems (B, C, D). It just requires B, C, and D to be also in the testing set. I don't have the knowledge to comment on whether the testing theorems in "novel_premises" are difficult mathematically. However, from our empirical results, they do seem to be more difficult for our model (and for our baselines, including tidy and GPT-4).
The code demo in https://github.com/lean-dojo/ReProver#using-trained-models-on-hugging-face tries to separate the model from our infrastructure for training and inference. In our setup, LeanDojo performs program analysis on a Lean repo, extracting information such as the dependency DAG between files, what constants are defined in each file, and where they are defined (row/column numbers). This information is saved into a static dataset (LeanDojo Benchmark) that the machine learning model can access. This infrastructure is still optimized for machine learning use cases, rather than interactive use cases in VSCode. The point of separating the model from the infrastructure is that you could use metaprogramming to build your own infrastructure (e.g., for extracting premises in VSCode) and use our models.
Kaiyu Yang (Jun 28 2023 at 14:34):
@Junyan Xu Yes, the number of accessible premises depends on imports. "Premises" in LeanDojo are basically "constants" in Lean's terminology. In each theorem, we can access constants defined earlier in the same file, or imported (directly or indirectly) from other files. 33K is the average of all theorems in LeanDojo Benchmark.
Bolton Bailey (Jun 28 2023 at 14:51):
Screenshot-2023-06-28-at-10.50.42-AM.png
Figure B from the arxiv paper looks like it is experiencing some problems rendering unicode.
Kaiyu Yang (Jun 28 2023 at 14:57):
@Bolton Bailey Thanks for noting that! These Unicodes can be rendered properly on OverLeaf using XeLaTex, but arXiv only supports pdfLaTex. That's super annoying..
Patrick Massot (Jun 28 2023 at 15:25):
arXiv moves very slowly but it moves. Supporting any modern compiler is still listed as Not in the short-term of “this year”.
Patrick Massot (Jun 28 2023 at 15:25):
So there is hope in the long run, but don't hold your breath.
Junyan Xu (Jun 28 2023 at 15:50):
I think you have the option to upload a PDF without TeX source, but am not sure about whether you can add a PDF to your TeX submission. This page says:
a PDF file created from a TeX/LaTeX file will typically be rejected, with exceptions granted on a case-by-case basis. There are good reasons why arXiv insists on TeX/LaTeX source if it is available. arXiv produces PDF automatically from all TeX submitted source.
I think a display issue should warrant an exception.
arXiv is also hiring TeX experts:
TeX/LaTeX rationalization – arXiv accepts the bulk of its content in TeX format. We have TeX or LaTeX source for 90% of our articles. However, compiling all the TeX source we have into viewable papers (in PDF and increasingly also in HTML) can be difficult. We maintain a principle that we can rebuild any TeX submission accepted in the last 32 years. Preserving this capability, while being able to accept submissions that depend on the latest and greatest versions of TeX or LaTeX has become increasingly challenging. It is not unusual for a new release of TeX Live to include changes that cause older papers to no longer build properly. We have an opening for a TeX/LaTeX expert to rationalize our TeX pipeline. This task may include getting deeply involved in the TeX community and contributing to projects that support arXiv's goals.
Kaiyu Yang (Jun 28 2023 at 15:56):
Thanks for the pointers! We'll probably just do the easiest thing: use an image for the figure in the next revision.
Junyan Xu (Jun 28 2023 at 16:57):
we create a challenging data split named novel_premises. It requires testing proofs to use at least one premise that has never been used in training.
I'm concerned this will bias novel_premises away from "hard" theorems (which will tend to involve premises mentioned elsewhere, by virtue of being interesting)
I feel you can simply require that for any testing proof A and any training proof B, the premises used in A aren't a subset of the premises used in B. It's still an interesting testing sample if it requires combining techniques from two proofs the model trained on. Shouldn't be too hard to generate such a split?
Kevin Buzzard (Jun 28 2023 at 18:53):
Re pdfs in ArXiv: it is possible to edit the pdf to make ArXiv believe that it was not produced from a TeX source and hence accept it, but it's not trivial (some methods on the internet apparently don't work any more)
Kaiyu Yang (Jun 28 2023 at 19:36):
@Junyan Xu Yep, it shouldn't be hard to create such splits by adapting our data-generation script (at least for Lean 3, since the premise extraction for Lean 4 is still ongoing work) . We also considered other splits, e.g., training on old theorems and testing on new theorems (theorem dates can be determined from Git history, and LeanDojo already supports that). I guess that doesn't make sense for mathlib 4, since most theorems are exported from mathlib 3?
Jason Rute (Jun 29 2023 at 08:34):
@Kaiyu Yang interesting work. I look forward to diving deeper into it. Since your work uses publicly available pre-training data, it would be interesting and important to look into how much data contamination is in your pre-training data. You may be surprised (as I was in the PACT paper) that it could be quite significant.
Jason Rute (Jun 29 2023 at 08:46):
Also as for if this is the “the first LLM-based prover that is augmented with retrieval for selecting premises from a vast math library”, probably depends whether this AITP abstract on Retrieval-Augmented Proof Step Synthesis by @Christian Szegedy @Markus Rabe and @Henryk Michalewski counts. :smile:
Kaiyu Yang (Jun 29 2023 at 13:39):
@Jason Rute Thanks for your comments!
Our model is built upon the ByT5 model, which was pretrained on the mC4 dataset (details in Sec 3.1 of the mT5 paper). The dataset was relatively old (constructed in 2020) and was designed to include 108 different natural languages (not code). Therefore, we believe the risk of data contamination is smaller compared to prior works such as PACT and HyperTree Proof Search. It would be helpful to download the mC4 dataset and check how much Lean is included. Currently, we don't have the disk for doing that since the dataset is massive, but we'll try to look into it. Thank you for the suggestion!
Regarding Christian et al. 2021, we were aware of this work (and another relevant work Tworkowski et al. 2022) when writing our paper. However, we found it difficult to discuss them or let them influence the claims we made. They were intended to be non-archival extended abstracts (as AITP requires), instead of complete research papers. We were unable to find enough details in the papers (regarding experimental setup, datasets, evaluation protocol, etc.) for us to comfortably discuss them and compare with our work in a fair manner.
Scott Morrison (Jun 29 2023 at 22:52):
Surely they can still be mentioned, even if with that caveat? I have to admit the "not enough detail" for "us to be comfortable" line reads as a bit self-serving to me! Don't get me wrong, I like this work (and wish it had some more usable products...), but given the prevailing skepticism amongst mathematicians of ML papers claiming to make progress in doing maths, I think erring on the side of even-handedness helps overall.
Kaiyu Yang (Jun 30 2023 at 14:07):
@Scott Morrison
I agree they can be mentioned, and we'll probably do that in the next revision. However, I'd suggest focusing the discussion on facts rather than motives that are unfalsifiable. For those who are curious, I recommend reading the two extended abstracts that are being discussed. It's important to recognize that different fields have different norms concerning the discussion of work in progress. For instance, mathematics often has lengthy publication cycles, and it's not unusual for drafts to be shared and circulated for an extended period before final publication. This practice might not be as prevalent in AI.
Regarding "more usable products," if you are referring to proof automation tools integrated with the Lean user's workflow, such as in VSCode, we currently have a student working on running our models in Lean locally, for example, as tactics. I find this type of work challenging not technically but because the priorities in different communities are not always aligned. Jeremy and others had interesting discussions about this on the 2nd day of the AI to Assist Mathematical Reasoning workshop.
Kaiyu Yang (Jun 30 2023 at 15:19):
Some contexts: I say that (regarding integrating LLMs with Lean locally) because I tried to sell this idea as a potential project to several students interested in AI/ML + theorem proving, from undergrads to Ph.D. students. It turned out only the most junior student (@Peiyang Song ) was interested (who is still entering research and is very open-minded).
Andreas Gittis (Jul 01 2023 at 01:42):
This looks really great!
I tried following the instructions in the "Getting Started" section of the docs (https://leandojo.readthedocs.io/en/latest/getting-started.html). After installing when I ran the basic example code I ran into an error with pydantic. I think they upgraded to version 2.0 just today, and when I downgraded to pydantic 1.10.10 and retried the error went away so I'm guessing the package version isn't correctly delimited. Afterwards when I tried to trace the example file I ran into a rather verbose error that had to do with being unable to read/pickle some 'lxml.etree._ListErrorLog' object. I can dump the error message as a text file or image if you're interested, wasn't able to find a way around that one.
Also, I'm not sure if this is the correct thread to discuss technical issues with the package, so please let me know if I should DM this type of thing or post it somewhere else. Thanks again for the hard work from you and your colleagues.
Kaiyu Yang (Jul 01 2023 at 04:19):
Hi @Andreas Gittis, could you please open an issue in our repo? Thanks!
Heather Macbeth (Jul 12 2023 at 01:47):
Kaiyu Yang said:
Regarding the "novel_premises" split, it doesn't prevent a testing theorem (say, A) from using a premise that has been used in other theorems (B, C, D). It just requires B, C, and D to be also in the testing set. I don't have the knowledge to comment on whether the testing theorems in "novel_premises" are difficult mathematically.
@Kaiyu Yang To continue the discussion of your "novel premises" split. From your comment here, it seems like the "novel premises" constraint would encourage clustering of subject matter, so that whole groups of related theorems are enforced to lie only in training or only in test. Do you agree?
Data point: I looked at the test set and it seems to contain material from 449 files in mathlib, out of the ~3000 files in total.  And there was some evidence of "super-clustering" beyond this, e.g. none of the analysis.normed_space.* folder gets included and basically all of the analysis.convex.* folder gets included:
   2 src/analysis/convex/basic.lean
  10 src/analysis/convex/between.lean
   1 src/analysis/convex/body.lean
   1 src/analysis/convex/caratheodory.lean
   3 src/analysis/convex/combination.lean
   1 src/analysis/convex/contractible.lean
   1 src/analysis/convex/extrema.lean
   3 src/analysis/convex/extreme.lean
   2 src/analysis/convex/function.lean
   8 src/analysis/convex/gauge.lean
   3 src/analysis/convex/hull.lean
   3 src/analysis/convex/independent.lean
   3 src/analysis/convex/integral.lean
   5 src/analysis/convex/intrinsic.lean
   2 src/analysis/convex/krein_milman.lean
   1 src/analysis/convex/measure.lean
  17 src/analysis/convex/side.lean
   2 src/analysis/convex/slope.lean
   1 src/analysis/convex/star.lean
   1 src/analysis/convex/stone_separation.lean
   1 src/analysis/convex/strict.lean
   1 src/analysis/convex/strict_convex_between.lean
   6 src/analysis/convex/strict_convex_space.lean
   4 src/analysis/convex/topology.lean
   1 src/analysis/convex/uniform.lean
Heather Macbeth (Jul 12 2023 at 01:49):
I guess the argument here would be that the benchmark is testing how useful a system could be expected to be on a new theory, where it has not previously seen the mathematical subject matter.
Kaiyu Yang (Jul 12 2023 at 02:12):
@Heather Macbeth I think you're correct. The "novel_premises" split may indeed have the effect of encouraging clustering. We also find other side effects, e.g., theorems with easy proofs consisting of a single tactic w/o premise (simp, norm_num, ring, etc.) can only go to training.
Also, I think it might be helpful to create an additional split that explicitly tests how the model generalizes to different topics. Some benchmarks (e.g., CoqGym) even split the theorems by projects, which might be a bit too challenging for current models, in my opinion.
Heather Macbeth (Jul 12 2023 at 02:21):
Kaiyu Yang said:
other side effects, e.g., theorems with easy proofs consisting of a single tactic w/o premise (
simp,norm_num,ring, etc.) can only go to training.
That's interesting. Can you say more about why these can't go in test? Since such proofs contain no lemmas, I would have thought there was no constraint on their presence in the test set.
Jason Rute (Jul 12 2023 at 02:25):
My understanding is that to be in the test set the theorem has to explicitly call a premise not used in any training theorem. So proofs which use powerful tactics but don’t explicitly call premises are excluded from testing. I think this is a big reason why this split is so much harder.
Jason Rute (Jul 12 2023 at 02:27):
Even proofs like simp which use premises but don’t explicitly name them wouldn’t count in this split.
Heather Macbeth (Jul 12 2023 at 03:09):
Are premises required to be theorems, i.e. of type Prop?  If premises can have any type (including Type), I guess you'd have the problem that every theorem mentioning the real numbers has to fall on one side of the dichotomy.
Kaiyu Yang (Jul 12 2023 at 03:23):
@Heather Macbeth Not necessarily. "Premises" in LeanDojo are basically "constants" in Lean's terminology. They can be theorems, definitions, inductive definitions, etc.
I didn't understand "every theorem mentioning the real numbers has to fall on one side of the dichotomy". Could you elaborate a 
bit?
Heather Macbeth (Jul 12 2023 at 03:27):
The real numbers, ℝ, are defined at this line in mathlib:
https://github.com/leanprover-community/mathlib/blob/4e24c4bfcff371c71f7ba22050308aa17815626c/src/data/real/basic.lean#L30
They are a structure, which I would assume is a "constant".  So will that mean that all the theorems for which the token ℝ appears in the proof (for instance in a have-statement or a cast) get grouped together, either (1) all in training or (2) all in test?
Kaiyu Yang (Jul 12 2023 at 03:40):
@Heather Macbeth Nope, for a theorem to be in testing, its human-written proof must use at least one premise that is never used by any training proof. We don't require all premises it uses to be not in training. So ℝ can definitely be used both in training and in testing. This is the script for generating the dataset: https://github.com/lean-dojo/LeanDojo/blob/main/scripts/generate-benchmark-lean3.ipynb
Heather Macbeth (Jul 12 2023 at 03:43):
Ah, indeed this is the point I missed!
Heather Macbeth (Jul 12 2023 at 03:56):
It would be interesting to build a full list of the "force-their-descendants-into-test" premises. I'd be curious to see whether they are all theorems, or whether some of them are definitions or structures.
I understand there's some randomness here.
Min-Hsien Weng (Aug 07 2023 at 05:24):
Kaiyu Yang said:
Heather Macbeth Nope, for a theorem to be in testing, its human-written proof must use at least one premise that is never used by any training proof.
Thanks for the clarification. I understand the complexity of data split, particular for mathematical texts. 
In some cases, the proof of a theorem may rely on existing lemmas that have been solved in the previous proof step. Such a case would be counted as testing, as this lemma would never appear before. 
Please correct me if my understanding is wrong. Thank you.
Kaiyu Yang (Aug 07 2023 at 14:41):
@Min-Hsien Weng Yes, if this lemma is not used by another theorem later.
Kaiyu Yang (Aug 13 2023 at 22:06):
Hi,
We just released a new version of LeanDojo: https://github.com/lean-dojo/LeanDojo/releases/tag/v1.2.0. Two major updates:
- Thanks to @Peiyang Song . Now, LeanDojo fully supports Lean 4 (previously in alpha stage). The new version can extract premises from Lean 4 repos. I.e., for each constant identifier used in tactics, LeanDojo finds out its full name and locates its definition (The previous version can do this only for Lean 3).
- The new version can interact with Lean 4 programmatically through not only tactics but also commands, e.g., def x:= 1or#eval x. See the end of this demo for examples.
Lev Stambler (Aug 15 2023 at 19:57):
Great project! I am playing around with Lean Dojo + Reprover. I was wondering if there is an easy way to load all of Mathlib as potential premises (for Lean 3)?
Kaiyu Yang (Aug 16 2023 at 02:28):
@Lev Stambler Just create a new GitHub repo and have a new file that imports everything in mathlib? mk_all.sh might be helpful.
Lev Stambler (Aug 16 2023 at 12:56):
Kaiyu Yang said:
Lev Stambler Just create a new GitHub repo and have a new file that imports everything in mathlib? mk_all.sh might be helpful.
Fantastic thank you!
Lev Stambler (Aug 16 2023 at 14:47):
I am playing around with ReProver and am noticing that sometimes, ReProver will get stuck in a sort of loop.
In my case, I am trying to prove the lemma in the example repo:
open nat (add_assoc add_comm)
theorem hello_world (a b c : ℕ) : a + b + c = a + c + b
Occasionally, the model suggests rw [add_comm] on a loop or rw [add_assoc] on a loop. Has anyone seen anything similar and have any suggestions?
Kaiyu Yang (Aug 16 2023 at 15:06):
Are you running the model from Python to generate a single tactic for the current proof state? If so, that's kind of an expected behavior. When evaluating the model, we plug the tactic generator into some search algorithm (e.g., best-first search), and the model generates multiple tactic suggestions for each proof state. This way we can get rid of loops.
Jason Rute (Aug 16 2023 at 15:20):
A few papers take the whole partial proof into account which would also help avoid loops (at least in theory since the model would learn that one never uses rw [add_comm] twice in a row), but that is still pretty rare to see.
Lev Stambler (Aug 16 2023 at 15:49):
Kaiyu Yang said:
Are you running the model from Python to generate a single tactic for the current proof state? If so, that's kind of an expected behavior. When evaluating the model, we plug the tactic generator into some search algorithm (e.g., best-first search), and the model generates multiple tactic suggestions for each proof state. This way we can get rid of loops.
Oh I see. I was using beam-search but with a small beam size (4). The paper specifies 64. Thank you!
Min-Hsien Weng (Aug 17 2023 at 01:02):
Kaiyu Yang said:
Are you running the model from Python to generate a single tactic for the current proof state? If so, that's kind of an expected behavior. When evaluating the model, we plug the tactic generator into some search algorithm (e.g., best-first search), and the model generates multiple tactic suggestions for each proof state. This way we can get rid of loops.
I wonder if the tactic generated by Reprover would be a single tactic or a combination of tactics for a proof goal?
In the example of Reprover, it seems that some tactic candidates may contain multiple tactics rw [nat.gcd, nat.gcd_self_right]. It makes great sense for rewrite tactics as it can reduce the length of proofs but the intermediate goal between these two tactics may be hidden.
However, using a single tactic will likely increase the length of proofs when the proof itself is large and complex.
Kaiyu Yang (Aug 17 2023 at 03:13):
I don't think there is a clear boundary between a single tactic and a compound tactic made of tactic combinators such as <;>.  Also, we don't hard-code the model to choose between them. Instead, what the model generates is only determined by the training data. In our dataset (LeanDojo Benchmark), rw [nat.gcd, nat.gcd_self_right] is treated as a single tactic, but X <;> Y consists of two tactics.
Min-Hsien Weng (Aug 17 2023 at 04:19):
That makes more sense. So the tactics generated by the model could be single/compound, depending on the context (proof goal). 
Thanks for the clarification @Kaiyu Yang. I have a better understanding of the model.
Rahul Saha (Aug 18 2023 at 04:16):
Hello there, I have this repository (this is a minimal worked example) that I am trying to run LeanDojo on - https://github.com/rah4927/lean-dojo-mew
I installed LeanDojo fine and can get it to run on simple examples. But on this repository, when I run the following piece of code:
import os
os.environ['CONTAINER'] = 'native'
from lean_dojo import *
repo = LeanGitRepo("https://github.com/rah4927/lean-dojo-mew", "c9e978ef8b2ea42fd6576de6c92e845d38854b65")
theorem = Theorem(repo, "lean4/MiniF2F/Example.lean", "amc12a_2019_p21")
with Dojo(theorem) as (dojo, init_state):
  print(init_state)
I get the following error:
2023-08-18 00:14:03.420 | WARNING  | lean_dojo.interaction.dojo:__post_init__:171 - Using Lean 4 without a hard timeout may lead to problems if a tactic hangs indefinitely.
2023-08-18 00:14:03.996 | INFO     | lean_dojo.data_extraction.trace:get_traced_repo_path:132 - Tracing LeanGitRepo(url='https://github.com/rah4927/lean-dojo-mew', commit='c9e978ef8b2ea42fd6576de6c92e845d38854b65')
2023-08-18 00:14:06.157 | INFO     | __main__:main:102 - Building lean-dojo-mew
info: cloning https://github.com/leanprover-community/mathlib4.git to ./lake-packages/mathlib
info: cloning https://github.com/leanprover/std4 to ./lake-packages/std
info: cloning https://github.com/gebner/quote4 to ./lake-packages/Qq
info: cloning https://github.com/JLimperg/aesop to ./lake-packages/aesop
info: cloning https://github.com/mhuisi/lean4-cli.git to ./lake-packages/Cli
info: cloning https://github.com/EdAyers/ProofWidgets4 to ./lake-packages/proofwidgets
error: no such file or directory (error code: 2)
  file: ./././MiniF2F.lean
Traceback (most recent call last):
  File "/private/var/folders/q5/p2f1mhz563l8_98p40ssf48h0000gq/T/tmpt6l3shge/workspace/build_lean4_repo.py", line 149, in <module>
    main()
  File "/private/var/folders/q5/p2f1mhz563l8_98p40ssf48h0000gq/T/tmpt6l3shge/workspace/build_lean4_repo.py", line 128, in main
    run_cmd(f"lake build")
  File "/private/var/folders/q5/p2f1mhz563l8_98p40ssf48h0000gq/T/tmpt6l3shge/workspace/build_lean4_repo.py", line 28, in run_cmd
    res = subprocess.run(cmd, shell=True, capture_output=capture_output, check=True)
  File "/usr/local/Cellar/python@3.10/3.10.12_1/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'lake build' returned non-zero exit status 1.
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
Cell In[10], line 1
----> 1 with Dojo(theorem) as (dojo, init_state):
      2   print(init_state)
      3   # result = dojo.run_tac(init_state, "rw [add_assoc, add_comm b, ←add_assoc]")
      4   # assert isinstance(result, ProofFinished)
File ~/venv/lean-dojo-benchmark/lib/python3.10/site-packages/lean_dojo/interaction/dojo.py:302, in Dojo.__enter__(self)
    300 os.chdir(self.origin_dir)
    301 shutil.rmtree(self.tmp_dir)
--> 302 raise ex
File ~/venv/lean-dojo-benchmark/lib/python3.10/site-packages/lean_dojo/interaction/dojo.py:207, in Dojo.__enter__(self)
    205 if repo.is_lean4:
    206     raise NotImplementedError("InteractingLean 4 is not supported yet.")
--> 207 traced_repo_path = get_traced_repo_path(repo)
    209 # Copy and `cd` into the repo.
    210 shutil.copytree(
    211     traced_repo_path,
    212     repo.name,
    213     ignore=ignore_patterns("*.dep_paths", "*.ast.json", "*.trace.xml"),
    214 )
File ~/venv/lean-dojo-benchmark/lib/python3.10/site-packages/lean_dojo/data_extraction/trace.py:135, in get_traced_repo_path(repo)
    133 with working_directory() as tmp_dir:
    134     logger.debug(f"Working in the temporary directory {tmp_dir}")
--> 135     _trace(repo)
    136     traced_repo = TracedRepo.from_traced_files(tmp_dir / repo.name)
    137     traced_repo.save_to_disk()
File ~/venv/lean-dojo-benchmark/lib/python3.10/site-packages/lean_dojo/data_extraction/trace.py:66, in _trace(repo)
     64 else:
     65     assert repo.uses_lean4
---> 66     _trace_lean4(repo)
File ~/venv/lean-dojo-benchmark/lib/python3.10/site-packages/lean_dojo/data_extraction/trace.py:105, in _trace_lean4(repo)
     99 container = get_container()
    100 mts = {
    101     Path.cwd() / repo.name: f"/workspace/{repo.name}",
    102     LEAN4_BUILD_SCRIPT_PATH: f"/workspace/{LEAN4_BUILD_SCRIPT_PATH.name}",
    103     LEAN4_DATA_EXTRACTOR_PATH: f"/workspace/{repo.name}/{LEAN4_DATA_EXTRACTOR_PATH.name}",
    104 }
--> 105 container.run(
    106     f"python3 build_lean4_repo.py {repo.name}",
    107     create_mounts(mts),
    108     {"NUM_PROCS": NUM_PROCS},
    109     as_current_user=True,
    110     work_dir="/workspace",
    111 )
File ~/venv/lean-dojo-benchmark/lib/python3.10/site-packages/lean_dojo/container.py:193, in NativeContainer.run(self, command, mounts, envs, as_current_user, capture_output, cpu_limit, memory_limit, work_dir)
    190         work_dir = Path.cwd() / work_dir.relative_to(work_dir.root)
    192 with working_directory(work_dir):
--> 193     execute(cmd, capture_output=capture_output)
    195 self._unmount_files(mounts)
File ~/venv/lean-dojo-benchmark/lib/python3.10/site-packages/lean_dojo/utils.py:116, in execute(cmd, capture_output)
    114         logger.info(ex.stdout.decode())
    115         logger.error(ex.stderr.decode())
--> 116     raise ex
    117 if not capture_output:
    118     return None
File ~/venv/lean-dojo-benchmark/lib/python3.10/site-packages/lean_dojo/utils.py:111, in execute(cmd, capture_output)
    101 """Execute the shell command ``cmd`` and optionally return its output.
    102
    103 Args:
   (...)
    108     Optional[Tuple[str, str]]: The command's output, including stdout and stderr (None if ``capture_output == False``).
    109 """
    110 try:
--> 111     res = subprocess.run(cmd, shell=True, capture_output=capture_output, check=True)
    112 except subprocess.CalledProcessError as ex:
    113     if capture_output:
File /usr/local/Cellar/python@3.10/3.10.12_1/Frameworks/Python.framework/Versions/3.10/lib/python3.10/subprocess.py:526, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    524     retcode = process.poll()
    525     if check and retcode:
--> 526         raise CalledProcessError(retcode, process.args,
    527                                  output=stdout, stderr=stderr)
    528 return CompletedProcess(process.args, retcode, stdout, stderr)
CalledProcessError: Command 'NUM_PROCS=12 python3 build_lean4_repo.py lean-dojo-mew' returned non-zero exit status 1.
It appears to be a problem with lean build, however locally I have built and used this project just fine. Wondering if there is anything that I am missing. Some help would be appreciated! Also, really cool tool!
Scott Morrison (Aug 18 2023 at 04:18):
The pertinent part of the error is surely
error: no such file or directory (error code: 2)
  file: ./././MiniF2F.lean
Rahul Saha (Aug 18 2023 at 04:24):
Ah I forgot that file while trying to construct a minimal worked example. Let me update that, sorry about that
Rahul Saha (Aug 18 2023 at 04:31):
Ah, never mind, seems like my gitignore was preventing this file from being pushed to the repository, which created this error. Thanks!
Kaiyu Yang (Aug 18 2023 at 14:56):
@Rahul Saha I'm glad the problem has been resolved. I also noted that you're using an old version of LeanDojo. It would be great to upgrade to the latest version since we fixed a few minor bugs recently.
Rahul Saha (Aug 22 2023 at 18:40):
@Kaiyu Yang Hi, thank you for your input. I just installed lean-dojo from PyPI again, and ran into the following error.
2023-08-22 14:32:18.453 | WARNING  | lean_dojo.interaction.dojo:__init__:172 - Using Lean 4 without a hard timeout may hang indefinitely.
2023-08-22 14:32:18.751 | INFO     | lean_dojo.data_extraction.trace:get_traced_repo_path:131 - Tracing LeanGitRepo(url='https://github.com/rah4927/lean-dojo-mew', commit='3940892546b1371385b34871d73015b9ae50b77a')
Traceback (most recent call last):
  File "build_lean4_repo.py", line 12, in <module>
    from loguru import logger
ModuleNotFoundError: No module named 'loguru'
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
Cell In[5], line 1
----> 1 with Dojo(theorem) as (dojo, init_state):
      2   print(init_state)
File ~/.pyenv/versions/lean-dojo-benchmark/lib/python3.9/site-packages/lean_dojo/interaction/dojo.py:283, in Dojo.__enter__(self)
    281 os.chdir(self.origin_dir)
    282 shutil.rmtree(self.tmp_dir)
--> 283 raise ex
File ~/.pyenv/versions/lean-dojo-benchmark/lib/python3.9/site-packages/lean_dojo/interaction/dojo.py:203, in Dojo.__enter__(self)
    200 os.chdir(self.tmp_dir)
    202 # Copy and `cd` into the repo.
--> 203 traced_repo_path = get_traced_repo_path(self.repo)
    204 shutil.copytree(
    205     traced_repo_path,
    206     self.repo.name,
    207     ignore=ignore_patterns("*.dep_paths", "*.ast.json", "*.trace.xml"),
    208 )
    209 os.chdir(self.repo.name)
File ~/.pyenv/versions/lean-dojo-benchmark/lib/python3.9/site-packages/lean_dojo/data_extraction/trace.py:134, in get_traced_repo_path(repo)
    132 with working_directory() as tmp_dir:
    133     logger.debug(f"Working in the temporary directory {tmp_dir}")
--> 134     _trace(repo)
    135     traced_repo = TracedRepo.from_traced_files(tmp_dir / repo.name)
    136     traced_repo.save_to_disk()
File ~/.pyenv/versions/lean-dojo-benchmark/lib/python3.9/site-packages/lean_dojo/data_extraction/trace.py:66, in _trace(repo)
     64 else:
     65     assert repo.uses_lean4
---> 66     _trace_lean4(repo)
File ~/.pyenv/versions/lean-dojo-benchmark/lib/python3.9/site-packages/lean_dojo/data_extraction/trace.py:104, in _trace_lean4(repo)
     98 container = get_container()
     99 mts = {
    100     Path.cwd() / repo.name: f"/workspace/{repo.name}",
    101     LEAN4_BUILD_SCRIPT_PATH: f"/workspace/{LEAN4_BUILD_SCRIPT_PATH.name}",
    102     LEAN4_DATA_EXTRACTOR_PATH: f"/workspace/{repo.name}/{LEAN4_DATA_EXTRACTOR_PATH.name}",
    103 }
--> 104 container.run(
    105     f"python3 build_lean4_repo.py {repo.name}",
    106     create_mounts(mts),
    107     {"NUM_PROCS": NUM_PROCS},
    108     as_current_user=True,
    109     work_dir="/workspace",
    110 )
File ~/.pyenv/versions/lean-dojo-benchmark/lib/python3.9/site-packages/lean_dojo/container.py:193, in NativeContainer.run(self, command, mounts, envs, as_current_user, capture_output, cpu_limit, memory_limit, work_dir)
    190         work_dir = Path.cwd() / work_dir.relative_to(work_dir.root)
    192 with working_directory(work_dir):
--> 193     execute(cmd, capture_output=capture_output)
    195 self._unmount_files(mounts)
File ~/.pyenv/versions/lean-dojo-benchmark/lib/python3.9/site-packages/lean_dojo/utils.py:116, in execute(cmd, capture_output)
    114         logger.info(ex.stdout.decode())
    115         logger.error(ex.stderr.decode())
--> 116     raise ex
    117 if not capture_output:
    118     return None
File ~/.pyenv/versions/lean-dojo-benchmark/lib/python3.9/site-packages/lean_dojo/utils.py:111, in execute(cmd, capture_output)
    101 """Execute the shell command ``cmd`` and optionally return its output.
    102
    103 Args:
   (...)
    108     Optional[Tuple[str, str]]: The command's output, including stdout and stderr (None if ``capture_output == False``).
    109 """
    110 try:
--> 111     res = subprocess.run(cmd, shell=True, capture_output=capture_output, check=True)
    112 except subprocess.CalledProcessError as ex:
    113     if capture_output:
File ~/.pyenv/versions/3.9.16/lib/python3.9/subprocess.py:528, in run(input, capture_output, timeout, check, *popenargs, **kwargs)
    526     retcode = process.poll()
    527     if check and retcode:
--> 528         raise CalledProcessError(retcode, process.args,
    529                                  output=stdout, stderr=stderr)
    530 return CompletedProcess(process.args, retcode, stdout, stderr)
CalledProcessError: Command 'NUM_PROCS=32 python3 build_lean4_repo.py lean-dojo-mew' returned non-zero exit status 1.
The error in question seems to be that there is no module named loguru. However, I installed loguru in my environment and even ran from loguru import logger in my notebook to make sure that it imports correctly. Wondering if this is an issue introduced in the latest version? Below is the code that I ran: 
import os
os.environ['CONTAINER'] = 'native'
from lean_dojo import *
repo = LeanGitRepo("https://github.com/rah4927/lean-dojo-mew", "12fec4d9f39ac9a41c6c1f058efab3ec41c028af")
theorem = Theorem(repo, "MiniF2F/Example.lean", "amc12a_2019_p21")
with Dojo(theorem) as (dojo, init_state):
  print(init_state)
EDIT: The following piece of code seems to work just fine:
repo = LeanGitRepo("https://github.com/yangky11/lean4-example", "7d711f6da4584ecb7d4f057715e1f72ba175c910")
theorem = Theorem(repo, "Lean4Example.lean", "hello_world")
with Dojo(theorem) as (dojo, init_state):
  print(init_state)
  result = dojo.run_tac(init_state, "rw [add_assoc, add_comm b, ←add_assoc]")
  assert isinstance(result, ProofFinished)
  print(result)
which makes me think the issue is somewhere in the repository, but I am not sure where it could be, since lean works fine in that project locally.
Kaiyu Yang (Aug 22 2023 at 20:01):
@Rahul Saha  Can you quickly try if python3 -c "import loguru" works? The second example works probably because the repo is downloaded from the remote cache instead of built locally. Setting VERBOSE=1 gives you additional information. Instead of posting it here, can you open an issue on GitHub following https://github.com/lean-dojo/LeanDojo#questions-and-bugs? It will instruct you to run in the debug mode and add more information I need to figure out what's going on.
Rahul Saha (Aug 22 2023 at 20:23):
@Kaiyu Yang yes, python3 -c "import loguru" works fine. I have created an issue here - https://github.com/lean-dojo/LeanDojo/issues/41, thank you!
Lev Stambler (Aug 23 2023 at 18:01):
Hello, I am trying to clone a fork of math-lib (Lean 3) and trace it. I am finding though that trace returns an error saying that the mount path does not exist:
docker: Error response from daemon: invalid mount config for type "bind": bind source path does not exist: /tmp/tmplknmfziv/mathlib-clustering.
See 'docker run --help'.
Kaiyu Yang (Aug 23 2023 at 18:49):
@Lev Stambler Instead of posting it here, can you open an issue on GitHub following https://github.com/lean-dojo/LeanDojo#questions-and-bugs? It will instruct you to run in the debug mode and add more information I need to figure out what's going on.
Adam Kurkiewicz (Apr 21 2024 at 11:57):
Hey @Kaiyu Yang , I'm currently reading the LeanDojo paper:
https://arxiv.org/pdf/2306.15626.pdf
I'd like to try a very simple proof of concept (ideally without running my own instance of any LLM), just to learn a bit more about this field.
In particular, I'd like to try this "beam search" strategy, which you've described in the appendix C.1 on page 24:
Screenshot-2024-04-21-at-13.47.17.png
where the sum of log-likelihoods of each path in the search tree determines the next evaluation of the tree (hopefully leading to a correct solution!). I'd like to play around a bit with that and see what kind of search tree evaluation strategies give the best results.
To keep things really simple, I wanted to try one of the APIs for an existing, generic LLM. The problem is, I can't find any modern, hosted LLM that would give me access to the logprobs through their API (I think older models used to have a "logprobs" parameter in the API call, see for example here).
Any suggestions? Is my understanding correct that without access to logprobs from the model one simply cannot do an effective search for the proof of a given premise?
Kaiyu Yang (Apr 21 2024 at 14:32):
Hi @Adam Kurkiewicz,
I'm not familiar with whether you can get log-probs from hosted models such as GPT-4. If not, you could try to ask it directly output a confidence score. Or you could hard-code the scores, e.g., to a constant, and see how well it works.
For implementing the search algorithm, you may want to refer to https://github.com/lean-dojo/ReProver/tree/main/prover.
Adam Kurkiewicz (Apr 22 2024 at 11:39):
Thank you @Kaiyu Yang. I'll look up the implementation of the search algorithm.
In meanwhile it turns out that OpenAI recently restored access to logprobs to their new models through a new API:
https://platform.openai.com/docs/api-reference/chat/create?lang=python
I got a simple script working.
I'll let you know how I get on with the rest of this test project :)
Last updated: May 02 2025 at 03:31 UTC

