Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Lean Workbook: A large-scale Lean formalized problem set

Huaiyuan Ying (Jun 07 2024 at 02:38):

We have released two synthetic datasets containing 57k and 83k aligned data for formalized statements translated from natural language math problems. These datasets contain problems belonging to algebra, number theory and combinatorics, and are evaluated manually to ensure high accuracy. About 5k problems in Lean Workbook have searched proof based on InternLM-Math-Plus.

Lean Workbook & Lean Workbook Plus: https://huggingface.co/datasets/internlm/Lean-Workbook

Meanwhile, we open-sourced our base model and pipeline for data generation: https://github.com/InternLM/InternLM-Math

We are also formalizing 21 new IMO problems by this pipeline and working on pull request to Compfiles: https://github.com/dwrensha/compfiles/pull/27

The arxiv-version paper of our work: https://arxiv.org/abs/2406.03847

Kevin Buzzard (Jun 07 2024 at 05:33):

Thanks so much! It's quite rare to see data like this in my opinion and I do find it interesting to see what people are using to train models

Bolton Bailey (Jun 07 2024 at 19:43):

Thanks to the authors for providing this dataset, seems like it is very large and potentially useful for the community.

Given discussion in the past about how hard it is to get properly formalized data, I decided to look into whether the formalizations in the data download match the descriptions. I did this for 10 problems (numbers randomly selected by random.org).

408: correct
2077: correct
5214: correct
27195
- seems to be incorrect
  - the variables and conditions about the values of x and y are absent.
30705
- The problem statement is of the form "Find the maximum of this expression subject to these constraints", but the corresponding formalization is "Prove that any value of this expression subject to these constraints is at most 7/18". So this suffers from the unavoidable "determine" issue, as well as the avoidable-but-tricky "maximum" vs "at most" issue.
37823: correct
38757: correct, though it is a determine statement
40987
- Problem statement is: Prove that for all $x \\ge 0, \\cos{x} = \\sin(\\frac{\\pi}{2} - x)$ and $\\sin{x} \\le x$
- Formalization doesn't include the second part.
- Also, a bit strange to have a dataset problem just be the conjuction of two theorems that are or should be in mathlib, maybe this has to do with how the database was constructed.
52887
- seems to be incorrect:
  - It uses the lean expression 5 / 2 * (a * b + b * c + c * a)
  - ... when it should be 5 / (2 * (a * b + b * c + c * a)).
53367
- ~~The lean code seems like it will not compile, (not sure how this got past the "Compiling Correctness test" mentioned in the paper)~~ Seems maybe like these newline characters are just meant to be newline characters in the code, that makes sense
  - \n characters from the Latex problem statement are erroneously interspersed in the lean formalization.
  - Other than that, the formalization looks correct.

So a score of about 6 to 8 out of 10 depending on how generous you want to be. Not great, but also not terrible

Bolton Bailey (Jun 07 2024 at 19:53):

FWIW since seeing a formalization error in an OpenAI blog post I have never been very confident that the rates of perfectly correct formalization in these datasets are very high, kudos to the authors for their work analyzing it in section 5.

Adam Kurkiewicz (Jun 08 2024 at 16:38):

@Huaiyuan Ying , thank you for providing this dataset!

I'm looking at auto-formalisation of randomly-chosen problem 27094 from Lean Workbook:

{'answer': '',
 'formal_statement': 'theorem lean_workbook_27094 (x y z : ℝ) (h : x * y * z = '
                     '1) : (x * y + y * z + z * x) * (1 / (x ^ 2 + 2 * x * y) '
                     '+ 1 / (y ^ 2 + 2 * y * z) + 1 / (z ^ 2 + 2 * z * x)) ≥ '
                     '3  :=  by sorry',
 'natural_language_statement': 'prove that \\n $ (xy+yz+zx) \\cdot \\left ( '
                               '\\frac{1}{x^2+2xy}+\\frac{1}{y^2+2yz}+\\frac{1}{z^2+2zx} '
                               '\\right ) \\geqslant 3.$\\n',
 'proof': [],
 'split': 'lean_workbook',
 'tags': ['inequality']}

I'm having trouble telling if the discrepancy is due to an error in the formal statement or due to the error in the natural language statement.

The difference between the two statements is that the formal statement has an extra assumption $x * y * z = 1$ . So the question is if the auto-formalisation model hallucinated the extra assumption, or if the OCR model that produced the natural language statement (I'm assuming that's how you do it, I'll need to check in the paper) made a transcription error.

For ease of reference, here's rendered latex of the natural language problem statement:

Screenshot-2024-06-08-at-18.24.57.png

The reason I'm asking is that I think it would be possible to crowd-source in a citizen-science fashion improvements to this dataset (if we could get 1000 participants formalising proofs of one theorem a week for a year, we could cover almost all of the Lean Workbook in a year).

But I think for this we would need to be able to cross-reference the original corpus from which you sourced the natural language statements. Like in this case, it's hard to tell if the assumption $x * y * z = 1$ is truly needed or not. I suspect it is not needed, as I did light numerical testing of the inequality, and it seemed like it worked even without the assumption.

How hard would it be to cross-reference the original corpus from which you sourced the problems to verify correctness of the natural language statement?

Adam Kurkiewicz (Jun 08 2024 at 16:50):

I tried to find this problem in the AoPS dataset, which I think is how you sourced it, but unfortunately I think I broke their search server, whoops :/ (It could be that using a star in a search query isn't a good idea!)

Screenshot-2024-06-08-at-18.49.17.png

I think having a "source" tag in the json would be fantastic!

InternLM-Math (Jun 09 2024 at 01:26):

Adam Kurkiewicz said:

I tried to find this problem in the AoPS dataset, which I think is how you sourced it, but unfortunately I think I broke their search server, whoops :/ (It could be that using a star in a search query isn't a good idea!)

Screenshot-2024-06-08-at-18.49.17.png

I think having a "source" tag in the json would be fantastic!

We do not use OCR to collect natural language statements. We download the AOPS posts and use an LLM to extract its problem. So the error can come from (1) LLM extraction or (2) Autoformalization. We will work on adding a cross-reference for the Lean Workbook. Adding a cross-reference for us is not very easy though, but considering the importance we will work on it.

Kevin Buzzard (Jun 09 2024 at 06:30):

The inequality is homogeneous -- if you multiply all of x,y,z by 37 then it doesn't change the truth of the inequality. Hence if you know that all of x,y,z are nonzero (and the problem makes no sense if two of them are zero) then you can scale to make the product 1.

Last updated: May 02 2025 at 03:31 UTC