Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: The use of tactics in the creation of data sets

Srivatsa Srinivas (Jan 24 2025 at 20:37):

Does anyone know if using tactics can hinder the quality of a data set used to train a transformer to auto-formalize english proofs?

For example, the problem (r \in Q) -> (Irrational x) -> (Irrational (r+x)) gets one-shotted by simp because of Rat.cast_add.

But not using tactics would make the creation of a large enough data set quite exhausting. Any thoughts?

Jason Rute (Jan 25 2025 at 00:23):

I don't know what this means. Are you asking if it is a challenge that natural language proofs don't have a literal equivalent in formal proofs? Some proofs are long in natural language and short in Lean, while others are long in Lean, but short in natural language. If so, I do imagine it is quite a challenge. I don't know if there is really all that much work on automatically translating proofs. There are some papers like Draft-Sketch-Prove and successors. It seems from those papers that hierarchical approaches help, since you can then match up the higher abstract levels of the proof with blocks of code which are then filled in by proof search. And that one shouldn't expect a perfect one-shot translation, but have mechanisms where one can fill in or fix low-level details. I could imagine this approach getting more refined as the field improves (which again has very little research that I'm aware of).

Jason Rute (Jan 25 2025 at 00:25):

Also, is there any dataset currently for auto-formalization of proofs? There are datasets for formalization of statements, but any for proofs?

Srivatsa Srinivas (Jan 25 2025 at 00:39):

@Jason Rute

Thanks for the response to my vague question. Here is what I meant; natural language proofs tend to keep mathematical information such as the product of two rationals is rational while skipping the type theoretic information such as congrArg Rat.cast. We can skip the type theoretic details using tactics like rw, exact and simp but sometimes these can solve a goal without the user inputting any mathematical content because the @simp tag covers so many theorems. (In the case posted originally, it's Rat.cast_add)

Would training data that skips a lot of mathematical content due to tactics be useful to train on? Was the question I'm interested in

There are datasets for proofs, but the problems are not beyond high school level I think. I think they are by InternLM

Jason Rute (Jan 25 2025 at 01:56):

If you know a reference, I would be curious. (I haven't done the best job at keeping track of autoformalization.)

Jason Rute (Jan 25 2025 at 01:58):

But as for your question "Would training data that skips a lot of mathematical content due to tactics be useful to train on?", I assume the answer is yes. Generally more diverse data is better, no? Of course, if one had better-aligned data that explained step-by-step how to go from natural language to formal with reasons on the particulars of Lean, that would even be better probably, but that is more work. Of course, I'm not a practitioner in autoformalization, so I could be mistaken.

Srivatsa Srinivas (Jan 25 2025 at 03:23):

Here is the data set: https://huggingface.co/datasets/internlm/Lean-Workbook

Srivatsa Srinivas (Jan 25 2025 at 03:24):

Thanks for the input! I needed to discuss this to just flesh out some thoughts

Srivatsa Srinivas (Jan 25 2025 at 03:26):

Okay, these guys use tactics

Jason Rute (Jan 25 2025 at 03:31):

I don’t see any natural language proofs in that dataset. Am I missing something?

Srivatsa Srinivas (Jan 26 2025 at 19:18):

You are right. They only have a few proofs (a statement they make) and one that I have not verified. I do not know if those proofs have natural language alongside them

Last updated: May 02 2025 at 03:31 UTC