Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Assistance for GPT-Based Lean research project

Vir Malhotra (Jul 14 2023 at 08:12):

Hello everyone! I am a new member here and in the Lean world in general. My name is Vir, a first-year at NYU Abu Dhabi. As part of a research project, my team is looking into the possibility of building a program, similar to GPT-f, but instead of finding better proofs, it's aimed at educating undergraduates in proofs. The way we wish to go about it is that the user enters the proof steps step-by-step, and our GPT-integrated program will translate it to Lean3 code, so that users don't have to learn the language and instead can understand everything through natural language.

The issue I have now is that the dataset I'm looking for should mostly pertain to Propositional Logic, as we want to carry-out very limited testing to begin with. I want to build a dataset that has step-by-step natural language proofs followed by Lean3 proofs preferably only including propositional logic proofs. However, I am unable to find any such dataset.

Is there a way to get only logic-based data from mathlib? Or maybe even how to find a decent amount of examples to build my own. Any suggestions and tips for the same would be greatly appreciated.

Zhangir Azerbayev (Jul 14 2023 at 18:38):

Two resources that might be relevant to your interests:

Dan Velleman's "How to Prove it with Lean"

Patrick Massot's IPAM talk

If you're interested only in propositional logic, I'm sure you could generate statements procedurally.

Heather Macbeth (Jul 14 2023 at 18:42):

I think you'll have the problem here that pure logic deductions are actually conventionally expressed symbolically, rather than in natural language. So there may not be much training data available for natural language logic reasoning.

Heather Macbeth (Jul 14 2023 at 18:42):

Have you considered using a different "toy model"?

Vir Malhotra (Jul 14 2023 at 20:44):

Zhangir Azerbayev said:

Two resources that might be relevant to your interests:

Dan Velleman's "How to Prove it with Lean"

Patrick Massot's IPAM talk

If you're interested only in propositional logic, I'm sure you could generate statements procedurally.

Thank you!
Yeah the only problem is that generating enough for a dataset on my own might be tough because we have a limited summer time span at the moment.

Vir Malhotra (Jul 14 2023 at 20:46):

Heather Macbeth said:

I think you'll have the problem here that pure logic deductions are actually conventionally expressed symbolically, rather than in natural language. So there may not be much training data available for natural language logic reasoning.

That makes sense. The thing is the issue isn't particularly with natural language, as I can always get GPT to translate code into natural language and it does that pretty well. The issue is finding a lean dataset specifically for a single 'toy model' as I'm not sure how to do so. All the existing datasets are huge and diverse and might not be what we're looking for rn.

Vir Malhotra (Jul 14 2023 at 20:47):

I'll look into basing it on other topics though, thank you.

yufeng shen (Jul 16 2023 at 05:03):

Zhangir Azerbayev said:

Two resources that might be relevant to your interests:

Dan Velleman's "How to Prove it with Lean"

Patrick Massot's IPAM talk

If you're interested only in propositional logic, I'm sure you could generate statements procedurally.

Just curious is Prof Massot's tool publicly available? The example he showed is quite interesting.

Heather Macbeth (Jul 16 2023 at 05:04):

It's not publicly available yet, but you can see a second talk by @Kyle Miller with more details at https://www.msri.org/summer_schools/1021/schedules/33480

Last updated: May 02 2025 at 03:31 UTC