Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Wrong informal proof by AI


(deleted) (Sep 26 2025 at 05:06):

AI prompt: How to prove this, give me proof in English (not code). Please prove with elementary approach only

p : ℕ
hp : Nat.Prime p
h : ∀ (q : ℕ), Nat.Prime q → q < p → Squarefree (p % q)
hp_le3 : ¬p ≤ 3
hp_gt3 : 3 < p
hp_ge4 : 4 ≤ p
hp_ge5 : 5 ≤ p
hp_gt4 : 4 < p
hdiv_three : ∀ {q : ℕ}, Nat.Prime q → q ∣ p - 4 → q = 3
b : ℕ
hb : p - 4 = 3 ^ b
hp_eq : p = 3 ^ b + 4
hb_ge : ¬b ≤ 2
hb_gt2 : 2 < b
hb3 : 3 ≤ b
⊢ False

(deleted) (Sep 26 2025 at 05:06):

The above is the prompt I sent to Gemini Deep Think. This is the initial answer

(deleted) (Sep 26 2025 at 05:07):

view revision history for the proof

(deleted) (Sep 26 2025 at 05:08):

And the problem with the above proof is it only considers the easiest case, b=3, and thinks it's enough. So, the problem with AI generated informal proofs is there's no guarantee that they are correct, and so they have little industrial value.

(deleted) (Sep 26 2025 at 05:10):

This is the proof generated by GPT-5 pro. I haven't checked the proof but it doesn't look as obviously wrong.

(deleted) (Sep 26 2025 at 05:10):

https://chatgpt.com/s/t_68d62f201ba081919055f0ce8d2640a6

Jason Rute (Sep 26 2025 at 05:53):

The long ai-generated output here distracts from what you are trying to say. Maybe put the ai output in spoiler blocks or some other collapsible block.

(deleted) (Sep 26 2025 at 06:03):

I do realize that, but I can't do it :(

(deleted) (Sep 26 2025 at 06:04):

But my main message here is during my day to day work, I rely on AI generated proofs. But they can be wrong, in a frustrating way.

(deleted) (Sep 26 2025 at 06:04):

I sang Gemini Deep Think's praises as previously I'd never encountered an incorrect proof from Gemini Deep Think. Too bad, now there's one.

(deleted) (Sep 26 2025 at 06:14):

OK I did what I could.

Jason Rute (Sep 26 2025 at 06:30):

Huỳnh Trần Khanh said:

I do realize that, but I can't do it :(

Markdown code for spoiler

Deleted User 968128 (Sep 26 2025 at 19:59):

Huỳnh Trần Khanh said:

AI prompt: How to prove this, give me proof in English (not code). Please prove with elementary approach only

Yeah, I think the key value prop is verifiable correctness. For any real lean question the model would always check its output against a compiler.

Informal anything is likely to cause a lot of disappointment / hallucination and I'd even go so far as arguing not something labs should prioritize. I don't think 'AGI' is a productive goal and is just going to lead to scheming AI. https://www.arxiv.org/pdf/2509.15541

Deleted User 968128 (Sep 26 2025 at 20:02):

OpenAI has pretty much admitted that hallucinations are impossible to fix with current model architectures. The beauty of domains like code and lean is that it can be fixed, because if it fails to run, you're good to go (ignoring various edge cases, ofc, but those are already limited and can be limited further)

Deleted User 968128 (Sep 26 2025 at 20:02):

https://arxiv.org/pdf/2509.04664


Last updated: Dec 20 2025 at 21:32 UTC