Zulip Chat Archive

Stream: mathlib4

Topic: What are all the checks in Mathlib

Jason Rute (May 02 2025 at 23:07):

In light of the bug found by DeepSeek, I’m curious what Mathlib does to check against the following possible issues:

check that the proof is indeed correct by the kernel
check that no sorry or other axioms were used (besides the usual ones)
check that theorems made it into the environment
check that the theorem statement in the environment is as expected.

I assume the first is done by lean4checker. Does it also cover the second? Is there anything for the last two? If a theorem is used, then that would probably work, but if a theorem is a leaf theorem, it could be skipped without lean4checker noticing, right?

Eric Wieser (May 03 2025 at 00:06):

If it didn't make it into the environment, I wouldn't expect to see it in the docs

Yury G. Kudryashov (May 03 2025 at 01:03):

AFAIK, axioms aren't covered. There is a (pretty straightforward) meta code for this, and we're waiting for people who care about avoiding Classical to turn it into a linter.
If Lean silently fails on a theorem, then we won't see it until someone tries to use it.
We have nothing to compare the statement in the environment to, so we don't implement any checks here.

Jason Rute (May 03 2025 at 01:15):

Thanks! I’m surprised about not checking for axioms or sorry, but I saw that @Eric Wieser made a request to add that to lean4checker.

Yury G. Kudryashov (May 03 2025 at 02:40):

AFAIK, Lean issues a warning if a declaration uses sorry, so we rely on this for now.

Yury G. Kudryashov (May 03 2025 at 02:42):

As for axioms, we will definitely decline a PR with an axiom command. Of course, someone can deliberately sneak it in using some meta programming, but meta code goes through review too.

Yury G. Kudryashov (May 03 2025 at 02:43):

This isn't a 100% guarantee, but works well at least until someone doesn't try to deliberately fool us.

Yury G. Kudryashov (May 03 2025 at 02:44):

For DeepSeek, AFAICT they didn't run the final proof through a normal Lean compilation process, otherwise they would've caught this.

Jason Rute (May 03 2025 at 02:48):

Yeah, I think your meticulous review process is the best safe guard to be honest. The bug which the DeepSeek model seems to have found did pass the vs code infoview, lean build, and lean4checker. (I’m not saying they tested all that, but even if they did it wouldn’t have caught it.) The normal warning about sorries didn’t print and also the theorem wasn’t added to the environment so lean4checker was happy.

Jason Rute (May 03 2025 at 02:50):

That is why I brought up this conversation. I’m trying to understand the recommended way to avoid this issue in the future, and I assume mathlib was the place with the most rigorous checking process.

Jason Rute (May 03 2025 at 02:54):

Here is the link to the discussion of the error: #lean4 > apply? might suppress other warnings/errors

Kim Morrison (May 03 2025 at 11:17):

It's good to keep in mind here that's it's only a couple of lines to write a macro that turns subsequent theorems into a no-op. Lean's metaprogramming is intentionally very powerful.

Obviously you can't use such "theorems", nor will they appear in docs.

It's easy to check for the presence of a named theorem, or to check for a named theorem with a specified type, using metaprogramming, and e.g. autograders illustrate how to do this (as in the linked thread above).

At the end of the day, skeptical review by humans is very helpful, and if you're going to claim any big results (especially False), at least a superficial sanity check by a human is obviously necessary. The same applies to all AI teams claiming "state of the art" benchmark results, and this last episode will hopefully raise standards there!

(Presumably there will need to be a retraction, and this should help increase caution in future claims, and in particular not to claim things like "Ooh, the model magically manages to use result X to prove many things" without giving some thought about whether that makes an mathematical sense.)

Jason Rute (May 03 2025 at 11:33):

(As for the DeepSeek paper, I think the results mostly hold with just a few false examples. A major edit more than a retraction I think is needed.)

But I think the standards of how to check this stuff may need to increase across the board including in mathlib. People are going to use AI to fill in Lean blueprint project lemmas, to prove the correctness of their code, to prove the correctness of smart contracts, and to build autoformalized libraries of math. The promise of Lean (very possibly false) is that you don’t have to check every little detail, and I don’t think real users are going to realize all the pitfalls. Of course this is a learning curve for all of us, but again it is noteworthy that some of these non-proofs would (seem to) have at least passed the CI of Mathlib on Lean 4.9.0 (although not the review process), not to mention the CI of a mathematician’s random blueprint project.

Jason Rute (May 03 2025 at 11:37):

Maybe I’m making too big a deal of this. (Everyone is right that this would have been caught if one tried to use the lemmas or print the axioms.) I really just wanted to ask here about what Mathlib does.

Ruben Van de Velde (May 03 2025 at 12:00):

We have leanchecker, btw

Kevin Buzzard (May 03 2025 at 12:06):

The apply? trick would not make it into mathlib because CI lints for tactics which produce blue noise (whatever it's called officially, I mean blue squiggles which give output) and fails if there is any.

Jason Rute (May 03 2025 at 12:08):

@Ruben Van de Velde see above and in the other threads. Lean4checker doesn’t always catch stuff including the DeepSeek example

Last updated: Feb 28 2026 at 14:05 UTC