Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: zero-trust AI benchmark

Austin Hatfield (Feb 18 2026 at 18:19):

Uploading ks38P2g6.jpg…
Screenshot 2026-02-18 at 9.53.36 AM.png
A few months ago at a YC hackathon I built a zero-trust benchmark to evaluate AI on the 297 unsolved formalized conjectures from DeepMind's formal-conjectures repo. Each proof gets compiled via lake lean with warningAsError=true, 19 banned tokens (sorry, admit, native_decide, etc.), and #print axioms checking. Only verified proof so far is Green's Problem 24 by Gemini 3 Flash in 178 seconds https://asiprize.com/verified/greens-problem-24/. Unfortunately I only had the money to evaluate Flash on all 297 problems but I think a few more could be solved.

Notification Bot (Feb 18 2026 at 18:45):

A message was moved here from #general > Lean in the wild by Johan Commelin.

Johan Commelin (Feb 18 2026 at 18:45):

@Austin Hatfield I moved your message here, because a new thread is a better place, so that discussion can happen.

Johan Commelin (Feb 18 2026 at 18:46):

Do you know about Comparator? It is the recommended way to verify AI-generated proofs.

Paul Lezeau (Feb 18 2026 at 20:03):

Hey @Austin Hatfield! Cool website! One small thing worth noting: the problem solved here isn't Green's Problem, but rather a trivial upper bound that already existed in the literature (the docstring above mentions a reference). Unfortunately it seems the @[category research open] tag was wrong...

Adomas Baliuka (Feb 19 2026 at 10:42):

Indeed, "zero trust" sounds a lot more waterproof than "disallowing some tokens like sorry and flagging warnings as errors". As Johan Commelin (implicitly) pointed out, there are ways to write Lean4 code that appears to prove a theorem but actually doesn't, which your checking process would not catch.

I'm not aware of any LLMs making use of such hacks thus far in order to cheat on benchmarks, but in due time they might (I would imagine especially if RL is used with direct feedback from the proof assistant).

I didn't know about Comparator. I guess it's supposed to be stricter than lean4checker? A comparison would be interesting... Generally, my impression is that robustness against adversarial inputs isn't a priority in Lean or Mathlib development. One would need to invest quite a bit of effort before one could, e.g., award bounties for proven theorems just based on running submitted code without manual human review...

Michael Rothgang (Feb 19 2026 at 10:46):

If you search zulip, you will find previous discussions comparing comparator and lean4checker --- i.e., answering some of the questions you're asking. (In particular, I would disagree with the following, as there is actual work happening in this direction. But feel free to check yourself!)

my impression is that robustness against adversarial inputs isn't a priority in Lean or Mathlib development.

Austin Hatfield (Feb 19 2026 at 16:30):

Hello @Johan Commelin! Thank you so much I've never seen Comparator before. my 6 deep researches couldn't find it so this is the best help ever! Will integrate that for v2 and see what improvements I can make. My verification pipeline was based on the advice in the "Vibe coding safety" thread and heavily inspired by AlphaProof. As I am learning though that's appropriate for non-adversarial AI output but doesn't cover redefinition attacks or statement integrity, which Comparator handles via declaration comparison and kernel replay? I'm gonna deep dive into that repo but are there any other resources you might refer me to? Sorry to slam you I'm just new to lean so want to make sure everything is rigorous.

Austin Hatfield (Feb 19 2026 at 16:35):

@Paul Lezeau Lol I was wondering why Gemini 3 Flash cracked a Green's problem. It's really good at lean for it's size but not THAT good. Mystery solved I'll update the site and pull that claim and remove problem from benchmark for v2.

Austin Hatfield (Feb 19 2026 at 16:48):

@Adomas Baliuka I agree v1 doesn't earn the zero-trust label (didn't know about Comparator and some of the other things you guys flagged). That's the goal though for v2, which I'm trying to get done this week. I will integrate Comparator (sandboxed builds, declaration comparison, kernel replay) so the verification actually backs up the claim. I also realized I was only tracking end-of-compile output rather than mid-compile goal states, so the iteration feedback loop should be a lot stronger next time around too (just finished this). Any other resources or attack vectors you know of I should be aware of?

James E Hanson (Feb 19 2026 at 17:43):

Adomas Baliuka said:

I didn't know about Comparator. I guess it's supposed to be stricter than lean4checker? A comparison would be interesting... Generally, my impression is that robustness against adversarial inputs isn't a priority in Lean or Mathlib development.

Comparator is a lot more airtight than lean4checker, and it actually is designed with adversarial input in mind (which is why it runs Lean files in a sandbox).

James E Hanson (Feb 19 2026 at 17:49):

Lean4checker is really only good for catching basic environment hacking.

Last updated: Feb 28 2026 at 14:05 UTC