Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Gemini 3 pro performance on formal benchmarks


Ayush Agrawal (Nov 20 2025 at 03:29):

Gemini 2.5 series were known to hallucinate alot on formal benchmarks. Do we know how Gemini 3 series perform on the same?

Justin Asher (Nov 20 2025 at 04:58):

I am super curious about this too, especially considering Google's emphasis on Lean programming through their various efforts, e.g., AlphaProof.

Deleted User 968128 (Nov 22 2025 at 16:22):

https://epoch.ai/frontiermath
https://critpt.com/

Gemini 3 is significantly outperforming SOTA base models on advanced math and physics, however web and tool usage are compelling.

How Gemini 3 Pro Deep Think (a model which does a BFS using the base Pro model) will perform is what many are waiting to find out.

Opt (Nov 25 2025 at 21:34):

Tim Shephard said:

https://epoch.ai/frontiermath
https://critpt.com/

Gemini 3 is significantly outperforming SOTA base models on advanced math and physics, however web and tool usage are compelling.

How Gemini 3 Pro Deep Think (a model which does a BFS using the base Pro model) will perform is what many are waiting to find out.

I don't think we know what kind of search Deep Think does right? So we don't know if it's BFS or something else.

Junyan Xu (Nov 25 2025 at 21:53):

I find it interesting that Gemini 3 keeps hallucinating a certain McCoy's theorem saying that f.g. ideals in a commutative ring consisting of zerodivisors must be annihilated by some nonzero element, e.g. in this conversation and a previous one (which also shows a bug in extracting the final response from CoT).

Deleted User 968128 (Nov 25 2025 at 23:29):

Opt said:

I don't think we know what kind of search Deep Think does right? So we don't know if it's BFS or something else.

Admittedly it is a bit of a simplification, but I'd argue it's definitely not DFS.

"Just as people tackle complex problems by taking the time to explore different angles, weigh potential solutions, and refine a final answer, Deep Think pushes the frontier of thinking capabilities by using parallel thinking techniques."
https://blog.google/products/gemini/gemini-2-5-deep-think/


Last updated: Dec 20 2025 at 21:32 UTC