Zulip Chat Archive
Stream: Machine Learning for Theorem Proving
Topic: Gemini 3 pro performance on formal benchmarks
Ayush Agrawal (Nov 20 2025 at 03:29):
Gemini 2.5 series were known to hallucinate alot on formal benchmarks. Do we know how Gemini 3 series perform on the same?
Justin Asher (Nov 20 2025 at 04:58):
I am super curious about this too, especially considering Google's emphasis on Lean programming through their various efforts, e.g., AlphaProof.
Deleted User 968128 (Nov 22 2025 at 16:22):
https://epoch.ai/frontiermath
https://critpt.com/
Gemini 3 is significantly outperforming SOTA base models on advanced math and physics, however web and tool usage are compelling.
How Gemini 3 Pro Deep Think (a model which does a BFS using the base Pro model) will perform is what many are waiting to find out.
Opt (Nov 25 2025 at 21:34):
Tim Shephard said:
https://epoch.ai/frontiermath
https://critpt.com/Gemini 3 is significantly outperforming SOTA base models on advanced math and physics, however web and tool usage are compelling.
How Gemini 3 Pro Deep Think (a model which does a BFS using the base Pro model) will perform is what many are waiting to find out.
I don't think we know what kind of search Deep Think does right? So we don't know if it's BFS or something else.
Junyan Xu (Nov 25 2025 at 21:53):
I find it interesting that Gemini 3 keeps hallucinating a certain McCoy's theorem saying that f.g. ideals in a commutative ring consisting of zerodivisors must be annihilated by some nonzero element, e.g. in this conversation and a previous one (which also shows a bug in extracting the final response from CoT).
Deleted User 968128 (Nov 25 2025 at 23:29):
Opt said:
I don't think we know what kind of search Deep Think does right? So we don't know if it's BFS or something else.
Admittedly it is a bit of a simplification, but I'd argue it's definitely not DFS.
"Just as people tackle complex problems by taking the time to explore different angles, weigh potential solutions, and refine a final answer, Deep Think pushes the frontier of thinking capabilities by using parallel thinking techniques."
https://blog.google/products/gemini/gemini-2-5-deep-think/
Last updated: Dec 20 2025 at 21:32 UTC