Zulip Chat Archive
Stream: Machine Learning for Theorem Proving
Topic: Judging formalization
V S (Jul 28 2025 at 15:28):
Hello! I've seen many discussions about results of big_name_like_Deepseek-Prover, but, for them, the formalizations of problems are given (and fixed) in the specific benchmark. I've been looking at the big_name_...-Formalizers and noticed that they give out slightly different formalizations for the same problem, which obviously is not unexpected, but brings the following possibly philosophical question:
Can such autoformalization be objectively judged at least somewhat automatically in the sense that, given two different attempts, one is obviously better and should be used instead of the other, or they are the same in quality? If such methods are unattainable at the moment, what role do formalizers even play in the whole 'AI for TP' race?
For example, Kimina Autoformalizer, when choosing how to formalize "Find the largest integer number smaller than sqrt(4)+sqrt(5). The answer is 4", provides the following formalizations, both of which are able to be proven and are pretty much correct (the second one is probably a little bit worse, but I feel like, if a student rephrased the problem in such a way, they would still get good marks):
import Mathlib
open Real
theorem my_favorite_theorem_1 : IsGreatest {n : ℤ | n < √4 + √5} 4 := by sorry
theorem my_favorite_theorem_2 : ⌊Real.sqrt 4 + Real.sqrt 5⌋ = 4 := by sorry
Justin Asher (Jul 28 2025 at 16:37):
If you want something automated, I am sure you can just use an LLM as a judge after making a rubric. These have been shown to be effective in other contexts, so I see no reason they cannot be used here.
Justin Asher (Jul 28 2025 at 16:37):
(The point is that autoformalization, outside of type checking and handwritten tests, is not a verifiable domain.)
Jibiana Jakpor (Jul 28 2025 at 17:29):
@Scott Kovach wrote an interesting blog post discussing the underlying philosophical question.
Shashank Kirtania (Aug 09 2025 at 20:46):
I have been thinking off extending CodeBleu to Proofs I think it is a non-trivial yet relevant metric will be happy to hear thoughts on it.
Jason Rute (Aug 09 2025 at 21:18):
Why proofs? Also would you use it for RL training or just as a metric?
Shashank Kirtania (Aug 20 2025 at 21:24):
@Jason Rute RL training does sound relevant, but I am not super sure how to go about it with help of the metric, right now I am trying to use it as a metric and understand if it representative enough to be used for training purposes. (mostly analysis)
Shashank Kirtania (Aug 20 2025 at 21:30):
Also some relevant work to compare two formalizations:
https://arxiv.org/abs/2507.07399
Paul Dietze (Oct 18 2025 at 20:05):
A possible additional evaluation metric could also be the cosine similarity between the text embeddings of the natural-language statement and its corresponding Lean code. Or does this approach have some potential disadvantages that I am missing?
(I understand that for concepts which are outside of the training data distribution of the chosen embedding model the results will mostly likely be bad, but for more common statements/proofs etc this might be a viable approach..)
Shashank Kirtania (Oct 18 2025 at 22:11):
You're right to be skeptical, embedding models struggle with these nuances for a fundamental reason: natural language and Lean code don't actually exist in the same semantic space, even when describing identical mathematical concepts.
If standard embeddings could meaningfully bridge NL and formal mathematics, we'd see much better performance in retrieval tasks over Mathlib (as projects like Lean Copilot have discovered).
Paul Dietze (Oct 18 2025 at 23:08):
I wasn't necessarily referring to existing off-the-shelf embedding models, but rather to potential future natural-language - Lean embedding models which would enforce closeness in the semantic space. My point was more long-term: such models could provide a complementary semantic signal given enough training data and even turn out to be better at capturing nuances than type and/or tree-based similarity measures. But this is just a hypothesis and I would be very interested in counter-arguments.
Apparently, a recently published paper is directly relevant to this post: https://arxiv.org/pdf/2510.15681 . Unfortunately, it seems that the embedding model weights were not released.
Last updated: Dec 20 2025 at 21:32 UTC