Zulip Chat Archive

Stream: PhysLean

Topic: PhysLib

Tyler Josephson ⚛️ (Nov 02 2025 at 18:32):

New paper on Lean + Physics:
https://arxiv.org/abs/2510.26094

GitHub is here, but not public yet: https://github.com/ShirleyLIYuxin/Lean4PHYS

Looks like an alternative to PhysLean, along with the idea to make a physics-themed benchmark for automated theorem proving.

Tyler Josephson ⚛️ (Nov 02 2025 at 18:38):

Lead author @Yuxin Shirley LI seems to be on Zulip - hi!

Alfredo Moreira-Rosa (Nov 02 2025 at 19:01):

Looks more like a setup to build physbench for LLMs than an alternative to PhysLean.
Indeed, this looks more like a computer science project than a physics one.

Interrestingly it says :

Additionally, we prove the interchangeability between physics quantities and mathematical quantities. Besides, we also define the physics dimension cast and prove that such a cast does not affect the mathematical propriety of such a unit.

Curious how it extends T.Tao work.

@Yuxin Shirley LI curious to take a look at the repository once you publish it.

Joseph Tooby-Smith (Nov 02 2025 at 19:40):

Giving them the benefit of the doubt, because it may be a bad choice of example problems, or it may be down to presentation reasons, but I would not say the 'college' level problems they are solving are college level. The one in figure 4 is of the type "here is a formula, plug in the numbers and show it works' , the one in Figure 5 is trivially true.

Alfredo Moreira-Rosa (Nov 02 2025 at 20:12):

Benchmark results are really low across the board, so i'm inclined to also give the benefit of the doubt.
Since with today LLMs the examples given are easy to solve.

Another explaination for such low scores is that really working with unit systems is almost void in lean 4 litterature, with a handful of examples from PhysLean, lean-units, Terence Tao library,...
So very few for any LLMs to be have been effectively trained on such formal setups.

ZhiKai Pong (Nov 03 2025 at 15:53):

I appreciate the idea to create a dataset for benchmarking, but with full respect, I genuinely hope they’ll join the discussion here to exchange ideas. At the risk of sounding too critical, here are some honest thoughts after going through the preprint (and only judging from what is currently shown):

Unclear what the role of PhysLib plays in assisting LLM.

On page 7 the paper says

without the PhysLib, the model can only perform basic simplification tactics like: constructor, rw, abel, exact, aesop. By adding the PhysLib, the model can learn to perform more advanced tactics, such as simp and norm_num.

This sounds like the LLM is just bad at lean and outdated in terms of tactics, and it picked up those tactics from PhysLib but it is certainly not the case that PhysLib is necessary to teach the LLM those tactics. Moreover, it seems that the comparison demonstrated in Figure 5 doesn't really showcase the importance of PhysLib, as Gemini is able to solve the problem without PhysLib anyway, and although the added thought process is a plus, I'm pretty sure a similar effect can be achieved with slight changes to the prompt and it is unclear to me what is the role of PhysLib here.

Unclear what the benchmark actually intends to test.

If I understand the work correctly, LeanPhysBench just contains manually formalized statements and the LLMs are asked to evaluate directly on it. This in effect is just asking LLMs to solve lean problems, and the result in itself can be completely devoid of physical content. This framework works for math because the mathematical content of the proof itself is of interest and the mathematical ideas are nontrivial enough that LLMs can't just brute force their way through it (disclaimer: I'm obviously not an expert in this area). However, the examples shown here are nowhere near the level of sophistication, and to LLMs those are just simple symbolic manipulations. Personally I would hope LeanPhysBench actually provides a way to test the physical understanding of the various AI models and I'm not sure what's currently shown is sufficient.

Formalizing arithmetic is not formalizing physics.

Noting the above, I'd like to emphasise the difference between formalizing arithmetic and formalizing physics. I think having units is a step in the right direction, but that is still quite far from saying that the statements shown are actual physical reasoning. Take the competition-easy problem in Figure. 4 as an example, the equation C=(q/V) is already given in the context and what's left is just arithmetic. I believe the philosophy of PhysLean would be to actually show that the charge is proportional to voltage from electromagnetism, and I believe this is where the actual "physics" lies. I fully acknowledge that this is precisely what makes formalising physics tricky, but still I would expect some sort of physics derivations to be demonstrated if an AI is to be called a reasoning model, rather than simply evaluating arithmetic relations.

To conclude, I'd say that I'd expect more from an AI doing "formal physics reasoning in Lean4" than just doing arithmetic. Again, apologies if I sound too harsh and I offer these observations purely for the sake of discussion.

Alfredo Moreira-Rosa (Nov 03 2025 at 17:50):

Maybe someone could reach to the email of the principal author and tell her that PhysLean community is interrested in discussing how they could improve the benchmarks and some of us would be happy to give some direct feedback.

Alfredo Moreira-Rosa (Nov 03 2025 at 18:14):

Ok, i took the liberty to send her a friendly email to ask if she would be ok to discuss her paper here.
Hopefully she'll join the conversation.

Joseph Tooby-Smith (Nov 03 2025 at 18:18):

Awesome! I also sent them an email this morning about PhysLean, but didn't mention this discussion, so thank you!

Yuxin Shirley LI (Nov 04 2025 at 16:31):

Hi everyone,

Thank you all for your interest in our work and the thoughtful feedback! We're actively preparing to release our repository.

I will definitely try my best to engage more deeply with the discussion. We would like to take this as a precious opportunity to exchange ideas, improve our work and explore potential collaborations.

Thanks again for taking the time to review our work!

Last updated: Feb 28 2026 at 14:05 UTC