Zulip Chat Archive

Stream: lean4

Topic: Normalize definitions (for hashing and caching)


Arvind Arasu (Nov 08 2025 at 22:42):

Is there an inexpensive way of normalizing Lean definitions? The goal is to (SHA)-hash the normalized forms for caching purposes.

Two definitions that normalize to the same text should be equivalent, but it is fine for equivalent definitions to not normalize to the same text.

Perhaps normalizing up to variable renaming and basic type inference is achievable? (In the example below, foo and syntacticallyFoo normalize to the same text, but semanticallyFoo does not).

-- some definition
def foo (m n : Nat) : Nat := m + n

-- some other definition with different
-- variable names and output type elided
def syntacticallyFoo (someVar : Nat) (someOtherVar : Nat) := someVar + someOtherVar

-- this is not syntactially identically modulo
-- variable renaming (but semantically identical)
def semanticallyFoo (m n : Nat) : Nat := n + m

Aaron Liu (Nov 08 2025 at 23:21):

if you want hashing I think it would be better to use docs#Lean.ExprStructEq

Aaron Liu (Nov 08 2025 at 23:21):

what are you hashing it for?

Aaron Liu (Nov 08 2025 at 23:43):

Arvind Arasu said:

Perhaps normalizing up to variable renaming and basic type inference is achievable?

What do you mean "basic type inference"? I don't think you should have to do any type inference for this basic kind of normalization?

Aaron Liu (Nov 08 2025 at 23:48):

and if you're doing type inference then it's not really basic anymore

Arvind Arasu (Nov 09 2025 at 05:10):

Thanks Aaron.

The exact scenario requires some context - not sure I can do justice here: we are using Lean to capture user intent and drive actions using an LLM. Once a Lean definition is formulated, it triggers a potentially expensive LLM action (e.g., proof generation).

We want to mitigate this expense by caching the steps associated with a definition the first time it occurs and using it for subsequent equivalent definitions.

Jovan Gerbscheid (Nov 09 2025 at 13:43):

You probably want to use the default BEq and Hash instances on Lean.Expr, since I believe that already accounts for renaming of variables. You might also want to account for reordering of variables/hypotheses by normalizing their order, which is being discussed here: #mathlib4 > duplicate declarations

Arvind Arasu (Nov 10 2025 at 19:54):

Thanks Jovan. I hacked up something based on this suggestion and it works well for my purposes (but without re-ordering variables/hypothesis for now).


Last updated: Dec 20 2025 at 21:32 UTC