Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Embedder for Lean+NL


Vasily Ilin (Apr 09 2025 at 21:17):

What are my options for an embedder for both Lean and natural language? The downstream task is retrieval from natural language, like LeanSearch.

Justin Asher (Apr 10 2025 at 03:18):

One strategy is to adopt methods similar to those used in LeanDojo's data extraction. For instance, see ExtractData.lean (https://github.com/lean-dojo/LeanDojo/blob/main/src/lean_dojo/data_extraction/ExtractData.lean). Note that specific code only works for older Lean versions (pre-v4.13.0). This kind of approach involves parsing Lean code to extract structural details, such as abstract syntax trees (ASTs) and theorem dependencies (PremiseTrace). You could then process these dependencies first to build richer context around a definition or theorem before feeding this structured representation into an LLM to generate a clear English description. (I am currently using Gemini models for similar tasks, e.g. Flash 2.0.) Once you have the English text, embedding it with a standard model provides the vectors necessary for your natural language retrieval task.

(My own project, Vantage (https://www.vantageproject.org/), explores the complementary challenge of autoformalization, translating English into Lean.)

Bas Spitters (Apr 10 2025 at 07:19):

@Justin Asher for many programming languages getting such structural details can be done with tree-sitter (https://tree-sitter.github.io/). It supports most popular programming languages, including agda. I have not looked closely enough at the lean syntax whether this is also possible, but it would be a way to connect to a much wider eco-system.


Last updated: May 02 2025 at 03:31 UTC