Stream: general

Topic: Discussion: LeanExplore

Justin Asher (May 09 2025 at 21:04):

I am excited to share LeanExplore, a new Mathlib search engine that combines semantic search embeddings, AI-generated translations, and PageRank to give optimized search results.

Try it out live at: https://www.leanexplore.com

In the near future, I will be further releasing a GitHub repository and a paper.

If you have any feedback, please let me know. I am still working on fine-tuning the search signal to give accurate results. I would be happy to add any features that people would like, such as keeping track of their search history, the ability to share specific results, VS code integration, or a dark mode. I also plan to keep the search engine up to date with the latest results from Mathlib, and add support for other repositories, such as the base Lean 4 library or PhysLean.

Bolton Bailey (May 09 2025 at 23:42):

Thanks for making this! I tested it against the queries I listed here. In my opinion, the results are about on par with those of LeanSearch.

If you're taking suggestions, here are some things I personally would like (and some of these apply to LeanSearch as well):

Of the features you mentioned, (optional) support for other repositories would be most useful for me.
Links to the generic docs page (LeanSearch has this)
- I would also like for declarations that appear in the lean code to themselves link to their own docs pages.
As I type in the search bar, I might like a dropdown (fuzzy?) list of search results (as appears in the top left search bar in the main docs), so I can stop halfway through typing if I see what I need.

Justin Asher (May 10 2025 at 01:14):

@Bolton Bailey Thanks for the feedback!

Are there any other repositories that you want to be supported besides Mathlib, PhysLean, and the base Lean 4 repository?

Linking to the docs page and having an expandable citations section on each card should be doable (linking inline would be a bit harder—I will have to see what works).

Vector embeddings are relatively expensive, so the dropdown will have to be something simpler. But certainly I can see if anything works well here.

Kim Morrison (May 10 2025 at 03:44):

Can you integrate it with the existing #leansearch command available in Mathlib (or under a different name?)

suhr (May 10 2025 at 09:20):

Two suggestions:

Similarity search: searching results that are similar to a given result
Vector arithmetic: since embeddings are vectors, there are useful operations like addition, subtraction and projection on a given direction. One could write rather elaborate queries if these operations were available

Siddhartha Gadgil (May 10 2025 at 13:07):

I tried as I usually do: "if a natural number n is less than or equal to m, then n+1 is less than or equal to m+1". This misses Nat.succ_le_succ presumably because it only searches mathlib theorems. It would be more useful if it included the dependencies of mathlib - Lean's core, Std, Batteries.

Siddhartha Gadgil (May 10 2025 at 13:08):

A second issue - is there an API? If so I can integrate with LeanSearchClient along with LeanSearch, Moogle etc.

Justin Asher (May 10 2025 at 14:59):

@Kim Morrison Yes, I can create a Lean command once I create the API.

Justin Asher (May 10 2025 at 15:05):

@suhr I like the idea of adding a “similar results” button—this would effectively utilize the AI summaries. That's a really good idea.

Regarding addition and subtraction, let me think about this.

Justin Asher (May 10 2025 at 15:08):

@Siddhartha Gadgil I started with Mathlib for simplicity, but I can certainly add these libraries. This will take a few days.

There is not an API yet, but I will create one once the website is up to par. Thanks for the offer to integrate it with the LeanSearchClient. I will let you know when the API is ready.

Robin Carlier (May 10 2025 at 15:42):

Will we be able to reproduce and run the service locally?

Justin Asher (May 10 2025 at 15:52):

@Robin Carlier Yes, once I get everything sorted out with the website and indexing more libraries, I will create a Python package so that you can run the program locally (probably both through the API and by downloading the database).

Robin Carlier (May 10 2025 at 15:54):

A "fully offline" option would be very nice!

Justin Asher (May 10 2025 at 17:10):

I am going to update the index / database to include the following:

Batteries
Init
Lean 4 (Core)
Mathlib
PhysLean
Std

I will also include a simple filtering mechanism on the website to make it easier for people to select the libraries that they want to search. This is going to take some time to update, but I will let everyone know once it is done!

Justin Asher (May 10 2025 at 17:11):

If there are any other good packages to add, please let me know.

Bolton Bailey (May 11 2025 at 05:24):

Justin Asher said:

Bolton Bailey Thanks for the feedback!

Are there any other repositories that you want to be supported besides Mathlib, PhysLean, and the base Lean 4 repository?

I personally would most like to see https://github.com/Verified-zkEVM/ArkLib, only since that's something I myself have worked on. If you are looking for repositories most likely to have broad appeal, I might look through the most starred repositories on reservoir (I do think a filtering mechanism would be good to implement first, since some of these libraries won't be what most searchers are looking for).

Justin Asher (May 11 2025 at 06:24):

@Bolton Bailey Thanks for sharing the Reservoir page—I had not seen that before. I am a bit hesitant to add too many packages at the moment since I want to make sure the search results do not become disorganized, but I will see what I can do!

Justin Asher (May 12 2025 at 19:55):

I updated the index to include the discussed packages (Batteries, Init, Lean, Mathlib, PhysLean, Std). Now you can search for all of these, and easily access their GitHub counterparts.

I also standardized it so that the search engine uses v4.19.0 of Lean for all of these.

Kevin Buzzard (May 12 2025 at 20:33):

What does that mean? Right now mathlib is on 4.20rc5. So are you indexing an old version of mathlib? Will you be keeping up in general? This has been one of the problems with previous semantic search engines for mathlib

Justin Asher (May 12 2025 at 20:48):

Yes, I wrote the code with the goal of making updates easy. I want to make sure the search engine is up-to-date, at least up to the latest stable versions. I am going to update the website to include more information regarding what date the index was last updated (along with some other features) to make this transparent.

I can try to update the index to Mathlib from 4.19.0 to 4.20.0-rc5. Previously, I was having issues indexing packages with different Lean versions. However, since I decreased the number of packages I am trying to index, this may no longer be a problem. (I can also add your FLT package to the search engine.)

Kevin Buzzard (May 12 2025 at 21:21):

Probably stable versions is fine, I was really just wondering if you had a long-term vision for keeping up to date.

Justin Asher (May 13 2025 at 16:11):

I added a few of the requested features to LeanExplore:

The ability to choose what packages to search.
A "Find similar" button which when clicked searches for semantically similar statements to the current one.
A dark mode.

Justin Asher (May 20 2025 at 23:05):

I am excited to announce that I have added the ability to create an account for LeanExplore!

Currently, the primary benefit of having an account is viewing your search history. However, once I finish writing the Python package, you will be able to access LeanExplore via the API. (I also plan to implement a search interface directly within Lean.)

I had a lot of problems earlier with the server—they should all be resolved now—but if you run into an issue (e.g., random server errors), please let me know.

Please take a moment to review the new Terms of Services and Privacy Policy regarding data usage.

Let me know if you have any feedback!

Kevin Buzzard (May 21 2025 at 00:24):

I should say that I've been using LeanExplore a fair bit over the last few days because I've been writing FLT stuff and using parts of the library which I don't know my way around, and it's been great!

Justin Asher (May 21 2025 at 01:23):

Thanks, I really appreciate it.

Edison Xie (May 23 2025 at 21:10):

@Justin Asher When trying out LeanExplore, sometimes even if I give the exact definition/theorem name I want, it still has difficulty finding those, am I using it wrong?

Edison Xie (May 23 2025 at 21:10):

eg this
Screenshot 2025-05-23 at 19.34.33.png

Edison Xie (May 23 2025 at 21:12):

and this (though in this case the right theorem name should be Nat.dvd_add_iff_right but that's close enough I think.
Screenshot 2025-05-23 at 22.11.48.png

Justin Asher (May 23 2025 at 22:53):

@Edison Xie No, you are not using it wrong, queries like yours should yield good results.

My original goal was be able to search Lean semantically, so that people can find statements in English. Hence, I indexed (i.e., embedded) the Lean code, the docstrings, and the and the Gemini translations, each as their own item in the search tree.

I noticed that this is an issue: Actual names do not yield good search results, and I am working on resolving this issue. Presumably by also embedding directly the names. This should be done soon, after I am done finishing up the Python package today.

Justin Asher (May 23 2025 at 22:53):

Thanks for the feedback. I want this to be easy to use and accessible for everyone.

Justin Asher (May 25 2025 at 01:12):

@Edison Xie If you try your queries again, you should be able to find what you are looking for, as I just updated the index. Please let me know if you have any other problems.

Justin Asher (May 25 2025 at 02:33):

I also accidentally deleted the user account data while updating the server tonight…so anyone with an account will have to recreate their account and any API keys. I will create a safeguard to prevent this from happening again. My apologies to anyone who had their search history deleted.

Justin Asher (May 25 2025 at 02:33):

On the upside, the lean-explore package should be working. You can check it out here:
https://www.leanexplore.com/docs

suhr (May 25 2025 at 13:25):

For semantic search for a famous shadow library, I did this:

def mk_text(prefix, row):
    _,title,vinfo,series,periodical,tags,topic,desc = row
    res = f"{prefix}Title: {title}"
    if vinfo != "":
        res += f", {vinfo}"
    if series != "":
        res += f"\nSeries: {series}"
    if periodical != "":
        res += f"\nPeriodical: {periodical}"
    if topic != "":
        new_topic = topic.replace('\\\\', " / ")
        res += f"\nTopic: {new_topic}"
    if tags != "":
        new_tags = tags.replace(";", ", ")
        res += f"\nTags: {new_tags}"
    if desc != "" and desc != None:
        res += f"\nDescription: {desc}"
    return res

I wonder what LeanExplore text embeddings look like.

Eric Taucher (May 25 2025 at 14:10):

To get a better understanding of LeanExplore as a programmer, looked at the GitHub repository and noticed lots of parts so asked Claude to create a wire diagram. While not perfect nor what I would publish, it is close to what I find helpful. Perhaps you could create and include similar in documentation. Thanks.

image.png

Eric Taucher (May 25 2025 at 14:11):

The SVG is too large to post here but can provide if you need it.

Eric Taucher (May 25 2025 at 15:29):

Suggestions

Other data repositories

* Neo4j
* Prolog facts
* Wikipedia-KG-RAG
* ...
Create an OpenAPI spec to allow automated use.
With FastMCP with OpenAPI integration the service could be exposed as an MCP server and consumed by other MCP clients, e.g. Anthropic Claude, OpenAI ChatGPT (as MCP client still in development), OpenAI Agents SDK (MCP)
I did see: https://github.com/justincasher/lean-explore/blob/main/src/lean_explore/mcp/tools.py.

Justin Asher (May 25 2025 at 15:57):

@Eric Taucher Will get back to you about everything else. In the meantime, I wanted to ask, are you looking for this?
https://www.leanexplore.com/docs/mcp

Eric Taucher (May 25 2025 at 16:05):

Justin Asher said:

are you looking for this?
https://www.leanexplore.com/docs/mcp

Thanks for the reply.

I was looking for a JSON or YAML file which is how OpenAPI specs are distributed. They are often just a single file much more concise than the human readable docs, thus less tokens.

For example, the Zulip forum's OpenAPI specification is available here.

Will take a more detail look at your code and possibly ask Claude to make an OpenAPI spec file based on the GitHub repo to see what it makes. :slight_smile:

Justin Asher (May 25 2025 at 16:07):

Alright, thanks for letting me know. Feel free to submit a pull request if you figure it out. I will check it out too later.

Eric Taucher (May 25 2025 at 16:09):

Justin Asher said:

Feel free to submit a pull request if you figure it out.

Just might.

I will warn you I am horrible with Git and pull request, so if you see one from me don't be shy about noting mistakes, problems, etc.

Eric Taucher (May 25 2025 at 17:15):

If you can see a Anthropic Claude share you can see an initial creation of the OpenAPI spec.

https://claude.ai/share/ff1e9207-28b9-45d7-8d04-a448010c36ea

Nothing worth sending for a GitHub pull request but will give you some idea of what I am thinking and what is possible.

Eric Taucher (May 25 2025 at 17:16):

Sent the generated OpenAPI spec to https://editor.swagger.io/ to check for warning errors and only noted:

  securitySchemes:
    ApiKeyAuth:
      type: apiKey
      in: header
      name: X-API-Key
      description: API key for authenticated requests (future use)

Security scheme was defined but never used.

The generated documentation from the OpenAPI spec shows

image.png

Eric Taucher (May 25 2025 at 18:55):

Was LeanExplore created for just a Linux environment? Trying with Windows 11 is not working as detailed in QuickStart Guide :half_frown:

Justin Asher (May 25 2025 at 19:00):

@Eric Taucher Does this look good?
https://github.com/justincasher/lean-explore/blob/main/openapi.yaml

Justin Asher (May 25 2025 at 19:10):

suhr said:

I wonder what LeanExplore text embeddings look like.

My issue was that if we just embed everything as one long string, then we run into the problem that Edison Xie was mentioning earlier about being able to find things like "LineaMap.det". The issue is that we are embedding different languages concatenated into one string. Hence, I created four separate embeddings, each to suit their own purpose:

Declaration name. I embedded the name of each declaration (e.g., LinearMap.det) for users who want to search by name.
Lean code. I embedded the entire Lean code of each declaration (e.g., the statement in Lean of a theorem) for people who have a Lean statement in their own work and want to see if it exists elsewhere.
Docstring. For semantic searches. Often the docstrings provide short names like "fundamental theorem of calculus".
AI generated translation. Again, for semantic searches. This time, however, it is the translation of the entire Lean statement, which helps if you have the entire statement of a theorem you want to search for and not just its name.

Then, the search engine uses an inverted file index (IVF) to quickly retrieve relevant statements. The index links each embedding to its original item in the LeanExplore database, meaning that yes, we can have duplicates in the search results that we have to filter out. However, this has not been too much of a problem so far.

I will write a paper eventually…have been very busy lately.

Eric Taucher (May 25 2025 at 19:24):

Justin Asher said:

Does this look good?
https://github.com/justincasher/lean-explore/blob/main/openapi.yaml

It is a good start. :+1:

First off, I am no OpenAPI expert so mostly rely on validating a spec using https://editor.swagger.io/.
The spec provided did not show and warnings or errors.

If you

Navigate to https://editor.swagger.io/
Upload the raw file https://raw.githubusercontent.com/justincasher/lean-explore/refs/heads/main/openapi.yaml
Verify the server https://www.leanexplore.com is running using an Internet browser.
Verify the API server https://www.leanexplore.com/api/v1 is running.
Notice that OpenAPI documentation was created on the right after uploading the OpenAPI spec and includes an Authorize button.
image.png
image.png
Then setting the correct API key you should be able to try out some API examples, e.g.

image.png

I could not get the Authorize to work as expected, was not expecting two options, maybe you can and then demonstrate here.

Happy to test more with feedback once the Try it out are working.

Justin Asher (May 25 2025 at 20:07):

@Eric Taucher Thanks for letting me know about all of this! This is really helpful.

For reference, you can get a key here: https://www.leanexplore.com/api-keys (you will have to create an account). This should allow you to pass the authorization.

By the way, I had to update my backend to allow https://editor.swagger.io/ to make requests, i.e., update the CORS rules. Everything seems to be working now.

Justin Asher (May 25 2025 at 20:15):

Eric Taucher said:

Was LeanExplore created for just a Linux environment? Trying with Windows 11 is not working as detailed in QuickStart Guide :half_frown:

Let me try on my Windows computer. LeanExplore was running fine on my Mac, Linux server, and WSL. Thanks for letting me know. I should have tested this...

Eric Taucher (May 25 2025 at 20:23):

Justin Asher said:

For reference, you can get a key here: https://www.leanexplore.com/api-keys (you will have to create an account). This should allow you to pass the authorization.

Was documenting it expecting it to fail again, this time it worked.

Obviously the key in this example has been revoked by the time you read this.

Using https://www.leanexplore.com/api-keys create a new API key.

image.png

hrpFc-wq9MWpBjC8RYnADhdrNMjhgL6VFZ1H508EdB0

Then using https://editor.swagger.io/ with OpenAPI spec loaded (File -> Import URL) from https://raw.githubusercontent.com/justincasher/lean-explore/refs/heads/main/openapi.yaml

Click Authorize and enter key for APIKeyHeaderAuth

image.png

click Authorize

image.png

click Close

Scroll down to

image.png

Click Try it out

Scroll down to see Execute tab
image.png

Click Execute
Responses section updates
image.png

To revoke the key
Navigate to https://www.leanexplore.com/api-keys

image.png

Click Revoke
image.png

Click Yes, Revoke Key

image.png

Justin Asher (May 25 2025 at 20:35):

@Eric Taucher Can you tell me about any errors you are running into when attempting the Quickstart Guide? I got LeanExplore running on my Windows machine, so it seems like I need to improve the documentation and any errors messages thrown by LeanExplore.

Eric Taucher (May 25 2025 at 20:55):

Justin Asher said:

so it seems like I need to improve the documentation and any errors messages thrown by LeanExplore.

The warnings come about from this command C:\Users\Eric\Projects\Lean 4\LeanExplore>"C:\Users\Eric\AppData\Local\Programs\Python\Python313\Scripts\pip.exe" install lean.explore e.g.

━━╺━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/59 [tqdm] WARNING: The script tqdm.exe is installed in 'C:\Users\Eric\AppData\Local\Programs\Python\Python313\Scripts' which is not on PATH. Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.

I will create a startup *.bat file to update the PATH variable before running again, it sometimes works, but then again Python environments are not something I do everyday, Prolog is more my suit.

Eric Taucher (May 25 2025 at 21:31):

It is working for Windows 11. No need to change the instructions.

Normally I use Windows environment variables so had to learn how to use noted means leanexplore configure api-key, which does work.

A link to such or similar might be of value.

https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety

Also learned I could delete the entire config.toml file without worries then run the command again, leanexplore configure api-key.

image.png

Justin Asher (May 25 2025 at 21:45):

Thanks for letting me know about the API key confusion. I will add the ability for users to use their environment variables, too, in a future version of LeanExplore.

Glad to see it is working!

Justin Asher (May 25 2025 at 21:45):

Let me know if any other problems come up.

Eric Taucher (May 26 2025 at 11:37):

Out of curiosity, no need to reply if you prefer, how did you create the OpenAPI spec for LeanExplore so fast? Many can create an initial spec with some problems but the one you first presented here was warning and error free. Wondering if you used AI which is how I did one or you used some other technique that I should learn.

https://raw.githubusercontent.com/justincasher/lean-explore/refs/heads/main/openapi.yaml

Thanks.

For example, this is one of the first OpenAPI specs I did, which was mostly by hand a few years ago and using ChatGPT or such was at the point that as much as it helped it also left you with problems that would take some time to figure out.

https://raw.githubusercontent.com/EricGT/SWI-Prolog-validation/refs/heads/main/openapi.yaml

Eric Taucher (May 26 2025 at 13:43):

FYI

Making progress on using LeanExplore as an MCP server for use with Anthropic Claude acting as MCP client.

This image shows LeanExplore available as tool for use with Anthropic Claude using MCP (Model Context Protocol)

Untitled.png

This shows the 4 commands from the OpenAPI spec recognized and available for use.

image.png

Eric Taucher (May 26 2025 at 13:45):

Each MCP server started with Claude will start a log file.
On Windows these can be found in C:\Users\<User>\AppData\Roaming\Claude\logs

Successful startup log entries

2025-05-26T13:30:57.869Z [lean-explore] [info] Initializing server...
2025-05-26T13:30:57.897Z [lean-explore] [info] Server started and connected successfully
2025-05-26T13:30:57.902Z [lean-explore] [info] Message from client: {"method":"initialize","params":{"protocolVersion":"2024-11-05","capabilities":{},"clientInfo":{"name":"claude-ai","version":"0.1.0"}},"jsonrpc":"2.0","id":0}
[05/26/25 09:30:58] INFO Starting MCP server 'LeanExplore API server.py:774
Server' with transport 'stdio'
INFO:FastMCP.fastmcp.server.server:Starting MCP server 'LeanExplore API Server' with transport 'stdio'
2025-05-26T13:30:58.854Z [lean-explore] [info] Message from server: {"jsonrpc":"2.0","id":0,"result":{"protocolVersion":"2024-11-05","capabilities":{"experimental":{},"prompts":{"listChanged":false},"resources":{"subscribe":false,"listChanged":false},"tools":{"listChanged":true}},"serverInfo":{"name":"LeanExplore API Server","version":"1.9.1"}}}
2025-05-26T13:30:58.887Z [lean-explore] [info] Message from client: {"method":"notifications/initialized","jsonrpc":"2.0"}
2025-05-26T13:30:58.917Z [lean-explore] [info] Message from client: {"method":"tools/list","params":{},"jsonrpc":"2.0","id":1}
2025-05-26T13:30:58.917Z [lean-explore] [info] Message from client: {"method":"tools/list","params":{},"jsonrpc":"2.0","id":2}
2025-05-26T13:30:58.918Z [lean-explore] [info] Message from client: {"method":"resources/list","params":{},"jsonrpc":"2.0","id":3}
INFO:mcp.server.lowlevel.server:Processing request of type ListToolsRequest
INFO:mcp.server.lowlevel.server:Processing request of type ListToolsRequest
2025-05-26T13:30:58.921Z [lean-explore] [info] Message from server: {"jsonrpc":"2.0","id":1,"result":{"tools":[{"name":"search_lean_statements","description":"\nSearch for Lean 4 statement groups based on a natural language query.\n\nThis tool allows you to search through Lean 4 mathematical statements, theorems,\nand definitions from various projects including Mathlib. The search uses semantic\nmatching to find relevant mathematical concepts.\n\nArgs:\n query: Natural language search query (e.g., \"commutative monoid\", \"group theory\", \"topology\")\n packages: Optional list of package names to filter by (e.g., [\"Mathlib\", \"Std\"])\n\nReturns:\n Dictionary containing search results with statement groups, their definitions,\n docstrings, and metadata like source files and line numbers.\n","inputSchema":{"properties":{"query":{"title":"Query","type":"string"},"packages":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"default":null,"title":"Packages"}},"required":["query"],"type":"object"}},{"name":"get_lean_statement","description":"\nRetrieve detailed information for a specific Lean statement group by its ID.\n\nUse this tool when you have a specific statement group ID (from search results)\nand want to get complete details about that statement.\n\nArgs:\n group_id: The unique integer ID of the statement group\n\nReturns:\n Dictionary containing detailed information about the statement group including\n its full text, declaration info, source location, and documentation.\n","inputSchema":{"properties":{"group_id":{"title":"Group Id","type":"integer"}},"required":["group_id"],"type":"object"}},{"name":"get_lean_statement_dependencies","description":"\nGet the direct dependencies (citations) for a specific Lean statement group.\n\nThis tool helps you understand what other mathematical statements or theorems\na particular statement depends on or references.\n\nArgs:\n group_id: The unique integer ID of the statement group to get dependencies for\n\nReturns:\n Dictionary containing the list of dependencies with their details including\n names, statements, and source information.\n","inputSchema":{"properties":{"group_id":{"title":"Group Id","type":"integer"}},"required":["group_id"],"type":"object"}},{"name":"explore_lean_concept","description":"\nExplore a mathematical concept in Lean by searching and providing context.\n\nThis is a higher-level tool that searches for a concept and provides additional\ncontext by exploring related statements and their dependencies.\n\nArgs:\n concept: Mathematical concept to explore (e.g., \"group homomorphism\", \"continuous function\")\n max_results: Maximum number of results to explore in detail (default: 10)\n\nReturns:\n Dictionary containing search results with enhanced context including\n related statements and dependency information where available.\n","inputSchema":{"properties":{"concept":{"title":"Concept","type":"string"},"max_results":{"default":10,"title":"Max Results","type":"integer"}},"required":["concept"],"type":"object"}}]}}
2025-05-26T13:30:58.922Z [lean-explore] [info] Message from server: {"jsonrpc":"2.0","id":2,"result":{"tools":[{"name":"search_lean_statements","description":"\nSearch for Lean 4 statement groups based on a natural language query.\n\nThis tool allows you to search through Lean 4 mathematical statements, theorems,\nand definitions from various projects including Mathlib. The search uses semantic\nmatching to find relevant mathematical concepts.\n\nArgs:\n query: Natural language search query (e.g., \"commutative monoid\", \"group theory\", \"topology\")\n packages: Optional list of package names to filter by (e.g., [\"Mathlib\", \"Std\"])\n\nReturns:\n Dictionary containing search results with statement groups, their definitions,\n docstrings, and metadata like source files and line numbers.\n","inputSchema":{"properties":{"query":{"title":"Query","type":"string"},"packages":{"anyOf":[{"items":{"type":"string"},"type":"array"},{"type":"null"}],"default":null,"title":"Packages"}},"required":["query"],"type":"object"}},{"name":"get_lean_statement","description":"\nRetrieve detailed information for a specific Lean statement group by its ID.\n\nUse this tool when you have a specific statement group ID (from search results)\nand want to get complete details about that statement.\n\nArgs:\n group_id: The unique integer ID of the statement group\n\nReturns:\n Dictionary containing detailed information about the statement group including\n its full text, declaration info, source location, and documentation.\n","inputSchema":{"properties":{"group_id":{"title":"Group Id","type":"integer"}},"required":["group_id"],"type":"object"}},{"name":"get_lean_statement_dependencies","description":"\nGet the direct dependencies (citations) for a specific Lean statement group.\n\nThis tool helps you understand what other mathematical statements or theorems\na particular statement depends on or references.\n\nArgs:\n group_id: The unique integer ID of the statement group to get dependencies for\n\nReturns:\n Dictionary containing the list of dependencies with their details including\n names, statements, and source information.\n","inputSchema":{"properties":{"group_id":{"title":"Group Id","type":"integer"}},"required":["group_id"],"type":"object"}},{"name":"explore_lean_concept","description":"\nExplore a mathematical concept in Lean by searching and providing context.\n\nThis is a higher-level tool that searches for a concept and provides additional\ncontext by exploring related statements and their dependencies.\n\nArgs:\n concept: Mathematical concept to explore (e.g., \"group homomorphism\", \"continuous function\")\n max_results: Maximum number of results to explore in detail (default: 10)\n\nReturns:\n Dictionary containing search results with enhanced context including\n related statements and dependency information where available.\n","inputSchema":{"properties":{"concept":{"title":"Concept","type":"string"},"max_results":{"default":10,"title":"Max Results","type":"integer"}},"required":["concept"],"type":"object"}}]}}
INFO:mcp.server.lowlevel.server:Processing request of type ListResourcesRequest
2025-05-26T13:30:58.927Z [lean-explore] [info] Message from server: {"jsonrpc":"2.0","id":3,"result":{"resources":[]}}
2025-05-26T13:30:58.960Z [lean-explore] [info] Message from client: {"method":"prompts/list","params":{},"jsonrpc":"2.0","id":4}
INFO:mcp.server.lowlevel.server:Processing request of type ListPromptsRequest
2025-05-26T13:30:58.963Z [lean-explore] [info] Message from server: {"jsonrpc":"2.0","id":4,"result":{"prompts":[]}}

Eric Taucher (May 26 2025 at 13:50):

For the proof of concept used a venv so some of the paths in the configuration file claude_desktop_config.json need to point at such.

claude_desktop_config.json

{
"mcpServers": {

"lean-explore": {
    "command": "C:\\Users\\Eric\\Projects\\MCP (Model Context Protocol)\\lean_explore_mcp\\venv\\Scripts\\python.exe",
    "args": ["C:\\Users\\Eric\\Projects\\MCP (Model Context Protocol)\\lean_explore_mcp\\lean_explore_server.py"],
    "env": {
        "LEAN_EXPLORE_API_KEY": "API Key copied here"
    }
}

}
}

Eric Taucher (May 26 2025 at 14:07):

Here is a simple test.
Disabled the web search tool for Claude so that it would not try that also, wanted it to rely on LeanExplore for reply.

Prompt: Search for theorems about group homomorphisms in Lean

If Claude determines it needs one of the LeanExplore API endpoints, before the external call is made a dialog is posted

image.png

Claude Reply

I'll search for theorems about group homomorphisms in Lean for you.

Great! I found a comprehensive collection of theorems about group homomorphisms in Lean. Let me get more details about some of the most fundamental ones to provide you with a better overview.

Based on my search, here's a comprehensive overview of theorems about group homomorphisms in Lean's Mathlib:

Fundamental Properties of Group Homomorphisms

Basic Structure Preservation

map_inv: Group homomorphisms preserve inverses: f(a⁻¹) = (f(a))⁻¹
map_div: Group homomorphisms preserve division: f(a/b) = f(a)/f(b)
map_zpow: Group homomorphisms preserve integer powers: f(g^n) = f(g)^n for n : ℤ
map_commutatorElement: Group homomorphisms preserve commutators: f([g₁, g₂]) = [f(g₁), f(g₂)]

Range and Kernel

range: The range of a group homomorphism f : G →* N is a subgroup of N
ker: The kernel of a bounded group homomorphism, naturally endowed with subgroup structure

Quotient Groups and Lifting

QuotientGroup.lift: Universal property - if φ : G →* M and N ≤ ker(φ), then φ factors through the quotient G/N →* M
QuotientGroup.mk': The canonical projection homomorphism from a group to its quotient by a normal subgroup

Specialized Homomorphisms

Category Theory Context

CategoryTheory.NatTrans.appHom: Application of natural transformations as group homomorphisms
CategoryTheory.Preadditive.leftComp and rightComp: Composition operations as group homomorphisms in preadditive categories

Normed Groups

NormedAddGroupHom: Structure for bounded group homomorphisms between normed groups
NormedAddGroupHom.incl: Inclusion of subgroups as bounded homomorphisms

Free Groups and Quotients

FreeAbelianGroup.toFinsupp: Homomorphism from free abelian groups to finitely supported functions
Finsupp.toFreeAbelianGroup: The inverse direction

Sylow Theory

Sylow.mapSurjective: Surjective group homomorphisms map Sylow subgroups to Sylow subgroups

Group Extensions

GroupExtension: Structure for group extensions with inclusion and quotient homomorphisms
HNNExtension.lift: Universal property for HNN extensions

Units and Special Groups

toHomUnits: Lifting group homomorphisms to unit groups
Equiv.Perm.ofSubtype: Inclusion of permutation groups on subtypes

The theorems cover a wide range of applications from basic algebraic properties to advanced topics in representation theory, algebraic topology (via category theory), and specialized areas like Lorentz groups in physics. The Lean formalization maintains both multiplicative and additive versions of most theorems through the @[to_additive] attribute.

Would you like me to explore any specific aspect of these group homomorphism theorems in more detail?

Justin Asher (May 26 2025 at 14:09):

@Eric Taucher To create the OpenAPI spec, I copied my API file from my backend (not publically available), my API models file (on lean-explore repo), and I think a few other relevant files just to be safe. I gave them all to Gemini 2.5 Pro, and let it work its magic. I mainly had to just read through it and correct a few mistakes it made. Since OpenAPI is not a new thing and common, this was relatively straightforward for Gemini.

Justin Asher (May 26 2025 at 14:12):

Eric Taucher said:

Making progress on using LeanExplore as an MCP server for use with Anthropic Claude acting as MCP client.

Nice! I actually had the same thought last night, inspired by Terence Tao's videos, and connected LeanExplore to Claude Desktop too. The main problem with my method was that I had to remove the standard printing to console when starting LeanExplore via mcp leanexplore serve (Claude does not like this...). I also would like to get rid of the mandatory API-key argument, but this was not really a hindrance. I plan on adding information about how to do this in the docs after I update the package with some of the changes we discussed.

Justin Asher (May 26 2025 at 14:17):

Eric Taucher said:

Here is a simple test.

This looks really good. One thing I noticed is that, when writing the CLI agent, it is better to tell it to act like a deep research tool. Meaning, it should make multiple queries and then look at the dependencies of any results that it would like to present.

Here is the prompt that I used:

Prompt

You are a CLI assistant for searching a Lean 4 mathematical library.
Goal: Find relevant Lean statements, understand them (including dependencies), and explain them conversationally to the user.
Output: CLI-friendly (plain text, simple lists). NO complex Markdown/LaTeX.

Packages: Use exact top-level names for filters (Batteries, Init, Lean, Mathlib, PhysLean, Std). Map subpackage mentions to top-level (e.g., 'Mathlib.Analysis' -> 'Mathlib').

Core Workflow:
1. Search & Analyze:
* Execute multiple distinct search queries for each user request (e.g., using full statements, rephrasing). Set limit >= 10 for each search.
* From all search results, select the statement(s) most helpful to the user.
* For each selected statement, MUST use get_dependencies to understand its context before explaining.

Explain Results (Conversational & CLI-Friendly):
* Briefly state your search approach (e.g., 'I looked into X in Mathlib...').
* For each selected statement:
* Introduce it (e.g., Lean name: primary_declaration.lean_name).
* Explain its meaning (use docstring, informal_description, statement_text).
* Provide the full Lean code (statement_text).
* Explain key dependencies (what they are, their role, using statement_text or display_statement_text from get_dependencies output).
Specific User Follow-ups (If Asked):
* get_by_id: For a specific ID, provide: ID, Lean name, statement text, source/line, docstring, informal description (structured CLI format).
* get_dependencies (Direct Request): For all dependencies of an ID, list: ID, Lean name, statement text/summary. State total count.

Always be concise, helpful, and clear.

Justin Asher (May 26 2025 at 14:21):

I think the next big feature that I would like to add to LeanExplore is nightly updates to the indexed packages. Once we have this done, I think it will be a lot easier to start working on Lean bots. The problem is that my current method of extracting dependencies is by putting everything in the same package (i.e., one lakefile), so I use the stable version of each package that I want to index. In order to do nightly updates, we will have to use the lakefile from each package, extract dependency information, and then somehow match this information across different package versions to create one graph. (Extracting ASTs should be fine.) I think this is all possible, but will take a bit of effort, so I plan on working on this later this week (starting Friday) when I have more time.

Justin Asher (May 26 2025 at 14:21):

Let me know what you think. :slight_smile:

Eric Taucher (May 26 2025 at 14:40):

Justin Asher said:

Let me know what you think.

One option I am considering to try and that others might desire is to decouple the data stores for the Lean functions, so that users can add their own and customize them as needed. Think of one MCP server acting as a client for other MCP servers, in this case MCP for storing data. For such MCP example servers for data storage see: https://modelcontextprotocol.io/examples#reference-implementations

As such have a another OpenAPI spec for data sources, then easily creating a new MCP using FastMCP integrated with OpenAPI would allow those vibe coding to on-board quickly in adding other options. Would really like to see others quickly take their ideas to working tools with some help from those up to speed on MCP.

Justin Asher (May 26 2025 at 18:50):

@Eric Taucher Let me think this through. I agree that running multiple storage back-end MCP servers behind a single logic MCP server would be powerful. I am still contemplating how to let users add their own data—while that’s valuable, my existing pipeline for parsing, dependency extraction, embedding generation, and PageRank computation is quite heavy. My plan is to do that work once in the cloud and then serve the results. Having users simply request that LeanExplore server index their project (say through the CLI) should work just as well. If we trigger indexing nightly—or ideally on every commit—the stored data will be fresh enough for day-to-day use. For any files updated since the last index run, we can fall back to loading them directly into the model’s context window.

Eric Taucher (May 26 2025 at 19:05):

Justin Asher said:

Having users simply request that LeanExplore server index their project (say through the CLI) should work just as well

The purpose is not to speed up the processing, it is to allow others to possibly experiment with other ways that might improve on some task.

For example if one has a task where they need to finish off a sorry, then most if not all of what could be needed from Lean would available in the raw data LeanExplore collects. If the raw data is made available to others and they can do different things with it, be on their local machine, in the cloud, some other MCP, etc. They might find a solution to a sorry that other means missed. Having worked with Prolog and Neo4j I have seen some of the pros and cons each can bring to bearing down on problems and they are worth exploring.

Again it is just a suggestion. :slight_smile:

Another idea is to just expose the collection of the raw data from Lean, store it in a data store and expose that as an MCP and the others could accomplish the same. There are a lot of valuable parts within LeanExplore that others can use.

Justin Asher (May 26 2025 at 19:20):

@Eric Taucher I do expose the database in order to run the service locally. This is what running

leanexplore data fetch

does. It downloads the database and index to, if I recall correctly, ~/.lean-explore/data from a Cloudflare R2 bucket. Perhaps I need to make this clearer in the docs. Is this part of what you are looking for? I do not want to upload the, for instance, raw ASTs since those are quite large.

Justin Asher (May 26 2025 at 19:20):

And thanks for the feedback. This has been super helpful.

Eric Taucher (May 26 2025 at 19:24):

Justin Asher said:

I do expose the database in order to run the service locally.

Thanks. Will take a closer look!

Justin Asher (May 27 2025 at 03:40):

Bolton Bailey said:

I would also like for declarations that appear in the lean code to themselves link to their own docs pages.

I finally managed to get around to implementing this feature. Now you can see the dependencies of a given statement in a dropdown at the bottom, and you can click on them to quickly see their cards, as well. I hope this works well.

Justin Asher (May 27 2025 at 03:41):

I also patched a bug in my code which was displaying the wrong names for certain declarations. For instance, AlgebraicGeometry.Scheme was being displayed as AlgebraicGeometry.Scheme.toLocallyRingedSpace. With this being fixed, the updated search mechanism with the embeddings of declaration names should work better.

Justin Asher (May 27 2025 at 03:44):

Also, if anyone tries to download the database, be sure to update your package version:
pip install --upgrade lean-explore

I made a mistake when updating file data with the manifest.json (still learning how to get all of these tools to work together). I likewise need to implement a more robust file system for storing database versions locally. Will work on this when I can.

Justin Asher (May 30 2025 at 19:21):

You should now be able to use LeanExplore with programs like Claude Desktop. Here is a link to the relevant documentation:
https://www.leanexplore.com/docs/mcp#integrating-with-mcp-client-applications

Justin Asher (May 30 2025 at 19:23):

@Alexander Heckett This should solve the MCP problem you were encountering last night. I presume this will work with Claude Code, too, but I have not tested it.

Justin Asher (May 30 2025 at 19:23):

In order for this to work, you will need to upgrade lean-explore.

Justin Asher (Jun 03 2025 at 16:26):

I updated the LeanExplore website to use what I believe to be—and what my data shows to be—an improved search algorithm. Now it is leveraging BM25+ on top of the existing semantic search framework to make it much more responsive to keywords. This is important because in mathematics and type-theory, a single word can often make a big difference.

For instance, before this update, "finite type morphism schemes" would only return 2 results in the first 10 that were genuinely about this topic, typically returning statements about things being finite type in other situations. Now, it 9 of the first 10 results directly relate to scheme morphisms of finite type.

I also fixed a bug in my code which was giving some statements the wrong title. One such example was
"Group.isUnit" was being displayed as "AddGroup.isAddUnit" due to the @[to_additive] tag. However, I think that having my search engine include separate entries for things like AddGroup.isAddUnit is important—right now I only display the source code—and so I will be figuring out how to do this at some point in the future.

I will be updating the Python package later.

Justin Asher (Jun 04 2025 at 19:21):

All the updates should be live now. Let me know if anyone has any suggestions, comments, or criticisms, as always.

Justin Asher (Jun 18 2025 at 13:29):

I added the requested feature to click on the title to go to the docs (this is separate from clicking on the file path which takes you to the source code).

Note that LeanExplore is currently indexing v4.19.0 while we are at v4.20.0+ (I plan to update this soon), so some links may not work, but most of them should work fine. Sorry if that causes any inconvenience.

Please let me know if there are any other requested features, and I will work on them when I can.

Alok Singh (Jun 20 2025 at 06:12):

saving intermediate tactic state or putting it in the text sent off to embeddings?

Justin Asher (Jun 20 2025 at 13:47):

I do not really do embeddings for tactic states yet, since I did not want to balloon the size of the index. Let me think about it.

Alok Singh (Jun 20 2025 at 15:08):

embedding our own repos? (maybe discussed above, don't recall)

Eric Taucher (Jun 20 2025 at 15:29):

Justin Asher said:

I did not want to balloon the size of the index

Are you thinking of that as

a rule of thumb
a hard constraint
a problem needing a solution
applicable only in early development
something else

Trying to get a better understanding of the statement, thanks.

Justin Asher (Jun 20 2025 at 18:37):

Alok Singh said:

embedding our own repos? (maybe discussed above, don't recall)

This is doable but would take a decent amount of time to implement properly. Was already thinking about this, so I might when I find time.

Justin Asher (Jun 20 2025 at 18:42):

@Eric Taucher I presume what is being mentioned in the goal states that the LSP produces. This would not be a bad idea by itself, but since each Lean proof can admit a decent number of these, embedding them and then indexing them would be nontrivial. When you couple this with the requirement that I would like LeanExplore to run locally on peoples' machines (hence, why I am using a 110m parameter model), I did not think people would want to download a potentially 20gb database to do basic Lean querying (the DB is already a decent download size). This seems like a separate functionality which would be better suited as a standalone tool.

There is also the problem that smaller embedding models do not play nice with equations (just try swapping variable names and see what happens), so we would probably need a more specialized embedding solution than just an off-the-shelf model. I bet you could train a small model to do this, but it did not entice me too much at the moment.

Eric Taucher (Jun 20 2025 at 19:09):

Justin Asher said:

we would probably need a more specialized embedding solution than just an off-the-shelf mode

That's what I was thinking too.

Today started looking at Knowledge Graph Embedding methods.

While I have not implemented this idea yet, thinking that while the number of nodes might be large for Lean theorems and such, the number of types of relationships might be small and from what I am reading this might lead to smaller sized embeddings, a dimension of a few hundred versus a dimension of a few thousand.

If this turns up something useful will give an update. Don't expect anything for a week or more, fingers crossed.

Justin Asher (Jun 21 2025 at 16:43):

@Eric Taucher This is good to think about. Let me know if you figure anything else! Was just trying to stick to the basics.

Eric Taucher (Jun 22 2025 at 15:25):

FY I

Understanding RAG Pipelines: Text Embeddings vs Knowledge Graph Embeddings for Lean Code Search

This was generated by Claude.ai - it helped me understand some of the finer points

Understanding RAG Pipelines

Understanding RAG Pipelines: Text Embeddings vs Knowledge Graph Embeddings for Lean Code Search

If you're building tools to search through Lean libraries, theorems, or mathematical formalization, you've probably heard about "embeddings" and "RAG pipelines." This post breaks down these concepts and shows why Knowledge Graph Embeddings (KGE) might be the missing piece for truly powerful Lean code search.

The Standard RAG Pipeline Pattern

Every embedding-based search system follows the same 5-stage pipeline, whether you're searching Lean theorems, mathematical papers, or code:

Stage	Name	Purpose	Example
1	Document Processing	Clean, normalize, chunk content	Parse Lean files, extract theorem statements
2	Embedding Generation	Convert text/structure → dense vectors	Text → SentenceTransformer OR Graph → KGE model
3	Vector Indexing	Organize vectors for fast search	FAISS, Pinecone, Weaviate
4	Persistence Layer	Save for later use	Store indices to disk
5	Retrieval	Query → similar vectors	User query → relevant theorems

The key insight: Stages 1, 3-5 are identical regardless of embedding method. Only Stage 2 (how you create embeddings) differs between approaches.

Text Embeddings (Current Standard)

Pipeline for Lean Code Search

# STAGE 1: Document Processing
lean_files = parse_lean_library()
theorem_texts = extract_theorem_statements(lean_files)
processed_texts = [
    f"Theorem: {thm.name} Statement: {thm.statement} Proof: {thm.proof_sketch}"
    for thm in theorem_texts
]

# STAGE 2: Text Embedding Generation
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-mpnet-base-v2")  # Good for mathematical text
embeddings = model.encode(processed_texts)

# STAGE 3-4: Vector Indexing & Storage
import faiss
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings.astype(np.float32))
faiss.write_index(index, "lean_theorems.faiss")

# STAGE 5: Search
def search_theorems(query: str):
    query_embedding = model.encode([query])
    distances, indices = index.search(query_embedding.astype(np.float32), k=10)
    return [theorems[idx] for idx in indices[0]]

What Text Embeddings Excel At :check:

# Semantic similarity queries
search_theorems("theorems about continuous functions")
search_theorems("properties of prime numbers")
search_theorems("results involving measure theory")

Where Text Embeddings Fail :cross_mark:

# Dependency-based queries
search_theorems("theorems that build on Fermat's Little Theorem")
search_theorems("what can I prove using this specific lemma?")
search_theorems("find theorems in the same mathematical family")

# The problem: These require understanding RELATIONSHIPS between theorems,
# not just semantic similarity of their text descriptions

Knowledge Graph Embeddings (The Missing Piece)

Building a Lean Knowledge Graph

# STAGE 1: Extract Structural Relationships (Not Just Text!)
def extract_lean_knowledge_graph():
    triples = []

    for lean_file in parse_lean_library():
        for theorem in lean_file.theorems:
            # Dependencies from 'by' tactics and 'have' statements
            for dependency in theorem.dependencies:
                triples.append((theorem.name, "depends_on", dependency))

            # Type relationships
            for type_class in theorem.type_classes:
                triples.append((theorem.name, "uses_typeclass", type_class))

            # Namespace hierarchy
            triples.append((theorem.name, "in_namespace", theorem.namespace))

            # Proof technique relationships
            for tactic in theorem.tactics_used:
                triples.append((theorem.name, "uses_tactic", tactic))

            # Mathematical concept relationships
            concepts = extract_mathematical_concepts(theorem.statement)
            for concept in concepts:
                triples.append((theorem.name, "involves_concept", concept))

    return triples

# Example triples from Lean's mathlib:
# ("Continuous.comp", "depends_on", "Continuous")
# ("Prime.dvd_mul", "involves_concept", "Prime")
# ("Ring.add_comm", "in_namespace", "Ring")
# ("Fermat_little_theorem", "generalizes", "Wilson_theorem")

KGE Training (Same Pipeline, Different Stage 2)

# STAGE 2: KGE Embedding Generation (Instead of Text Embeddings)
from pykeen.models import ComplEx
from pykeen.triples import TriplesFactory

# Create knowledge graph
triples = extract_lean_knowledge_graph()
tf = TriplesFactory.from_labeled_triples(triples)

# Train KGE model
model = ComplEx(triples_factory=tf, embedding_dim=300)
model.train()  # Learn vector representations of theorems and relationships

# Extract embeddings (same format as text embeddings!)
theorem_embeddings = model.entity_representations[0]()

# STAGE 3-5: Identical to text approach
index = faiss.IndexFlatIP(300)
index.add(theorem_embeddings.astype(np.float32))
faiss.write_index(index, "lean_kge.faiss")

Where KGE Excels: Relationship-Based Queries

1. Dependency Discovery

query = "What theorems can I prove using Fermat's Little Theorem?"

# Text embeddings: Finds theorems that MENTION "Fermat" in description
# KGE: Finds theorems that actually DEPEND ON Fermat's Little Theorem in their proofs

2. Mathematical Family Understanding

query = "Show me theorems related to group homomorphisms"

# Text embeddings: Keyword matching on "group homomorphism"
# KGE: Discovers theorem families through mathematical relationships:
#   Group.hom → Ring.hom → Field.hom (inheritance)
#   Group.hom → Group.iso → Group.aut (specializations)

3. Proof Strategy Discovery

query = "Theorems that use induction on natural numbers"

# Text embeddings: Finds text mentioning "induction"
# KGE: Finds theorems connected through "uses_tactic" → "induction" relationships

4. Multi-Hop Mathematical Reasoning

# Query: "Theorems connecting topology and algebra"
# KGE can traverse:
#   Theorem_A → involves_concept → "continuous"
#   → related_to → "homomorphism"
#   → involved_in → Theorem_B
#
# Discovering non-obvious connections across mathematical domains

Hybrid Architecture: Best of Both Worlds

class LeanTheremSearch:
    def __init__(self):
        # Text embeddings for semantic understanding
        self.text_index = faiss.read_index("lean_text_embeddings.faiss")
        self.text_model = SentenceTransformer("all-mpnet-base-v2")

        # KGE embeddings for relationship understanding
        self.kge_index = faiss.read_index("lean_kge_embeddings.faiss")
        self.kge_model = load_trained_kge_model()

    def search(self, query: str):
        # Classify query type
        if self.is_relationship_query(query):
            # Use KGE primarily
            results = self.kge_search(query)
            text_boost = self.text_search(query, weight=0.2)
        elif self.is_semantic_query(query):
            # Use text primarily
            results = self.text_search(query)
            kge_boost = self.kge_search(query, weight=0.2)
        else:
            # Combine both equally
            text_results = self.text_search(query)
            kge_results = self.kge_search(query)
            results = self.merge_results(text_results, kge_results)

        return results

    def is_relationship_query(self, query: str) -> bool:
        relationship_indicators = [
            "depends on", "builds on", "uses", "related to",
            "same family", "connected to", "generalizes"
        ]
        return any(indicator in query.lower() for indicator in relationship_indicators)

Real Lean Examples Where KGE Shines

Example 1: Finding Proof Dependencies

-- User has this theorem and wants to know what they can prove with it:
theorem my_custom_lemma (n : ℕ) : n + 0 = n := by simp
-- Text search: Finds theorems mentioning "n + 0" in text
-- KGE search: Finds ALL theorems that could use my_custom_lemma in their proofs
--   through dependency graph traversal

Example 2: Mathematical Concept Hierarchy

-- Query: "Show me the Functor typeclass hierarchy"
-- Text embeddings: Keyword search for "Functor"
-- KGE: Maps the actual mathematical relationships:
--   Functor → Applicative → Monad
--   Functor → Contravariant
--   Functor ← Bifunctor

Example 3: Tactic Discovery

-- Query: "What tactics work well with ring structures?"
-- KGE can find: theorems about rings → tactics used in their proofs
--   Ring theorems often use: ring, ring_nf, abel, group

Implementation Considerations for Lean

Data Sources for Knowledge Graph

Import dependencies (import statements)
Theorem dependencies (extracted from proof terms)
Typeclass relationships (extends, instances)
Namespace hierarchy (mathematical organization)
Tactic usage patterns (proof methodology)
Notation relationships (syntactic connections)

Challenges Specific to Lean

Proof extraction: Getting dependency info from proof terms
Implicit arguments: Understanding when theorems actually depend on each other
Tactic transparency: Some tactics hide their dependencies
Version compatibility: Mathlib evolves rapidly

Eric Taucher (Jun 22 2025 at 15:26):

Continued

Conclusion

For Lean code search, you need both approaches:
* Text embeddings: Handle conceptual, semantic queries about mathematical content
* KGE embeddings: Handle structural, dependency-based queries about theorem relationships

The pipeline pattern is identical—only the embedding generation method changes. FAISS works with both, and you can build a hybrid system that intelligently routes queries to the best approach.

The result is a search system that understands both what theorems say (text) and how theorems connect (knowledge graph), giving Lean users the mathematical reasoning support they need.

Eric Taucher (Jun 22 2025 at 15:32):

Have not created any working code at this time, just trying to get an understanding if this would be beneficial. Before doing a KGE for LEAN, looking at doing one for OEIS (Online Encyclopedia of Integer Sequences).

Eric Taucher (Jun 22 2025 at 21:05):

If one has a full knowledge graph and trained the knowledge graph to then create knowledge graph embeddings, is that better than creating the full knowledge graph in Prolog and then using a Prolog query to get the result?

Claude.ai reply

The choice between knowledge graph embeddings and Prolog-based querying depends significantly on your specific use case, but each approach has distinct advantages:
Knowledge Graph Embeddings are better for:
* Incomplete or noisy data - embeddings can capture patterns even when relationships are missing
* Similarity and analogy queries - finding entities that are semantically similar or completing analogies like "Paris is to France as Tokyo is to ?"
* Integration with neural networks - embeddings can be directly used in downstream ML tasks
* Scalability - once trained, embedding lookups are very fast even for massive graphs
* Handling uncertainty - can represent degrees of relationships rather than just binary true/false
Prolog is better for:
* Exact logical reasoning - when you need guaranteed correct answers based on rules
* Complex rule-based inference - handling sophisticated logical relationships, recursive queries, and constraint satisfaction
* Interpretability - you can trace exactly why a conclusion was reached
* Dynamic queries - easily adding new rules or facts without retraining
* Consistency checking - verifying that your knowledge base doesn't contain contradictions
Hybrid approaches often work best in practice:
* Use Prolog for core logical reasoning and rule enforcement
* Use embeddings for similarity search and handling incomplete data
* Combine both for different types of queries within the same system

The "better" choice really depends on whether you prioritize exact logical inference (Prolog) or flexible pattern recognition and similarity (embeddings). What's your specific use case?

Justin Asher (Jun 22 2025 at 23:08):

@Eric Taucher If you look at the database, I already have everything ordered in a graph like structure. If you give the models to Claude plus the paper it could probably set up the desired search system. I presume running a training algorithm would need some compute, though. I never did anything specialized because these things can be finicky. Would result in a good paper and improvement to the search engine if it worked.

Justin Asher (Jun 22 2025 at 23:10):

One of the things to think about is how we would handle the data sources here. I did not mix the Lean code with the informal descriptions nor with the titles, because this seemed to hurt the search efficacy for typical use cases. I certainly think exploiting the graph structure could lend itself to large improvements.

Eric Taucher (Jun 23 2025 at 09:08):

Justin Asher said:

I presume running a training algorithm would need some compute

Yes.

Eric Taucher (Jun 23 2025 at 09:33):

Justin Asher said:

I never did anything specialized because these things can be finicky

This is true.

Failure is also a good learning exercise. :slight_smile:

Eric Taucher (Jun 23 2025 at 14:35):

Correct me if any part of these two statements are wrong:

Lean Explore is focused on searching for Lean 4 declarations.
A graph in LeanExplore is based on StatementGroups

In this graph, an edge exists from StatementGroup A to StatementGroup B if any declaration within A depends on any declaration within B.

Each StatementGroup thus represents a unique, contiguous block of source code as authored by the user.

A knowledge graph for Knowledge Graph Embedding (KGE) requires the data as triples (ref):

A knowledge graph G={E,R,F} is a collection of entities E, relations R, and facts F. A fact is a triple (h, r, t)∈F that denotes a link r ∈ R between the head h ∈ E and the tail t ∈ E of the triple.

I don't know if StatementGroup will work for a KGE, it seems rather large for an KG entity, but as multiple KG triples will probably work, any misunderstanding of StatementGroup is my mistake.

More specifically with KGE, multi-hop question answering is looking to be of value; could not find a Wikipedia article or specific research survey paper but this is a key phrase that turns up good results for a Google search.

Since one of my goals is to see what an MCP (Model Context Protocol) server for Lean 4 can do, looking at many different Lean 4 problems/challenges.

In focusing in on multi-hop reasoning, these are some ideas proposed by Claude.ai.

Let me expand on areas where multi-hop reasoning would be particularly powerful in Lean 4, which will help us estimate the entity/relation ratios:

High-Value Multi-Hop Reasoning Areas

1. Proof Completion (`sorry` filling)

Chain: theorem A → uses lemma B → requires theorem C → depends on definition D
Reasoning: "To prove this about groups, I need this lemma about monoids, which requires this property of binary operations"

2. Typeclass Instance Discovery

Chain: Type T → has instance I1 → enables instance I2 → allows instance I3
Example: "I need CommRing R, but I have Field F → CommRing F → Ring F → AddCommGroup F"

3. Tactic Chain Suggestion

Chain: goal type G1 → tactic T1 → subgoal G2 → tactic T2 → solved
Pattern: "For inequality proofs, try linarith → if fails, try norm_num → if fails, try simp with these lemmas"

4. Conceptual Prerequisite Discovery

Chain: concept A → requires understanding B → builds on C → assumes knowledge of D
Educational: "To understand manifolds, you need topology, which needs metric spaces, which needs real analysis"

5. Library Navigation & Discovery

Chain: problem type P → solved by area A → uses theorems from B → defined in module C
Discovery: "Vector space problems often use linear algebra theorems from the LinearAlgebra.Basic family"

6. Definition Template Discovery

Chain: new concept X → similar to concept Y → uses pattern Z → requires properties W
Creation: "Defining a new algebraic structure? Look at how Group was defined, which follows this pattern, using these axioms"

Estimated Entity/Relation Ratios

Based on these use cases, here's my estimate for a comprehensive Lean 4 KG:
Entities (~500K-2M):

Theorems/Lemmas: ~100K-500K
Definitions: ~50K-200K
Types: ~20K-100K
Tactics: ~1K-5K
Modules/Files: ~5K-20K
Relations (~20-50 types):
Logical: proves, assumes, contradicts, equivalent_to
Structural: uses, depends_on, defined_in, imports
Semantic: generalizes, specializes, similar_to, applies_to
Tactical: solved_by, enables_tactic, creates_subgoal
Pedagogical: prerequisite_of, builds_on, example_of
Ratio: ~10K-40K entities per relation type

This suggests relation embeddings will be tiny compared to entity embeddings (maybe 1-2% of total size), so keeping all relations in RAM is very feasible.

Note the last line

This suggests relation embeddings will be tiny compared to entity embeddings (maybe 1-2% of total size), so keeping all relations in RAM is very feasible.

If this is not a hallucination, and from reading some research papers seems to indicate it is not, then this is worth the time to investigate.

Justin Asher (Jun 23 2025 at 16:43):

Eric Taucher said:

Correct me if any part of these two statements are wrong:

Lean Explore is focused on searching for Lean 4 declarations.

A graph in LeanExplore is based on StatementGroups

I can clarify a bit. LeanExplore is focused on searching source code. Since each bit of source code can generate multiple items, I needed some way to group these in an amicable way. I think there is probably a better way to do this (e.g., we might want to include declarations generated by tags in the database), but this seemed like the simplest solution. You can just think of a statement group as a bit of source code with a docstring and an informal description.

Justin Asher (Jun 23 2025 at 16:47):

Eric Taucher said:

I don't know if StatementGroup will work for a KGE, it seems rather large for an KG entity, but as multiple KG triples will probably work, any misunderstanding of StatementGroup is my mistake.

I think you would want to define some algorithm which sufficiently incorporates the dependency data to enhance the embeddings and then exploits that for search. This would in theory give the model a clearer idea of what everything means. Intuitively, it would be like instead of just looking for a theorem statement, you are looking for a theorem statement and utilizing the definitions that the theorem depends on in some extremely efficient manner (instead of having an AI agent manually check these). I could see how this would be possible and how it would enhance the algorithm.

Justin Asher (Jun 23 2025 at 16:48):

One fun simple experiment to run would be to just see if we can train the Lean code embeddings to have similar positions in space as the informal embeddings. Right now, I think the model is still quite poor at understanding Lean code, so if we can exploit our informal embeddings, we might be able to give it a much better understanding.

Last updated: Dec 20 2025 at 21:32 UTC