Zulip Chat Archive

Stream: general

Topic: Duper


Jason Rute (Dec 18 2023 at 15:49):

Where should we discuss Duper on Zulip?

Patrick Massot (Dec 18 2023 at 15:51):

I would say the general stream. It is a very general purpose tool that is not limited to Mathlib and not really about Lean 4 itself so not suitable for the Lean 4 stream.

Jason Rute (Dec 18 2023 at 16:04):

Place to talk about #announce > Duper.

Jason Rute (Dec 18 2023 at 16:05):

I made a place #general > Duper but don't have permission to move the messages above.

Jason Rute (Dec 18 2023 at 16:07):

This is really interesting. Do you have any idea how good it is alone? For example @Scott Morrison made a way to benchmark solvers like this. I'd love a comparison to Aesop and to AI-based tools like LLM-Step and Lean Copilot. (And yes, it is tricky to compare AI tools since they have been trained.)

Jason Rute (Dec 18 2023 at 16:13):

Also, you talk about making a Sledgehammer clone, but have you also thought about a MagnusHammer (#Machine Learning for Theorem Proving > New paper: Magnushammer) clone as well? Instead of using an ATP to find lemmas and reconstructing proofs with Duper, MagnusHammer just uses an AI-based lemma selection tool directly. If I read the MagnusHammer paper correctly, it is much better than SledgeHammer (although I'm annoyed at them for not putting a Venn diagram or similar so I don't know if solves similar theorems or if there is still an advantage to SledgeHammer). The simplest approach would be to hook up the lemma selection tool in @Kaiyu Yang's Lean Copilot (#Machine Learning for Theorem Proving > LeanCopilot). Although a lemma selector trained to work with Duper might be better, but unlike Isabelle you don't have any existing data for this since Duper has never been used in practice.

Patrick Massot (Dec 18 2023 at 16:18):

Jason Rute said:

I made a place #general > Duper but don't have permission to move the messages above.

Moved.

Jason Rute (Dec 18 2023 at 16:19):

@Patrick Massot Thanks, but there is no link from #announce > Duper back to this topic now.

Patrick Massot (Dec 18 2023 at 16:19):

Jason, do you know about lean-auto? It is currently then main user of Duper towards a full hammer pipeline.

Jason Rute (Dec 18 2023 at 16:22):

No. I'm out of the loop. So Duper is a component of lean-auto? Is lean-auto like a Sledgehammer converting the goal and a large list of possible lemmas to FOL, sending it to an ATP, reading the lemmas needed to solve the goal, and reconstructing the proof with Duper and the selected lemmas?

Patrick Massot (Dec 18 2023 at 16:23):

That's the plan. It is not yet complete.

Patrick Massot (Dec 18 2023 at 16:24):

Teaser: you will learn a lot more during LT2024 which features a double talk on duper and auto.

Terence Tao (Dec 18 2023 at 16:37):

I have a very naive question (coming from my lack of a good conceptual model of how github really works). For a collaborative formalization project run out of a Github repository, would each participant need to install such a tool separately, or is the installation done entirely within the repository, so that once an administrator installs the tool in the master repository, all participants automatically gain access to the tool after synchronizing with the master? One could imagine a division of labor emerging in a large project where some participants (who for whatever reason are not able or willing to use these tools) would speed up their coding by leaving lots of sorrys for routine "this is obvious" steps, and having other participants who are expert in these automated tools (and have the GPUs to run them quickly) then come in and try to fill them in efficiently.

Josh Clune (Dec 18 2023 at 16:39):

Jason Rute said:

This is really interesting. Do you have any idea how good it is alone? For example Scott Morrison made a way to benchmark solvers like this. I'd love a comparison to Aesop and to AI-based tools like LLM-Step and Lean Copilot. (And yes, it is tricky to compare AI tools since they have been trained.)

I'm not familiar with the technique of Scott's that you're referring to, but would be happy to look into it if you send a link. I haven't interacted much with LLM-Step or Lean Copilot, so I couldn't speak to that comparison at all at this point. I've used Aesop a little bit, but I understand that Aesop benefits a lot from knowing how to tweak options and add rules, so I wouldn't claim to be using Aesop to its full potential.

Still, in my anecdotal experience, I think that currently, Aesop and Duper excel on different sorts of problems. I would expect Aesop to be better than Duper on problems that are particularly far from first or higher order logic (e.g. problems centering on inductive types) and I would expect Duper to be better on problems that require a significant amount of native reasoning (e.g. problems that involve combining hypotheses in a variety of ways as opposed to primarily applying rules to the target).

Yaël Dillies (Dec 18 2023 at 16:40):

It's a global per-project install. It's like importing a file, but with extra steps. I don't think we can avoid rerunning duper incantations for some users only, though (except by telling them not to open/recompile the files containing the incantations). Certainly if you want to give it a try in PFR I can add it.

Jason Rute (Dec 18 2023 at 16:43):

@Terence Tao Some of these tools replace the tool call with proofs that run without the tool, so one doesn't need the tool to run the proof. This is especially important if the tool takes a long time, or uses external software like an automatied theorem prover or a neural network. I think Isabelle's sledgehammer does this for example, and the AI theorem provers like Lean Copilot also do this. Now, some of these generated proofs are more readable and maintainable than others. I imagine (without knowing for sure) that lean-auto calls will be replaced with duper tactics, but that will still require having duper installed which isn't as bad as lean-auto since it doesn't use any external tools.

Terence Tao (Dec 18 2023 at 16:44):

I think I'll start a separate thread in PFR to gauge interest amongst the contributors if they want to experiment with Duper or any other automation tool. Given that PFR has already achieved its primary goal and is now just pursuing secondary goals, I think we can afford to try some experimental tools even if there is a risk that the project could be slowed down or temporarily broken due to a botched install or something.

Josh Clune (Dec 18 2023 at 16:50):

Jason Rute said:

I imagine (without knowing for sure) that lean-auto calls will be replaced with duper tactics, but that will still require having duper installed which isn't as bad as lean-auto since it doesn't use any external tools.

I can confirm that eventually, auto calls (from lean-auto) should be transformable into Duper calls. Using auto would likely require having and installing external tools, but using Duper wouldn't, so although there might be some individuals that can more easily use auto (because they have the external tools), once the auto calls are transformed to duper calls, any Lean user should be able to use the file.

Jason Rute (Dec 18 2023 at 16:51):

@Terence Tao Also, you mention GPUs, but to be clear, I don't think duper or lean-auto use GPUs at all since they don't use neural networks. Duper is just a tactic I think. Lean-auto uses external first-order logic automated theorem provers. Lean Copilot (#Machine Learning for Theorem Proving > LeanCopilot) and LLM-step can however use GPUs if I understand correctly since they are based on locally running large language models. They are two very different approaches to the same problem, automating the "easy stuff".

Terence Tao (Dec 18 2023 at 16:54):

Ah thanks for clarifying. If it is "just" a tactic then it seems like there is less downside to including it in a project where many of the contributors may not actually want to use the tool and don't want the project to accidentally melt their laptops through excessive CPU usage or something.

Josh Clune (Dec 18 2023 at 16:58):

Jason Rute said:

Also, you talk about making a Sledgehammer clone, but have you also thought about a MagnusHammer (#Machine Learning for Theorem Proving > New paper: Magnushammer) clone as well?

I wouldn't be opposed to exploring something like this down the line, but for now, I think there are good technical reasons for pursuing a sledgehammer clone before a magnushammer clone. One of the issues, as you mentioned, is that we don't have any existing data of Duper being used. Another issue is that currently, Duper is pretty sensitive to receiving just the facts that it needs to prove the goal (given too many facts, Duper can very easily spend all of its time proving irrelevant things and time out). That's an aspect of Duper I'm currently working to improve, but as long as that's the case, I expect that a sledgehammer which is able to provide the exact list of facts an external prover needed is more likely to successfully call Duper than just a lemma selection tool.

Jason Rute (Dec 18 2023 at 17:00):

@Josh Clune Is that different than what Sledgehammer uses? Is the METIS tactic (or whatever) behind Sledgehammer less susceptible to extra premises?

Josh Clune (Dec 18 2023 at 17:04):

Currently, yes. I think that has more to do with tuning and heuristics that Metis has that Duper doesn't yet have rather than any fundamental difference in approach though. My hope and intention is that Duper will get better on this front.

Kevin Buzzard (Dec 18 2023 at 17:35):

Sorry for the naive question: I have no idea what Isabelle's Metis tactic does or what we can expect duper to do, and I don't know what a "superposition solver" is. My mental model of aesop is that it will rewrite or apply lemmas which have been tagged with aesop_rule. What should my mental model of a superposition solver be?

Johan Commelin (Dec 18 2023 at 17:37):

My mental model is: you give it (explicitly) a bag of lemmas, and it proves the goal using them.

Johan Commelin (Dec 18 2023 at 17:37):

(Or fails, if you gave a bag that is too small or too large.)

Kevin Buzzard (Dec 18 2023 at 17:37):

So what's the difference between it and aesop?

Johan Commelin (Dec 18 2023 at 17:38):

In Isabelle, you typically use sledgehammer to find a good bag of lemmas, and sledgehammer will spit out a line of the form metis [lemma_1, ..., lemma_10].

Johan Commelin (Dec 18 2023 at 17:38):

I'm not the right person to comment on the different between aesop and duper.

Jason Rute (Dec 18 2023 at 17:52):

I think it also has to follow from first order reasoning. I don't think it will do induction or anything else higher order even if it is trivial (but that is just a guess from similar tactics I've seen in HOL Light).

Patrick Massot (Dec 18 2023 at 17:53):

Duper has some higher order capabilities.

Josh Clune (Dec 18 2023 at 18:05):

Kevin Buzzard said:

Sorry for the naive question: I have no idea what Isabelle's Metis tactic does or what we can expect duper to do, and I don't know what a "superposition solver" is. My mental model of aesop is that it will rewrite or apply lemmas which have been tagged with aesop_rule. What should my mental model of a superposition solver be?

I'm planning on giving a talk at LeanTogether 2024 that will help clarify a lot of this, but in the meantime, here's the gist: Duper will start with the facts that it is given and a target it is trying to prove. The first thing it'll do is try to transform the facts it has and the target its trying to prove into a mostly first-order problem. Then it'll assume the target's negation, take it as fact, and try to prove a contradiction. Duper will repeatedly take facts that it knows and try to combine them with a variety of rules that can be justified with mostly first order reasoning (though as Patrick said, Duper can do some higher order stuff as well). As Duper chugs along, it'll generate more and more facts until eventually, it derives a contradiction, times out, or 'saturates' (meaning it cannot generate more facts).

Currently, Duper won't do any special domain-specific reasoning (unless it is explicitly provided the lemmas necessary to do so), it won't do induction, and it only knows facts that are specifically provided to it. As far as the difference between Duper and Aesop, the fundamental approach is pretty different so there are just different problems that the two tactics are well-suited for. I hesitate to make any strong claims of Aesop's capabilities because I'm not a power-user of Aesop, but my anecdotal experience is that I would expect Aesop to be better than Duper on problems that are particularly far from first or higher order logic (e.g. problems centering on inductive types) and I would expect Duper to be better on problems that require a significant amount of native reasoning (e.g. problems that involve combining hypotheses in a variety of ways as opposed to primarily applying rules to the target).

Kevin Buzzard (Dec 18 2023 at 18:09):

Here is a comment not specific to Duper, but which harks back to Jason's recent comment on the LeanCopilot thread that we have all been "spoiled by github copilot". If I am using my serious home desktop to write Lean code then I would love to have hint running in the background and literally interrupting me (as GH Copilot does) saying "how about this?". However in contrast to Copilot it could only interrupt when it has actually found a proof, so basically telling me "OK I can close the goal from here".

However it seems to me that duper as it stands would not be a good fit for this, because I need to feed it the right lemmas, and this is what Isabelle's Sledgehammer does (and what we currently don't have). Is that correct? So, for example, right now to use duper to good effect I need to manually tell it all the lemmas which I think might be useful?

Josh Clune (Dec 18 2023 at 18:11):

I agree that Duper as it currently stands wouldn't be a good fit for this, but that down the line, auto from lean-auto might be.

Henrik Böving (Dec 18 2023 at 18:32):

Kevin Buzzard said:

Here is a comment not specific to Duper, but which harks back to Jason's recent comment on the LeanCopilot thread that we have all been "spoiled by github copilot". If I am using my serious home desktop to write Lean code then I would love to have hint running in the background and literally interrupting me (as GH Copilot does) saying "how about this?". However in contrast to Copilot it could only interrupt when it has actually found a proof, so basically telling me "OK I can close the goal from here".

However it seems to me that duper as it stands would not be a good fit for this, because I need to feed it the right lemmas, and this is what Isabelle's Sledgehammer does (and what we currently don't have). Is that correct? So, for example, right now to use duper to good effect I need to manually tell it all the lemmas which I think might be useful?

yes, roughly speaking sledgehammer in Isabelle works like this: take Isabelle theorem statement and translate it to an SMT problem (SMT is basically "smoosh together a whole lot of automated, domain specific and general reasoning procedures in a single tool"). Then they run that problem against an SMT solver like Z3 and based on its results (these results have no formal proof attached so we cannot trust them) feed a superposition prover (for Isabelle that's metis, for us that will be duper as I understand) with a list of theorems to use. And then that superposition prover spits out an Isabelle proof that their kernel can check.

I don't know about the intentions of the people working on this stuff but I would guess that some similar architecture might be planned, given the sledgehammer success? Maybe someone can comment on that.

Josh Clune (Dec 18 2023 at 18:36):

Henrik Böving said:

I don't know about the intentions of the people working on this stuff but I would guess that some similar architecture might be planned, given the sledgehammer success? Maybe someone can comment on that.

Yes, auto from lean-auto is working in that direction.

Henrik Böving (Dec 18 2023 at 18:37):

But auto is a bit different in that it is trying to reconstruct a proof from Z3 directly without duper right?

Josh Clune (Dec 18 2023 at 18:47):

That is also a direction that's being explored, but auto can already call Duper itself. Directly reconstructing proofs from external provers is strictly more work than gaining enough information from external provers to call Duper, so in the long run, one goal is to have auto be the analogue to sledgehammer with Duper as the analogue to Metis, and another (perhaps subsequent) goal is to have auto capable of directly reconstructing proofs from external provers. I don't anticipate a future where the latter goal is possible but the former isn't.

Jireh Loreaux (Dec 18 2023 at 18:53):

Just a quick question about naming, is duper meant to be dependent + super, or have I missed the boat completely?

Josh Clune (Dec 18 2023 at 18:56):

That would make a lot of sense, but no. Duper is named that just because it's the spiritual successor to Lean 3's super and the phrase "super duper" exists. It's not deep at all, but maybe I'll retroactively decide that the d in Duper stands for dependent.

Scott Morrison (Dec 18 2023 at 23:46):

Re: the tactic benchmarking tool mentioned above.

What I have right now is a lame prototype. In January I'm hoping to write something much better.

What currently exists is documented in the lean-training-data repository.

It allows you to run a specified tactic against every declaration in a file, or the whole library.

Sample output running on a single file:

% lake exe tactic_benchmark --aesop Mathlib.Topology.ContinuousFunction.Ordered
Mathlib.Topology.ContinuousFunction.Ordered 7 21
 ContinuousMap.abs (0.006912s) (111 heartbeats)
 ContinuousMap.instAbsContinuousMap (0.007926s) (115 heartbeats)
 ContinuousMap.abs_apply (0.027873s) (401 heartbeats)
 ContinuousMap.partialOrder (0.008932s) (105 heartbeats)
 ContinuousMap.le_def (0.071422s) (984 heartbeats)
...

(Here the 7 21 on the first output line is cryptically saying that aesop can generate a term of the specified type for 7 out of the 21 declarations in Mathlib.Topology.ContinuousFunction.Ordered. Note that it is trying to prove defs as well as theorem`s!)

Running on an entire library you can get output like:

Solved by aesop:  21013 / 151756  in  19938s
  but not exact?:  5466
  but not rfl:  6731
  but not simp_all:  15106

Future improvements:

  • Currently this only runs at declaration level, but we should run at every goal.
  • Currently testing new tactics requires modifying the executable!
  • Currently there is no way to filter goals (either by properties of the goal itself, or properties of how it was closed in the library)
  • Currently you just get the binary information "did it succeed", and can't get answers to questions like "is aesop or hint faster on this goal?"

I want to make all of these improvements, and make it easy to ask questions like:

  • Of all goals appearing in the library, which ones are closed by auto faster than the existing proof.
  • Of all goals appearing in the library which are solved by a simp only, which ones still work replacing simp only with simp_rw
  • ...

Notification Bot (Dec 19 2023 at 05:17):

5 messages were moved from this topic to #general > Duper on v4.4.0-rc1 by Utensil Song.

Wojciech Nawrocki (Dec 19 2023 at 21:47):

It looks like Lean's not differentiating between architectures has come back to bite a little here: I am on ARM64 MacOS, but the cloud release downloaded by Lake seems to have been built on x86. Importing Duper fails with a linker error, and the solution is a bit roundabout: rm -r .lake/packages/Duper/.lake/build/lib.

Josh Clune (Dec 20 2023 at 01:03):

Yikes! Thanks for posting about it. I've removed the most recent MacOS asset so hopefully, even if it takes longer to build the first time, others that try to build on an ARM64 Mac won't encounter any errors.


Last updated: Dec 20 2023 at 11:08 UTC