Zulip Chat Archive

Stream: general

Topic: Tactic worth running my PC overnight?

Martin Dvořák (Jan 05 2024 at 10:31):

I was thinking of ways to outsource more of my work to my computer.
Maybe there is a way how to make progress on my proofs overnight (other than me having a revelation in a dream).
What if I could configure some tactics (perhaps aesop or simp variants) to be able to succeed even after minutes of no result.
Any tips?
I know there will be severely-diminishing results, but I want to try anyway.

Max Nowak 🐉 (Jan 05 2024 at 19:19):

I imagine a hammer would fall into this category, but I think all current hammers for Lean (lean auto, lean-smt, duper,…) have very short timeout times.
The existence of such an overnight tool would be really cool though!

Eric Rodriguez (Jan 05 2024 at 19:23):

I'd also be very in favour of something like this, seems really cool:)

Alok Singh (Jan 06 2024 at 18:24):

is there a way to set the timeout on individual invocations?

Kim Morrison (Jan 06 2024 at 23:11):

set_option maxHeartbeats 0 in ...?

namibj (Jan 07 2024 at 00:34):

I'm trying to pull from a couple recent deep learning architecture papers to hopefully make a true model-free policy-gradient reinforcement learning approach work, exposing the search tree/graph of attempted tactics and resulting (sub-) goal (-tuples) to let the model choose where to apply a tactic, and ofc also what tactic to apply.

I'd love some decent information/statistics for the broadly-usable tactics about also critically how long they run/take.
And quite good would be some way to decently predict how long a tactic would run for in a specific situation to let the RL model exploit it when choosing whether it thinks a tactic in a place is (currently) worth it or not.

Kim Morrison (Jan 07 2024 at 04:17):

@namibj, check the output of the tactic-benchmark script in the lean-training-data repository for a proof of concept.

namibj (Jan 08 2024 at 00:29):

Scott Morrison said:

@namibj, check the output of the tactic-benchmark script in the lean-training-data repository for a proof of concept.

Thanks for the pointer; somehow wasn't aware of that repo before. (I've now tried it a little, though the benchmarking itself should wait until I'm better at writing Lean so the necessary modifications won't unnecessarily risk frustration in comparatively "boring" code.)
(Getting that benchmarking working will be more of a chore than a fun exercise, at least relative to much of the other Lean code I expect to write in the next weeks.)

Kim Morrison (Jan 08 2024 at 02:02):

It hasn't quite bubbled to the top of my list, but I would like to make many further improvements to tactic-benchmark. Really I want to build something much more general, so that is it possible to ask questions like "For all occurrences of tactic X in the library, try running tactic Y (possibly aware of the context and the proof following X), and tell me about how things went."

If you have particular requests let me know!

Martin Dvořák (Jan 09 2024 at 10:33):

How much potential to prove more lemmas does aesop have if allowed an enormous number of heartbeats?
Do you @Jannis Limperg have examples where aesop succeeds iff given more than the standard number of heartbeats?

Jason Rute (Jan 09 2024 at 11:03):

This would probably be a good application for neural approaches, which typically scale better for longer times. Check out LLMStep and Lean Copilot. (The later at least is just a tactic, so if you can can get it your vision to work with aesop , I think you could get it to work with Lean copilot.). And I agree RL is good for this setting, especially if there was a large number of sorries you were trying to fill in overnight. Then it could learn from the ones it solved. The has been lots of research in this direction (see #Machine Learning for Theorem Proving ) but less so practical tools (especially ones intended to be run overnight in batch), but that is just an engineering problem.

Jason Rute (Jan 09 2024 at 11:05):

@Kaiyu Yang is giving a talk on Lean Copilot at Lean Together.

Martin Dvořák (Jan 09 2024 at 11:12):

Jason Rute said:

(see #Machine Learning for Theorem Proving )

I was unable to follow that stream as there was a lot going on, so I created a thread
https://leanprover.zulipchat.com/#narrow/stream/219941-Machine-Learning-for-Theorem-Proving/topic/Testimonials
hoping to learn about the fruits of ML for Lean, but nobody posted a testimonial there.

I am looking forward to the talk on Lean Copilot as I need an overview for Lean users who don't actively watch the ML stuff, which I hope to get from the Lean Together lecture.

Martin Dvořák (Jan 09 2024 at 11:22):

Jason Rute said:

(...) just an engineering problem.

If the research part has been quite successful and some engineering problems are the current bottleneck for deploying AI assistants for Lean, then it is great news I guess?

Jason Rute (Jan 09 2024 at 12:01):

Maybe I oversold it slightly. The current approaches are okay but not great. Nonetheless, the research is certainly ahead of the practical tools.

Jason Rute (Jan 09 2024 at 12:04):

As for your testimonial page I think there are only 2 usable machine learning tools for all of ITPs, Lean Copilot and Coq Tactician. There are also some hammers, Isabelle Sledgehammer and CoqHammer. So the lack of testimonials is partly due to a lack of usable tools.

Jason Rute (Jan 09 2024 at 12:06):

And there are also some general purpose tools like GitHub Copilot and ChatGPT that some people find valuable.

namibj (Jan 10 2024 at 02:16):

Martin Dvořák said:

If the research part has been quite successful and some engineering problems are the current bottleneck for deploying AI assistants for Lean, then it is great news I guess?

The major problems with RL as of what I'm aware of published in ATP that'd apply to Lean are:

parametrization makes there be far too many allowed tactics in any given moment to use techniques that at some point brute-force over these in their reasoning.
RL has a severe habit of being computationally extremely expensive
humans look at the previous steps and attempted-but-rejected branches of the tree(-ish) search for a proof. ATP being technically a "purely simulated environment" planning task means that the constraints of "no (free) rollback" of most well-researched RL settings (simulated robots, real robots, chess, Atari games, etc.) don't apply. So no need for e.g. AlphaGo's monte-carlo tree search, or similar ones that try to build a world model to limit their interaction with the environment and still train their policy (and/or value) functions.
Lean ATP as a task has an extremely unusually hard to predict task topology. No real use in predicting the results of a tactic, better to just hone one's intuition and try them out to observe the results.
Lean4's AST is fairly complex, and side-stepping that by using language models massively increases the machine learning compute costs.

I'm trying to cobble together a model architecture that ought to cope with 1-3 (2. should be treatable by using recent analytical result/developments that use higher-order differentiation to speed up convergence, akin to how Newton's method (applied to the gradient of the loss function) converges faster than plain gradient descent).

The particular points that are not just an engineering problem AFAIK are how to isolate the model from human-originated identifiers/have the model view the symbols as the abstract entities they are, and how to squeeze out lean actions from an RL policy model that got to think about the proof state to make up it's mind on how to proceed (a diffusion model might work, but my attempts at finding prior art on doing conditioned generative diffusion on ASTs didn't yield anything more relevant than some nice results in generating molecule graphs (bio chemical/pharma industry (-adjacent) research)).

In general I'd love to collaborate, especially with some help on the Lean parts as I'm sadly fairly new to Lean itself.

Martin Dvořák (Jan 10 2024 at 10:25):

What is "task topology" please?

Jannis Limperg (Jan 10 2024 at 10:28):

Martin Dvořák said:

How much potential to prove more lemmas does aesop have if allowed an enormous number of heartbeats?
Do you Jannis Limperg have examples where aesop succeeds iff given more than the standard number of heartbeats?

I don't have any examples like this, no. Ironically, I'm not a very heavy user of Aesop myself. I imagine that Aesop is going to struggle with heartbeat limits deep into Mathlib because it runs simp at lot, so there more heartbeats might help. But we're hopefully talking minutes of runtime here, not hours.

Generally speaking, Aesop is not designed to be "complete given infinite time", like e.g. superposition provers are (in theory). You could try to add very general rules (unfold everything, add all recursors, etc.) but I don't think you'd get anything useful out of this.

namibj (Jan 10 2024 at 12:32):

Martin Dvořák said:

What is "task topology" please?

The topological behavior/properties of the task itself. I had trouble pinning it down compactly; here's the log of my trying with a decent summary at the very end: https://chat.openai.com/share/2e8e0b7f-62cb-427d-8484-e2961a9db0b8

Jason Rute (Jan 10 2024 at 14:39):

Note, when AI people use words like “topology” and “manifold” they use the words more casually than a mathematician means them.

namibj (Jan 10 2024 at 17:24):

Jason Rute said:

Not [...]

I'll assume that was supposed to be "Note".
And yeah, I'd probably abused "manifold" somewhat. I don't expect to have done that nearly as much to "topology", though.
I did very much mean the e.g. connectivity of the time-history sweep/extension of the (observation, action) space, which is embedded in what's _roughly_ just the e.g. (finite-dimensional) rank-3 tensor representation of the raw video sequence (plus the action representation if not implied from the observation history).
Because on a graph search/path-discovery task like ITP, you're dealing with the structure of a hidden graph with tactics as edges and goals as nodes. You can apply a tactic to a goal and if it doesn't fail, the edge exists, and the resulting goal (might be True) is added if not yet part of your visible part of the graph and connected with the edge. If you include tactics with multiple subgoals, you'd get hyperedges, but that doesn't really change much.

The point being, you're trying to find a path from your theorem initial goal node to a True to finish your proof. That graph's structure is fairly resistant to predicting without just trying. Also the try-able out-degree of a node is, if you're lucky, countably finite: you can't brute-force because tactics are parametric so you'd run into the heat death of the universe before you're done brute-forcing a single node's out-edges.
Thus you have to train a "gut feeling"/policy to have a habit of choosing effective tactics. And you need some way to choose where to next apply a tactic.
Retrieval-less LeanCopilot today for example gives you such a "gut feeling", and aesop has some strategies (best/depth/breath first) using a normalized rating of how promising the suggested tactics for a particular goal are, and treating them as success probabilities with a probability tree/nesting calculation.

Many RL tasks in research are far more predictable: e.g. 3D physics-based game engine simulated robots, or some industrial controller tasks where the structure is often so regular that until recently people have just relied on analytical (near-) optimal solutions with sometimes proven bounds to their in-optimality.

Martin Dvořák (Mar 04 2024 at 11:17):

Is rw_search from Mathlib.Tactic.RewriteSearch worth running overnight?
Does anybody have an experience with rw_search running for more than 5 minutes and eventually finding a proof?

Kim Morrison (Mar 04 2024 at 12:48):

No.

Utensil Song (Mar 13 2024 at 07:47):

Martin Dvořák said:

How much potential to prove more lemmas does aesop have if allowed an enormous number of heartbeats?
Do you Jannis Limperg have examples where aesop succeeds iff given more than the standard number of heartbeats?

aided_by :thread: seems to be related, and partially answers this question.

It's not actually for running overnight, instead it's running when the human is thinking, but since it's in the background, it naturally allows more heartbeats (such relaxation is not in the PR yet). Personally, I guess aesop/duper etc. has quite some potential for running for minutes, but not so much for hours.

Martin Dvořák (Mar 14 2025 at 07:54):

It has been a year.
Has any general-purpose tactic worth running overnight appeared in the meantime?
I made a suggestion for overnight aesop https://github.com/leanprover-community/aesop/issues/202 but @Jannis Limperg will not have time for developing it.

Jason Rute (Mar 14 2025 at 10:09):

From previous conversations, it seems like you are looking for a tactic which is easy to install, runs on windows without wsl, and is worth running overnight—which is a much larger ask.

Martin Dvořák (Mar 14 2025 at 10:16):

Not really "runs on windows without wsl" — I got a new PC with Ubuntu a few weeks ago. Nevertheless, I would like to have something that runs out of the box.

Arthur Paulino (Mar 14 2025 at 12:40):

I think this is a reasonable feature, in terms of UX. But let me try to enhance it a little bit

Such a tactic would need some mechanism for not losing track of the solution, if/when it is found. It needs to cache the solution locally (just in case the user accidentally messes up) and provide the proof as a "try this" hint for easy replacement. If the user replaces the tactic invocation with the "try this" hint, the cache can be thrown away
The tactic can also cache previous attempts, if that helps to avoid trying the same fruitless path over and over. Again, once the "try this" hint is accepted by the user, this cache can be thrown away

Shreyas Srinivas (Mar 14 2025 at 12:44):

Such a tactic would also need to checkpoint its proof term generation, to recover from abrupt process termination.

Shreyas Srinivas (Mar 14 2025 at 12:44):

It would need to run asynchronously somehow.

Jason Rute (Mar 14 2025 at 13:15):

Even a try this would be printed if run from the command line. I’ve used this to benchmark Aesop and LeanCopilot in the past.

Kevin Buzzard (Mar 14 2025 at 13:30):

I have never seen aesop take more than a few seconds. Has anyone actually ever set up some version of mathlib where some call to aesop with some lemmas suitable tagged actually takes e.g. 8 hours on a modern machine? Without that, this question is a bit nebulous.

Shreyas Srinivas (Mar 14 2025 at 13:34):

So in discord we discussed this and it was pointed out that Aesop is meant to be used interactively. But we were hoping that lean-auto + duper might benefit from multi hour runs if it were upstreamed and maintained

Shreyas Srinivas (Mar 14 2025 at 13:34):

Same for bv_decide

Kevin Buzzard (Mar 14 2025 at 13:48):

I thought the idea was "prove theorems in my WIP work while I'm sleeping". I can imagine that we can find some goals which bv_decide would solve but it take 8 hours, however I cannot really imagine that they are goals which would show up in reality (at least in my own work!). I can imagine that AI tools could take hours though (and also cost a fair bit of money!).

Shreyas Srinivas (Mar 14 2025 at 13:49):

Small combinatorial problems

Shreyas Srinivas (Mar 14 2025 at 13:50):

One of the profs at my department was hacking on a certain allocation problem (assigning discrete goods to agents) satisfying some fairness criterion. He got a sat solver to say that a certain allocation exists for a small number of agents (that was previously a wide open question) by running a sat solver for 32 hours

Shreyas Srinivas (Mar 14 2025 at 13:51):

Unfortunately the sat solver he used doesn’t plug into a proof assistant easily so we don’t technically have a “proof” yet.

Martin Dvořák (Mar 14 2025 at 14:17):

Shreyas Srinivas said:

So in discord we discussed this and it was pointed out that Aesop is meant to be used interactively. But we were hoping that lean-auto + duper might benefit from multi hour runs if it were upstreamed and maintained

The difference is that aesop is a general-purpose automation that works out of the box.
I don't think lean-auto or duper will satisfy it anytime soon.

Martin Dvořák (Mar 14 2025 at 14:20):

Kevin Buzzard said:

I have never seen aesop take more than a few seconds. Has anyone actually ever set up some version of mathlib where some call to aesop with some lemmas suitable tagged actually takes e.g. 8 hours on a modern machine? Without that, this question is a bit nebulous.

I once had aesop call that took over 20 minutes and then succeeded. Never hours.

Martin Dvořák (Mar 22 2025 at 15:24):

My impression is that this is getting nowhere because people who have the skills to develop tactics are too focused on efficiency. At this point, I'd like to have anything slightly more complete than aesop — no matter how inefficient; I won't mind if searching the proof space with slightly more breadth increases the runtime from 10 seconds to 10 hours.

Henrik Böving (Mar 22 2025 at 15:42):

My impression is that this is getting nowhere because people who have the skills to develop tactics are too focused on efficiency.

I wouldn't say they are too focused, as a simple matter of fact the vast majority of users use Lean interactively and thus getting as much as possible solved in a time that can be considered interactive is the most important metric for the majority of tactic developers. Furthermore many tools that we have like aesop or simp do not have completeness so they will very likely have diminishing returns very quickly (unless you are say, working on gigantic proof states in which case they might just take a while to consume the entire proof state). As mentioned previously tools that do have completeness (for their respective domains) and thus it is reasonable to let them run for very long is for example bv_decide or duper.

Alex Meiburg (Mar 22 2025 at 16:46):

no matter how inefficient

IMHO, I would suggest caution there... For problems like proof search with a pretty "strongly exponential" blowup, a 'modest' efficiency improvement in the search strategy can make the difference between 10 minutes and 10 days, easily. Okay, maybe 10 days is "fine" for you, but what if it's a month? A year?

As an example...

Something like GodelProver https://arxiv.org/abs/2502.07640 you could just run as many times as you'd like, getting independent samples. Their paper shows steady improvements in success on MiniF2F rate as number of attempts goes up, exponentially, as one would expect. It's not slowing down by 100k samples, so one can reasonably expect to still benefit up to 1M samples, at least. But producing 1M samples at 1sec/attempt (which would be kind of beefy!) would still mean 11 days.

Bottom line: with a fairly new LLM-based approach, on MiniF2F problems, you would definitely continue to see benefit waiting up to 11 days per problem you want solved.

At that kind of time scale, I definitely care about efficiency, not just 'can it maybe close my goal in theory'.

Jason Rute (Mar 23 2025 at 02:10):

A good method will solve log(t) theorems, from a set of theorems to solve, in time t. If you have a sizable test set and enough compute to test it, you can see this plainly. For example look at the plots in Graph2Tac. (Yes, this won’t hold forever, but it will for a sizable range of time.) But it is good to test this, because it may not be true of current methods.

Jason Rute (Mar 23 2025 at 02:13):

@Martin Dvořák honestly as the person who requests this feature the most, you might be the perfect person to work on it, say testing various methods, tweaking Aesop and others to have this property, and making something that works in practice. I don’t know your background, but I imagine you are more capable than you may realize, especially with the support of the Lean community.

Martin Dvořák (Mar 23 2025 at 10:00):

I can look into it after I finish my Ph.D., but now I need to focus on finishing things towards my thesis. Learning metaprogramming would take too much time for me, while my contract is running out soon.

Niels Voss (Mar 23 2025 at 17:39):

One thought is that instead of looking for tactics which work on a single goal state for hours, which is something that probably has exponential blowup and quickly causes diminishing returns, but instead doing operations that scan all of Mathlib looking for ways to optimize or analyze proofs. Maybe things like

Trying aesop after every single tactic state in Mathlib (iirc, there's a tool to do this now?)
Trying various proof changes across Mathlib to try to speed them up
Creating an sqlite database with every single local hypothesis and goal state that have ever been reached in Mathlib, and their corresponding proofs. Then filling in a new sorry could involve doing a scan to see if a similar situation has ever occurred in Mathlib, and informing the user if so.

Kevin Buzzard (Mar 23 2025 at 18:01):

TryAtEachStep is the tool BTW

Ruben Van de Velde (Mar 23 2025 at 18:35):

That's an interesting thing to do, but doesn't help with proving things that you don't have a formalized proof for yet

Niels Voss (Mar 23 2025 at 19:13):

That's fair. I just can't imagine that goals which require between 1 hour and 10 days to solve are very common; I'd imagine most goals take either a couple of seconds at most or longer than the heat death of the universe. I think there would probably need to be an entirely new approach for this to be feasible.

Martin Dvořák (Mar 23 2025 at 19:19):

You are probably right, but in my case, the entire area between 2 minutes and 2 days is almost untested but viable to be tested. Something brute-force-able should be there.

Niels Voss (Mar 23 2025 at 19:36):

I guess the reason why I bring up scanning all of mathlib is because stuff that involves a full mathlib is expensive but not that expensive. This is the only task I can think of that is expensive for reasons other than exponential blowup. So maybe if exact? takes a few seconds to run, something that is like a depth-2 exact? will take on the order of hours to run? Of course, since you can't expect this to close every goal, this would make the most sense if you had like 15 or 20 sorries and left it running overnight to see if any of those sorries could be filled in, rather than focusing on a particular one.

Of course, I'm just throwing around ideas, but it would be really cool if we had a way to let people dedicate compute power to solving math problems, so I hope this succeeds.

Martin Dvořák (Mar 23 2025 at 19:51):

Has depth-2 exact? been implemented?

Niels Voss (Mar 23 2025 at 20:52):

Not that I know of

Isak Colboubrani (Mar 24 2025 at 20:01):

@Martin Dvořák A depth-2 exact? in some form would be super cool. This topic was discussed briefly last week in #new members > Feasibility of an `exact?` tactic exploring two lemmas deep?, but as far as I can tell, the interesting feasibility question was never fully addressed.

Kim Morrison (Mar 24 2025 at 23:25):

I think more interesting would be to generalize the discharger for exact?. Currently it solves subgoals just via solve_by_elim. What happens if that in turn is allowed to call simp or grind? What if we use grind instead of solve_by_elim, etc.

Martin Dvořák (Mar 26 2025 at 16:29):

Alex Meiburg said:

Something like GodelProver https://arxiv.org/abs/2502.07640

Does anybody know if GodelProver works on goals that involve custom types?

Last updated: May 02 2025 at 03:31 UTC