Zulip Chat Archive

Stream: Equational

Topic: What are the hardest positive implications for an ATP?

Terence Tao (Nov 24 2024 at 15:50):

It might be worth having the following "Theorem" (or more precisely, "Empirical finding") in the paper, for a suitable value of N:

Empirical finding. Every positive implication $E \implies E'$ amongst the 4694 equations of order at most 4 takes at most N seconds to prove in <choose an ATP here>.

[Side note: I am leaning towards using the entailment notation $E \vdash E'$ in the paper for implication between laws rather than $E \implies E'$ or $E \leq E'$ as it is unambiguous.]

What value of N (and what ATP) can we achieve this with, and which implications are the "worst offenders"? The Kisielewicz example 1689 -> 2 is the only contender I am aware of for this latter title.

I guess one could work just with the transitively reduced set of 10,657 explicitly proven implications rather than the full set of >8 million, though it would also be interesting to see if there are implications in the latter set that are genuinely harder for ATPs to find than the reduced set (my intuition says they should not be significantly harder though, and in fact maybe easier in practice).

There is some work here on systematically exploring longer laws, like Austin laws of order 5, or Higman-Neumann laws of order 8. Empirically, how much more difficult are the positive implications in these settings?

Vlad Tsyrklevich (Nov 24 2024 at 16:12):

I think the value is much less than 1s for all positive implications in order 4, though I can double check that. In order 5 it is noticeably worse, but it's hard to bound because I only looked at implications to Equation 2 specifically. In that subset, some implications took Vampire ~1s, and I'll also note that Prover9 can have quite different proving times, sometimes longer sometimes shorter.

Vlad Tsyrklevich (Nov 24 2024 at 16:26):

I just re-computed all implications that are in the reduced set:
71% took 0.001s
96.8% took <= 0.005s
99.8% took <= 0.1s [all but 23]
99.98% took <= 0.5s [all but 2]

2923 => 2628 takes 2.8s and 650 => 448 takes 3.6s

Trying eprover on those two implications, they both take ~0.01s, so I think they are probably just hitting some pathological case with the search parameters and that eprover could have different equations that are pathological. So the definition of "hardest" becomes kind of ATP-specific.

Vlad Tsyrklevich (Nov 24 2024 at 16:32):

Also, just a note that 1689=>2 took vampire 0.003s.

Matthew Bolan (Nov 24 2024 at 16:33):

Another measure of complexity is the length of the proof vampire finds (which should also be less hardware specific than run time).

Andrés Goens (Nov 24 2024 at 18:18):

out of curiosity, how long does it take Vampire for 1689 => 2 (the one that Kisielewicz proved in 1997 with computer assistance)?

Andrés Goens (Nov 24 2024 at 18:19):

Vlad Tsyrklevich said:

Also, just a note that 1689=>2 took vampire 0.003s.

oh, never mind, I missed that, sorry!

Vlad Tsyrklevich (Nov 24 2024 at 21:44):

Matthew Bolan said:

Another measure of complexity is the length of the proof vampire finds (which should also be less hardware specific than run time).

Since the majority of the time is actually spent in the search, the "search depth" may be a more useful measure for measuring implication difficulty. By search depth I mean the number of clauses it found in the search, this is visible directly in the vampire output.

Jose Brox (Nov 24 2024 at 22:21):

Terence Tao ha dicho:

Every positive implication E⟹E′ amongst the 4694 equations of order at most 4 takes at most N seconds to prove in <choose an ATP here>.

This is also dependent on the search parameters of a given ATP. IN order to compute N, are we going to use fixed search parameters for the ATP, or do we allow to use the best combination possible for each implication (i.e., N would be the largest among the smallest proof times).

For example, showing that 71557731 implies $x^2 = y^2$ takes 278s in Prover9 with max_weight 100 and sos_limit 200, but 2186s with (1000,2000) (and it can run for 7000s before exhausting the search space without proof with (200,100)). With the default Prover9 parameters, which are (100,20000), it takes 1722s (edited).

Terence Tao (Nov 24 2024 at 22:31):

Hmm, I guess if we are to report the run times scientifically rather than anecdotally, we need some replicable process. Are there standard practices or benchmarks regarding reporting run times of ATPs (or similar tools) in the literature?

Vlad Tsyrklevich (Nov 24 2024 at 22:32):

I don't know how Prover9 works, but the Vampire documentation also states that the choice of timeout can change how it searches, so longer timeouts may actually lead to longer searches.

Daniel Weber (Nov 25 2024 at 06:29):

Terence Tao said:

Hmm, I guess if we are to report the run times scientifically rather than anecdotally, we need some replicable process. Are there standard practices or benchmarks regarding reporting run times of ATPs (or similar tools) in the literature?

There is CASC's https://tptp.org/CASC/29/Design.html#Evaluation, I think that's fairly standard

Terence Tao (Nov 25 2024 at 06:42):

Hmm. I guess we don't have the standard hardware that this competition uses, and so perhaps absolute runtimes are not particularly informative as this would be dependent on the hardware used. We could do relative comparisons between ATPs perhaps. I must confess this is not my area of expertise at all, and have no idea what is actually done in the literature to evaluate the difficulty of a problem to an ATP.

Daniel Weber (Nov 25 2024 at 06:48):

I see CASC uses StarExec, we might also be able to use it

Jose Brox (Nov 25 2024 at 12:55):

Vampire is subtler in that it changes methods and algorithms on the fly in function of time remaining etc. (according to the paper you referenced in other post). Prover9 behaviour is more predictable: it just runs through its algorithm, using whatever tweaks you call it with (like special inference rules, etc.). Concretely, max_weight forces Prover9 to discard any inferred clause with weight ("size") higher that this limit, while sos_limit controls the size of the cache or buffer of remembered clauses.

Daniel Weber (Nov 25 2024 at 13:27):

Vlad Tsyrklevich said:

I just re-computed all implications that are in the reduced set:
71% took 0.001s
96.8% took <= 0.005s
99.8% took <= 0.1s [all but 23]
99.98% took <= 0.5s [all but 2]

2923 => 2628 takes 2.8s and 650 => 448 takes 3.6s

Trying eprover on those two implications, they both take ~0.01s, so I think they are probably just hitting some pathological case with the search parameters and that eprover could have different equations that are pathological. So the definition of "hardest" becomes kind of ATP-specific.

What parameters are you using for Vampire? Are you using --mode casc?

Zoltan A. Kocsis (Z.A.K.) (Nov 25 2024 at 14:08):

I must confess this is not my area of expertise at all, and have no idea what is actually done in the literature to evaluate the difficulty of a problem to an ATP.

The way I see it, the main article for the Equational Theories Project need not address why or how ATPs handle these problems well or not well, especially when we lack the expertise to study this thoroughly within the group of current contributors.

It's perfectly fine to document what we actually observed during the project and highlight in our paper that we'd like to see theory that explains our observations. Since this project is sorta-prominent, highlighting this should incentivize experts in the relevant kind of ATP to produce their own follow-ups.

Given the statistics we see above, creating the necessary Prover9/Vampire/etc. run datasets is also straightforward, assuming full knowledge of the implication graph. This means there's no pressing need for us to investigate this now, it's not like current ATP runs or related data might become hard to recreate. The domain experts can address these questions in their own work, and they'll know how to obtain the data relevant to them much better than we do.

Following up would likely appeal to the domain experts more than becoming a contributor to ETP at this late stage anyway, since it allows them to write articles better suited to their preferred publication venues (though overlaps could ofc happen depending on where this one gets published).

Vlad Tsyrklevich (Nov 25 2024 at 14:40):

Daniel Weber said:

Are you using --mode casc?

I used the default mode (--mode vampire) though I did also try with --mode casc and found similar results earlier. Re-trying it just now I actually got a concrete example of the timeout behavior. For 650=>448, my local results are:

-t 60s --mode casc finishes in ~12s
-t 5s --mode casc finishes in ~4s
-t 1s --mode casc does not succeed.
-t 60s --mode vampire finishes in ~3.5s
-t 5s --mode vampire finishes in ~0.5s
-t 1s --mode vampire finishes in ~0.5s

So by setting timeouts lower than the original high timeouts I had used, the 2 pathological examples I cited are actually solvable in just over 0.5s if a suitably low timeout is used.

Terence Tao (Nov 25 2024 at 16:41):

@Zoltan A. Kocsis (Z.A.K.) Good points. We don't need to do everything in our report, we just need to report on what we can, and allow others to build upon our work. In particular, I had hoped to use the ETP data to create some standard benchmark for future ATPs to test on, but I now see that unless someone with the right domain expertise joins the project (or is already among the participants), it is better to pose this as an open problem, and leave it to some future work, possibly by experts who did not directly participate in this project

Vlad Tsyrklevich (Nov 25 2024 at 17:19):

One thing I do think we might able to contribute is examples of implications/refutations that ATPs are unable to establish that we know to be true. This may be useful as either a test set for CASC, or alternatively as a test set in some online library such as the TPTP Problem Library. If others think that's a good idea, Geoff Sutcliffe organizes both of those so he can probably direct what's appropriate.

Andrés Goens (Nov 25 2024 at 18:54):

Vlad Tsyrklevich said:

One thing I do think we might able to contribute is examples of implications/refutations that ATPs are unable to establish that we know to be true. This may be useful as either a test set for CASC, or alternatively as a test set in some online library such as the TPTP Problem Library. If others think that's a good idea, Geoff Sutcliffe organizes both of those so he can probably direct what's appropriate.

are there any examples of implications that vampire/e were not able to prove? I thought it was only some of the refutations that it couldn't?

Vlad Tsyrklevich (Nov 25 2024 at 19:06):

In order $\le 4$ , there are no examples. The inspiration for this thread, @Jose Brox's order 8 example, is the only difficult implication I've seen

Jose Brox (Nov 25 2024 at 22:40):

Zoltan A. Kocsis (Z.A.K.) ha dicho:

he way I see it, the main article for the Equational Theories Project need not address why or how ATPs handle these problems well or not well, especially when we lack the expertise to study this thoroughly within the group of current contributors.

I suscribe this view. Personally, although I've spent many CPU hours doing informal benchmarking of Prover9/Mace4, I would feel uneasy if the subject is slippery for us, since ATPs and software testing are complicated... Unless Daniel Weber (or someone else less participative until now) declares him/herself an expert an can guide this part of the project, the result would be less than optimal and detract from the rest of the paper.

That being said, I still think there is something we can contribute: instead of focusing on what we can contribute to expert ATP researchers, we can center on what we can contribute to novice ATP users, algebra (or perhaps general) researchers who, as some of us at the start of this project, want to use ATPs for the first time on problems that need them, and may be at a loss of what to use, what parameters may be important... in summary, what may make a difference. Here we can be more informal, throw a couple of "benchmarking" tables for the same ATP with different parameters and for different ATPs, talk about some relative gains in time (changing parameters we saw a 500 times speedup on this particular problem), etc. This is knowledge I think we have gained to some extent, and certainly I would have been glad to receive this kind of hints before we started!

Terence Tao (Nov 25 2024 at 22:48):

I like this framing; in the absence of an ATP expert joining our collaboration, we can present our experiences with ATPs as a field report aimed at a general mathematical audience rather than an expert analysis for the professional ATP community, and leave it as an interesting open problem to properly develop and measure benchmarks for ATPs based on this project.

Terence Tao (Nov 25 2024 at 22:57):

I've opened equational#920 for someone to take a first stab at writing a draft of the ATP section of the paper, hopefully incorporating some of the discussion above. (I plan to mostly focus on writing the more math-intensive sections of the paper, but the ATP section may actually be the most broadly influential in the long run, being of interest well beyond the mathematical field of universal algebra.)

Shreyas Srinivas (Nov 26 2024 at 22:05):

My two cents: I am not sure we can necessarily give ATP experts performance metrics they might find useful. I don't think we know for certain that we have pushed these ATPs to their limits. Only two weeks ago, for a different combinatorics problem, someone I know obtained a solution after running one of these tools for 30-40 hours. I don't recall anybody reporting that they had such long runs. What they might find useful is how the tools are used by non-experts.

Vlad Tsyrklevich (Nov 26 2024 at 22:21):

I agree with your point, but I'll also note that I have 2.5 laptops (so, not super high-end hardware) running ATPs 24/7 and Jose has also probably had some pretty long runs. I'm not sure running for 30 hours on an individual problem is really more illustrative than spending 300 hours on a class of problems.

Terence Tao (Nov 26 2024 at 22:34):

Perhaps some coarse relative comparisons may still be useful, e.g., whether one class of problems seems to take an order of magnitude more time on ATPs than another, or whether one choice of ATP parameters is an order of magnitude more effective than another. Those sorts of conclusions seem less sensitive to things like choice of hardware.

Amir Livne Bar-on (Nov 27 2024 at 07:06):

Large differences in runtime between different orders of the assumptions may be interesting to implementers too, as well as different parameters

Jose Brox (Nov 27 2024 at 11:46):

Vlad Tsyrklevich ha dicho:

and Jose has also probably had some pretty long runs.

That's right, my computer is basically on fire since I started serious computations back with Astérix/Obélix. Right now I'm running 13 simultaneous experiments with Prover9/Mace4 for the Higman-Neumann problem, with 10h per candidate and lists of around 15 candidates each, while at the same time I'm running 3 "benchmarking" experiments with the Windows GUI (this is an AMD Ryzen 5 5600X with 6 core processors and 32GB of RAM, running at 100% CPU and 90% RAM most of the time :grinning_face_with_smiling_eyes:).

Eric Taucher (Nov 27 2024 at 12:07):

An important point worth considering for mention in the paper is the utility of application checkpointing for automated search tools that involve long-running jobs with uncertain completion.

To the best of my knowledge, neither Mace4 nor Prover9 support native checkpointing functionality.

However, one workaround for applications lacking built-in checkpointing is to run them within a virtual machine, such as VirtualBox, and utilize its snapshot feature. This allows the state of the application to be saved and restored as needed.

Eric Taucher (Nov 27 2024 at 12:18):

FYI
Docker can also checkpoint but it is still experimental.

Eric Taucher (Nov 27 2024 at 13:15):

FYI:
Windows 11 includes Hyper-V Manager, which can be used to create virtual machines (VMs) and manage checkpoints. Since a VM is a virtual machine, it won’t have an operating system pre-installed. However, you can install Linux on the VM to run tools like Mace4 or Prover9.

That said, I haven’t used Hyper-V in a few years, so I can’t provide much more detail.

Before asking, no, WSL 2 does not support checkpoints. :smile:

Jose Brox (Nov 27 2024 at 18:28):

Vlad Tsyrklevich ha dicho:

Jose has also probably had some pretty long runs.

Another interesting example which had a long run was the search with Mace4 of a weak central groupoid of size 16 which is not strong. I have redone it with the best combination of parameters and identities I have found, and it has needed 5326s (1.5h). Finding ALL such models apparently is done in 11367s (more than 3h).

Last updated: May 02 2025 at 03:31 UTC