Zulip Chat Archive

Stream: Equational

Topic: RECORDS REQUEST: data and performance automated run metrics

Terence Tao (Oct 11 2024 at 15:55):

This request is not aimed at the immediate goal of completing the implication graph in Lean, but for the second stage goal of writing a paper reporting on the results.

A key component of this paper will be describing various automated techniques (e.g., deploying ATPs such as Vampire, using finite magmas to generate counterexamples, etc.), and it would be good to have metrics on how they would perform in isolation on this problem. This is distinct from how they were deployed in reality, in which later runs could benefit from the implications already eliminated in earlier runs; also, due to various post-optimizations to the Lean codebase (e.g., transitive reductions of the implication graph), the current portion of the codebase corresponding to a given run may not be indicative of the absolute performance of that run.

So, while the implementation details of runs are still fresh in the minds of those contributors who programmed these runs, I would like to set up this (long term) thread to ask all of these contributors, when they have the time and spare CPU, to run their programs again against the entire implication graph (or some random sample thereof, if the entire graph is too computationally intractable to cover), and record the results, including metrics such as

Number of implications/anti-implications resolved (with data sets recording the specific statements resolved, if possible).
CPU time (and hardware used)
Any other implementation notes of interest

These would not go into the (already optimized) Lean codebase, but the data files generated could go into the data folder, and implementation notes could also go into relevant blueprint chapters.

This is not an urgent priority - the paper writing is likely weeks or months away still - but I can imagine that some of the older runs may be harder to execute after some time due to fading memories (and also some technical changes to the repository data formats). Also, I think having all the different contributors report their statistics in this common location may be enlightening for the entire project community.

EDITED TO ADD: This request is not aimed only at the most "successful" runs (if measured by how much they reduced the number of outstanding implications), but also the runs which may not have numerically achieved significant reduction in outstanding implications. This will be important for comparative analysis (and also some runs may have actually been significantly more effective than they seemed in reality, because they were able to resolve many implications that had already been worked out by earlier runs). (Also, this accords with the general scientific principle that negative results need to be published as well as positive ones.)

Mario Carneiro (Oct 11 2024 at 16:00):

I think it would be better to instead record here (or somewhere else and link here) the instructions to reproduce these ATP runs. This information will need to be included in the paper anyway, and having reproduction instructions means it will be possible to wait until the paper is actually nearing publication to do the runs, rather than have them run now on what is still likely not the final project result

Terence Tao (Oct 11 2024 at 17:16):

Very good points. The emphasis for now should be on recording a replicable process and perhaps a preliminary projection of what the metrics would look like, but we can defer the full runs until we have agreed on a systematic and standardized framework to run them all.

Terence Tao (Oct 11 2024 at 19:15):

In particular, I can see that it makes sense to ultimately perform a lot of these runs on a common set of hardware in order to be able to have an apples-to-apples comparison of performance.

Andrés Goens (Oct 12 2024 at 17:32):

there's also the question of which (subset) of the implications to consider, e.g. the entire ~4700^2, or a "minimal" subset known to be sufficient to establish all of them via transitivity/duality, or e.g. the known implications and non-implications separately.

In the case of the egg tactic, for example, it cannot prove the non-implications, so it's kind if a waste of time and computation to run it on them, but it is part of a fair apples-to-apples comparison (and it's also what we did at the time, albeit just for subgraph, not knowing if an implication held or not).

Mario Carneiro (Oct 12 2024 at 17:38):

ATPs are generally graded on how many problems they can solve within a time limit (for increasing values of the time limit). So non-implications are not counted against it, but the asymptote will be the overall performance

Mario Carneiro (Oct 12 2024 at 17:40):

you can think of this as running a race with an independent solver for each problem, where you plot how many solvers crossed their respective finish lines as a function of time

Joe McCann (Oct 12 2024 at 17:45):

When grading time to solve for an ATP is there a set of predefined computational specs?

Mario Carneiro (Oct 12 2024 at 17:45):

what do you mean by that?

Mario Carneiro (Oct 12 2024 at 17:46):

it's a comparative method, you throw a bunch of different solvers at the problem and see how they do which tells you how good the solvers are on this kind of problem

Mario Carneiro (Oct 12 2024 at 17:47):

the baseline is usually to take the problem and encode it directly in a well-known off the shelf standard solver, e.g. vampire

Terence Tao (Oct 12 2024 at 17:48):

So I guess one is grading not just the ATP, but whatever metastrategy is used to select the implications or anti-implications to try to prove. For instance if one is given in advance what implications are actually true or not, this would greatly advantage an ATP that can only prove one of the two types of implications, as it can focus its CPU cycles on the ones that it can actually solve.

Mario Carneiro (Oct 12 2024 at 17:49):

well like I said we're not putting non-implications on the plot. In the race I described, some of the solvers don't actually have a finish line but don't know it; but they will not appear in the stats because they won't finish

Mario Carneiro (Oct 12 2024 at 17:51):

there is another mode where you count the total number of problems solved by a single solver multiplexing over all the problems, and here you do get penalized if you spend time on unsolvable problems. But I wasn't suggesting that

Mario Carneiro (Oct 12 2024 at 17:54):

Here's an example, Fig. 2 of a recent SAT competition publication

image.png

Mario Carneiro (Oct 12 2024 at 17:56):

there are probably more than 240 problems in the competition, but the plot does not focus on benchmark problems that no one solved

Mario Carneiro (Oct 12 2024 at 18:00):

What this plot does not tell you is how "complementary" the different solvers are - it might be that they are solving completely different sets or almost the same set or a strict subset of some other solver. For this information I usually see it presented as a table where you show how many problems are solved by the union of the best N methods, up to the full list, so you can see how much additional is added with the second and later methods

Shreyas Srinivas (Oct 12 2024 at 18:19):

Do we have to limit ourselves to the traditional measurements?

Shreyas Srinivas (Oct 12 2024 at 18:20):

If we want to tell mathematicians which solvers help them with what, it seems useful to also compare solvers against non-implications

Terence Tao (Oct 12 2024 at 19:02):

So I guess once we select a set of implications as benchmarks, there appear to be standard ways in the literature to measure performance of tools against those benchmarks. If so, the remaining question would be how to select a suite of benchmarks to test against.

One obvious benchmark is to just randomly select N assertions EquationX => EquationY (which may be true or false) for some medium value of N, such as N = 5000. But then we will likely miss almost completely the few hundred "interesting" equations that we are currently focusing on.

We could also benchmark against randomly selected true assertions (about a third of the total), or randomly selected false assertions (about two thirds of the total), but again we miss out on the interesting ones.

We could select on those equations that take some standard ATP such as Vampire more than some time limit to resolve, thus focusing on the "interesting" equations, but then of course we could not test Vampire against that benchmark in any unbiased fashion. We could instead take some weighted basket of ATPs perhaps to perform the filtering?

Maybe there are some other suggestions. For instance we might be able to use the implication graph itself (once we have completed it) to somehow identify the "interesting" implications in some ATP-agnostic fashion, but I don't see how to do this offhand.

Mario Carneiro (Oct 12 2024 at 19:07):

As long as you have enough time to prepare the data (and I think you do), I would suggest running the tools until a relatively large limit on every problem, except for the ones for which you know they will time out (i.e. non-implications). Selecting a random subset is only needed if testing on the whole data set is impractical.

Mario Carneiro (Oct 12 2024 at 19:08):

I agree that random sampling is not good for this setting, since it seems like 99.8% of the problems are easy

Mario Carneiro (Oct 12 2024 at 19:09):

you could bias toward the known hard problems, but in that case you might as well sweep up all the easy problems while you are at it

Mario Carneiro (Oct 12 2024 at 19:12):

I realize it looks like cheating to use your current knowledge of implications and non-implications to early prune the data before testing, but running a proof search when you know there is no proof or a countermodel search when there is no countermodel is just a waste of time, the only thing this would test is unsoundness bugs in the solvers

Vlad Tsyrklevich (Oct 13 2024 at 12:55):

Yeah, for testing implications I'm not sure why the test wouldn't just be of the implications in the reduction, that way you can test every single equation and not have to determine which ones are easy vs hard, and the reduction is <10k. But the problem is equivalence classes: 1) From class1->class2, do you let the solver try every combination or just pick a single one, and 2) how to establish the equivalence class itself. You could pick a hamiltonian cycle that the provers have to prove, but some may not be able to accomplish that particular cycle though they would have found another way to complete the equivalence class otherwise.

Vlad Tsyrklevich (Oct 13 2024 at 12:58):

For non-implications the set of all 'facts' on main is pretty small, about ~1k I believe? But there may be the same combinatorial problem with choosing which elements from equivalence class 1 -> 2 you test. I guess the facts are more general than what an ATP would give you for specifically establishing a refutation between two specific equations, but I think the total reduction size is still fairly small, <20k?

Michael Bucko (Oct 17 2024 at 13:16):

Made this kind of workflow using snakemake for inspiration. We could also use lineage tools that are commonly used in ML.

Pipeline: Data ingestion->Data preprocessing->ATP->Metrics collection->Post-processing and analysis->Reporting and documentation

Lineage: Version control, Metadata management, Automated logging, Reproducibility scripts

# Snakefile

rule all:
    input:
        "data/metrics.csv",
        "reports/analysis.pdf"

rule preprocess:
    input:
        "../data/outcomes.png"
    output:
        "../data/implications.json"
    shell:
        python select_subset.py {input} {output}

rule run_atps:
    input:
        "../data/implications.json"
    output:
        "../data/atp_results.json"
    shell:
        python scripts/run_atps.py {input} {output}

rule collect_metrics:
    input:
        "data/atp_results.json"
    output:
        "../data/metrics.csv"
    shell:
        python3 collect_metrics.py {input} {output}

rule analyze:
    input:
        "../data/outcomes.csv"
    output:
        "reports/outcome_analysis.pdf"
    shell:
        some script {input} {output}

Last updated: May 02 2025 at 03:31 UTC