Zulip Chat Archive

Stream: Equational

Topic: Relaxation testing agent on top of LLM and Vampire

Michael Bucko (Nov 08 2024 at 08:17):

I created this GitHub issue and can build this tool, if you think it'd be good to try.

https://github.com/teorth/equational_theories/issues/803

Shreyas Srinivas (Nov 08 2024 at 09:41):

You could try it. But I don't know if this will give you a solution.

Shreyas Srinivas (Nov 08 2024 at 09:43):

A PR would probably include the script and some link to the chat transcript after the script succeeds

Michael Bucko (Nov 08 2024 at 10:16):

Sure, it won't solve everything at once. But it could perhaps assist (I guess even Vampire / Mace4 pros could have errors in their code). I think it'll be a more iterative experience -- one will make adjustments and lead, and the AI will try to adjust and run the code.

And it will be natural language.

I'll look into this and report in this ticket.

Amir Livne Bar-on (Nov 08 2024 at 10:46):

I didn't try generating tptp files with LLMs, but my experience with other less-popular languages (such as ocaml) is that they tend to make more syntax errors and logic errors than in popular ones. So we'll still need to verify that the tptp files test what the user thought about. Maybe we can format them in a nicer way though? Is there a tptp to latex converter?

Shreyas Srinivas (Nov 08 2024 at 10:55):

Ultimately the proofs have to be formalised in lean

Shreyas Srinivas (Nov 08 2024 at 10:55):

That's the minimum requirement for a PR to get merged

Michael Bucko (Nov 08 2024 at 10:57):

Here's what the script is currently doing:

That's how you run it:
aivampire.py --file aivampire.tptp --query "can you try the same exercise with m(W,t(Y,Z))=t(W,Y) replaced by m(t(Y,Z),t(W,Y))=t(t(Y,Z),W)?"

Adjusted TPTP:

fof(eq1323,axiom,

X = m(Y,m(m(m(Y,Y),X),Y))

).

fof(def_T, axiom,

t(Y,X) = m(m(m(m(Y,Y),m(Y,Y)),X),m(Y,Y))

).

fof(axiom_7, axiom,

m(t(Y,Z),t(W,Y))=t(t(Y,Z),W)

).

fof(eq2744,conjecture,

X = m(m(m(Y,Y),m(Y,X)),Y)

).

**Not Proved**

Michael Bucko (Nov 08 2024 at 11:00):

ie one could continuously make requests, and then review the code. One has to then review the code - LLMs make mistakes, but they are getting better too, plus I thought this would not be a bad use case (since it uses a file that already works).

Michael Bucko (Nov 08 2024 at 11:01):

I could try to then generate Lean, but I don't trust LLMs when it comes to Lean too much. (I am still learning Lean myself, and want to contribute to mathlib4)

Michael Bucko (Nov 08 2024 at 11:02):

For the reference:

the sample file comes from @Bernhard Reinke 's 1516->255 research
the sample request comes from @Terence Tao (also related to the beautiful 1516->255 discussion)

Also, this could also be a TS / NextJS app, but I guess it's okay to start with a small CLI experiment.

Shreyas Srinivas (Nov 08 2024 at 11:08):

It's definitely worth an attempt. I am just saying, the resulting PR which produces proofs must have them formalised. That's the bar too meet in a project like this

Michael Bucko (Nov 08 2024 at 11:15):

Tbh I don't think LLMs can't write reliable (enough) code. That's changing fast with agents -- but it's not yet there (one still has to lead, fix, decide, and so on).

We could adjust the script to return Lean code, but it's gonna be most likely more like an LLM recommendation (I think) that won't compile (and will require a Lean expert).

Shreyas Srinivas (Nov 08 2024 at 11:16):

If you need to do some hand formalisation after discovering proofs that is fine, but we can't accept unverified proof claims

Shreyas Srinivas (Nov 08 2024 at 11:16):

Especially because of the reason you mention, namely, that LLMs can't be trusted to output anything correctly

Michael Bucko (Nov 08 2024 at 11:23):

So the question is whether a Lean recommender on top would be useful or not. I don't think they can generate code reliably.

That's more of a use case question, and whether you think that kind of CLI for automating TPTP gen on top of files could be useful. I would use it, because making small adjustments in the existing code is easier for LLMs that writing heavy Lean code.

Shreyas Srinivas (Nov 08 2024 at 11:28):

I am not sure just adding that is useful tbh

Shreyas Srinivas (Nov 08 2024 at 11:28):

With LLMs you want manual intervention or proof verification at each step

Shreyas Srinivas (Nov 08 2024 at 11:28):

Without that it's just ChatGPT on a loop

Michael Bucko (Nov 08 2024 at 11:31):

I don't think generating fool-proof Lean code at scale will be that easy. It might still take some time before LLMS with ATP will be able to do it perfectly.

But I find the combination of LLM and ATP interesting. It's up to you. I don't think LLMs can write perfect Lean code atm.

Shreyas Srinivas (Nov 08 2024 at 11:32):

I know that. What I am saying is, I need to see the value proposition here. The whole point of combining ITPs and LLMs is for ITPs to catch LLMs mistakes

Shreyas Srinivas (Nov 08 2024 at 11:32):

If we are not doing that, then why even use an ITP?

Shreyas Srinivas (Nov 08 2024 at 11:32):

We could have also simply accept Vampire and eprover's claims as is in such a scenario

Shreyas Srinivas (Nov 08 2024 at 11:34):

Lean is the glue of trust here. Without lean verification, this project wouldn't make sense from a math point of view.

Michael Bucko (Nov 08 2024 at 11:38):

Yes, I think the trick here is that I am using a file in the CLI setup, and only requesting the adjustments (rather than asking the LLM to write code from scratch). So my thinking was that such a combo of ATP + LLM could be interesting.

But again, this is a small script. I'm not sure if I'll be using it. For me, it's more of an experiment.

Michael Bucko (Nov 08 2024 at 11:39):

I agree about Lean. But I think LLMs won't be able to write Lean that efficiently. Even with C++, Go, Rust, Ocaml, Nextjs, there's always issues -- and a senior guy needs to lead.

Shreyas Srinivas (Nov 08 2024 at 11:42):

That's fair. Experiments are always good and it is nice to see these experiments happen. But when accepting a PR we need to see what the experiment returns and in this case whether we can verify it.

Michael Bucko (Nov 08 2024 at 11:43):

Absolutely :+1:

Michael Bucko (Nov 08 2024 at 13:38):

Btw. I am trying to finish all the proofs from the Mechanics of Proof book. And even with the simpler ones (like the two below) the LLMs are still struggling (but I am still betting on this, since I think it'll change very quickly).

Spoiler alert for those also doing these proofs.

example {a b : ℚ} (h1 : a - b = 4) (h2 : a * b = 1) : (a + b) ^ 2 = 20 :=
  calc
    (a + b) ^ 2 = (a - b) ^ 2 + 4 * (a * b) := by ring
    _ = 4 ^ 2 + 4 * 1 := by rw [h1, h2]
    _ = 20 := by ring

example {r s : ℝ} (h1 : s = 3) (h2 : r + 2 * s = -1) : r = -7 :=
  calc
    r = r + 2 * s - 2 * s := by ring
    _ = -1 - 2 * s := by rw [h2]
    _ = -1 - 2 * 3 := by rw [h1]
    _ = -7 := by ring

example {a b m n : ℤ} (h1 : a * m + b * n = 1) (h2 : b ^ 2 = 2 * a ^ 2) :
    (2 * a * n + b * m) ^ 2 = 2 :=
  calc
    (2 * a * n + b * m) ^ 2
      = 2 * (a * m + b * n) ^ 2 + (m ^ 2 - 2 * n ^ 2) * (b ^ 2 - 2 * a ^ 2) := by ring
    _ = 2 * 1 ^ 2 + (m ^ 2 - 2 * n ^ 2) * (2 * a ^ 2 - 2 * a ^ 2) := by rw [h2,h1]
    _ = 2 := by ring

Shreyas Srinivas (Nov 08 2024 at 13:50):

You can use zulip's spoiler feature for spoiler alerts

Shreyas Srinivas (Nov 08 2024 at 13:51):

I don't know which specific LLM this is and how they learn from feedback, so it is hard to say anything concrete about how they will perform

Amir Livne Bar-on (Nov 08 2024 at 15:31):

Shreyas Srinivas said:

We could have also simply accept Vampire and eprover's claims as is in such a scenario

This is kind of what happens: when trying out a new ansatz or a new combination for the greedy completion construction, people first ask an ATP to prove/disprove it and trust the result enough to devote days and weeks of work on what the ATP didn't contradict. In this scenario it makes sense to create tools to interact more easily with ATPs.

Amir Livne Bar-on (Nov 08 2024 at 15:35):

(and there are multiple results that we record as "conjectures" that were proven by incomplete arguments in English supported by results from Vampire runs, and there is less focus on formalizing them than on finding new ones)

Michael Bucko (Nov 08 2024 at 15:54):

Amir Livne Bar-on said:

Shreyas Srinivas said:

We could have also simply accept Vampire and eprover's claims as is in such a scenario

This is kind of what happens: when trying out a new ansatz or a new combination for the greedy completion construction, people first ask an ATP to prove/disprove it and trust the result enough to devote days and weeks of work on what the ATP didn't contradict. In this scenario it makes sense to create tools to interact more easily with ATPs.

Exactly main point. Those tools still don't easily reach traction at scale-- but could get some people to fall in love with them. Perhaps with a simple UI.

Michael Bucko (Nov 09 2024 at 12:38):

For those interested, I made it available here https://github.com/riccitensor/vampirellm/

It uses the OpenAI API (latest syntax; you can ofc. switch to any other LLM), and works as a CLI tool.

After you've created a TPTP file, you can try to make simple modifications in the command line. Then the tool will run and interpret Vampire's output. My assumption here is that smaller adjustments to the file should be easier for the LLM than writing the entire file.

Michael Bucko (Nov 09 2024 at 12:40):

LLMs are currently not that reliable, and someone senior has to look at the code (or in this case - a TPTP file), but this will be changing, so I think it's not a bad idea to have a look.

Michael Bucko (Nov 09 2024 at 13:37):

Btw. even though I now feel comfortable writing those TPTP files (or z3 scripts), I can still have errors in those scripts. That kind of LLM assistant makes it easier -- because I only need to review/push the diff to the repo.

Last updated: May 02 2025 at 03:31 UTC