Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: A proposal for Safe and Hallucinatio-free coding AI

GasStationManager (Nov 05 2024 at 04:24):

I have written an essay "A Proposal for Safe and Hallucination-free Coding AI" (https://gasstationmanager.github.io/ai/2024/11/04/a-proposal.html), in which I propose an open-source collaboration on a research agenda that I believe will eventually lead to coding AIs that have superhuman-level ability, are hallucination-free, and safe.

By the fact that I'm posting here you can have a pretty good guess of what I'm proposing. Some of the stuff I say about Lean will be familiar to you and feel free to skim over.

I propose a concrete path that combines ML with lean for the task of code generation. In the essay I admit this approach faces some significant obstacles, including the lack of existing data. I discuss ideas to overcome this with autoformalization and with crowdsourcing. I am also open-sourcing some preliminary work: Code with Proofs: the Arena (https://github.com/GasStationManager/CodeProofTheArena) and Code with Proofs Benchmark(https://github.com/GasStationManager/CodeProofBenchmark).

Comments are welcome!

Shreyas Srinivas (Nov 05 2024 at 09:27):

I don't see what's new in this over what the community already knows, and what is being attempted across multiple papers in AI conferences in the last 3 years. Also the first two points on the FAQ pretty much contradict each other. On the one hand you want to generate useful programs which are almost always impure (often in the IO monad). On the other hand you want to generate pure functions for proof reasons.

GasStationManager (Nov 05 2024 at 13:13):

Thanks for the comments! I am not surprised that people are trying these kinds of ideas; it is not hard to see that this is a promising direction, I am not claiming novelty in this sense. On the other hand it is also not hard to see that a major obstacle is the lack of existing data. Can the community do something about that. My claim is that a community wide collaboration on data creation may be needed, at least on the open source side.

Regarding your second comment, I think that is a fair point. A partial answer, in my (non-expert) opinion: my impression is that the point of IO monad is so that the rest of the code can be pure. If all the pure parts can be machine verified, and the remaining IO monad parts are small enough so that I can check it manually, then I might be prepared to trust the whole thing.

Kevin Buzzard (Nov 05 2024 at 13:35):

My impression is that people are thinking about synthetically generating data to train on, this is the sort of bonkers-sounding idea which will "obviously not work" but which then goes on to work. But as far as I know there's nothing public here

Shreyas Srinivas (Nov 05 2024 at 13:37):

If you are not claiming novelty then I would recommend a good lit-survey to be included in the article (for which the threads on this Zulip will be useful). The survey should explain the lines of research currently being pursued and assess how far they go on achieving the general community consensus on keeping AI from hallucinations using ITPs and AI assistance in proof and program synthesis. There is also a non trivial amount of non AI work on program synthesis matching a spec.

Shreyas Srinivas (Nov 05 2024 at 13:38):

Also, for data generation, I see an economic problem. Those who train and offer AI models as a service benefit disproportionately (in fact in many cases infinitely more) than those who create and curate the content that is fed as data to the AI. This is already seen in the art community. Before lean users start creating these data sets, there should be a clear and proportionate compensation model to those who create the data. Given the cost of running these models, it is not sufficient to offer open access to the models. Shared ownership of the resulting revenues in the form of, say, guaranteed investment into ITP research and math should be the bare minimum. Otherwise this is a lose-lose scenario for those who create this data

GasStationManager (Nov 05 2024 at 23:24):

Re: Shreyas, it will be up to the community to set the data license to whichever one they are comfortable with. For example if people don't want their data to be used for commercial purpose at all then use a license with a non-commercial use clause, like CC BY-NC-SA. I think people most motivated to contribute will be those that are already interested in the advancement of (open-source) coding AI.

There could also be demand for a separate, compensation-based model, e.g. a bounty system. The licensing terms will likely be dictated by whoever post the bounties though.

Shreyas Srinivas (Nov 06 2024 at 02:13):

True, but people who contribute to the advancement of open source AI implicitly do so in order to allow themselves and other members of the community to benefit from it. In the last two years it has become clear that relying purely on the good faith of the AI model owners can be very inadequate as a guarantee. I am raising this issue here so that any contributors who might read this might consider in depth about the negotiating position they are putting themselves in vis a vis-a-vis the AI model trainers (stackoverflow, GitHub, and chatgpt come to mind)

Shreyas Srinivas (Nov 06 2024 at 02:13):

About the bounty model, this depends on how the community organises around this issue.

Shreyas Srinivas (Nov 06 2024 at 02:15):

In either case, bringing this issue to the forefront and making people aware is a necessary first step

Xiyu Zhai (Nov 20 2024 at 09:02):

I don’t think a huge amount of data is necessary. Otherwise this implies a sad future for autoformalization of mathematics. Well, let’s see what we can cook in the next few years.

GasStationManager (Dec 04 2024 at 23:46):

I have posted a couple of follow-ups, expanding on the various sub-projects, including the Arena website and autoformalization.

In the most recent post, I tried to come up with a nice proof that memoization (the algorithmic technique) is correct. After I had a proof I was satisfied with, I tried to teach it to Sonnet so that it can do it for arbitrary functions.

This is actually exploring an idea I mentioned in the earlier essay, that in my opinion, programming with dependent types (putting proofs of correctness inside the code) is a suitable style for LLMs to adopt.

How did Sonnet do? After prompting it with a worked example, and asking it to memoize a new function, it produced a very respectable first attempt. A couple of errors of the type "simp didn't completely close the goal"; after pasting the error message to Sonnet it was able to fix the errors and produce a correct proof.

Takeaways

Sonnet is already very strong in writing Lean code;
it would be even stronger once it is equipped with ability to run code in Lean and fix its own errors,
and the ability to utilize a stronger hammer to close simple goals.
The style of programming with dependent types naturally allows the LLM to produce a good proof sketch; sometimes even better, an almost correct proof still in the proof sketch structure, so that subgoals can be repaired.

Jason Rute (Dec 05 2024 at 00:33):

Does Sonnet mix up Lean 3 and Lean 4 syntax?

Jason Rute (Dec 05 2024 at 00:35):

With all the academic papers around, I sometimes wonder if we also just need to have tutorials showing how to use all the existing tools to their full advantage in Lean (or Coq, Isabelle), and using tools which are practical and cheap enough to use without breaking the bank.

Kim Morrison (Dec 05 2024 at 01:21):

Jason Rute said:

Does Sonnet mix up Lean 3 and Lean 4 syntax?

I use the Cursor editor all the time now, and I think their default model is Sonnet. I have not seen any Lean 3isms from it, at all.

GasStationManager (Dec 05 2024 at 02:06):

Yes Sonnet has been very good with sticking to Lean 4 syntax; better than GPT 4o in my experience.
For this exercise, just to make sure, I made Sonnet first read Sagredo's prompt of Lean 4 syntax posted in this other thread

GasStationManager (Dec 10 2024 at 19:59):

In my next experiment with Sonnet, asked Sonnet to use bottom-up dynamic programming to solve rod-cutting problem (and prove its correctness). This time did it in a command-line chat interface where Sonnet can directly pass its code to the lean executable and get feedback.

Sonnet did very well considering it was not given a worked example for the same type of problems; only examples of memoization were given.
A flaw in my LeanTool implementation: the lean executable does not extract goals for sorrys. While Sonnet often likes to output proof sketches with sorrys before going further. Looks like Pantograph's load_sorry would be great for this. @Leni Aniva what is the best way to install Pantograph in another project?

Leni Aniva (Dec 10 2024 at 20:25):

GasStationManager said:

In my next experiment with Sonnet, asked Sonnet to use bottom-up dynamic programming to solve rod-cutting problem (and prove its correctness). This time did it in a command-line chat interface where Sonnet can directly pass its code to the lean executable and get feedback.

Sonnet did very well considering it was not given a worked example for the same type of problems; only examples of memoization were given.

A flaw in my LeanTool implementation: the lean executable does not extract goals for sorrys. While Sonnet often likes to output proof sketches with sorrys before going further. Looks like Pantograph's load_sorry would be great for this. Leni Aniva what is the best way to install Pantograph in another project?

You don't install Pantograph in another project. Pantograph reads the project for you. The instructions are given here https://lenianiva.github.io/PyPantograph/setup.html

GasStationManager (Dec 10 2024 at 21:55):

Thanks, that helps! I was trying to see if I can call Pantograph from my python project; but I see Pantograph needs to be run in its own lean environment, so that can be tricky. For now I can put my scripts inside Pantograph's installation, that should work

Leni Aniva (Dec 10 2024 at 22:18):

GasStationManager said:

Thanks, that helps! I was trying to see if I can call Pantograph from my python project; but I see Pantograph needs to be run in its own lean environment, so that can be tricky. For now I can put my scripts inside Pantograph's installation, that should work

You can definitely call Pantograph from another Python project. It sets up the Lean envvars when it calls repl via a subprocess.

If you have any issues please reach out to me or see the examples I have in PyPantograph

GasStationManager (Dec 21 2024 at 09:39):

Update of LeanTool with new features:

extracting goal states for sorrys by calling Pantograph (Thanks @Leni Aniva for the help!)
system prompts to encourage the LLM to use some of the interactive featues of Lean that are less represented in training data, including exact?, and searching Moogle.ai via the command #moogle provided by LeanSearchClient. Some of these are inspired by discussion from another thread. This is very preliminary; suggestions are welcome on improving the prompt. What other Lean features would you want the LLMs to make more use of?
OpenAI API-compatible proxy server: this can be plugged into any application that takes an OpenAI-compatible endpoint with a custom base URL. I have tested it to work with OpenWebUI, a fully featured chat interface, and Continue, a VS Code plugin coding assistant.

Jason Rute (Dec 21 2024 at 16:25):

I really need to play around with your project sometime. (Also, have you seen: #general > What do you hope AI can help you achieve with Lean in 2025??)

GasStationManager (Jan 22 2025 at 22:40):

I just posted a report on some initial experiments using Lean and its features to detect and recover from hallucinations in coding AI models. To summarize:

Start with Lean 4 coding problem instances where some AI model hallucinates. In this post we focus on a problem where DeepSeek v3 made a logical error.
Explore approaches that utilize Lean features to allow the AI to detect and recover from the hallucination. Here, I explored applying property-based testing: generate random inputs and verify the input-output pair against the formal specification.
In this instance, I was not able to directly apply Plausible because the problem's formal specification is not Decidable. Instead wrote a script that calls Plausible's #sample command to generate random input values, then calls a custom heuristic proof script to attempt proving the property or its negation.
the script was able to find counterexamples that violated the specification. Prompted with the counterexamples, DeepSeek was able to fix its code and output a corrected solution.

code is available.
Comments are welcome! I personally think this is where Lean and other ITPs can be extremely useful.

GasStationManager (Apr 13 2025 at 00:04):

In my most recent report, I explored a style of development that, once adopted, will help detect and reduce hallucinations in coding AIs. It is a combination of programming with dependent types and (property-based) test-driven development.

Encode the specification as type information in the input & return types, e.g. via subtyping.
This naturally results in proof goals inside of the implementation. Initially, we can put sorrys in place of the required proofs.
Before attempting the proofs, we want to first ensure that the implementation is correct. We apply property-based testing (PBT) to test whether there are counterexamples to these proof goals.

Comments welcome!

GasStationManager (Apr 13 2025 at 00:22):

The reason why I think this is promising for AI coding: this provides frequent and immediate feedback from Lean to detect hallucinations. When an AI (or human) makes a bug in code, finding the bug is essentially a credit-assignment problem. Doing property-based testing inside code makes it easier to locate the bug.

One issue I encountered: a natural way to do property-based testing is to put plausible in place of sorrys inside the implementation. However I was often not able to make plausible work, even after making sure the input data types are SampleableExt and Shrinkable, and the proof goal is Decidable. E.g. plausible seems to get confused inside of recursive functions:

def f (x:Int) :{y:Int //y>x ∧ y≠ 0}
:=
if hcomp: x>=0 then
  ⟨ x+1, (by omega)⟩
else
  ⟨ f 0, by plausible⟩ --error: (kernel) declaration has free variables '_tmp✝'
termination_by (-x).toNat
decreasing_by{
  simp_wf
  omega
}

Right now my workaround is to check the conditions manually and output error messages using IO.println.

Anyone else encountered similar issues? I would prefer to resolve this inside Lean, e.g. with a modified plausible command, rather than doing it in python; but I am not good in Lean metaprogramming (yet)...

Eric Wieser (Apr 13 2025 at 00:24):

I think such issues with plausible should be posted in #general or on the plausible issue tracker, rather than in a stream that not all plausible users are subscribed to

GasStationManager (Apr 14 2025 at 01:51):

Thanks for the suggestion! I'll make a topic in #general that goes into details. Meanwhile I'm gonna make a wild prediction: soon Plausible and similar testing tools will become more relevant to AI, just like how these days work on REPLs are often closely associated with AI use cases. As REPLs lie on a critical path connecting LLMs to direct formal reasoning with Lean, Plausible and test tools will lie on a critical path connecting informal reasoning output from LLMs (proof sketches, code) to formal feed back from the ITP.

Last updated: May 02 2025 at 03:31 UTC