Zulip Chat Archive

Stream: lean4

Topic: Vibe Proving


West (Sep 03 2025 at 13:11):

Is it safe to say that, as long as you understand the final theorem you're working towards, and you manage to get something to compile with no errors and no sorrys/admits, then it doesn't really matter what led you to your goal?

I ask because GPT-5 managed to complete a proof for me, just by making small changes and correcting based on feedback from lean4 until there's no errors in a loop. While it could be refactored to be more readable/less verbose, etc. if my understanding is correct, it doesn't matter because that proof has been proved.

Here's the proof:
https://gist.github.com/wildwestrom/9f6d45f087503c5850ccff58f0b85e27

Ruben Van de Velde (Sep 03 2025 at 13:19):

Depends on your goal. If you just want to convince lean something is true then yeah, that works. If you want to learn what you're doing, maybe not so much

Arthur Paulino (Sep 03 2025 at 13:22):

There's also the cost of type checking. If you, somehow, found a proof that takes 10 minutes to be checked by Lean but it could be done in 2 seconds, then I think the path you took matters

Martin Dvořák (Sep 03 2025 at 13:25):

From my point of view, yes.

Set the goal (write the theorem and check all definitions it depends on to be sure they are what you think they are) and then get to a sorry-free* proof using any means necessary!

Of course, if you want to contribute to Mathlib, the actual proof will matter, but for most other purposes I'll stick to a simple "yes".

(*) I mean, a proof that depends only on these three axioms:
[propext, Classical.choice, Quot.sound]

West (Sep 03 2025 at 13:27):

I intuitively understood that the function in question is infinitely differentiable (it's continuous (no sharp points, holes, asymptotes on the domain in question), and $d/dx e^x$ is $e^x$). I just didn't want to go through all the pedantic steps and figure out what I'm supposed to look for in mathlib.

West (Sep 03 2025 at 13:49):

As you could tell, I'm not experienced with lean4, but I was wondering if y'all could take a look and see if there's anything that could obviously be refactored out or simplified, for example primitive_is_C_inf_on_unitInterval.
I already did as much as I could wrap my head around.

West (Sep 03 2025 at 13:53):

Ruben Van de Velde said:

...convince lean something is true...

By the way, not sure if you meant this, but I'm not trying to convince lean of anything. If anything, I want lean to convince me that it's done the math right.

Ruben Van de Velde (Sep 03 2025 at 14:01):

Speaking metaphorically, in the sense of writing code that lean can compile

(deleted) (Sep 04 2025 at 03:14):

What are you using? Copilot?

(deleted) (Sep 04 2025 at 03:16):

Do you use LeanExplore?

West (Sep 04 2025 at 03:42):

Huỳnh Trần Khanh said:

What are you using? Copilot?

I used Cursor with gpt-5-low and I had generated a rule that basically describes how to debug (make a small change, see the output of the current state of the proof and the compiler, and repeat).

Claude, gemini, and "auto" failed spectacularly every time.

West (Sep 04 2025 at 03:43):

Huỳnh Trần Khanh said:

Do you use LeanExplore?

Never heard of this. I'll give it a try next time I want to prove something.

West (Sep 04 2025 at 03:55):

One more thing. When using cursor, you also might benefit from removing .lake from .gitignore temporarily so it can search mathlib and other libraries directly. If you gitignore it, Cursor has more trouble grepping through the codebase. Hopefully Cursor will get better at indexing the Mathlib docs soon, last time I tried it did not work at all. Better yet if somebody makes a good MCP server.

(deleted) (Sep 07 2025 at 12:38):

Thank you. My verdict: it has potential. But the most important weakness is the model isn't able to tell whether the list of errors is complete or not.

(deleted) (Sep 07 2025 at 12:38):

So to maximize the potential of GPT-5, we need to make an MCP server that only reports errors when Lean has finished processing the file.

(deleted) (Sep 07 2025 at 12:39):

Cursor successfully proved my theorem, thus helping me earn $105.

(deleted) (Sep 07 2025 at 12:40):

The age of AI printing money for me isn't far away. It's right here.

(deleted) (Sep 07 2025 at 18:01):

Because of my greed, I made Cursor solve another problem. It didn't do that well.

(deleted) (Sep 07 2025 at 18:12):

The reasoning depth matters a lot. If the model still has to do a lot of extra reasoning beyond the provided blueprint then it's more likely to fail.

(deleted) (Sep 07 2025 at 18:16):

To prevent the model from acting on an incomplete error list, a simple #check at the end of the file is enough.

West (Sep 07 2025 at 23:56):

This may be of help: https://gist.github.com/wildwestrom/55f7b11b3ab821e517e0c6d1bd31a5c2

It's hit or miss with 'auto' and seems to work well with 'gpt-5-low'.

I recently started using this lean mcp server.

Mr Proof (Sep 08 2025 at 01:12):

@West Woah. That's pretty cool. Is it all automated? I tried a bit of vibe proofing with the normal GPT in the browser and it just made up a load of stuff. My method was to try a top-down approach and to ask it to try and take the proposition and write it in terms of a few other unproved propositions and do that recursively. Didn't really work though :cry:

Do you have a screen recording of it solving your proof. That would be nice to see.

West (Sep 08 2025 at 01:18):

It definitely needs babysitting. I probably won't make a recording either. But anyway, like I said, gpt-5 is the only family of models that work well with lean within cursor, and you really have to make sure it doesn't make changes without seeing the goal state and diagnostics.

Mr Proof (Sep 08 2025 at 01:22):

P.S I like the name "Vibe Proofing" :copyright:.

(deleted) (Sep 08 2025 at 08:29):

Not a fan of the MCP server. It's too buggy.

(deleted) (Sep 08 2025 at 08:30):

Don't use it.

(deleted) (Sep 08 2025 at 08:37):

I'll just babysit Cursor and tell it to reread lints when there are more errors than what it previously fetched

West (Sep 08 2025 at 08:43):

In any case, let's work together so we can make vibe proving a better experience. We can start by sharing Cursor rules and other procedures. You can modify mine and instead of telling it to use the MCP tell it which commands to run.

Also, I forgot to mention. Somewhere in the project (in a separate file or at the top of the file with your proof) it helps to add context about the problem you're working on. This primes the language model to use its math knowledge more.

West (Sep 08 2025 at 09:09):

Using LLMs for doing Lean is more a problem of having the right information (about APIs, syntax, procedures, etc.) rather than the math itself. This is because language models have math as a large part of their training data.

Eric Wieser (Sep 08 2025 at 09:30):

As a drive by comment, I'm reading "vibe proofing" as "vibe-proofing" as in "making Lean resilient to AI generation" (better error messages etc). I'd perhaps suggest "vibe proving" instead as the analogy to "vibe coding".

West (Sep 08 2025 at 09:49):

You're write. I realized it sounded kinda dumb so I changed my words before your comment.

West said:

vibe proving

I'll change the name of the topic.

Oliver Dressler (Sep 08 2025 at 10:21):

I am developing this MCP. @Huỳnh Trần Khanh I am sorry you had a buggy experience. You can dm me (not to pollute this thread) or open an issue, maybe we can fix some of these bugs.

@West I would be interested in linking your instruction file in the README as an example, if you are OK with this. You can let me know, once you have a ready version...

West (Sep 08 2025 at 10:24):

Oliver Dressler said:

West I would be interested in linking your instruction file in the README as an example, if you are OK with this. You can let me know, once you have a ready version...

I think it could be refined a lot more.
I just asked GPT 5 to generate it and then tweaked it.
It could be more concise.
You're free to use it. Let it be public domain.

Oliver Dressler (Sep 08 2025 at 10:29):

West said:

Oliver Dressler said:

West I would be interested in linking your instruction file in the README as an example, if you are OK with this. You can let me know, once you have a ready version...

I think it could be refined a lot more.

No hurry, I'm happy to wait a bit for some refinement then.

I am aware that instructions are the missing piece for successfully using the MCP. I have been planning to start a collection of examples for a while but haven't gotten around to it.

Honestly I am not a very prolific user of the MCP myself, so I am reliant on user feedback for this.

Rye Howard-Stone (Sep 08 2025 at 12:56):

I love this concept!

I've been experimenting with a number of tasks that agents can accomplish autonomously. I've mainly been using Claude-code, as I'm a bioinformatician and it is much better than GPT5 at coding, but I've been trying to create prompt systems and task scopes that are both useful, and comfortably within the models' current sets of capabilities. Things that don't take all too much effort to check a correct solution: optimizing memory or runtime efficiency, for example. Give the same input as the old program? You should get the same output. Objective measures like runtime or maximum memory utilization can be checked by the agent as it iterates.

If not given proper instructions, the agent may try to optimize the algorithm.. well.. robotically: redesign entire algorithm so input for n<500 is much faster (n<500 already only took 10 seconds). Or, it may find vestigial code in a large repository, and try to optimize that (not having profiled the code to figure out what gets run, and where execution actually hangs).

It's this journey that has brought me to Lean: my undergrad degrees were math and physics (although I admittedly.. barely survived Real Analysis), so I was fascinated to learn of a compilable, verifiable language for math proofs.

I have both cursor and ChatGPT Pro subscriptions, so I am able to try a lot of things here. I also have access to a powerful VM: an Ubuntu 22.04 KVM virtual machine configured with 190 virtual cores and ~380GB of RAM, running on a Dell PowerEdge R7525 server with two AMD EPYC 7552 48-Core Processors and 2TB total RAM, equipped with an A100 GPU with 40Gb memory.

I am, however, entirely new to Lean itself (and my proof abilities are very rusty). My research interests in this area do lie specifically in the automation of agentic AI contribution to scientific endeavors. This has proven difficult for broad tasks, but easier for those with limited scope. Directive sets like the ones you posted, @West:

West said:

https://gist.github.com/wildwestrom/55f7b11b3ab821e517e0c6d1bd31a5c2

are of the utmost importance in this area, in my opinion. I don't like calling what we are doing "vibes": vibes imply you have no clue what the thing is spitting back out at you. That's not the case here: we're not humanities majors coding a website to do ML on polymer prediction, or something we have no clue about. I prefer to call (at least, the current state of) this "Nanny-coding" (or "Nanny-proving" in this case): We are "babysitting" the results.

To that end, and.. given my limited confidence in the mathematical abilities I've retained over the last 8 years of not using them.. What do you think is a good place to start? Is there a nice, 'bite-sized' set of proofs you can direct me to? Are you willing to look at the results, and see if they can be integrated into the community?

How will I know for sure the agent has gotten the correct answer(s)? Are there common ways it will attempt to 'evade'? Are there definitive 'test harnesses' I can apply? Asserting the beginning, and end of the proof perhaps, in some way? In coding I'm able to directly test the output of an algorithm, or at least construct a set of tests with known inputs and outputs. I initially thought that vibe-, or Nanny-proving wouldn't be possible, for lack of verifiability.

As the models continue to expand their sets of capabilities, I think it's very important we work out some of these things now. Even if these things were already of "genius, PhD-level" intelligence, we would still need a verifiable check on machine outputs, and likely, providing a precise directives document with best practices would still produce better output.

Sorry for the essay!

West (Sep 08 2025 at 13:02):

What's important is that you know just enough Lean and mathlib to make sure you (or somebody) can understand exactly what your final theorem is asserting. Only then can you let the AI take over, otherwise it's meaningless.

Rye Howard-Stone (Sep 08 2025 at 13:03):

I guess that's what I'm asking: can't we set up the assertions at the end? Does the language permit some form of 'test harness'? Rather than writing the proof, specify the inputs and outputs?

West (Sep 08 2025 at 13:18):

You must be new to the language. Yes, if you just write your assertion (lemma or theorem) and put sorry or admit you're basically telling Lean "just trust me bro" and you can work on something else without the compiler complaining too much. It's useful for when you know something is true but you can't figure out how to code it out yet.

Rye Howard-Stone (Sep 08 2025 at 13:24):

I am, very! Got it, the language itself is sort of the test harness. So, a set of 'bite-sized' potential proofs for contribution might be things that have been left as sorry or admit in existing mathlib4 proofs?

Kevin Buzzard (Sep 08 2025 at 13:28):

mathlib4 has no sorry or admit, that's one of the rules.

Rye Howard-Stone (Sep 08 2025 at 13:34):

Gotcha! Sensible rule. So, where would you suggest starting? Any, 'low-hanging fruit' to test the ability of an automated system? The process of verifying that the statement of a proof is correct is not nearly as hard as actually proving it (I'm sure there's a counterexample, but for most things). Is there a DB or collection somewhere of statements of Lean proofs, yet to be formalized?

(deleted) (Sep 08 2025 at 13:39):

Lean 4 annotation firms. Look into them. Project Numina is one. You get to solve problems.

Rye Howard-Stone (Sep 08 2025 at 13:46):

Oh my that is a fascinating project!! Thank you! All these communities are clearly on the cusp of great things!

My thinking is that, re:

Arthur Paulino said:

There's also the cost of type checking. If you, somehow, found a proof that takes 10 minutes to be checked by Lean but it could be done in 2 seconds, then I think the path you took matters

That's fundamentally back to the 'efficiency' problem. If a human verifies the statement of the problem, and an AI is able to come up with some correct solution, it can then be profiled for type checking behavior -- and, optimized agentically. Type checking cost is itself an objective measure, right? Treat the inefficient 'first pass' as the 'ground truth', and optimize from there.

Eric Wieser (Sep 08 2025 at 13:58):

Huỳnh Trần Khanh said:

Lean 4 annotation firms. Look into them. Project Numina is one. You get to solve problems.

If these employers are paying for human-written Lean data to train AIs, I don't think they are going to be very happy about you giving them AI generated data

Kevin Buzzard (Sep 08 2025 at 13:59):

Rye Howard-Stone said:

Gotcha! Sensible rule. So, where would you suggest starting? Any, 'low-hanging fruit' to test the ability of an automated system? The process of verifying that the statement of a proof is correct is not nearly as hard as actually proving it (I'm sure there's a counterexample, but for most things). Is there a DB or collection somewhere of statements of Lean proofs, yet to be formalized?

You might be interested in https://github.com/SorryDB/SorryDB but I'm not promising any low-hanging fruit.

The process of verifying that the statement of a theorem is correct is something which I would not trust a machine to do; the process of verifying that a Lean proof of a theorem is correct is absolutely something I would trust a machine to do, indeed that's one thing Lean does.

(deleted) (Sep 08 2025 at 14:00):

Eric Wieser said:

Huỳnh Trần Khanh said:

Lean 4 annotation firms. Look into them. Project Numina is one. You get to solve problems.

If these employers are paying for human-written Lean data to train AIs, I don't think they are going to be very happy about you giving them AI generated data

They actively encourage it actually. The only rule is the code compiles.

Eric Wieser (Sep 08 2025 at 14:01):

Interesting!

Eric Wieser (Sep 08 2025 at 14:02):

The only rule is the code compiles.

Surely there is a rule also against empty files, against native_decide hacks, against malicious code, ...?

(deleted) (Sep 08 2025 at 14:05):

Every month someone goes through all my code and reviews to check whether there's something very bad. The main thing they check is wrong theorem statements, and very egregious native_decide hacks—which have yet to appear in anyone's code. Sometimes I'm asked to review other people's code.

(deleted) (Sep 08 2025 at 14:05):

There used to be a rule against AI but then I asked for clarification and they said only AI code that doesn't work properly is banned

(deleted) (Sep 08 2025 at 14:06):

And the pay is fixed, $105 per problem. And problems are very hard. I can only manage to solve one per day.

Rye Howard-Stone (Sep 08 2025 at 14:07):

Oh woah wait which one pays you lol?

(deleted) (Sep 08 2025 at 14:08):

Project Numina pays me through ABAKA AI and Kili Technology.

Rye Howard-Stone (Sep 08 2025 at 14:08):

Oh wow! Yeah I'm looking at the bottom now, where it says 'join as contributor'?

(deleted) (Sep 08 2025 at 14:09):

Email maxime.michel@kili-technology.com and ask him to join the project. You'll get a response on September 15th.

(deleted) (Sep 08 2025 at 14:10):

There might be other ways I'm not sure. But this is the easiest way.

Rye Howard-Stone (Sep 08 2025 at 14:12):

Thank you! I must say -- y'all are so nice! A lot of other disciplines would very quickly tell the novice: 'shush, go read. The AI can't help you' and just dismiss the giant text block! I'm finding different domains and communities are reacting radically differently to AI contributions, some will not hear of it, others are fascinated -- the right approach, I think, is skeptical!

Rye Howard-Stone (Sep 08 2025 at 14:14):

So on Numina -- the theorems that you are trying to prove, do they come stated in Lean?

Yakov Pechersky (Sep 08 2025 at 14:15):

Announcing how to set up AI to answer questions with a monetary reward is likely to lead to commoditization and thus the reward $$$ to decrease.

Rye Howard-Stone (Sep 08 2025 at 14:15):

I mean hey free market

(deleted) (Sep 08 2025 at 14:16):

Rye Howard-Stone said:

So on Numina -- the theorems that you are trying to prove, do they come stated in Lean?

They're not stated in Lean. You have to translate into Lean yourself. There are human reviewers after all.

(deleted) (Sep 08 2025 at 14:16):

I'm not afraid of more people contributing. Problems will get harder, but in the long run demand will increase, as Lean gets used by more people.

(deleted) (Sep 08 2025 at 14:16):

That's why I openly talk about my methods.

Ilmārs Cīrulis (Sep 08 2025 at 14:21):

AI is meh, I plan on using my natural stupidity as long as I can. :grinning:

That probably means I don't know how to use AI correctly (probably too old). :sweat_smile:

Rye Howard-Stone (Sep 08 2025 at 14:25):

That's the beauty of what I've been finding out: it's our 'natural stupidity' that is best to use when accelerating ourselves with AI. That mindset ('I have the plan in my head, I know what I would do, I know how I'll check it.. it will just take a while') is exactly the kind of thing that is useful to codify as prompts (like West's fantastic Lean proving gist)

(deleted) (Sep 08 2025 at 18:53):

Lol. Anyway

(deleted) (Sep 08 2025 at 18:53):

If the problem is API heavy GPT-5 with Cursor is likely to succeed

(deleted) (Sep 08 2025 at 18:54):

If the problem is reasoning heavy GPT-5 with Cursor is less likely to succeed

(deleted) (Sep 08 2025 at 18:54):

At least it's an improvement compared to Claude Code, which is unable to do anything

(deleted) (Sep 08 2025 at 18:55):

I haven't tried Cursor + GPT-5 without supplying an informal blueprint. In all previous attempts I always supplied a correct informal blueprint as a comment at the top of the file.

(deleted) (Sep 08 2025 at 18:56):

AI firms are actively monitoring the discussions here. I anticipate that they will start using GPT-5 in the future to prove some of the problems, taking them out of the problem pool.

Patrick Massot (Sep 09 2025 at 07:05):

I’m very happy to see Huỳnh Trần Khanh actively participating in AI collapse by feeding AI slop to AI training, please don’t let skeptics discourages you building us a better world!

(deleted) (Sep 09 2025 at 07:06):

Ok I'll stop

(deleted) (Sep 09 2025 at 07:07):

But the actual work is much more complicated... I also talked about AI failing catastrophically too

(deleted) (Sep 09 2025 at 07:09):

I posted screenshots of failures on Zulip

Patrick Massot (Sep 09 2025 at 07:10):

I don’t think you can stop AI training on AI generated stuff and getting worse and worse. Nobody can.

Kim Morrison (Sep 09 2025 at 09:51):

I'm really not sure here. It's just not obvious to me that an LLM driving some form of tree search, pruned via Lean feedback, cannot produce valid proof scripts, which, if incorporated in the initial training corpus will result in a stronger LLM (and hence agentic system). Indeed, it seems quite likely that this is possible --- and apparently contrary to Patrick's assertion.

It's initially AI slop, but potentially it becomes something better after being rejected by Lean enough. :-)

Even if you can get an LLM+Lean to get better than it started, or even better than humans, merely by running it in a closed box, it's still even more unclear how capabilities would scale with resources.

(deleted) (Sep 09 2025 at 10:57):

Too bad I'm not an AI engineer or an AI researcher. AI training feels like a giant game of guessing. But whatever, they want proofs, I give proofs. Some models claim to perform well on MiniF2F but end up being useless in real world usage. Some models are good enough for day-to-day use. And currently program verification doesn't get enough attention. I don't uncritically praise AI.

(deleted) (Sep 10 2025 at 07:13):

Project Numina has this tool to help everyone participate in AI collapse arm AI with a tool to verify Lean code: https://github.com/project-numina/kimina-lean-server. It is much more efficient than existing MCP servers. Try it out!

(deleted) (Sep 10 2025 at 11:38):

image.png
GPT-5 has a strong tendency to cheat by adding an axiom.

Tyler Josephson ⚛️ (Sep 10 2025 at 13:03):

When I task my students (scientists/engineers) to prove things, I think many of them implement forms of vibe proving. It’s not usually systematic and automated like this via Cursor, but instead involves manually chatting while copy-pasting Lean code and errors into the chat window.

Jakub Nowak (Sep 10 2025 at 13:17):

I've tried using GPT-4o mini for Lean questions a few times but it always hallucinated. :<

Arthur Paulino (Sep 10 2025 at 13:20):

I've found a middle ground that has been working nicely for me. Instead of expecting the LLM to spit out precise proofs, I ask it to structure a proof so I can deal with the details myself.

(deleted) (Sep 10 2025 at 13:24):

I am agentically proving an inequality

Arthur Paulino (Sep 10 2025 at 13:24):

My intuition: there's plenty Lean 3 code out there. And even within the Lean 4 realm, mathlib (and other libs) are alive and evolve. And, of course, Lean 4 itself is alive and evolves!

(deleted) (Sep 10 2025 at 13:26):

Yeah. Good thing Cursor provides enough context about the current state of mathlib4 to GPT-5

(deleted) (Sep 10 2025 at 13:26):

Otherwise GPT-5 keeps hallucinating to death

(deleted) (Sep 10 2025 at 13:27):

This of course requires that you delete gitignore

(deleted) (Sep 10 2025 at 13:29):

Hmm. Errors are everywhere. Everything moves slowly

(deleted) (Sep 10 2025 at 13:29):

Let's see how far it goes in the end

(deleted) (Sep 10 2025 at 13:29):

Maybe all this is crap

(deleted) (Sep 10 2025 at 13:32):

And I'm now using up the context window. I'll just tell the agent to fix existing errors for now.

(deleted) (Sep 10 2025 at 15:36):

Speculation: I have a feeling someone at OpenAI did at least some amount of fine tuning so GPT-5 is able to write Lean code. GPT-5 proofs have a lot of have statements too.

(deleted) (Sep 10 2025 at 15:37):

And as I mentioned in a previous thread, Project Numina deliberately engineered their LLM to produce many have statements. This is directly confirmed by their paper.

Emilio Jesús Gallego Arias (Sep 10 2025 at 17:06):

West said:

Is it safe to say that, as long as you understand the final theorem you're working towards, and you manage to get something to compile with no errors and no sorrys/admits, then it doesn't really matter what led you to your goal?

I'd dare to say the style of the proof does matter a lot for _proof maintenance_.

If your proof is a one-shot thing, it shouldn't indeed matter.

Maybe this paper is of your interest.

(deleted) (Sep 10 2025 at 17:26):

Proof maintenance becomes easy when AI can write the proofs, making it cheaper to make a new proof when underlying definitions change

(deleted) (Sep 10 2025 at 17:27):

This is what Certified Programming with Dependent Types says... wait it doesn't mention AI, just automation in general

Arthur Paulino (Sep 10 2025 at 17:34):

It all becomes easily defensible if you assume you have the ideal AI system

(deleted) (Sep 10 2025 at 17:49):

Which doesn't exist. That's why I get paid.

(deleted) (Sep 11 2025 at 11:00):

Sorry. Looks like this thing isn't as magical as I thought. Good thing it helped me earn $105. Now it isn't very helpful anymore. Gemini Deep Think is the most helpful tool I've used, nothing else has come close.

(deleted) (Sep 11 2025 at 11:00):

Sorry to disappoint. AI collapse hasn't happened yet.

(deleted) (Sep 11 2025 at 11:01):

I guess the most important thing I need to improve is my brain. I shouldn't search for the next magical AI tool.

(deleted) (Sep 11 2025 at 11:54):

May I state my personal opinion. Every time a person claims they managed to hook Lean and some lemma search tools up to a general purpose LLM it always turns out to be less useful than expected.

Alfredo Moreira-Rosa (Sep 12 2025 at 11:16):

Patrick Massot said:

I don’t think you can stop AI training on AI generated stuff and getting worse and worse. Nobody can.

I see a lot of missconceptions like this about AI. and it's quite the opposite. AI trained on pure RL becomes better and better, when there is no human in the loop.
AI learning by itself using it's own learned synthetic data makes the AI better. And you can do RL on lean thanks to lean feedback.
I'm pretty sure we'll see superhuman AI on lean proofs very soon thanks to this RL process.
Hallucinations you currently see on LLMs will vanish when the new insights from OpenAI on why they hallucinate so much are applied as the new training process standards.

Alfredo Moreira-Rosa (Sep 12 2025 at 11:20):

Here the paper on the OpenaI insights :
https://arxiv.org/abs/2509.04664

Eric Wieser (Sep 12 2025 at 11:31):

My impression is that a lot of AI trained on AI is carefully controlled by the researchers building the system, such that they're aware of the loopbacks. If they're paying specifically for human expert data, and unknowingly end up with AI output from a competing system, then probably 1) it is harder to design a stable training system, and 2) this might put them at legal risk if the other AI system prohibits training on its outputs.

Arthur Paulino (Sep 12 2025 at 13:15):

Yes, feedback is a game changer. Quite literally.
That's how AlphaGo eventually beat Lee Sedol. Because the computer can play against itself a number of times in an hour no human can come even close in a lifetime.

One challenge is designing the heuristic for how close you think you are from enclosing a proof goal. A dummy heuristic is "return 1 if the goal is closed and 0 otherwise".

(deleted) (Sep 12 2025 at 13:17):

The AI community has yet to devise another heuristic

(deleted) (Sep 12 2025 at 13:17):

It's very widely used

Arthur Paulino (Sep 12 2025 at 13:29):

It makes sense to be conservative. If you have a bad heuristic, you can waste a lot of money and time training a system with wrong rewards

Arthur Paulino (Sep 12 2025 at 13:38):

Hmm, there are interesting safe ideas like punishing the proof context size, proof size and number of theorems used to reward succinct and self-contained proofs as a proxy for "I know what I'm doing"


Last updated: Dec 20 2025 at 21:32 UTC