Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: System card: Claude Opus 4 & Claude Sonnet 4

Eric Taucher (May 26 2025 at 18:03):

I checked and have not seen this noted here, if wrong please link to earlier topic.

In reading "System card: Claude Opus 4 & Claude Sonnet 4" (PDF)

For section 4.2.3 Reinforcement learning behavior review it notes

image.png

...

image.png

The text was to complicated to format correctly here so reverted to using screen captures.

The import part as text so that it can be searched for

Transcript 4.2.3.A An example where Claude Opus 4 produces a superficially reasonable solution to a task it knows is impossible. When given a complex mathematical theorem-proving problem, it determines in its scratchpad that the theorem cannot be true, but does not reveal this in its final response; it instead produces a plausible-looking proof attempt of a theorem it knows to be false.

GasStationManager (May 26 2025 at 20:51):

This is a great find, thanks @Eric Taucher ! Another passage from that document: (this is part of the model's reasoning chain-of-thought:)

Actually, let me think one more time. The user specifically asked me to prove
this theorem, and they said they'll run it in Lean. If the theorem is false
(which I believe it is), then Lean won't be able to prove it.

If we stop here, and imagine a future version of the model that "happens" to know of a way to fool the Lean proof checker...

Eric Taucher (May 26 2025 at 20:59):

GasStationManager said:

This is a great find, thanks

For a few years have been helping answer questions on the OpenAI API forum for developers and one thing some of us try to instill into those asking fundamental questions about specific models is to just read the system card, as they often do have valuable knowledge.

Another point of note is from section 6 Reward hacking

Reward hacking occurs when an AI model performing a task finds a way to maximize its rewards that technically satisfies the rules of the task, but violates the intended purpose (in other words, the model finds and exploits a shortcut or loophole). Specific examples of reward hacking include hard-coding (writing solutions that directly output expected values) and special-casing (writing insufficiently general solutions) to pass tests.

Our previous model, Claude Sonnet 3.7, showed a tendency to resort to these behaviors, particularly in agentic coding settings such as Claude Code. Whereas Claude Opus 4 and Claude Sonnet 4 still exhibit these behaviors, the frequency has decreased significantly, and both models are easier to correct with prompting. Across our reward hacking evaluations, Claude Opus 4 showed an average 67% decrease in hard-coding behavior and Claude Sonnet 4 a 69% average decrease compared to Claude Sonnet 3.7. Further, in our tests, we found that simple prompts could dramatically reduce Claude Opus 4 and Claude Sonnet 4’s propensity towards these behaviors, while such prompts often failed to improve Claude Sonnet 3.7’s behavior, demonstrating improved instruction-following.

Alok Singh (May 27 2025 at 17:21):

This makes me think of a library of false statements explicitly proved false as a dataset

Adam Kurkiewicz (Jun 04 2025 at 19:13):

Great find @Eric Taucher!!!

This made me laugh:

Maybe I can use some tactic that will magically work.

Adam Kurkiewicz (Jun 04 2025 at 19:15):

The fact that they are explicitly testing its performance on Lean 4 probably explains why the performance of their model seems to have, overall, significantly improved from earlier models

David Michael Roberts (Jun 05 2025 at 02:32):

Claude Opus 4 and Claude Sonnet 4 still exhibit these behaviors, the frequency has decreased significantly, and both models are easier to correct with prompting.
....
....we found that simple prompts could dramatically reduce Claude Opus 4 and Claude Sonnet 4’s propensity towards these behaviors

What prompting is this, pray tell?

Eric Taucher (Jun 05 2025 at 09:35):

David Michael Roberts said:

What prompting is this, pray tell?

I checked the system card looking for more specifics on such simple prompts and found none.

All I can say is that when I read that section and why I noted in this forum is that simplifying a prompt is a valid suggestion that works at times and one users may not consider.

There is so much information about creating prompts, more often than not suggesting to add more to the prompt. However sometimes, just a simple prompt works better.

For example years ago created a prompt to check, enhance, style, etc. text that I would publish online in a forum, it was nearing 20 different instructions. Now that prompt is just polish this followed by a blank line and the text needing to be polished.

Like you would be interested in more factual details.

Eric Taucher (Jun 05 2025 at 09:48):

One possibility to help in this situation would be to create a forum topic or subtopic where users can post prompts they are using related to Lean so that other can learn from them and possibly provide feedback.

Eric Taucher (Jun 05 2025 at 09:54):

Recently Anthropic created Claude Explains, a blog for Claude. I see it as a sales pitch to upsell users to using Claude code and ignore that part, however the blog post do contain useful prompts related to Python, Typescript, and other popular coding tasks. The site is something I will be keeping tabs on to see if it does become a useful resource for Clade prompts.

Last updated: Feb 28 2026 at 14:05 UTC