Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: A simple AI benchmark problem: Spearman correlation = 0.5

Lars Ericson (Jan 30 2025 at 12:43):

I just tried this problem with Perplexity.AI, ChatGPT and Claude.ai. The first two could not get it after about 10 tries each. Claude got it on the 7th try. All of them gave extensive reasoning for every attempt. The problem is

Create X and Y such that scipy.stats.spearmanr(X,Y) = 0.5 exactly.

Claude's solution was

    X = np.array([1, 2, 3, 4, 5])
    Y = np.array([1, 4, 2, 5, 3])

which is 0.5 within rounding error, so 0.49999999999999994 on my computer.

It would be interesting to understand why this problem is hard for LLMs.

Kevin Buzzard (Jan 30 2025 at 12:56):

This has nothing to do with Lean, right? So really it's off-topic for this Zulip.

Lars Ericson (Jan 30 2025 at 12:59):

@Kevin Buzzard it's off topic if you think this reasoning supplied by Claude is not checkable by Lean, and if you think why a problem is hard for an LLM is off topic for this thread:

    # Let me verify:
    # Rank differences d: [0, -2, 1, -1, 2]
    # d² = [0, 4, 1, 1, 4]
    # Σd² = 10 ✓

If you look at this dialogue, you will see that this problem has nothing to do with numerical approximation and Python/scipy, and everything to do with formal reasoning. Every attempt by Perplexity was backed by incorrect formal reasoning:
https://www.perplexity.ai/search/create-x-and-y-such-that-scipy-UDtcx8LFTLSgvsuDfKHobg

Jason Rute (Jan 30 2025 at 13:29):

@Kevin Buzzard Historically, this #Machine Learning for Theorem Proving topic has been a bit loose abut things needing to be just about Lean. @Lars Ericson But it also feels a bit oddly specific and I’m not clear the utility. Why sci.stats.spearman? Why not ask it a pure math question? Why those three models and not a reasoning model? There are many now, many of which are free: o1-mini, QwQ, DeepSeek-r1, Gemini Flash Thinking. (Also a nitpick: “formal reasoning” is a strange term to be using on a formal theorem provers chat. “Mathematical reasoning” might be better, since I assume there is no formal language being used.)

Lars Ericson (Jan 30 2025 at 14:24):

@Jason Rute , this particular problem wasn't randomly constructed. I am working on the Broad Institute Autoimmune Disorder Challenge. The scoring function for this challenge uses Spearman rank correlation. Spearman correlation is not differentiable: it involves a sorting operation. It is really a kind of discrete math calculation. To make a machine learning model which optimizes this correlation, it is necessary to construct a differentiable approximation to Spearman correlation. I asked an LLM to recommend one. I wanted to test the quality of the approximation. It occurred to me to ask the LLM for a test example with a mathematically exact answer like 1/2. When I asked for this example, it turned out to be hard for the 3 LLMs I asked, all of which were free to me up to a point. I wasn't worried about polling specifically free ones. (Of which DeepSeek is not.) Free-to-use was not the issue. Why it was hard to reason about turned out to be the interesting issue.

In the case of Claude, it took 7 tries to get the "formal" or "mathematical" or whatever reasoning right. Here is an example of what turned out to be incorrect reasoning from Claude:

For n=4:
The Spearman formula is: ρ = 1 - (6 * Σd²) / (n * (n² - 1))
For ρ = 0.5:
0.5 = 1 - (6 * Σd²) / (4 * (16 - 1))
0.5 = 1 - (6 * Σd²) / 60
-0.5 = -(6 * Σd²) / 60
Σd² = 5

Here is Perplexity's explanation of why it thinks it is hard:

The difficulty lies in the nature of Spearman's rank correlation coefficient, which is based on the ranks of the data rather than the raw values. This makes it less straightforward to manipulate compared to other correlation measures like Pearson's correlation.

Perplexity in its attempts used probabilistic arguments and constructions to try to get there. Claude used math that looking more discrete and combinatoric. Where the LLMs go to in the math space seems interesting to me.

As a side-note (off topic for Lean?), here is the differentiable approximation (in Python):

def spearmanr_approx(x, y, regularization_strength=0.5):
    """
    Differentiable approximation of Spearman's rank correlation.

    Args:
        x, y: Input tensors of shape (N,).
        regularization_strength: Controls softness of ranking (higher is more relaxed).

    Returns:
        Approximate Spearman correlation coefficient.

    author:
        https://chatgpt.com/share/679b64fc-dd94-8006-b1ef-7ba628c80386
    """
    # Convert inputs to float tensors
    x, y = x.float(), y.float()

    # Compute soft ranks
    ranks_x = torchsort.soft_rank(x.unsqueeze(0), regularization_strength=regularization_strength).squeeze(0)
    ranks_y = torchsort.soft_rank(y.unsqueeze(0), regularization_strength=regularization_strength).squeeze(0)

    # Compute Pearson correlation on soft ranks
    mean_x, mean_y = ranks_x.mean(), ranks_y.mean()
    cov_xy = ((ranks_x - mean_x) * (ranks_y - mean_y)).mean()
    std_x = ranks_x.std(unbiased=False)
    std_y = ranks_y.std(unbiased=False)

    return cov_xy / (std_x * std_y + 1e-6)  # Small epsilon to avoid division by zero

Should we say that finding differentiable approximations to a discontinuous function is outside the scope of what Lean can or should formalize?

Jason Rute (Jan 30 2025 at 14:36):

I think Gemini Flash Think gets it right on the first try (which might be multiple tries if you look at the reasoning trace):

This confirms that X = [1, 2, 3] and Y = [1, 3, 2] (or equivalently X = [0, 1, 2] and Y = [0, 2, 1], or any data that has these ranks) will give a Spearman rank correlation of exactly 0.5.

Jason Rute (Jan 30 2025 at 14:39):

I suspect many other reasoning models might as well, but it seems this model might be able to run code in the thinking which would help if so.

Jason Rute (Feb 01 2025 at 04:08):

I just tried ChatGPT with the new "reasoning" button turned on (so it uses o3-mini) and I think it got it right. (I'm too lazy to plug it into Python.)

Thus, one acceptable answer is:
* X=[1,2,3,4,5]
* Y=[1,4,2,5,3]

This choice ensures that scipy.stats.spearmanr(X, Y) returns 0.5 exactly.

Jason Rute (Feb 01 2025 at 04:09):

https://chatgpt.com/share/679d9e5a-a7b0-8001-826f-5d0fefbd7a66

Jason Rute (Feb 01 2025 at 04:13):

So I think it is safe to say the assumption of this question is wrong. Reasoning models (but not non-reasoning LLMs) are to be able to figure this problem out.

Lars Ericson (Feb 01 2025 at 04:17):

Consider this prompt:

Construct X,Y with Spearman correlation of exactly 1/2

Perplexity gets it right the first try: https://www.perplexity.ai/search/construct-x-y-with-spearman-co-Y5UuFvxNS_WQv5kKkl3J6g

Google Gemini gets it right after coaching it to reduce the vector size.

ChatGPT 4o gets it wrong 9 times in a row, and then asks for money before it will answer any further questions: https://chatgpt.com/share/679d9837-6cb0-8006-8cfe-074158133a47

All 3 LLMs employ reasoning where the steps involve integer arithmetic and order relations. All of the reasoning narratives could be directly translated to elementary proof-checking steps in Lean. Natural Number Game elementary, not FLT math.

I know it is possible to make domain-specific LLMs that do better on the first try. If you can do that and that's your goal, you are done. If you are interested in modes of reasoning, flaws in reasoning, different approaches to reasoning, and how that impacts the performance of LLMs that approach the same problem in different ways, then this example is very efficient at eliciting those differences.

The reason I am bringing this up is that AI/math benchmarking has become a topic of interest. There was some issue about some LLM manufacturers getting early access to good benchmarks and using that to tip the scales on perception. To me the best way to un-tip the scales is to publicly discuss benchmark problems which elicit distinct and revealing behavior from different LLMs. This is (demonstrably) such a problem. So, from my perspective, yay. One problem at a time, one could build up a repertoire of such problems that elicit distinct and useful insights about LLM reasoning methods. They don't need to be secret, NDAd or held back in any way. You could train against them, but if you had enough of them (say 300), it wouldn't be possible to train against 290 of them and have a reasonable expectation of getting the other 10 right, without actually developing some skills. In that sense, it's a benchmark that can be completely public, and people can train against it all they want, but if they don't train a good model, they won't do well on the held-out problems. To use another analogy, a good set of benchmark questions is like a public key, where the good trainable model is the private key. Here I am assuming that you are given models to benchmark that differ in their architecture but are trained from scratch on a common set of training data. Obviously this doesn't pragmatically apply to commercial LLMs where the training data and methods are a trade secret.

Lars Ericson (Feb 01 2025 at 04:19):

Note that for all of these, I did the work of plugging them into Python, and in my use of ChatGPT, it was wrong every time, after telling me up and down that it was right.

Jason Rute (Feb 01 2025 at 10:12):

@Lars Ericson Just so I’m clear, my points are as follows:

I agree it is good to come up with problems that the current models can’t solve.
But I don’t see much value with coming up with reasoning problems and only testing them on non-reasoning models (perplexity, Claude sonnet, ChatGPT 4o, Gemini without “thinking”, etc). These models may reason a little bit, but they don’t have the robust systematic exploration of the problem that the full reasoning models have.
My limited experiments with the new free reasoning models have shown that this problem is easy for them. This includes Gemini Flash Thinking and o3-mini (which came out yesterday and is accessible via ChatGPT with the reason button turned on). I also just tried QwQ-32B-preview and DeepSeek-r1 (use the app with DeepThink turned on). They also got it correct on the first try.
So I think we need harder test problems for the reasoning models.

Lars Ericson (Feb 01 2025 at 14:17):

There is a difference between problems that are hard from a mathematician's point of view and problems that may be easy for the latest LLM but still confirm or elucidate properties of the LLM. These might be called cognitive psychology tests for AI evaluation. For example you can test how many things an LLM can remember, how many things it can remember across prompts, whether it can perform addition, and so on. In the gpt-4o example above, it was wrong 9 times, and convinced each time that it was right. It could be that this state of being convinced represents a unique measurable cognitive feature, in the same sense that short term memory size (3 to 5 in humans) is a distinct feature. That makes the Spearman question indicative from a cog sci of AI point of view, if not from a "does an AI model exist that can do this" point of view.

What I am hearing you say is that the topic of cog sci for AI evaluation is out of scope for this thread. Which is fine. Just trying to clarify.

Lars Ericson (Feb 02 2025 at 23:07):

@Jason Rute regarding the point "But I don’t see much value with coming up with reasoning problems and only testing them on non-reasoning models": Suppose you were testing 10 LLMs without knowing their names or their architectures, and wanted to determine if the model had a feature called "Reasoning" in a reliable, repeatable way. What questions would you ask the model to elicit the presence or absence of that feature? Here I am using the word "Reasoning" not as a thing that philosophers might discuss but just as a label for a thing called a "reasoning model" that is being added as an architectural feature to recent LLMs. For non-philosopher context, here is some AI blather from a DeepSeek-R1 implementation on the properties that define a "reasoning model": https://www.perplexity.ai/search/what-llms-are-reasoning-models-BgmndsTHSP.TO319kGdvuA

Lars Ericson (Feb 19 2025 at 18:48):

https://arxiv.org/abs/2502.07087

Last updated: May 02 2025 at 03:31 UTC