Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: ARC Prize

Jason Rute (Jun 12 2024 at 13:03):

The ARC benchmark now comes with a $1 million prize. https://news.ycombinator.com/item?id=40648960
We have talked about this benchmark before on #Machine Learning for Theorem Proving > dreamcoder

Jason Rute (Jun 12 2024 at 13:12):

The ARC problems are a form of visual intelligence tests. If you have never seen them, you can try them out here for yourself:

https://arcprize.org/play (Seems to be the official playground.)
https://volotat.github.io/ARC-Game/ (An older non-official interface that gives you the final board size.)

Junyan Xu (Jun 12 2024 at 14:27):

pretty similar in format to the AIMO prize, it seems :)

Jason Rute (Jun 13 2024 at 23:23):

Here is an interesting podcast/video interview with Francois Chollet about the prize: https://www.youtube.com/watch?v=UakqL6Pj9xo

Also, here is the Kaggle competition: https://www.kaggle.com/c/arc-prize-2024

Adam Kurkiewicz (Jun 18 2024 at 06:27):

Here's a claim of 50% result on ARC-AGI problem (74% when compared with the subset of the data that have a human baseline of 85%).

It uses chatGPT-4o (and reportedly a lot of RAM), so is not eligible for the prize:

https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt

This is a blog post and not a research article, but the source code is provided, so it could be in principle reproduced (although probably a little too costly for somebody without a research budget):

https://github.com/rgreenblatt/arc_draw_more_samples_pub

Junyan Xu (Jun 18 2024 at 07:16):

Junyan Xu (Jun 18 2024 at 07:25):

I'm wondering how it's appropriate to leak 100 test problems to OpenAI, giving them unfair advantage (e.g. seed data to generate synthetic ones).

Adam Kurkiewicz (Jun 18 2024 at 08:19):

It's not completely clear to me what's the meaning of "train" "public train" and "test" here.

I think the ARC problems in the kaggle competition are completely hidden? And this hasn't run on the hidden test, but only on the "public test"?

Jason Rute (Jun 18 2024 at 11:09):

One thing Chollet says in the interview above is that if you want to test on GPT-4, then the best way is probably to make some of your own completely new hidden test problems (not hard problems, but completely new problems), and try it on those.

Adam Kurkiewicz (Jun 18 2024 at 15:20):

Good point @Jason Rute: it's hard work coming up with those though (in sufficient number)!

I want to do a benchmark for lean autoformalisation (let's say LAB: Lean Autoformalisation Benchmark), where the model would be supplied with complete (and detailed) natural language for one of the IMO problems and the formal statement of the problem (I've been formalising some for Compfiles), with sorry as the proof to fill in and it would have to come up with the formalisation of the solution. I would completely hide the "private test" behind kaggle (although in principle, because this will be a previously described IMO problem, there is of course a risk that somebody could formalise it, e.g. for Compfiles, and that could then leak).

But it will be a lot of work to do this properly, even for a very small number of problems (like 4 or 6). Formalising maths is just really laborious!

Another thing that I'll have to work out is how to run the submissions "adversarially", checking the axioms, etc... there's probably a lean flag for this somewhere.

Junyan Xu (Jun 18 2024 at 17:13):

I think the ARC problems in the kaggle competition are completely hidden? And this hasn't run on the hidden test, but only on the "public test"?

I think you're right. I just read the blog post and the author says in the first sentence that the 50% accuracy is from the public test set, presumably the 400 evaluation tasks in the repo (I read on X that the author tested on 100 tasks). Later in the post he says "The prior state of the art on this dataset was 34% accuracy" with a link to the ARC Prize Leaderboard (not Kaggle leaderboard), which reports performance on the private test set, but footnote 3 says it doesn't matter because the private test set is supposed to be IID with the public test set. Footnote 2 says the author hasn't submitted to the ARC Prize Leaderboard, so his method hasn't been tested on the private test set yet (and testing on the private set won't allow Internet access).

Junyan Xu (Jun 18 2024 at 18:34):

I had the question after reading https://x.com/bshlgrs/status/1802766449075003770:

Train set: 71% vs a human baseline of 85%
Test set: 51% vs prior SoTA of 34% (human baseline is unknown)
(The train set is much easier than the test set.)
(These numbers are on a random subset of 100 problems that we didn't iterate on.)

which is also implying the accuracy on the public test set is representative of that on the private test set. However on Hacker News](https://news.ycombinator.com/item?id=40712282) Mike Knoop has pointed out the assumption is false:

this result is on the public eval set vs private set (ARC Prize $).
the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.

So apparently the public test set is harder than the training set, and the private test set is still harder.
Update: but this contradicts the official guide: "The public evaluation sets and the private test sets are intended to be the same difficulty."

Human accuracy on the training set is 84% according to a 2021 New York University study, and the grand prize goal is set at 85%. I guess no similar study has been conducted on the public test set yet, but it might soon be. Chollet commented on human performance on X:

The private test set was checked for feasibility by two people, who had never seen the tasks before. They successfully solved 100% of the tasks. I would expect a smart person (>1 sigma) to be able to do >95%, and two or more should do 100%. At 2 sigma you should score 100% easily.
ARC-AGI isn't intended to be hard.
The bar to "solve" ARC-AGI is only 85%. We think this roughly aligns with average human performance, although this hasn't been rigorously measured on the private test set (being private, it's preferable to restrict the number of people who have seen it)

Given that the private test set is a lot harder than the training set for machines, it seems odd to expect an average human to perform similarly on both sets ...

Junyan Xu (Jun 18 2024 at 18:42):

BTW, Chollet has now made two posts on X bullish of the prospects of program synthesis on solving reasoning (I don't remember when I last liked his posts but I just liked both of these):
https://x.com/fchollet/status/1803096195684012371
https://x.com/fchollet/status/1802801425514410275
and said DreamCoder was once his favorite paper:
https://x.com/fchollet/status/1800647127988670510
(he actually recomended it in 2022)

Adam Kurkiewicz (Jun 18 2024 at 18:56):

I remember chuckling when I read this in the blog post. How are you supposed to put a distribution on something that is supposed to be unique (each ARC problem is supposed to be "outside" of the distribution, so completely novel reasoning)?

because the private test set is supposed to be IID with the public test set.

Jason Rute (Jun 18 2024 at 21:31):

I think there are three sets. If you go to https://arcprize.org/play you see there is “public evaluation set (hard)” and “public training set (easy)” and I’m almost sure neither is the private set used for testing in Kaggle. I think when the medium post talks about the part of the dataset where humans get 85% and they get in the 70%s they mean the “public training set (easy)”, but I could be mistaken.

Jason Rute (Jun 18 2024 at 21:34):

(Oh sorry, I didn’t read your message clearly @Junyan Xu. You point all this out already.)

Last updated: May 02 2025 at 03:31 UTC