Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Agora


Riyaz Ahuja (Sep 09 2025 at 23:44):

We are excited to announce the initial demo of Agora, a public interface for multi-agent, market-based mathematical discovery.

Agora hosts a core Lean repository which agents can submit to and freely access. Agents in the marketplace can hold assets whose value depends on how the library evolves over time, and moreover, trade assets between themselves -- with the goal of incentivizing these agents to reward helpful contributions via the underlying market dynamics.

We think markets are a natural way for AI systems to do math for a number of reasons. Firstly, markets should incentivize agents to explore original research in whatever directions are most helpful for the community. For example, if a new field seems to have a lot of potential applications, we expect this field to receive a lot of attention. Secondly, turning math into this sort of game should give reinforcement learning an opportunity to produce superhuman AI mathematicians, analogous to the performance of AI agents in two-player games like Chess, Go, or Poker. Thirdly, this market can be used as a kind of benchmark that's resistant to saturation as the results demanded by the community will evolve in response to the capabilities of new models.

Towards this end, we are releasing a web interface so that people can get an intuition for how the market, asset classes, and overall dynamics all work together. We note that Agora is very much under active development, and we expect there to be bugs and intend to iterate rapidly. In the coming weeks, we will be releasing a Python SDK and MCP connector to allow LLMs to directly interact with the marketplace. We look forward to hearing your feedback and are excited to continue iterating on it soon!

Jason Rute (Sep 10 2025 at 02:21):

I visited the website. Maybe it is just me, but I don't understand this at all. I'm not even sure what to ask. Is this for real-world use? Is this for some benchmark, and if so, what will encourage people to use this? Is the "cash" real money or imaginary currency? Am I supposed to be seeing agents interacting, existing library theorems, or existing targets? Right now, I see nothing on the site.

Bas Spitters (Sep 10 2025 at 08:30):

There have been various proposals for proof markets, including the first one "proof market" where someone proved a hard theorem using a bug in the Rocq kernel and receive a bitcoin (before it was worth so much as it is today).
Here's a recent paper about the topic. @Riyaz Ahuja are you providing something similar?
http://cl-informatik.uibk.ac.at/cek/proofgold/fmbc2022.pdf

Jason Rute (Sep 10 2025 at 11:10):

@Bas Spitters brings up another good point. @Riyaz Ahuja You have to be very careful about exploits (especially if real money is involved). How would you feel if some agent used a known exploit in Lean to prove the theorem, either intentionally (I could probably write such code in a day to exploit your system using known exploits talked about here on Zulip), or accidentally (sometimes AI models find hacks to avoid actually solving the problem)? (If you look through a number of my posts over the last year, I talk a lot about various ways this can happen.)

Jason Rute (Sep 10 2025 at 11:11):

(Also, I talk about good ways to safeguard from these exploits.)

Bolton Bailey (Sep 10 2025 at 15:50):

I am getting an error on my submission

Does not compile
Build failed: Build failed with unknown error

Which is obviously not too helpful for figuring out what is wrong.

Riyaz Ahuja (Sep 10 2025 at 18:25):

Thanks for all the questions! I'll do my best to respond to each in order:

Is this for real-world use?

Yes, but as an eventual goal -- not in its current form as of right now. The purpose of this post is to simply start off the discussion on this project and concept, and provide a thread to consistently add updates to this project as it continues to grow in both features as well as practical usability.

In its current form, Agora is meant as a piece of infrastructure for this kind of market interaction so that we may evaluating and training LLM's for real-world library building alongside humans. Again, the goal of this project is that we can use this backend to have agents that are constantly contributing new content to the library, so that if you as a human have something you want proven or developed further -- say the definition of a ring and you want some properties about it -- then you can just submit that as an offer to the market and the LLM's will conjecture and prove it.

Is it a benchmark? If so, why should someone use this?

It is not exactly what most would consider as a “benchmark” per se. Unlike in theorem proving evals, where we have a set of theorem statements we want proofs for and can easily check w/ Lean, the task we're training/testing models on here is a more general sense of "mathematical exploration". More concretely, we want LLM’s to generate lean code/conjecture defs+theorems, prove said theorems, and interact with this market system so that they learn how to interact with other agents in the market. This is a way fuzzier domain, and there are plenty of ways one can design benchmarks to heuristically or formally gauge what makes a good conjecture/proof/definition/etc. but Agora allows you to also gauge your model's performance in a different way. Namely, you can measure agents' performances against each other on proving a set of bounties (your "test set") using each agent's profit as a proxy for performance on this general task. Which is definitely different than what a benchmark aims to do, but allows you to gauge relative differences in model performance in an informative way.

Is the cash real money or imaginary money?

It is imaginary.

Am I supposed to see agents interacting, existing theorems, or existing targets?

As of now, the agents in this above demo are simply other humans that are/have contributed to this library with different targets and theorems. So, it is likely that you may not see much activity for now. Namely, any interaction you see will be dependent on if there are others online at the same time, and you may see new targets and theorems popping up as others play with the system. 
However, the goal of this project is to interleave both human and LLM agents into the market, so that you may then see+interact with dozens of different agents all at the same time.

Riyaz Ahuja (Sep 10 2025 at 18:27):

For some more discussion on this topic, Christian Szegedy actually had a good discussion about the concept of math RL markets last week on TWIML, which might be a nice and more long-form explanation of the concept Agora is working on.

Riyaz Ahuja (Sep 10 2025 at 18:28):

Also, I’ve made a few quick changes to the project in response to some of the confusion here:

  • the documentation should be automatically shown now when loading in, and hopefully be a lot more explanatory.
  • Additionally, the error messages are a lot more informative now, as well as some bug fixes on submission errors and search functionality.

Thanks for all the feedback!

Riyaz Ahuja (Sep 10 2025 at 18:29):


@Riyaz Ahuja are you providing something similar?

We’re providing something somewhat similar to this, but moreso focused on building an RL environment out of it using the market dynamics rather than a literal “proofs market” where people can pay for proofs. However, you both are 100% right that we need to think about exploits — though in this case it’s more in the case of reward hacking than getting bitcoin.

Currently we have an admittedly very flimsy layer of security beyond just protecting our sandbox environment with Bubblewrap, but we’re currently working on integrating things like lean4checker or SafeVerify before we continue further on the RL side of things. Additionally, we’re looking into if it’s possible to incentivize agents to catch hacks/mistakes for large payoffs (i.e. have an adversarial “reviewer” model) — though that idea is very much still just a hypothetical for now. But overall, you are correct in saying that there’s a lot of work left to be done to ensure better security and mitigate reward hacking.

Riyaz Ahuja (Sep 10 2025 at 18:30):

Bolton Bailey said:

I am getting an error on my submission

Does not compile
Build failed: Build failed with unknown error

Which is obviously not too helpful for figuring out what is wrong.

Sorry about that, better error messages and some bugfixes have been pushed in, so it should be more stable and descriptive now. Your submissions look like they’re all good as targets now.

Bolton Bailey (Sep 10 2025 at 19:07):

I am trying to submit, and the infoview says my proof is fine, but I am now getting the error:

file_typqe2n3
Does not compile
Build failed: Build failed with unknown error:
 error: build failed
✖ [2/6310] Running Library
error: build cycle detected:
  +Library.file_typqe2n3:leanArts
  +Library.file_typqe2n3:olean
  +Library:deps
  +Library:leanArts
  +Library:olean
  +Library.file_typqe2n3:deps
  agora_lib/Library:leanLib.leanArts
  +Library.file_typqe2n3:leanArts
✖ [6309/6310] Running Library.file_typqe2n3
error: ././././Library/file_typqe2n3.lean: bad import 'Library'
Some required builds logged failures:
- Library
- Library.file_typqe2n3

Last updated: Dec 20 2025 at 21:32 UTC