Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Piecework jobs creating Lean 4 training corpora for LLMs

Lars Ericson (Feb 27 2025 at 03:14):

I have been contacted by two different efforts to do gig work creating training examples for LLMs to learn how to translate English mathematical problem descriptions into Lean 4 proofs. I am relating the details here in case you may be interested in the work or if you are interested the questions that I had concerning the work arrangements.

Lars Ericson (Feb 27 2025 at 03:14):

The first pitch was from a company called Snorkel.ai:

I help recruiting here at Snorkel, a data development company spun out of the Stanford AI lab. Our research has focused on approaches for labeling and curating the data used to “teach” and evaluate AI models (i.e. ChatGPT) including large scale open-source benchmarks like Stanford HELM with the intention to evaluate them for truthfulness, bias, and other error modes.
To do this, we're launching an exciting project that ABSOLUTELY pushes the frontier of reasoning of modern LLMs. Specifically, we are laser focused on advanced, mathematical problem solving at the graduate-level. Your task will be to craft complex mathematical “Stumpers” that AI fails to solve or correctly reason through!
Compensation is multi-faceted, including:

$50/accepted Stumper

$20/peer review of fellow Expert’s Stumpers

Special Leaderboard - given the critical nature of this project, we’ve introduced (for a short-time) accelerators for our top contributors! We have dozens of experts who are earning $200+/per accepted Stumper!

We're moving fast, starting as soon as Monday! Please do visit https://snorkel.ai/ and If you're (potentially) interested, I only need the info below to get you engaged with the team!

First & Last name

Email address

The Snorkel team will reach out direct with specific details AND you will receive onboarding info from HireArt (our agency of record) https://www.hireart.com. The timeline for this project tentatively goes from now till 2/17. This will be a highly collaborative environment with daily Office Hours, direct communication with Snorkelers, frequent feedback on submissions and collaboration with fellow Mathematics Experts

I asked

What is an "agency of record". What is my relationship with "HireArt"? How probabilistic is the payment? What do you expect to be the effective hourly rate for this work?"

The response was

We use HireArt as our agency of record in order to track submission progress and pay you. You'll be working direct with Snorkel on the tasks, but your paycheck will come via direct deposit from HireArt.

An equivalent hourly rate really depends on your level of expertise in the field. If you generate a significant number of accepted stumpers, the equivalent hourly rate can be quite high.

I then had a stream of other questions, for which there was no response, and I didn't pursue the opportunity:

What LinkedIn employment category would this be with respect to the Agency of Record? Freelance?

Can people who participate represent this as a job on their LinkedIn? Are participants asked to sign a disclosure representing that they are test subjects on a human psychology experiment?

I'm sorry if these questions sound a little cranky but this is gig economy/Mechanical Turk stuff and it is fair to be clear about what the labor conditions are. You are asking skilled people to do Mechanical Turk on fancy stuff but it is still Mechanical Turk in the end.

So the questions are about (a) do people really get paid a material amount for their work, versus for example crowdsourcing challenges where the payment is nominal and the payoff is really reputational and (b) is this something that people can brag about as legitimate work or are people really classified as test subjects from the granting agency's point of view. I.e. is it more like very sketchy post-doc work or is it more like volunteering as a test subject for a medical experiment.

Also, what level of association with Stanford can people claim, and what level of actual interaction with Stanford people can one expect.

If I like the experience/payoff I can socialize with my community of experienced mathematicians, but I have to determine whether it is more of a grownup experience that people can be proud of, or lunch money for undergrads.
Again, I apologize for the tone of these questions, but I see post-doc skilled mathematicians struggling for funds and opportunities. So this sounds like it could be a good one or a fake-out and so I am trying to differentiate.

Lars Ericson (Feb 27 2025 at 03:14):

The second opportunity came in an email:

I’m a Solution Engineer at Kili Technology, and I’m reaching out about an open-source initiative we’re working on. We’re assembling a team of individuals skilled in Lean 4 / mathlib, and advanced mathematics to help us move forward.

The project’s goal is to train an AI capable of advanced mathematical reasoning. We’ve already made great progress and are now entering a second phase. This step involves building a model that translates natural language math problems into formal Lean 4 representations.

Currently, we’re focused on math problems at AMC/AIME difficulty levels, with plans to scale up to AMO complexity. Your expertise—particularly if you have a background in math competitions— would be a fantastic addition to the team!

This is a paid opportunity and even dedicating a few hours a day could make a huge impact. Plus, all results (code and data) will be open-source, contributing to the broader community.

The email directed me to a form with 3 sample theorems to prove. I ran these iteratively through DeepSeek R1 until I could get a proof that checked on Lean 4 web. This took 3 or 4 iterations per theorem for R1 to get it right. This took me about half an hour. I think these were AMC level problems. The result was acceptable and I was contacted again:

I am coming back to you for the Lean 4 project. Thanks for your work on the expertise assessment; we think you’d be a great fit for this initiative to advance mathematical reasoning with AI.

I was invited to a meeting with about 8 other Leaners. In the meeting we learned that they were trying to create a corpus of 3000 examples by summer. Participating applied computational logicians would be paid $15 for proofs of AMC theorems and $45 for AIME theorems. Payment would be through a gig-working platform called Upwork. The sponsor was a non-profit called Project Numina. The corpus would be open-sourced as a benchmark after Numina had a first crack at it. This appears to be a French company, associated with Mistral AI for the development of a model called Mathstral, see also Mathstral, and the NuminaMath dataset. Numina received a €3 million grant from XTX Markets, an algorithmic trading firm.

The specific details of UpWork are that you register with them. If you want to advertise yourself on the platform, they ask for a subscription of $240/year. They take 10% of your gross pay from clients. To compensate for the 10%, if Numina via Kili Technology via Upwork pays you $15, they actually pay Upwork $16.67, Upwork takes it's 10%, and you get your $15. You don't actually have to advertise yourself, so you don't have to pay $240.

Two other questions regarding the employment arrangement:

For US contributors, the question was whether this is foreign income. Upwork is in a lot of countries, so the feeling there is that anybody who does this job and has an Upwork in their country is working for the Upwork of that country, not indirectly for a French hedge fund. (In the US, foreign income has to be declared on tax forms and in certain cases technical work has to receive an export license; it can get pretty complicated.)
Others also asked how this job could be represented on LinkedIn. Are freelance compensated applied computational logicians effectively employees of Upwork, or of Kili Technologies, or of Project Numina, or of XTX Markets? The Kili rep said he thought maybe people could associate themselves with Project Numina. I didn't see any answer on this come back. Most participants just cracked on with the work itself. The opportunity to get paid to do something with Lean was good enough. Cash compensation is more important than the reputational reward of a specific affiliation.

I won't belabor the corporate provenance except to note that OpenAI is also a non-profit, which is valued in the eyes of US President Musk at about USD100B. So there is serious money involved. And also $15 per crack, if you satisfy the requirements. For myself, I am not really that good at math. I could convince R1 to do the work for me. To make my $15 on the regular, I would have done the same thing. They don't want this. They feel that feeding Lean 4.15-checked LLM-generated proofs into an LLM for training might give it bad habits as opposed to feeding it Lean 4.15-checked human-written proofs. Either way it is checked, but the human version has a certain je ne sais quoi that the LLM one does not. I don't know. That's just the way they like it. They also don't like lemmas or defs or any sleight of hand like linarith. In detail, they ask:

Do not forget to comment, even AMC short formalization (Do not put the comment at the end, put it right after the theorem if there is just a few line)

Avoid using computationally heavy methods. (e.g., nlineartith, decide, r-case).

Avoid using lemma and def (forbidden except in edge cases).

Always put your formalization in the code format (Little code icon on the top right of the formalization box)

Do not forget to import Mathlib

Do not use LLM to do the formalization. It is detected and not good, I'll not be able to keep constributors using formalization done by LLM. Of course you can use them to check documentation or help you in some parts. But make sure you check the formalization compiles and make sense .

So that last bit knocks me out. Without an LLM, I don't think I could achieve a decent hourly rate at $15 a pop. I'm sure there are plenty of people on Zulip who could and for which this opportunity could be a real help in making ends meet.

As with the first effort I described, there is also an opportunity to get paid to referee:

As we scale up and onboard more contributors, I’m assembling a review team to help ensure the accuracy of our formalizations. The review task involves quickly checking a random sample of formalized problems to verify their mathematical correctness. It’s a lightweight task focused solely on confirming that the proofs are valid.This task will be compensated separately at a rate of $30 per 10 problems reviewed.

On the one hand I am glad to see Lean-related industrial work open up. On the other hand it feels like a lot of people that are capable of doing this work may also be suffering from being academically "disestablished", a word I like to use to contrast with the more typical usage "academically established", that is, having a real recognized and compensated place in the academic world. This is not that.

Last updated: May 02 2025 at 03:31 UTC