Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Areas of math where synthetic generation of problems of v...

Andy Jiang (Oct 15 2024 at 00:32):

After watching David Silver's talk, where he discusses running RL based proof search on a diverse set of problems of arbitrary difficulty, it seems to me that it would be interesting to compile a list of areas where it is possible to generate infinite synthetic data easily which crosses the human math frontier in the sense that it contains both many easy question, many hard but solvable questions and many hard and unsolved. Maybe if you have access to huge amounts of compute you could also think about actually running RL on one of these (idk if it's feasible today but could be in the near future)

Here are a few that came to my mind

1) Diophantine/rational solutions to integer polynomials (a lot of human math progress has come from this)

2) Decision problems in algebraic geometry with no easy algorithm to decide (is a variety birational to P^n? How bad are its singularities? Etc)

3) Infinitude of Prime n-tuples with certain characteristics so that heuristically we expect infinitely many.

Some other examples that came to mind include (ir)rationality of integrals of algebraic functions, existence and uniqueness of randomly generated pdes but I'm less sure about those. Would be interested in other people's ideas/feedback on this.

Should say that I don't work in any of the above areas so caveat lector

Sid (Oct 15 2024 at 07:06):

Random 3-SAT at the phase transition threshold tho I'm not sure if it has much relevance to improving proof search

Kevin Buzzard (Oct 15 2024 at 07:15):

Thinking about how Diophantine equations behave under changes of parameters reminds me a bit of the current Equational project. In the 80s there was a push to find all integer solutions to $Y^2=X^3+n$ for $n$ an integer with $|n|\leq1000.$ This is an interesting set of equations because for a fixed $n$ there are only finitely many solutions in $X,Y$ but there can be random large solutions eg already with $n=17$ there is a solution with $Y$ a six digit number. There were a number of techniques known but each only applied to a subset of $n$ s (and solving Diophantine equations is already an undecidable domain so this is expected).

Andy Jiang (Oct 15 2024 at 12:44):

Sid said:

Random 3-SAT at the phase transition threshold tho I'm not sure if it has much relevance to improving proof search

Sounds interesting! I feel like maybe in this case the RL should learn something about proving certain 3SAT problems. Though it's very outside of what I think about so I personally won't be able to understand the things it will have learned.

fzyzcjy (Oct 19 2024 at 06:06):

Just come across this paper: https://arxiv.org/abs/2410.09988. They seem to claim the asymptotic methods graduate course problems can be generated algorithmically. That also reminds me of the symbolic integration area as well.

Siddhartha Gadgil (Oct 19 2024 at 10:51):

A couple of other examples (I believe any area has these):

Given a random knot, what is its slice genus.
Given a random presentation of a group, is it trivial? finite? torsion free? residually finite?

Jason Rute (Oct 19 2024 at 11:03):

One has to be careful. Unless you are particularly interested in a particular distribution of “random” knots/groups/matrices/etc, you could draw the wrong mathematical conclusions from your data. For example almost all finite groups are 2-groups. Some poor ML for math papers make really weird conclusions that are overfit to their sampling distribution. It is best to work with many distributions, including pathological cases. The knot theory DeepMind Nature paper had a good discussion on this.

Jason Rute (Oct 19 2024 at 11:09):

There is an ongoing seminar at Harvard CMSA about ML for Math, and I think generation of math data is one of there main topics. You could check out their Zulip at https://mml2024.zulipchat.com/.

Terence Tao (Oct 20 2024 at 16:10):

Certainly with the equational theories project we are finding that 0.01% of the implications are holding the vast majority of the mathematically interesting features! So "randomly generated" should definitely only be one component of a diverse data set, not the only component. (Of course, the other components are far harder to find in bulk, more or less by definition...)

In a somewhat parallel note: we exhaustively iterated over all magmas of size up to 4 to generate counterexamples, which worked pretty well, but still left a fair chunk of unresolved implications. Iterating over all magmas of size 5 was computationally prohibitive, and using randomly sampled magmas of this size produced almost no new counterexamples. However, using randomly sampled cancellative magmas of size 5 ended up being more productive, as this extra constraint significantly changed the distributions of what laws the magma obeyed.

Andy Jiang (Oct 21 2024 at 23:51):

Terence Tao said:

Certainly with the equational theories project we are finding that 0.01% of the implications are holding the vast majority of the mathematically interesting features! So "randomly generated" should definitely only be one component of a diverse data set, not the only component. (Of course, the other components are far harder to find in bulk, more or less by definition...)

Thanks for the comment! I don't know if this is unrealistic but I was hoping running reinforcement learning on a similar dataset as this one you guys are working on will similarly first dispatch of the "easy" 99.99% of problems and spend the rest of the time trying to learn on the rest. Of course I don't know if in practice it does with this way or also whether if the increase in difficulty is too steep at the end whether the reinforcement learning will just get stuck after the easy problems.

Btw this project looks really cool, congrats on getting the number of unsolved questions down to just a handful!

Last updated: May 02 2025 at 03:31 UTC