Zulip Chat Archive

Stream: general

Topic: AI and theorem proving

Jason Rute (May 19 2019 at 21:53):

First, since I’ve only posted a few times in here, I thought I’d introduce myself. I’m a former mathematician turned data scientist. I used to work on computability theory and algorithmic randomness. Jeremy Avigad was my advisor, which makes me an academic sibling to a couple of you. Before graduate school, I worked on the Flyspeck project for a bit, but then lost interest in interactive theorem proving. I’ve become more interested in it again with Lean and with the many advancements in AI and theorem proving.

I’m especially interested in AI for theorem proving and I closely follow the field (as an outsider). I’m always happy to chat about it. Since there have been many discussions here on the topic and I’d thought I’d share some resources and thoughts I have. (Disclaimer: I am only an interested outsider. I am not involved in any way with this research.)

Jason Rute (May 19 2019 at 21:53):

This talk by Joseph Urban is a great introduction to the state of field. For more in depth, here are two sets of lectures on the subject by Cezary Kaliszyk. (The first is just three lectures and has videos.) Indeed, these two people and their respective labs seem to be leading the charge in AI for theorem proving.

Recently Berkeley, OpenAI, Google AI, and DeepMind are also getting into the field, which is really exciting. Berkeley and OpenAI released GamePad which is a learning environment to train AI agents in Coq. Google AI just released a similar learning environment HOList for HOL-Light (see slides at AITP 2019). Both of these projects provide datasets (e.g. the Flyspeck library) and some weak baselines. As for DeepMind, Demis Hassabis recently mentioned “theorem proving” as one of the science projects they are working on right now (search “theorem” in this video/transcript).

As for tactic-based ITPs, besides GamePad and HOList, there is TacticToe, which is probably the most useful of the tools so far. It fills in tactic proofs for HOL4 (paper, slides/video at AITP 2018). Also, there appears to be PSL for Isabelle, but I don't know much about it.

In another channel someone asked about using AlphaZero (or AlphaGo Zero) methods for theorem proving. Despite what most people think, the AlphaZero algorithm isn’t so much about playing games as training an agent to search through a tree. Historically, tree search algorithms in games used min-max with alpha-beta pruning. That is they use a depth-first search to a certain level of depth, hand written heuristics, heavy pruning, and other optimizations—very similar to the tree searches methods in ATPs. In contrast, Monte Carlo Tree Search (MCTS) explores paths through the tree “at random” (formally it uses UCBT and random rollouts) avoiding the need for hand-written heuristics. The AlphaZero variant of MCTS eliminates the need to explore to the end of the tree, instead using policy and value heuristics learned by a neural network to guide the search towards nodes in the tree that look “valuable”. Eliminating the need to explore to the end makes AlphaZero MCTS useful in other tree and graph search problems with sparse terminal states, such as theorem proving. Indeed, most theorem proving settings can be thought of as tree search problems. The nodes in the tree are partially constructed proofs and edges are manipulations which lead to the new partial proof. For example, in tactic-based ITPs, the nodes correspond to a sets of open goals and the edges correspond to tactics which can be applied to the goals. In other settings, the nodes are are partially built proof trees or tableaus and the edges correspond to inference rules (in say a cut-free sequent calculus). Many papers in AI and theorem proving are starting to experiment with MCTS. For example, the TacticToe paper (the most recent version) uses MCTS, but with hand-engineered heuristics.

Another important trend in theorem proving is reinforcement learning (RL), that is learning by exploring the space instead of using labeled data. This is important since as many have stated there isn’t a lot of human labeled data to use in theorem proving. Recently, there have been some successful experiments in this area: rlCOP (slides/video at AITP 2018), an RL-trained QBF solver, and an RL-trained intuitionistic theorem prover. These results are really encouraging and many of these systems use MCTS and other ideas borrowed from AlphaZero. Nonetheless, the systems are still quite limited.

There are also applications of AI/ML to SMT solvers, SAT solvers, QBF solvers, resolution theorem provers, connection-style theorem provers, and a variety of different logics. While there aren’t a lot of researchers directly working on theorem proving, there a number of related research topics including program synthesis, question answering and reasoning in natural language, theorem proving in large relational knowledge databases, and neural-symbolic learning.

Jason Rute (May 19 2019 at 21:54):

I’d also like to share some major trends and challenges I’ve seen in the area of AI for theorem proving.

Suppose an agent wants to prove a theorem or fill in a gap in a proof. It could prove it from scratch, but it is much easier to use previously proved facts. However, if we have a long list of previously proved facts, our combinatorial search space will grow considerably. This is called the “premise selection problem”. (For some tactics one also has to specify terms as arguments. This again is a sort of premise selection problem.)

More powerfully, one would like an agent which creates its own definitions and lemmas. (This is akin to the difference between a proof system with and without cut, the former allowing arbitrary lemmas.) All the current benchmarks I am aware of use theorem proving libraries where the intermediate lemmas are already given (or can be chosen from a large list of previously proved results). I know of no system which has the ability to make its own definitions or lemmas. (Although in the related field of program synthesis there is work on automatic creation of reusable subroutines.)

Another major trend is feature extraction. In order to do machine learning on formulas, one has to extract the useful information from the formula. One time honored tradition is manual “feature engineering”. While neural networks are really good at automatically discovering features, many of the most successful current systems (e.g. TacticToe) still don’t use neural networks since they are too slow. Instead they use hand-engineered features in conjunction with faster machine learning methods like boosted trees or nearest-neighbor search. Nonetheless, I think in the end neural networks will win. It is already established that neural networks allow one to significantly decrease the search space in many theorem proving tasks, and as we build systems that are more ambitious this exponential speedup in search space will outpace the linear slowdown in feature computation. While tools like SMT solvers are massively optimized, they only work well in narrow domains. For research math problems, we need an agent which can learn from the cutting edge without hand engineering. And we would be willing to wait an hour to solve a problem that would take a week or more for a human to solve.

Even if one uses neural networks to extract features from terms (or formulas, goals, etc.), there are a number of design choices. While terms (and formulas, etc.) are just strings of symbols, neural networks which are designed for strings of symbols (1D CNNs and LSTMs) don’t seem to work as well. The reason is that that terms have a deep recursive tree-based structure. There has been more success with recursive tree neural networks which better capture this recursive structure. However, formulas aren’t really well represented as trees either because they have bound variables with arbitrary names. Instead, it seems to be best to use DeBruijn representations where each bound variable is connected to the associated quantifier. Graph neural networks are amenable to this representation of a term (or formula, etc) as a graph, and seem to do empirically better. However graph neural networks are still a new concept and more experimentation is likely needed.

Despite all the areas for improvement, I am very hopeful. I have no doubt that theorem proving is difficult, but incremental progress can have tangible value: Speeding up SMT and SAT solvers. Learning to quickly prove basic facts in new domains. Improving the tactics in ITPs to fill in the “obvious parts”. Improving hardware and software formalization. Speeding up compilers (even the Lean compiler/type-checker). This will likely come before automatically proving Math Olympiad problems, auto-formalizing arXiv, having a computer develop new areas of mathematics, or solving really hard research problems.

Another reason to be hopeful is that the tools being built in AI are amazingly general. For example, two of the important tools used in AlphaStar (DeepMind’s recent Starcraft II AI) are transformers and LSTMs. Both of these were developed for sequence-to-sequence learning such as machine translation. Nonetheless, they also turned out to be very applicable to playing real-time strategy games and a host of other problems such as creating computer generated images. This reusability of tools is a big win. Researchers don’t need to be working on theorem proving in particular to develop ideas, such as MCTS and graph neural networks, which can be repurposed for theorem proving. Indeed the problems plaguing AI and theorem proving, such as an infinite search space, are also problems in many other areas of AI. (Just search for “the bigger picture” in this DeepMind talk). Moreover, I think many of the advanced ideas in AI haven’t even been tried in theorem proving (neither the HOList nor GamePad baselines use graph neural networks for example), which is why I am very excited that the big players (e.g. DeepMind) are joining the field since they have a lot of experience and have shown a willingness to tackle ambitious projects. Now we just need to convince them that this is worth the payoff!

And finally it seems that, by using reinforcement learning and lots of computer power, we can quickly scale up from ok progress on an AI task to great progress. Thinking big (maybe too big), if we can teach an AI agent how to make definitions and lemmas, how to use curiosity to explore a new mathematical subject, and how to effectively leverage previous proved knowledge to build new knowledge, then we might—just might—get really impressive results…

Jason Rute (May 19 2019 at 21:55):

(Sorry for the long posts. As Jeremy can attest to, I haven't learned the skill of short writing.)

Jason Rute (May 19 2019 at 21:56):

(Also, I know many of these resources have been shared many times. I thought there was value in putting them together in the same place.)

Kevin Buzzard (May 19 2019 at 21:57):

Thanks a lot for these -- I find myself a few times a week on an exercise machine and it's great to have stuff to watch there -- I still haven't figured out how to do Lean whilst pedalling energetically.

Kevin Buzzard (May 19 2019 at 22:01):

Making mathematical definitions is an art -- it's analogous to making new tools. Getting computers to prove theorems using a given set of tools sounds like a much easier problem, and even that seems very difficult to do at the minute.

Kevin Buzzard (May 19 2019 at 22:02):

Brian Conrad is fond of challenging me to get my computer to invent / discover a proof that there are infinitely many primes, and even given the fact that the integers are a unique factorization domain I think a computer would find this hard to do.

Jason Rute (May 19 2019 at 22:15):

I agree discovering definitions and "interesting" results will be one of the last things we figure out how to do well. As far as making new tools, I think to scale automatic theorem proving, we will have to figure out how to build new tactics or make simple lemmas that factor out common proof steps. This is related to other problems in AI such as hierarchical reinforcement learning where one wants to learn new high level actions made up of low level actions. Theorem proving is interesting in that the ability to package small steps into one big step is built into the theorem proving framework already, but it is not clear how one would take advantage of it.

Jason Rute (May 19 2019 at 22:24):

But speaking of hierarchical RL, the OpenAI Five Dota AI is an interesting example. They thought they would have to solve hierarchical reinforcement learning before they could solve Dota, but it turns out the existing tools (namely LSTMs) already were capable of learning complicated strategies which are executed over a very long time period. So maybe lemmas and hierarchical RL is not as important as I think.

Moses Schönfinkel (May 19 2019 at 22:36):

And yet it didn't even learn the neutral-camp pull into lane! ;) Thanks for the write-up.

Jesse Michael Han (May 19 2019 at 23:21):

This is a nice summary. It would be nice to compare the rlCoP-style approach of guiding tableaux/superposition calculus proof search with the tidy_on_steroids approach of TacticToe/HOList, but it doesn't seem like they've been implemented for the same systems yet.

Jason Rute (May 20 2019 at 04:54):

The short answer to your question is that the main ideas of the systems are similar. Connection tableau or tactic-goals, it is basically the same idea. We progressively modify the proof goal until it is solved. The devil is in the details. These details include generating human data by recording proofs and hooking up the theorem prover to a machine learning library. Even after that, there are a number of challenges with learning tactics since they take arguments and that can greatly expand your search space.

rlCoP

Logic: Classical FOL
Search space:
- States: Connection tableau
- Actions: Inference rules (cut-free so limited set of valid rules to choose from)
Search algorithm:
- AlphaZero style Monte Carlo tree search (with learned policy/value)
Machine learning method and training algorithm:
- Hand-engineered features fed into XGBoost
- AlphaZero style reinforcement learning
- It was given first order statements to solve from a list of problems. If it could solve them, then it was rewarded, else punished. (This is not exactly true, especially for the policy.)
Dataset: Miz40
- Only the statements, not the proofs.
- It used RL to learn proofs.
What prevents this from being applied to Lean?
- If used as a first order theorem prover, it would have to be ported from OCaml to Lean. (The most difficult would be the bindings to a machine learning tool to do XGBoost.)
- If used as a tactic theorem prover, the training method is very extensible to tactics and goals. The main differences are:
  - The search space with tactics isn’t necessarily cut-free so it can be quite larger, but it might be possible to fix a smaller subset of tactics (and tactic arguments).
  - One needs to redo the hand-engineered features (and obviously the RL training).
  - One needs a good diverse set of training problems of varying difficulty. (Just theorem statements, not proofs.)

TacticToe

System: HOL4 tactics
Search space:
- States: Lists of open goals (plus meta information such as how many times each tactic has been applied)
- Actions: Tactics (but this is complicated since tactics have arguments and tactics may fail or do nothing)
Search algorithm:
- AlphaZero style Monte Carlo tree search (with policy/value)
- The policy and value cover special tactics like applying an ATP or applying tactics which take arguments.
Machine learning method and training algorithm:
- Curated database of tactics
- Hand engineered “predictor” features
- “Learning” means creating and improving this database
- There is a lot I don’t understand yet about the learning step
- Policy / value are based on how similar the current tactic / goal is to what is in the database based on the hand-engineered predictor features.
Dataset:
- “Human proofs” (I assume the HOL4 library, with proper proof recording)
What prevents this from being applied to Lean?
- It is probably mostly just engineering, but there is a lot of engineering involved.
- I don’t know if the differences in logic really matter.

GamePad

System: Coq tactics
Search space:
- States: Lists of open goals
- Actions: Tactics (they didn’t deal with the complicated case of the have tactic or the rewrite tactic which take arguments)
Search algorithm:
- The GamePad baselines don’t have a proof search, but the whole point of GamePad it is a framework for one to build machine learning tools for Coq such as Monte Carlo tree search.
- While they don’t do full theorem proving, they do predict tactics used in human proofs and proof evaluation (number of steps left in the proof) which could be used for a tree search algorithm.
Machine learning method and training algorithm:
- Recursive tree neural networks learn embeddings (feature vectors encoding information) for terms, formulas, goals, etc.
- Tried a number of machine learning methods (SVM, GRU, LSTM) for the prediction and evaluation tasks.
- Trained with supervised learning
Dataset:
- Coq Library, Feit Thompson
What prevents this from being applied to Lean?
- This is closest to Lean in logic. Again, it’s a lot of engineering.
- Also, there is no prover built yet. This is more of a learning framework than a final product. (The goal is for others to work in this framework and beat their baselines.)

HOList

System: HOL Light tactics
Search space:
- States: Goal (set of subgoals if I am not mistaken)
- Actions: Tactics (both tactic and premise argument)
Search algorithm:
- Breath first search
- Limit on the number of tactic actions from each goal
- Tactics are ranked by neural network policy
- Again, this baseline algorithm is designed to be improved by other users.
Machine learning method and training algorithm:
- Neural network embeddings for goals and premises. (I think they tried both convolutional NNs and WaveNet, but I don’t know for sure.)
- Neural networks which rank tactics and tactic + premise pairs. (Again, I’m not clear on the details.)
- Trained with supervised learning first on human proof and then reinforcement learning. (I’m not clear on the RL training details.)
Datasets:
- HOL Light core, complex (as in complex analysis), and Flyspeck
What prevents this from being applied to Lean?
- Again, it’s a lot of engineering.
- It could be possible to hook up their system to Lean in theory. Of course the exact logic and tactics are different.

Joseph Corneli (May 20 2019 at 11:06):

Here are a couple references I've come across that look relevant to premise selection (in the context of proof synthesis).

A type-theoretic approach that could be germane to our style of working here:

Jan Bessai et al. Combinatory Logic Synthesizer. In Tiziana Margaria and Bernhard Steffen, editors, Leveraging Applications of Formal Methods, Verification and Validation. Technologies for Mastering Change, pages 26–40, Berlin, Heidelberg. Springer Berlin Hei- delberg, 2014.

Applications of contemporary ML tools to code:

Tal Ben-Nun et al. Neural Code Comprehension: A Learnable Representation of Code Semantics. In S. Bengio et al., editors, Advances in Neural Information Processing Systems 31, pages 3589–3601. Curran Associates, Inc., 2018.

Jason Rute (May 29 2019 at 22:36):

I’ve recently come across these three new papers on machine learning theorem proving in interactive theorem provers:

Learning to Prove Theorems via Interacting with Proof Assistants A theorem prover (ASTactic) and gym (CoqGym) for Coq. (Separate project from GamePad.) Trained on a very large collection of Coq projects.
These next two are new additions to the HOList project for HOL Light which now has a nice summary of all their results at the top of the website.
- Graph Representations for Higher-Order Logic and Theorem Proving An improved automated theorem prover using graph neural networks. It is trained on human proofs.
- Learning to Reason in Large Theories without Imitation An improved reinforcement learning theorem prover trained without human proofs (but trained with human written theorem statements and uses standard HOL Light human engineered tactics).

Bryan Gin-ge Chen (Jun 08 2019 at 00:32):

I haven't watched it yet, but here's an interview with Christian Szegedy of the Google Brain team. Presumably he talks about some of the papers mentioned above. https://www.youtube.com/watch?v=p_UXra-_ORQ

Jason Rute (Jun 09 2019 at 11:41):

@Bryan Gin-ge Chen Thanks for sharing! It was really interesting!

For those who prefer to read, here is a (computer generated) transcript of the video. It's a non-technical, high-level vision talk and is very accessible to the Lean community.

The first part of the video is about how Christian got into deep learning about about his earlier work (which is quite influential in the field, batch-normalization, inception architecture, and adversarial examples).
At 14:26, they start talking about AI and formalization and Szegedy explains his motivation which is high level automated code generation.
At 19:33 the interviewer challenges Szegedy about why he thinks AI mathematics is at all tractable. Szegedy has some interesting thoughts and moreover has a much bigger vision than I realized. He thinks the only way to make this tractable is to simultaneously work on automated formalization of books and papers while also working on automated theorem proving. Both will strengthen the other (and have huge consequences in natural language understanding if successful).
At 26:43 they talk about the HOList project (but not by name), and what papers are coming next in a few weeks from his lab. Rather than using a fix search algorithm like beam search, the AI agent will handle the searching as well. (I've been thinking a lot about this same thing recently. MCTS works really well in games since even small differences in game states can be exploited by an opponent, but in theorem proving, there are many paths to the goal and exploring one branch will tell you a lot about similar branches in a way that beam search or MCTS don't handle.)

Jason Rute (Jul 28 2019 at 13:14):

ProverBot9001 is yet another work using machine learning to prove theorems for Coq. Generating Correctness Proofs with Neural Networks. The code can be found at https://github.com/UCSD-PL/proverbot9001 . The paper is pretty well written. They have made it clear that their approach proves more theorems overall than CoqGym/ASTactic and CoqHammer (see Figure 15). There are still a number of possible improvements to make it much better. I suspect that the work of Szegedy’s lab, if applied to Coq, would probably give much better results overall, but this is hard to say for sure since HOL Light and Coq are different systems with different logics, tactics, and libraries.

Jason Rute (Oct 06 2019 at 14:09):

I haven’t posted any updates on AI and theorem proving in a while. Here are some:

Jason Rute (Oct 06 2019 at 14:10):

Just over a month ago, Szegedy posted a manifesto called A Promising Path Towards Autoformalization and General Artificial Intelligence. It is really interesting and provocative—if a bit challenging to read. (Also, it mentions Lean!) His master vision is to simultaneously build powerful AI autoformalization and AI automated theorem proving by training both tasks in lock-step. The main idea is to take images of math papers and to encode them as vectors in R^n that represent a theorem to prove. (As weird as this sounds to go to R^n instead of the symbols of the theorem to prove, it is a common practice in AI, especially machine translation, to “embed” the input, in this case informal math papers, into an intermediate latent space R^n before translating it into the final symbolic space.) Then one uses these vectors to guide proving new theorems, given theorems already proven. This is a boot-strapping process, where the autoformalization procedure gets positive reinforcement feedback when its “translations” lead to a new proof, and in turn these new proofs provide training data for the theorem prover.

Jason Rute (Oct 06 2019 at 14:12):

While I think the plan is incredibly ambitious, Szegedy explains in Section 7 how one would start to build smaller prototypes of the system parts. If I might be so bold as to suggest another addition to that section, (@Christian Szegedy @Sarah Loos @Markus Rabe ), instead of working with (images of) informal mathematics papers to begin with, they could use libraries of formal mathematics (Mizar, MetaMath, and maybe even raw Lean, Coq, Isabelle, etc. code ) instead, where the latter is given only as unaligned sequences of symbols (or trees/graphs of tokens). While they wouldn’t be developing informal-to-formal translation yet, they would be developing formal-to-formal translation between theorem provers without a need to align symbols manually or get into the complicated inter-workings of the source logical language. (And it would still provide useful training data for the automated theorem prover.)

Jason Rute (Oct 06 2019 at 14:12):

In other news, Szegedy’s lab recently posted a paper on Mathematical Reasoning in Latent Space. They showed that you can embed a mathematical statement into R^n (latent space) and then from that embedding alone, do multiple steps of symbolic rewriting (in the sense of a rewriting tactic). This is important because one of the most expensive parts of tools like HOList (or even AlphaGo) is to go from the symbolic representation (e.g. a formula, a go board) to the prediction of the next step in the search. This prediction function is usually composed of two parts: (1) A large domain-specific neural network (e.g. a graph neural network or a convolutional neural network) which embeds the symbolic problem into a vector in latent space. (2) A small neural network which takes this embedding and computes the result, which in a search algorithm is the next move to apply (e.g. the next tactic to apply). Now if it is possible to also “perform” that move in the latent space, then there is no need to constantly calculate the embedding at each step. Moreover, having general purpose precomputed embeddings is a common way to speed up AI training for specific applications. For example, a lot of work is spent in machine learning on creating pre-trained embeddings for English words and for images of everyday objects. Szegedy’s work shows that one can use the mechanism of symbolic rewriting to automatically compute embeddings of (HOL) formulas automatically.

Jason Rute (Oct 06 2019 at 14:13):

Last, on a different note, one system I neglected to mention previously is Holophrasm which is an older AI-based theorem-prover for MetaMath by Daniel Whalen. A new paper extending Holophrasm is on the topic of Learning to prove theorems by learning to generate theorems. The accuracy of this new paper is still fairly low, ~20% of MetaMath theorems are proven, but given that MetaMath has no automation or tactics, this might actually be pretty good. (As usual, the lack of uniform benchmarks makes it challenging to compare ML techniques across different theorem provers. And of course, raw accuracy numbers are not very meaningful to practitioners.)

Luca Seemungal (Oct 06 2019 at 17:37):

Is it possible to save comments for later reference on zulip?

Johan Commelin (Oct 06 2019 at 17:45):

@Luca Seemungal Click the little star (which appears if you hover over the message, in the top right corner). Keyboard shortcut: *

Luca Seemungal (Oct 06 2019 at 17:50):

Ah cheers

Reid Barton (Oct 06 2019 at 18:29):

Do you mean literal images as in arrays of grayscale values?

Bryan Gin-ge Chen (Oct 06 2019 at 18:58):

From Section 6: "A Proposed Path to Autoformalization", (page 11 of the PDF):

One engineering question is how to represent informal mathematical content, that is mathematical papers and text books we want to formalize automatically. The straightforward looking path would be to represent them textually, for example by a sequence of unicode characters. This might work well for use cases that do not require the understanding of formulas, diagrams and graphs. However mathematical content is heavily based on the use of formulas and also diagrams and geometric illustrations might play an important role in informing the reader. Therefore the best path seems to rely on images instead of textual representation. Given that humans work from the same representation, this can preempt a lot of unexpected failure modes and engineering that would go into reconciling various types of textual representations of the same input.

Reid Barton (Oct 06 2019 at 19:17):

Thanks Bryan, my phone isn't displaying the pdf for some reason.

Bryan Gin-ge Chen (Oct 06 2019 at 19:24):

Yeah, I had to log into a google account to access it. Let me just attach the file to Zulip here for convenience: Autoformalization.pdf

Reid Barton (Oct 06 2019 at 19:26):

Now acrobat can't open it because it's not logged in to zulip, instead :)

Reid Barton (Oct 06 2019 at 19:27):

Okay, I managed to get it.

Tim Daly (Oct 06 2019 at 20:22):

Quoting from the Szegedy paper: "Still, everything that can be specified precisely can be considered as mathematics and everything we can talk about precisely is what we can talk about it by mathematical arguments. Therefore, mathematics is the language of all things specified formally."

This is a form of hasty generalization.

The mathematicians I have worked with spent a lot of time "hand waving" at a solution, wrote psuedo-formal attempts, and got "debated" into slightly more formal positions, usually by being more precise in their definitions. The "definition" phase is essentially contructing a new "type" so that they could restrict their domain enough to get a sound and complete map to their solution domain. That, in my experience, is "mathematics"., i.e. what a mathematician does.

Computational mathematics takes the formalized definitions and theorems and tries to construct a formal proof from primary axioms. "Everything that can be specified precisely can be considered as mathematics" is a good premise but "mathematics is the language of all things specified formally" does not follow as a conclusion.

Christian Szegedy (Oct 06 2019 at 20:24):

they could use libraries of formal mathematics (Mizar, MetaMath, and maybe even raw Lean, Coq, Isabelle, etc. code ) instead, where the latter is given only as unaligned sequences of symbols (or trees/graphs of tokens). While they wouldn’t be developing informal-to-formal translation yet, they would be developing formal-to-formal translation between theorem provers without a need to align symbols manually or get into the complicated inter-workings of the source logical language. (And it would still provide useful training data for the automated theorem prover.)

Yes we were considering this, but it is unclear at this point whether it would be an unnecessary detour or a useful proof of concept.

What we plan to do is the (presumably much simpler) task of reconstructing larger parts of our theorem library purely from their semantic embeddings. Even that could be made incrementally easier and or harder: in the simplest case the system has access to the original embedding model. The next step if it needs to reconstruct it, and the third level if we inject some extra noise into those embeddings as well.

Without getting too technical, I would like to thank Jason for the great and insightful summary. While our project is very ambitious, it is more inspiring to have a high level long term vision and work towards it, rather than doing disconnected random research. Doing so enables us to concentrate on fundamental basic ideas that could actually matter in the long run. Also it forces us to come up with simpler versions of the final system and test them in isolation, leading to interesting ideas like the one you mention above.

Christian Szegedy (Oct 06 2019 at 20:32):

The mathematicians I have worked with spent a lot of time "hand waving" at a solution, wrote psuedo-formal attempts, and got "debated" into slightly more formal positions, usually by being more precise in their definitions. The "definition" phase is essentially contructing a new "type" so that they could restrict their domain enough to get a sound and complete map to their solution domain. That, in my experience, is "mathematics"., i.e. what a mathematician does.

You can still debate them into doing so, since they agree that it should be possible.

Tim Daly (Oct 06 2019 at 20:48):

The deepest mathematics I've ever tried was on two problems: The Twin Primes Conjecture and The Andrews-Curtis Conjecture. Both of them are "formally specified" but really require whole new areas of mathematics. As I think Buzzard pointed out in one of his talks, some problems get solved because a much more general problem is solved and the initial problem is just a special case.
(http://daly.axiom-developer.org/TimothyDaly_files/ACConjecture/index.html)

It may be possible to "pattern recognize" methods of proof but this is just defining "tactics". I don't see how you can "pattern recognize" definitions though. What would a "tactic" for constructing a definition look like? Indeed, if you feed definitions to a deep learning system, what would you expect it to recognize?

Tim Daly (Oct 06 2019 at 22:25):

Barry Mazur gave a lecture (https://video.ethz.ch/speakers/bernays/2018.html) and discusses the role of definition at about the 25 minute mark.

Christian Szegedy (Oct 07 2019 at 00:51):

It may be possible to "pattern recognize" methods of proof but this is just defining "tactics". I don't see how you can "pattern recognize" definitions though. What would a "tactic" for constructing a definition look like? Indeed, if you feed definitions to a deep learning system, what would you expect it to recognize?

I think this relies on a misunderstanding of our notion of tactic, also of the role of "pattern recognizing", etc.

In our setup "tactic" just refers to one "proof step" using the HOL Light proof assistant.

In the long run, we want to explore mathematics, but in a way that is still useful for proving theorems in our seed corpus (the seed conjectures given by humans). On the way the system should maintain and extend a database of definitions and conjectures. The system should create new definitions, either by abstracting common patterns in existing definitions or proofs or by just coming up with intriguing new objects, perhaps those that serve as counterexamples of existing conjectures.

Both definition and conjecture creation can be viewed as a special case of generative modeling. In the case of definitions you try to generate new notions that helps to compress the existing corpus of mathematics, e.g. making existing statements special cases or just simplifying the proofs. In the case of conjectures, you try to create statements that would be either be helpful for proving other conjectures or those that could generalize existing statements (again: compression).

Jason Rute (Oct 07 2019 at 02:25):

it is more inspiring to have a high level long term vision and work towards it, rather than doing disconnected random research

I agree. One can compare it to DeepMind’s AlphaStar project to beat StarCraft II. As the AlphaStar project and the subsequent commentary shows, different people can look at the same results with very different levels of optimism. The pessimistic view is that deep learning is just function approximation and pattern recognition and this isn’t close to sufficient for success on StarCraft II. Moreover, any success AlphaStar does have is just due to the non-human advantages of computers (many-lifetimes worth of playing experience, fast brute force search, perfect accuracy, and no mistakes). It can’t plan, strategize, show creativity, or understand the game. And those in this camp are quick to point out how even with its unfair advantages, AlphaStar is consistently being beat by humans. I hear the same pessimism about applying deep learning to mathematics: all deep learning can do is recognize patterns in proofs. It can’t understand math, make definitions, or show the level of creativity of a mathematician.

Jason Rute (Oct 07 2019 at 02:26):

I would both agree narrowly that these are valid concerns, but offer a much more optimistic viewpoint. In StarCraft II, a lot of success has been made by successfully combining off-the-shelf tools from other areas of AI research. For example, LSTMs and attention-based mechanisms are both tools heavily used in natural language processing, and population-based training is a very old idea in machine learning inspired by biological evolution and diversity. Even if AlphaStar isn’t perfect, it is still amazing how far we’ve come and how general the techniques are that it employs. A philosopher can argue whether it shows “long term planning” or “creativity”, but it does accomplish something more human-like than other algorithms and can readily be applied to new problems. I think the same story will hold in computer mathematics. We have barely begun to explore the possibilities of applying existing deep learning techniques to theorem proving and autoformalization, and the little existing research done by a select few researchers already has shown real promise! Rather than pessimistically saying this will never work because, say, we don’t have a solid plan for how to teach the agent to make new definitions, let’s see how far the current deep learning techniques take us. And unlike AlphaStar, if the result of this program is only a halfway decent automated theorem prover which can autoformalize some of the existing literature, that is a great initial step with very real value to mathematics and industry!

Jason Rute (Oct 07 2019 at 02:26):

Moreover, my optimism doesn’t stop there. Deep learning research is far from fizzling out, and there seems to be good arguments that the human mind could well use something like deep reinforcement learning to learn both at the level of milliseconds and millennia (the later via evolution). Moreover, as Christian pointed out, deep learning also provides, through say reinforcement learning and generative modeling, ways to create entirely new and novel information. So I think we are on the right track, even if there is still decades of research needed. As automated neural mathematics reaches the limit of the current techniques in deep learning, other research groups in AI (e.g. machine translation, game play, or robotics) may already have solved some of the problems we face. If not, computer mathematics could very well be the testbed needed to bring in powerful ideas from human intelligence into AI.

Tim Daly (Oct 07 2019 at 04:15):

Ok. lets consider the "special cases" as teeth on a gear. Somehow we have to recognize the teeth are related. Then we have to generalize the problem with new mathematics which defines the gear. The final result is that we recognize that the teeth occur in every area of mathematics.

More concretely, I'll make a "natural language conjecture". It appears to me that random walks, the collatz conjecture, and the andrews-curtis conjecture are all instances of the same mathematics. There is some higher level statement of the problem of which these are all instances. Once the higher statement is formalized and proven, not only will these three phenomena be proper instances, we will find that every area of mathematics has such statistical conjectures.

The generalization would allow us to "translate" collatz to andrews-curtis or random walks.

Recognizing that there MIGHT be a generalization is hard since you have to at least know some infinite group theory, and some statatistics, etc.

Inventing the mathematics for the general case is even harder. Somehow you have to create definitions and axioms and even some new notation such that it is obvious that all of these examples are "just instances".

I have conjectured that random walks, collatz, and andrew-curtis are all instances of some deeper mathematics. Unfortunately, staring at the problems leaves me with no deeper insight. I'm not even sure how to translate from one notation to another, or to invent a more general notation.

If an AI program could (a) suggest the "common idea" conjecture, (b) invent definitions that cover the instances, (c) invent a notation that covers their more specific notation, (d) formalize the corresponding general case, (e) prove the general case, and (f) prove the instances.... well, color me convinced.

At the moment, I can't see how pattern recogniztion can be applied to any of the above steps.

Tim Daly (Oct 07 2019 at 04:37):

Lets apply the machine to the machine...

Then we should turn the flashlight of attention on deep learning AI itself. After all, it should be possible to formalize deep learning in such a way that we could gather all instances under a single general notation, make definitions and axioms that formalize what deep learning can and cannot achieve, and then show that deep learning can cover proofs as an instance.

I have spent time on the Open AI Gym website (https://openai.com/). There appears to be a bit of theory about gradient descent (see https://www.analyticsvidhya.com/blog/2017/03/introduction-to-gradient-descent-algorithm-along-its-variants/), some statistics, and some emerging "how to" models. There are even claims that deep learning can model any function. But the question before the court seems to be: "Is what a mathematician does, a 'function'"?

If a mathematician is a function, the book should start with the sentence "Call me F". :-)

(For those who don't know, the famous first line of Moby Dick is "Call me Ishmael")

Tim Daly (Oct 07 2019 at 18:56):

Deep learning is "pattern recognition". Theorem proving is not
pattern recognition.

Deep learning can look at many proofs and create new tactics
by recognizing patterns of proofs. But proving theorems often
requires definitions (to limit the scope of the domain) and lemmas
to focus on sub-problems. Constructing definitions and lemmas
is not a pattern recognition task.

Deep learning does not incorporate "human modes of reasoning".

There is a whole "other branch" of AI based on symbolic reasoning.
I authored a language called Krops that combined knowledge
representation and rule based programming. It was used by IBM
to construct a system for configuring mainframes, a rather complex
task.
http://daly.axiom-developer.org/TimothyDaly_files/publications/krops/HICSS88.pdf

Symbolic reasoning systems, especially those based on certain
forms of logic, are more applicable to reasoning about proofs.
They do not depend on "training" or "pattern recognition".

Christian Szegedy (Oct 08 2019 at 02:46):

Deep learning is "pattern recognition". Theorem proving is not
pattern recognition.

Is the creation of photorealistic looking faces "pattern recognition"? Cf.
https://thispersondoesnotexist.com

Jason Rute (Oct 08 2019 at 03:15):

First, I may not have put enough emphasis so far on symbolic reasoning so far. To be clear, every one of the projects I’ve mentioned in this thread relies on symbolic reasoning. To me, the melding of symbolic reasoning and deep learning is what makes this area exciting.

Jason Rute (Oct 08 2019 at 03:15):

Deep learning can look at many proofs and create new tactics by recognizing patterns of proofs.

I agree, and I see this as a valuable contribution alone, especially if these “tactics” fill in obvious or straight forward steps in a proofs. Even better if they learn patterns in informal published mathematics and translate this into formal mathematics.

Jason Rute (Oct 08 2019 at 03:16):

But proving theorems often requires definitions (to limit the scope of the domain) and lemmas to focus on sub-problems. Constructing definitions and lemmas is not a pattern recognition task.

I agree that definitions and lemmas present the most significant challenge to any attempt to have a computer do interesting mathematics, both for deep learning and symbolic reasoning. I’m not convinced that Szegedy’s lab has fully thought this through either, but I am more optimistic that they or someone else can. As for pattern recognition, would you not say that intuition (a type of pattern recognition) plays an important role in creating definitions and lemmas? However, I agree that pattern recognition alone is not sufficient to creating deep definitions.

Jason Rute (Oct 08 2019 at 03:16):

As for the term deep learning, I think of it more broadly than just function approximation (pattern recognition) using neural networks. A lot of effort is spent understanding how to train these neural network based agents to learn from and explore their environment. This includes instilling them with forms of curiosity, episodic memory, learning how to learn, individual specialization (population based training), competition, cooperation, planning, imagination ~~day dreaming~~, communication, asking what if questions (tree search), and so on. It is this element of the research which I call “human modes of reasoning”, and of course, symbolic reasoning should fit in there as well. While this training regime is usually used just to train an agent, in the case of theorem proving, the training itself creates symbolic knowledge (theorems) as well as non-symbolic knowledge (better “pattern recognition”) both of which can be used to improve the agent, but are also a useful outcome.

Tim Daly (Oct 08 2019 at 08:14):

The idea that a vague anthropomorphic term like "intuation", which we don't fully define, is a form of pattern recognition, sounds compelling.

But using an ill-defined label for deep learning or symbolic computing in order to infer that we can automatically construct exact defintions and lemmas seems like wish-fulfillment. Given a definition, KROPS (see my link above) can use subsumption in a knowledge representation web to classify it. KROPS can also use the definition in rules. But I have no idea how to automatically create a definition ab ovo.

The closest I can come to this idea is to say that the role of a definition (in proof theory) is to define a new type and give it a name. The new type is used to restrict the "domain of discourse", that is, to restrict the input domain of a theorem. Given a theorem statement and a partial proof, it might be possible to reason backward from the point of difficulty. If the failing point could be proven for Nats but not for Reals then one could reason backward to say that the proof only applies to Nats. From that, one could maybe construct a definition of a restricted type that allows the proof to succeed.

While this "backward-reasoned" restriction is a form of definition, it does not cover constructing "forward definitions" that get constructed before stated proofs. In the forward case one is trying to "etch out" a domain. Unfortunately, this seems to require something I don't know how to automate. In the collatz/Andrews-Curtis/random-walk (CAR) example I have above, I have a vague "intuition" that these are related by higher math. But the "intuation" is so vague I can't even make a definition that applies in all three CAR domains.

Can you suggest an automated way to "forward construct" a CAR definition?

Tim Daly (Oct 08 2019 at 08:17):

My understanding of creating photorealistic faces is to use GANs. But the GAN has been trained on millions of faces. We don't have millions of near-similar definitions. Is there an example of constructing such faces without GANs?

Scott Morrison (Oct 08 2019 at 08:25):

I think collatz and Andrews-Curtis are not particularly good models for "AI and theorem proving", because they are both rather featureless problems which little connection to mathematics. (Andrews-Curtis less so; there's a fun connection with a fairly featureless corner of 4-manifold topology...)

Tim Daly (Oct 08 2019 at 08:46):

Wow, that's quite a list of words. Unfortunately those anthropomorphic labels are just that... words. Back when I was but a lad there were mechanical mice with light sensors and lights. Based on trivial wiring changes on breadboards one could construct various forms of "social behavior" such as avoidance (reduce the capacitor charge by leaving lit areas). Researchers labelled such mice as "shy". AI has been plagued by these anthropomorphic labels.

Show me the code and I know what it does. Show me a label like "day dreaming" and I... day dream.

Q-Learning is theory and code I can understand. Updating Q_k to chase a non-stationary target is meaningful "behavior". But calling such changes "Hunting" in the human sense is poetic license at best. Whether Q-Learning is an implementation of a Markov process is still not clear to me but they seem closely related.

Gathering Q-Learning state-action pairs (aka "Experience", aka "table lookup") to do pole balancing on a cart looks impressive. And it is. But it is still in the "bicycle" stage, learning behavior that is hard to state formally.

Do what I try to do, which is read the code. Ignore the anthropomorphic labels.

As Schoedinger said to his cat, if you keep climbing into that box and sniffing my qubit, it will be the death of you. In the police report they said that curiosity killed the cat.

Tim Daly (Oct 08 2019 at 09:12):

The fact that collatz/random walk/Andrews-Curthis (CAR) are featureless is one of the reasons why I feel they must be instances of a deeper result. Many of the statements in areas like geometry are "featureless" instances of topology, as you point out. I wish I could collect more instances of mathematics that "sort-of walks around a point". Then maybe the general case will appear. I also think that CAR examples might be "edge cases" of chaotic behavior so they might be instances of the disorder that arises from chaotic differential equations
(https://www.math.arizona.edu/~dwang/W-noiseOct1409.pdf). Now if I could only "define" CARs as instances of these equations I could search for the boundary conditions.

But that would require constructing definitions about differential equations that restrict the domain of discourse (aka construct a type) so that lemmas and theorems can be used. The only "crossover" work I've seen between differential equations and logic is work by Andre Platzer.
(http://reports-archive.adm.cs.cmu.edu/anon/2017/CMU-CS-17-117.pdf). And, frankly, I have no idea how to construct a definition in differential equations that applies to a conjecture in Infinite Group Theory. That seems to be beyond my personal coffee budget.

Jason Rute (Oct 08 2019 at 13:27):

I added links to papers for the words. (Not code, but still a bit more concrete than before.)

Last updated: May 02 2025 at 03:31 UTC