## Stream: general

### Topic: Cell Phone Mathematics

#### Tim Daly (Jan 05 2020 at 18:36):

Can you build a cell phone from scratch? Do you understand how?

I've spent most of my career, such as it is, in areas that were once considered AI (vision, robotics, planning, expert systems, computer algebra, etc.). I'm a bit concerned about the impact a machine learning approach to theorem proving.

I am currently "fuzz testing" Axiom. That is, I have a program that randomly generates expressions which I can integrate to test the integration algorithm.

Note that the fuzz tester does not know anything except that expressions are trees and nodes in the trees have one, two, or many operands. Some of the generated expressions are not valid since they end up doing things like division by zero. That's not a problem as Axiom will discover that and reject that expression.

Note, however, that we are not remembering nor building on these random expressions.

Imagine that some bright-spot constructs a Neural Net AI that can prove theorems. And imagine a fuzz-tester that generates random theorems that the AI tries to prove. Sometimes it succeeds. The fuzzed theorem is correct but has no 'intrinsic meaning', just a 5000 step proof.

A proven theorem has value and should be remembered. It should be added to the list of proven theorems.

Now you arrive with your hand-generated theorem for your theory. You submit it to the AI and it generates a proof. The proof relies on a pile of prior fuzz-generated proven theorems.

Eventually you reach a state where everyone believes your theorem is true (since there is a proof) but no one, including you, has any clue how to prove it without the machine, or what the proof even means.

In other words, "It just works"... or, as I call it, Cell Phone Mathematics.

This used to be the stuff of science fiction. Is there some kind of mathematical "summary" theory that can be derived from a sequence of steps? Do we need an algorithm to create meta-tactics? Or is "understanding" going to be lost on the sea of random theorems?

#### Jason Rute (Jan 05 2020 at 21:17):

I think this is a common concern I've heard about computer generated proofs, namely that there will be incomprehensible proofs of interesting theorems. We will know it is true, but not why. However, I imagine that the advancement of computer proving (using neural networks or any other means) will be slow enough that the mathematics community will have time to adapt. I think we are already starting to face similar issues. The four color theorem reduces to a huge case analysis. The classification of finite simple groups is too big for any one person to wrap their head around. Sometimes some mathematician comes up with a lemma whose proof has no motivation. Other times a mathematician brings in knowledge from a very different area of mathematics to prove something (leaving the mathematicians who normally work in that area a bit angry that they have to learn a whole new field of math to keep up).

#### Jason Rute (Jan 05 2020 at 21:17):

I think one possible solution to the "incomprehensible proof" problem will be that we can use the same AI tools to better understand what is going on. We can ask the AI if certain facts are true (like an oracle) and then get a better big picture idea of what is happening. Alternately we can look at the AI proof at different levels of granularity and try to separate out the ugly lemmas (and them try to replace that part of the proof with something more beautiful). I think for a long time humans will have the upper hand on theory building and understanding. I think of AI more as a new tool, like the telescope was to astronomers or the microscope to biologists. I know combinatorists who plug in a conjecture into a computer to test the first billion numbers or so. That way they know they aren't wasting their time. I imagine that AI could form a similar purpose.

#### Tim Daly (Jan 05 2020 at 23:44):

I know that in Expert Systems which used rules to reason we had to introduce the notion of 'time' (i.e. the most recent fact) and 'priority' (i.e. an ordering between rules), otherwise the system might loop or might reach really obscure conclusions. Even when the rules are written in pseudo-natural style it is very hard to think like a machine.

I suspect that an AI needs 'time' (i.e. focus on the most recent goal) and ordering (i.e. try the premises in a certain order). In logic trees, though, you might have to walk upward to 'find a lemma' in order to make progress in another section of a tree. It would be really great if the systems could 'lemonize' (create lemmas) and recognize repeated patterns as existing or new tactics. I don't know how to 'cascade' AI systems so they can be made into subunits (e.g. a tactic finder) that gets repeatedly used. Most of the NNs I've studied are monolithic objects.

I suppose the NN could learn to use 'refl' or 'hammer-like' tactics first (the analogy of 'priority'). Lemma-finders might be applied post-facto as separate 'rewrite' NNs to find patterns in the proof. It is still not clear to me where you find the training sets. Nor is it clear that ReLU nets are bounded in the set of moves it can apply without some heuristics.

I can barely read the HoTT book now. I fear an AI that swallows all of that for reasoning.

#### Jason Rute (Jan 06 2020 at 01:49):

Right now NNs are primarily used for two tasks in theorem proving which relate to your "priority".

• The first is premise selection. For example, if want to apply a tactic which takes theorems as arguments (maybe a hammer tactic), which previously theorems do I use? If I pick too many, then it means more work for the tactic and it may not find a proof in time. If I pick too few (or not the right one), then won't have the appropriate premises to prove the result. Technically this can be done by inputting the premise and the current goal state into a NN and then have the NN return a score from 0 to 1 where 1 is useful and 0 is not useful. Then one gives the tactic the mot useful premises. (This is just one approach of many.)
• The next task is proof guidance. For example, given a goal state, which tactic do I apply? (Currently this is out of a fixed list of tactics.) This is done by having a NN which takes the goal state and returns a probability distribution over tactics. (Again, this is one of many approaches.) Then usually one uses a tree search algorithm (like beam search) where the decisions are ranked and by the proof guidance NN.

As for training sets, there are currently two approaches (or a combination of both). One is to modify the ITP to record theorems and (tactic level) proofs (and maybe also the intermediate goal states). Then one can train a network to predict the next tactic give a current state and also to predict if a premise is used in that proof. The other approach is to do reinforcement learning where the prover tries to prove a whole bunch theorems (again gotten from the human-written ITP proofs). If it succeeds then those proofs are used as training data, a new model is trained, and it tries again.

There are still some tasks which haven't fully accounted for in these NN systems. Some tactics take terms as input and this is where things get really hard. For example, to prove an existential statement one has to give a satisfying term. There are many NN-based approaches for generating objects (like text or images, etc). These approaches or modifications could be applied to theorem proving, but this is very preliminary and difficult. A similarly difficult task would be lemma creation.

Also, the question of combining low level steps into higher level steps (tactics) is also an interesting research area in machine learning (for example in hierarchical reinforcement learning and in automatic program generation). To automatically create tactics, this may or may not involve NNs. There are Bayesian methods for example, which have shown some success in similar machine learning tasks.

I don't think there is one NN way of doing things. Instead I see NNs as a component in a larger system that could become more complicated in time. I think we are still figuring out how NNs best fit in, how to train them, and how to combine them with other tools.

#### Kevin Buzzard (Jan 06 2020 at 02:05):

All this happened for chess and it just made people more excited about chess, because as well as the computer being able to find incomprehensible mates in 100+ moves in endgames like K+R v K+Q or whatever, it also tells humans things they want to know, gives new insight into openings and so on. I am not worried about incomprehensible proofs at all. If we're there then we're already in Nirvana

#### Tim Daly (Jan 06 2020 at 16:26):

Yes, I worked on chess (I posted a file of chess games for ML if you're interested in a training set).

Games have the property that they are decidable. Theorem proving doesn't have this property.

Games have a fixed set of rules. Theorem proving has an "expandable" set of rules by adding axioms / lemmas / definitions. Reinforcement learning will struggle with the fact that the possible choices, and thus the reinforcement, changes over time.

NNs generally require millions of training data sets. I don't know of a million theorem proving sets. I'm not even sure how to generate s training set.

A tactic is essentially a program. I don't know of any NN that successfully generates programs.

We struggled with several of these problems (in an Expert System context). I developed a language (KROPS) that merged rule-based and knowledge-based semantics (http://daly.axiom-developer.org/TimothyDaly_files/publications/krops/HICSS88.pdf) to try to work with 1-off, undecidable sets.

I'm still reading the current crop of papers, hoping to be pleasantly surprised. But it seems to me that knowledge representation and rule-based AI is more likely to be applicable. An expert system could state rules to perform likely actions at any proof step (e.g. split into cases) by applying context-specific rules (e.g. the AND can be split into 2 cases). Not all AI has to be Neural Networks.

#### Tony Johnson (Jan 06 2020 at 19:51):

I think a good use of a system like lean would be to prove deep results in machine learning and execute the extracted algorithims. Maybe this is just a bias I have against ANNs, but I think they lack elegance and would rather have inductive learning algorithms based on some variant of PAC and Conditional Kolmogorov Complexity. Go GOFAI!

#### Tim Daly (Jan 06 2020 at 21:27):

So the game is as follows:

Given a rule based system such as OPS5 (manual: https://pdfs.semanticscholar.org/aa5e/87dbe6b50bd32ad971044fb113a40d2e87fa.pdf, sourcecode: https://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/expert/systems/ops5/0.html)

1) generate proofs for all known theorems (anybody know how to do this?)
2) for each known tactic (is there a list?)
2a) find the use of the tactic (e.g. refl)
2b) write a rule that recognizes the condition using the tactic (e.g. sideA == sideB)
3) run the resulting rule set over the proofs. it should find at least some of the proofs.
4) figure out what failed and write some more "expert" rules

#### Jason Rute (Jan 06 2020 at 22:16):

If you want to try this game, I'd suggest starting with HOL Light? The HOList project has a data set of HOL Light theorems with tactic proofs. You can compare your results to DeepHOL. I think you could directly modify DeepHOL as described here to test out your rule-based system. I can provide a bit of assistance if you need it. (Also, to be a fair comparison and to avoid overfitting, you should probably only use the proofs in the training set for coming up with your rule-based system.)

Other options would be to try it out on Coq or Isabelle. For Coq, you can look at either CoqGym, ProverBot9001, or (maybe) GamePad. I believe they all have recorded Coq tactic proofs. And I think CoqGym has a way to put in your own theorem proving AI, which you should be able to test your approach.

Last, you can look at TacticToe and their Isabelle dataset. Also there is Metamath and Mizar which have recorded proofs, but they do not use tactics. For Lean, I don't think there is a good way currently to get recorded proofs. My dream is that there would be easy-to-use theorem proving environments that AI researchers such as yourself could try out their ideas with ease (either using expert systems, NNs, or something else).

#### Tim Daly (Jan 07 2020 at 02:02):

Excellent. I see that HOList has a leader board of NNs.

#### Tim Daly (Jan 07 2020 at 03:46):

BTW, 'overfitting' isn't a concept that applies to expert systems.

#### Daniel Selsam (Jan 07 2020 at 12:53):

BTW, 'overfitting' isn't a concept that applies to expert systems.

It does if you view the rule-writer as a program that takes as input a collection of proofs and outputs a set of tactics. I think almost all intuition about statistical overfitting applies to this regime as well, especially the need to keep training and testing sets separate.

#### Daniel Selsam (Jan 07 2020 at 12:58):

Note that you can't win a gold medal in the IMO grand challenge by manually formalizing the proofs after the problems are released :)

#### Johan Commelin (Jan 07 2020 at 13:01):

/me throws away his Christmas holiday project :rolling_on_the_floor_laughing:

#### Tim Daly (Jan 07 2020 at 13:38):

Jason, I planned to attend the FoMM meeting (I'm watching on video) but I came down with the flu, which I'm unwilling to share :-)

#### Tim Daly (Jan 07 2020 at 13:50):

@Daniel Selsam two points... Expert systems aren't statistical and are not approximation functions of their input so the notion of 'overfitting' doesn't apply.

A 'grand challenge" award? I have no idea what that is (reference?). I'm just suggesting that proofs are an area where expert systems might be a better fit for the task. Several people here look at posted proofs and suggest tactics that might apply. Capturing that expertise is the kernel of expert systems. To capture that expertise you need to look deep into the steps taken in proven theorems.

There are two measures of success. One would be to reproduce an existing proof automatically. The other would be to prove a new theorm. Ultimately, though, it would be useful to have a system that, at least, knew the available tactics and could suggest 'the most likely one'. Adding new tactics would require adding rules for when it would apply but there would be no 'retraining' needed.

Expert systems have the additional advantage of being able to explain their reasoning.

#### Jason Rute (Jan 07 2020 at 13:53):

@Tim Daly Feel better soon!

#### Tim Daly (Jan 07 2020 at 14:00):

@Alex J. Best re: grand challenge. I'm not a competitive personality so I don't find that motivating. I do know that Expert Systems were quite useful in a lot of domains (e.g. Mycin). They are especially good at diagnostics (e.g. why does my car rattle?), which seems to me to apply to finding out what "a next proof step" might be.

At the moment I've downloaded OPS5 and am looking for a database of fully expanded proofs that include each step and tactic. The HOL version only includes the theorem and result as far as I can see.
There should be a way to capture the proof steps in a Lean proof .

#### Daniel Selsam (Jan 07 2020 at 14:03):

@Tim Daly I think you misunderstood my comment. I agree that symbolic tactics are not themselves statistical, though I would argue that neither is a black-box action-predictor. It is the process of synthesizing these systems that can be considered statistical. Specifically, I think that the process of looking at proofs and writing an expert system that can do well on those theorems can be productively viewed as a statistical process, and as a kind of manual version of synthesizing neural-network parameters by minimizing a loss function over a training set. This view suggests that "overfitting" should indeed be a central concern when evaluating expert systems. I did not mean to debate the relative merits of symbolic tactics written by humans versus floating-point heuristics trained through gradient descent.

#### Jason Rute (Jan 07 2020 at 14:05):

@Tim Daly Daniel Selsam was just using he IMO grand challenge (which I assume he thought you knew about) as an example of how one could "overfit" (or whatever the analogous concept in expert systems) is. Neither him or Alex were necessarily suggesting you work on that.

#### Jason Rute (Jan 07 2020 at 14:05):

Let me try to find you a list of both theorems and proofs for HOL. I know this is available and part of the HOList training data.

#### Jason Rute (Jan 07 2020 at 14:06):

There should be a way to capture the proof steps in a Lean proof.

This is more complicated than you would think. One problem is that the lowest level of the proof would be steps in the Lean type theory calculus. Getting the intermediate tactics is challenging. Also, not all Lean proofs use tactics and some mix declarative style with tactic style.

#### Tim Daly (Jan 07 2020 at 14:08):

@Jason Rute I don't remember the notion of 'overfit' in Expert Systems. Overfit implies that it works well on the training set but not well in general. That doesn't seem to happen in Expert Systems (although it could, I suppose). Given a set of proofs I don't see how a set of expert rules would "only apply" to that set.

#### Daniel Selsam (Jan 07 2020 at 14:10):

Alex J. Best re: grand challenge. I'm not a competitive personality so I don't find that motivating.

I'll just point out that although we are open to there being multiple (semi-)independent teams, I personally aspire for the IMO Grand Challenge to be as close to a single, community-wide collaboration as possible.

#### Daniel Selsam (Jan 07 2020 at 14:13):

I don't remember the notion of 'overfit' in Expert Systems. Overfit implies that it works well on the training set but not well in general. That doesn't seem to happen in Expert Systems (although it could, I suppose). Given a set of proofs I don't see how a set of expert rules would "only apply" to that set.

@Tim Daly try running any tactic ever written on any problem that is not in its actively maintained test suite.

#### Tim Daly (Jan 07 2020 at 14:17):

@Daniel Selsam I'm not suggesting that a Grand Challenge is a bad idea. I'm just pointing out that, as a personal failing, I don't understand the idea of "competitive mathematics". I have enough "competition" with myself trying to write working mathematics. I took the Putnam exam (got a zero) as an undergrad but that was due to pressure from my friend (who got a 3). I find mathematics difficult and it takes me a long time to "get it" and when I finally do, it is due to being stubborn, which is only useful in long distance running competitions :-) I worked with some "world class" people in the math dept. at IBM Research and I know that I'm not able to play on that field.

#### Tim Daly (Jan 07 2020 at 14:41):

@Jason Rute I went to that site. It mentions a github repo. The data isn't in the repo (or if it is, I can't find it)

#### Jason Rute (Jan 07 2020 at 14:52):

Generally one doesn't put data in git. Instead, you have to download the data from Google's servers (using gsutil). They provide instructions in the section "Proofs and Training Data".

#### Tim Daly (Jan 07 2020 at 15:11):

@Jason Rute I'm working on installing gsutil. Sigh.

#### Jason Rute (Jan 07 2020 at 15:19):

Also, it is a lot of data if I remember correctly. It will take a while to download. Also, they are stored in Google's protocol buffer (protobuf) format.

#### Tim Daly (Jan 07 2020 at 15:27):

@Jason Rute Yeah, I'm just installing it from a tar.gz. I don't want the whole google cloud install.

#### Jason Rute (Jan 07 2020 at 15:55):

Also, when you get it downloaded, they don't use pure HOL Light format. They use a modification. See my notes here. I take the time to describe their syntax, the available tactics, and generally how to read them.

However, note these notes are for the "backend" API which is how DeepHOL communicates with HOL Light. For the most part, you probably want to use the "front end" API, where you just modify DeepHOL to use your algorithm predict the next tactic. This "front end" API is described here and here.

#### Tim Daly (Jan 07 2020 at 16:06):

@Jason Rute Thanks, I'll look into this later. The flu is winning at the moment so I'm going to pause for a bit.

#### Tim Daly (Jan 09 2020 at 01:19):

@Jason Rute I've been giving some thought to our discussion of machine learning vs expert systems. Given the text of a set of proofs we are trying to decide what tactic to apply. This is essentially a labelled data set. There is a technique called decision trees that might be applied here to develop "expert rules" for making decisions. Alternatively, there is a more powerful (but heavier) bit of machinery called a support vector machine which I used once before ((http://daly.axiom-developer.org/TimothyDaly_files/publications/sei/ISeBJournalFinal.pdf). Decision trees seem to be a better choice in this instance to develop "prototype rules" for an expert system, an approach I haven't seen before.
Decision trees also lends itself to "explanation".

#### Jason Rute (Jan 09 2020 at 03:05):

Decision trees, SVMs, and other "classical" supervised machine learning classification algorithms (logistic regression, random forests, gradient boosted trees, k nearest neighbors, naive Bayes) are not out of the question. Your idea to use a decision tree as a way to jumpstart a hand-crafted expert system is also interesting. One advantage of these simpler machine learning methods is that they are much faster to run than neural networks and that can make a big difference in a tree search. Another practical advantage is that they are easier to code, train, and work with. (This last point is especially important if one is trying to code the algorithm in say OCaml, Lean, etc.) Indeed there are a number of papers using these simpler methods. SledgeHammer uses naive Bayes for premise selection. TacticToe uses k-nearest neighbors to learn tactics for Isabelle. The system in "Reinforcement Learning of Theorem Proving" uses gradient boosted trees (another explainable tree-based machine learning method) to guide a tableau theorem prover. FastSMT uses something like a decision tree (specifically a non-looping flow chart of SMT tactics branching on certain formula characteristics) to guide an SMT solver (but in this case the tree was built by a neural-network-based algorithm).

The one thing that all of these methods have in common is they can't work with formulas directly. One has to manually extract which features of the formula are important. Is it the number of variables, the outermost connective/relation/quantifier, the depth of the formula tree, the number of symbols, the type of variable in the quantifier, etc? Of course, this sort of "manual feature engineering" is common in machine learning, automatic theorem proving, and expert systems.

One advantage of neural networks are that they don't necessarily require as much feature engineering. There are a few types of neural networks which can take as input strings, lists of tokens, or structured trees directly. They essentially do the feature engineering for you. If they work (and to be honest it can be difficult to train them correctly, and they are not fast to run) they can give very good results since they don't require a human to hand-engineer the system and think through every case. (I think the same goes for expert systems as well. They are great if the human can think through all the cases, but often there is a lot to think through and it requires a lot of engineering to get it right.) Of course, the counter argument to all this is that humans are smarter than machines still, so we can better see the big picture and exploit the explicit structure of a formula. So I am trying to keep an open mind to your expert system proposal.

Last, I counter that explainability is not very important in this case. In almost every other domain, explainability is a big benefit. One doesn't want an algorithm which makes unethical, dangerous, or unpredictable decisions. In my real world job as a data scientist for a startup, we like explainable algorithms. However, in this very specialized case of theorem proving, if the algorithm is better at predicting tactics but we are not sure why, that is ok. Either the tactic succeeds or it fails, but in either case, it will never prove something false since the theorem prover checks every step. This is one reason I like applying machine learning to theorem proving. The theorem prover provides tight bounds of safety, so we can let the machine learning algorithm be as unpredictable (or creative?) as we want.

#### Jason Rute (Jan 09 2020 at 03:49):

@Jesse Michael Han would like me to point out the following:

in FastSMT, only the candidate strategies are found by neural network. the candidate strategies are then assembled into a tree using a deterministic (and seemingly quite expensive?) process which tries to minimize some notion of multi-label entropy

#### Tim Daly (Jan 09 2020 at 08:58):

@Jason Rute We 'll have to disagree about the importance of explainability. I'm a 'primitivist'. I like to be able to understand (and debug, or in this domain, conjecture why the proof failed) what the machine is doing, down to the metal.

If my proof of an integral algorithm fails I'd prefer the conclusion "Risch's Theorem Violated" instead of "Well, that failed".

My last flight was pulled out of line for take-off, engine shutdown and restart, because the computer needed to be rebooted. The ultimate case of "have you tried turning it off and on again". I don't want my mathematics to depend on reboots :-) Tell me what was tried (so I don't hand-repeat the efforts) and why each nearly-applicable rule failed (so I can focus on the blocking condition).Otherwise, the machine is telling me "that failed" and, ironically, I am now "learning by reinforcement".

#### Tim Daly (Jan 09 2020 at 09:02):

@Jason Rute Feature extraction from proofs is a necessary component, as you mentioned. I haven't reached a point of looking at the proof sets (despite my "fevered attempts" :-) ). I am hopeful, but not optimistic, that these systems already provide the ability to 'grab' any component of a proof. If not, then developing the set of extractions is the first, painful step. I would have expected that the NN group would also want this since it would naturally provide "attributes" or "features" that would be part of the training data.

Last updated: May 13 2021 at 18:26 UTC