Zulip Chat Archive

Stream: general

Topic: CNL and the Stack Project


Johan Commelin (Jan 17 2020 at 07:01):

I just watched @Thales talk from last week, on CNL and Formal Abstracts. I was wondering about the following question: would it be interesting to attempt to pre-process the Stacks Project into something that is readable by the CNL? Is it reasonable to think that this is possible? (cc people that might know more about this: @Peter Koepke @Jesse Michael Han)

Johan Commelin (Jan 17 2020 at 07:08):

One reason for suggesting the Stacks Project is that it is written in a very strict subset of LaTeX

Johan Commelin (Jan 17 2020 at 07:12):

See https://github.com/stacks/stacks-project/blob/master/documentation/rules from which I will quote the 4th line:

Avoid reading this document at all cost!

Jesse Michael Han (Jan 17 2020 at 15:16):

it would be interesting, and definitely easier than preprocessing a less uniform corpus like all math pages from e.g. Wikipedia

besides working in a small subset of LaTeX, the CNL parser only accepts certain canned phrases in a certain order, so some NLP will be required to align the informal text with CNL equivalents

Johan Commelin (Jan 17 2020 at 15:20):

Yup, I guess it won't be trivial. But for an informal text, the Stacks Project is written rather on the formal side of the spectrum.

Johan Commelin (Jan 17 2020 at 15:20):

That's another reason why I think it might be an interesting target.

Johan Commelin (Jan 17 2020 at 15:37):

A long time ago, I thought about a baby version of this problem. Together with @Josef Urban we tried to "linkify" the Stacks Project: whenever a concept was defined in SP, link back to it when it is used later. We used some very basic NLP, but nothing fancy. As far as I remember, the results were cute, but not impressive. Also: I had to write a PhD thesis, so I stopped working on it.

Johan Commelin (Jan 17 2020 at 15:40):

Unfortunately, I don't think I have any code left from that side project.

Alex J. Best (Jan 17 2020 at 15:41):

Stacks tags the term being defined in each definition too, which is helpful.

Johan Commelin (Jan 17 2020 at 15:43):

You mean with the pattern {\it some-word}?

Alex J. Best (Jan 17 2020 at 15:44):

Right, its unfortunate they don't use a more descriptive macro, but I guess that's counter to the whole philosophy of a strict subset of latex

Johan Commelin (Jan 17 2020 at 15:46):

I don't expect that you'll find many false positive {\it ...} patterns inside definition environments.

Johan Commelin (Jan 17 2020 at 15:46):

It is true that one definition environment might contain multiple definitions.

Alex J. Best (Jan 17 2020 at 15:46):

I have an even babier version of what you are talking about set up for my own notes (nothing fancy at all, just matching exact strings + pluralization) but it works passably, the main difficulties are stacked adjectives (complete separated noetherian scheme) and mathematics overloading (even stacks has normal subgroups and normal schemes) I'd imagine your NLP approach would deal with the latter quite well?

Johan Commelin (Jan 17 2020 at 15:47):

No, our NLP wasn't very NLP at all. IIRC we used something that Josef called "bag-of-words" and was some sort of statistical matching algorithm.

Johan Commelin (Jan 17 2020 at 15:49):

We didn't pair it with any typing information. So if you had Let $X$ be a scheme. ... blabla ... Hence $X$ is normal. Then it wouldn't figure out that normal meant scheme.normal (if there was no occurence of the word scheme within the closest 10 words).

Johan Commelin (Jan 17 2020 at 15:49):

In fact, if group occured near by (for some reason), it would likely have a false match.

Johan Commelin (Jan 17 2020 at 15:50):

Ideally, we could get a version of the Ganesalingam–BL parser. But that seems socially impossible.

Alex J. Best (Jan 17 2020 at 15:55):

Yeah that would definitely be a boost, let me know if you ever try anything like this, this problem has been something I've wanted to try and do more on for a while.

Kevin Buzzard (Jan 17 2020 at 20:02):

Ideally, we could get a version of the Ganesalingam–BL parser. But that seems socially impossible.

I don't know anything about this parser. What does your second sentence mean?

Johan Commelin (Jan 17 2020 at 20:04):

Mohan Ganesalingam and TBL (yes, that TBL) have a parser that eats LaTeX and spits out some weakly typed abstract syntax tree.

Johan Commelin (Jan 17 2020 at 20:04):

But Mohan doesn't want to share the code.

Johan Commelin (Jan 17 2020 at 20:04):

Ask Tim Gowers for details.

Josef Urban (Jan 17 2020 at 20:41):

Hi,

I got an e-mail ping from Johan's post.

The short 2015 paper about our poor-man's Stacks hyper-linker is in
https://cicm-conference.org/2015/CICM2015-wip.pdf (p. 19 - 23).

As for combining statistical parsing and typing, we have so far done
a couple of things in the inf2formal project :

  1. Learning probabilistic grammars from the corpora and adding native
    (HOL/Mizar) typing inside the probabilistic parsers (but also using
    the type checking and ATP ex post). See:
    https://doi.org/10.1007/978-3-319-66107-0_2
    https://doi.org/10.1109/SYNASC.2017.00036 and the Flyspeck-trained
    demo at http://grid01.ciirc.cvut.cz/~mptp/demo.ogv .

  2. Trained supervised neural translation models on synthetic
    Latex-Mizar corpus: http://arxiv.org/abs/1805.06502 .

  3. Recently we have added unsupervised neural translation (so an aligned
    corpus is not needed) - https://arxiv.org/pdf/1912.02636 and
    combined the neural translation with type-checking. This is so far
    done by a simple data-augmentation loop between the neural net and
    the Mizar elaborator (Section 5.4 of the paper). The paper
    also shows that it is so far much harder to go from ProofWiki
    (informal) to Mizar. To do this for Stacks, it would be good to
    have at least some parts of Stacks formalized (in Lean or
    whatever). I believe this is happening (Kevin, Johan & Co.), so UMT
    methods might probably be directly tried right now.

  4. We have also started to play with neural text summarization for
    e.g. conjecturing over Mizar and Stacks -
    http://grid01.ciirc.cvut.cz/~mptp/AITP_2020_paper_21.pdf . Again
    (unsurprisingly), Stacks looks quite a bit harder (at least for the
    task we have chosen) than the nice and declarative Mizar ND proofs for
    such tasks.

Obviously, these are just the beginnings - the future of
autoformalization is bright and unavoidable (the naysayers can bet
me - see IHP Challenge 3 at http://ai4reason.org/aichallenges.html ;-).

Josef
(I am an email-driven dinosaur, ping me again via @JosefUrban if needed)
(And come to http://aitp-conference.org/2020/ if interested in the topic)

Johan Commelin (Jan 17 2020 at 20:47):

@Josef Urban Oooh, I had completely forgotten about that wip document :face_palm: Thanks for reminding me (-;


Last updated: Dec 20 2023 at 11:08 UTC