Zulip Chat Archive

Stream: Machine Learning for Theorem Proving

Topic: Facebook Transcoder for Lean 4 port

Lars Ericson (May 27 2023 at 22:08):

How about training Facebook Transcoder on Lean 3 to Lean 4? It is open source.

Jason Rute (May 27 2023 at 22:12):

At this point since 75% of mathlib is translated, it would possibly be better to do an aligned translation. Also given that they are 75% done it would likely be more academic than practical. I’d like to see this on translating say Isabelle to Lean. But ideally we need better tools which can translate a whole development including a whole project with definitions, theorems etc.

Lars Ericson (May 28 2023 at 00:13):

I guess it's too close at this point, latest report is

Status Count Percent
Unported files 702/2995 (76.6% of total)
Unported lines 261216/1031509 (74.7% of total)
Longest unported chain 22/115 (80.9% progress)

so there is still a lot to do but not an impossible amount, so practically speaking not worth the effort.

Also the number of lines of Lean 3 code in existence is several orders of magnitude smaller than what was used to train for mainstream leanguages, so...never mind. It was just a thought. (This is from Google Bard but I trust it well enough, at least on this topic.):

Language Pair Translator Name Number of Lines of Code
Python to JavaScript Facebook Transcoder 100 million
Java to C++ Google Translate 50 million
C# to Ruby Microsoft Translator 25 million
PHP to Go DeepL 10 million
Swift to Kotlin Yandex Translate 5 million

The Google Bard queries to get this were

  • What open source code translation models are similar to Facebook Transcoder?
  • Please make a table with the language pair, the translator name, and the number of lines of code used to translate.
  • Can you give me that in Zulip markdown?

Jason Rute (May 28 2023 at 02:52):

Lars Ericson said:

(This is from Google Bard but I trust it well enough, at least on this topic.)

Really? I have a hard time believing both google translate is used for programming languages, and if it were that the amount of training data would be public knowledge.

Jason Rute (May 28 2023 at 03:38):

None of these appear to be translators for code.

Lars Ericson (May 28 2023 at 13:08):

@Jason Rute you are right the table is mostly a hallucination except for Facebook Transcoder which is legit. GPT hallucinations can be very hazardous to a career. In the case of Lean, the risks are that the resulting translations are harder to port than the current output of mathport, or that whoever tried the project discovers that there isn't enough training data to get useful output.

Mario Carneiro (May 28 2023 at 14:10):

To me the solution is obvious: don't trust an LLM to give factual information. Stick to interpretive and creative enterprises, or things where the facts are checked as part of the work (e.g. theorem proving)

Last updated: Dec 20 2023 at 11:08 UTC