Zulip Chat Archive

Stream: general

Topic: Lean, machine learning, and internships


John Tristan (Nov 04 2019 at 16:21):

Hi everyone,

Last summer, Koundinya (Kody) Vajjha, Joseph Tassarotti, and I have been using Lean to formally prove a (very basic) result
in machine learning theory: that decision stumps are PAC learnable. I'm a machine learning researcher working in the industry and
increasingly I have been feeling the need for more trustworthy machine learning applications. I started this project as an experiment
to see how far we could get in a summer. We chose Lean because I had prior experience with Coq and we needed to build on the
measure theory available in mathlib. It was a very positive experience, we finished the proof, and I really enjoyed working with Lean.

If you're interested, you can find the proof: https://github.com/jtristan/stump-learnable/wiki/A-formal-proof-that-decisions-stumps-are-PAC-learnable. It was my first experience with Lean and I had not used a proof assistant in many years so I am aware that my proofs could use a lot of cleaning. The part of the development written by Kody is of course in much better shape.

We also put a paper on arxiv describing our work: https://arxiv.org/pdf/1911.00385.pdf

As a side note, I'm thinking of doing some more work on using Lean for machine learning and I am looking for an intern.
(to join me at Oracle Labs in Boston in the summer of 2020)
Things we could work on include:

  • Generalizing the proof to higher dimensions, by using the Giry monad as much as possible
  • Proving Hoeffding inequality (a common tool in ML)
  • Working our way toward proving the VC inequality
  • Work related to fairness in machine learning

If you're interested please send me an email! (https://jtristan.github.io/)

Johan Commelin (Nov 04 2019 at 16:23):

Nice work! Are you considering PRing the mathematical bits of your repo to mathlib?

Chris Hughes (Nov 04 2019 at 16:27):

There's a to_mathlib file in the repo.

John Tristan (Nov 04 2019 at 16:27):

Yes, we would certainly like to do so.

John Tristan (Nov 04 2019 at 16:30):

We split the proof in two folders for that reason. The results in the lib folder are more general and could potentially be contributed. The results in the stump folder are very specific to the machine learning result.

Koundinya Vajjha (Nov 04 2019 at 16:33):

The to_mathlib file contains a few lemmas on integration, the Markov and Chebyshev inequalities, lemmas about the Giry monad on probability measures.

Koundinya Vajjha (Nov 04 2019 at 16:33):

I also used the Dynkin pi-lambda theorem to prove that the product measure constructed out of the Giry monad is in fact a product measure.

Koundinya Vajjha (Nov 04 2019 at 16:34):

None of this would have been possible without Johannes' work. Many thanks to him!

Jesse Michael Han (Nov 04 2019 at 20:48):

i'm amused thatflypitch's library on dvectors has been replicated verbatim and copyrighted by Oracle: https://github.com/jtristan/stump-learnable/blob/master/src/lib/dvector.lean

Koundinya Vajjha (Nov 04 2019 at 21:35):

That file is from the formalabstracts repo. https://github.com/formalabstracts/formalabstracts/blob/master/src/data/dvector.lean

NOT flypitch.

I am a contributing member of formalabstracts and the license for formalabstracts allows reuse.

Jesse Michael Han (Nov 04 2019 at 21:48):

yes, that file is also copied verbatim from flypitch

Mario Carneiro (Nov 04 2019 at 21:50):

I don't think apache license supports copying without attribution (and relicensing under oracle seems especially problematic)

Mario Carneiro (Nov 04 2019 at 21:56):

ah, correction, it seems stump-learnable is apache licensed but oracle copyrighted

Floris van Doorn (Nov 04 2019 at 21:56):

Relicensing seems allowed:

You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.

https://tldrlegal.com/license/apache-license-2.0-(apache-2.0)#fulltext

John Tristan (Nov 04 2019 at 21:58):

Well, this is a mistake on our part and we will fix it. I didn't realize that some code was copied from another project (I didn't work on that file), but granted, I should have. I will work with Kody to clearly separate the code that was copied and give it the proper attribution.

Mario Carneiro (Nov 04 2019 at 21:59):

This wouldn't have been a problem if dvector was PR'd to mathlib :P

David Michael Roberts (Nov 04 2019 at 22:39):

@John Tristan , maybe it's standard in your area, but in the sentence in the paper

The concept class C is the subset of X^{0,1}
it took me a while to realise that X^{0,1} meant functions from X to {0,1} (the indicator function notation was not a problem, the lambda gave me no clues as to the domain :-). As written it usually means functions {0,1} to X, in mathematical contexts.

Joseph Tassarotti (Nov 04 2019 at 22:41):

John Tristan , maybe it's standard in your area, but in the sentence in the paper

The concept class C is the subset of X^{0,1}
it took me a while to realise that X^{0,1} meant functions from X to {0,1} (the indicator function notation was not a problem, the lambda gave me no clues as to the domain :-). As written it usually means functions {0,1} to X, in mathematical contexts.

Indeed, that's a very confusing typo, thanks!


Last updated: Dec 20 2023 at 11:08 UTC