Zulip Chat Archive

Stream: general

Topic: Jaccard distance

Bjørn Kjos-Hanssen (Dec 11 2020 at 22:26):

I've finally completed a proof that finset \N with the Jaccard distance https://en.wikipedia.org/wiki/Jaccard_index is indeed a metric_space.
I actually proved a more general result from my own research (perhaps as a result of my old-fashioned attitude that all work done has to be related to new research papers :grinning: ).

Usually Jaccard distance would be defined by cases: there is a certain formula that works in all cases except the distance between the empty set and itself. However, since Lean says 0/0=0 :innocent:, I was able to avoid that case distinction and work within a "group with zero".

I know there is a whole process for getting code ready for mathlib, but is there anything particular I should be aware of --- like is there anything in mathlib that is at all similar to Jaccard distance?

Bjørn Kjos-Hanssen (Dec 11 2020 at 22:36):

The code is 900 lines right now. Surely it could be reduced significantly by doing less reinventing of the wheel.

Bjørn Kjos-Hanssen (Dec 11 2020 at 22:43):

Thanks are due to @Johan Commelin, @Kyle Miller, @Pedro Minicz, @Reid Barton, @Scott Morrison, @Heather Macbeth for significant help on this project.

Kevin Buzzard (Dec 11 2020 at 23:38):

I jettisoned the idea that all research work had to be related to new research papers and now I just work with the principle that all research work has to be fun. However I am in an extremely privileged position. Have you ever heard of groups with zero before seeing them in mathlib? They are something which came out of the perfectoid project and I always credit Johan Commelin with the abstraction but I would be interested to know if there was more history. Well done on the project, stuff like this is a great way to learn Lean. I could imagine that something like this could end up in mathlib although it's hard work getting code up to the standard the maintainers require, it's not something which can be taken lightly, as I discovered the hard way

Bryan Gin-ge Chen (Dec 12 2020 at 00:10):

Have you ever heard of groups with zero before seeing them in mathlib?

I probably wouldn't have given this passage a second thought if I hadn't heard about Johan's work here, but Emil Artin defines them in the first chapter of Geometric Algebra (published 1957): Screen-Shot-2020-12-11-at-7.07.53-PM.png

Bryan Gin-ge Chen (Dec 12 2020 at 00:14):

Back when Johan was PRing this stuff to mathlib, I also posted some links to some papers by Lorscheid on "the field with one element" where "monoids with zero" show up.

Adam Topaz (Dec 12 2020 at 00:15):

A hyperfield is a commutative multiplicative group with zero together with blah blah blah

Bjørn Kjos-Hanssen (Dec 12 2020 at 00:16):

@Kevin Buzzard Yes even the fact that the mathlib instructions say that "tabs are 2 spaces" makes me question whether that's worth it. :grinning:

Aaron Anderson (Dec 19 2020 at 18:23):

I'm curious to see the code, to evaluate how hard I'd think it'd be to get into mathlib...

Aaron Anderson (Dec 19 2020 at 18:38):

...and also to evaluate how hard it'd be to extend it to general measure spaces, but I'm probably not familiar enough with Lean measure theory to make that evaluation.

Bjørn Kjos-Hanssen (Dec 20 2020 at 06:04):

Great @Aaron Anderson I almost missed this reply but I can post it on github!

I have no idea about measure theory in Lean but yes, we should be able to replace the counting measure on finite sets by something more general.

Actually, for some entropy related application I'm interested in a version involving certain signed measures.

Bjørn Kjos-Hanssen (Dec 20 2020 at 06:38):

Here it is: https://github.com/bjoernkjoshanssen/jaccard-nid :construction_worker:

Bjørn Kjos-Hanssen (Dec 22 2020 at 23:36):

So, any obvious things that should be done to it? @Aaron Anderson Feel free to "roast" my code!

Alex J. Best (Dec 22 2020 at 23:54):

One thing to do is make the repo an actual lean project, as described on https://leanprover-community.github.io/install/project.html

Alex J. Best (Dec 22 2020 at 23:55):

This will track the dependency of the exact version of mathlib you are using, and make it easier for others to do leanproject get to get a copy of your code running in vscode.

Bjørn Kjos-Hanssen (Dec 22 2020 at 23:57):

Thanks @Alex J. Best

Alex J. Best (Dec 22 2020 at 23:58):

Whats the reasoning for using ring2 it should be possible to replace all of them with ring

Alex J. Best (Dec 22 2020 at 23:59):

(ring2 is not used in mathlib outside of the file it is defined in, its fairly experimental)

Alex J. Best (Dec 23 2020 at 00:01):

If you have multiple let .. in chained you can combine them like

    let x := |X|, y := |Y|, z := |Z|, x_z := |X\Z|, z_x := |Z\X|, y_z := |Y\Z|, z_y := |Z\Y|, x_y := |X\Y|, y_x := |Y\X| in

in place of line 212 of delta.lean

Alex J. Best (Dec 23 2020 at 00:03):

You should not need to do from begin ..., and by begin the words from and by can be removed there.

Alex J. Best (Dec 23 2020 at 00:04):

Likewise, with from by

Alex J. Best (Dec 23 2020 at 00:07):

Theorem seventeen takes ages to compile for me, probably because of all the invocations of finish and tidy, tidy can be replaced with the output of tidy? (which can then often be simplified)

Alex J. Best (Dec 23 2020 at 00:08):

        dist_comm          := λ x y, calc δ m M x y = δ m M y x: delta_comm,

is really just

        dist_comm          := λ x y, delta_comm,

Bjørn Kjos-Hanssen (Dec 23 2020 at 00:08):

Alex J. Best said:

Whats the reasoning for using ring2 it should be possible to replace all of them with ring

Thanks, it seems you're right. (Maybe I just thought ring2 was "the big guns" and it was better to use that.)

Alex J. Best (Dec 23 2020 at 00:09):

ring_exp is the big guns I guess, in that thats what I try if ring fails me :smile:

Alex J. Best (Dec 23 2020 at 00:13):

If you turn on the profiler with set_option profiler true you can see which of the finishes / tidys are taking the most time.

Alex J. Best (Dec 23 2020 at 00:15):

So the tactic#norm_num works in place of tidy in all uses in seventeen, and reduces the time massively

Bjørn Kjos-Hanssen (Dec 23 2020 at 00:27):

I would have found a tactic called ordered_ring useful.

Alex J. Best (Dec 23 2020 at 00:28):

That's probably tactic#linarith ?

Alex J. Best (Dec 23 2020 at 00:29):

Or its friend tactic#nlinarith, they work over linear ordered rings

Bjørn Kjos-Hanssen (Dec 23 2020 at 00:37):

Alex J. Best said:

You should not need to do from begin ..., and by begin the words from and by can be removed there.

Thanks. Also from by apparently is not needed, that always did sound strange...

Bjørn Kjos-Hanssen (Dec 23 2020 at 00:51):

Is there a running time with set_option profiler true that's considered "acceptable"?

Mario Carneiro (Dec 23 2020 at 00:54):

that depends on what you are doing

Mario Carneiro (Dec 23 2020 at 00:55):

for a proof that goes into mathlib I would try for under a second, but up to 5-10 seconds seems ok for a big proof

Mario Carneiro (Dec 23 2020 at 00:56):

The -T50000 challenge uses a particular time bound at the command line to measure this instead of profiler

Mario Carneiro (Dec 23 2020 at 00:57):

For something like tidy that is intended to find a proof but not for archival I don't think there needs to be any bound besides your own patience

Bjørn Kjos-Hanssen (Dec 23 2020 at 01:01):

So maybe there is a list of tactics that should not be used in mathlib code? Or is it just: "whichever ones run slowly"

Mario Carneiro (Dec 23 2020 at 01:04):

I don't think there are any actual banned tactics, but simp, omega, finish, tidy, linarith are the usual suspects if you have a slow proof

Bryan Gin-ge Chen (Dec 23 2020 at 01:05):

Tactics that emit messages, like library_search, squeeze_simp are effectively banned since mathlib files have to be silent when they compile.

Bjørn Kjos-Hanssen (Dec 23 2020 at 01:12):

I noticed that when I #print a theorem proved by a tactic like tidy, the proof printed can't just be cut and pasted into my code --- that gives some errors about failing to synthesize type class instances.

Mario Carneiro (Dec 23 2020 at 01:28):

that's a general problem with pretty printing - it may potentially hide information that is necessary for faithful reproduction as lean source code

Mario Carneiro (Dec 23 2020 at 01:28):

You can mostly solve this using pp.all (although the result probably isn't something you actually want to put in a mathlib file), although there are some things that even pp.all can't roundtrip unfortunately

Bryan Gin-ge Chen (Dec 23 2020 at 01:32):

You can wrap your tactic in show_term { ... } to see the term that the tactic constructs (see tactic#show_term). In many cases the term is much too large and complex to be of any use though.

Rob Lewis (Dec 23 2020 at 01:47):

tidy? will give you a more useful trace. Not the term it generates but the sequence of tactics it tried.

Kevin Buzzard (Dec 23 2020 at 12:07):

You get the hang of the most common round trip failures and how to deal with them after a while, for example \u a often needs to be changed to (a : X) and widgets can tell you X

Bjørn Kjos-Hanssen (Dec 24 2020 at 01:05):

Kevin Buzzard said:

You get the hang of the most common round trip failures and how to deal with them after a while, for example \u a often needs to be changed to (a : X) and widgets can tell you X

Thanks, but now I'm curious what a "round trip failure" is? Casting from N to Q and back to N? I see some mentions of "round trip" in this Zulip but they didn't really help.

Bryan Gin-ge Chen (Dec 24 2020 at 01:08):

In this case it's referring to Lean pretty-printing some expression, which we then try to get Lean to accept as input.

Bjørn Kjos-Hanssen (Dec 26 2020 at 06:51):

Mario Carneiro said:

I don't think there are any actual banned tactics, but simp, omega, finish, tidy, linarith are the usual suspects if you have a slow proof

Thanks. So cc is not included in this list?

Johan Commelin (Dec 26 2020 at 07:00):

@Bjørn Kjos-Hanssen cc can occassionaly also be slow. But I think it's not a "usual suspect" because it's not used that much in mathlib...

Johan Commelin (Dec 26 2020 at 07:00):

Maybe it can be used more often... but it just doesn't seem to be the case.

Bjørn Kjos-Hanssen (Dec 26 2020 at 07:49):

Thanks. I'm noticing a difference, after getting rid of finish and tidy I can now use set_option profiler true on the whole file without server crashing problems.

Bjørn Kjos-Hanssen (Dec 30 2020 at 07:02):

Alex J. Best said:

One thing to do is make the repo an actual lean project, as described on https://leanprover-community.github.io/install/project.html

OK I think I have done that now, it's at https://github.com/bjoernkjoshanssen/jaccard thanks again. Haven't tried to make it comply with the Style Guide.

Bjørn Kjos-Hanssen (Nov 10 2021 at 21:50):

Now on arXiv with linked Lean "Code & Data" https://arxiv.org/abs/2111.02498

Heather Macbeth (Nov 10 2021 at 21:52):

Congratulations! Perhaps you could make a PR adding it to
https://leanprover-community.github.io/papers.html
also?

Heather Macbeth (Nov 10 2021 at 21:53):

Is this our first (Lean) math paper formalized at time of publication?

Heather Macbeth (Nov 10 2021 at 21:55):

Perhaps not, I see there's also https://arxiv.org/abs/1907.07801

Bjørn Kjos-Hanssen (Nov 10 2021 at 22:20):

@Heather Macbeth thanks! Do I need access/permission to make a PR for that page?

Bryan Gin-ge Chen (Nov 10 2021 at 22:56):

@Bjørn Kjos-Hanssen nope, you can go ahead and make a PR from a fork.

Kevin Buzzard (Nov 10 2021 at 23:10):

Only part of @Neil Strickland 's paper with Nicola Bellumat was formalised in Lean IIRC

Bjørn Kjos-Hanssen (Nov 10 2021 at 23:31):

@Bryan Gin-ge Chen how does it work? I haven't done any PR :pensive:

Bryan Gin-ge Chen (Nov 10 2021 at 23:33):

You can try clicking the edit button on this page and adding your paper to the bib file. (We can normalize the formatting afterwards.)

Bjørn Kjos-Hanssen (Nov 10 2021 at 23:47):

@Bryan Gin-ge Chen thanks, I did it now... I think

Bryan Gin-ge Chen (Nov 10 2021 at 23:57):

Bjørn Kjos-Hanssen said:

Bryan Gin-ge Chen thanks, I did it now... I think

I see that you made the changes in a fork, now you can make a PR to the main repository by visiting this link: https://github.com/leanprover-community/leanprover-community.github.io/compare/newsite...bjoernkjoshanssen:patch-1?expand=1

Bjørn Kjos-Hanssen (Nov 11 2021 at 00:08):

Ah, okay... should be done now

Bjørn Kjos-Hanssen (Nov 11 2021 at 00:15):

And is there another step to actually add the paper to https://leanprover-community.github.io/papers.html ?

Bryan Gin-ge Chen (Nov 11 2021 at 00:16):

It will take some time for the website build to finish, but the paper should show up soon afterwards.

Bjørn Kjos-Hanssen (Nov 11 2021 at 00:17):

It's automatized via the tags formalization and lean3 in the BiBTeX entry I guess

Bryan Gin-ge Chen (Nov 11 2021 at 00:18):

I added those manually just now :)

Bjørn Kjos-Hanssen (Nov 11 2021 at 00:21):

Beat you to it :) So now there are two tags lines... should I try to make another change, or is it better to roll-back yours?

Bryan Gin-ge Chen (Nov 11 2021 at 00:21):

Yeah, whoops, I missed that you had already added tags to the BibTeX entry! I've removed the duplicate tags field so it should hopefully be fixed now.

Bjørn Kjos-Hanssen (Nov 11 2021 at 00:22):

yes, thanks @Bryan

Heather Macbeth (Nov 11 2021 at 01:12):

Maybe we should make a new category "math papers with accompanying Lean formalization"!

Heather Macbeth (Nov 11 2021 at 01:12):

Or maybe we should wait til there are 3 examples rather than 2.

Floris van Doorn (Nov 11 2021 at 11:48):

Really cool!
I recommend that you add some references to Lean-related papers, like https://link.springer.com/chapter/10.1007%2F978-3-319-21401-6_26 and https://dl.acm.org/doi/10.1145/3372885.3373824.

Bjørn Kjos-Hanssen (Mar 30 2022 at 21:57):

Kevin Buzzard said:

I jettisoned the idea that all research work had to be related to new research papers and now I just work with the principle that all research work has to be fun. However I am in an extremely privileged position. Have you ever heard of groups with zero before seeing them in mathlib? They are something which came out of the perfectoid project and I always credit Johan Commelin with the abstraction but I would be interested to know if there was more history. Well done on the project, stuff like this is a great way to learn Lean. I could imagine that something like this could end up in mathlib although it's hard work getting code up to the standard the maintainers require, it's not something which can be taken lightly, as I discovered the hard way

Alas now I've found an easier proof, so would want to revise the formalization rather thoroughly... so this advice of not trying to get it into mathlib standards right away was on-point :pray:

Bjørn Kjos-Hanssen (Jun 12 2024 at 23:45):

For what it's worth, here's a new version for Lean 4 that explicitly proves that the Jaccard distance is a metric: https://github.com/bjoernkjoshanssen/jaccard4/blob/main/D.lean

Eric Wieser (Jun 12 2024 at 23:47):

Perhaps I'm missing something, but you have in the docstring:

In this file we give a formal proof that in terms of
$d(X,Y)= m \min(|X\setminus Y|, |Y\setminus X|) + M \max(|X\setminus Y|, |Y\setminus X|)$
the function
$D(X,Y) = d(X,Y)/(|X ∩ Y|+d(X,Y))$
is a metric if and only if m ≤ M and 1 ≤ M.

Where is the "only if" direction?

Bjørn Kjos-Hanssen (Jun 12 2024 at 23:55):

Eric Wieser said:

Perhaps I'm missing something, but you have in the docstring:

In this file we give a formal proof that in terms of
$d(X,Y)= m \min(|X\setminus Y|, |Y\setminus X|) + M \max(|X\setminus Y|, |Y\setminus X|)$
the function
$D(X,Y) = d(X,Y)/(|X ∩ Y|+d(X,Y))$
is a metric if and only if m ≤ M and 1 ≤ M.

Where is the "only if" direction?

Hmm, good point (although it does not affect the claim about the Jaccard distance); there seems to be only a result in that direction in theorem seventeen in the imported file https://github.com/bjoernkjoshanssen/jaccard4/blob/main/delta.lean

Bjørn Kjos-Hanssen (Jun 12 2024 at 23:59):

The full claim is proved (modulo a change of variables) here: https://arxiv.org/abs/2111.02498 Anyway, I fixed the docstring now to make the more modest claim.

Last updated: May 02 2025 at 03:31 UTC