Zulip Chat Archive

Stream: general

Topic: Jaccard distance


view this post on Zulip Bjørn Kjos-Hanssen (Dec 11 2020 at 22:26):

I've finally completed a proof that finset \N with the Jaccard distance https://en.wikipedia.org/wiki/Jaccard_index is indeed a metric_space.
I actually proved a more general result from my own research (perhaps as a result of my old-fashioned attitude that all work done has to be related to new research papers :grinning: ).

Usually Jaccard distance would be defined by cases: there is a certain formula that works in all cases except the distance between the empty set and itself. However, since Lean says 0/0=0 :innocent:, I was able to avoid that case distinction and work within a "group with zero".

I know there is a whole process for getting code ready for mathlib, but is there anything particular I should be aware of --- like is there anything in mathlib that is at all similar to Jaccard distance?

view this post on Zulip Bjørn Kjos-Hanssen (Dec 11 2020 at 22:36):

The code is 900 lines right now. Surely it could be reduced significantly by doing less reinventing of the wheel.

view this post on Zulip Bjørn Kjos-Hanssen (Dec 11 2020 at 22:43):

Thanks are due to @Johan Commelin, @Kyle Miller, @Pedro Minicz, @Reid Barton, @Scott Morrison, @Heather Macbeth for significant help on this project.

view this post on Zulip Kevin Buzzard (Dec 11 2020 at 23:38):

I jettisoned the idea that all research work had to be related to new research papers and now I just work with the principle that all research work has to be fun. However I am in an extremely privileged position. Have you ever heard of groups with zero before seeing them in mathlib? They are something which came out of the perfectoid project and I always credit Johan Commelin with the abstraction but I would be interested to know if there was more history. Well done on the project, stuff like this is a great way to learn Lean. I could imagine that something like this could end up in mathlib although it's hard work getting code up to the standard the maintainers require, it's not something which can be taken lightly, as I discovered the hard way

view this post on Zulip Bryan Gin-ge Chen (Dec 12 2020 at 00:10):

Have you ever heard of groups with zero before seeing them in mathlib?

I probably wouldn't have given this passage a second thought if I hadn't heard about Johan's work here, but Emil Artin defines them in the first chapter of Geometric Algebra (published 1957): Screen-Shot-2020-12-11-at-7.07.53-PM.png

view this post on Zulip Bryan Gin-ge Chen (Dec 12 2020 at 00:14):

Back when Johan was PRing this stuff to mathlib, I also posted some links to some papers by Lorscheid on "the field with one element" where "monoids with zero" show up.

view this post on Zulip Adam Topaz (Dec 12 2020 at 00:15):

A hyperfield is a commutative multiplicative group with zero together with blah blah blah

view this post on Zulip Bjørn Kjos-Hanssen (Dec 12 2020 at 00:16):

@Kevin Buzzard Yes even the fact that the mathlib instructions say that "tabs are 2 spaces" makes me question whether that's worth it. :grinning:

view this post on Zulip Aaron Anderson (Dec 19 2020 at 18:23):

I'm curious to see the code, to evaluate how hard I'd think it'd be to get into mathlib...

view this post on Zulip Aaron Anderson (Dec 19 2020 at 18:38):

...and also to evaluate how hard it'd be to extend it to general measure spaces, but I'm probably not familiar enough with Lean measure theory to make that evaluation.

view this post on Zulip Bjørn Kjos-Hanssen (Dec 20 2020 at 06:04):

Great @Aaron Anderson I almost missed this reply but I can post it on github!

I have no idea about measure theory in Lean but yes, we should be able to replace the counting measure on finite sets by something more general.

Actually, for some entropy related application I'm interested in a version involving certain signed measures.

view this post on Zulip Bjørn Kjos-Hanssen (Dec 20 2020 at 06:38):

Here it is: https://github.com/bjoernkjoshanssen/jaccard-nid :construction_worker:

view this post on Zulip Bjørn Kjos-Hanssen (Dec 22 2020 at 23:36):

So, any obvious things that should be done to it? @Aaron Anderson Feel free to "roast" my code!

view this post on Zulip Alex J. Best (Dec 22 2020 at 23:54):

One thing to do is make the repo an actual lean project, as described on https://leanprover-community.github.io/install/project.html

view this post on Zulip Alex J. Best (Dec 22 2020 at 23:55):

This will track the dependency of the exact version of mathlib you are using, and make it easier for others to do leanproject get to get a copy of your code running in vscode.

view this post on Zulip Bjørn Kjos-Hanssen (Dec 22 2020 at 23:57):

Thanks @Alex J. Best

view this post on Zulip Alex J. Best (Dec 22 2020 at 23:58):

Whats the reasoning for using ring2 it should be possible to replace all of them with ring

view this post on Zulip Alex J. Best (Dec 22 2020 at 23:59):

(ring2 is not used in mathlib outside of the file it is defined in, its fairly experimental)

view this post on Zulip Alex J. Best (Dec 23 2020 at 00:01):

If you have multiple let .. in chained you can combine them like

    let x := |X|, y := |Y|, z := |Z|, x_z := |X\Z|, z_x := |Z\X|, y_z := |Y\Z|, z_y := |Z\Y|, x_y := |X\Y|, y_x := |Y\X| in

in place of line 212 of delta.lean

view this post on Zulip Alex J. Best (Dec 23 2020 at 00:03):

You should not need to do from begin ..., and by begin the words from and by can be removed there.

view this post on Zulip Alex J. Best (Dec 23 2020 at 00:04):

Likewise, with from by

view this post on Zulip Alex J. Best (Dec 23 2020 at 00:07):

Theorem seventeen takes ages to compile for me, probably because of all the invocations of finish and tidy, tidy can be replaced with the output of tidy? (which can then often be simplified)

view this post on Zulip Alex J. Best (Dec 23 2020 at 00:08):

        dist_comm          := λ x y, calc δ m M x y = δ m M y x: delta_comm,

is really just

        dist_comm          := λ x y, delta_comm,

view this post on Zulip Bjørn Kjos-Hanssen (Dec 23 2020 at 00:08):

Alex J. Best said:

Whats the reasoning for using ring2 it should be possible to replace all of them with ring

Thanks, it seems you're right. (Maybe I just thought ring2 was "the big guns" and it was better to use that.)

view this post on Zulip Alex J. Best (Dec 23 2020 at 00:09):

ring_exp is the big guns I guess, in that thats what I try if ring fails me :smile:

view this post on Zulip Alex J. Best (Dec 23 2020 at 00:13):

If you turn on the profiler with set_option profiler true you can see which of the finishes / tidys are taking the most time.

view this post on Zulip Alex J. Best (Dec 23 2020 at 00:15):

So the tactic#norm_num works in place of tidy in all uses in seventeen, and reduces the time massively

view this post on Zulip Bjørn Kjos-Hanssen (Dec 23 2020 at 00:27):

I would have found a tactic called ordered_ring useful.

view this post on Zulip Alex J. Best (Dec 23 2020 at 00:28):

That's probably tactic#linarith ?

view this post on Zulip Alex J. Best (Dec 23 2020 at 00:29):

Or its friend tactic#nlinarith, they work over linear ordered rings

view this post on Zulip Bjørn Kjos-Hanssen (Dec 23 2020 at 00:37):

Alex J. Best said:

You should not need to do from begin ..., and by begin the words from and by can be removed there.

Thanks. Also from by apparently is not needed, that always did sound strange...

view this post on Zulip Bjørn Kjos-Hanssen (Dec 23 2020 at 00:51):

Is there a running time with set_option profiler true that's considered "acceptable"?

view this post on Zulip Mario Carneiro (Dec 23 2020 at 00:54):

that depends on what you are doing

view this post on Zulip Mario Carneiro (Dec 23 2020 at 00:55):

for a proof that goes into mathlib I would try for under a second, but up to 5-10 seconds seems ok for a big proof

view this post on Zulip Mario Carneiro (Dec 23 2020 at 00:56):

The -T50000 challenge uses a particular time bound at the command line to measure this instead of profiler

view this post on Zulip Mario Carneiro (Dec 23 2020 at 00:57):

For something like tidy that is intended to find a proof but not for archival I don't think there needs to be any bound besides your own patience

view this post on Zulip Bjørn Kjos-Hanssen (Dec 23 2020 at 01:01):

So maybe there is a list of tactics that should not be used in mathlib code? Or is it just: "whichever ones run slowly"

view this post on Zulip Mario Carneiro (Dec 23 2020 at 01:04):

I don't think there are any actual banned tactics, but simp, omega, finish, tidy, linarith are the usual suspects if you have a slow proof

view this post on Zulip Bryan Gin-ge Chen (Dec 23 2020 at 01:05):

Tactics that emit messages, like library_search, squeeze_simp are effectively banned since mathlib files have to be silent when they compile.

view this post on Zulip Bjørn Kjos-Hanssen (Dec 23 2020 at 01:12):

I noticed that when I #print a theorem proved by a tactic like tidy, the proof printed can't just be cut and pasted into my code --- that gives some errors about failing to synthesize type class instances.

view this post on Zulip Mario Carneiro (Dec 23 2020 at 01:28):

that's a general problem with pretty printing - it may potentially hide information that is necessary for faithful reproduction as lean source code

view this post on Zulip Mario Carneiro (Dec 23 2020 at 01:28):

You can mostly solve this using pp.all (although the result probably isn't something you actually want to put in a mathlib file), although there are some things that even pp.all can't roundtrip unfortunately

view this post on Zulip Bryan Gin-ge Chen (Dec 23 2020 at 01:32):

You can wrap your tactic in show_term { ... } to see the term that the tactic constructs (see tactic#show_term). In many cases the term is much too large and complex to be of any use though.

view this post on Zulip Rob Lewis (Dec 23 2020 at 01:47):

tidy? will give you a more useful trace. Not the term it generates but the sequence of tactics it tried.

view this post on Zulip Kevin Buzzard (Dec 23 2020 at 12:07):

You get the hang of the most common round trip failures and how to deal with them after a while, for example \u a often needs to be changed to (a : X) and widgets can tell you X

view this post on Zulip Bjørn Kjos-Hanssen (Dec 24 2020 at 01:05):

Kevin Buzzard said:

You get the hang of the most common round trip failures and how to deal with them after a while, for example \u a often needs to be changed to (a : X) and widgets can tell you X

Thanks, but now I'm curious what a "round trip failure" is? Casting from N to Q and back to N? I see some mentions of "round trip" in this Zulip but they didn't really help.

view this post on Zulip Bryan Gin-ge Chen (Dec 24 2020 at 01:08):

In this case it's referring to Lean pretty-printing some expression, which we then try to get Lean to accept as input.

view this post on Zulip Bjørn Kjos-Hanssen (Dec 26 2020 at 06:51):

Mario Carneiro said:

I don't think there are any actual banned tactics, but simp, omega, finish, tidy, linarith are the usual suspects if you have a slow proof

Thanks. So cc is not included in this list?

view this post on Zulip Johan Commelin (Dec 26 2020 at 07:00):

@Bjørn Kjos-Hanssen cc can occassionaly also be slow. But I think it's not a "usual suspect" because it's not used that much in mathlib...

view this post on Zulip Johan Commelin (Dec 26 2020 at 07:00):

Maybe it can be used more often... but it just doesn't seem to be the case.

view this post on Zulip Bjørn Kjos-Hanssen (Dec 26 2020 at 07:49):

Thanks. I'm noticing a difference, after getting rid of finish and tidy I can now use set_option profiler true on the whole file without server crashing problems.

view this post on Zulip Bjørn Kjos-Hanssen (Dec 30 2020 at 07:02):

Alex J. Best said:

One thing to do is make the repo an actual lean project, as described on https://leanprover-community.github.io/install/project.html

OK I think I have done that now, it's at https://github.com/bjoernkjoshanssen/jaccard thanks again. Haven't tried to make it comply with the Style Guide.


Last updated: May 08 2021 at 18:17 UTC