Zulip Chat Archive

Stream: mathlib4

Topic: Should data structures theory sit in mathlib?

Julien Michel (Mar 16 2024 at 11:25):

In mathlib, there is a file on binary trees: Data/Tree.lean and a file on array based heaps: Data/BinaryHeap.lean. Should they be moved to std or core, or stay in mathlib ?
For heaps, there are different representations. The array based one in Data/BinaryHeap.lean seems concerned about performance and feels like it should be in std. But a heap could be defined more abstractly with a tree representation, which might be a good choice to study its theory in mathlib. I'm really unsure about where binary trees should sit.

How to decide where a data structure theorem should go?

Eric Wieser (Mar 16 2024 at 11:57):

I think in general it's fine to PR any such theorem to mathlib, even if eventually things get moved elsewhere

Julien Michel (Mar 16 2024 at 13:25):

It was said on the other topic that "data structures" should probably not go into mathlib.
I think an issue, is that categorizing concepts as being or not being a "data structure" feels a bit meaningless especially in lean's type theory.
Graphs are often qualified as "data structures" by programmers who solve engineering problems. But graphs are also in mathlib because they are mathematically interesting.
I believe many simple tree structures from CS have an elegant definition and are mathematically interesting to study, so why should they not be part of mathlib ? Only because they've been labeled a "data structure" ?

Thanks for your thoughts or clarifications

Notification Bot (Mar 16 2024 at 19:46):

A message was moved here from #mathlib4 > github permission by Julien Michel.

Kevin Buzzard (Mar 16 2024 at 20:54):

You raise interesting questions. I have what you might perceive as a very narrow view of mathematics. In my undergraduate maths degree I did a course on graph theory but I didn't do any course on abstract data structures and hence I regard them as computer science. I have no idea what made my university draw the line where they drew it. It's the job of the mathlib maintainers to decide where they want to draw the line here.

Kim Morrison (Mar 16 2024 at 21:15):

Sorry, I didn't intend to say that they were out of scope. Rather that work that is useful outside of mathematical or theoretical applications should ideally go outside of Mathlib!

Mathlib is gigantic, and makes little attempt at this point to be backwards compatible (not that Lean itself does better!), and freely occupies the root namespace. All of these as practical matters make it difficult for some projects to consider having Mathlib as a dependency,

Hence if "data structures" if interpreted as generally useful things like Std's RBMap, then Mathlib is not the best home.

Julien Michel (Mar 16 2024 at 21:16):

I think it's just that most of these data structures have a different form depending on if you want to reason about them (mathlib would be a nice place), or if you want to use them to build fast software (Std would be a nice place).
That's how I would personally draw the line (but I'm curious what other people think).

A basic example I mentioned is that of a binary heap. For me, a heap defined as a tree has its place in mathlib, but software engineers might want an efficient heap in Std that's implemented via an array.
Another example is Array vs List. I think there should be no occurrence of Array's in mathlib given it's just the same thing but with optimized access.

I would be interested in formalizing some structures at the bottom of this page https://en.wikipedia.org/wiki/Tree_(data_structure) in mathlib, but as I was suggesting, only in their easy-to-reason-about form.

Kim Morrison (Mar 16 2024 at 21:17):

More generally, I really hope that one day we move towards a world where e.g. graph theory / basic category theory / the basic algebraic hierarchy could be in separate libraries from Mathlib. They might still be in the same repository for convenience, but usable as separate dependencies.

Julien Michel (Mar 16 2024 at 23:25):

@Scott Morrison Thank you for answering, that clarifies things. That just gave me an idea (perhaps silly):

Why not split Std into two packages : Std (only data and code, no theorems) and StdTheory (only theorems about Std, or relating Std to other stuff) with these dependencies:

Std -> Core
StdTheory -> Std
StdTheory -> Mathlib -> Core

One advantage is that StdTheory can use all the power of math to say things about Std. Std can be built fast and be lightweight, and doesn't hold any proofs.

Now If I'm a programmer interested in creating verified software, I would create two packages : Software (only data and code, no theorems), SoftwareTheory (theorems about Software) with the following dependencies:

Software -> Std
SoftwareTheory -> Software
SoftwareTheory -> StdTheory

SoftwareTheory could prove stuff about Software using all of StdTheory, or even Mathlib if necessary, but at least Software has minimal dependencies.

Note that Mathlib should not depend on Std nor StdTheory anymore. Most theorems in Core could probably be moved to Mathlib or some other CoreTheory package.

To sum it up, would that make sense to decouple theorems and defs completely in different packages?

Eric Wieser (Mar 16 2024 at 23:38):

The reality is that definitions and theorems are rather entangled, and you can't get away with writing all of the former before writing any of the latter. Sometimes you can't define a function without proving that the recursion is well-founded, and this is much more difficult to do if no theorems are yet available for any of your types.

Kim Morrison (Mar 16 2024 at 23:41):

We want fewer things depending on Mathlib, not more. :-)

Eric Wieser (Mar 16 2024 at 23:41):

Another example of this is the @[csimp] attribute; this lets you prove that an efficient function is equal to a simple one, and tells the compiler that the efficient one always can be used in place of the simple one. But this attribute goes on a theorem, and so if you put it in SoftwareTheory, now your functions are slow unless people import the theory library.

Eric Wieser (Mar 16 2024 at 23:43):

Scott Morrison said:

More generally, I really hope that one day we move towards a world where e.g. graph theory / basic category theory / the basic algebraic hierarchy could be in separate libraries from Mathlib. They might still be in the same repository for convenience, but usable as separate dependencies.

I think it's important to note the "one day" here. If you have a graph theory result or a result about some other definition in mathlib, then at least today that result also belongs in mathlib

Julien Michel (Mar 16 2024 at 23:46):

(deleted)

Julien Michel (Mar 16 2024 at 23:49):

Regarding termination, maybe there should be a way to prove termination elsewhere, just like the attribute command can add simp lemmas later

Julien Michel (Mar 16 2024 at 23:53):

For the @[csimp] theorems, I'm not sure you care, because if you have the fast function in Software, you just use it. I don't see why you would use them for something else than proofs

Julien Michel (Mar 16 2024 at 23:58):

I'm saying that Software and Std code wouldn't depend on Mathlib.
For building executable software under time and memory constraints, I agree it feels important to have few dependencies.
But if you're interest is only in proving, do you really care that much about the time or the memory it will take ? You would only care that you're Theory typechecks once and that's it. Wouldn't you?

Eric Wieser (Mar 17 2024 at 00:01):

Std does not depend on mathlib, the dependency is the other way around; because mathematicians need the basic operations on Nat and List that Std provides.

Julien Michel (Mar 17 2024 at 00:02):

Scott Morrison said:

We want fewer things depending on Mathlib, not more. :-)

Yes I know I thought Scott was implying that I said Std should depend on Mathlib. But I only meant StdTheory could

Kim Morrison (Mar 17 2024 at 00:03):

Yes, I was saying that having a StdTheory that depends on Mathlib is not viable.

Eric Wieser (Mar 17 2024 at 00:04):

Julien Michel said:

Regarding termination, maybe there should be a way to prove termination elsewhere, just like the attribute command can add simp lemmas later

The problem is that this allows functions to exist that don't terminate at all, and now everything is unsound and proofs are meaningless.

Julien Michel (Mar 17 2024 at 00:05):

But it is possible to do that with the partial keyword already?

Julien Michel (Mar 17 2024 at 00:06):

But it's true that you can't unfold the defs I think

Eric Wieser (Mar 17 2024 at 00:06):

You could enforce that the termination is eventually proved in some downstream file but now a) you can't import your Software without also importing SoftwareTheory and b) it is much harder to ensure that proofs are not mutually cyclic if proofs of early results (termination) can appear after later results

Eric Wieser (Mar 17 2024 at 00:07):

Julien Michel said:

But it is possible to do that with the partial keyword already?

def ohno : False := ohno

is not legal even with partial

Eric Wieser (Mar 17 2024 at 00:07):

But the only way you can find out in general is by forcing the user to prove termination

Eric Wieser (Mar 17 2024 at 00:08):

(it's legal with unsafe, which is maybe your answer here)

Julien Michel (Mar 17 2024 at 00:16):

Writing a function like that partial def loop (x : Nat) : Nat := loop 0 compiles and runs (even if it doesn't terminate), and its not unsafe code. So I believe useful software functions could be written in Software and proven to terminate in a non-imported SoftwareTheory. I don't really know about b), I'll think more about it. Thank you for these answers

Mario Carneiro (Mar 17 2024 at 02:38):

You can't prove the termination of partial def after the fact. Once you use partial, you get a fully opaque function which is useless for reasoning purposes. So if StdTheory wants to prove facts about functions defined in Std, those functions have to be defined without using partial and proving termination on the spot.

Mario Carneiro (Mar 17 2024 at 02:40):

Some of the division you are describing does exist in Std already, in particular there are things like Std.Data.Array.Basic with the definitions and Std.Data.Array.Lemmas with the theorems, with the intent that you can import just the first part if you just want to do programming and don't care about the proofs.

Mario Carneiro (Mar 17 2024 at 02:41):

But sometimes there are theorems that are needed by Std.Data.Array.Basic because of the aforementioned reasons, so there is Std.Data.Array.Init.Lemmas for holding those "bootstrapping" lemmas.

Last updated: May 02 2025 at 03:31 UTC