Zulip Chat Archive
Stream: new members
Topic: James B. Wilson
James B. Wilson (Jun 12 2023 at 13:06):
I'm math prof at Colorado State University doing computational algebra, developing fast algorithms for GAP/MAGMA/SAGE and the like. My objectives with LEAN are to see theorems about efficient computation. The catch is that data types in computational algebra are skeletal and nothing like what you write down when doing a math proof. My goal is to close the gap (pun sorry) between what we program about a group or algebra and what we prove about it. Happy to learn about projects others have going on in the space and a preemptive thanks to all those who help us along the way.
Kevin Buzzard (Jun 12 2023 at 13:18):
The idea behind Lean 4 is that it is supposed to be able to unlock this kind of thing: efficient algorithms coupled with proofs that the algorithms are bug-free (the proofs are optional, but are needed if you need to use the algorithms in other proofs of course). However we're yet to really see it in action. Do you have a good toy example to set as a challenge/benchmark? I think that even primality testing would be a good example right now; Lean currently has nothing more exotic than "check all the numbers ".
Yaël Dillies (Jun 12 2023 at 13:20):
@Bhavik Mehta did port the prime certificates from Isabelle to Lean 3 a long time ago, but it was pretty slow.
Bolton Bailey (Jun 12 2023 at 13:26):
The miller-rabin
branch has an implementation of Miller-Rabin in Lean 3, it would be interesting to port to Lean 4 to see how it would perform.
Jireh Loreaux (Jun 12 2023 at 16:13):
James B. Wilson (Jun 12 2023 at 16:16):
Thanks @Kevin Buzzard, the example that seems best is the "Hello World" of computational algebra (the one that I assign to my first year students) is known as Schreier-Sims. It has its own Wikipedia page here . It has a couple things going for it.
- relatively simple data types (arrays/linked list suffice, though a stack is even easier)
- easy algorithms (its a transitive closure with book-keeping)
- requires only basic knowledge of group actions.
James B. Wilson (Jun 12 2023 at 16:17):
When you get it to work then it has some fun pedigogical applications such as having the computer output instructions to solve a Rubix cube. I saw the LEAN4 samples include a Rubix cube visualizer so it probably would tie together nicely as a demo.
James B. Wilson (Jun 12 2023 at 16:18):
The challenge isn't to prove Schreier-Sims is correct, that's relatively trivial. The challenge is to write the program and annotate the calculations in such a way that your result proves its performance.
James B. Wilson (Jun 12 2023 at 16:20):
This would be somewhat direct if LEAN had linear types, but I asked here and got a no on that one.
James B. Wilson (Jun 12 2023 at 16:24):
As @Mario Carneiro observed there I could add linear types myself, but is a bit of a project. I'll look at some alternatives and open to suggestions. The key problem is that you can't stop a system from making a horrible implementation of a function, such as branched recursion, from the signature of the function alone. So the trick the linear types and other structural types give you is a way to insist data is used the amount of times you want.
James B. Wilson (Jun 12 2023 at 16:26):
As a side remark as to motive: while mostly it makes me happy, the real goal is that so much scientific software gets written by post-docs and students who leave and all that knowledge gets lost. Impossible to maintain, sometimes hard to trust. Its usually not a problem of the theory, but it is a problem to know if the implementation matches the theory.
Reid Barton (Jun 12 2023 at 16:54):
There has been a lot of discussion about proving the complexity of algorithms on #general > Computational Complexity Theory.
Reid Barton (Jun 12 2023 at 17:17):
It seems pretty hard to me (and I'm not sure how much linear types would help). Generally I would be more than happy with proven-correct algorithms, that are also efficient in practice!
James B. Wilson (Jun 12 2023 at 17:34):
For me it is the opposite. Nearly all the algorithms to be applied have semi-obvious definitions. Just about anyone can make a program that returns the correct answer even if all it does is brute-force search. The art for my camp is a program that does it with proven complexity.
James B. Wilson (Jun 12 2023 at 17:40):
But I am not seeking to convert anyone just highlighting my motives. Thanks for the link.
James B. Wilson (Jun 12 2023 at 17:42):
Linear types don't solve all the questions but they do make it possible to prevent errant back-tracking and branch and bounds which are an all to common guess at how to make a practical algorithm and which can fool random testing but ultimately mathematics (and humans) tend to ask questions in the hard regime.
Jireh Loreaux (Jun 12 2023 at 17:43):
I would love to see this happen, despite the fact that it is far outside my wheelhouse.
Mario Carneiro (Jun 12 2023 at 18:45):
@James B. Wilson said:
For me it is the opposite. Nearly all the algorithms to be applied have semi-obvious definitions. Just about anyone can make a program that returns the correct answer even if all it does is brute-force search. The art for my camp is a program that does it with proven complexity.
I think you are undervaluing the space of proved correct and (presumed or proved on paper) efficient algorithms. Most CAS's are built out of these things and it is an area we would really like to break into with lean 4. Just because you haven't proved that the algorithm is efficient in the system doesn't mean that it isn't still useful, and it doesn't make it a brute force search.
Mario Carneiro (Jun 12 2023 at 18:48):
If you think "the algorithms to be applied have semi-obvious definitions" then great, do it! These algorithms do require some not so common combination of expertise so just because you think it's easy doesn't mean that anyone can do it. Having a "CAS insider" would be a great boon to getting this stuff done
Mario Carneiro (Jun 12 2023 at 18:50):
Plus, adding the "proved correct" part is already a nontrivial value-add over the original algorithm as it would be implemented in a CAS without proofs. It often changes the character of the work when you have to be proof producing, making different approaches more or less viable.
Mario Carneiro (Jun 12 2023 at 18:53):
By contrast, if you want to have proved efficiency then it completely changes the goal to something which is more useful for algorithms researchers and not so much for users. Of course it is still an interesting research area, but for me that sucks a lot of the fun out if it doesn't put a tactic in the users' hands at the end of the day.
James B. Wilson (Jun 12 2023 at 18:54):
Apologies that my remark appeared to place value. Lots of people working on lots of important aspects, I have respect for them each. I only mean to say that many question have on the surface a straight-forward algorithm that would easily be proved correct. For example, find a solution so some equation, a discrete log, a prime factor, if you find it by brute-force searching you can rather easily prove it is correct. But you might in fact miss the point of the exercise. But I fully agree for complicated algorithms a certificate comes with great care and analysis.
Reid Barton (Jun 12 2023 at 18:55):
Yes, hence the proviso "efficient in practice".
James B. Wilson (Jun 12 2023 at 18:56):
Well as a long time CAS builder I find that phrase has a rather short shelf-life.
James B. Wilson (Jun 12 2023 at 18:56):
Release a package and within a year you hear about cases that don't work "in practice"
Mario Carneiro (Jun 12 2023 at 18:57):
well a good way to build efficient in practice algorithms is to use algorithms that are efficient in theory
Mario Carneiro (Jun 12 2023 at 18:57):
because those are somewhat correlated
James B. Wilson (Jun 12 2023 at 18:58):
My view as well
James B. Wilson (Jun 12 2023 at 18:59):
My colleagues and I always publish proofs of complexity along side correctness. But the code goes into the CAS in whatever version it got implemented. It seems to me the obvious next step is to have the code carry with it a proof of both claims.
Mario Carneiro (Jun 12 2023 at 19:01):
The main point here is that lean does not lend itself naturally to complexity certificates, because it is essentially a denotational semantics, terms are equal when they are extensionally equal so performance is not a thing you can directly reason about with lean functions
Mario Carneiro (Jun 12 2023 at 19:02):
On the other hand it does lend itself naturally to functional correctness proofs
James B. Wilson (Jun 12 2023 at 19:02):
Yeah, its one of the barriers for sure. And I do have prototypes and such in other languages. But the tactics and community in LEAN are attractive
Jireh Loreaux (Jun 12 2023 at 19:03):
(just FYI, despite the L∃∀N text on the community website, it's Lean, not LEAN)
Mario Carneiro (Jun 12 2023 at 19:03):
I'd love to see what an implementation of Schreier–Sims actually looks like in lean. I suspect you are underestimating some aspects of the implementation
Mario Carneiro (Jun 12 2023 at 19:04):
even designing the API for it looks nontrivial, what is the actual setting in which you would use it?
James B. Wilson (Jun 12 2023 at 19:06):
Well I made a stupid hack of it with limited checks such as ignoring array bounds but its a test case. I think its coming together rather as expected, but still don't have linear types so my idris inspired timing certificates are not in place
Mario Carneiro (Jun 12 2023 at 19:07):
the question is, can you actually verify it computes what it claims to?
James B. Wilson (Jun 12 2023 at 19:07):
yes.
Mario Carneiro (Jun 12 2023 at 19:07):
how does this work?
James B. Wilson (Jun 12 2023 at 19:07):
Again subject to some low-order things like putting in the proper bounds checking.
Mario Carneiro (Jun 12 2023 at 19:08):
I don't mean "argue informally for the correctness of the algorithm", I mean "there is a lean proof of the properties of the results of the function"
Mario Carneiro (Jun 12 2023 at 19:08):
or alternatively "it produces a proof that the results are as expected along with the answer"
James B. Wilson (Jun 12 2023 at 19:08):
Right, yeah, that the return is what you say. Most of that comes down to knowing what the mathematical abstraction is and making the type for that.
Mario Carneiro (Jun 12 2023 at 19:09):
I'd love to see the code if you have something
James B. Wilson (Jun 12 2023 at 19:11):
Sure, we have a git repo, we'll make it public soon I wager, I do have to ask the rest of the team. Its meant as a learning exercise since most of us program in other languages so its a hit to our pride to put out what we know to be sloppy LEAN code.
Mario Carneiro (Jun 12 2023 at 19:14):
Yep, it's a learning process of course. I'm interested in the actual end goal statements / theorems, how are you phrasing this problem
Mario Carneiro (Jun 12 2023 at 19:16):
(And also, obviously ask here if you have any questions about how to do something in lean, what the efficiency of some primitive operations is, or what standard style is, we are very interested to support this work!)
James B. Wilson (Jun 12 2023 at 19:18):
Take an orbit theorem
Theorem. If is a group acting on a set and then where and .
To make this computational one obvious question:
- Given: and
- Return:
or return .
James B. Wilson (Jun 12 2023 at 19:19):
That's all directly explained as types but you would struggle to make that a program as the groups are huge. What CAS does is store generators, not the group.
James B. Wilson (Jun 12 2023 at 19:20):
And what it does for the orbit is store a Schreier-tree (a spanning tree of the Cayley graph of the fixed generators). And for generators of it uses what are called "Schreier generators" which are a data types build from the Schreier tree and generators for .
James B. Wilson (Jun 12 2023 at 19:21):
All of these are different types than what one originally describes as and but individually you can see to prove they are equivalent.
Mario Carneiro (Jun 12 2023 at 19:22):
Do you write this using some relationship to e.g. groups and subgroups as described by mathlib? Or are these just data types like GeneratorSet
with some operations on them
James B. Wilson (Jun 12 2023 at 19:23):
for now just hand made dependent and inductive types but yeah the vision is to integrate.
Mario Carneiro (Jun 12 2023 at 19:23):
How would you package this in a way that it would come up in the course of a regular-looking query?
Mario Carneiro (Jun 12 2023 at 19:24):
it seems a bit abstract to "return a group"
James B. Wilson (Jun 12 2023 at 19:40):
Yep, in CAS this is always understood as something specific, usually a set of generators but sometimes a presentation or what is known as a black-box group.
Mario Carneiro (Jun 12 2023 at 19:41):
and how do you interact with a "black-box group"?
Mario Carneiro (Jun 12 2023 at 19:42):
even the basic API for doing things with such groups would be a useful place to start
James B. Wilson (Jun 12 2023 at 19:45):
Mario Carneiro said:
How would you package this in a way that it would come up in the course of a regular-looking query?
I didn't follow, but for example in many proofs you go through several cases and then are left with some calculation. You then insert in your proof that you ran some code and it came back with 3 examples, you check them all and none of them are counter examples. The end. Well its that code you ran that matters to be right, but it can't simply be seen as exhaustively trying all things. The search spaces are huge. But much of the CAS world has found ways to abstract the question of huge search spaces into small amounts of data given and returned. But they drop off on providing certificates not because we don't know how, we have. There is a whole industry of certified returns. But its never been part of interactive proofs or proof checkers. Instead it was Arthur Merlin and BPP and other complexity zoo things where each thing to certify provides its own certificates and its own certificate checker. Its now possible to off load that last part to standard issue proof checkers.
Mario Carneiro (Jun 12 2023 at 19:52):
I'm totally on board with the idea of holding some finite data which "encodes" a group in some sense, but I have no idea what concretely CASs generally use for such a data structure (generators? generators + relations? order?) or what questions are being posed of the groups (elementhood? order? listing subgroups?) and what mechanisms are used to answer those questions. I'm sure there is a book somewhere called something like Introduction to Computer Algebra which details some data structures, and real world CASs use something kind of like that but not quite because such and such works better. There is clearly a whole world here that one wouldn't know unless you have a friend who works on sage or magma or something
James B. Wilson (Jun 12 2023 at 20:01):
Right on all those points. Well, I work with quite a number of the people who make those systems and been fussing about with Agda/Idris/Coq and now LEAN waiting for a system that ticked enough boxes to come off the side lines. I look forward to sharing what we know and seeing how to build bridges.
Books by the way for interested readers
* Holt Eick O'Brien "Handbook of computational Group Theory" does a handsome job for many things.
* Seress "Permutation group algorithms" does more on optimal performance and randomization.
Matthew Ballard (Jun 16 2023 at 14:59):
Bolton Bailey said:
The
miller-rabin
branch has an implementation of Miller-Rabin in Lean 3, it would be interesting to port to Lean 4 to see how it would perform.
As a test, I did it from scratch in Lean 4 and performance was comparable to a Go version I had made for a cryptography course. (This is despite the fact I did everything with Nat
and the Go version used native int
, ie no big numbers.) Obviously next time I teach the course it will be in Lean :)
Reid Barton (Jun 20 2023 at 06:23):
James B. Wilson said:
Linear types don't solve all the questions but they do make it possible to prevent errant back-tracking and branch and bounds which are an all to common guess at how to make a practical algorithm and which can fool random testing but ultimately mathematics (and humans) tend to ask questions in the hard regime.
Here's an Idris example demonstrating that linear types don't really prevent anything when applied to natural numbers (and presumably other inductive types). Basically, I can consume an argument linearly to build a copy of it which I can then use non-linearly. That means I can always turn any function which consumes a Nat
into one that claims to use it linearly.
module Main
launder : (Nat -> Bool) -> ((1 n : Nat) -> Bool)
launder f Z = f Z
launder f (S n) = launder (\k => f (1 + k)) n
unlaunder : ((1 n : Nat) -> Bool) -> (Nat -> Bool)
unlaunder g n = g n
launder_eq : (f : Nat -> Bool) -> (n : Nat) -> unlaunder (launder f) n = f n
launder_eq f Z = Refl
launder_eq f (S n) = launder_eq (\k => f (1 + k)) n
Reid Barton (Jun 20 2023 at 06:24):
My intuition is that more generally the notion of "using linearly" is meaningless in classical mathematics for all types and not just for inductive types, though I don't have a precise formulation of that.
James B. Wilson (Jun 20 2023 at 15:00):
That is an interesting example @Reid Barton. I'll think on it. I've considered a few similar cases of where you might leak time/resources. I rather expect you can shoe-horn in undecidability into any claims, there is after all the halting problem directly in the path. But I'm not after ways to cheat rather I'm after ways that someone who with good intent has made an algorithm and proved it has a fixed complexity can also write that proof down with the algorithm, or at least provide good circumstantial evidence for that claim within the program. Its already possible to do this on Turing Machine directly, and Dal Lago and company's work features some progress towards translating that into untyped lambda calculus. But its not really a programming reality, there are no clear idioms or tactics that I've come across to expose complexity through type annotations. I want to investigate how current theorem provers can nevertheless do at some forms of proof of complexity, and to what extent interactive proofs can offer the right sort of guards to guide a program into the desired complexity. Linear types do add some of this, you have to think a bit harder to leak resources when you've put up the condition to use it a limited number of times.
So all in all, its not a broad claim I'm hoping to make that it will work on all functions. That I don't believe. Just a case study to see if some of the transparent techniques used often in CAS and whether if annotated and forced to comply with some programming paradigms they would infact communicated a proven complexity as part of their program description.
James B. Wilson (Jun 20 2023 at 15:03):
But of course I may fail spectacularly!
Bolton Bailey (Dec 16 2023 at 14:36):
James B. Wilson said:
Thanks Kevin Buzzard, the example that seems best is the "Hello World" of computational algebra (the one that I assign to my first year students) is known as Schreier-Sims. It has its own Wikipedia page here . It has a couple things going for it.
- relatively simple data types (arrays/linked list suffice, though a stack is even easier)
- easy algorithms (its a transitive closure with book-keeping)
- requires only basic knowledge of group actions.
Wait, am I correct in reading this wiki page that the Monte-Carlo version of the algorithm has a quadratic speedup vs. the deterministic version? This is potentially another great example of the same problem that plagues Miller-Rabin and the ring
tactic.
James B. Wilson (Dec 18 2023 at 15:24):
There is a nearly linear time Monte Carlo algorithm and various Las Vegas adaptations.
James B. Wilson (Dec 18 2023 at 15:25):
There are also more clever deterministic versions by Knuth and Jerrum.
James B. Wilson (Dec 18 2023 at 15:26):
And finally there are versions using probabilistic data structures.
So yes, all in all it get to explore a wide range of both algorithms and data types. The hassle/goal is to operate in these regimes keeping all the performance but linking to the theory.
James B. Wilson (Dec 18 2023 at 15:27):
(FYI nearly linear in a multitape TM, it wouldn't be interesting to say that in a RAM model)
Last updated: Dec 20 2023 at 11:08 UTC