Zulip Chat Archive

Stream: general

Topic: Cryptographic Hashing


Tim Daly (Jan 09 2020 at 23:26):

Axiom will include proofs (presumably validated by Lean) for its algorithms. Ideally, when a computation is requested from Lean for a GCD, the result as well as a GCD algorithm proof is returned. Lean can then verify that the proof is valid. But it is computationally more efficient if both Axiom and Lean use a cryptographic hash, such as SHA1. That way the proof doesn't need to be 'reproven', only the hash computation over the proof text needs to be performed. Hashes are blazingly fast. This allows proofs to be exchanged without the need to re-run the proof mechanism. Since a large computation request from Lean might involve many algorithms there would be considerable overhead to recompute each proof. A hash simplifies the issue yet provides proof integrity.

Jason Rute (Jan 10 2020 at 00:49):

Reid was once telling me about some theoretical elliptic curve cryptographic hash mechanism for proofs. I’m having trouble remembering the main idea, much less the details. But in you case, is the idea that Axiom (or Lean) would store the hashes of previously checked proofs, so if a hash is in the list of checked hashes, it could safely avoid checking it again? Why hash the proof? Couldn’t you store (or hash) the theorem statement itself? Maybe I am misunderstanding.

Jalex Stark (Jan 10 2020 at 03:48):

I think you want the notion of a probabilistically checkable proof?

Johan Commelin (Jan 10 2020 at 06:05):

@Tim Daly Note that SHA1 is getting less and less secure every day. Just this week people announced chosen prefix collision attacks.

Johan Commelin (Jan 10 2020 at 06:07):

Also, I don't see why you can't just ask Lean: "Hey, do you have some precompiled binary of this theorem? Please answer with 1 bit".

Tim Daly (Jan 10 2020 at 06:10):

You could simply character compare the prior proof given to Axiom with the proof sent by Axiom. Of course, that assumes you stored the proof. Hashes are generally smaller. Either solution works fine. It's a question of scale. If Axiom uses 100 algorithms to perform a computation and sends 100 proofs along with the answer it pays to be efficient.

Tim Daly (Jan 10 2020 at 07:33):

So instead of shipping whole proof scripts back and forth, all that would be needed is a hash code. The proof script still exists but the hash code can be a table lookup. If the lookup fails for any reason, then request the associated proof script to re-validate the proof. This is especially useful if Lean is a network server. It just keeps a table of hashes of proofs it has validated.

Tim Daly (Jan 10 2020 at 08:13):

Indeed, one can get ever more clever. One could hash all of the hashes for all of the proven algorithms into one hash. Then all that needs to be exchanged for, say, a Groebner basis computation would be a single meta-hash.

Tim Daly (Jan 10 2020 at 08:15):

So there would be a single meta-hash for each algorithm, one for GCD, one for Groebner, etc. It is all very efficient and can scale. The server doesn't need to store proof scripts since if the hash lookup fails, it just asks the client for the proof.

Johan Commelin (Jan 10 2020 at 08:52):

I've been thinking a bit about Unison (unisonweb.org) recently. It's a type-theoretic functional programming language based on content addressable storage. So everything is hashed and then stored. Renaming a function is a 1-line change in the metadata, and doesn't break any code. You no longer store your code in files, but in this content addressable store managed by Unison. If you want to read the definition of a function, the system renders a textual representation for you, based on the content of some hashes + the names in the metadata. So you don't actually see the hashes.
It should make things like distributed code, reproducibility, refactoring code, etc... easier.

Johan Commelin (Jan 10 2020 at 08:53):

I wonder if a similar system might be usable for an ITP as well. It seems closely related to what @Tim Daly is suggesting

Jason Rute (Jan 10 2020 at 12:29):

I’m still confused by two points. Why is Lean storing the proof scripts or proof script hashs instead of just having Axiom store them? Why store proofs or proof hashes instead of theorems? If you store theorems, you don’t even need to generate the proof scripts if you already have that theorem (or theorem hash). And I’m fine with hashes (although I don’t see the need for a secure hash), in the high level languages that I work (Python and Scala) the set objects for storing objects to lookup later are all based on hashes (I think md5 hashing, but I’m not certain).

Jason Rute (Jan 10 2020 at 12:32):

Oh, maybe the reason to lookup the poof (hash) is that you still want to return the proof to the user, so to be safe you want to know that the proof and not just the theorem is correct. Is that it?

Jason Rute (Jan 10 2020 at 13:03):

@Tim Daly I now see you responded to my questions in another topic. I’ll look at that...

Tim Daly (Jan 10 2020 at 15:11):

The hash gives a collision-resistant name to the theorem/proof. Git, for example, uses SHA1. The first 8 hex characters are usually unique and act as a 'nickname' for the full hash. In addition, storing the hashes by using the first 2 hex characters gives a quick file system lookup.

You could use the hash to name the file containing the actual theorem/proof text. Then each theorem could provide the URL for the proof text by just providing a link to the hash-named file.

The Lean user (e.g. Axiom) can also store the proof text or it could rely on Lean to provide the URL. That is a design decision for someone who wants to use Lean as a network proof engine. Axiom Sane wants to generate proofs of its algorithms, store the proofs, and provide them to its users so the results can be trusted.

It seems that Axiom's 10,000 algorithms would be painful to use in a trusted manner if every use required a full re-proof of the algorithm on every use. It is an issue of working-at-scale for automating trust between the computer algebra branch and the logic/proof branch of computational mathematics.

Johan Commelin (Jan 10 2020 at 15:12):

Tim, I suggest you take a look at unison for more ideas in this direction

Johan Commelin (Jan 10 2020 at 15:13):

Also, I repeat that SHA1 isn't your best choice

Johan Commelin (Jan 10 2020 at 15:13):

Linus claims that SHA1 is fine for git, and it probably is. But I wonder if he would choose it again, if he would have to choose something now.

Tim Daly (Jan 10 2020 at 15:38):

re: SHA1. Like git, there is no real concern about its crypto properties (e.g. SHA1 has a prefix-hash collision attack). SHA1 returns a 160 character "name". SHA256 returns a 256 character name. (SHA1 and SHA256 share a design. There is a SHA3 with a different design but like Linux, I don't see the concern) What you want is a hash that depends on EVERY bit of the file contents.

I am looking at the Unison website now. Perhaps Lean could construct itself on Unison and let someone else do the heavy lifting of crafting a hash-based solution. Either way, if you're going from 3 mathematicians to 3000 mathematicians, a couple dozen computer algebra systems, a large corpus of proven algorithms, and a non-trivial pile of mathematics, all hiding somewhere "in the cloud", it helps to have a scalable design.

With trusted algorithms, Axiom could use some other system's algorithms for, say, algebraic geometry without re-implementing them. I'm not inventing these ideas. There are whole conferences and funded research efforts on the subject. (See James Davenport for some connections). Systems need trusted communication designed in.

There is also work on organizing the theorems. Axiom uses a hierarchy so they can be inherited based on properties (e.g. a chain of axioms that assume commutativity (giving arithmetic) verses a different chain that does not assume that (e.g. matrices). See the work of Farmer and Carette on "tiny theories". This scales much better than a flat organization.

Oh, and don't get me started on the (literate) documentation :-)


Last updated: Dec 20 2023 at 11:08 UTC