## Stream: general

### Topic: Verifying a deterministic algorithm

#### Kevin Buzzard (Apr 03 2020 at 12:08):

Let's say that someone ports a random number generator to Lean where you can seed it (default seed is 37) and then it just gives you a function from nat to real or nat to the nats which are at most N or whatever. Using this someone could formalise an algorithm like Miller Rabin and then this could be a pseudo primality checker for large numbers. But how could one formalise the assertion that "the miller rabin test tried with 100 randomly generated primes will give the right answer with probability 99.99999999%"?

#### Kevin Buzzard (Apr 03 2020 at 12:09):

I mean, formalising the things mathematicians say when they justify these sorts of claims

#### Gabriel Ebner (Apr 03 2020 at 12:11):

The assertion is just a statement about the probability of a certain set (where the seed is a (uniformly distributed) random variable).

#### Kevin Buzzard (Apr 03 2020 at 12:12):

Maybe there's some function which eats a proposition depending on a stream input and returns a number between zero and one which we never compute

#### Kevin Buzzard (Apr 03 2020 at 12:14):

I see, and I can then prove that the probability you pass for a given input is at least 3/4 or whatever

#### Kevin Buzzard (Apr 03 2020 at 12:15):

Without ever explicitly computing it

#### Gabriel Ebner (Apr 03 2020 at 12:23):

It's been some time since I've looked at Miler-Rabin. IIRC (and read wikipedia), then in order to check that $n$ is prime, you check for a random $a < n$ that $a^d \not\equiv 1 (n)$ and $a^{2^r d} \not\equiv -1$ for all $r < s$ where $n-1 = 2^sd$. The number $n$ is prime iff no such $a$ exists. If $n$ is not prime, the probability of picking such an $a$ is at least $3/4$.

#### Gabriel Ebner (Apr 03 2020 at 12:28):

So you'd get a function test : fin n → Prop with volume test ≥ 3/4. You can then also define a function (fin m → fin n) → Prop that does the test multiple times. I'm not sure how exactly you would hook this up with a PRNG, you'd probably need some theorems about the distribution of the values of the PRNG.

#### Mario Carneiro (Apr 03 2020 at 12:35):

I don't think there is anything meaningful you can do to hook up the PRNG, at the end of the day you just get one number. You might be able to take the preimage of the probability distribution across the PRNG to give a bound on the volume of the set of "good" seed values, but then you just stick 37 in and hope it's one of the good ones

#### Mario Carneiro (Apr 03 2020 at 12:36):

You might also have trouble proving that iterating the test increases the volume past 3/4, since proving independence is somewhere between very difficult and false

#### Gabriel Ebner (Apr 03 2020 at 12:43):

The part about the independence of the values of the PRNG is a good point, though the probability of finding a witness should still converge to 1 since the PRNG hopefully enumerates all values.

There are two questions here though:

1. How to verify the Miller-Rabin algorithm. Here the answer is probably to ignore the PRNG and just compute the probability of the test directly, which isn't too hard.
2. How to run it as a tactic to disprove primality. I completely agree with you here. Just stick in the randomly chosen values 2, 3, 5, 7, and 11 and pray that you find a witness.

#### Mario Carneiro (Apr 03 2020 at 12:45):

Unless it's a ludicrously large number, an easier way to get a witness is to ask a CAS to find a factor

#### Gabriel Ebner (Apr 03 2020 at 12:46):

I'd even say "in particular if it's a ludicrously large number".

#### Mario Carneiro (Apr 03 2020 at 12:47):

But I agree with the general spirit of the question; I've been struggling with finding the "proof content" of probabilistic proofs for years

#### Mario Carneiro (Apr 03 2020 at 12:48):

it always boils down to relying on open questions like the algorithmic randomness of prime numbers after suitable preprocessing

#### Gabriel Ebner (Apr 03 2020 at 12:50):

They're not probabilistic proofs, they're probabilistic algorithms. And their correctness (= probability that they return the correct result) does not rely on any open questions AFAICT.

#### Mario Carneiro (Apr 03 2020 at 12:51):

I mean the correctness of the over all procedure of taking a test like this, running it 10 or 20 times, and arguing that the chance that the theorem is false is now less than the probability of hardware failure

#### Gabriel Ebner (Apr 03 2020 at 12:51):

I also feel a bit uneasy that these algorithms return an imprecise answer to a binary question, but that doesn't change the fact that they are correct in a very precise sense.

#### Gabriel Ebner (Apr 03 2020 at 12:52):

Yes, that's precisely the part I don't like either.

#### Kevin Buzzard (Apr 03 2020 at 13:17):

You're pure computer scientists.

#### Bryan Gin-ge Chen (Apr 03 2020 at 13:32):

This discussion reminds me a bit of this MO question which I remember reading and being confused about right before I learned about Lean.

What notable algorithms do we currently have in non meta mathlib? There's the extended GCD stuff by Mario in data.int.gcd, some sort algorithms in data.list.sort and some other stuff on data structures like data.hash_maps, anything else that I'm missing? (I guess any def is in some sense an algorithm, so really I'm asking about what mathematicians might label algorithms.)

#### Mario Carneiro (Apr 03 2020 at 13:42):

there are association lists in data.list.alist, R-B maps in core, and ordmap is an implementation of 2/3 size balanced binary trees a la Haskell's Data.Map on a long outdated branch of mathlib. I once had a plan to formalize most of Okasaki's book on functional data structures, but it was dead on arrival because lean 3 doesn't have thunks so half of the algorithms don't work (at least not at the stated performance bounds)

#### Reid Barton (Apr 03 2020 at 13:43):

For these probabilistic algorithms I don't think you can avoid modeling the PRNG output as a random variable (in the sense of probability theory).

#### Mario Carneiro (Apr 03 2020 at 13:44):

but that's... at best unjustified

#### Reid Barton (Apr 03 2020 at 13:44):

Right, which is precisely why you can't do better in a theorem prover :slight_smile:

#### Reid Barton (Apr 03 2020 at 13:45):

The most you could even hope to be true is that your PRNG looks random to someone with bounded computational power

#### Reid Barton (Apr 03 2020 at 13:45):

but I don't think anyone knows how to prove any statement of that form

#### Mario Carneiro (Apr 03 2020 at 13:45):

for all you know there is some deep number theory reason that the method your PRNG uses is disproportionally likely to pick Miller-Rabin false negatives

#### Reid Barton (Apr 03 2020 at 13:50):

Well, you might believe that this is actually false (because false negatives are rare, and you could detect them in polynomial time by doing a polynomial time deterministic primality test) but how would you ever prove it?

#### Reid Barton (Apr 03 2020 at 13:54):

and for your application you would even need to prove it with effective bounds

#### Ryan Lahfa (Apr 03 2020 at 14:05):

That's a small contribution, but AFAIK, people have been successful at modelling some CSPRNG behaviors using Z3 solvers: https://alephsecurity.com/2019/09/02/Z3-for-webapp-security/ — I don't know how feasible it would be to reproduce this in Lean (and how useful?) — but I guess this is far from the ideal desired outcome of this discussion.

#### Mario Carneiro (Apr 03 2020 at 14:12):

Yeah, unfortunately the only thing that seems to be possible in this space is disproving that certain random number generators are good

Last updated: May 13 2021 at 17:42 UTC