Zulip Chat Archive

Stream: mathlib4

Topic: Possible PR for probability theory?

Rick de Wolf (Oct 22 2024 at 08:35):

Hi all, I'm working on an implementation for Bernoulli random variables. I think it might be a good fit for Mathlib as there are quite a few other commonly used probability distributions in there already. Also, I'm following the existing Mathlib code for the geometric distribution, so integration should be relatively easy, hopefully.

Rick de Wolf (Oct 22 2024 at 08:35):

(@Rémy Degenne, hope you don't mind me pinging you)

Rick de Wolf (Oct 22 2024 at 08:39):

My question is, is there any interest in a PR like this?

Josha Dekker (Oct 22 2024 at 09:03):

Hi Rick,
I've written a few files in this direction in the past, and I can tell you that adding some more distributions is certainly desirable. (I've been too busy lately to work on Mathlib myself, unfortunately). What is important to keep in mind is that in Mathlib we typically aim for generality: if possible, I believe it would be nice to start with the categorical distribution, after which you could make a separate file for the bernoulli distribution as a special case. (Similar to how the exponential distribution is written as a special case of the gamma distribution).

Perhaps Rémy has some thoughts on the properties of distributions that we would like to have available for each distribution, to make sure that have some reliable API across the various distributions?

Yaël Dillies (Oct 22 2024 at 09:08):

Rick de Wolf said:

I'm working on an implementation for Bernoulli random variables.

To me this could mean many things. Can you clarify whether you are defining an explicit Bernoulli random variable, defining a predicate saying that a random variable is Bernoulli, proving properties of Bernoulli random variable, something else?

Rick de Wolf (Oct 22 2024 at 09:15):

@Josha Dekker That is great to hear! Generalizing to a categorical distribution makes a lot of sense. When I was working on this last night I tried to use PMF.ofFinset but that becomes quite clunky even for something as simple as a Categorical PMF :sweat_smile:. I can have a go at that and then maybe PR both the categorical and the Bernoulli distribution at the same time?

Josha Dekker (Oct 22 2024 at 09:22):

Rick de Wolf said:

Josha Dekker That is great to hear! Generalizing to a categorical distribution makes a lot of sense. When I was working on this last night I tried to use PMF.ofFinset but that becomes quite clunky even for something as simple as a Categorical PMF :sweat_smile:. I can have a go at that and then maybe PR both the categorical and the Bernoulli distribution at the same time?

You could definitely develop them in parallel, but you may want to split them in two separate PR's (one depending on the other) to speed up the reviewing process. I'd probably stick to categorical at first. If PMF.ofFinset is too clunky to work with, perhaps it is easier to work directly with a PMF as follows: if I have k categories with probabilities p_1,..,p_k, I can define a PMF (in pseudo-code): f(n) = n <= k ? p_n : 0.

But it may be that a small API lemma around PMF.ofFinset is missing/is overlooked, which makes that approach viable?

edit: I see that what I'm describing is practically PMF.ofFinset, I'd stick to that one.

Rick de Wolf (Oct 22 2024 at 09:23):

@Yaël Dillies I am following the geometric distribution's implementation in Mathlib as a guide, so the overall setup is almost the same as that one. To be more concrete though, I'm defining a Bernoulli PMF, using def bernoulli_pmf_real (p : ℝ) (b : Bool) : ℝ := if b then p else (1-p) as a basis. (And yes, I realized it should be {0, 1} instead of Bool last night, that's the one thing that's left to do.) There is more in my file though - there is a lemma stating bernoulli_pmf_real sums to one, for example, and there is a definition for the induced Measure on Bool.

Yaël Dillies (Oct 22 2024 at 09:24):

Do you know about docs#PMF.bernoulli ?

Yaël Dillies (Oct 22 2024 at 09:25):

In fact, I don't like this definition and would rather have it be a PMF over Prop, because that's how binomial random graphs are defined (as a family of independent random Prop-valued Bernoulli variables)

Rick de Wolf (Oct 22 2024 at 09:28):

Josha Dekker said:

Rick de Wolf said:

Josha Dekker That is great to hear! Generalizing to a categorical distribution makes a lot of sense. When I was working on this last night I tried to use PMF.ofFinset but that becomes quite clunky even for something as simple as a Categorical PMF :sweat_smile:. I can have a go at that and then maybe PR both the categorical and the Bernoulli distribution at the same time?

You could definitely develop them in parallel, but you may want to split them in two separate PR's (one depending on the other) to speed up the reviewing process. I'd probably stick to categorical at first. If PMF.ofFinset is too clunky to work with, perhaps it is easier to work directly with a PMF as follows: if I have k categories with probabilities p_1,..,p_k, I can define a PMF (in pseudo-code): f(n) = n <= k ? p_n : 0.

But it may be that a small API lemma around PMF.ofFinset is missing/is overlooked, which makes that approach viable?

edit: I see that what I'm describing is practically PMF.ofFinset, I'd stick to that one.

Yeah PMF.ofFinset basically does everything a categorical dist needs, but I'm not a fan of the interface. Maybe I'm missing something there, but I feel like there should be an interface that only needs to be given an array or similar datastructure with probability masses and automatically turns that into a PMF. I might be oversimplifying things though.

Rick de Wolf (Oct 22 2024 at 09:31):

Yaël Dillies said:

Do you know about docs#PMF.bernoulli ?

No I did not haha This would make my code unncessary it seems. So a PR that changes the existing definition to use Props might be good then?

Josha Dekker (Oct 22 2024 at 09:31):

Rick de Wolf said:

Josha Dekker said:

Rick de Wolf said:

Josha Dekker That is great to hear! Generalizing to a categorical distribution makes a lot of sense. When I was working on this last night I tried to use PMF.ofFinset but that becomes quite clunky even for something as simple as a Categorical PMF :sweat_smile:. I can have a go at that and then maybe PR both the categorical and the Bernoulli distribution at the same time?

You could definitely develop them in parallel, but you may want to split them in two separate PR's (one depending on the other) to speed up the reviewing process. I'd probably stick to categorical at first. If PMF.ofFinset is too clunky to work with, perhaps it is easier to work directly with a PMF as follows: if I have k categories with probabilities p_1,..,p_k, I can define a PMF (in pseudo-code): f(n) = n <= k ? p_n : 0.

But it may be that a small API lemma around PMF.ofFinset is missing/is overlooked, which makes that approach viable?

edit: I see that what I'm describing is practically PMF.ofFinset, I'd stick to that one.

Yeah PMF.ofFinset basically does everything a categorical dist needs, but I'm not a fan of the interface. Maybe I'm missing something there, but I feel like there should be an interface that only needs to be given an array or similar datastructure with probability masses and automatically turns that into a PMF. I might be oversimplifying things though.

It should not be hard to write API for this, i.e. a lemma that takes as input a vector p of probabilities and the hypothesis that they sum to 1, and returns a PMF:

Define a function (pseudocode!) f(n) = n <= k ? p_n : 0, and verify that this satisfies the conditions of PMF.ofFinset.
Then call PMF.ofFinset to supply the PMF.

Josha Dekker (Oct 22 2024 at 09:33):

Rick de Wolf said:

Yaël Dillies said:

Do you know about docs#PMF.bernoulli ?

No I did not haha This would make my code unncessary it seems. So a PR that changes the existing definition to use Props might be good then?

Or perhaps first generalise to categorical, then derive Bernoulli as a special case (still mapping to {0,1}) and then introducing a similar object which is prop-valued? This way we can avoid as much API duplication as possible?

Yaël Dillies (Oct 22 2024 at 09:33):

Rick de Wolf said:

So a PR that changes the existing definition to use Props might be good then?

Yes. I would be happy with that. See LeanCamCombi for some work towards that. If people complain that they want the Bool version still, we could have two versions

Rick de Wolf (Oct 22 2024 at 09:33):

Sounds great! I'll look into defining a categorical distribution interface.

Rick de Wolf (Oct 22 2024 at 09:34):

Yaël Dillies said:

Rick de Wolf said:

So a PR that changes the existing definition to use Props might be good then?

Yes. I would be happy with that. See LeanCamCombi for some work towards that. If people complain that they want the Bool version still, we could have two versions

What about deprecating the old version?

Yaël Dillies (Oct 22 2024 at 09:35):

Happy with it (although personally I think you can change in place), but again maybe people will complain. I just don't know.

Rick de Wolf (Oct 22 2024 at 09:36):

Alright, I can make a PR and we'll take it from there then. It would be my first time contributing to Mathlib so it might take me a minute to set things up.

Rick de Wolf (Oct 22 2024 at 09:36):

Probably not until the weekend at least.

Rick de Wolf (Oct 22 2024 at 09:37):

LeanCamCombi might be interesting for another thing I'm working on as well. Does it have anything for random graphs?

Yaël Dillies (Oct 22 2024 at 09:38):

Feel free to ask for help! I am happy to take to DMs to avoid spamming this thread

Rick de Wolf (Oct 22 2024 at 09:38):

Will do!

Yaël Dillies (Oct 22 2024 at 09:38):

Rick de Wolf said:

LeanCamCombi might be interesting for another thing I'm working on as well. Does it have anything for random graphs?

It doesn't have a huge lot, but I've worked out the definitions and quite immediately saw I needed to refactor mathlib, but I never did. Here it is.

Rémy Degenne (Oct 22 2024 at 09:48):

We will need several definitions for Benoulli distributions: neither Bool nor Prop corresponds to what I would like to have for example (I want to be able to take sums and averages of Bernoulli r.v., to look at some limit of that, so something that coerces to Real).

I don't know if the Bool definition is useful of if replacing that one with Prop would just be better. But let's name that bernoulliBool or bernoulliProp rather than just bernoulli?

Rick de Wolf (Oct 22 2024 at 11:37):

I'm planning on using Bernoulli distributions in a way that requires expectations and variances can be handled, so I'm happy to look into that Rémy :thumbs_up: . Would a simple Finset \Real work?

Last updated: May 02 2025 at 03:31 UTC