Stream: general

Topic: Applied statistics

Simon Hudon (Jul 05 2018 at 23:41):

I haven't done any statistics in a while. Can someone point me to a formula I can use to calculate the probability that x ≤ y if both are normally distributed random variables?

Kenny Lau (Jul 05 2018 at 23:42):

mean and variance?

Simon Hudon (Jul 05 2018 at 23:42):

Yes using the mean and variance.

Simon Hudon (Jul 06 2018 at 00:06):

Do you have a formula for that?

Mario Carneiro (Jul 06 2018 at 01:30):

This is like taking the 2D Gaussian and drawing a line across the distribution; if you project parallel to that line the projected distribution is another Gaussian so the answer should be Erf of some simple term involving the mean and variance

Simon Hudon (Jul 06 2018 at 01:37):

I'm not knowledgeable enough to make sense of that answer

Simon Hudon (Jul 06 2018 at 01:42):

Maybe it would be useful to reveal more context of my problem. I have two tactics. I'm running them each ~30 times, measuring average running time and variance. I'd like to check if one is actually faster than the other

Simon Hudon (Jul 06 2018 at 01:43):

The difference is not big but I'd like to know where I'm heading with the optimization

Mario Carneiro (Jul 06 2018 at 01:44):

If the variables are $X \sim N(a, \sigma_X)$ and $Y \sim N(b, \sigma_Y)$ then let $X = a + \sigma_X X'$ and $Y = b + \sigma_Y Y'$ where $X'$ and $Y'$ are standard normal variables; then $a+\sigma_X X' \le b + \sigma_Y Y'$ iff $\sigma_Y Y' - \sigma_X X' \ge a - b$. But $\sigma_Y Y' - \sigma_X X'$ is a Gaussian with SD $\sqrt{\sigma^2_X + \sigma^2_Y}$, so the probability is $\mathrm{erf}\frac{a-b}{\sqrt{\sigma^2_X + \sigma^2_Y}}$

Mario Carneiro (Jul 06 2018 at 01:48):

At the first order the answer to your question is easy: If the mean is less then it's probably an improvement

Mario Carneiro (Jul 06 2018 at 01:48):

but the larger the standard deviation the less confidence you have in this assertion

Simon Hudon (Jul 06 2018 at 01:49):

Right. One of them has a pretty large standard deviation

Simon Hudon (Jul 06 2018 at 01:49):

Is it weird that it only started making sense when you brought up the formula? :)

Mario Carneiro (Jul 06 2018 at 01:51):

By the way there is a square root in the last two math blocks there, but in my mathjax display I only see a conspicuous space

Simon Hudon (Jul 06 2018 at 01:53):

I see the same. You were missing a bracket and a \$ so I ended up compiler your latex myself. I see both square roots

Mario Carneiro (Jul 06 2018 at 01:54):

I think I've fixed all the delimiter problems above, but the square root still isn't rendering. You know how to fix it?

Simon Hudon (Jul 06 2018 at 01:56):

You're right, it's now rendering almost properly. I haven't played with mathjax so I wouldn't know where to start

Mario Carneiro (Jul 06 2018 at 01:57):

MSE uses mathjax and there isn't much trouble using $\sqrt x$ there

Reid Barton (Jul 06 2018 at 01:57):

Zulip actually uses katex, not mathjax

Simon Hudon (Jul 06 2018 at 01:57):

$x$

Mario Carneiro (Jul 06 2018 at 01:58):

you have to use double dollar delimiter

Mario Carneiro (Jul 06 2018 at 01:58):

does $\sqrt x$ work?

Nice thanks!

Simon Hudon (Jul 06 2018 at 01:59):

$\sqrt x$

No

Simon Hudon (Jul 06 2018 at 01:59):

$\sqrt x$

Mario Carneiro (Jul 06 2018 at 01:59):

Not sure how to do display math either

Simon Hudon (Jul 06 2018 at 01:59):

$\sqrt x$

oh, there it is

Simon Hudon (Jul 06 2018 at 02:00):

If you do "three ticks" math (like you would for Lean code), then it seems to work

Simon Hudon (Jul 06 2018 at 02:05):

If I try my tactic on three problems, I evaluate the formula you gave me for each problem, how do you recommend I aggregate the data (if I need to)?

Mario Carneiro (Jul 06 2018 at 02:22):

To make it a bit easier, you can subtract the two times for each problem, resulting in a random variable that you want to compare to zero

Mario Carneiro (Jul 06 2018 at 02:23):

then you take an average weighted by how many trials were performed in each problem

Mario Carneiro (Jul 06 2018 at 02:26):

the standard deviation would be the sqrt-sum of the individual SDs, divided by 3 since there are three subproblems (assuming equal weights on the three problems)

Simon Hudon (Jul 06 2018 at 02:29):

Does it matter than one of the problem is much harder than the others? It has a run time around 16s while the others are around 7s and 3s

Mario Carneiro (Jul 06 2018 at 02:29):

Oh, I see, yes that changes things

Mario Carneiro (Jul 06 2018 at 02:30):

I'm assuming in this analysis that we have two gaussian variables X and Y and want to estimate them

Mario Carneiro (Jul 06 2018 at 02:30):

but this supposes that the mean and SD are fixed across the sample

Simon Hudon (Jul 06 2018 at 02:30):

That sounds right

Simon Hudon (Jul 06 2018 at 02:31):

I don't understand your last sentence. What's the opposite of fixed?

Mario Carneiro (Jul 06 2018 at 02:32):

We can assume that problem 1 has a gaussian distrib of times for each tactic, but there is no reason to believe that this same gaussian describes the times for problem 2

Simon Hudon (Jul 06 2018 at 02:34):

I think that's reasonable

Mario Carneiro (Jul 06 2018 at 02:37):

So we have to make some assumption about what we are estimating. Let's say we have good data on each problem, so we can say things like problem 1 is 2s +- .1s for tac 1 and 4s +- 1s for tac 2, and similarly for problem 2. If the goal is to minimize total run time over all future uses of tac1 vs tac2, then we need to assume that problem 1 and problem 2 are somehow representative of those future uses

Mario Carneiro (Jul 06 2018 at 02:37):

If we think problem 2 is more common than problem 1, then it will get a higher weight in the average

Simon Hudon (Jul 06 2018 at 02:38):

The three problems are deriving transportable for monoids, groups and rings. Rings are very expensive

Simon Hudon (Jul 06 2018 at 02:39):

We could believe that the complexity is linear in the number of fields of the structure

Mario Carneiro (Jul 06 2018 at 02:39):

I think it is safe to say that each of those will be run only once, but perhaps you want to extrapolate to even larger examples, or smaller?

Mario Carneiro (Jul 06 2018 at 02:40):

In other words you should estimate the common use case size

Simon Hudon (Jul 06 2018 at 02:42):

Digression: should there be an absolute value in your formula? It's giving me negative numbers

Simon Hudon (Jul 06 2018 at 02:42):

I think as you say there will be an average size for the problems where this is used

Mario Carneiro (Jul 06 2018 at 02:49):

You mean the a-b/sqrt(sd) formula?

Mario Carneiro (Jul 06 2018 at 02:51):

there should not be an absolute value, although I think I may have a sign error in there - one sign means tac1 is better, the other means tac2 is better

Simon Hudon (Jul 06 2018 at 02:52):

Ah I see! The highest probability I get is 50%. I'll keep the optimization in but that doesn't seem conclusive

Mario Carneiro (Jul 06 2018 at 02:52):

when you use erf it turns the number into a probability which is close to 100% for positive numbers and close to 0 for negative numbers

Mario Carneiro (Jul 06 2018 at 02:53):

50% means you have a very small difference, of course

Simon Hudon (Jul 06 2018 at 02:53):

Really? I don't get that. Maybe I should be suspicious of my erf implementation.

Mario Carneiro (Jul 06 2018 at 02:54):

you can report the number without the erf, that gives the number of SDs away from the mean

Mario Carneiro (Jul 06 2018 at 02:55):

wait, did you just push some haskell to mathlib?

Simon Hudon (Jul 06 2018 at 02:55):

On my branch yes. I'll take it out before pushing for a merge

Simon Hudon (Jul 06 2018 at 03:01):

I checked with another erf implementation, the results are consistent. I get the following numbers, one for each problem:

0.3548605107843447
0.392231363819503
0.5018422725422864


Simon Hudon (Jul 06 2018 at 03:01):

Is the last one 50% or 0.50%?

Mario Carneiro (Jul 06 2018 at 03:01):

I would assume 50%

Mario Carneiro (Jul 06 2018 at 03:02):

erf returns numbers in the range 0-1

Simon Hudon (Jul 06 2018 at 03:03):

Ok, I got confused when you said:

when you use erf it turns the number into a probability which is close to 100% for positive numbers and close to 0 for negative numbers

Mario Carneiro (Jul 06 2018 at 03:05):

Oh, I just checked wikipedia and it looks like the normalization conventions are a bit different for erf

Mario Carneiro (Jul 06 2018 at 03:05):

there is a factor sqrt 2 on the inputs, and it returns a value in the range -1 to 1

Mario Carneiro (Jul 06 2018 at 03:06):

Maybe you should just leave the erf off and interpret the number of SDs away from the mean directly

Mario Carneiro (Jul 06 2018 at 03:06):

what I really want there is the CDF of the standard normal distribution

Mario Carneiro (Jul 06 2018 at 03:07):

which is basically erf but needs some preprocessing

Mario Carneiro (Jul 06 2018 at 03:09):

wikipedia says the CDF is

$CDF_{N(\mu,\sigma)}(x)=\frac{1}{2}\left[1+\mathrm{erf}\left(\frac{x-\mu}{\sigma\sqrt 2}\right)\right]$

Simon Hudon (Jul 06 2018 at 03:10):

The erf library in haskell has a function normcdf with normcdf x = erfc(-x sqrt 2) ^ 2 does that make sense?

Mario Carneiro (Jul 06 2018 at 03:10):

based on the name that's likely to be the right one

Mario Carneiro (Jul 06 2018 at 03:11):

try evaluating it at -1, 0, 1

Simon Hudon (Jul 06 2018 at 03:13):

0.15865525393145705
0.5
0.8413447460685429


that's correct

Mario Carneiro (Jul 06 2018 at 03:15):

use that in place of erf in the formula I gave before

Thanks!

Simon Hudon (Jul 06 2018 at 03:17):

That looks much better!

Simon Hudon (Jul 06 2018 at 03:18):

0.6276517385920415
0.6416715838500027
0.6840264065071848


Simon Hudon (Jul 06 2018 at 03:18):

And if I flip aand b in a - b, I get:

0.37234826140795846
0.3583284161499974
0.3159735934928151


Simon Hudon (Jul 06 2018 at 03:19):

If we use b - a, that's the probability that b ≥ a, is correct?

yes

Cool

Mario Carneiro (Jul 06 2018 at 03:20):

The interpretation is something like "given the data you gathered, there is a 62% chance that in fact tac 1 is better than tac 2 on problem 1"

Simon Hudon (Jul 06 2018 at 03:21):

Btw, unfold_projs yielded a big improvement in the end: Lean stopped crashing

Mario Carneiro (Jul 06 2018 at 03:21):

ooh, what caused the crash?

Timeout

Simon Hudon (Jul 06 2018 at 03:22):

I suspect that definitions generated by a tactic don't play nicely with simp / dsimp / unfold / dunfold if you try to unfold them

Mario Carneiro (Jul 06 2018 at 03:24):

you can write your own simp lemmas for them if you want

Simon Hudon (Jul 06 2018 at 03:25):

Yes, I tried that. It did help a bit

Simon Hudon (Jul 06 2018 at 03:26):

But not as much as unfold_projs. It removed the need to hard code has_mul.mul et cie

Simon Hudon (Jul 06 2018 at 03:35):

I don't know about you but I'm having so much fun with that stuff (Lean) lately :D

Kenny Lau (Jul 06 2018 at 06:47):

I haven't done any statistics in a while. Can someone point me to a formula I can use to calculate the probability that x ≤ y if both are normally distributed random variables?

oh of course it's 50%, it's symmetric

Kevin Buzzard (Jul 06 2018 at 07:23):

oh of course it's 50%, it's symmetric

I think the means are different

Last updated: May 08 2021 at 11:09 UTC