“Machine Learning, Deep Learning, Data Science, etc. are all in the same nexus – which is the bacically Probability plus Statistics. ”

 

Statistics, MIT 18.6501x

A useful book and a probability refer link.

Where will be covered:

 

Theorems and Tools

i.i.d. stands for independent and identically distributed .

r.v. denotes random variable

Law of Large Numbers (LLN)

According to the law, the average of the results obtained from performing the same experiment a large number of times should be close to the expectation value, and tend to be closer when n is even greater.

Let X,X1,X2,,Xn be i.i.d. Laws (weak and strong) of large numbers (LLN):

X¯n:=1ni=1nXi P , a.s. nμ.

Central Limit Theorem (CLT)

CLT establishes that when independent random variables summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed,

nX¯nμσ(d)nN(0,1)

where (d) denotes convergence in distribution.

Standard Gaussian means that this quantity will be a number between (-3, 3) with overwhelming probability, we have |Xnμ|=3σn.

X¯n is close to μ at some deviations that are of order 1 over square root of sample size (1n).

Rule of thumb to apply CLT - normally, we require n30.

Asymptotic normality

Assuming the sequence {Xn} is an i.i.d sequence with finite mean and variance. Therefore, it satisfies the conditions of Central Limit Theorem.

Hence, the sample mean X¯n in the above equlity is asymptotically normal. In other words, the sample mean Xn converges in distribution to a normal random variable with mean μ and variance σ2n.

(n)(X¯nμσ)dZ

Continuous Mapping Theorem (CMT)

CMT states that continuous functions preserve limits even if their arguments are sequences of random variables. A continuous function f()​ is such a function that maps convergent sequences into another convergent sequen,

Tn a.s. /P/(d)nTf(Tn) a.s. /P/(d)nf(T)

It applies to d, p and a.s. convergence respectively.

But how close is f(Tn) to f(T)? The Delta (Δ) method.

Hoeffding's Inequality

What if n is not large enough to apply CLT?

For bounded random variable, this is still Hoeffding’s Inequality we can say for any n. Let n be a positive integer and Xi[a,b] (a<b are given numbers) Then this holds even for small sample sizes n:

P[|X¯nμ|ε]2e2nε2(ba)2.ε>0

Here I need that my random variables are actfually almost surely bounded, which rules out like Gaussians and Exponential Random Variables.

How to choose ε ?

So let's parse this for a second… , if when ε=cn:

XiiidBer(p)P(|X¯μ|cn)2e2c2/1

The square root of n qualitative behavior happens at any n.

So the conclusion is the average is a good replacement

Is this tight? That's the annoying thing about inequalities.

The above inequality could actually be e to the minus exponential of n (en), which could be much, much smaller than that, so it is loose.

Chebyshev Inequality & Markov Inequality

These two inequalities guarentees that upper bounds on P(Xt) based on the mean and variance of X.

Markov inequality

For a random variable X0 with mean μ>0, and any number t>0 :

P(Xt)μt

Note that the Markov inequality is restricted to non-negative random variables.

Chebyshev inequality

For a random variable X with (finite) mean μ and variance σ2, and for any number t>0,

P(|Xμ|t)σ2t2

Remark: When Markov inequality is applied to (Xμ)2, we obtain Chebyshev's inequality. Markov inequality is also used in the proof of Hoeffding's inequality.

Triangle Inequality

|α ± β||α|+|β|

Slutsky’s Theorem

Slutsky's Theorem will be our main tool for convergence in distribution.

Let Tn,Un​ be two sequences of r.v., such that:

Tnn (d)T and UnnPU

Then,

Delta (Δ) method

https://en.wikipedia.org/wiki/Delta_method

Distribution

Discrete: Probability mass function

Gaussian Distribution

Notation: XN(μ,σ2)

Mean and Variance: μ, σ2

Probability Density Function:

1σ2πe12(xμσ)2

image-20220205100840395

Cumulative Distribution Function:

Φ(x)=P(Xx)=12[1+erf(xμ2σ)]

erf() is the exponential response formula.

image-20220205100857657

Why we use Gaussian Distribution so frequently?

Normally, we use sample mean as our estimator. And the reason is because the Gaussian distribution is the thing that shows up as the limit of the CLT as the minute you start talking about averages.

Of the universe type of results, that says that if you take average of enough things, then it's going to go to a Gaussian.

The extreme value?

The value field of a Gaussian is (,), but when we use it in mapping real distribution, namely the height, we will never achieve negative value, why we do Gaussian?

Yes, there exists extreme value, but they never really come into play. Because of the exponential can get really, really small. The Gaussian actually almost in a bounded interval.

Gaussian Probability Tables

A Gaussian CDF (z-score) calculator.

image-20220219230424092

Quantiles

α2.5%5%10%
qα1.961.651.28

 

Poisson Distribution

The advantage of using poisson distribution is that n or p do not need to be known! This can make assumptions much easier.

Notation: XPoi(λ), λ(0,)

Mean and Variance: λ, λ:

E(X2)=Var(X)+E(x)2=λ+λ2

image-20220206111006698

Pre-require 0! = 1

Probability Mass Function:

P(x=k)=λkk!eλ, k=0,1,2,

image-20220205102615493

Cumulative distribution function:

image-20220205102652730

Exponential Distribution

sample space: x[0,)

Mean and Variance: 1λ,1λ2

Probability Mass Function:

λeλx

pdf

Cumulative distribution function:

1eλx

cdf

image-20220205172337486

image-20220205172726369

 

Gamma Distribution

Notation: XGamma(α,β) or Γ(α,λ),λ=1β

Parameters: α,λ(0,)

Mean and Variance: αλ, αλ2

Gamma Function:

Γ(s)=0xs1exdx
{Γ(α)=(α1)! if α is Z+Γ(α)=(α1)Γ(α1) if α is R+Γ(12)=π

Probability Mass Function:

f(x)=x(α1)λαe(λx)Γ(α)=x(α1)e(1βx)βαΓ(α),x>0

 

https://upload.wikimedia.org/wikipedia/commons/f/fc/Gamma_distribution_pdf.png

Cumulative distribution function:

γ(α,λx)Γ(α)

https://upload.wikimedia.org/wikipedia/commons/a/a9/Gamma_distribution_cdf.png

image-20220217094728373

Geometric Distribution

Such as : number of trials until a success

Geometric Distribution is either one of below two distribution:

 

Notation: XGemo(p)  or  YGemo(p)

Mean and Variance: 1p or 1pp, 1pp2

Probability Mass Function:

(1p)k1p  or  (1p)kp

Geometric pmf.svg

Cumulative distribution function:

1(1p)k  or  1(1p)k+1

img

image-20220228105357946

Binomial Distribution

Notation: B(n,p), n is number of trials and p is success probability of each trial

Mean and Variance: np, k=1nσ2=np(1p)

Probability Mass Function:

P(k)=(nk)pkqnk=n!k!(nk)!pkqnk

Probability mass function for the binomial distribution

Cumulative Distribution Function:

Iq(nk,1+k)

Cumulative distribution function for the binomial distribution

In other words, there are a finite amount of events in a binomial distribution, but an infinite number in a normal distribution.

Bernoulli Distribution

Notation: XBer(p)

Mean and Variance: p, p(1p)

PMF pmfs

{\displaystyle p^{k}(1-p)^{1-k}}

 

Indicator Function

The indicator function of a subset A of a set X is a function

1A:X{0,1}

defined as

1A(x):={1 if xA0 if xA

Derivative of indicator function

I have an indicator function I(DQ),

DI(DQ)=δ(DQ)

δ is symmetric. δ can be thought of as the derivative of the Heaviside function H(x)=1 for x>0, 0 for x<0.

Moment Generation Function (MGF)

https://en.wikipedia.org/wiki/Moment-generating_function

 

expectation of moment generating function

image-20220209101247516

image-20220209101551026

https://online.stat.psu.edu/stat414/book/export/html/676

image-20220209101722845

mixture distribution moment generating function

Useful…

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwitz--mtvL1AhXeQEEAHSmoAI8QFnoECDwQAQ&url=https%3A%2F%2Fwww.le.ac.uk%2Fusers%2Fdsgp1%2FCOURSES%2FMATHSTAT%2F6normgf.pdf&usg=AOvVaw3QHSFjpCFrBgTFBxRwGAnK---

 

 

Property of Distribution

We take Gaussian XN(μ,σ2)as an example.

Affine Transformation:

αX+βN(αμ+β,α2σ2)

Standardization:

According to CLT, we assume Z=Xμσ,

P(u<X<v)=P(uμσ<Z<vμσ)

Symmetry:

P(|X|>x)=P(X>x)+P(X>x)=2P(X>x)

Mode of Convergence

Three types of convergence, going from strong to weak.

Almost surely (a.s.) convergence

So I created two sequences, and I want this to converge.

Tn a.s. nT iff P[{ω:Tn(ω)nT(ω)}]=1

Convergence in probability

The probability that they depart from each other by something is going to be going to 0 as n goes to infinity.

TnPnT iff P[|TnT|ε]nO,ε>0.

Convergence in distribution

This just saying I'm going to measure something about this random variable, maybe it is distribution. For all continuous and bounded function f, using CLT:

Tnn(d)T iff E[f(Tn)]nE[f(T)]

I'm just saying that its distribution is converging. My random variable is going to become the same as the probabilities for the second guy as n goes to infinity.

Properties

Convergence in distribution implies convergence of probabilities if the limit has a density (e.g. Gaussian):

Tn(d)nTP(aTnb)nP(aTb)

Addition, Multiplication and Division Assume,

Tnn a.s. /PT and Unn a.s. /PU

Then,

Warning: In general, these rules do not apply to convergence in distribution (d).

 

Estimator

Normally, we have two estimator:

How can we decide how many samples (n) we need to draw our conclusion? What is the cutoff, namely if 60 is enough, how about 59 and 58?

Now we have our first estimator of p of Kissing Example, we put a hat on everything that’s the estimator of something.

p^=Rn=1nnRi

And averages of random variables are essentially controlled by two major tools: They are LLN and CLT.

What is the accuracy of this estimator?

What is the probability that p^ is away from p by more than 10%.

We don’t even know the trueandard and the observation p^ is from random variables fluctuates. We need a method to measure its fluctuation.

Modelling Assumptions:

Measures of Distance

If we want to estimate mean of a Gaussian and we can compute the expectation, but not it doesn’t work for the the variance.

You can go into example and compute the variance, which is actually coming from the method of moments.

But it turns our we have a much more powerful method called the maximum likelihood method, but it is far non-trivial.