“Machine Learning, Deep Learning, Data Science, etc. are all in the same nexus – which is the bacically Probability plus Statistics. ”

Statistics, MIT 18.6501x

A useful book and a probability refer link.

Where will be covered:

Theorems and Tools
- LLN & CLT, CMT
- Hoeffding's Inequality, Chebyshev Inequality
- Slutsky's theorem
- Asymptotic normality
- $\Delta$ ) method
Distribution Functions
Mode of Convergence
Estimator
Probability Redux

Theorems and Tools

i.i.d. stands for independent and identically distributed .

r.v. denotes random variable

Law of Large Numbers (LLN)

According to the law, the average of the results obtained from performing the same experiment a large number of times should be close to the expectation value, and tend to be closer when n is even greater.

$X, X_{1}, X_{2}, \ldots, X_{n}$ be i.i.d. Laws (weak and strong) of large numbers (LLN):

{\bar{X}}_{n} := \frac{1}{n} \sum_{i = 1}^{n} X_{i} \underset{n \to \infty}{\overset{P, a.s.}{⟶}} μ .

Central Limit Theorem (CLT)

CLT establishes that when independent random variables summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed,

\sqrt{n} \frac{{\bar{X}}_{n} - μ}{σ} \underset{n \to \infty}{\overset{(d)}{\to}} N (0, 1)

$\stackrel{(d)}{\rightarrow}$ denotes convergence in distribution.

$\lvert \overline X_n - \mu\rvert =\frac{3\sigma}{\sqrt{n}}$ .

$\bar X_n$ $\mu$ $\sim \frac{1}{n}$ ).

Rule of thumb to apply CLT $n\geq30$ .

Asymptotic normality

$\{X_{n}\}$ is an i.i.d sequence with finite mean and variance. Therefore, it satisfies the conditions of Central Limit Theorem.

$\bar X_{n}$ $X_{n}$ $\mu$ $\large \frac{\sigma^{2}}{n}$ .

\sqrt{(} n) (\frac{{\bar{X}}_{n} - μ}{σ}) \overset{d}{\to} Z

Continuous Mapping Theorem (CMT)

$f(\cdot)$ is such a function that maps convergent sequences into another convergent sequen,

T_{n} \underset{n \to \infty}{\overset{a.s. / P / (d)}{\to}} T \Rightarrow f (T_{n}) \underset{n \to \infty}{\overset{a.s. / P / (d)}{\to}} f (T)

$d,\ p$ $a.s.$ convergence respectively.

$f(T_n)$ $f(T)$ $\Delta$ ) method.

Hoeffding's Inequality

$n$ is not large enough to apply CLT?

For bounded random variable, this is still Hoeffding’s Inequality $n$ $X_i \in[a, b]\$ $a<b$ $n$ :

P [| {\bar{X}}_{n} - μ | \geq ε] \leq 2 e^{- \frac{2 n ε^{2}}{(b - a)^{2}}} . \forall ε > 0

Here I need that my random variables are actfually almost surely bounded, which rules out like Gaussians and Exponential Random Variables.

$\varepsilon$ ?

$\varepsilon = \large\frac{c}{\sqrt{n}}$ :

\begin{matrix} X_{i} \overset{i i d}{\sim} B e r (p) \\ P (| \bar{X} - μ | \geq \frac{c}{\sqrt{n}}) \leq 2 e^{- 2 c^{2} / 1} \end{matrix}

$n$ qualitative behavior happens at any n.

So the conclusion is the average is a good replacement

Is this tight? That's the annoying thing about inequalities.

$e^{-n}$ ), which could be much, much smaller than that, so it is loose.

Chebyshev Inequality & Markov Inequality

$\mathbf{P}(X \geq t)$ $X$ .

Markov inequality

$X \geq 0$ $\mu>0$ $t>0$ :

P (X \geq t) \leq \frac{μ}{t}

Note that the Markov inequality is restricted to non-negative random variables.

Chebyshev inequality

$X$ $\mu$ $\sigma^{2}$ $t>0$ ,

P (| X - μ | \geq t) \leq \frac{σ^{2}}{t^{2}}

Remark: $(X-\mu)^{2}$ , we obtain Chebyshev's inequality. Markov inequality is also used in the proof of Hoeffding's inequality.

Triangle Inequality

| α \pm β | \leq | α | + | β |

Slutsky’s Theorem

Slutsky's Theorem will be our main tool for convergence in distribution.

$T_{n},U_{n}$ be two sequences of r.v., such that:

T_{n} \overset{(d)}{\underset{n \to \infty}{⟶}} T and U_{n} \overset{P}{\underset{n \to \infty}{⟶}} U

Then,

$T_{n}+U_{n} \stackrel{(d)}{\underset{n \rightarrow \infty}{\longrightarrow}} T+U$ ,
$T_{n} U_{n} \stackrel{(d)}{\underset{n \rightarrow \infty}{\longrightarrow}} T U$
$u \neq 0$ $\large \frac{T_{n}}{U_{n}} \stackrel{(d)}{\underset{n \rightarrow \infty}{\longrightarrow}} \frac{T}{U}$ .

$\Delta$ ) method

https://en.wikipedia.org/wiki/Delta_method

Distribution

Discrete: Probability mass function

Bernoulli, Uniform, Binomial, Geometric

Gaussian Distribution

$X\sim\mathcal{N}(\mu, \sigma^2)$

$\mu,\ \sigma^2$

Probability Density Function:

\frac{1}{σ \sqrt{2 π}} e^{- \frac{1}{2} {(\frac{x - μ}{σ})}^{2}}

Cumulative Distribution Function:

Φ (x) = P (X \leq x) = \frac{1}{2} [1 + erf (\frac{x - μ}{\sqrt{2} σ})]

$\text{erf}(\cdot)$ is the exponential response formula.

Why we use Gaussian Distribution so frequently?

Normally, we use sample mean as our estimator. And the reason is because the Gaussian distribution is the thing that shows up as the limit of the CLT as the minute you start talking about averages.

Of the universe type of results, that says that if you take average of enough things, then it's going to go to a Gaussian.

The extreme value?

$(-\infty, \infty)$ , but when we use it in mapping real distribution, namely the height, we will never achieve negative value, why we do Gaussian?

Yes, there exists extreme value, but they never really come into play. Because of the exponential can get really, really small. The Gaussian actually almost in a bounded interval.

Gaussian Probability Tables

A Gaussian CDF (z-score) calculator.

Quantiles

$\alpha$	2.5%	5%	10%
$q_\alpha$	1.96	1.65	1.28

Poisson Distribution

The advantage of using poisson distribution is that n or p do not need to be known! This can make assumptions much easier.

$X\sim\text{Poi}(\lambda),\ \lambda \in (0, \infty)$

$\lambda,\ \lambda$ :

E (X^{2}) = V a r (X) + E (x)^{2} = λ + λ^{2}

Pre-require 0! = 1

Probability Mass Function:

\begin{matrix} P (x = k) = \frac{λ^{k}}{k!} e^{- λ}, k = 0, 1, 2, \dots \end{matrix}

Cumulative distribution function:

Exponential Distribution

$x\in [0, \infty)$

$\large \frac{1}{\lambda},\frac{1}{\lambda^2}$

Probability Mass Function:

λ e^{- λ x}

pdf

Cumulative distribution function:

1 - e^{- λ x}

cdf

Gamma Distribution

$X\sim\text{Gamma}(\alpha,\beta)\text{ or } \Gamma(\alpha, \lambda) ,\lambda = \large \frac{1}{\beta}$

$\alpha, \lambda \in (0,\infty)$

$\large \frac{\alpha}{\lambda},\ \frac{\alpha}{\lambda^2}$

Gamma Function:

Γ (s) = \int_{0}^{\infty} x^{s - 1} e^{- x} d x

\begin{matrix} {\begin{cases} Γ (α) = (α - 1)! & if α is Z^{+} \\ Γ (α) = (α - 1) Γ (α - 1) & if α is R^{+} \\ Γ (\frac{1}{2}) = \sqrt{π} \end{cases} \end{matrix}

Probability Mass Function:

f (x) = \frac{x^{(α - 1)} λ^{α} e^{(- λ x)}}{Γ (α)} = \frac{x^{(α - 1)} e^{(- \frac{1}{β} x)}}{β^{α} Γ (α)}, x > 0

Cumulative distribution function:

\frac{γ (α, λ x)}{Γ (α)}

Geometric Distribution

Such as : number of trials until a success

Geometric Distribution is either one of below two distribution:

$X$ $\{1,2,3, \ldots\}$ ;
$Y=X-1$ $\{0,1,2, \ldots\}$ .

$X\sim\text{Gemo}(p)\ \text{ or }\ Y\sim\text{Gemo}(p)$

$\large \frac{1}{p} \small\text{ or }\large \frac{1-p}{p} ,\ \frac{1-p}{p^2}$

Probability Mass Function:

(1 - p)^{k - 1} p or (1 - p)^{k} p

Cumulative distribution function:

1 - (1 - p)^{k} or 1 - (1 - p)^{k + 1}

$image-20220228105357946$

Binomial Distribution

$B(n,p)$ , n is number of trials and p is success probability of each trial

Mean and Variance: $np$ $\sum^n_{k=1} \sigma^2 = np(1-p)$

Probability Mass Function:

\begin{matrix} P (k) = (\begin{array}{l} n \\ k \end{array}) p^{k} q^{n - k} = \frac{n!}{k! (n - k)!} p^{k} q^{n - k} \end{matrix}

Cumulative Distribution Function:

I_{q} (n - k, 1 + k)

In other words, there are a finite amount of events in a binomial distribution, but an infinite number in a normal distribution.

Bernoulli Distribution

$X\sim Ber(p)$

$p$ $p(1-p)$

PMF pmfs

$p^{k}(1-p)^{1-k}$

Indicator Function

$A$ $X$ is a function

1_{A} : X \to {0, 1}

defined as

\begin{matrix} 1_{A} (x) := {\begin{cases} 1 & if x \in A \\ 0 & if x \notin A \end{cases} \end{matrix}

Derivative of indicator function

$I(D\leq Q)$ ,

\frac{\partial}{\partial D} I (D \leq Q) = - δ (D - Q)

δ is symmetric. δ can be thought of as the derivative of the Heaviside function H(x)=1 for x>0, 0 for x<0.

Moment Generation Function (MGF)

https://en.wikipedia.org/wiki/Moment-generating_function

expectation of moment generating function

https://online.stat.psu.edu/stat414/book/export/html/676

mixture distribution moment generating function

Useful…

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwitz--mtvL1AhXeQEEAHSmoAI8QFnoECDwQAQ&url=https%3A%2F%2Fwww.le.ac.uk%2Fusers%2Fdsgp1%2FCOURSES%2FMATHSTAT%2F6normgf.pdf&usg=AOvVaw3QHSFjpCFrBgTFBxRwGAnK---

Property of Distribution

$X\sim\mathcal N(\mu, \sigma^2)$ as an example.

Affine Transformation:

α \cdot X + β \sim N (α μ + β, α^{2} σ^{2})

Standardization:

$Z = \frac{X-\mu}{\sigma}$ ,

P (u < X < v) = P (\frac{u - μ}{σ} < Z < \frac{v - μ}{σ})

Symmetry:

P (| X | > x) = P (X > x) + P (- X > x) = 2 P (X > x)

Mode of Convergence

Three types of convergence, going from strong to weak.

$\left(T_{n}\right)_{n \geq 1}$ is a sequence of random variables
$T$ $T$ may be deterministic).
Some examples are shown [here]

Almost surely (a.s.) convergence

So I created two sequences, and I want this to converge.

T_{n} \underset{n \to \infty}{\overset{a.s.}{⟶}} T iff P [{ω : T_{n} (ω) \underset{n \to \infty}{⟶} T (ω)}] = 1

Convergence in probability

$n$ goes to infinity.

T_{n} \underset{n \to \infty}{\overset{P}{⟶}} T iff P [| T_{n} - T | \geq ε] \underset{n \to \infty}{⟶} O, \forall ε > 0 .

Convergence in distribution

$f$ , using CLT:

T_{n} \overset{(d)}{n \to \infty} T iff E [f (T_{n})] \underset{n \to \infty}{⟶} E [f (T)]

$n$ goes to infinity.

Properties

$\left(T_{n}\right)_{n \geq 1}$ converges a.s., then it also converges in probability, and the two limits are equal a.s.
$\left(T_{n}\right)_{n \geq 1}$ converges in probability, then it also converges in distribution

Convergence in distribution implies convergence of probabilities if the limit has a density (e.g. Gaussian):

T_{n} \underset{n \to \infty}{\overset{(d)}{⟶}} T \Rightarrow P (a \leq T_{n} \leq b) \underset{n \to \infty}{⟶} P (a \leq T \leq b)

Addition, Multiplication and Division Assume,

T_{n} \overset{a.s. / P}{\underset{n \to \infty}{⟶}} T and U_{n} \overset{a.s. / P}{\underset{n \to \infty}{⟶}} U

Then,

$T_{n}+U_{n} \stackrel{\text { a.s./P }}{\underset{n \rightarrow \infty}{\longrightarrow}} T+U$ ,
$T_{n} U_{n} \stackrel{\text { a.s. } / \mathrm{P}}{\underset{n \rightarrow \infty}{\longrightarrow}} T U$ ,
$U \neq 0$ $\large\frac{T_{n}}{U_{n}} \stackrel{\text { a.s. } / \mathbf{P}}{\longrightarrow} \frac{T}{U \rightarrow \infty}$ .

$(d)$ .

Estimator

Normally, we have two estimator:

Compute the expectation of your random variable
Using Delta method

$n$ ) we need to draw our conclusion? What is the cutoff, namely if 60 is enough, how about 59 and 58?

$p$ of Kissing Example, we put a hat on everything that’s the estimator of something.

$i=1, \ldots, n$ $R\sim Ber(p)$ $R_{i}= 1$ $i$ $R_{i}=0$ otherwise.
$p$ is the sample averge:

\hat{p} = \overset{―}{R_{n}} = \frac{1}{n} \sum^{n} R_{i}

And averages of random variables are essentially controlled by two major tools: They are LLN and CLT.

What is the accuracy of this estimator?

$\hat p$ $p$ by more than 10%.

$\hat{p}$ is from random variables fluctuates. We need a method to measure its fluctuation.

Modelling Assumptions:

$R_i$ is Bernoulli (p)

Measures of Distance

If we want to estimate mean of a Gaussian and we can compute the expectation, but not it doesn’t work for the the variance.

You can go into example and compute the variance, which is actually coming from the method of moments.

But it turns our we have a much more powerful method called the maximum likelihood method, but it is far non-trivial.

Statistics, MIT 18.6501x

Where will be covered:

Theorems and Tools

Law of Large Numbers (LLN)

Central Limit Theorem (CLT)

Continuous Mapping Theorem (CMT)

Hoeffding's Inequality

Chebyshev Inequality & Markov Inequality

Slutsky’s Theorem

Delta (Δ\Delta) method

Distribution

Gaussian Distribution

Poisson Distribution

Exponential Distribution

Gamma Distribution

Geometric Distribution

Binomial Distribution

Bernoulli Distribution

Indicator Function

Moment Generation Function (MGF)

Property of Distribution

Mode of Convergence

Almost surely (a.s.) convergence

Convergence in probability

Convergence in distribution

Properties

Estimator

Measures of Distance

$\Delta$ ) method