Statistic for ML

statistical concept 統計學概念

免費完整內容

PMF and CDF

PMF定義的值是P(X=x)，而CDF定義的值是P(X <= x)，x為所有的實數線上的點。

probability mass function (PMF) 概率質量函數

$pX (x) = P (X = x)$

是離散隨機變數在各個特定取值上的概率。有時也被稱為離散密度函數。

概率密度函數通常是定義離散隨機分佈的主要方法，

並且此類函數存在於其定義域是：

離散的純量變數
多遠隨機變數

維基百科

Cumulative distribution function(CDF)累積分佈函數

$FX (x) = P (X <= x)$

也叫概率分佈函數或分佈函數

是概率密度函數的積分

能夠完整的描述一個實隨機變數X的概率分佈

維基百科

probability density function（PDF）概率密度函數

PDF

概率密度函數（Probability Density Function， PDF）-CSDN博客

Central Limits 中央界限

Support we have a set of independent random variables $X_{i}$ for $i = 1, ...., n$ with:

$Mean(X_{i})=\mu$ $Var(X_{i})=V$ for all $i$

Then as $n$ becomes large, the sum:

$S_{m}=\sum\limits_{i = 1}^n {{X_i} \to {\rm N}(n\mu ,nV)}$

tends to become normally distributed.

Absence of Central Limits

Another case is where the moments are not defined / infinite 另一種情況是力矩不確定或無限大

Randomness 隨機性

Motivation 動機

Three main ways that random comes into data science:

The data themselves are often best understood as random 數據本身通常最好被理解為隨機
**When we want to reason under **subjective uncertainty **(for example in Bayesian approaches) then unknown quantities can be represented as random. Often when we make predictions they will be **probabilitic 當我們主管不確定性的情況下進行推理時，可以將未知量表示為隨機量，當我們進行預測時，他們將是概率性的
Many of the most effective / efficient / commonly‑used algorithms in data science—typically called Monte Carlo algorithms—exploit randomness. 蒙特卡洛算法

Unpredictable 不可預測性
Subjective uncertainty 主管不確定性

The logistic map 邏輯地圖（單峰映象）

是一個二次多項式映射(遞歸關係)经常作为典型范例来说明复杂的混沌现象是如何从非常简单的非线性动力学方程中产生的。

is an example of deterministic chaos 是確定性混沌的一個例子 but whose results are apparently not easy to predict. 結果不容易被預測

它在一定程度上是一个时间离散的人口统计模型

Logistic模型可以描述生物種群的演化，它可以表示成一維非線性迭代方程 $x_{n+1}=rx_{n}(1-x_{n})$

Math:

$\displaystyle{ x(t+1)=\mu x(t)(1-x(t)) }$

其中，t为迭代时间步，对于任意的t，x(t)∈[0,1]，μ为一可调参数，为了保证映射得到的x(t)始终位于[0,1]内，则μ∈[0,4]。x(t)为在t时刻种群占最大可能种群规模的比例（即现有人口数与最大可能人口数的比率）。当变化不同的参数μ时，该方程会展现出不同的动力学极限行为（即当t趋于无穷大，x(t)的变化情况），包括：稳定点（即最终x(t)始终为同一个数值）、周期（x(t)会在2个或者多个数值之间跳跃)、以及混沌（x(t)的终态不会重复，而会等概率地取遍某区间）。

当μ超过[1,4]时，就会发生混沌现象。该非线性差分方程意在观察两种情形:
• 当人口规模很小时，人口将以与当前人口成比例的速度增长进行繁殖。
• 饥饿(与密度有关的死亡率) ，其增长率将以与环境的”承受能力”减去当前人口所得值成正比的速度下降

然而，Logistic映射作为一种人口统计模型，存在着一些初始条件和参数值(如μ >4)为某值时所导致的混沌问题。这个问题在较老的瑞克模型中没有出现，该模型也展示了混沌动力学。

$0 <= μ <= 1$ :

entropy 熵

另一种方法是利用计算机外部因素来产生随机性，例如鼠标点击的位置和时间。在此，我们将考虑把代码运行的时间作为外部因素

即使用系统时钟当前时间的小数点后六位数字（分辨率为微秒）

R和Matlab 使用軟件包提供隨機數生成的函數

Estimation of $π$ using Monte Carlo methods

假設我們將 $\pi$ 定義為半徑為1的圓的面積根據其定義來估算這個數字，

we will pick random values of $x$ and $y$ independently from a uniform distribution between 0 and 1, then let the random variable $Z$ equal 1 if the point $(x, y)$ falls within the quarter-circle shown and 0 otherwise. This $Z$ allows us to make an estimate of $π$ in that its expected value, $E [Z] = π /4$ . We can then define a random variable An to be the average of $n$ independent samples of $Z$ . Formally:

${{\rm{A}}_n} = \frac{1}{n}\sum\limits_{i = 1}^n {{Z_i} = \frac{\pi }{4} + {\varepsilon _n}}$

Code operation

To deal with this, we’ll repeat the experiment $m$ times and make a list of all the estimates

we get. We’ll then arrange these results in ascending order and throw away a certain fraction $\alpha$ of the largest and smallest results. The remaining values should provide decent upper and lower bounds for an interval containing $\pi$ .

m = 100 # Number of estimates taken
n = 80000 # Number of points used in each estimate

If we increase $n$ above, we should get a more accurate estimates of $\pi$ each tme we run the experiment, while if we increase $m$ , we’ll get more accurate estimates of the endpoints of an interval containing $\pi$ .

#Generate a set of m estimates of the area of a unit-radius quarter-circle
np.random.seed(42) # Seed the random number generator
A = np.zeros(m) # A will hold our m estimates
for i in range(0,m):
    for j in range(0,n):
        # Generate an (x, y) pair in the unit square
        x = np.random.rand()
        y = np.random.rand()
  
        # Decide whether the point lies in or on
        # the unit circle and set Z accordingly
        r = x**2 + y**2
        if ( r <= 1.0):
            Z = 1.0
        else:
            Z = 0
    
        # Add up the contribution to the current estimate
        A[i] = A[i] + Z
   
    # Convert the sum we've built to an estimate of pi
    A[i] = 4.0 * A[i] / float( n )

# Calculate approximate 95% confidence interval for pi based on our Monte Carlo estimates
pi_estimates = np.sort(A)
piLower = np.percentile(pi_estimates,2.5)
piUpper = np.percentile(pi_estimates,97.5)
print(f'We estimate that pi lies between {piLower:.3f} and {piUpper:.3f}.')

standard distribution

Bernoulli 伯努利分佈

$P(X=x) = p^{x}(1-p)^{1-x}, x = 0, 1; 0 < p < 1$

only have two choices(binary situations). 只有兩個結果例如成功失敗硬幣正反面

Random Variable (X): In the context of Bernoulli Distribution, X represents the variable that can take the values 1 or 0, denoting the number of successes occurring.

Bernoulli Trial: An individual experiment or trial with only two possible outcomes.

Bernoulli Parameter: This refers to the probability of success § in a Bernoulli Distribution.

Mean:

$E [X] = μ = p$

Variance:

$Var[X] = E[X^{2}] - (E[X])^2 \\ =σ2 = p(1 - p) \ or\ pq$

Applications of Bernoulli Distribution in Business Statistics

1. Quality Control: In manufacturing, every product undergoes quality checks. Bernoulli Distribution helps assess whether a product passes (success) or fails (failure) the quality standards. By analysing the probability of success, manufacturers can evaluate the overall quality of their production process and make improvements.

2. Market Research: Bernoulli Distribution is useful in surveys and market research when dealing with yes/no questions. For instance, when surveying customer satisfaction, responses are often categorised as satisfied (success) or dissatisfied (failure). Analysing these binary outcomes using Bernoulli Distribution helps companies gauge customer sentiment.

3. Risk Assessment: In the context of risk management, the Bernoulli Distribution can be applied to model events with binary outcomes, such as a financial investment succeeding (success) or failing (failure). The probability of success serves as a key parameter for assessing the risk associated with specific investments or decisions.

4. Marketing Campaigns: Businesses use Bernoulli Distribution to measure the effectiveness of marketing campaigns. For instance, in email marketing, success might represent a recipient opening an email, while failure indicates not opening it. Analysing these binary responses helps refine marketing strategies and improve campaign success rates.

Difference between Bernoulli Distribution and Binomial Distribution 伯努利分佈和二項分佈的區分

The Bernoulli Distribution and the Binomial Distribution are both used to model random experiments with binary outcomes, but they differ in how they handle multiple trials or repetitions of these experiments. 同樣是對具有二元結果的隨機實驗進行建模，但在處理多次實驗的方式上有所不同

Basis	Bernoulli Distribution	Binomial Distribution
Number of Trials	Single trial	Multiple trials
Possible Outcomes	2 outcomes (1 for success, 0 for failure)	Multiple outcomes (e.g., success or failure)
Parameter	Probability of success is p	Probability of success in each trial is p and the number of trials is n
Random Variable	X can only be 0 or 1	X can be any non-negative integer (0, 1, 2, 3, …)
Purpose	Describes single trial events with success/failure.	Models the number of successes in multiple trials.
Example	Coin toss (Heads/Tails), Pass/Fail, Yes/No, etc.	Counting the number of successful free throws in a series of attempts, number of defective items in a batch, etc.

Arithmetic with normally-distributed variables

Suppose we have two random variables, X1 and X2 that are independent and are both normally distributed with means µ1 and µ2 **and variances σ12 and σ2 2, respectively.

$W=X_{1}+X_{2}$

will also be normally distributed

mean:

${\mu_{W}}={\mu_{1}} + {\mu_{2}}$

variance:

${\sigma^{2}_{W}}={\sigma^{2}_{1}}+{\sigma^{2}_{2}}$

$Y=aX_{1}+b$

will also be normally distributed

mean:

$\mu_{Y}=a\mu_{1}+b$

variance:

$\sigma^{2}_{Y}=a^{2}\sigma^{2}_{1}$

PDF

CDF

Cauchy 柯西分佈

The Cauchy distribution has probability density function

$\frac{1}{{\pi s(1 + {{((x - t)/s)}^2})}}$

$s$ is positive $t$ is parameter can be any parameters

It has “heavy tails”, which means that large values are so common that the Cauchy distribution lacks a well-defined mean and variance!

But the parameter $t$ gives the location of the mode and median, which are well-defined.

The parameter $s$ determines the ‘width’ of the distribution as measured using e.g. the distances between percentiles, which are also well defined.

PDF

CDF

EDA: Exploratory data analysis

motivation:

EDA is about getting an intuitive understanding of the data, and as such different people will find different techniques useful.

Data quality

The first thing understand is where the data come from and how accurate they are. 數據的來源和準確性

star rating 星級評級

This is based on experience rather than any formal theory：

4： Numbers we can believe. Examples: official statistics(官方統計數據); well controlled laboratory experiments
3： Numbers that are reasonably accurate. Examples: well conducted surveys / samples; field measurements; less well controlled experiments.
2：Numbers that could be out by quite a long way. Examples: poorly conducted surveys / samples; measurements of very noisy systems
1： Numbers that are unreliable. Examples: highly biased / unrepresentative surveys / samples; measurements using biased / low-quality equipment
0： Numbers that have just been made up. Examples: urban legends / memes; fabricated experimental data

Univariate Data Vectors

univariate case: one measurement per ‘thing’ 每個變量都單獨探索

Mathematically, we represent a univariate dataset as a length-n vector:

$x = (x_{1},x_{2},...,x_{n})$

The sample mean of a function f (x) is

$\left\langle {{\rm{f}}(x)} \right\rangle = \frac{1}{n}\sum\limits_{i = 1}^n {f({x_i}) = \frac{1}{n}[f({x_1}) + f({x_2}) + .... + f({x_n})]}$

Visualisation and Information

There is an important distinction in visualisations between

Lossless(無損) ones from which, if viewed at sufficiently high resolution, one could recover the original dataset
Lossy(有損) ones, where a given plot would be consistent with many different raw datasets

Typically for complex data, choosing the lossy visualistaion that loses the ‘right’ information is key to successful visualisation.

Multivariate Exploratory Data Analysis

In real applications, we almost almost always have multiple features of different things measured, and are so in a multivariate rather than univariate situation

Professional Skill

Data types

Nominal or categorical (e.g. colours, car names): not ordered; cannot be added or compared; can be relabelled.
Ordinal (e.g. small/medium/large): sometimes represented by numbers; can be ordered, but differences or ratios are not meaningful.
Measurement: meaningful numbers, on which (some) operations make sense. They can be:
- Discrete (e.g. publication year, number of cylinders): typically integer.
- Continuous (e.g. height): precision limited only by measurement accuracy.

Measurements can be in an interval scale (e.g. temperature in degrees Celsius), ratio scale (say, weights in kg), or circular scale (time of day on the 24 hr clock), depending on the 0 value and on which operations yield meaningful results

Summary Statistics

Measures of Central Tendency 集中趨勢測度

Often, we are interested in what a typical value of the data;

The mean of the data is:

$Mean(x)=\left\langle {\rm{x}} \right\rangle = \frac{1}{n}\sum\limits_{i = 1}^n {{x_i}}$

The median of the data is the value that sits in the middle when the data are sorted by value
A mode in data is a value of $x$ that is ‘more common’ than those around it, or a ‘local maximum’ in the density.
- For discrete data[离散数据] this can be uniquely determined as the most common value
- For continuous data, modes need to be estimated, one aspect of a major strand in data science, estimating distributions.

Visualising

For the data, we estimate from the kernel density that there is one mode, and its location and calculate the mean and median directly

Example:

The data are right-skewed(右偏的), and as a consequence of this the mode is smallest and the mean is largest – we will consider this further; (note that for a normal distribution all would be equal.)

Variance

特性	有偏差方差	無偏差方差
分母	n	n-1
應用場景	描述樣本的離散型	估計總體的方差
偏差	對總體方差的估計存在低估偏差	對總體方差的估計是無偏的
應用場景	數據分析、機器學習中的樣本優化	統計學中總體方差估計

何时使用？

有偏差方差 ：在机器学习中，通常计算样本的有偏差方差（分母为 nnn），因为重点在于优化模型对样本的适配性，而非推断总体。
无偏差方差 ：在统计学和推断中，需要用无偏差方差（分母为 n−1n-1n−1），因为它更准确地估计总体参数。

$\begin{array}{ccccc} Var(x) = \left\langle {{{(x - \left\langle x \right\rangle )}^2}} \right\rangle\\ = \frac{1}{n}\sum\limits_{i = 1}^n {{{({x_i} - \left\langle x \right\rangle )}^2}}\\ =\frac{1}{n}\sum\limits_{i = 1}^n {({x^2}_i - 2{x_i}\left\langle x \right\rangle + {{\left\langle x \right\rangle }^2})}\\ =\left( {\frac{1}{n}\sum\limits_{i = 1}^n {x_i^2} } \right) + 2\left( {\frac{1}{n}\sum\limits_{i = 1}^n {{x_i}} } \right)\left\langle x \right\rangle + \frac{1}{n}\left( {\sum\limits_{i = 1}^n 1 } \right){\left\langle x \right\rangle ^2}\\ =\frac{1}{n}\left( {\sum\limits_{i = 1}^n {x_i^2} } \right) - {\left( {\frac{1}{n}\sum\limits_{i = 1}^n {{x_i}} } \right)^2}\\ =\left\langle {{x^2}} \right\rangle - {\left\langle x \right\rangle ^2} \end{array}$

Unbiased Variance and Computation 無偏方差

$\begin{array}{ccccc} \widehat {Var}(x) = \frac{n}{{n - 1}}Var(x) \\ = \frac{1}{{n - 1}}\sum\limits_{i = 1}^n {{{({x_i} - \left\langle x \right\rangle )}^2}}\\ = \frac{1}{{n - 1}}\left( {\sum\limits_{i = 1}^n {x_i^2 - \frac{1}{n}{{\left( {\sum\limits_{i = 1}^n {{x_i}} } \right)}^2}} } \right) \end{array}$

默認情況下， python計算有偏差的，R計算無偏差的

無偏差樣本

‘Natural’ units

there are two commonly-used quantities that have the same units as the data 與數據有相同單位

mean $\mu = Mean(x)$
standard deviation $\sigma = \sqrt {Var(x)}$

These two quantities let us define two transformations commonly applied to data 用於數據轉換

centring ${y_i} = {x_i} - \mu$ | $M e an (y) = 0$
standardisation ${z_i} = \frac{{{y_i}}}{\sigma }$ | $Va r (z) = 1$

Higher moments

In general, the $r$ -th moment of the data is 第 $r$ 時刻的數據是 ${m_r} = \left\langle {{x^r}} \right\rangle$
The $r$ -th central moment中心距 of the data is ${\mu _r} = \left\langle {{{(x - \mu )}^r}} \right\rangle = \left\langle {{y^r}} \right\rangle$

where the y’s are the centred versions of the data.
The $r$ -th standardised moment of the data is ${\mu _r} = \left\langle {{{(\frac{{x - \mu }}{\sigma })}^r}} \right\rangle = \left\langle {{z^r}} \right\rangle = \frac{{\left\langle {{{\left( {x - \mu } \right)}^2}} \right\rangle }}{{{\sigma ^r}}} = \frac{{{\mu _r}}}{{{\sigma ^r}}}$

In theory, all higher moments are informative about the data, but in practice those with r = 3 and r = 4 are most commonly reported

standardised moment

${M_k} = \frac{{{\mu _k}}}{{{\sigma ^k}}}=\frac{{{原始矩}}}{{{標準差}}}$

$M_k$ ：第 $k$ 阶标准化矩。
$\mu_k$ ：第 $k$ 阶原始矩。
$\sigma$ ：标准差

标准化矩通过除以标准差的 $k$ 次方，使矩的量纲消失，方便分布的比较

第一阶标准化矩

${M_1} = \frac{{{\mu _1}}}{{{\sigma ^1}}}$

表示分布的中心位置，但通常为 0（如果中心点选均值）

第二阶标准化矩

${M_2} = \frac{{{\mu _2}}}{{{\sigma ^2}}}$

恒等于 1，因为分布已经用标准差标准化。

第三阶标准化矩（偏度，Skewness）

${M_3} = \frac{{{\mu _3}}}{{{\sigma ^3}}}=\widetilde {{\mu _3}} = Skew(x)$

用于描述分布的对称性或偏斜程度
- ${{\rm{M}}_3} > 0$ : 分佈偏右(右尾較長)
- ${{\rm{M}}_3} < 0$ : 分佈偏左(左尾較長)
- ${{\rm{M}}_3} = 0$ : 分佈對稱
A larger (more positive) value of this quantity indicates right-skewness, meaning that more of the data’s variability arises from values of x larger than the mean
Conversely, a smaller (more negative) value of this quantity indicates left-skewness, meaning that more of the data’s variability arises from values of x smaller than the mean.
A value close to zero means that the variability of the data is similar either side of the mean (but does not imply an overall symmetric distribution).

第四阶标准化矩（峰度，Kurtosis）

${M_4} = \frac{{{\mu _4}}}{{{\sigma ^4}}}$

用于描述分布的尖峰或平坦程度.
- ${{\rm{M}}_4} > 3$ : 尖峰分佈
- ${{\rm{M}}_4} < 3$ : 平坦分佈

用途

描述分布形状 ：偏度和峰度是最常用的标准化矩，用于研究数据分布的对称性和尾部特性。
模型假设检验 ：例如，判断数据是否符合正态分布。
分布比较 ：通过标准化，消除了尺度和单位的影响，可以直接比较不同数据集的形状特征。

A value of this quantity larger than 3 means that more of the variance of the data arises from the tails than would be expected if it were normally distributed
A value of this quantity less than 3 means that less of the variance of the data arises from the tails than would be expected if it were normally distributed.
A value close to 3 is consistent with, though not strong evidence for, a normal distribution.
The difference between the kurtosis and 3 is called the excess kurtosis.

functions

Quantiles and Order Statistics

The z-th percentile, $P_z$ is the value of x for which z% of the data is ≤ x
So the median is median(x) = $P_{50}$
This is related to the ECDF as illustrated below
A measure of dispersal of the data is the inter-quartile range $IQR(x) = {P_{75}} - {P_{25}}$

Density Estimation

Histograms

histogram can be used to make an estimate of the probability density underlying a data set. Given data{ ${ {x_1}, . . . , {x_n} }$ } and a collection of q + 1 bin-boundaries, $b = (b_0, b_1, . . . , b_q )$

chosen so that ${b_0} < min(x) \ and \ max(x) < {b_q}$ , we can think of the histogram-based density estimate as a piecewise-constant (that is, constant on intervals) function arranged so that the value of the estimator in the interval $b_{a−1} ≤ x < b_{a}$ is

$\frac{1}{{{b_a} - {b_{a - 1}}}}\left( {\frac{{\left| {\{ {x_j}|{b_{a - 1}} \le {x_j} < {b_a}\} } \right|}}{n}} \right)$

where the second factor is the proportion of the ${x_j}$ that fall into the interval and $b_a − b_{a−1}$ is the width of the interval. These choices mean that the bar (of the histogram) above the interval has an area equal to the proportion of the data points $x_j$ that fall in that interval

Estimating a Density with Kernels

$\widehat f(x|w) = \frac{1}{n}\sum\limits_{j = 1}^n {\frac{1}{w}K\left( {\frac{{x - {x_j}}}{w}} \right)}$

The main players in this formula are

$K (x)$ : the kernel, typically some bump-shaped function such as a Gaussian or a parabolic bump. It should be normalised in the sense that

$\int_{ - \infty }^\infty {K(x)\ dx = 1}$

$w$ : the bandwidth, which sets the width of the bumps

Kernel Density Estimation （KDE）

是一种 非参数方法 ，用于估计随机变量的概率密度函数（PDF，Probability Density Function）。它提供了一种平滑方式来描述数据的分布，不依赖特定的分布假设（如正态分布）

目标：

KDE 的目标是从有限的样本数据中估计其背后的概率密度函数。
与直方图类似，KDE 描述了数据的分布，但比直方图更平滑且不受特定区间（bin）的影响。

核心公式 ：
给定 $n$ 个数据点 $\{x_1, x_2, \dots, x_n\}$ ，KDE 在位置 $x$ 处的估计值为：

$f^(x)=1nh∑i=1nK(x−xih)\hat{f}(x) = \frac{1}{n h} \sum_{i=1}^{n} K\left(\frac{x - x_i}{h}\right)：在 x 处的密度估计。$

$K(\cdot)$ ： 核函数 （Kernel Function），定义如何分布平滑权重。
$h$ ： 带宽参数 （Bandwidth），控制平滑的程度。
$x_i$ ：数据点。

核函数 $K(\cdot)$ ：

核函数是一个对称的非负函数，其积分为 1，通常用来为每个点分配权重。
常见核函数：
- 高斯核（Gaussian Kernel）： $\frac{1}{\sqrt{2\pi}} e^{-\frac{u^2}{2}}$
- 均匀核（Uniform Kernel）： $\frac{1}{2}$ （如果 $\leq 1$ ，否则为 0）
- 三角核（Triangular Kernel）： $K (u) = 1 - ∣ u ∣ K (u) = 1 - ∣ u ∣$ （如果 $\leq 1$ ，否则为 0）

带宽 $h$ ：

带宽控制了核的扩展范围。
$h$ 的选择非常重要：
- $h$ 太小：估计函数会过于波动（过拟合）。
- $h$ 太大：估计函数会过于平滑（欠拟合）。
KDE 的核心思想是用核函数 $K(\cdot)$ 平滑地“覆盖”每个数据点。
通过将核函数中心放在每个数据点上，并根据带宽 $h$ 调整宽度，最终生成一个连续的概率密度曲线

KDE与直方图的比较

特點	直方圖	KDE
區間	数据被划分成固定宽度的区间（bin）	不需要固定区间
平滑性	曲线可能不连续，有棱角	曲线连续、平滑
參數	区间宽度（bin width）	核函数和带宽（kernel + bandwidth）
靈活性	对区间位置敏感	更灵活，适用于复杂数据分布