Statistic for ML

statistical concept 統計學概念

免費完整內容

PMF and CDF

PMF定義的值是P(X=x),而CDF定義的值是P(X <= x),x為所有的實數線上的點。

probability mass function (PMF) 概率質量函數

p X ( x ) = P ( X = x ) pX(x)=P(X=x) pX(x)=P(X=x)

是離散隨機變數在各個特定取值上的概率。有時也被稱為離散密度函數

概率密度函數通常是定義離散隨機分佈的主要方法

並且此類函數存在於其定義域是:

  1. 離散的純量變數
  2. 多遠隨機變數

維基百科

Cumulative distribution function(CDF)累積分佈函數

F X ( x ) = P ( X < = x ) FX(x)=P(X<=x) FX(x)=P(X<=x)

也叫概率分佈函數或分佈函數

是概率密度函數的積分

能夠完整的描述一個實隨機變數X的概率分佈

維基百科

probability density function(PDF)概率密度函數

PDF

概率密度函數(Probability Density Function, PDF)-CSDN博客

Central Limits 中央界限

Support we have a set of independent random variables X i X_{i} Xi for i = 1 , . . . . , n i=1,....,n i=1,....,n with:

M e a n ( X i ) = μ Mean(X_{i})=\mu Mean(Xi)=μ V a r ( X i ) = V Var(X_{i})=V Var(Xi)=V for all i i i

Then as n n n becomes large, the sum:

S m = ∑ i = 1 n X i → N ( n μ , n V ) S_{m}=\sum\limits_{i = 1}^n {{X_i} \to {\rm N}(n\mu ,nV)} Sm=i=1nXiN(nμ,nV)

tends to become normally distributed.

Absence of Central Limits

Another case is where the moments are not defined / infinite 另一種情況是力矩不確定或無限大

Randomness 隨機性

Motivation 動機

Three main ways that random comes into data science:

  1. The data themselves are often best understood as random 數據本身通常最好被理解為隨機
  2. **When we want to reason under **subjective uncertainty **(for example in Bayesian approaches) then unknown quantities can be represented as random. Often when we make predictions they will be **probabilitic 當我們主管不確定性的情況下進行推理時,可以將未知量表示為隨機量,當我們進行預測時,他們將是概率性的
  3. Many of the most effective / efficient / commonly‑used algorithms in data science—typically called Monte Carlo algorithms—exploit randomness. 蒙特卡洛算法
  • Unpredictable 不可預測性
  • Subjective uncertainty 主管不確定性

The logistic map 邏輯地圖(單峰映象)

是一個二次多項式映射(遞歸關係)经常作为典型范例来说明复杂的混沌现象是如何从非常简单的非线性动力学方程中产生的。

is an example of deterministic chaos 是確定性混沌的一個例子 but whose results are apparently not easy to predict. 結果不容易被預測

它在一定程度上是一个时间离散的人口统计模型

Logistic模型可以描述生物種群的演化,它可以表示成一維非線性迭代方程 x n + 1 = r x n ( 1 − x n ) x_{n+1}=rx_{n}(1-x_{n}) xn+1=rxn(1xn)

Math:

x ( t + 1 ) = μ x ( t ) ( 1 − x ( t ) ) \displaystyle{ x(t+1)=\mu x(t)(1-x(t)) } x(t+1)=μx(t)(1x(t))

其中,t为迭代时间步,对于任意的t,x(t)∈[0,1],μ为一可调参数,为了保证映射得到的x(t)始终位于[0,1]内,则μ∈[0,4]。x(t)为在t时刻种群占最大可能种群规模的比例(即现有人口数与最大可能人口数的比率)。当变化不同的参数μ时,该方程会展现出不同的动力学极限行为(即当t趋于无穷大,x(t)的变化情况),包括:稳定点(即最终x(t)始终为同一个数值)、周期(x(t)会在2个或者多个数值之间跳跃)、以及混沌(x(t)的终态不会重复,而会等概率地取遍某区间)。

当μ超过[1,4]时,就会发生混沌现象。该非线性差分方程意在观察两种情形:
• 当人口规模很小时,人口将以与当前人口成比例的速度增长进行繁殖。
• 饥饿(与密度有关的死亡率) ,其增长率将以与环境的”承受能力”减去当前人口所得值成正比的速度下降

然而,Logistic映射作为一种人口统计模型,存在着一些初始条件和参数值(如μ >4)为某值时所导致的混沌问题。这个问题在较老的瑞克模型中没有出现,该模型也展示了混沌动力学。

0 < = μ < = 1 0<=μ<=1 0<=μ<=1:

entropy 熵

另一种方法是利用计算机外部因素来产生随机性, 例如鼠标点击的位置和时间。 在此, 我们将考虑把代码运行的时间作为外部因素

即使用系统时钟当前时间的小数点后六位数字(分辨率为微秒)

R和Matlab 使用軟件包提供隨機數生成的函數

Estimation of π π π using Monte Carlo methods

假設我們將 π \pi π 定義為半徑為1的圓的面積根據其定義來估算這個數字,

we will pick random values of x x x and y y y independently from a uniform distribution between 0 and 1, then let the random variable Z Z Z equal 1 if the point ( x , y ) (x, y) (x,y) falls within the quarter-circle shown and 0 otherwise. This Z Z Z allows us to make an estimate of π π π in that its expected value, E [ Z ] = π / 4 E[Z] = π/4 E[Z]=π/4. We can then define a random variable An to be the average of n n n independent samples of Z Z Z. Formally:

A n = 1 n ∑ i = 1 n Z i = π 4 + ε n {{\rm{A}}_n} = \frac{1}{n}\sum\limits_{i = 1}^n {{Z_i} = \frac{\pi }{4} + {\varepsilon _n}} An=n1i=1nZi=4π+εn

Code operation

To deal with this, we’ll repeat the experiment m m m times and make a list of all the estimates

we get. We’ll then arrange these results in ascending order and throw away a certain fraction α \alpha α of the largest and smallest results. The remaining values should provide decent upper and lower bounds for an interval containing π \pi π.

m = 100 # Number of estimates taken
n = 80000 # Number of points used in each estimate

If we increase n n n above, we should get a more accurate estimates of π \pi π each tme we run the experiment, while if we increase m m m, we’ll get more accurate estimates of the endpoints of an interval containing π \pi π.

#Generate a set of m estimates of the area of a unit-radius quarter-circle
np.random.seed(42) # Seed the random number generator
A = np.zeros(m) # A will hold our m estimates
for i in range(0,m):
    for j in range(0,n):
        # Generate an (x, y) pair in the unit square
        x = np.random.rand()
        y = np.random.rand()
  
        # Decide whether the point lies in or on
        # the unit circle and set Z accordingly
        r = x**2 + y**2
        if ( r <= 1.0):
            Z = 1.0
        else:
            Z = 0
    
        # Add up the contribution to the current estimate
        A[i] = A[i] + Z
   
    # Convert the sum we've built to an estimate of pi
    A[i] = 4.0 * A[i] / float( n )
# Calculate approximate 95% confidence interval for pi based on our Monte Carlo estimates
pi_estimates = np.sort(A)
piLower = np.percentile(pi_estimates,2.5)
piUpper = np.percentile(pi_estimates,97.5)
print(f'We estimate that pi lies between {piLower:.3f} and {piUpper:.3f}.')

standard distribution

Bernoulli 伯努利分佈

P ( X = x ) = p x ( 1 − p ) 1 − x , x = 0 , 1 ; 0 < p < 1 P(X=x) = p^{x}(1-p)^{1-x}, x = 0, 1; 0 < p < 1 P(X=x)=px(1p)1x,x=0,1;0<p<1

only have two choices(binary situations). 只有兩個結果 例如成功失敗 硬幣正反面

Random Variable (X): In the context of Bernoulli Distribution, X represents the variable that can take the values 1 or 0, denoting the number of successes occurring.

Bernoulli Trial: An individual experiment or trial with only two possible outcomes.

Bernoulli Parameter: This refers to the probability of success § in a Bernoulli Distribution.

Mean:

E [ X ] = μ = p E[X] = μ = p E[X]=μ=p

Variance:

V a r [ X ] = E [ X 2 ] − ( E [ X ] ) 2 = σ 2 = p ( 1 − p )   o r   p q Var[X] = E[X^{2}] - (E[X])^2 \\ =σ2 = p(1 - p) \ or\ pq Var[X]=E[X2](E[X])2=σ2=p(1p) or pq

Applications of Bernoulli Distribution in Business Statistics

1. Quality Control: In manufacturing, every product undergoes quality checks. Bernoulli Distribution helps assess whether a product passes (success) or fails (failure) the quality standards. By analysing the probability of success, manufacturers can evaluate the overall quality of their production process and make improvements.

2. Market Research: Bernoulli Distribution is useful in surveys and market research when dealing with yes/no questions. For instance, when surveying customer satisfaction, responses are often categorised as satisfied (success) or dissatisfied (failure). Analysing these binary outcomes using Bernoulli Distribution helps companies gauge customer sentiment.

3. Risk Assessment: In the context of risk management, the Bernoulli Distribution can be applied to model events with binary outcomes, such as a financial investment succeeding (success) or failing (failure). The probability of success serves as a key parameter for assessing the risk associated with specific investments or decisions.

4. Marketing Campaigns: Businesses use Bernoulli Distribution to measure the effectiveness of marketing campaigns. For instance, in email marketing, success might represent a recipient opening an email, while failure indicates not opening it. Analysing these binary responses helps refine marketing strategies and improve campaign success rates.

Difference between Bernoulli Distribution and Binomial Distribution 伯努利分佈和二項分佈的區分

The Bernoulli Distribution and the Binomial Distribution are both used to model random experiments with binary outcomes, but they differ in how they handle multiple trials or repetitions of these experiments. 同樣是對具有二元結果的隨機實驗進行建模,但在處理多次實驗的方式上有所不同

BasisBernoulli DistributionBinomial Distribution
Number of TrialsSingle trialMultiple trials
Possible Outcomes2 outcomes (1 for success, 0 for failure)Multiple outcomes (e.g., success or failure)
ParameterProbability of success is pProbability of success in each trial is p and the number of trials is n
Random VariableX can only be 0 or 1X can be any non-negative integer (0, 1, 2, 3, …)
PurposeDescribes single trial events with success/failure.Models the number of successes in multiple trials.
ExampleCoin toss (Heads/Tails), Pass/Fail, Yes/No, etc.Counting the number of successful free throws in a series of attempts, number of defective items in a batch, etc.

Arithmetic with normally-distributed variables

Suppose we have two random variables, X1 and X2 that are independent and are both normally distributed with means µ1 and µ2 **and variances σ12 and σ2 2, respectively.

W = X 1 + X 2 W=X_{1}+X_{2} W=X1+X2

will also be normally distributed

mean:

μ W = μ 1 + μ 2 {\mu_{W}}={\mu_{1}} + {\mu_{2}} μW=μ1+μ2

variance:

σ W 2 = σ 1 2 + σ 2 2 {\sigma^{2}_{W}}={\sigma^{2}_{1}}+{\sigma^{2}_{2}} σW2=σ12+σ22

Y = a X 1 + b Y=aX_{1}+b Y=aX1+b

will also be normally distributed

mean:

μ Y = a μ 1 + b \mu_{Y}=a\mu_{1}+b μY=aμ1+b

variance:

σ Y 2 = a 2 σ 1 2 \sigma^{2}_{Y}=a^{2}\sigma^{2}_{1} σY2=a2σ12

PDF

CDF

Cauchy 柯西分佈

The Cauchy distribution has probability density function

f ( x ) = 1 π s ( 1 + ( ( x − t ) / s ) 2 ) f(x) = \frac{1}{{\pi s(1 + {{((x - t)/s)}^2})}} f(x)=πs(1+((xt)/s)2)1

s s s is positive t t t is parameter can be any parameters

It has “heavy tails”, which means that large values are so common that the Cauchy distribution lacks a well-defined mean and variance!

But the parameter t t t gives the location of the mode and median, which are well-defined.

The parameter s s s determines the ‘width’ of the distribution as measured using e.g. the distances between percentiles, which are also well defined.

PDF

CDF

EDA: Exploratory data analysis

motivation:

EDA is about getting an intuitive understanding of the data, and as such different people will find different techniques useful.

Data quality

The first thing understand is where the data come from and how accurate they are. 數據的來源和準確性

star rating 星級評級

This is based on experience rather than any formal theory:

  • 4: Numbers we can believe. Examples: official statistics(官方統計數據); well controlled laboratory experiments
  • 3: Numbers that are reasonably accurate. Examples: well conducted surveys / samples; field measurements; less well controlled experiments.
  • 2:Numbers that could be out by quite a long way. Examples: poorly conducted surveys / samples; measurements of very noisy systems
  • 1: Numbers that are unreliable. Examples: highly biased / unrepresentative surveys / samples; measurements using biased / low-quality equipment
  • 0: Numbers that have just been made up. Examples: urban legends / memes; fabricated experimental data

Univariate Data Vectors

univariate case: one measurement per ‘thing’ 每個變量都單獨探索

Mathematically, we represent a univariate dataset as a length-n vector:

x = ( x 1 , x 2 , . . . , x n ) x = (x_{1},x_{2},...,x_{n}) x=(x1,x2,...,xn)

The sample mean of a function f (x) is

⟨ f ( x ) ⟩ = 1 n ∑ i = 1 n f ( x i ) = 1 n [ f ( x 1 ) + f ( x 2 ) + . . . . + f ( x n ) ] \left\langle {{\rm{f}}(x)} \right\rangle = \frac{1}{n}\sum\limits_{i = 1}^n {f({x_i}) = \frac{1}{n}[f({x_1}) + f({x_2}) + .... + f({x_n})]} f(x)=n1i=1nf(xi)=n1[f(x1)+f(x2)+....+f(xn)]

Visualisation and Information

There is an important distinction in visualisations between

  • Lossless(無損) ones from which, if viewed at sufficiently high resolution, one could recover the original dataset
  • Lossy(有損) ones, where a given plot would be consistent with many different raw datasets

Typically for complex data, choosing the lossy visualistaion that loses the ‘right’ information is key to successful visualisation.

Multivariate Exploratory Data Analysis

  • In real applications, we almost almost always have multiple features of different things measured, and are so in a multivariate rather than univariate situation

Professional Skill

Data types
  • Nominal or categorical (e.g. colours, car names): not ordered; cannot be added or compared; can be relabelled.
  • Ordinal (e.g. small/medium/large): sometimes represented by numbers; can be ordered, but differences or ratios are not meaningful.
  • Measurement: meaningful numbers, on which (some) operations make sense. They can be:
    • Discrete (e.g. publication year, number of cylinders): typically integer.
    • Continuous (e.g. height): precision limited only by measurement accuracy.

Measurements can be in an interval scale (e.g. temperature in degrees Celsius), ratio scale (say, weights in kg), or circular scale (time of day on the 24 hr clock), depending on the 0 value and on which operations yield meaningful results

Summary Statistics

Measures of Central Tendency 集中趨勢測度

Often, we are interested in what a typical value of the data;

  • The mean of the data is:

M e a n ( x ) = ⟨ x ⟩ = 1 n ∑ i = 1 n x i Mean(x)=\left\langle {\rm{x}} \right\rangle = \frac{1}{n}\sum\limits_{i = 1}^n {{x_i}} Mean(x)=x=n1i=1nxi

  • The median of the data is the value that sits in the middle when the data are sorted by value
  • A mode in data is a value of x x x that is ‘more common’ than those around it, or a ‘local maximum’ in the density.
    • For discrete data[离散数据] this can be uniquely determined as the most common value
    • For continuous data, modes need to be estimated, one aspect of a major strand in data science, estimating distributions.

Visualising

For the data, we estimate from the kernel density that there is one mode, and its location and calculate the mean and median directly

Example:

The data are right-skewed(右偏的), and as a consequence of this the mode is smallest and the mean is largest – we will consider this further; (note that for a normal distribution all would be equal.)

Variance

特性有偏差方差無偏差方差
分母nn-1
應用場景描述樣本的離散型估計總體的方差
偏差對總體方差的估計存在低估偏差對總體方差的估計是無偏的
應用場景數據分析、機器學習中的樣本優化統計學中總體方差估計

何时使用?

  • 有偏差方差 :在机器学习中,通常计算样本的有偏差方差(分母为 nnn),因为重点在于优化模型对样本的适配性,而非推断总体。
  • 无偏差方差 :在统计学和推断中,需要用无偏差方差(分母为 n−1n-1n1),因为它更准确地估计总体参数。

V a r ( x ) = ⟨ ( x − ⟨ x ⟩ ) 2 ⟩ = 1 n ∑ i = 1 n ( x i − ⟨ x ⟩ ) 2 = 1 n ∑ i = 1 n ( x 2 i − 2 x i ⟨ x ⟩ + ⟨ x ⟩ 2 ) = ( 1 n ∑ i = 1 n x i 2 ) + 2 ( 1 n ∑ i = 1 n x i ) ⟨ x ⟩ + 1 n ( ∑ i = 1 n 1 ) ⟨ x ⟩ 2 = 1 n ( ∑ i = 1 n x i 2 ) − ( 1 n ∑ i = 1 n x i ) 2 = ⟨ x 2 ⟩ − ⟨ x ⟩ 2 \begin{array}{ccccc} Var(x) = \left\langle {{{(x - \left\langle x \right\rangle )}^2}} \right\rangle\\ = \frac{1}{n}\sum\limits_{i = 1}^n {{{({x_i} - \left\langle x \right\rangle )}^2}}\\ =\frac{1}{n}\sum\limits_{i = 1}^n {({x^2}_i - 2{x_i}\left\langle x \right\rangle + {{\left\langle x \right\rangle }^2})}\\ =\left( {\frac{1}{n}\sum\limits_{i = 1}^n {x_i^2} } \right) + 2\left( {\frac{1}{n}\sum\limits_{i = 1}^n {{x_i}} } \right)\left\langle x \right\rangle + \frac{1}{n}\left( {\sum\limits_{i = 1}^n 1 } \right){\left\langle x \right\rangle ^2}\\ =\frac{1}{n}\left( {\sum\limits_{i = 1}^n {x_i^2} } \right) - {\left( {\frac{1}{n}\sum\limits_{i = 1}^n {{x_i}} } \right)^2}\\ =\left\langle {{x^2}} \right\rangle - {\left\langle x \right\rangle ^2} \end{array} Var(x)=(xx)2=n1i=1n(xix)2=n1i=1n(x2i2xix+x2)=(n1i=1nxi2)+2(n1i=1nxi)x+n1(i=1n1)x2=n1(i=1nxi2)(n1i=1nxi)2=x2x2

Unbiased Variance and Computation 無偏方差

V a r ^ ( x ) = n n − 1 V a r ( x ) = 1 n − 1 ∑ i = 1 n ( x i − ⟨ x ⟩ ) 2 = 1 n − 1 ( ∑ i = 1 n x i 2 − 1 n ( ∑ i = 1 n x i ) 2 ) \begin{array}{ccccc} \widehat {Var}(x) = \frac{n}{{n - 1}}Var(x) \\ = \frac{1}{{n - 1}}\sum\limits_{i = 1}^n {{{({x_i} - \left\langle x \right\rangle )}^2}}\\ = \frac{1}{{n - 1}}\left( {\sum\limits_{i = 1}^n {x_i^2 - \frac{1}{n}{{\left( {\sum\limits_{i = 1}^n {{x_i}} } \right)}^2}} } \right) \end{array} Var (x)=n1nVar(x)=n11i=1n(xix)2=n11(i=1nxi2n1(i=1nxi)2)

默認情況下, python計算有偏差的,R計算無偏差的

無偏差樣本

‘Natural’ units

there are two commonly-used quantities that have the same units as the data 與數據有相同單位

  • mean μ = M e a n ( x ) \mu = Mean(x) μ=Mean(x)
  • standard deviation σ = V a r ( x ) \sigma = \sqrt {Var(x)} σ=Var(x)

These two quantities let us define two transformations commonly applied to data 用於數據轉換

  • centring y i = x i − μ {y_i} = {x_i} - \mu yi=xiμ | M e a n ( y ) = 0 Mean(y) = 0 Mean(y)=0
  • standardisation z i = y i σ {z_i} = \frac{{{y_i}}}{\sigma } zi=σyi | V a r ( z ) = 1 Var(z)=1 Var(z)=1

Higher moments

  • In general, the r r r-th moment of the data is 第 r r r時刻的數據是 m r = ⟨ x r ⟩ {m_r} = \left\langle {{x^r}} \right\rangle mr=xr

  • The r r r-th central moment中心距 of the data is μ r = ⟨ ( x − μ ) r ⟩ = ⟨ y r ⟩ {\mu _r} = \left\langle {{{(x - \mu )}^r}} \right\rangle = \left\langle {{y^r}} \right\rangle μr=(xμ)r=yr

    where the y’s are the centred versions of the data.

  • The r r r-th standardised moment of the data is μ r = ⟨ ( x − μ σ ) r ⟩ = ⟨ z r ⟩ = ⟨ ( x − μ ) 2 ⟩ σ r = μ r σ r {\mu _r} = \left\langle {{{(\frac{{x - \mu }}{\sigma })}^r}} \right\rangle = \left\langle {{z^r}} \right\rangle = \frac{{\left\langle {{{\left( {x - \mu } \right)}^2}} \right\rangle }}{{{\sigma ^r}}} = \frac{{{\mu _r}}}{{{\sigma ^r}}} μr=(σxμ)r=zr=σr(xμ)2=σrμr

In theory, all higher moments are informative about the data, but in practice those with r = 3 and r = 4 are most commonly reported

standardised moment

M k = μ k σ k = 原始矩 標準差 {M_k} = \frac{{{\mu _k}}}{{{\sigma ^k}}}=\frac{{{原始矩}}}{{{標準差}}} Mk=σkμk=標準差原始矩

  • M k M_k Mk:第 k k k阶标准化矩。
  • μ k \mu_k μk:第 k k k 阶原始矩。
  • σ \sigma σ:标准差

标准化矩通过除以标准差的 k k k 次方,使矩的量纲消失,方便分布的比较

第一阶标准化矩

M 1 = μ 1 σ 1 {M_1} = \frac{{{\mu _1}}}{{{\sigma ^1}}} M1=σ1μ1

表示分布的中心位置,但通常为 0(如果中心点选均值)

第二阶标准化矩

M 2 = μ 2 σ 2 {M_2} = \frac{{{\mu _2}}}{{{\sigma ^2}}} M2=σ2μ2

恒等于 1,因为分布已经用标准差标准化。

第三阶标准化矩(偏度,Skewness)

M 3 = μ 3 σ 3 = μ 3 ~ = S k e w ( x ) {M_3} = \frac{{{\mu _3}}}{{{\sigma ^3}}}=\widetilde {{\mu _3}} = Skew(x) M3=σ3μ3=μ3 =Skew(x)

  • 用于描述分布的对称性或偏斜程度

    • M 3 > 0 {{\rm{M}}_3} > 0 M3>0: 分佈 偏右(右尾較長)
    • M 3 < 0 {{\rm{M}}_3} < 0 M3<0: 分佈偏左(左尾較長)
    • M 3 = 0 {{\rm{M}}_3} = 0 M3=0: 分佈對稱
  • A larger (more positive) value of this quantity indicates right-skewness, meaning that more of the data’s variability arises from values of x larger than the mean

  • Conversely, a smaller (more negative) value of this quantity indicates left-skewness, meaning that more of the data’s variability arises from values of x smaller than the mean.

  • A value close to zero means that the variability of the data is similar either side of the mean (but does not imply an overall symmetric distribution).

第四阶标准化矩(峰度,Kurtosis)

M 4 = μ 4 σ 4 {M_4} = \frac{{{\mu _4}}}{{{\sigma ^4}}} M4=σ4μ4

  • 用于描述分布的尖峰或平坦程度.
    • M 4 > 3 {{\rm{M}}_4} > 3 M4>3: 尖峰分佈
    • M 4 < 3 {{\rm{M}}_4} < 3 M4<3: 平坦分佈

用途

  • 描述分布形状 :偏度和峰度是最常用的标准化矩,用于研究数据分布的对称性和尾部特性。
  • 模型假设检验 :例如,判断数据是否符合正态分布。
  • 分布比较 :通过标准化,消除了尺度和单位的影响,可以直接比较不同数据集的形状特征。
  • A value of this quantity larger than 3 means that more of the variance of the data arises from the tails than would be expected if it were normally distributed
  • A value of this quantity less than 3 means that less of the variance of the data arises from the tails than would be expected if it were normally distributed.
  • A value close to 3 is consistent with, though not strong evidence for, a normal distribution.
  • The difference between the kurtosis and 3 is called the excess kurtosis.

functions

Quantiles and Order Statistics

  • The z-th percentile, P z P_z Pz is the value of x for which z% of the data is ≤ x
  • So the median is median(x) = P 50 P_{50} P50
  • This is related to the ECDF as illustrated below
  • A measure of dispersal of the data is the inter-quartile range I Q R ( x ) = P 75 − P 25 IQR(x) = {P_{75}} - {P_{25}} IQR(x)=P75P25

Density Estimation

Histograms

histogram can be used to make an estimate of the probability density underlying a data set. Given data{ x 1 , . . . , x n { {x_1}, . . . , {x_n} } x1,...,xn} and a collection of q + 1 bin-boundaries, b = ( b 0 , b 1 , . . . , b q ) b = (b_0, b_1, . . . , b_q ) b=(b0,b1,...,bq)

chosen so that b 0 < m i n ( x )   a n d   m a x ( x ) < b q {b_0} < min(x) \ and \ max(x) < {b_q} b0<min(x) and max(x)<bq , we can think of the histogram-based density estimate as a piecewise-constant (that is, constant on intervals) function arranged so that the value of the estimator in the interval b a − 1 ≤ x < b a b_{a−1} ≤ x < b_{a} ba1x<ba is

f ( x ∣ b ) = 1 b a − b a − 1 ( ∣ { x j ∣ b a − 1 ≤ x j < b a } ∣ n ) f(x|b) = \frac{1}{{{b_a} - {b_{a - 1}}}}\left( {\frac{{\left| {\{ {x_j}|{b_{a - 1}} \le {x_j} < {b_a}\} } \right|}}{n}} \right) f(xb)=baba11(n{xjba1xj<ba})

where the second factor is the proportion of the x j {x_j} xj that fall into the interval and b a − b a − 1 b_a − b_{a−1} baba1 is the width of the interval. These choices mean that the bar (of the histogram) above the interval has an area equal to the proportion of the data points x j x_j xj that fall in that interval

Estimating a Density with Kernels

f ^ ( x ∣ w ) = 1 n ∑ j = 1 n 1 w K ( x − x j w ) \widehat f(x|w) = \frac{1}{n}\sum\limits_{j = 1}^n {\frac{1}{w}K\left( {\frac{{x - {x_j}}}{w}} \right)} f (xw)=n1j=1nw1K(wxxj)

The main players in this formula are

K ( x ) K(x) K(x): the kernel, typically some bump-shaped function such as a Gaussian or a parabolic bump. It should be normalised in the sense that

∫ − ∞ ∞ K ( x )   d x = 1 \int_{ - \infty }^\infty {K(x)\ dx = 1} K(x) dx=1

w w w : the bandwidth, which sets the width of the bumps

Kernel Density Estimation (KDE)

是一种 非参数方法 ,用于估计随机变量的概率密度函数(PDF,Probability Density Function)。它提供了一种平滑方式来描述数据的分布,不依赖特定的分布假设(如正态分布)

目标

  • KDE 的目标是从有限的样本数据中估计其背后的概率密度函数。
  • 与直方图类似,KDE 描述了数据的分布,但比直方图更平滑且不受特定区间(bin)的影响。

核心公式
给定 n n n 个数据点 { x 1 , x 2 , … , x n } \{x_1, x_2, \dots, x_n\} {x1,x2,,xn},KDE 在位置 x x x 处的估计值为:

f ( x ) = 1 n h ∑ i = 1 n K ( x − x i h ) f ^ ( x ) = 1 n h ∑ i = 1 n K ( x − x i h ) :在 x 处的密度估计。 f^(x)=1nh∑i=1nK(x−xih)\hat{f}(x) = \frac{1}{n h} \sum_{i=1}^{n} K\left(\frac{x - x_i}{h}\right):在 x 处的密度估计。 f(x)=1nhi=1nK(xxih)f^(x)=nh1i=1nK(hxxi):在x处的密度估计。

  • K ( ⋅ ) K(\cdot) K()核函数 (Kernel Function),定义如何分布平滑权重。
  • h h h带宽参数 (Bandwidth),控制平滑的程度。
  • x i x_i xi:数据点。

核函数 K ( ⋅ ) K(\cdot) K()

  • 核函数是一个对称的非负函数,其积分为 1,通常用来为每个点分配权重。
  • 常见核函数:
    • 高斯核(Gaussian Kernel): K ( u ) = 12 π e − u 22 K ( u ) = 1 2 π e − u 2 2 K(u)=12πe−u22K(u) = \frac{1}{\sqrt{2\pi}} e^{-\frac{u^2}{2}} K(u)=12πeu22K(u)=2π 1e2u2
    • 均匀核(Uniform Kernel): K ( u ) = 12 K ( u ) = 1 2 K(u)=12K(u) = \frac{1}{2} K(u)=12K(u)=21(如果 ∣ u ∣ ≤ 1 ∣ u ∣ ≤ 1 ∣u∣≤1|u| \leq 1 u∣≤1∣u1,否则为 0)
    • 三角核(Triangular Kernel): K ( u ) = 1 − ∣ u ∣ K ( u ) = 1 − ∣ u ∣ K(u)=1−∣u∣K(u) = 1 - |u| K(u)=1uK(u)=1u(如果 ∣ u ∣ ≤ 1 ∣ u ∣ ≤ 1 ∣u∣≤1|u| \leq 1 u∣≤1∣u1,否则为 0)

带宽 h h h

  • 带宽控制了核的扩展范围。

  • h h h 的选择非常重要:

    • h h h 太小:估计函数会过于波动(过拟合)。
    • h h h 太大:估计函数会过于平滑(欠拟合)。
  • KDE 的核心思想是用核函数 K ( ⋅ ) K(\cdot) K()平滑地“覆盖”每个数据点。

  • 通过将核函数中心放在每个数据点上,并根据带宽 h h h 调整宽度,最终生成一个连续的概率密度曲线

KDE与直方图的比较

特點直方圖KDE
區間数据被划分成固定宽度的区间(bin)不需要固定区间
平滑性曲线可能不连续,有棱角曲线连续、平滑
參數区间宽度(bin width)核函数和带宽(kernel + bandwidth)
靈活性对区间位置敏感更灵活,适用于复杂数据分布

应用场景

  1. 数据分布可视化 :如观察数据的集中趋势和分布形态。
  2. 异常检测 :识别不符合密度分布的数据点。
  3. 概率密度估计 :用于机器学习和统计建模中的特征分布建模。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/948093.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

pip安装paddle失败

一、pip安装paddle失败&#xff0c;报错如下 Preparing metadata (setup.py) ... error error: subprocess-exited-with-error import common, dual, tight, data, prox ModuleNotFoundError: No module named common [end of output] 二、解决方法&#xff1a; 按照提示安装对…

计算机网络 (19)扩展的以太网

前言 以太网&#xff08;Ethernet&#xff09;是一种局域网&#xff08;LAN&#xff09;技术&#xff0c;它规定了包括物理层的连线、电子信号和介质访问层协议的内容。以太网技术不断演进&#xff0c;从最初的10Mbps到如今的10Gbps、25Gbps、40Gbps、100Gbps等&#xff0c;已成…

企业二要素如何用java实现

一、什么是企业二要素&#xff1f; 企业二要素&#xff0c;通过输入统一社会信用代码、企业名称或统一社会信用代码、法人名称&#xff0c;验证两者是否匹配一致。 二、企业二要素适用哪些场景&#xff1f; 例如&#xff1a;企业日常运营 1.文件与资料管理&#xff1a;企业…

企业三要素如何用PHP实现调用

一、什么是企业三要素&#xff1f; 企业三要素即传入的企业名称、法人名称、社会统一信用代码或注册号&#xff0c;校验此三项是否一致。 二、具体怎么样通过PHP实现接口调用&#xff1f; 下面我们以阿里云为例&#xff0c;通过PHP示例代码进行调用&#xff0c;参考如下&…

一份完整的软件测试报告如何编写?

在软件开发的过程中&#xff0c;测试是必不可少的环节。然而&#xff0c;测试报告往往是最被忽视的部分。你是否也曾在忙碌的测试工作后&#xff0c;面对一份模糊不清的测试报告感到头疼&#xff1f;一份清晰、完整且结构合理的测试报告&#xff0c;能够帮助团队快速了解软件的…

021-spring-springmvc-组件

SpringMVC的handMapping 比较重要的部分 比较重要的部分 比较重要的部分 关于组件的部分 这里以 RequestMappingHandlerMapping 为例子 默认的3个组件是&#xff1a; org.springframework.web.servlet.handler.BeanNameUrlHandlerMapping org.springframework.web.servlet.mvc…

Golang的并发编程实战经验

## Golang的并发编程实战经验 并发编程是什么 并发编程是指程序的多个部分可以同时执行&#xff0c;这样可以提高程序的性能和效率。在Golang中&#xff0c;并发编程是通过goroutine来实现的&#xff0c;goroutine是一种轻量级线程&#xff0c;可以在一个程序中同时运行成千上万…

【时时三省】(C语言基础)常见的动态内存错误

山不在高&#xff0c;有仙则名。水不在深&#xff0c;有龙则灵。 ----CSDN 时时三省 对NULL指针的解引用操作 示例&#xff1a; malloc申请空间的时候它可能会失败 比如我申请一块非常大的空间 那么空间可能就会开辟失败 正常的话要写一个if&#xff08;p&#xff1d;&#x…

计算机网络 (18)使用广播信道的数据链路层

一、广播信道的基本概念 广播信道是一种允许一个发送者向多个接收者发送数据的通信信道。在计算机网络中&#xff0c;广播信道通常用于局域网&#xff08;LAN&#xff09;内部的主机之间的通信。这种通信方式的主要优点是可以节省线路&#xff0c;实现资源共享。 二、广播信道数…

网络安全:路由技术

概述 路由技术到底研究什么内容 研究路由器寻找最佳路径的过程 路由器根据最佳路径转发数据包 知识点&#xff0c;重要OSRF,BGP1.静态路由原理 路由技术分类 静态路由和动态路由技术 静态路由&#xff1a;是第一代路由技术&#xff0c;由网络管理员手工静态写路由/路径告知路…

游戏引擎学习第72天

无论如何&#xff0c;我们今天有一些调试工作要做&#xff0c;因为昨天做了一些修改&#xff0c;结果没有时间进行调试和处理。我们知道自己还有一些需要解决的问题&#xff0c;却没有及时完成&#xff0c;所以我们想继续进行这些调试。对我们来说&#xff0c;拖延调试工作总是…

RP2K:一个面向细粒度图像的大规模零售商品数据集

这是一种用于细粒度图像分类的新的大规模零售产品数据集。与以往专注于相对较少产品的数据集不同&#xff0c;我们收集了2000多种不同零售产品的35万张图像&#xff0c;这些图像直接在真实的零售商店的货架上拍摄。我们的数据集旨在推进零售对象识别的研究&#xff0c;该研究具…

【Linux】传输层协议UDP

目录 再谈端口号 端口号范围划分 UDP协议 UDP协议端格式 UDP的特点 UDP的缓冲区 UDP注意事项 进一步深刻理解 再谈端口号 在上图中&#xff0c;有两个客户端A和B&#xff0c;客户端A打开了两个浏览器&#xff0c;这两个客户端都访问同一个服务器&#xff0c;都访问服务…

ReactiveStreams、Reactor、SpringWebFlux

注意&#xff1a; 本文内容于 2024-12-28 21:22:12 创建&#xff0c;可能不会在此平台上进行更新。如果您希望查看最新版本或更多相关内容&#xff0c;请访问原文地址&#xff1a;ReactiveStreams、Reactor、SpringWebFlux。感谢您的关注与支持&#xff01; ReactiveStreams是…

window10同时安装mysql5.7和mysql8.4.X

前提&#xff1a;window10已经安装了mysql5.7想再安装个mysql8.4.x 步骤1&#xff1a;去官网下载mysql8.4.X https://dev.mysql.com/downloads/mysql/ 步骤2&#xff1a;解压后mysql根目录添加my.ini文件如下&#xff0c;注意端口改为3308&#xff08;3306已经被mysql5.7占用…

VS2015中使用boost库函数时报错问题解决error C4996 ‘std::_Copy_impl‘

在VS2015中使用boost库函数buffer时遇到问题&#xff0c;其他函数定义均能执行&#xff0c;当加上bg::buffer(参数输入正确);语句后就报如下错误&#xff1a; 错误 C4996 std::_Copy_impl: Function call with parameters that may be unsafe - this call relies…

如何自定义异常?项目中的异常是怎么处理的?全局异常拦截如何实现?

异常就是程序出现了不正常的情况 异常的体系结构&#xff1a; 一、如何自定义异常&#xff1f; 自定义异常概述 当Java提供的本地异常不能满足我们的需求时,我们可以自定义异常 实现步骤 自定义异常类&#xff0c;extends 继承Excepion &#xff08;编译时异常&#xff09;或者…

Linux中ethtool的用法

在大多数常见的 Linux 发行版中&#xff0c;ethtool 命令通常是已经预装的&#xff0c;不需要额外手动安装软件包&#xff0c;但如果所在系统中没有该命令&#xff0c;可以通过相应的软件包管理器进行安装&#xff0c;例如&#xff1a; Ubuntu / Debian 系统 可以使用 apt-get…

LLM(十二)| DeepSeek-V3 技术报告深度解读——开源模型的巅峰之作

近年来&#xff0c;大型语言模型&#xff08;LLMs&#xff09;的发展突飞猛进&#xff0c;逐步缩小了与通用人工智能&#xff08;AGI&#xff09;的差距。DeepSeek-AI 团队最新发布的 DeepSeek-V3&#xff0c;作为一款强大的混合专家模型&#xff08;Mixture-of-Experts, MoE&a…

手机租赁平台开发实用指南与市场趋势分析

内容概要 在当今快速变化的科技时代&#xff0c;手机租赁平台的发展如火如荼。随着越来越多的人希望使用最新款的智能手机&#xff0c;但又不愿意承担昂贵的购机成本&#xff0c;手机租赁平台应运而生。这种模式不仅为用户提供了灵活的选择&#xff0c;还为企业创造了新的商机…