Applied Spatial Statistics(一)统计推断

Applied Spatial Statistics(一)统计推断

1.统计推断:Bootstrap 置信区间

本笔记本演示了如何使用引导方法构建统计数据的置信区间。 我们还将检查 CI 的覆盖概率。

  • 构建 Bootstrap 置信区间
  • 检查覆盖概率
  • Bootstrap CI 相关系数
import numpy as np
import matplotlib.pyplot as plt
#Generate some random data
data = np.random.randn(100)*10 + 20

data
array([20.14846988, 13.40823552, 13.32946651, 19.12680721, 17.52762493,
       45.23921971, 34.37879933, 18.87102583, 15.96800357, 25.24450873,
       20.40852062, 22.59343075, 34.79216096, 16.89194103, 28.92760549,
       18.84044456, 18.28049028, 20.10881674,  7.68688806, 10.37430632,
       20.3172587 , 26.42390427, 39.13238623,  7.40129486,  4.38548135,
       23.36831945, 26.89693477, 15.4169132 , 36.71645312,  7.00419646,
       15.15546063, 13.59549372, 18.88764964, 30.84743651, 31.79246417,
        1.91489133, 19.86336078, 21.92654346, 20.24120974, 19.78252461,
       29.67569607, 22.84760632,  5.64987682,  8.71363322, 21.88605373,
       23.48926653, 23.0107298 , 39.27012335, 13.98657903, 33.82055816,
       20.11463245, 21.64808896, -0.70135753, 31.30912412, -1.16449383,
       24.14380325, 35.47126313, 17.98800236, 18.58904375,  5.67235521,
       21.28026186, 19.49719148, 41.08458071, 10.09613031, 31.31805292,
       29.79117483, 13.69039686, 16.06187024, 35.57088589, 16.34373587,
       17.45492499,  9.27927219, 18.36888727, 15.62266002, 17.47501537,
       16.19066077, 22.28515871, 33.46323477, 10.23985703, 25.26497935,
        5.08193247, 36.18867675, 12.35343392, 24.85332929,  6.87104071,
       15.16828773, 28.68638639, 38.51222579,  0.90316588, 17.36319043,
        9.1263876 , 10.78820054, 13.35119181, 39.2378541 , 17.60207791,
       -1.29705377, 22.3886622 , 10.98409074, 20.81760636,  5.51315604])
plt.hist(data,bins=8)
(array([ 6., 13., 16., 30., 13.,  9., 11.,  2.]),
 array([-1.29705377,  4.51998042, 10.3370146 , 16.15404879, 21.97108297,
        27.78811715, 33.60515134, 39.42218552, 45.23921971]),
 <BarContainer object of 8 artists>)






![在这里插入图片描述](https://img-blog.csdnimg.cn/direct/8ceeaaebbdc74efabf2c640a0950a1a6.png)
print("Population mean is:", np.mean(data))

print("Population variance is:", np.var(data))
Population mean is: 19.73510302232665
Population variance is: 105.44761217481347
#One sample with 30 numbers, this is what we have in hand.

sample_30 = np.random.choice(data, 30, replace=True)

sample_30
array([30.84743651, 39.13238623, 34.37879933, 13.32946651, 13.69039686,
        9.1263876 ,  6.87104071, 31.79246417, 10.98409074, 18.58904375,
       18.87102583, 10.23985703, 13.59549372, 25.24450873, 26.42390427,
        5.08193247, 31.79246417, 28.92760549, 15.4169132 ,  8.71363322,
        7.00419646, 25.26497935,  5.51315604, 22.28515871, 15.15546063,
       30.84743651, 24.14380325, 13.98657903, 25.24450873,  0.90316588])

1.构建Bootstrap CI

#Define a bootstrap function:
def bootstrap(sample):
    
    bootstrap_mean_list = []
    
    for i in range(10000):
        #generate a re-sample with the original sample size, with replacement
        subsample = np.random.choice(sample, len(sample), replace=True)
        
        #compute sub-sample mean
        subsample_mean = np.mean(subsample)

        bootstrap_mean_list.append(subsample_mean)
    
    #Calculatet the mean and std of the bootstrap sampling distribution
    bootstrap_mean = np.mean(bootstrap_mean_list)
    boostrap_std = np.std(bootstrap_mean_list)
    
    # mean +- 2*std for an approximate 95% CI.
    CI = [(bootstrap_mean - 2*boostrap_std), (bootstrap_mean + 2*boostrap_std)]
    
    return CI,bootstrap_mean_list
    
#Do the same thing for the percentile-based method:
def bootstrap_perc(sample):
    
    bootstrap_mean_list = []
    
    for i in range(1000):
        #generate a re-sample with the original sample size, with replacement
        subsample = np.random.choice(sample, len(sample), replace=True)
        
        #compute sample mean
        subsample_mean = np.mean(subsample)

        bootstrap_mean_list.append(subsample_mean)
    
    #Get the lower and upper bound for the middle 95%:
    percentile_CI = [np.percentile(bootstrap_mean_list, 2.5), 
          np.percentile(bootstrap_mean_list, 97.5)]
    
    return percentile_CI,bootstrap_mean_list
    
#Define a anlytical based function:
def anlytical(sample):
    
    sample_mean = np.mean(sample)
    
    err_of_margin = 2*np.std(sample)/np.sqrt(len(sample))
    
    # mean +- 2*std for an approximate 95% CI.
    CI_lower = sample_mean - err_of_margin
    CI_upper = sample_mean + err_of_margin
    
    CI = [CI_lower, CI_upper]
    
    return CI

比较引导 CI 与分析 CI:

bootstrap_CI, bootstrap_mean_list = bootstrap(sample_30)
bootstrap_CI_perc, bootstrap_mean_list_perc = bootstrap_perc(sample_30)
analytical_CI = anlytical(sample_30)
print("95% analytical CI: ", analytical_CI)
print("95% bootstrap CI: ", bootstrap_CI)
print("95% percentile-based bootstrap CI: ", bootstrap_CI_perc)
95% analytical CI:  [15.105707709695746, 22.45411196607375]
95% bootstrap CI:  [15.166149117216644, 22.429589185859715]
95% percentile-based bootstrap CI:  [15.128839804780965, 22.302901561304623]
  1. 测试 95% CI 的覆盖率
  • 引导程序
  • 基于百分位数的 Bootstrap
  • 分析
%%time
#generate samples for multiple times
counter = 0
counter_perc = 0
true_mean = np.mean(data)

for i in range(1000):
    #generate a sample with 30 numbers
    sample = np.random.choice(data, 30, replace=True)
    
    #For each sample, we compute the two CIs:
    ci,_ = bootstrap(sample)
    perc_ci,_ = bootstrap_perc(sample)
    analytical_ci = anlytical(sample)
    
    #Check the coverage
    if ci[0] <= true_mean <= ci[1]:
        counter = counter + 1
        
    if perc_ci[0] <= true_mean <= perc_ci[1]:
        counter_perc = counter_perc + 1
CPU times: total: 28.3 s
Wall time: 2min 6s
print("Number of times 95% bootstrap CI covered the population mean:", 
      counter,"out of 1000")
Number of times 95% bootstrap CI covered the population mean: 944 out of 1000
print("Number of times 95% percentile-based bootstrap CI covered the population mean:", 
      counter,"out of 1000")
Number of times 95% percentile-based bootstrap CI covered the population mean: 944 out of 1000
print("Number of times 95% analytical CI covered the population mean:", 
      counter,"out of 1000")
Number of times 95% analytical CI covered the population mean: 944 out of 1000
  1. 相关系数的 Bootstrap 置信区间
import pandas as pd
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv'
df = pd.read_csv(url)
df = df.dropna()
from scipy.stats import *

plt.scatter(df.horsepower, df.acceleration)
plt.xlabel("Horsepower")
plt.ylabel("Acceleration")
d:\work\miniconda3\lib\site-packages\scipy\__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.3
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"





Text(0, 0.5, 'Acceleration')

在这里插入图片描述

print("Pearson:",pearsonr(df.horsepower, df.acceleration))
Pearson: (-0.6891955103342369, 1.5818862297814436e-56)
sample_car = df.sample(frac=0.3,replace=True)
sample_car
mpgcylindersdisplacementhorsepowerweightaccelerationmodel_yearoriginname
23833.5498.083.0207515.977usadodge colt m/m
19422.56232.090.0308517.676usaamc hornet
38036.04120.088.0216014.582japannissan stanza xe
14226.0479.067.0196315.574europevolkswagen dasher
31237.2486.065.0201916.480japandatsun 310
..............................
27123.24156.0105.0274516.778usaplymouth sapporo
13616.08302.0140.0414114.074usaford gran torino
33229.8489.062.0184515.380europevokswagen rabbit
27223.84151.085.0285517.678usaoldsmobile starfire sx
8513.08350.0175.0410013.073usabuick century 350

118 rows × 9 columns

print("Pearson:",pearsonr(sample_car.horsepower, sample_car.acceleration))
Pearson: (-0.7670736293678354, 4.176876123725007e-24)
#Define a bootstrap function:
np.random.seed(222)
def bootstrap_pearson(sample_car):
    
    bootstrap_cor_list = []
    
    for i in range(10000):
        #generate a re-sample with the original sample size, with replacement
        subsample = sample_car.sample(len(sample), replace=True)
        
        #compute correlation
        sample_cor = pearsonr(subsample.horsepower, subsample.acceleration)[0]

        bootstrap_cor_list.append(sample_cor)
    
    #Get the lower and upper bound for the middle 95%:
    percentile_CI = [np.percentile(bootstrap_cor_list, 2.5), 
          np.percentile(bootstrap_cor_list, 97.5)]
    
    return percentile_CI
    
bootstrap_pearson(sample_car)
[-0.897486157722385, -0.49473536389623324]
print("95% confidence interval of the Pearson correlation coefficient is:",
     "[-0.855 to -0.606]")

95% confidence interval of the Pearson correlation coefficient is: [-0.855 to -0.606]

我们有 95% 的信心认为人群中 HP 和 ACC 的真实相关性在 -0.855 到 -0.606 之间。

在这种情况下,CI 不覆盖 0,这意味着总体相关系数不为零。 因此,我们从样本中观察到的负趋势不仅仅是因为抽样变化,而是因为总体中真正的相关性是负的。
_cor_list, 97.5)]

return percentile_CI


```python
bootstrap_pearson(sample_car)
[-0.897486157722385, -0.49473536389623324]
print("95% confidence interval of the Pearson correlation coefficient is:",
     "[-0.855 to -0.606]")

95% confidence interval of the Pearson correlation coefficient is: [-0.855 to -0.606]

我们有 95% 的信心认为人群中 HP 和 ACC 的真实相关性在 -0.855 到 -0.606 之间。

在这种情况下,CI 不覆盖 0,这意味着总体相关系数不为零。 因此,我们从样本中观察到的负趋势不仅仅是因为抽样变化,而是因为总体中真正的相关性是负的。

2.统计推断置信区间

本笔记本演示了估计的置信区间 (CI) 的概念。 我们将检查 CI 的覆盖概率。

考虑这样一个场景:FSU 有 20,000 名学生,他们的智商水平服从正态分布,平均值为 110,标准差为 10。 让我们从学生群体中抽取样本,并检查样本与总体之间的关系。

import numpy as np
import matplotlib.pyplot as plt

生成人口数据

N = 20000 #20,000 students

#Generate from a normal distribution
data = np.random.randn(N)*10 + 110

plt.hist(data)
(array([  48.,  288., 1185., 3287., 5481., 5239., 3162., 1050.,  230.,
          30.]),
 array([ 73.30451127,  80.71913869,  88.13376612,  95.54839355,
        102.96302097, 110.3776484 , 117.79227582, 125.20690325,
        132.62153068, 140.0361581 , 147.45078553]),
 <BarContainer object of 10 artists>)

在这里插入图片描述

print("Population mean is:", np.mean(data))
print("Population variance is:", np.var(data))
Population mean is: 110.01962672036136
Population variance is: 102.25452537436367
现在让我们抽取样本。 让我们从 10 名学生的样本开始
#One sample with 10 numbers

sample_10 = np.random.choice(data, 10)

sample_10
array([109.62388683,  91.97729472, 100.46429321, 111.00988364,
       108.31788058, 113.66683355, 109.74268367, 113.06416223,
       105.39076392, 122.01725268])
print("The sample mean is: ", np.mean(sample_10))
The sample mean is:  108.52749350347217
如果我们重复采样过程(例如 1,000 次)怎么样?

从 t 分布获取临界 t 值。 自由度:10-1(样本大小 - 1)。 置信区间宽度:95%。

from scipy import stats

t = stats.t.ppf(1-0.05/2, 9)
print(t)
d:\work\miniconda3\lib\site-packages\scipy\__init__.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.3
  warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"


2.2621571627409915
coverage_list = []

for i in range(1000):
    #generate a sample with 10 numbers
    sample = np.random.choice(data, 10)
    
    #compute sample mean
    sample_mean = np.mean(sample)
    
    err_of_margin = t*np.std(sample)/np.sqrt(10)
    
    CI_lower = sample_mean - err_of_margin
    CI_upper = sample_mean + err_of_margin
    
    true_mean = np.mean(data)
    
    is_covered = (true_mean>=CI_lower) and (true_mean<=CI_upper)
    
    print(i," 95% confidence interval ", [CI_lower, CI_upper], "Cover the true mean?", is_covered)
    
    coverage_list.append(is_covered)
0  95% confidence interval  [102.7151079840605, 118.2405606324954] Cover the true mean? True
1  95% confidence interval  [104.23885267719237, 118.10221181506196] Cover the true mean? True
2  95% confidence interval  [103.72346627181192, 115.9391835906615] Cover the true mean? True
3  95% confidence interval  [104.62550349440005, 115.14252771113254] Cover the true mean? True
4  95% confidence interval  [102.12727319514039, 112.16268597210077] Cover the true mean? True
5  95% confidence interval  [101.08352856103731, 118.69180761678739] Cover the true mean? True
6  95% confidence interval  [103.18857595332162, 113.00591633235916] Cover the true mean? True
7  95% confidence interval  [103.55533372607188, 110.30496134366432] Cover the true mean? True
8  95% confidence interval  [90.18976464492026, 103.9824575401898] Cover the true mean? False
9  95% confidence interval  [102.14358563332436, 115.72116848794366] Cover the true mean? True
10  95% confidence interval  [100.80227845656053, 111.93303378484153] Cover the true mean? True
11  95% confidence interval  [103.89446446670988, 114.34313482587856] Cover the true mean? True
12  95% confidence interval  [100.7487394903015, 111.36109764079389] Cover the true mean? True

3.统计推断点估计

本笔记本演示了统计量抽样分布的概念。 这也表明

  • 样本均值是无偏的
  • 样本方差是无偏的
  • 均值和方差抽样变异的解析解

考虑这样一个场景:FSU 有 20,000 名学生,他们的智商水平服从正态分布,平均值为 110,标准差为 10。 让我们从学生群体中抽取样本,并检查样本与总体之间的关系。

样本平均值
import numpy as np
import matplotlib.pyplot as plt

生成人口数据

N = 20000 #20,000 students

#Generate from a normal distribution
data = np.random.randn(N)*10 + 110

plt.hist(data)
(array([   8.,  142.,  843., 3136., 5751., 5935., 3176.,  867.,  127.,
          15.]),
 array([ 69.32483145,  77.45863551,  85.59243958,  93.72624364,
        101.8600477 , 109.99385177, 118.12765583, 126.26145989,
        134.39526396, 142.52906802, 150.66287208]),
 <BarContainer object of 10 artists>)

在这里插入图片描述

print("Population mean is:", np.mean(data))
print("Population variance is:", np.var(data))
Population mean is: 110.07413289511938
Population variance is: 98.89323295421896
现在让我们抽取样本。 让我们从 10 名学生的样本开始
#One sample with 10 numbers

sample_10 = np.random.choice(data, 10)

sample_10
array([111.66777512, 112.16222242, 118.08689454, 110.24523641,
       125.48603522,  88.41367076, 125.67988677, 113.88800138,
       108.6807358 , 121.01857026])
print("The sample mean is: ", np.mean(sample_10))
The sample mean is:  113.53290286905217
如果我们重复采样过程(例如 1,000 次)怎么样?
#Create an empty list to hold the numbers from each sample
sample_10_mean_list = []

for i in range(1000):
    #generate a sample with 10 numbers
    sample = np.random.choice(data, 10)
    
    #compute sample mean
    sample_mean = np.mean(sample)
    
    print(i," sample mean: ", sample_mean)
    
    #append them to the list
    sample_10_mean_list.append(sample_mean)
0  sample mean:  108.17391918529407
1  sample mean:  111.67818693824806
2  sample mean:  111.67822497907324
3  sample mean:  107.24418829414785
4  sample mean:  113.17329723752623
5  sample mean:  112.9070413608591
6  sample mean:  108.04910734829619
7  sample mean:  112.5903576727253
8  sample mean:  113.62852862972878
9  sample mean:  113.27149334572691
10  sample mean:  107.76969738534706
11  sample mean:  109.1620666792978
12  sample mean:  103.71358812699627
13  sample mean:  107.697395964914
14  sample mean:  110.8082408750503
15  sample mean:  110.92158184257207
16  sample mean:  108.58419010871549
17  sample mean:  109.65230764397218
18  sample mean:  109.86678913378717
19  sample mean:  105.08768958694495
20  sample mean:  106.88640190007649
21  sample mean:  112.56008138539062
22  sample mean:  110.83395797118828
23  sample mean:  110.73008027594503
24  sample mean:  109.3294505313664
25  sample mean:  110.04042595076564
26  sample mean:  109.67050530705797
27  sample mean:  108.32738807341839
28  sample mean:  108.49314412381798
29  sample mean:  107.1858761872827
30  sample mean:  112.46187853160286
31  sample mean:  113.12349039970186
32  sample mean:  110.14207252003163
33  sample mean:  110.75599034550528
34  sample mean:  106.8240410457283
35  sample mean:  111.8765477301838
36  sample mean:  113.82299698648565
37  sample mean:  109.31286148365352
38  sample mean:  111.97700620796886
39  sample mean:  109.9008197839028
40  sample mean:  107.69013776731724
41  sample mean:  114.28574317155494
42  sample mean:  109.50146395888653
43  sample mean:  107.40447023607123
44  sample mean:  116.24621309981323
45  sample mean:  112.67872689503027
46  sample mean:  106.53331942661855
47  sample mean:  110.35618578397953
48  sample mean:  107.67531638860439
49  sample mean:  111.03510449989963
50  sample mean:  109.03305491756535
51  sample mean:  111.12749226215763
52  sample mean:  107.44884335343575
53  sample mean:  111.4216569660086
54  sample mean:  112.14976633180977
55  sample mean:  108.046169949935
56  sample mean:  107.78856332959371
57  sample mean:  115.0407995071375
58  sample mean:  107.82026621773139
59  sample mean:  106.80442741389179
60  sample mean:  109.07177961379848
61  sample mean:  110.57205199833035
62  sample mean:  107.35628147704801
63  sample mean:  107.84236667449561
64  sample mean:  111.90515735465651
65  sample mean:  114.4823745877502
66  sample mean:  102.21096088759352
67  sample mean:  112.18457688111475
68  sample mean:  107.4440654002789
69  sample mean:  108.47620386204194
70  sample mean:  115.00823605453897
71  sample mean:  106.75913949864893
72  sample mean:  111.13852427880879
73  sample mean:  108.95844954274037
74  sample mean:  113.59768517919773
75  sample mean:  111.31910512168243
76  sample mean:  106.38067170844681
77  sample mean:  109.49375945640797
78  sample mean:  115.093047567975
79  sample mean:  106.09833252432466
80  sample mean:  116.4311422319212
81  sample mean:  108.28528941038228
82  sample mean:  103.82927317186042
83  sample mean:  111.18673783032439
84  sample mean:  108.4971016336932
85  sample mean:  105.49599715374734
86  sample mean:  110.14529385508467
87  sample mean:  105.75197161089457
88  sample mean:  115.80545357784479
89  sample mean:  107.75685400068934
90  sample mean:  109.49381609427321
91  sample mean:  108.95402323987977
92  sample mean:  111.97213167688433
93  sample mean:  109.47712760468328
94  sample mean:  113.14660304988516
95  sample mean:  110.80976482820472
96  sample mean:  107.92343270315664
97  sample mean:  108.37770599789269
98  sample mean:  105.25528550623233
99  sample mean:  115.05666095395479
100  sample mean:  107.77247710607735
101  sample mean:  108.89543987766524
102  sample mean:  112.12273016811253
103  sample mean:  114.80392467210928
104  sample mean:  106.68961413411903
105  sample mean:  106.03545032270267
106  sample mean:  114.97887048765503
107  sample mean:  110.37581153007821
108  sample mean:  110.65923878869683
109  sample mean:  111.3430788040037
110  sample mean:  107.7232254285943
111  sample mean:  111.57244948737664
112  sample mean:  105.28194979231736
113  sample mean:  109.98995161589244
114  sample mean:  109.53396064067333
115  sample mean:  111.38428386596779
116  sample mean:  110.97486991904695
117  sample mean:  112.87118032789704
118  sample mean:  108.04167623331617
119  sample mean:  109.74945215273877
120  sample mean:  113.34034390095641
121  sample mean:  108.87412423611276
122  sample mean:  105.06601270840356
123  sample mean:  109.90153164811923
124  sample mean:  108.29931093364205
125  sample mean:  114.54952533305318
126  sample mean:  108.41885224159255
127  sample mean:  109.432696156925
128  sample mean:  108.85548860855647
129  sample mean:  106.69453432738733
130  sample mean:  110.73156301741199
131  sample mean:  110.3344293210414
132  sample mean:  106.43368100363816
133  sample mean:  111.57845827640968
134  sample mean:  116.41696101920373
135  sample mean:  110.09786113805441
136  sample mean:  110.12713022826912
137  sample mean:  115.19539623223059
138  sample mean:  114.69947911209549
139  sample mean:  112.15932563335575
140  sample mean:  105.03670120615747
141  sample mean:  106.55324659192142
142  sample mean:  108.87155226347852
143  sample mean:  113.19328040597836
144  sample mean:  107.43749597783346
145  sample mean:  112.52280142597355
146  sample mean:  104.25237526935716
147  sample mean:  111.74730695830422
148  sample mean:  106.47421562458676
149  sample mean:  110.41192960632631
150  sample mean:  111.06560448474977
151  sample mean:  110.8574597363748
152  sample mean:  113.5702179737926
153  sample mean:  102.94360945489191
154  sample mean:  112.51717223500005
155  sample mean:  116.24740871295299
156  sample mean:  113.38278417361005
157  sample mean:  115.25751695567212
158  sample mean:  110.0492555852654
159  sample mean:  111.79273879730708
160  sample mean:  115.00085251159253
161  sample mean:  108.46956146896632
162  sample mean:  112.25352540521085
163  sample mean:  110.09704245884075
164  sample mean:  109.37142519972653
165  sample mean:  113.528730214851
166  sample mean:  111.73192964440406
167  sample mean:  107.83625790041121
168  sample mean:  110.30771604129515
169  sample mean:  104.20621128051827
170  sample mean:  110.32212617511611
171  sample mean:  108.00438566363746
172  sample mean:  113.61660167443861
173  sample mean:  108.81058515088898
174  sample mean:  110.77800251434556
175  sample mean:  113.10422375502681
176  sample mean:  114.39592439664179
177  sample mean:  112.55187994163687
178  sample mean:  107.23704574273629
179  sample mean:  108.57205135565349
180  sample mean:  110.58745831308568
181  sample mean:  110.47926496421303
182  sample mean:  107.99057552600043
183  sample mean:  114.1629001335393
184  sample mean:  106.45733896312565
185  sample mean:  114.21043571521477
186  sample mean:  111.0160001995882
187  sample mean:  110.23687892814004
188  sample mean:  113.88730941650071
189  sample mean:  107.42852295711978
190  sample mean:  109.50077760121633
191  sample mean:  110.37270694898527
192  sample mean:  105.7880522406582
193  sample mean:  114.6503862528984
194  sample mean:  109.42960390312935
195  sample mean:  104.93643110081601

每次采样时我们都会观察到不同的样本均值。 如果我们看一下它们的分布,样本均值的抽样分布是:

plt.hist(sample_10_mean_list)
(array([  9.,  18.,  71., 148., 220., 207., 164., 105.,  49.,   9.]),
 array([100.41739832, 102.22911703, 104.04083574, 105.85255445,
        107.66427316, 109.47599187, 111.28771058, 113.0994293 ,
        114.91114801, 116.72286672, 118.53458543]),
 <BarContainer object of 10 artists>)

在这里插入图片描述

**重要提示:**这不是智商水平的分布! 这是给定样本(样本中有 10 名学生)的平均 IQ 水平超过 1,000 次的分布。

该分布的平均值为:

np.mean(sample_10_mean_list)
109.81834080140132

让我们与总体平均值进行比较:

np.mean(data)
110.07413289511938

相当接近!

如果我们1)增加样本量(10->100),样本均值是否会更接近真实均值? 2) 进行更多采样(10,000 -> 1,000,000)。 是的!

#Create an empty list to hold the numbers from each sample
sample_100_mean_list = []

for i in range(1000000):
    #generate a sample with 100 numbers
    sample = np.random.choice(data, 100)
    
    #compute sample mean
    sample_mean = np.mean(sample)
    
    #print(i," sample mean: ", sample_mean)
    
    #append them to the list
    sample_100_mean_list.append(sample_mean)
plt.hist(sample_100_mean_list)
(array([6.80000e+01, 2.14900e+03, 2.68310e+04, 1.42865e+05, 3.29388e+05,
        3.28362e+05, 1.41841e+05, 2.63350e+04, 2.09200e+03, 6.90000e+01]),
 array([105.35779164, 106.30191012, 107.24602861, 108.1901471 ,
        109.13426558, 110.07838407, 111.02250256, 111.96662104,
        112.91073953, 113.85485802, 114.7989765 ]),
 <BarContainer object of 10 artists>)

在这里插入图片描述

print("Average sample mean:", np.mean(sample_100_mean_list))
print("True mean:", np.mean(data))
Average sample mean: 110.07524696756842
True mean: 110.07413289511938

进一步增加样本量和样本次数将使样本均值收敛于真实总体均值,这表明样本均值是一个无偏统计量。

我们可以比较使用两个不同样本量时样本均值的分布。

plt.hist(sample_10_mean_list, density=True, alpha=0.5,bins=10,label='sample size = 10')

plt.hist(sample_100_mean_list, density=True, alpha=0.5,bins=10,label='sample size = 100')

plt.legend()
<matplotlib.legend.Legend at 0x1a70b3b9a90>

在这里插入图片描述

当样本量较小时,抽样分布更宽(即抽样变异性更大)!

这是因为:从分析上来说,样本均值的抽样分布遵循正态分布,均值为总体均值,标准差为: σ n \frac{\sigma}{\sqrt{n}} n σ

其中 σ \sigma σ 是总体标准差, n n n 是样本量。

print("Analytical SD of the sampling distribution:", 
      np.std(data)/np.sqrt(10))

print("Empirical SD of the sampling distribution:", 
      np.std(sample_10_mean_list))
Analytical SD of the sampling distribution: 3.1447294470942797
Empirical SD of the sampling distribution: 3.1421107844908382
print("Analytical SD of the sampling distribution:", 
      np.std(data)/np.sqrt(100))

print("Empirical SD of the sampling distribution:", 
      np.std(sample_100_mean_list))
Analytical SD of the sampling distribution: 0.9944507677819902
Empirical SD of the sampling distribution: 0.994316399287971
同样,让我们检查样本方差的抽样分布。
#define a small function to calculate sample variance.
def sample_var(sample):
    mean = np.mean(sample)
    n = sample.shape[0]
    return np.sum((sample - mean)**2)/(n-1)
#One sample with 10 numbers

sample_10 = np.random.choice(data, 10)

sample_10
array([121.64683719, 103.11098717, 116.80834819, 103.55885849,
        91.93607486, 110.61085577, 120.87592219, 101.59268352,
       117.57011851, 108.56843983])
print("Sample variance is:", sample_var(sample_10))
Sample variance is: 93.84088686288874
#generate samples for multiple times
sample_10_variance_list = []

for i in range(10000):
    #generate a sample with 10 numbers
    sample = np.random.choice(data, 10)
    
    #compute sample variance
    sample_variance = sample_var(sample)
    
    print(i," sample variance: ", sample_variance)
    
    #append them to the list
    sample_10_variance_list.append(sample_variance)
0  sample variance:  133.16255153634953
1  sample variance:  84.95111363055562
2  sample variance:  129.25853520691152
3  sample variance:  162.9507939120388
4  sample variance:  57.65250685184362
5  sample variance:  127.05245098320125
6  sample variance:  70.03273515311018
7  sample variance:  96.41498645686788
8  sample variance:  83.05696382116894
9  sample variance:  119.12242739103021
10  sample variance:  72.61720603700626

方差的抽样分布

plt.hist(sample_10_variance_list)
(array([9.170e+02, 3.058e+03, 3.017e+03, 1.754e+03, 8.010e+02, 2.870e+02,
        1.000e+02, 5.400e+01, 9.000e+00, 3.000e+00]),
 array([  8.95012759,  45.1402609 ,  81.3303942 , 117.5205275 ,
        153.71066081, 189.90079411, 226.09092741, 262.28106072,
        298.47119402, 334.66132733, 370.85146063]),
 <BarContainer object of 10 artists>)

在这里插入图片描述

print("Average of sample variance:", np.mean(sample_10_variance_list))

print("True variance:", np.var(data))
Average of sample variance: 99.59784954844092
True variance: 98.89323295421896

再次,相当接近!

增加样本大小和样本数量:

#generate samples for multiple times
sample_100_variance_list = []

for i in range(1000000):
    #generate a sample with 100 numbers
    sample = np.random.choice(data, 100)
    
    #compute sample variance
    sample_variance = sample_var(sample)
    
    #append them to the list
    sample_100_variance_list.append(sample_variance)
print("Average of sample variance:", np.mean(sample_100_variance_list))
print("True variance:", np.var(data))
Average of sample variance: 98.88787841866694
True variance: 98.89323295421896

靠近点!

比较两个相同大小的样本均值的抽样分布。

plt.hist(sample_10_variance_list,density=True,alpha=0.5,bins=20,label='sample size = 10')
plt.hist(sample_100_variance_list,density=True,alpha=0.5,bins=20,label='sample size = 100')
plt.legend()
<matplotlib.legend.Legend at 0x1a701f14670>

在这里插入图片描述

从分析上看,样本方差的抽样分布遵循自由度为 n - 1(n 为样本大小)的卡方分布。

( n − 1 ) s 2 / σ 2 ∼ χ n − 1 2 (n-1)s^2/\sigma^2 \sim \chi^2_{n-1} (n1)s2/σ2χn12

其中 s s s 是样本标准差, σ \sigma σ 是总体标准差

numpy 函数模拟卡方分布:random.chisquare(df, size=None)

anlytical_dist_10 = np.random.chisquare(10-1, 10000)*np.var(data)/(10-1)

anlytical_dist_100 = np.random.chisquare(100-1, 10000)*np.var(data)/(100-1)
plt.hist(anlytical_dist_10,density=True,alpha=0.5,bins=20,label='sample size = 10')
plt.hist(anlytical_dist_100,density=True,alpha=0.5,bins=20,label='sample size = 100')
plt.legend()
<matplotlib.legend.Legend at 0x1a711da2190>

在这里插入图片描述

分析分布与我们的经验抽样分布相同!


本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/505756.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

数据挖掘入门项目二手交易车价格预测之特征工程

文章目录 目标常见的特征工程具体步骤1. 导入数据2. 删除异常值3. 特征构造3.1 为树模型构造特征3.2 为LR NN 之类的模型构造特征 4. 特征筛选过滤式包裹式嵌入式 5. 总结 本文数据集来自阿里天池&#xff1a;https://tianchi.aliyun.com/competition/entrance/231784/informat…

Debian linux版本下运行的openmediavault网盘 千兆网卡升级万兆

一、适用场景 1、使用vmware ESXi虚拟化平台运行多种不同应用服务器时&#xff0c;其中网盘服务器采用开源的openmediavault搭建&#xff1b; 2、将老专业服务器升级千兆网为万兆网&#xff1b; 3、需要转移的数据量大的企业或用户&#xff1b; 4、从服务器到服务器的数据转移…

LeetCode刷题【链表,图论,回溯】

目录 链表138. 随机链表的复制148. 排序链表146. LRU 缓存 图论200. 岛屿数量994. 腐烂的橘子207. 课程表 回溯 链表 138. 随机链表的复制 给你一个长度为 n 的链表&#xff0c;每个节点包含一个额外增加的随机指针 random &#xff0c;该指针可以指向链表中的任何节点或空节…

基于知识图谱的个性化学习推荐系统的设计与实现(论文+源码)_kaic

摘 要 Abstract 1 绪 论 1.1 研究背景及意义 1.2 国内外现状研究 1.3 研究工作和论文结构 2 相关技术 2.1 HTML 语言 2.2 Python 语言 2.3 数据库技术 2.4 Django 框架 3 系统分析 3.1 需求概述 3.2 系统可行性分析 3.2.1 技术可行性 3.2.2 经济可行性 3.2.3 操作可行性 3.3 功…

网络基础二补充——json与http协议

五、市面上常用序列化和反序列化工具 ​ 常用的有&#xff1a;json、protobuf、xml三种方案&#xff1b; 5.1json的使用 1.安装jsoncpp库&#xff0c;是一个第三方的开发库文件&#xff1b; sudo yum install -y jsoncpp-devel2.使用json ​ 经常使用的头文件是json.h&…

Python之Opencv教程(2):图像边缘检测

1、什么是边缘检测 OpenCV中的边缘检测是一种常见的图像处理技术&#xff0c;用于检测图像中物体边缘的位置。常用的边缘检测算法包括Sobel算子、Scharr算子、Laplacian算子和Canny边缘检测算法等。下面将介绍使用OpenCV实现这些边缘检测算法的方法。 2、边缘检测的作用 边缘…

C语言---自定义类型:联合体和枚举

文章目录 前言1. 联合体类型的声明1.1 联合体类型的声明1.2 联合体的特点1.4 联合体大小的计算1.5 联合的一个练习 2.枚举2.1 枚举类型的声明2.2 枚举类型的优点 前言 上一篇我们学习了自定义类型—结构体&#xff0c;大家会发现&#xff0c;构建一个结构体时&#xff0c;有些…

程序数据模型由OS还是硬件架构决定?

文章目录 前言硬件架构的作用OS的作用编译器的角色OS的数据模型参考 前言 在文章 1>>32的结果是1还是0 中提到了数据模型 L P 64 LP64 LP64 &#xff0c;并提出这个数据模型主要是由 U n i x Unix Unix 以及类 U n i x Unix Unix 的操作系统使用居多&#xff0c;例如…

macOS Catalina for mac (macos 10.15系统)v10.15.7正式版

macOS Catalina是苹果公司专为麦金塔电脑推出的桌面操作系统&#xff0c;是macOS的第16个主要版本。它继承了苹果一贯的优雅与高效&#xff0c;不仅引入了分割视图和侧边栏&#xff0c;还带来了全新的音乐和播客应用&#xff0c;极大地提升了用户体验。在隐私保护和安全性方面&…

java学习总结以及考试总结

1.对象的this引用 this引用用于区分成员变量和局部变量&#xff0c;this引用的一定的指的是成员变量 所以说this语句的作用就是区分成员变量和局部变量&#xff08;如何呢&#xff09; package com.temo.test1;public class student{private String name;//成员变量private …

Optimizer神经网络中各种优化器介绍

1. SGD 1.1 batch-GD 每次更新使用全部的样本&#xff0c;注意会对所有的样本取均值&#xff0c;这样每次更新的速度慢。计算量大。 1.2 SGD 每次随机取一个样本。这样更新速度更快。SGD算法在于每次只去拟合一个训练样本&#xff0c;这使得在梯度下降过程中不需去用所有训…

OpenEuler华为欧拉系统安装教程及联网配置

OpenEuler简介 openEuler是一款开源操作系统。当前openEuler内核源于Linux&#xff0c;支持鲲鹏及其它多种处理器&#xff0c;能够充分释放计算芯片的潜能&#xff0c;是由全球开源贡献者构建的高效、稳定、安全的开源操作系统&#xff0c;适用于数据库、大数据、云计算、人工智…

【Laravel】07 快速套用一个网站模板

【Laravel】07 快速套用一个网站模板 1. 新增post表2.补充 &#xff1a;生成Model、Controller、迁移文件3. 使用php artisan tinker4. 网站模板下载 课程地址 1. 新增post表 在Model中创建Post (base) ➜ example-app php artisan make:model Post Model created successfu…

力扣 1035. 不相交的线

题目来源&#xff1a;https://leetcode.cn/problems/uncrossed-lines/description/ C题解&#xff1a;经过细细一推导&#xff0c;就发现跟力扣 1143. 最长公共子序列-CSDN博客 换汤不换药。 直线不能相交&#xff0c;说明元素顺序不能改变&#xff0c;求可以绘制的最大连线数…

相机显示储存卡未格式化怎么回事?怎么办

在摄影的学习和实践中&#xff0c;相机是我们记录美好瞬间的得力助手。然而&#xff0c;当相机突然提示储存卡未格式化时&#xff0c;这往往会让我们感到困惑和焦虑。本文将探讨相机显示储存卡未格式化的可能原因&#xff0c;并提供相应的解决方案。 图片来源于网络&#xff0c…

游戏引擎中的大气和云的渲染

一、大气 首先和光线追踪类似&#xff0c;大气渲染也有类似的渲染公式&#xff0c;在实际处理中也有类似 Blinn-Phong的拟合模型。关键参数是当前点到天顶的角度和到太阳的角度 二、大气散射理论 光和介质的接触&#xff1a; Absorption 吸收Out-scattering 散射Emission …

汇编语言第四版-王爽第1章 基础知识

前言 基础知识 &#xff08;1&#xff09;换成bit&#xff0c;1KB1024B&#xff0c;1Byte8bit&#xff1b;1KB1024*8bit&#xff0c;即2的13次方&#xff0c;宽度为13. &#xff08;2&#xff09;1个存储单元只能放1个字节&#xff0c;1KB1024B&#xff1b;编号从0到1023. &a…

web前端面试题----->VUE

Vue的数据双向绑定是通过Vue的响应式系统实现的。具体原理&#xff1a; 1. Vue会在初始化时对数据对象进行遍历&#xff0c;使用Object.defineProperty方法将每个属性转化为getter、setter。这样在访问或修改数据时&#xff0c;Vue能够监听到数据的变化。 2. 当数据发生变化时…

书生 浦语大模型全链路开源体系

通用大模型成为发展通用人工智能的重要途径 书生 浦语大模型的开源历程 书生 浦语 2.0体系&#xff0c;面向不同的使用需求&#xff0c;每个规格包含三个模型版本&#xff0c;&#xff08;7B、20B&#xff09;InternLM2-Base、InternLM2、InternLM2-Chat。 大模型是回归语言建…

python通过shapely 的 valid 判断aoi图形是否有效

测试aoi坐标&#xff1a; 116.527712,39.924304;116.527123,39.924353;116.52707,39.923985;116.527685,39.92397;116.527712,39.924304 如图所示是一个有效的坐标&#xff0c;使用python代码判断是否有效&#xff1a; 代码&#xff1a; from shapely.geometry import Polyg…