Applied Spatial Statistics(五)线性回归 I
该笔记本演示了:
- 线性回归系数估计在假设下是无偏的
- 如何围绕系数估计构建 bootstrap 置信区间
- 残差图
- Q-Q图
1. 线性回归系数估计在假设下是无偏的
import numpy as np
import matplotlib.pyplot as plt
Statsmodels是python中的统计建模包
import statsmodels.api as sm
#Randomly draw X and error
x = np.random.randn(1000) # ~ N(0,1)
e = np.random.randn(1000) # ~ N(0,1)
定义与一些干扰(噪声)的线性关系
#b0 = 10
#b1 = 4
y = 10 + 4*x + e
plt.scatter(x,y,alpha=0.4)
plt.axline((0, 10), (3, 22), linewidth=2, color='r')
<matplotlib.lines.AxLine at 0x29e732150>
x_cons = sm.add_constant(x)
model = sm.OLS(y,x_cons).fit()
线性回归结果总结
model.summary()
Dep. Variable: | y | R-squared: | 0.944 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.944 |
Method: | Least Squares | F-statistic: | 1.678e+04 |
Date: | Wed, 07 Feb 2024 | Prob (F-statistic): | 0.00 |
Time: | 19:25:57 | Log-Likelihood: | -1394.3 |
No. Observations: | 1000 | AIC: | 2793. |
Df Residuals: | 998 | BIC: | 2802. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 9.9372 | 0.031 | 321.228 | 0.000 | 9.876 | 9.998 |
x1 | 4.0251 | 0.031 | 129.541 | 0.000 | 3.964 | 4.086 |
Omnibus: | 1.647 | Durbin-Watson: | 1.924 |
---|---|---|---|
Prob(Omnibus): | 0.439 | Jarque-Bera (JB): | 1.671 |
Skew: | 0.098 | Prob(JB): | 0.434 |
Kurtosis: | 2.963 | Cond. No. | 1.06 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
测试公正性
def reg_unbiased(x,y):
estimates_list = []
#Repeat the sampling process for 10,000 times
for i in range(10000):
#Get a random index
sample_index = np.random.choice(1000, 300)
#Use the index to index y and x
sample_y = y[sample_index]
sample_x = x[sample_index]
#add constant to x
sample_x = sm.add_constant(sample_x)
#Fit the OLS
model = sm.OLS(sample_y,sample_x).fit()
#Append the parameters
estimates_list.append(model.params)
return np.array(estimates_list)
rslt = np.array(reg_unbiased(x,y))
rslt
array([[9.98692129, 4.0672853 ],
[9.93114785, 4.03665871],
[9.99781954, 4.10916308],
...,
[9.91294925, 3.98044767],
[9.88729838, 3.992089 ],
[9.87920183, 4.12874911]])
np.mean(rslt,axis=0)
array([9.93654559, 4.02654646])
10000 次重复的估计平均值相当于截距为 10、斜率为 4 的真实回归线。
2. 如何围绕系数估计构建置信区间
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import *
import statsmodels.formula.api as smf
#Use the mgp dataset
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv'
df = pd.read_csv(url)
df = df.dropna()
如果您使用“pandas.DataFrame”,则可以使用此接口来拟合模型
model = smf.ols(formula='acceleration ~ horsepower', data=df).fit()
subsample_coef = model.params
subsample_coef.values[0]
20.701933699127174
model.summary()
Dep. Variable: | acceleration | R-squared: | 0.475 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.474 |
Method: | Least Squares | F-statistic: | 352.8 |
Date: | Wed, 07 Feb 2024 | Prob (F-statistic): | 1.58e-56 |
Time: | 19:25:58 | Log-Likelihood: | -827.24 |
No. Observations: | 392 | AIC: | 1658. |
Df Residuals: | 390 | BIC: | 1666. |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 20.7019 | 0.293 | 70.717 | 0.000 | 20.126 | 21.277 |
horsepower | -0.0494 | 0.003 | -18.784 | 0.000 | -0.055 | -0.044 |
Omnibus: | 29.904 | Durbin-Watson: | 1.483 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 37.989 |
Skew: | 0.608 | Prob(JB): | 5.63e-09 |
Kurtosis: | 3.919 | Cond. No. | 322. |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
从数据帧中替换引导样本
#Define a bootstrap CI function
def bootstrap_ci(df):
np.random.seed(100)
bootstrap_b0 = []
bootstrap_b1 = []
for i in range(1000):
#generate a re-sample with the original sample size, with replacement
sample = df.sample(len(df), replace=True)
#formulate the model
model = smf.ols(formula='acceleration ~ horsepower', data=sample).fit()
#Get the intercept and slope estimates
subsample_coef = model.params
#Append them to the list
bootstrap_b0.append(subsample_coef.values[0])
bootstrap_b1.append(subsample_coef.values[1])
#Get the lower and upper bound for the middle 95%:
b0_CI = [np.percentile(bootstrap_b0, 2.5),
np.percentile(bootstrap_b0, 97.5)]
b1_CI = [np.percentile(bootstrap_b1, 2.5),
np.percentile(bootstrap_b1, 97.5)]
return b0_CI,b1_CI
rslt = bootstrap_ci(df)
print("95% CI for intercept:",rslt[0])
95% CI for intercept: [20.017892056783964, 21.347498114044548]
print("95% CI for slope:",rslt[1])
95% CI for slope: [-0.0548220191942679, -0.043592380423869134]
import seaborn as sns
sns.regplot(data=df, x="horsepower", y="acceleration", ci=95, truncate=False,line_kws={"color": "red"})
#plt.xlim(300)
<Axes: xlabel='horsepower', ylabel='acceleration'>
3. 残差图
plt.scatter(model.predict(),model.resid)
plt.hlines(0,8,20,color="black")
plt.ylim(-8,8)
plt.xlabel("Fitted acceleration")
plt.ylabel("Residuals")
Text(0, 0.5, 'Residuals')
4. Q-Q 图
res = model.resid
fig = sm.qqplot(res,line='q')
#plt.xlim(-8,8)
plt.ylim(-8,8)
plt.show()
plt.hist(model.resid)
(array([ 5., 29., 69., 106., 102., 40., 25., 8., 5., 3.]),
array([-4.99465643, -3.73465643, -2.47465643, -1.21465643, 0.04534357,
1.30534357, 2.56534357, 3.82534357, 5.08534357, 6.34534357,
7.60534357]),
<BarContainer object of 10 artists>)
5. 线性回归 - 模型规范
该笔记本演示了:
- 正确指定模型(基线)
- 型号不详
- 不相关的预测变量
- 相关预测变量
- 过度指定模型
- 完美的多重共线性
- 不同程度的多重共线性
- 包括不相关的变量
import numpy as np
import statsmodels.api as sm
5.1 正确指定模型(基线)
-
我们按照正态分布随机生成 100 个不相关的 X1 和 X2。
-
我们指定一条真正的回归线: y y y = 3 + 2* X 1 X_1 X1 + 4* X 2 X_2 X2 + e
-
我们在回归模型中正确包含 X1 和 X2:y ~ X1 + X2(带截距)
-
我们拟合 1000 个这样的回归模型,并检查估计值的偏差和精确度。
estimates = []
for i in range(1000):
X1 = np.random.randn(100) + 2
X2 = np.random.randn(100) + 2
e = np.random.randn(100)
y = 3 + 2*X1 + 4*X2 + e
X = sm.add_constant(np.column_stack([X1,X2]))
model = sm.OLS(y, X).fit()
estimates.append(model.params)
#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))
print("SE of estimates:", np.std(np.array(estimates),axis=0))
Average of estimates: [3.0037811 2.00347394 3.99617821]
SE of estimates: [0.30086517 0.09971629 0.10145132]
5.2 未指定型号
不相关的预测变量 (#1)
-
我们按照正态分布随机生成 100 个不相关的 X1 和 X2。
-
我们指定一条真正的回归线: y y y = 3 + 2* X 1 X_1 X1 + 4* X 2 X_2 X2 + e
-
我们的回归模型中仅包含 X1:y ~ X1(带截距)
-
我们拟合 1000 个这样的回归模型并检查偏差。
estimates = []
for i in range(1000):
means = [2 , 2]
cov = [[1,0], [0,1]]
X = np.random.multivariate_normal(means,cov,100)
e = np.random.randn(100)
y = 3 + 2*X[:,0] + 4*X[:,1] + e
X = sm.add_constant(X[:,0])
model = sm.OLS(y, X).fit()
estimates.append(model.params)
#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))
print("SE of estimates:", np.std(np.array(estimates),axis=0))
Average of estimates: [11.01614906 1.99128492]
SE of estimates: [0.95786376 0.42621356]
我们可以看到截距估计是有偏(只有当您省略的变量均值为零时才会无偏),并且 b1 的系数是无偏。 估计的 SE 远高于正确指定的模型中的 SE。
相关预测变量 (#2)
-
我们按照双变量正态分布随机生成 100 个相关的 X1 和 X2(因此 X1 和 X2 的相关系数为 0.4)。
-
我们指定一条真正的回归线: y y y = 3 + 2* X 1 X_1 X1 + 4* X 2 X_2 X2 + e
-
我们的回归模型中仅包含 X1:y ~ X1(带截距)
-
我们拟合 1000 个这样的回归模型并检查偏差。
estimates = []
for i in range(1000):
means = [2, 2]
cov = [[1,0.4], [0.4,1]]
X = np.random.multivariate_normal(means,cov,100)
e = np.random.randn(100)
y = 3 + 2*X[:,0] + 4*X[:,1] + e
X = sm.add_constant(X[:,0])
model = sm.OLS(y, X).fit()
estimates.append(model.params)
#The averages of estimates across the 1000 replications are:
np.mean(np.array(estimates),axis=0)
array([7.77702431, 3.60628815])
#The averages of estimates across the 1000 replications are:
np.std(np.array(estimates),axis=0)
array([0.85773046, 0.38963273])
我们可以看到截距估计是有偏差的(只有当两个变量的均值都为零时,它才是无偏差的),并且 b1 的系数也相对于其真实值 2 有偏差**。
结论:当预测变量相关时,忽略一个变量会使其他估计产生偏差。
5.3 过度指定模型
完美多重共线性 (#1)
-
我们按照正态分布随机生成 X1 的 100 个数据点。
-
我们设置 X2 = 1 - X1
-
我们指定一条真正的回归线: y y y = 3 + 2* X 1 X_1 X1 + 4* X 2 X_2 X2 + e
-
我们只将回归模型拟合为:y ~ X1 + X2(带截距)
X1 = np.random.randn(100)
X2 = 1 - X1
e = np.random.randn(100)
y = 3 + 2*X1 + 4*X2 + e
X = sm.add_constant(np.column_stack([X1,X2]))
model = sm.OLS(y,X).fit()
model.summary()
Dep. Variable: | y | R-squared: | 0.826 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.825 |
Method: | Least Squares | F-statistic: | 466.2 |
Date: | Tue, 13 Feb 2024 | Prob (F-statistic): | 4.98e-39 |
Time: | 09:50:43 | Log-Likelihood: | -141.90 |
No. Observations: | 100 | AIC: | 287.8 |
Df Residuals: | 98 | BIC: | 293.0 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 4.0724 | 0.075 | 54.100 | 0.000 | 3.923 | 4.222 |
x1 | 1.0145 | 0.072 | 14.005 | 0.000 | 0.871 | 1.158 |
x2 | 3.0579 | 0.045 | 67.296 | 0.000 | 2.968 | 3.148 |
Omnibus: | 1.402 | Durbin-Watson: | 2.156 |
---|---|---|---|
Prob(Omnibus): | 0.496 | Jarque-Bera (JB): | 1.313 |
Skew: | 0.152 | Prob(JB): | 0.519 |
Kurtosis: | 2.527 | Cond. No. | 1.32e+16 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.91e-30. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
多重共线性 (#2)
-
我们按照双变量正态分布随机生成 100 个相关的 X1 和 X2,并让 X1 和 X2 具有不同程度的相关性(0、0.3、0.6、0.9、0.95 和 0.99)。
-
我们指定一条真实的回归线: y y y = 2* X 1 X_1 X1 + 4* X 2 X_2 X2 + e
-
我们在回归模型中包含 X1 和 X2:y ~ X1 + X2(为了简单起见,没有截距)
-
对于每个预设的相关值,我们重复 1000 次以生成回归估计的采样分布。
-
绘制采样分布图
#Write a function to return the sampling distribution of regression coefficients
# based on different degree of correlation between X1 and X2
def simulation_multi_colinearity(cor):
params = []
for i in range(1000):
means = [2,2]
cov = [[1,cor], [cor,1]]
X = np.random.multivariate_normal(means,cov,100)
e = np.random.randn(100)*2
y = 2*X[:,0] + 4*X[:,1] + e
model = sm.OLS(y,X).fit()
params.append(model.params)
return params
sampling_dist_0 = simulation_multi_colinearity(0)
sampling_dist_0_3 = simulation_multi_colinearity(0.3)
sampling_dist_0_6 = simulation_multi_colinearity(0.6)
sampling_dist_0_9 = simulation_multi_colinearity(0.9)
sampling_dist_0_95 = simulation_multi_colinearity(0.95)
sampling_dist_0_99 = simulation_multi_colinearity(0.99)
print("SE of estimates when X1 and X2 have a cor of 0:", np.std(sampling_dist_0,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.3:", np.std(sampling_dist_0_3,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.6:", np.std(sampling_dist_0_6,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.9:", np.std(sampling_dist_0_9,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.95:", np.std(sampling_dist_0_95,axis=0))
print("SE of estimates when X1 and X2 have a cor of 0.99:", np.std(sampling_dist_0_99,axis=0))
SE of estimates when X1 and X2 have a cor of 0: [0.14924138 0.15362695]
SE of estimates when X1 and X2 have a cor of 0.3: [0.18184739 0.18149639]
SE of estimates when X1 and X2 have a cor of 0.6: [0.23542159 0.23302496]
SE of estimates when X1 and X2 have a cor of 0.9: [0.46005811 0.46522827]
SE of estimates when X1 and X2 have a cor of 0.95: [0.62053723 0.61689469]
SE of estimates when X1 and X2 have a cor of 0.99: [1.5058649 1.50165901]
print("Mean of estimates when X1 and X2 have a cor of 0:", np.mean(sampling_dist_0,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.3:", np.mean(sampling_dist_0_3,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.6:", np.mean(sampling_dist_0_6,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.9:", np.mean(sampling_dist_0_9,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.95:", np.mean(sampling_dist_0_95,axis=0))
print("Mean of estimates when X1 and X2 have a cor of 0.99:", np.mean(sampling_dist_0_99,axis=0))
Mean of estimates when X1 and X2 have a cor of 0: [1.99528853 4.00220908]
Mean of estimates when X1 and X2 have a cor of 0.3: [1.99374527 4.00339456]
Mean of estimates when X1 and X2 have a cor of 0.6: [2.01461324 3.98911849]
Mean of estimates when X1 and X2 have a cor of 0.9: [2.02487827 3.97569048]
Mean of estimates when X1 and X2 have a cor of 0.95: [2.04199266 3.95822687]
Mean of estimates when X1 and X2 have a cor of 0.99: [1.96516269 4.03487359]
我们可以观察到,模型中两个预测变量之间的相关程度不同,回归系数仍然无偏,但估计的标准误差根据相关程度夸大。
添加不相关的预测变量 (#2)
-
我们按照正态分布随机生成 X1、X2 和 X3 的 100 个数据点。
-
我们指定一条真正的回归线: y y y = 3 + 2* X 1 X_1 X1 + 4* X 2 X_2 X2 + e
-
我们在回归模型中包括 X1、X2 和 X3:y ~ X1(带截距)。 这里X3不应该在模型中,而是包含在内。
-
我们拟合 1000 个这样的回归模型并检查偏差。
estimates = []
for i in range(1000):
X1 = np.random.randn(100)
X2 = np.random.randn(100)
X3 = np.random.randn(100)
e = np.random.randn(100)
y = 3 + 2*X1 + 4*X2 + e
X = sm.add_constant(np.column_stack([X1,X2,X3]))
model = sm.OLS(y, X).fit()
estimates.append(model.params)
#The averages of estimates across the 1000 replications are:
print("Average of estimates:", np.mean(np.array(estimates),axis=0))
print("SE of estimates:", np.std(np.array(estimates),axis=0))
Average of estimates: [ 3.00132214e+00 1.99576168e+00 3.99886082e+00 -2.36659646e-03]
SE of estimates: [0.10146959 0.09980508 0.10321398 0.1018764 ]
我们发现 X1 和 X2 的估计是无偏,并且标准误差也很小。 X3 的估计值几乎为零,如果我们从 1000 个模型中检查一个模型,我们会发现该估计值在统计上并不显着。 从这个意义上讲,将不相关的随机数据纳入模型并没有多大害处。
model.summary()
Dep. Variable: | y | R-squared: | 0.962 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.961 |
Method: | Least Squares | F-statistic: | 806.9 |
Date: | Tue, 13 Feb 2024 | Prob (F-statistic): | 6.27e-68 |
Time: | 09:50:44 | Log-Likelihood: | -138.60 |
No. Observations: | 100 | AIC: | 285.2 |
Df Residuals: | 96 | BIC: | 295.6 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | 3.0338 | 0.099 | 30.704 | 0.000 | 2.838 | 3.230 |
x1 | 1.9591 | 0.094 | 20.781 | 0.000 | 1.772 | 2.146 |
x2 | 3.9472 | 0.090 | 43.918 | 0.000 | 3.769 | 4.126 |
x3 | 0.1564 | 0.099 | 1.580 | 0.117 | -0.040 | 0.353 |
Omnibus: | 2.692 | Durbin-Watson: | 2.382 |
---|---|---|---|
Prob(Omnibus): | 0.260 | Jarque-Bera (JB): | 2.525 |
Skew: | -0.387 | Prob(JB): | 0.283 |
Kurtosis: | 2.920 | Cond. No. | 1.20 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.