机器学习-06-回归算法

总结

本系列是机器学习课程的系列课程，主要介绍机器学习中回归算法，包括线性回归，岭回归，逻辑回归等部分。

参考

fit_transform,fit,transform区别和作用详解！！！！！！

本门课程的目标

完成一个特定行业的算法应用全过程：

懂业务+会选择合适的算法+数据处理+算法训练+算法调优+算法融合
+算法评估+持续调优+工程化接口实现

机器学习定义

关于机器学习的定义，Tom Michael Mitchell的这段话被广泛引用：
对于某类任务T和性能度量P，如果一个计算机程序在T上其性能P随着经验E而自我完善，那么我们称这个计算机程序从经验E中学习。

回归算法

回归分析简介

回归分析最早是由19世纪末期高尔顿发展的。1855年，他发表了一篇文章名为“遗传的身高向平均数方向的回归”，分析父母与其孩子之间身高的关系，发现父母的身高越高的其孩子也越高，反之则越矮。他把孩子跟父母身高这种现象拟合成一种线性关系
但是他还发现个有趣的现象，高个子的人生出来的孩子往往比他父亲矮一点更趋于父母的平均身高，矮个子的人生出来的孩子通常比他父亲高一点也趋向于平均身高。高尔顿选用了“回归”一词，把这一现象叫做“向平均数方向的回归”

在这里插入图片描述

线性回归

在这里插入图片描述

案例：

import numpy as np
import matplotlib.pyplot as plt 
from bz2 import __author__
#设置随机种子 
seed = np.random.seed(100)
#构造一个100行1列到矩阵。矩阵数值生成用rand，得到到数字是0-1到均匀分布到小数。 
X = 2 * np.random.rand(100,1) #最终得到到是0-2均匀分布到小数组成到100行1列到矩阵。这一步构建列    X1(训练集数据) 
#构建y和x的关系。 np.random.randn(100,1)是构建的符合高斯分布（正态分布）的100行一列的随机数。相当于给每个y增加列一个波动值。 
y= 4 + 3 * X + np.random.randn(100,1)
#将两个矩阵组合成一个矩阵。得到的X_b是100行2列的矩阵。其中第一列全都是1. 
X_b = np.c_[np.ones((100,1)),X]
#解析解求theta到最优解 
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
# 生成两个新的数据点,得到的是两个x1的值 
X_new = np.array([[0],[2]])
# 填充x0的值，两个1 
X_new_b = np.c_[(np.ones((2,1))),X_new]
# 用求得的theata和构建的预测点X_new_b相乘，得到yhat 
y_predice = X_new_b.dot(theta_best)
# 画出预测函数的图像，r-表示为用红色的线 
plt.plot(X_new,y_predice,'r-')
# 画出已知数据X和掺杂了误差的y，用蓝色的点表示 
plt.plot(X,y,'b.')
# 建立坐标轴 
plt.axis([0,2,0,15,])
plt.show()

输出为：
在这里插入图片描述

from sklearn import datasets
from sklearn.linear_model import LinearRegression
data = datasets.load_boston()
linear_model = LinearRegression()
linear_model.fit(data.data,data.target)
linear_model. coef_    #获取模型自变量系数
linear_model.intercept_   #获取模型

输出如下:
在这里插入图片描述

d:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function load_boston is deprecated; load_boston is deprecated in 1.0 and will be removed in 1.2.

The Boston housing prices dataset has an ethical problem. You can refer to
the documentation of this function for further details.

The scikit-learn maintainers therefore strongly discourage the use of this
dataset unless the purpose of the code is to study and educate about
ethical issues in data science and machine learning.

In this special case, you can fetch the dataset from the original
source::

    import pandas as pd
    import numpy as np


    data_url = "http://lib.stat.cmu.edu/datasets/boston"
    raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
    data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
    target = raw_df.values[1::2, 2]

Alternative datasets include the California housing dataset (i.e.
:func:`~sklearn.datasets.fetch_california_housing`) and the Ames housing
dataset. You can load the datasets as follows::

    from sklearn.datasets import fetch_california_housing
    housing = fetch_california_housing()

for the California housing dataset and::

    from sklearn.datasets import fetch_openml
    housing = fetch_openml(name="house_prices", as_frame=True)

for the Ames housing dataset.

warnings.warn(msg, category=FutureWarning)

from sklearn.metrics import mean_squared_error
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

data = datasets.load_boston()
x = data.data
y = data.target
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
linear_model = LinearRegression()
linear_model.fit(x_train,y_train)
y_predict = linear_model.predict(x_test)
mean_squared_error(y_test,y_predict)