数据分析(三)线性回归模型实现

1. 惩罚线性回归模型概述

线性回归在实际应用时需要对普通最小二乘法进行一些修改。普通最小二乘法只在训练数据上最小化错误,难以顾及所有数据。

惩罚线性回归方法是一族用于克服最小二乘法( OLS)过拟合问题的方法。岭回归是惩罚线性回归的一个特例。岭回归通过对回归系数的平方和进行惩罚来避免过拟合。其他惩罚回归算法使用不同形式的惩罚项。

下面几个特点使得惩罚线性回归方法非常有效:

--模型训练足够快速。

--变量的重要性信息。

--部署时的预测足够快速。

--在各种问题上性能可靠,尤其对样本并不明显多于属性的属性矩阵,或者非常稀疏的矩阵。希望模型为稀疏解(即只使用部分属性进行预测的吝啬模型)。

--问题可能适合使用线性模型来解决。

公式 4-6 可以用如下语言描述:向量 beta 是以及常量 beta 零星是使期望预测的均方

错误最小的值,期望预测的均方错误是指在所有数据行(i=1,...,n)上计算 yi 与预测生成

yi 之间的错误平方的平均。

岭惩罚项对于惩罚回归来说并不是唯一有用的惩罚项。任何关于向量长度的指标都可以。使用不同的长度指标可以改变解的重要性。岭回归应用欧式几何的指标(即 β 的平方和)。另外一个有用的算法称作套索(Lasso)回归,该回归源于出租车的几何路径被称作曼哈顿距离或者 L1 正则化(即 β 的绝对值的和)。ElasticNet 惩罚项包含套索惩罚项以及岭惩罚项。

2. 求解惩罚线性回归问题

有大量通用的数值优化算法可以求解公式 4-6、公式 4-8 以及公式 4-11 对应的优化问题,但是惩罚线性回归问题的重要性促使研究人员开发专用算法,从而能够非常快地生成解。本文将对这些算法进行介绍并且运行相关代码,重点介绍2种算法:最小角度回归 LARS 以及 Glmnet。

LARS 算法可以理解为一种改进的前向逐步回归算法。

之所以介绍 LARS 算法是因为该算法非常接近于套索以及前向逐步回归, LARS 算法很容易理解并且实现起来相对紧凑。通过研究 LARS 的代码,你会理解针对更一般的 ElasticNet 回归求解的具体过程,并且会了解惩罚回归求解的细节。

3. 完整代码(code)

from math import sqrt
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm


def x_normalized(xList, xMeans, xSD):
    nrows = len(xList)
    ncols = len(xList[0])
    xNormalized = []
    for i in range(nrows):
        rowNormalized = [(xList[i][j] - xMeans[j]) / xSD[j] for j in range(ncols)]
        xNormalized.append(rowNormalized)


def data_normalized(wine):
    nrows, ncols = wine.shape
    wineNormalized = wine
    for i in range(ncols):
        mean = summary.iloc[1, i]
        sd = summary.iloc[2, i]
        wineNormalized.iloc[:, i:(i + 1)] = (wineNormalized.iloc[:, i:(i + 1)] - mean) / sd
    return wineNormalized


def calculate_betaMat(nSteps, stepSize, wineNormalized):
    nrows, ncols = wineNormalized.shape
    # initialize a vector of coefficients beta(系数初始化)
    beta = [0.0] * (ncols - 1)
    # initialize matrix of betas at each step(系数矩阵初始化)
    betaMat = []
    betaMat.append(list(beta))
    # initialize residuals list(误差初始化)
    residuals = [0.0] * nrows
    for i in tqdm(range(nSteps)):
        # calculate residuals(计算误差)
        for j in range(nrows):
            residuals[j] = wineNormalized.iloc[j, (ncols - 1)]
            for k in range(ncols - 1):
                residuals[j] += - wineNormalized.iloc[j, k] * beta[k]

        # calculate correlation between attribute columns from normalized wine and residual(变量与误差相关系数)
        corr = [0.0] * (ncols - 1)
        for j in range(ncols - 1):
            for k in range(nrows):
                corr[j] += wineNormalized.iloc[k, j] * residuals[k] / nrows

        iStar = 0
        corrStar = corr[0]
        for j in range(1, (ncols - 1)):
            if abs(corrStar) < abs(corr[j]):  # 相关性大的放前面
                iStar = j
                corrStar = corr[j]
        beta[iStar] += stepSize * corrStar / abs(corrStar)  # 系数
        betaMat.append(list(beta))
    return betaMat


def plot_betaMat1(betaMat):
    ncols = len(betaMat[0])
    for i in range(ncols):
        # plot range of beta values for each attribute
        coefCurve = betaMat[0:nSteps][i]
        plt.plot(coefCurve)

    plt.xlabel("Attribute Index")
    plt.ylabel(("Attribute Values"))
    plt.show()


def plot_betaMat2(nSteps, betaMat):
    ncols = len(betaMat[0])
    for i in range(ncols):
        # plot range of beta values for each attribute
        coefCurve = [betaMat[k][i] for k in range(nSteps)]
        xaxis = range(nSteps)
        plt.plot(xaxis, coefCurve)

    plt.xlabel("Steps Taken")
    plt.ylabel(("Coefficient Values"))
    plt.show()


def S(z, gamma):
    if gamma >= abs(z):
        return 0.0
    return (z / abs(z)) * (abs(z) - gamma)


if __name__ == '__main__':
    target_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
    wine = pd.read_csv(target_url, header=0, sep=";")

    # normalize the wine data
    summary = wine.describe()
    print(summary)

    # 数据标准化
    wineNormalized = data_normalized(wine)
    # number of steps to take(训练步数)
    nSteps = 100
    stepSize = 0.1
    betaMat = calculate_betaMat(nSteps, stepSize, wineNormalized)
    plot_betaMat1(betaMat)
# ----------------------------larsWine---------------------------------------------------
    # read data into iterable
    names = wine.columns
    xList = []
    labels = []
    firstLine = True
    for i in range(len(wine)):
        row = wine.iloc[i]
        # put labels in separate array
        labels.append(float(row[-1]))
        # convert row to floats
        floatRow = row[:-1]
        xList.append(floatRow)
    # Normalize columns in x and labels
    nrows = len(xList)
    ncols = len(xList[0])
    # calculate means and variances(计算均值和方差)
    xMeans = []
    xSD = []
    for i in range(ncols):
        col = [xList[j][i] for j in range(nrows)]
        mean = sum(col) / nrows
        xMeans.append(mean)
        colDiff = [(xList[j][i] - mean) for j in range(nrows)]
        sumSq = sum([colDiff[i] * colDiff[i] for i in range(nrows)])
        stdDev = sqrt(sumSq / nrows)
        xSD.append(stdDev)

    # use calculate mean and standard deviation to normalize xList(X标准化)
    xNormalized = x_normalized(xList, xMeans, xSD)
    # Normalize labels: 将属性及标签进行归一化
    meanLabel = sum(labels) / nrows
    sdLabel = sqrt(sum([(labels[i] - meanLabel) * (labels[i] - meanLabel) for i in range(nrows)]) / nrows)
    labelNormalized = [(labels[i] - meanLabel) / sdLabel for i in range(nrows)]

    # initialize a vector of coefficients beta
    beta = [0.0] * ncols
    # initialize matrix of betas at each step
    betaMat = []
    betaMat.append(list(beta))
    # number of steps to take
    nSteps = 350
    stepSize = 0.004
    nzList = []
    for i in range(nSteps):
        # calculate residuals
        residuals = [0.0] * nrows
        for j in range(nrows):
            labelsHat = sum([xNormalized[j][k] * beta[k] for k in range(ncols)]) 
            residuals[j] = labelNormalized[j] - labelsHat  # 计算残差

        # calculate correlation between attribute columns from normalized wine and residual
        corr = [0.0] * ncols
        for j in range(ncols):
            corr[j] = sum([xNormalized[k][j] * residuals[k] for k in range(nrows)]) / nrows  # 每个属性和残差的关联

        iStar = 0
        corrStar = corr[0]
        for j in range(1, (ncols)):  # 逐个判断哪个属性对降低残差贡献最大
            if abs(corrStar) < abs(corr[j]):  # 好的(最大关联)特征会排到列表前面,应该保留,不太好的特征会排到最后
                iStar = j
                corrStar = corr[j]
        beta[iStar] += stepSize * corrStar / abs(corrStar)  # 固定增加beta变量值,关联为正增量为正;关联为负,增量为负
        betaMat.append(list(beta))  # 求解得到参数结果

        nzBeta = [index for index in range(ncols) if beta[index] != 0.0]
        for q in nzBeta:
            if q not in nzList:  # 对于每一迭代步,记录非零系数对应索引
                nzList.append(q)
    nameList = [names[nzList[i]] for i in range(len(nzList))]
    print(nameList)
    plot_betaMat2(nSteps, betaMat)  # 绘制系数曲线

# -------------------------------larsWine 10折交叉------------------------------------------------
    # Build cross-validation loop to determine best coefficient values.
    # number of cross validation folds
    nxval = 10
    # number of steps and step size
    nSteps = 350
    stepSize = 0.004
    # initialize list for storing errors.
    errors = []  # 记录每一步迭代的错误
    for i in range(nSteps):
        b = []
        errors.append(b)

    for ixval in range(nxval):  # 10折交叉验证
        # Define test and training index sets
        idxTrain = [a for a in range(nrows) if a % nxval != ixval * nxval]
        idxTest = [a for a in range(nrows) if a % nxval == ixval * nxval]
        # Define test and training attribute and label sets
        xTrain = [xNormalized[r] for r in idxTrain]  # 训练集
        labelTrain = [labelNormalized[r] for r in idxTrain]
        xTest = [xNormalized[r] for r in idxTest]  # 测试集
        labelTest = [labelNormalized[r] for r in idxTest]

        # Train LARS regression on Training Data
        nrowsTrain = len(idxTrain)
        nrowsTest = len(idxTest)

        # initialize a vector of coefficients beta
        beta = [0.0] * ncols

        # initialize matrix of betas at each step
        betaMat = []
        betaMat.append(list(beta))
        for iStep in range(nSteps):
            # calculate residuals
            residuals = [0.0] * nrows
            for j in range(nrowsTrain):
                labelsHat = sum([xTrain[j][k] * beta[k] for k in range(ncols)])
                residuals[j] = labelTrain[j] - labelsHat
            # calculate correlation between attribute columns from normalized wine and residual
            corr = [0.0] * ncols
            for j in range(ncols):
                corr[j] = sum([xTrain[k][j] * residuals[k] for k in range(nrowsTrain)]) / nrowsTrain

            iStar = 0
            corrStar = corr[0]
            for j in range(1, (ncols)):
                if abs(corrStar) < abs(corr[j]):
                    iStar = j
                    corrStar = corr[j]
            beta[iStar] += stepSize * corrStar / abs(corrStar)
            betaMat.append(list(beta))

            # Use beta just calculated to predict and accumulate out of sample error - not being used in the calc of beta
            for j in range(nrowsTest):
                labelsHat = sum([xTest[j][k] * beta[k] for k in range(ncols)])
                err = labelTest[j] - labelsHat
                errors[iStep].append(err)
    cvCurve = []
    for errVect in errors:
        mse = sum([x * x for x in errVect]) / len(errVect)
        cvCurve.append(mse)
    minMse = min(cvCurve)
    minPt = [i for i in range(len(cvCurve)) if cvCurve[i] == minMse][0]
    print("Minimum Mean Square Error", minMse)
    print("Index of Minimum Mean Square Error", minPt)

    xaxis = range(len(cvCurve))
    plt.plot(xaxis, cvCurve)
    plt.xlabel("Steps Taken")
    plt.ylabel(("Mean Square Error"))
    plt.show()

    # -------------------------------glmnet larsWine2------------------------------------------------
    # select value for alpha parameter
    alpha = 1.0
    # make a pass through the data to determine value of lambda that
    # just suppresses all coefficients.
    # start with betas all equal to zero.
    xy = [0.0] * ncols
    for i in range(nrows):
        for j in range(ncols):
            xy[j] += xNormalized[i][j] * labelNormalized[i]

    maxXY = 0.0
    for i in range(ncols):
        val = abs(xy[i]) / nrows
        if val > maxXY:
            maxXY = val

    # calculate starting value for lambda
    lam = maxXY / alpha

    # this value of lambda corresponds to beta = list of 0's
    # initialize a vector of coefficients beta
    beta = [0.0] * ncols

    # initialize matrix of betas at each step
    betaMat = []
    betaMat.append(list(beta))

    # begin iteration
    nSteps = 100
    lamMult = 0.93  # 100 steps gives reduction by factor of 1000 in
    # lambda (recommended by authors)
    nzList = []
    for iStep in range(nSteps):
        # make lambda smaller so that some coefficient becomes non-zero
        lam = lam * lamMult

        deltaBeta = 100.0
        eps = 0.01
        iterStep = 0
        betaInner = list(beta)
        while deltaBeta > eps:
            iterStep += 1
            if iterStep > 100:
                break
            # cycle through attributes and update one-at-a-time
            # record starting value for comparison
            betaStart = list(betaInner)
            for iCol in range(ncols):
                xyj = 0.0
                for i in range(nrows):
                    # calculate residual with current value of beta
                    labelHat = sum([xNormalized[i][k] * betaInner[k] for k in range(ncols)])
                    residual = labelNormalized[i] - labelHat

                    xyj += xNormalized[i][iCol] * residual

                uncBeta = xyj / nrows + betaInner[iCol]
                betaInner[iCol] = S(uncBeta, lam * alpha) / (1 + lam * (1 - alpha))

            sumDiff = sum([abs(betaInner[n] - betaStart[n]) for n in range(ncols)])
            sumBeta = sum([abs(betaInner[n]) for n in range(ncols)])
            deltaBeta = sumDiff / sumBeta
        print(iStep, iterStep)
        beta = betaInner
        # add newly determined beta to list
        betaMat.append(beta)
        # keep track of the order in which the betas become non-zero
        nzBeta = [index for index in range(ncols) if beta[index] != 0.0]
        for q in nzBeta:
            if q not in nzList:
                nzList.append(q)
    # print out the ordered list of betas
    nameList = [names[nzList[i]] for i in range(len(nzList))]
    print(nameList)
    nPts = len(betaMat)
    plot_betaMat2(nPts, betaMat)  # 绘制系数曲线

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/518003.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

新型智慧城市大数据解决方案(附下载)

随着云计算、大数据、移动互联网等技术的发展&#xff0c;由城市运行产生的交通、环境、市政、商业等各领域数据量巨大&#xff0c;这些数据经过合理的分析挖掘可产生大量传统数据不能反映的城市运行信息&#xff0c;已成为智慧城市的重要资产。 在大数据时代&#xff0c;数据信…

Unity入门

Unity入门学习 知识概述&#xff1a; Unity环境搭建 1.Unity引擎是什么 2.软件下载安装 下载最新的长期支持版即可 3.新工程和工程文件夹 Unity界面基础 1.Scene场景和Hierachy层级窗口 练习&#xff1a; 2.Game游戏和Project工程 3.Inspector检查和Console控制台 练习&#…

【快速上手ESP32(基于ESP-IDFVSCode)】03-定时器

ESP32中的通用定时器 通用定时器是 ESP32 定时器组外设的驱动程序。ESP32 硬件定时器分辨率高&#xff0c;具有灵活的报警功能。定时器内部计数器达到特定目标数值的行为被称为定时器报警。定时器报警时将调用用户注册的不同定时器回调函数。 在ESP32-S3中&#xff0c;一共有…

HTML:框架

案例&#xff1a; <frameset cols"5%,*" ><frame src"left_frame.html"><frame src"right_frame.html"> </frameset> 一、<frameset>标签 <frameset>标签&#xff1a;称为框架标记&#xff0c;将一个HTML…

全网最强JavaWeb笔记 | 万字长文爆肝JavaWeb开发——day06_数据库-MySQL-02

万字长文爆肝黑马程序员2023最新版JavaWeb教程。这套教程打破常规&#xff0c;不再局限于过时的老套JavaWeb技术&#xff0c;而是与时俱进&#xff0c;运用的都是企业中流行的前沿技术。笔者认真跟着这个教程&#xff0c;再一次认真学习一遍JavaWeb教程&#xff0c;温故而知新&…

vue-cli打包 nodejs内存溢出 vue2.x Last few GCs

遇到这种情况百度各种博客&#xff0c;什么改package.json里的配置&#xff0c;什么安装increase-memory-limit &#xff0c;都尝试了并没什么用处&#xff0c;最后解决方案为执行下方名单&#xff0c;再次打包就成功了&#xff1a; export NODE_OPTIONS--max_old_space_size4…

spring事务那些事

实际工作中还会面临千奇百怪的问题&#xff0c;看下面返个例子&#xff08;注意MySql数据库测试&#xff09;&#xff1a; //1.hello1Service 调用 hello2Service Transactional(propagation Propagation.REQUIRED,rollbackFor Exception.class) public void doUpdate() {//…

重建大师地物实体shp该怎样获取?(如下图)

问题如图 一般是基于自己的模型去提取的&#xff0c;可以使用地物外轮廓功能生成&#xff0c;或者这边有DLG也可以实现。同时&#xff0c;用重建大师可以提取地物外轮廓。 重建大师是一款专为超大规模实景三维数据生产而设计的集群并行处理软件&#xff0c;输入倾斜照片&#x…

微信小程序 ---- 慕尚花坊 代码优化

代码优化 1. 分享功能 思路分析&#xff1a; 目前小程序页面都没有配置分享功能&#xff0c;需要给小程序页面设置分享功能。 但是并不是所有页面都需要设置分享功能&#xff0c; 具体哪些页面需要设置分享功能&#xff0c;可以和产品经理进行协商。 首页商品列表商品详情…

[StartingPoint][Tier1]Crocodile

Task 1 What Nmap scanning switch employs the use of default scripts during a scan? (哪些 Nmap 扫描开关在扫描期间使用默认脚本&#xff1f;) -sC Task 2 What service version is found to be running on port 21? 发现端口 21 上运行的服务版本是什么&#xff1f…

前端 基于响应式数据 实现拖拽排序和移动

在外层父元素添加拖拽相关监听事件 <divdragstart"handleDragstart"dragover"handleDragover"dragenter"handleDragenter"drag"handleDrag"drop"handleDrop">其中&#xff0c;start是drag起始元素&#xff0c;over会…

【计算机网络】epoll

IO多路转接 - epoll 一、I/O多路转接之 epoll1. epoll 接口&#xff08;1&#xff09;epoll_create()&#xff08;2&#xff09;epoll_wait()&#xff08;3&#xff09;epoll_ctl() 2. epoll 原理3. epoll 的优点4. epoll 的使用5. epoll 的工作模式&#xff08;1&#xff09;水…

Python网络爬虫(四):b站评论

首先来看一下采集的数据格式: 本文不对数据采集的过程做探讨,直接上代码。首先要在程序入口处bvids列表内替换成自己想要采集的视频bvid号,然后将self.cookies替换成自己的(需要字典格式),代码可以同时爬取多个视频的评论,且爬取的评论较为完整,亲测有效: im…

Calico IPIP和BGP TOR的数据包走向

IPIP Mesh全网互联 文字描述 APOD eth0 10.7.75.132 -----> APOD 网关 -----> A宿主机 cali76174826315网卡 -----> Atunl0 10.7.75.128 封装 ----> Aeth0 10.120.181.20 -----> 通过网关 10.120.181.254 -----> 下一跳 BNODE eth0 10.120.179.8 解封装 --…

带头双向循环链表,顺序表和链表的比较

双向链表 单链表结点中只有一个指向其后继的指针&#xff0c;使得单链表只能从前往后依次遍历&#xff0c;要访问某个结点的前驱&#xff08;插入、删除操作时&#xff09;&#xff0c;只能从头开始遍历&#xff0c;访问前驱的时间复杂度为O(N)。为了克服这个缺点&#xff0c;…

SSM框架学习——Eclipse创建Spring MVC maven项目

Spring MVC项目创建 什么是Spring MVC Spring MVC是Spring内置的&#xff0c;实现了Web MVC设计模式的框架。 它解决了Web开发过程中很多的问题&#xff0c;例如参数接收、表单验证等。另外它采用松散耦合可插拔组件等结构&#xff0c;具有相对较高的灵活性和扩展性。 Spri…

Coursera上托福专项课程02:TOEFL Speaking and Writing Sections Skills Mastery 学习笔记

TOEFL Speaking and Writing Sections Skills Mastery Course Certificate 本文是学习 https://www.coursera.org/learn/toefl-speaking-writing-sections-skills-mastery 这门课的学习笔记&#xff0c;如有侵权&#xff0c;请联系删除。 文章目录 TOEFL Speaking and Writing…

《崩溃》社会如何选择成败兴亡 - 三余书屋 3ysw.net

崩溃&#xff1a;社会如何选择成败兴亡 《崩溃&#xff1a;社会如何选择成败兴亡》深入对人类大历史的思考&#xff0c;解答人类社会成败兴亡的秘密。这本书主要聚焦在人类社会兴盛与环境之间的纠葛。我们将一同探讨历史上哪些伟大文明因为环境破坏而崩溃&#xff0c;还有哪些…

v-for之对象和对象信息

如下图所示&#xff1a; 看打印&#xff1a;尤其是这个对象信息的打印 也可以在打印对象信息的时候取出索引信息&#xff1a;

kafka 高吞吐设计分析

说明 本文基于 kafka 2.7 编写。author blog.jellyfishmix.com / JellyfishMIX - githubLICENSE GPL-2.0 概括 支撑 kafka 高吞吐的设计主要有以下几个方面: 网络 nio 主从 reactor 设计模式 顺序写。 零拷贝。 producer producer 开启压缩后是批量压缩&#xff0c;bro…