官方API地址:
sklearn.metrics.roc_auc_score — scikit-learn 1.2.2 documentationExamples using sklearn.metrics.roc_auc_score: Release Highlights for scikit-learn 0.22 Release Highlights for scikit-learn 0.22 Probability Calibration curves Probability Calibration curves Multicl...https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html?highlight=roc_auc_score#sklearn.metrics.roc_auc_score
对于二分类
直接用预测值与标签值计算
代码:
# ---encoding:utf-8---
# @Time : 2023/6/6 17:41
# @Author : CBAiotAigc
# @Email :1050100468@qq.com
# @Site :
# @File : 癌症分类.py
# @Project : 机器学习
# @Software: PyCharm
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.metrics import accuracy_score, roc_auc_score
def logistic_regression_二分类():
data = pd.read_csv("./breast-cancer-wisconsin.csv")
data.info()
data = data.replace(to_replace="?", value=np.NaN)
data = data.dropna()
x = data.iloc[:, 1:-1].values
y = data.iloc[:, -1].values
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, shuffle=True, stratify=y, random_state=22)
transformer = StandardScaler()
x_train = transformer.fit_transform(x_train)
x_test = transformer.transform(x_test)
estimator = LogisticRegression()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
print(accuracy_score(y_pred, y_test))
print(roc_auc_score(y_test, y_test))
if __name__ == '__main__':
logistic_regression_二分类()
对于多分类
与二分类y_pred不同的是,概率分数y_pred_prob
代码:
# ---encoding:utf-8---
# @Time : 2023/6/6 17:41
# @Author : CBAiotAigc
# @Email :1050100468@qq.com
# @Site :
# @File : 癌症分类.py
# @Project : 机器学习
# @Software: PyCharm
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import pandas as pd
from sklearn.metrics import accuracy_score, roc_auc_score
def logistic_regression_多分类():
data = pd.read_csv("./iris.csv", header=None)
data.info()
x = data.iloc[:, :-1].values
data.columns = ["1", "2", "3", "4", "Class"]
y = data[["Class"]]
def myapply(x):
classify = x.unique().tolist()
list_ = []
for current in x:
for idx, c in enumerate(classify):
if current == c:
list_.append(idx)
return list_
y = y.apply(myapply)["Class"].values
# print(y)
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, shuffle=True, stratify=y, random_state=22)
transformer = StandardScaler()
x_train = transformer.fit_transform(x_train)
x_test = transformer.fit_transform(x_test)
estimator = LogisticRegression()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
print(accuracy_score(y_pred, y_test))
print(roc_auc_score(y_test, estimator.predict_proba(x_test), multi_class="ovr"))
if __name__ == '__main__':
logistic_regression_多分类()
roc_auc_score的multi_class参数的解释:
multi_class
是用于多分类问题的参数。在二元分类时,分类器需要将每个实例分到两个类别之一。而在多元分类时,分类器一般需要将一个实例分到多个类别之一。
multi_class
参数共有三种取值:
-
'raise'
: 如果标签中包含了多个类别并且multi_class没有显式地被设置为 'ovr' 或 'ovo' ,那么roc_auc_score函数将抛出一个 ValueError 异常。 -
'ovr'
: One-vs-rest策略。将多分类问题拆成多个二分类子问题。对于每个类别,都训练一个二分类模型来区分该类别和其他所有类别的差异。对于多分类问题,将会创建n个模型,其中n是类别的数量。 -
'ovo'
: One-vs-one策略。 每次只选择两个类别计算AUC,最终的AUC为所有类别的AUC均值。对于多分类问题,将会创建$ n*(n-1)/2 $个模型,其中n是类别的数量。
在处理具有大量类别的多分类问题时,ovo
策略的计算代价会变得非常高,因为它需要构建大量的二分类器。而 ovr
策略通常比 ovo
策略更有效,但也更容易受到样本不平衡和噪音数据的影响。