基于机器学习模型预测信用卡潜在用户(XGBoost、LightGBM和Random Forest)

基于机器学习模型预测信用卡潜在用户(XGBoost、LightGBM和Random Forest)

随着数据科学和机器学习的发展,越来越多的企业开始利用这些技术来提高运营效率。在这篇博客中,我将分享如何利用机器学习模型来预测信用卡的潜在客户。此项目基于我整理的代码和文件,涉及数据预处理、数据可视化、模型训练、预测及结果保存的完整流程。

项目概述

本项目旨在使用机器学习模型预测哪些客户最有可能成为信用卡的潜在客户。我们将使用三个主要的机器学习模型:XGBoost、LightGBM和随机森林(Random Forest)。以下是项目的主要步骤:

1、数据预处理
2、数据可视化
3、模型训练
4、模型预测
5、模型保存

1. 数据预处理

数据预处理是机器学习项目中至关重要的一步。通过清洗和准备数据,我们可以提高模型的性能和准确性。

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt 
import seaborn as sns
#Loading the dataset
df_train=pd.read_csv("dataset/train_s3TEQDk.csv")
df_train["source"]="train"
df_test=pd.read_csv("dataset/test_mSzZ8RL.csv")
df_test["source"]="test"
df=pd.concat([df_train,df_test],ignore_index=True)
df.head()
IDGenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_ActiveIs_Leadsource
0NNVBBKZBFemale73RG268OtherX343No1045696No0.0train
1IDD62UNGFemale30RG277SalariedX132No581988No0.0train
2HD3DSEMCFemale56RG268Self_EmployedX326No1484315Yes0.0train
3BF3NC7KVMale34RG270SalariedX119No470454No0.0train
4TEASRWXVFemale30RG282SalariedX133No886787No0.0train

1. Checking and Cleaning Dataset :

#Checking columns of dataset
df.columns
Index(['ID', 'Gender', 'Age', 'Region_Code', 'Occupation', 'Channel_Code',
       'Vintage', 'Credit_Product', 'Avg_Account_Balance', 'Is_Active',
       'Is_Lead', 'source'],
      dtype='object')
#Checking shape 
df.shape
(351037, 12)
#Checking unique values 
df.nunique()
ID                     351037
Gender                      2
Age                        63
Region_Code                35
Occupation                  4
Channel_Code                4
Vintage                    66
Credit_Product              2
Avg_Account_Balance    162137
Is_Active                   2
Is_Lead                     2
source                      2
dtype: int64
#Check for Null Values
df.isnull().sum()
ID                          0
Gender                      0
Age                         0
Region_Code                 0
Occupation                  0
Channel_Code                0
Vintage                     0
Credit_Product          41847
Avg_Account_Balance         0
Is_Active                   0
Is_Lead                105312
source                      0
dtype: int64

Observation:
Null values are present in Credit _Product column.

#Fill null values in Credit_Product feature
df['Credit_Product']= df['Credit_Product'].fillna("NA")
#Again check for null values
df.isnull().sum()
ID                          0
Gender                      0
Age                         0
Region_Code                 0
Occupation                  0
Channel_Code                0
Vintage                     0
Credit_Product              0
Avg_Account_Balance         0
Is_Active                   0
Is_Lead                105312
source                      0
dtype: int64
#Checking Datatypes and info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 351037 entries, 0 to 351036
Data columns (total 12 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   ID                   351037 non-null  object 
 1   Gender               351037 non-null  object 
 2   Age                  351037 non-null  int64  
 3   Region_Code          351037 non-null  object 
 4   Occupation           351037 non-null  object 
 5   Channel_Code         351037 non-null  object 
 6   Vintage              351037 non-null  int64  
 7   Credit_Product       351037 non-null  object 
 8   Avg_Account_Balance  351037 non-null  int64  
 9   Is_Active            351037 non-null  object 
 10  Is_Lead              245725 non-null  float64
 11  source               351037 non-null  object 
dtypes: float64(1), int64(3), object(8)
memory usage: 32.1+ MB
#Changing Yes to 1 and No to 0 in Is_Active column to covert  data into float

df["Is_Active"].replace(["Yes","No"],[1,0],inplace=True)

df['Is_Active'] = df['Is_Active'].astype(float)
df.head()
IDGenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_ActiveIs_Leadsource
0NNVBBKZBFemale73RG268OtherX343No10456960.00.0train
1IDD62UNGFemale30RG277SalariedX132No5819880.00.0train
2HD3DSEMCFemale56RG268Self_EmployedX326No14843151.00.0train
3BF3NC7KVMale34RG270SalariedX119No4704540.00.0train
4TEASRWXVFemale30RG282SalariedX133No8867870.00.0train
#Now changing all categorical column into numerical form using label endcoding
cat_col=[ 'Gender', 'Region_Code', 'Occupation','Channel_Code', 'Credit_Product']

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for col in cat_col:
    df[col]= le.fit_transform(df[col])


df_2= df
df_2.head()
IDGenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_ActiveIs_Leadsource
0NNVBBKZB073181243110456960.00.0train
1IDD62UNG03027203215819880.00.0train
2HD3DSEMC056183226114843151.00.0train
3BF3NC7KV13420201914704540.00.0train
4TEASRWXV03032203318867870.00.0train
#Separating the train and test
df_train=df_2.loc[df_2["source"]=="train"]
df_test=df_2.loc[df_2["source"]=="test"]
df_1 = df_train
#we can drop column as they are irrelevant and have no effect on our data
df_1.drop(columns=['ID',"source"],inplace=True)
df_1.head()
GenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_ActiveIs_Lead
0073181243110456960.00.0
103027203215819880.00.0
2056183226114843151.00.0
313420201914704540.00.0
403032203318867870.00.0

2. 数据可视化

数据可视化有助于我们更好地理解数据的分布和特征。以下是一些常用的数据可视化方法:

import warnings
warnings.filterwarnings("ignore")
plt.rcParams['figure.figsize']  = (10,6)
plt.rcParams['font.size']  = 16
sns.set_style("whitegrid")

sns.distplot(df['Age']);

在这里插入图片描述

sns.distplot(df['Avg_Account_Balance'])
plt.show()

在这里插入图片描述

#Countplot for Gender feature 
# plt.figure(figsize=(8,4))
sns.countplot(df['Gender'],palette='Accent')
plt.show()

在这里插入图片描述

#Countplot for Target variable i.e 'Is_Lead'
target = 'Is_Lead'
# plt.figure(figsize=(8,4))
sns.countplot(df[target],palette='hls')
print(df[target].value_counts())
0.0    187437
1.0     58288
Name: Is_Lead, dtype: int64
plt.rcParams['figure.figsize']  = (12,6)
#Checking occupation with customers
# plt.figure(figsize=(8,4))
sns.countplot(x='Occupation',hue='Is_Lead',data=df,palette= 'magma')
plt.show()

在这里插入图片描述

#Plot showing Activness of customer in last 3 months with respect to Occupation of customer
# plt.figure(figsize=(8,4))
sns.catplot(y='Age',x='Occupation',hue='Is_Active',data=df,kind='bar',palette='Oranges')
plt.show()

在这里插入图片描述

3. 模型训练

我们将使用三个模型进行训练:XGBoost、LightGBM和随机森林。以下是模型的训练代码:

# To balance the dataset , we will apply undersampling method
from sklearn.utils import resample
# separate the minority and majority classes
df_majority = df_1[df_1['Is_Lead']==0]
df_minority = df_1[df_1['Is_Lead']==1]

print(" The majority class values are", len(df_majority))
print(" The minority class values are", len(df_minority))
print(" The ratio of both classes are", len(df_majority)/len(df_minority))
 The majority class values are 187437
 The minority class values are 58288
 The ratio of both classes are 3.215704776283283
# undersample majority class
df_majority_undersampled = resample(df_majority, replace=True, n_samples=len(df_minority), random_state=0)
# combine minority class with oversampled majority class
df_undersampled = pd.concat([df_minority, df_majority_undersampled])

df_undersampled['Is_Lead'].value_counts()
df_1=df_undersampled

# display new class value counts
print(" The undersamples class values count is:", len(df_undersampled))
print(" The ratio of both classes are", len(df_undersampled[df_undersampled["Is_Lead"]==0])/len(df_undersampled[df_undersampled["Is_Lead"]==1]))

 The undersamples class values count is: 116576
 The ratio of both classes are 1.0
# dropping target variable 
#assign the value of y for training and testing phase
xc = df_1.drop(columns=['Is_Lead'])
yc = df_1[["Is_Lead"]]
df_1.head()
GenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_ActiveIs_Lead
6162321220010567501.01.0
1513318316905170631.01.0
16046181297222825020.01.0
17059331215223846920.01.0
20144193119210016500.01.0
#Importing necessary libraries
from sklearn import metrics
from scipy.stats import zscore
from sklearn.preprocessing import LabelEncoder,StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.decomposition import PCA
from sklearn.metrics import precision_score, recall_score, confusion_matrix, f1_score, roc_auc_score, roc_curve
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix,roc_auc_score,roc_curve
from sklearn.metrics import auc
from sklearn.metrics import plot_roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier,GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB

#Import warnings
import warnings
warnings.filterwarnings('ignore')
#Standardizing value of x by using standardscaler to make the data normally distributed
sc = StandardScaler()
df_xc = pd.DataFrame(sc.fit_transform(xc),columns=xc.columns)
df_xc.head()
GenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_Active
00.8719221.1029871.080645-1.2543101.098078-0.961192-1.495172-0.1189581.192880
10.871922-0.895316-0.2085251.008933-0.0528280.484610-1.495172-0.7482731.192880
2-1.1468910.000475-0.208525-1.2543101.0980781.3107841.2200961.310361-0.838307
3-1.1468910.8962661.172729-1.2543101.098078-1.1087231.2200961.429522-0.838307
40.871922-0.137339-0.1164411.008933-0.052828-0.9906991.220096-0.183209-0.838307
#defining a function to find fit of the model

def max_accuracy_scr(names,model_c,df_xc,yc):
    accuracy_scr_max = 0
    roc_scr_max=0
    train_xc,test_xc,train_yc,test_yc = train_test_split(df_xc,yc,random_state = 42,test_size = 0.2,stratify = yc)
    model_c.fit(train_xc,train_yc)
    pred = model_c.predict_proba(test_xc)[:, 1]
    roc_score = roc_auc_score(test_yc, pred)
    accuracy_scr = accuracy_score(test_yc,model_c.predict(test_xc))
    if roc_score> roc_scr_max:
        roc_scr_max=roc_score
        final_model = model_c
        mean_acc = cross_val_score(final_model,df_xc,yc,cv=5,scoring="accuracy").mean()
        std_dev = cross_val_score(final_model,df_xc,yc,cv=5,scoring="accuracy").std()
        cross_val = cross_val_score(final_model,df_xc,yc,cv=5,scoring="accuracy")
    print("*"*50)
    print("Results for model : ",names,'\n',
          "max roc score correspond to random state " ,roc_scr_max ,'\n',
          "Mean accuracy score is : ",mean_acc,'\n',
          "Std deviation score is : ",std_dev,'\n',
          "Cross validation scores are :  " ,cross_val) 
    print(f"roc_auc_score: {roc_score}")
    print("*"*50)
#Now by using multiple Algorithms we are calculating the best Algo which performs best for our data set 
accuracy_scr_max = []
models=[]
#accuracy=[]
std_dev=[]
roc_auc=[]
mean_acc=[]
cross_val=[]
models.append(('Logistic Regression', LogisticRegression()))
models.append(('Random Forest', RandomForestClassifier()))
models.append(('Decision Tree Classifier',DecisionTreeClassifier()))
models.append(("GausianNB",GaussianNB()))

for names,model_c in models:
    max_accuracy_scr(names,model_c,df_xc,yc)

**************************************************
Results for model :  Logistic Regression 
 max roc score correspond to random state  0.727315712597147 
 Mean accuracy score is :  0.6696918411779096 
 Std deviation score is :  0.0030322593046897828 
 Cross validation scores are :   [0.67361469 0.66566588 0.66703839 0.67239974 0.66974051]
roc_auc_score: 0.727315712597147
**************************************************
**************************************************
Results for model :  Random Forest 
 max roc score correspond to random state  0.8792762631904103 
 Mean accuracy score is :  0.8117279862602139 
 Std deviation score is :  0.002031698139189051 
 Cross validation scores are :   [0.81043061 0.81162342 0.81158053 0.81115162 0.81616985]
roc_auc_score: 0.8792762631904103
**************************************************
**************************************************
Results for model :  Decision Tree Classifier 
 max roc score correspond to random state  0.7397495282209642 
 Mean accuracy score is :  0.7426399792028343 
 Std deviation score is :  0.0025271129138200485 
 Cross validation scores are :   [0.74288043 0.74162556 0.74149689 0.73870899 0.74462792]
roc_auc_score: 0.7397495282209642
**************************************************
**************************************************
Results for model :  GausianNB 
 max roc score correspond to random state  0.7956111563031266 
 Mean accuracy score is :  0.7158677336619202 
 Std deviation score is :  0.0015884106712636206 
 Cross validation scores are :   [0.71894836 0.71550504 0.71546215 0.71443277 0.71499035]
roc_auc_score: 0.7956111563031266
**************************************************

First Attempt:Random Forest Classifier

# Estimating best n_estimator using grid search for Randomforest Classifier
parameters={"n_estimators":[1,10,100]}
rf_clf=RandomForestClassifier()
clf = GridSearchCV(rf_clf, parameters, cv=5,scoring="roc_auc")
clf.fit(df_xc,yc)
print("Best parameter : ",clf.best_params_,"\nBest Estimator : ", clf.best_estimator_,"\nBest Score : ", clf.best_score_)
Best parameter :  {'n_estimators': 100} 
Best Estimator :  RandomForestClassifier() 
Best Score :  0.8810508979668068
#Again running RFC with n_estimator = 100
rf_clf=RandomForestClassifier(n_estimators=100,random_state=42)
max_accuracy_scr("RandomForest Classifier",rf_clf,df_xc,yc)
**************************************************
Results for model :  RandomForest Classifier 
 max roc score correspond to random state  0.879415808805665 
 Mean accuracy score is :  0.8115392510996895 
 Std deviation score is :  0.0008997445291505284 
 Cross validation scores are :   [0.81180305 0.81136607 0.81106584 0.81037958 0.81308171]
roc_auc_score: 0.879415808805665
**************************************************
xc_train,xc_test,yc_train,yc_test=train_test_split(df_xc, yc,random_state = 80,test_size=0.20,stratify=yc)
rf_clf.fit(xc_train,yc_train)
yc_pred=rf_clf.predict(xc_test)
plt.rcParams['figure.figsize']  = (12,8)
#  Random Forest Classifier Results

pred_pb=rf_clf.predict_proba(xc_test)[:,1]
Fpr,Tpr,thresholds = roc_curve(yc_test,pred_pb,pos_label=True)
auc = roc_auc_score(yc_test,pred_pb)

print(" ROC_AUC score is ",auc)
print("accuracy score is : ",accuracy_score(yc_test,yc_pred))
print("Precision is : " ,precision_score(yc_test, yc_pred))
print("Recall is: " ,recall_score(yc_test, yc_pred))
print("F1 Score is : " ,f1_score(yc_test, yc_pred))
print("classification report \n",classification_report(yc_test,yc_pred))

#Plotting confusion matrix
cnf = confusion_matrix(yc_test,yc_pred)
sns.heatmap(cnf, annot=True, cmap = "magma")
 ROC_AUC score is  0.8804566893762799
accuracy score is :  0.8127466117687425
Precision is :  0.8397949673811743
Recall is:  0.7729456167438669
F1 Score is :  0.8049848132928354
classification report 
               precision    recall  f1-score   support

         0.0       0.79      0.85      0.82     11658
         1.0       0.84      0.77      0.80     11658

    accuracy                           0.81     23316
   macro avg       0.81      0.81      0.81     23316
weighted avg       0.81      0.81      0.81     23316






<AxesSubplot:>

在这里插入图片描述

plt.rcParams['figure.figsize']  = (12,6)
#plotting the graph for area under curve for representing accuracy of data
plt.plot([0,1],[1,0],'g--')
plt.plot(Fpr,Tpr)
plt.xlabel('False_Positive_Rate')
plt.ylabel('True_Positive_Rate')
plt.title("Random Forest Classifier")
plt.show()

Second Attempt: XG Boost Classifer

from sklearn.utils import class_weight
class_weight.compute_class_weight('balanced', np.unique(yc_train), yc_train["Is_Lead"])

weights = np.ones(y_train.shape[0], dtype = 'float')
for i, val in enumerate(y_train):
    weights[i] = classes_weights[val-1]

xgb_classifier.fit(X, y, sample_weight=weights)
#Trying XGBoost
import xgboost as xg
from xgboost import XGBClassifier
from sklearn.utils import class_weight

clf2 = xg.XGBClassifier(class_weight='balanced').fit(xc_train, yc_train)
class_weight.compute_class_weight('balanced', np.unique(yc_train), yc_train["Is_Lead"])
xg_pred = clf2.predict(xc_test)
[23:35:16] WARNING: /private/var/folders/fc/8d9mxh2s4ssd8k64mkmlsrj00000gn/T/pip-req-build-y40nwdrb/build/temp.macosx-10.9-x86_64-3.8/xgboost/src/learner.cc:576: 
Parameters: { "class_weight" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[23:35:16] WARNING: /private/var/folders/fc/8d9mxh2s4ssd8k64mkmlsrj00000gn/T/pip-req-build-y40nwdrb/build/temp.macosx-10.9-x86_64-3.8/xgboost/src/learner.cc:1100: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
plt.rcParams['figure.figsize']  = (12,8)
#XG Boost Results
xg_pred_2=clf2.predict_proba(xc_test)[:,1]
Fpr,Tpr,thresholds = roc_curve(yc_test,xg_pred_2,pos_label=True)
auc = roc_auc_score(yc_test,xg_pred_2)

print(" ROC_AUC score is ",auc)
print("accuracy score is : ",accuracy_score(yc_test,xg_pred))
print("Precision is : " ,precision_score(yc_test, xg_pred))
print("Recall is: " ,recall_score(yc_test, xg_pred))
print("F1 Score is : " ,f1_score(yc_test, xg_pred))
print("classification report \n",classification_report(yc_test,xg_pred))

cnf = confusion_matrix(yc_test,xg_pred)
sns.heatmap(cnf, annot=True, cmap = "magma")
 ROC_AUC score is  0.8706238059470456
accuracy score is :  0.8033968090581575
Precision is :  0.8246741325500275
Recall is:  0.7706296105678504
F1 Score is :  0.7967364313586378
classification report 
               precision    recall  f1-score   support

         0.0       0.78      0.84      0.81     11658
         1.0       0.82      0.77      0.80     11658

    accuracy                           0.80     23316
   macro avg       0.80      0.80      0.80     23316
weighted avg       0.80      0.80      0.80     23316






<AxesSubplot:>

在这里插入图片描述

plt.rcParams['figure.figsize']  = (12,6)
#plotting the graph for area under curve for representing accuracy of data
plt.plot([0,1],[1,0],'g--')
plt.plot(Fpr,Tpr)
plt.xlabel('False_Positive_Rate')
plt.ylabel('True_Positive_Rate')
plt.title("XG_Boost Classifier")
plt.show()

在这里插入图片描述

Third Attempt: LGBM Model with Stratification Folds

#Trying stratification modeling
from sklearn.model_selection import KFold, StratifiedKFold

def cross_val(xc, yc, model, params, folds=10):

    skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=42)
    for fold, (train_idx, test_idx) in enumerate(skf.split(xc, yc)):
        print(f"Fold: {fold}")
        xc_train, yc_train = xc.iloc[train_idx], yc.iloc[train_idx]
        xc_test, yc_test = xc.iloc[test_idx], yc.iloc[test_idx]

        model_c= model(**params)
        model_c.fit(xc_train, yc_train,eval_set=[(xc_test, yc_test)],early_stopping_rounds=100, verbose=300)

        pred_y = model_c.predict_proba(xc_test)[:, 1]
        roc_score = roc_auc_score(yc_test, pred_y)
        print(f"roc_auc_score: {roc_score}")
        print("-"*50)
    
    return model_c
#Applying LGBM Model with 10 stratified cross-folds
from lightgbm import LGBMClassifier

lgb_params= {'learning_rate': 0.045, 'n_estimators': 10000,'max_bin': 84,'num_leaves': 10,'max_depth': 20,'reg_alpha': 8.457,'reg_lambda': 6.853,'subsample': 0.749}
lgb_model = cross_val(xc, yc, LGBMClassifier, lgb_params)
Fold: 0
Training until validation scores don't improve for 100 rounds
[300]	valid_0's binary_logloss: 0.433821
[600]	valid_0's binary_logloss: 0.433498
Early stopping, best iteration is:
[599]	valid_0's binary_logloss: 0.433487
roc_auc_score: 0.8748638095718249
--------------------------------------------------
Fold: 1
Training until validation scores don't improve for 100 rounds
[300]	valid_0's binary_logloss: 0.434881
[600]	valid_0's binary_logloss: 0.43445
Early stopping, best iteration is:
[569]	valid_0's binary_logloss: 0.43442
roc_auc_score: 0.8755631159104413
--------------------------------------------------
Fold: 2
Training until validation scores don't improve for 100 rounds
[300]	valid_0's binary_logloss: 0.431872
[600]	valid_0's binary_logloss: 0.43125
[900]	valid_0's binary_logloss: 0.430984
Early stopping, best iteration is:
[1013]	valid_0's binary_logloss: 0.430841
roc_auc_score: 0.877077541404848
--------------------------------------------------
Fold: 3
Training until validation scores don't improve for 100 rounds
[300]	valid_0's binary_logloss: 0.442048
[600]	valid_0's binary_logloss: 0.44142
[900]	valid_0's binary_logloss: 0.441142
Early stopping, best iteration is:
[895]	valid_0's binary_logloss: 0.44114
roc_auc_score: 0.8721270953106521
--------------------------------------------------
Fold: 4
Training until validation scores don't improve for 100 rounds
[300]	valid_0's binary_logloss: 0.439466
[600]	valid_0's binary_logloss: 0.438899
Early stopping, best iteration is:
[782]	valid_0's binary_logloss: 0.438824
roc_auc_score: 0.8709229804739002
--------------------------------------------------
Fold: 5
Training until validation scores don't improve for 100 rounds
[300]	valid_0's binary_logloss: 0.427545
Early stopping, best iteration is:
[445]	valid_0's binary_logloss: 0.42739
roc_auc_score: 0.8792290845510382
--------------------------------------------------
Fold: 6
Training until validation scores don't improve for 100 rounds
[300]	valid_0's binary_logloss: 0.440554
[600]	valid_0's binary_logloss: 0.439762
[900]	valid_0's binary_logloss: 0.439505
[1200]	valid_0's binary_logloss: 0.439264
Early stopping, best iteration is:
[1247]	valid_0's binary_logloss: 0.439142
roc_auc_score: 0.872610593872283
--------------------------------------------------
Fold: 7
Training until validation scores don't improve for 100 rounds
[300]	valid_0's binary_logloss: 0.423764
Early stopping, best iteration is:
[414]	valid_0's binary_logloss: 0.423534
roc_auc_score: 0.8806521642373888
--------------------------------------------------
Fold: 8
Training until validation scores don't improve for 100 rounds
[300]	valid_0's binary_logloss: 0.440673
Early stopping, best iteration is:
[409]	valid_0's binary_logloss: 0.440262
roc_auc_score: 0.8708570312002339
--------------------------------------------------
Fold: 9
Training until validation scores don't improve for 100 rounds
[300]	valid_0's binary_logloss: 0.441536
[600]	valid_0's binary_logloss: 0.441034
Early stopping, best iteration is:
[661]	valid_0's binary_logloss: 0.440952
roc_auc_score: 0.8713195377336685
--------------------------------------------------
#LGBM results
lgb_pred_2=clf2.predict_proba(xc_test)[:,1]
Fpr,Tpr,thresholds = roc_curve(yc_test,lgb_pred_2,pos_label=True)
auc = roc_auc_score(yc_test,lgb_pred_2)

print(" ROC_AUC score is ",auc)
lgb_model.fit(xc_train,yc_train)
lgb_pred=lgb_model.predict(xc_test)
print("accuracy score is : ",accuracy_score(yc_test,lgb_pred))
print("Precision is : " ,precision_score(yc_test, lgb_pred))
print("Recall is: " ,recall_score(yc_test, lgb_pred))
print("F1 Score is : " ,f1_score(yc_test, lgb_pred))
print("classification report \n",classification_report(yc_test,lgb_pred))

cnf = confusion_matrix(yc_test,lgb_pred)
sns.heatmap(cnf, annot=True, cmap = "magma")
 ROC_AUC score is  0.8706238059470456
accuracy score is :  0.8030965860353405
Precision is :  0.8258784469242829
Recall is:  0.7681420483787956
F1 Score is :  0.7959646237944981
classification report 
               precision    recall  f1-score   support

         0.0       0.78      0.84      0.81     11658
         1.0       0.83      0.77      0.80     11658

    accuracy                           0.80     23316
   macro avg       0.80      0.80      0.80     23316
weighted avg       0.80      0.80      0.80     23316






<AxesSubplot:>

在这里插入图片描述

plt.rcParams['figure.figsize']  = (12,6)
#plotting the graph for area under curve for representing accuracy of data
plt.plot([0,1],[1,0],'g--')
plt.plot(Fpr,Tpr)
plt.xlabel('False_Positive_Rate')
plt.ylabel('True_Positive_Rate')
plt.title("LGB Classifier model")
plt.show()

在这里插入图片描述

5. 模型预测

模型训练完成后,我们使用测试数据进行预测:

#we can drop column as they are irrelevant and have no effect on our data
df_3 = df_test
df_3.drop(columns=["source"],inplace=True)
df_3.head()
IDGenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_ActiveIs_Lead
245725VBENBARO1294102527423660.0NaN
245726CCMEWNKY14318114909255370.0NaN
245727VK3KGA9M13120201412159490.0NaN
245728TT8RPZVC12922103318680700.0NaN
245729SHQZEYTZ02920101916570870.0NaN
# dropping target variable 
#assign the value of y for training and testing phase
xc_pred = df_3.drop(columns=['Is_Lead',"ID"])

#Standardizing value of x by using standardscaler to make the data normally distributed
sc = StandardScaler()
df_xc_pred = pd.DataFrame(sc.fit_transform(xc_pred),columns=xc_pred.columns)
lead_pred_xg=clf2.predict_proba(df_xc_pred)[:,1]
lead_pred_lgb=lgb_model.predict_proba(df_xc_pred)[:,1]
lead_pred_rf=rf_clf.predict_proba(df_xc_pred)[:,1]
print(lead_pred_xg, lead_pred_lgb, lead_pred_rf)
[0.09673516 0.9428428  0.12728807 ... 0.31698707 0.1821623  0.17593904] [0.14278614 0.94357392 0.13603912 ... 0.22251432 0.24186564 0.16873483] [0.17 0.97 0.09 ... 0.5  0.09 0.15]
#Dataframe for lead prediction
lead_pred_lgb= pd.DataFrame(lead_pred_lgb,columns=["Is_Lead"])
lead_pred_xg= pd.DataFrame(lead_pred_xg,columns=["Is_Lead"])
lead_pred_rf= pd.DataFrame(lead_pred_rf,columns=["Is_Lead"])
df_test = df_test.reset_index()
df_test.head()
indexIDGenderAgeRegion_CodeOccupationChannel_CodeVintageCredit_ProductAvg_Account_BalanceIs_ActiveIs_Lead
0245725VBENBARO1294102527423660.0NaN
1245726CCMEWNKY14318114909255370.0NaN
2245727VK3KGA9M13120201412159490.0NaN
3245728TT8RPZVC12922103318680700.0NaN
4245729SHQZEYTZ02920101916570870.0NaN
#Saving ID  and prediction to csv file for XG Model
df_pred_xg=pd.concat([df_test["ID"],lead_pred_xg],axis=1,ignore_index=True)
df_pred_xg.columns = ["ID","Is_Lead"]
print(df_pred_xg.head())
df_pred_xg.to_csv("Credit_Card_Lead_Predictions_final_xg.csv",index=False)

#Saving ID  and prediction to csv file for LGB Model
df_pred_lgb=pd.concat([df_test["ID"],lead_pred_lgb],axis=1,ignore_index=True)
df_pred_lgb.columns = ["ID","Is_Lead"]
print(df_pred_lgb.head())
df_pred_lgb.to_csv("Credit_Card_Lead_Predictions_final_lgb.csv",index=False)

#Saving ID  and prediction to csv file for RF model
df_pred_rf=pd.concat([df_test["ID"],lead_pred_rf],axis=1,ignore_index=True)
df_pred_rf.columns = ["ID","Is_Lead"]
print(df_pred_rf.head())
df_pred_rf.to_csv("Credit_Card_Lead_Predictions_final_rf.csv",index=False)
         ID   Is_Lead
0  VBENBARO  0.096735
1  CCMEWNKY  0.942843
2  VK3KGA9M  0.127288
3  TT8RPZVC  0.052260
4  SHQZEYTZ  0.057762
         ID   Is_Lead
0  VBENBARO  0.142786
1  CCMEWNKY  0.943574
2  VK3KGA9M  0.136039
3  TT8RPZVC  0.084144
4  SHQZEYTZ  0.055887
         ID  Is_Lead
0  VBENBARO     0.17
1  CCMEWNKY     0.97
2  VK3KGA9M     0.09
3  TT8RPZVC     0.12
4  SHQZEYTZ     0.09

6. 模型保存

为了在未来能够方便地加载和使用训练好的模型,我们将模型保存为pickle文件:

import joblib
# 将模型保存为文件中的pickle
joblib.dump(lgb_model,'lgb_model.pkl')
['lgb_model.pkl']

如有遇到问题可以找小编沟通交流哦。另外小编帮忙辅导大课作业,学生毕设等。不限于MapReduce, MySQL, python,java,大数据,模型训练等。 hadoop hdfs yarn spark Django flask flink kafka flume datax sqoop seatunnel echart可视化 机器学习等
在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/637060.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

如何设计足够可靠的分布式缓存体系,以满足大中型移动互联网系统的需要?no.31

传统 CAP 的突破 随着分布式系统的不断演进&#xff0c;会不断遇到各种问题&#xff0c;特别是当前&#xff0c;在大中型互联网系统的演进中&#xff0c;私有云、公有云并行发展且相互融合&#xff0c;互联网系统的部署早已突破单个区域&#xff0c;系统拓扑走向全国乃至全球的…

[深度学习]基于yolov8+bytetrack+pyqt5实现车辆进出流量统计+车辆实时测速实现

以前使用过yolov5deepsort实现过车辆进出流量统计车辆实时测速&#xff0c;可以看我往期视频&#xff0c;这回改成yolov8bytetrack实现&#xff0c;实时性更好&#xff0c;原理和原来一样。车流量进出统计车速测量优点&#xff1a; 使用目标检测算法考虑bbox抖动&#xff0c;解…

利用Python去除PDF水印

摘要 本文介绍了如何使用 Python 中的 PyMuPDF 和 OpenCV 库来从 PDF 文件中移除水印&#xff0c;并将每个页面保存为图像文件的方法。我们将深入探讨代码背后的工作原理&#xff0c;并提供一个简单的使用示例。 导言 简介&#xff1a;水印在许多 PDF 文件中都很常见&#x…

Spark项目实训(一)

目录 实验任务一&#xff1a;计算级数 idea步骤分步&#xff1a; 完整代码&#xff1a; linux步骤分布&#xff1a; 实验任务二&#xff1a;统计学生成绩 idea步骤分布&#xff1a; 完整代码&#xff1a; linux步骤分步&#xff1a; 实验任务一&#xff1a;计算级数 请…

【Linux001】centos常用命令总结总结(已更新)

1.熟悉、梳理、总结下centos知识体系。 2.Linux相关知识&#xff0c;在日常开发中必不可少&#xff0c;如一些必知必会的常用命令&#xff0c;如环境搭建、应用部署等。同时&#xff0c;也要谨慎使用一些命令&#xff0c;如rm -rf&#xff0c;防止一些生产事故的发生。 3.欢迎点…

洗衣行业在线预约小程序源码系统 在线下单+上门取件+订单状态跟踪 带网站的源代码包以及搭建部署教程

开发背景 在现代社会&#xff0c;人们越来越注重时间的利用和生活的便捷性。传统的洗衣服务模式往往需要消费者亲自将衣物送到洗衣店&#xff0c;然后再等待取衣&#xff0c;整个过程既耗时又不方便。此外&#xff0c;随着移动互联网的普及&#xff0c;人们更习惯于通过手机应…

Soybean Admin:一款高效、现代化的后台管理模板探索

随着前端技术的快速发展&#xff0c;越来越多的开发者开始寻求使用最新技术栈来构建高效、用户友好的后台管理系统。Soybean Admin作为一款基于Vue3、Vite5、TypeScript、Pinia、NaiveUI和UnoCSS等前沿技术的后台管理模板&#xff0c;为我们提供了一个全新的解决方案。本文将深…

List、IList、ArrayList 和 Dictionary

List 类型: 泛型类命名空间: System.Collections.Generic作用: List<T> 表示一个强类型的对象列表&#xff0c;可以通过索引访问。提供了搜索、排序和操作列表的方法。特点: 类型安全&#xff0c;性能较好&#xff0c;适用于需要强类型和高效操作的场景。例子: List<…

数字人系统OEM源码及赚钱方式详解!

当前&#xff0c;数字人直播的热度持续上涨&#xff0c;应用场景日益丰富。而随着数字人直播所蕴含的前景和潜力被不断挖掘一批又一批的创业者纷纷开始入局分羹。其中&#xff0c;数字人系统OEM源码模式作为最为常见的入局方式之一&#xff0c;更是备受瞩目。 所谓数字人系统O…

【个人经历分享】末流本科地信,毕业转码经验

本人24届末流本科&#xff0c;地理信息科学专业。 我们这个专业可以说是 “高不成&#xff0c;低不就”的专业&#xff0c;什么都学但都不精。考研我实在是卷不动同学历的人&#xff0c;我在大三的时候就开始考虑转码。 至于我为什么选择转码&#xff0c;选择了GIS开发&#xf…

hcip—VLAN实验

目录 实验拓扑&#xff1a; 实验目的&#xff1a; 实验思路&#xff1a; 实验步骤&#xff1a; 1.创建VLAN 2.将接口放进相应VLAN当中&#xff0c;并配置接口类型&#xff08;hybrid口配置撕tag表&#xff09; 3.配置路由器接口 4.配置DHCP服务 pc1 ping pc4的过程分析…

position: absolute对el-dialog的影响

当用到position: absolute,会使元素脱离文档流,从而对原始层级发生变化,导致蒙层无法消失.

dubbo复习: (5)和springboot集成时的标签路由

标签路由&#xff1a;服务提供者和服务消费者都可以指定标签。 只有服务提供者的标签和服务消费者的标签一致时&#xff0c;才会进行请求路由。 给服务提供者指定标签有两种方式&#xff0c;一种是通过在DubboService注解的tag属性来指定&#xff0c;如下示例 package cn.edu…

VScode C/C++环境安装配置

1. 编译器需要从如下网站下载&#xff1a; MinGW-w64 - for 32 and 64 bit Windows - Browse Files at SourceForge.net 2. 切换到file选项&#xff0c;下拉找到对应的文件版本直接下载&#xff1a; 3. 右键解压到当前文件夹如下&#xff1a; 4. 如图所示复制浏览器上的相应的…

LabVIEW2022安装教程指南【附安装包】

文章目录 前言一、安装指南1、软件包获取 二、安装步骤总结 前言 LabVIEW是一种程序开发环境&#xff0c;提供一种图形化编程方法&#xff0c;可可视化应用程序的各个方面&#xff0c;包括硬件配置、测量数据和调试&#xff0c;同时可以通过FPGA数学和分析选板中的NI浮点库链接…

XV4001KD汽车级应用的数字输出陀螺传感器

XV4001KD是一款专为汽车导航系统和远程信息处理而设计的数字输出陀螺传感器。采用SPI/I2C串行接口&#xff0c;具有高精度的16位的角速率输出和11位的温度输出功能&#xff0c;能够准确地测量车辆的运动状态和环境温度&#xff0c;为导航系统和信息处理提供可靠的数据支持。以及…

动态IP与静态IP有什么区别?如何选择?

动态IP和静态IP都是指网络设备&#xff08;如计算机、服务器、路由器等&#xff09;在互联网上分配的IP地址的类型。 一、什么是动态IP&#xff0c;什么是静态IP&#xff1f; 1、什么是动态IP&#xff1f; 动态IP是指由Internet服务提供商&#xff08;ISP&#xff09;动态分配…

C++的数据结构(十七):哈希表

哈希表&#xff0c;又称散列表&#xff0c;是一种根据关键码值&#xff08;Key value&#xff09;直接访问的数据结构。通过把关键码值映射到表中的位置&#xff0c;可以快速找到对应的数据&#xff0c;从而大大提高查找效率。这种映射关系是通过散列函数来实现的&#xff0c;散…

大语言模型本地部署与使用_ollama_open-webui

概述 本文主要记录如何使用ollama运行开源的大语言模型如llama3等&#xff0c;以及如何使用open-webui进行交互。 ollama支持MacOS、Linux、Windows等操作系统&#xff0c;这里主要以Linux和Windows为主&#xff0c;讲述如何在本地运行大语言模型。 一 安装ollama 1.1 Wind…

一张图看懂大模型性价比:能力、价格、并发量全面PK

最近&#xff0c;国内云厂商的大模型掀起一场降价风暴。火山引擎、阿里云、百度云等纷纷宣布降价&#xff0c;部分模型价格降幅据称高达99%&#xff0c;甚至还有些模型直接免费。 五花八门的降价话术&#xff0c;一眼望去遍地黄金。但事实真的如此吗&#xff1f;今天我们就拨开…