『飞桨领航团AI达人创造营』基于LGBM的模型预测借款人是否能按期还款


本文介绍基于LGBM模型预测借款人还款情况的项目。用40000条训练集和15000条测试集数据,经数据加载、预处理和特征工程,构建如贷款金额与缴存额组合等特征。选择LGBM算法,通过多轮参数调优,采用分层K折交叉验证训练,最终模型AUC均值较高,可辅助银行判断借款人信用。

☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

基于LGBM的模型预测借款人是否能按期还款

利用轻量级LGBM模型学习纯数据,从中归纳出优秀的特征,能够判断一个人是否会按期还款。

一、项目背景

1、代替人为判断是否能贷款给某人 2、减少传统人为贷款,暗箱操作 3、帮助银行找出信用更优质的人群

二、数据集简介

已收集的数据集数量: 训练集提供40000名,测试集提供15000名的缴存人基本信息、缴存信息,贷款信息 。

       

1.数据加载和预处理

train_df = df[df['label'].isna() == False].reset_index(drop=True)
test_df = df[df['label'].isna() == True].reset_index(drop=True)
display(train_df.shape, test_df.shape)

       

训练集样本量: (40000, 1093),验证集样本量: (15000, 1093)

2.数据集查看

def get_daikuanYE(df,col):
    df[col + '_genFeat1'] = (df[col] > 100000).astype(int)
    df[col + '_genFeat2'] = (df[col] > 120000).astype(int)
    df[col + '_genFeat3'] = (df[col] > 140000).astype(int)
    df[col + '_genFeat4'] = (df[col] > 180000).astype(int)
    df[col + '_genFeat5'] = (df[col] > 220000).astype(int)
    df[col + '_genFeat6'] = (df[col] > 260000).astype(int)
    df[col + '_genFeat7'] = (df[col] > 300000).astype(int)    return df, [col + f'_genFeat{i}' for i in range(1, 8)]

df, genFeats2 = get_daikuanYE(df, col = 'DKYE')
df, genFeats3 = get_daikuanYE(df, col = 'DKFFE')


plt.figure(figsize = (8, 2))
plt.subplot(1,2,1)
sns.distplot(df['DKYE'][df['label'] == 1])
plt.subplot(1,2,2)
sns.distplot(df['DKFFE'][df['label'] == 1])

       

三、模型选择和开发

详细说明你使用的算法。此处可细分,如下所示:

1.特征工程

# df['GRYJCE_sum_DWYJCE']= (df['GRYJCE']+df['DWYJCE'])*12*(df['DKLL']+1)    #贷款每年的还款额# df['GRZHDNGJYE_GRZHSNJZYE']=(df['GRZHDNGJYE']+df['GRZHSNJZYE']+df['GRZHYE'])-df['GRYJCE_sum_DWYJCE']# df['DWJJLX_DWYSSHY']=df['DWJJLX'] *df['DWSSHY']  #单位经济体行业*单位经济体行业# df['XINGBIEDKYE'] = df['XINGBIE'] * df['DKYE']# df['m2']  = (df['DKYE']  - ((df['GRYJCE']  + df['DWYJCE'] ) * 12) + df['GRZHDNGJYE'] ) / 12# df['KDKZGED'] = df['m2'] * (df['GRYJCE']  + df['DWYJCE'] )# gen_feats = ['DKFFE_multi_DKLL','DKFFE_DKYE_DKFFE','DWYSSHY2GRYJCE','DWYSSHY2DWYJCE','ZHIYE_GRZHZT','GRZHDNGJYE_GRZHSNJZYE']df['DWYSSHY2GRYJCE']=df['DWSSHY'] * df['DWSSHY']*df['GRYJCE']  #好gen_feats = ['DWYSSHY2GRYJCE']


df.head()

       

2.模型介绍

xgboost的出现,让数据民工们告别了传统的机器学习算法们:RF、GBM、SVM、LASSO……..。 顾名思义,lightGBM包含两个关键点:light即轻量级,GBM 梯度提升机。LightGBM 是一个梯度 boosting 框架,使用基于学习算法的决策树。是分布式的,高效的,有以下优势:

  • 更快的训练效率
  • 低内存使用
  • 更高的准确率
  • 支持并行化学习
  • 可处理大规模数据

概括来说,lightGBM主要有以下特点:

  • 基于Histogram的决策树算法
  • 带深度限制的Leaf-wise的叶子生长策略
  • 直方图做差加速
  • 直接支持类别特征(Categorical Feature)
  • Cache命中率优化
  • 基于直方图的稀疏特征优化
  • 多线程优化

3.模型训练

oof = np.zeros(train_df.shape[0])# feat_imp_df = pd.DataFrame({'feat': cols, 'imp': 0})test_df['prob'] = 0clf = LGBMClassifier(    # 0.05--0.1
    learning_rate=0.07,    # 1030
    # 1300
    n_estimators=1030,    # 31
    # 35
    # 37
    # 40
    # (0.523177, 0.93799)  38
    #(0.519115, 0.93587) 39
    num_leaves=37,
    subsample=0.8,    # 0.8
    # 0.85
    colsample_bytree=0.8,
    random_state=11,
    is_unbalace=True,
    sample_pos_weight=13
    
    # learning_rate=0.066,#学习率
    # n_estimators=1032,#拟合的树的棵树,相当于训练轮数
    # num_leaves=38,#树的最大叶子数,对比xgboost一般为2^(max_depth)
    # subsample=0.85,#子样本频率
    # colsample_bytree=0.85, #训练特征采样率列
    # random_state=17,   #随机种子数
    # reg_lambda=1e-1,    #L2正则化系数
    # # min_split_gain=0.2#最小分割增益
    # learning_rate=0.07,#学习率
    # n_estimators=1032,#拟合的树的棵树,相当于训练轮数
    # num_leaves=37,#树的最大叶子数,对比xgboost一般为2^(max_depth)
    # subsample=0.8,#子样本频率
    # colsample_bytree=0.8, #训练特征采样率 列
    # random_state=17,   #随机种子数
    # silent=True , #训练过程是否打印日志信息
    # min_split_gain=0.05 ,#最小分割增益
    # is_unbalace=True,
    # sample_pos_weight=13)

       

--------------------- 0 fold ---------------------

[LightGBM] [Warning] Unknown parameter: is_unbalace

[LightGBM] [Warning] Unknown parameter: sample_pos_weight

Training until validation scores don't improve for 200 rounds

[200] valid_0's auc: 0.944549 valid_0's binary_logloss: 0.110362

Early stopping, best iteration is:

[173] valid_0's auc: 0.944278 valid_0's binary_logloss: 0.1097

--------------------- 1 fold ---------------------

[LightGBM] [Warning] Unknown parameter: is_unbalace

[LightGBM] [Warning] Unknown parameter: sample_pos_weight

Training until validation scores don't improve for 200 rounds

[200] valid_0's auc: 0.943315 valid_0's binary_logloss: 0.113508

Early stopping, best iteration is:

[161] valid_0's auc: 0.943045 valid_0's binary_logloss: 0.113012

4.模型预测

使用model.predict接口来完成对大量数据集的批量预测。

val_aucs = []
seeds = [11,22,33]for seed in seeds:
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)    for i, (trn_idx, val_idx) in enumerate(skf.split(train_df, train_df['label'])):        print('--------------------- {} fold ---------------------'.format(i))
        t = time.time()
        trn_x, trn_y = train_df[cols].iloc[trn_idx].reset_index(drop=True), train_df['label'].values[trn_idx]
        val_x, val_y = train_df[cols].iloc[val_idx].reset_index(drop=True), train_df['label'].values[val_idx]
        clf.fit(
            trn_x, trn_y,
            eval_set=[(val_x, val_y)],    #         categorical_feature=cate_cols,
            eval_metric='auc',
            early_stopping_rounds=200,
            verbose=200
        )    #     feat_imp_df['imp'] += clf.feature_importances_ / skf.n_splits
        oof[val_idx] = clf.predict_proba(val_x)[:, 1]
        test_df['prob'] += clf.predict_proba(test_df[cols])[:, 1] / skf.n_splits / len(seeds)

    cv_auc = roc_auc_score(train_df['label'], oof)
    val_aucs.append(cv_auc)    print('\ncv_auc: ', cv_auc)print(val_aucs, np.mean(val_aucs))

       

四、总结与升华

心得:特征工程非常重要,大部分提升都是在特征工程的特征组合上,然后就是参数调优。我尝试了所有的参数,调参需要一一做对比实验,千万别同时改两个参数,然后正则化是一般是防止过拟合的,没搞清楚是否过拟合时不要用,不然大概率降低,找到关键的一些参数,能够强迫模型有提升的,比如上述min_split_gain=0.05 ,#最小分割增益,学习率由大往小了调整,多尝试不同的参数组合会有意想不到的效果。

个人简介

我在AI Studio上获得钻石等级,点亮10个徽章,来互关呀~  Alchemist_W 点我互关

AI达人创造营作业:模型训练和调参

当数据集准备好后,就需要进行愉快的训练了,在这里 Paddle 为大家提供了很多方便的套件,大大缩短开发者的开发时间,提高了开发效率

  • 综合套件:PaddleHub
  • 图像分类:PaddleClas
  • 目标检测:PaddleDetection
  • 图像分割:PaddleSeg
  • 文字识别:PaddleOCR

更多的套件请访问 飞桨产品全景

下面详细介绍项目——基于LGBM模型预测借款人是否能按期还款

       

In [ ]
import pandas as pdimport numpy as npfrom tqdm import tqdmimport osimport matplotlib.pyplot as pltimport seaborn as snsimport paddle
    In [ ]
train = pd.read_csv('./work/train.csv')
test = pd.read_csv('./work/test.csv')
submit = pd.read_csv('./work/submit.csv')
train.shape, test.shape, submit.shape
   

1、观察数据结构

In [ ]
train.head()
    In [ ]
cate_2_cols = ['XINGBIE', 'ZHIWU', 'XUELI']
cate_cols = ['HYZK', 'ZHIYE', 'ZHICHEN', 'DWJJLX', 'DWSSHY', 'GRZHZT']
train[cate_cols]
    In [ ]
num_cols = ['GRJCJS', 'GRZHYE', 'GRZHSNJZYE', 'GRZHDNGJYE', 'GRYJCE', 'DWYJCE','DKFFE', 'DKYE', 'DKLL']
train[num_cols]
   

2、特征工程

In [ ]
df = pd.concat([train, test], axis = 0).reset_index(drop = True)  #这一块把训练集和测试集连起来 观察数据分布特征 且方便后期做处理  最终还是会分开df.head(10)
   

3、LightGBM简介

xgboost的出现,让数据民工们告别了传统的机器学习算法们:RF、GBM、SVM、LASSO……..。 顾名思义,lightGBM包含两个关键点:light即轻量级,GBM 梯度提升机。LightGBM 是一个梯度 boosting 框架,使用基于学习算法的决策树。是分布式的,高效的,有以下优势:

  • 更快的训练效率
  • 低内存使用
  • 更高的准确率
  • 支持并行化学习
  • 可处理大规模数据

概括来说,lightGBM主要有以下特点:

  • 基于Histogram的决策树算法
  • 带深度限制的Leaf-wise的叶子生长策略
  • 直方图做差加速
  • 直接支持类别特征(Categorical Feature)
  • Cache命中率优化
  • 基于直方图的稀疏特征优化
  • 多线程优化

4、皮尔逊系数

4.1 Corr()函数保证行相同 可以使用 其中有三种相关函数 默认为皮尔逊相关系数('Pearson','Kendall','Spman')

4.2当两个变量的标准差都不为零时,相关系数才有定义,Pearson相关系数适用于:

(1)、两个变量之间是线性关系,都是连续数据。

(2)、两个变量的总体是正态分布,或接近正态的单峰分布。

(3)、两个变量的观测值是成对的,每对观测值之间相互独立。

In [ ]
# Find correlations with the target and sort  correlations = df.corr()['label'].sort_values()# Display correlationsprint('Most Positive Correlations:\n', correlations.tail(15))print('\nMost Negative Correlations:\n', correlations.head(15))#Most Positive Correlations: 越大说明正相关性越强 可以用来做组合特征
    In [ ]
# Heatmap 使用热力图来分析两组特征存在的数学上的关系summary=pd.pivot_table(data=df,
                        index='GRZHZT',
                        columns='ZHIYE',
                        values='label',
                        aggfunc=np.sum)

sns.heatmap(data=summary,
            cmap='rainbow',
            annot=True,            # fmt='.2e',  #科学计数法 保留2位小数
            linewidth=0.5)
plt.title('Label')
plt.show()
   

5、数据可视化-----使用条形图分析年龄段与还款的关系

In [ ]
def get_age(df,col = 'age'):
    df[col+"_genFeat1"]=(df['age'] > 18).astype(int)
    df[col+"_genFeat2"]=(df['age'] > 25).astype(int)
    df[col+"_genFeat3"]=(df['age'] > 30).astype(int)
    df[col+"_genFeat4"]=(df['age'] > 35).astype(int)
    df[col+"_genFeat5"]=(df['age'] > 40).astype(int)
    df[col+"_genFeat6"]=(df['age'] > 45).astype(int)    return df, [col + f'_genFeat{i}' for i in range(1, 7)]

df['age'] = ((1609430399 - df['CSNY']) / (365 * 24 * 3600)).astype(int)
df, genFeats1 = get_age(df, col = 'age')

sns.distplot(df['age'][df['age'] > 0])
    In [ ]
def get_daikuanYE(df,col):
    df[col + '_genFeat1'] = (df[col] > 100000).astype(int)
    df[col + '_genFeat2'] = (df[col] > 120000).astype(int)
    df[col + '_genFeat3'] = (df[col] > 140000).astype(int)
    df[col + '_genFeat4'] = (df[col] > 180000).astype(int)
    df[col + '_genFeat5'] = (df[col] > 220000).astype(int)
    df[col + '_genFeat6'] = (df[col] > 260000).astype(int)
    df[col + '_genFeat7'] = (df[col] > 300000).astype(int)    return df, [col + f'_genFeat{i}' for i in range(1, 8)]

df, genFeats2 = get_daikuanYE(df, col = 'DKYE')
df, genFeats3 = get_daikuanYE(df, col = 'DKFFE')


plt.figure(figsize = (8, 2))
plt.subplot(1,2,1)
sns.distplot(df['DKYE'][df['label'] == 1])
plt.subplot(1,2,2)
sns.distplot(df['DKFFE'][df['label'] == 1])
    In [ ]
train_df = df[df['label'].isna() == False].reset_index(drop=True)
test_df = df[df['label'].isna() == True].reset_index(drop=True)
display(train_df.shape, test_df.shape)

plt.figure(figsize = (8, 2))
plt.subplot(1,2,1)
sns.distplot(train_df['age'][train_df['age'] > 0])
plt.subplot(1,2,2)
sns.distplot(test_df['age'][test_df['age'] > 0])
   

Tips比较训练集与测试集中的各属性分布 发现(zhiwu)职务这个属性训练集与测试集分布不同 是不可用属性 给LGBM会使得分类效果变差

In [ ]
gen_feats_fest = ['age','HYZK','ZHIYE','ZHICHEN','ZHIWU','XUELI','DWJJLX','DWSSHY','GRJCJS','GRZHZT','GRZHYE','GRZHSNJZYE','GRZHDNGJYE','GRYJCE','DWYJCE','DKFFE','DKYE','DKLL']for i in range(len(gen_feats_fest)):
    plt.figure(figsize = (8, 2))
    plt.subplot(1,2,1)
    sns.distplot(train_df[gen_feats_fest[i]])
    plt.subplot(1,2,2)
    sns.distplot(test_df[gen_feats_fest[i]])
   

依据贷款公式组合特征,根据前面提到的皮尔曼系数组合特征

In [ ]
#df['missing_rate'] = (df.shape[1] - df.count(axis = 1)) / df.shape[1]#差#df['DKFFE_DKYE'] = df['DKFFE'] + df['DKYE'] #一般差#df['DKFFE_DKY_multi_DKLL'] = (df['DKFFE'] + df['DKYE']) * df['DKLL']#一般好#df['DKFFE_multi_DKLL'] = df['DKFFE'] * df['DKLL']#一般好#df['DKYE_multi_DKLL'] = df['DKYE'] * df['DKLL']#一般差#df['GRYJCE_DWYJCE'] = df['GRYJCE'] + df['DWYJCE']#一般#df['GRZHDNGJYE_GRZHSNJZYE'] = df['GRZHDNGJYE'] + df['GRZHSNJZYE']#一般差#df['DKFFE_multi_DKLL_ratio'] = df['DKFFE'] * df['DKLL'] / df['DKFFE_DKY_multi_DKLL']#一般差#df['DKYE_multi_DKLL_ratio'] = df['DKYE'] * df['DKLL'] / df['DKFFE_DKY_multi_DKLL']#一般差#df['DKYE_DKFFE_ratio'] = df['DKYE'] / (df['DKFFE'] + df['DKYE'])#一般#df['DKFFE_DKYE_ratio'] = df['DKFFE'] /(df['DKFFE'] + df['DKYE'])#一般差#df['GRZHYE_diff_GRZHDNGJYE'] = df['GRZHYE'] - df['GRZHDNGJYE']#一般差#df['GRZHYE_diff_GRZHSNJZYE'] = df['GRZHYE'] - df['GRZHSNJZYE']#一般差#df['GRYJCE_DWYJCE_ratio'] = df['GRYJCE'] / (df['GRYJCE'] + df['DWYJCE'])#差#df['DWYJCE_GRYJCE_ratio'] = df['DWYJCE'] / (df['GRYJCE'] + df['DWYJCE'])#一般差#df['DWYSSHY2DKLL']=df['DWSSHY'] * df['DWSSHY']*df['DKLL']# df['DWYSSHY2GRJCJS2']=df['DWSSHY'] * df['DWSSHY']*df['GRYJCE']*df['GRYJCE']# df['ZHIYE_GRZHZT']= df['GRZHZT']/(df['ZHIYE']+0.00000001)# gen_feats = ['DWYSSHY2GRYJCE','ZHIYE_GRZHZT']# df['DKFFE_multi_DKLL'] = df['DKFFE'] * df['DKLL']   #发放贷款金额*贷款利率# df['DKFFE-DKYE']=df['DKFFE']-df['DKYE']  #贷款发放额-贷款余额=剩余未还款# df['DKFFE_DKYE_DKFFE']=df['DKFFE-DKYE']*df['DKFFE']  #(贷款发放额-贷款余额)*贷款利率# df['DWYSSHY2GRYJCE']=df['DWSSHY'] * df['DWSSHY']*df['GRYJCE'] #所属行业*所属行业*个人月缴存额 ***# df['DWYSSHY2DWYJCE']=df['DWSSHY'] * df['DWSSHY']*df['DWYJCE'] #所属行业*所属行业*单位月缴存额# df['ZHIYE_GRZHZT']=df['GRZHZT']/df['ZHIYE']# df['DWYSSHY3GRYJCE']=(df['DWSSHY']*df['DWSSHY']*df['DWSSHY']*df['GRYJCE'])*(df['GRZHZT']/df['ZHIYE'])# df['GRYJCE_sum_DWYJCE']= (df['GRYJCE']+df['DWYJCE'])*12*(df['DKLL']+1)    #贷款每年的还款额# df['GRZHDNGJYE_GRZHSNJZYE']=(df['GRZHDNGJYE']+df['GRZHSNJZYE']+df['GRZHYE'])-df['GRYJCE_sum_DWYJCE']# df['DWJJLX_DWYSSHY']=df['DWJJLX'] *df['DWSSHY']  #单位经济体行业*单位经济体行业# df['XINGBIEDKYE'] = df['XINGBIE'] * df['DKYE']# df['m2']  = (df['DKYE']  - ((df['GRYJCE']  + df['DWYJCE'] ) * 12) + df['GRZHDNGJYE'] ) / 12# df['KDKZGED'] = df['m2'] * (df['GRYJCE']  + df['DWYJCE'] )# gen_feats = ['DKFFE_multi_DKLL','DKFFE_DKYE_DKFFE','DWYSSHY2GRYJCE','DWYSSHY2DWYJCE','ZHIYE_GRZHZT','GRZHDNGJYE_GRZHSNJZYE']df['DWYSSHY2GRYJCE']=df['DWSSHY'] * df['DWSSHY']*df['GRYJCE']  #好gen_feats = ['DWYSSHY2GRYJCE']


df.head()
    In [ ]
for f in tqdm(cate_cols):
    df[f] = df[f].map(dict(zip(df[f].unique(), range(df[f].nunique()))))
    df[f + '_count'] = df[f].map(df[f].value_counts())
    df = pd.concat([df,pd.get_dummies(df[f],prefix=f"{f}")],axis=1)
    
    
cate_cols_combine = [[cate_cols[i], cate_cols[j]] for i in range(len(cate_cols)) \                     for j in range(i + 1, len(cate_cols))]for f1, f2 in tqdm(cate_cols_combine):
    df['{}_{}_count'.format(f1, f2)] = df.groupby([f1, f2])['id'].transform('count')
    df['{}_in_{}_prop'.format(f1, f2)] = df['{}_{}_count'.format(f1, f2)] / df[f2 + '_count']
    df['{}_in_{}_prop'.format(f2, f1)] = df['{}_{}_count'.format(f1, f2)] / df[f1 + '_count']    
for f1 in tqdm(cate_cols):
    g = df.groupby(f1)    for f2 in num_cols + gen_feats:        for stat in ['sum', 'mean', 'std', 'max', 'min', 'std']:
            df['{}_{}_{}'.format(f1, f2, stat)] = g[f2].transform(stat)    for f3 in genFeats2 + genFeats3:        for stat in ['sum', 'mean']:
            df['{}_{}_{}'.format(f1, f2, stat)] = g[f2].transform(stat)

num_cols_gen_feats = num_cols + gen_featsfor f1 in tqdm(num_cols_gen_feats):
    g = df.groupby(f1)    for f2 in num_cols_gen_feats:        if f1 != f2:            for stat in ['sum', 'mean', 'std', 'max', 'min', 'std']:
                df['{}_{}_{}'.format(f1, f2, stat)] = g[f2].transform(stat)for i in tqdm(range(len(num_cols_gen_feats))):    for j in range(i + 1, len(num_cols_gen_feats)):
        df[f'numsOf_{num_cols_gen_feats[i]}_{num_cols_gen_feats[j]}_add'] = df[num_cols_gen_feats[i]] + df[num_cols_gen_feats[j]]
        df[f'numsOf_{num_cols_gen_feats[i]}_{num_cols_gen_feats[j]}_diff'] = df[num_cols_gen_feats[i]] - df[num_cols_gen_feats[j]]
        df[f'numsOf_{num_cols_gen_feats[i]}_{num_cols_gen_feats[j]}_multi'] = df[num_cols_gen_feats[i]] * df[num_cols_gen_feats[j]]
        df[f'numsOf_{num_cols_gen_feats[i]}_{num_cols_gen_feats[j]}_div'] = df[num_cols_gen_feats[i]] / (df[num_cols_gen_feats[j]] + 0.0000000001)
   

6、划分训练集与测试集

In [ ]
train_df = df[df['label'].isna() == False].reset_index(drop=True)
test_df = df[df['label'].isna() == True].reset_index(drop=True)
display(train_df.shape, test_df.shape)
       
(40000, 1093)
               
(15000, 1093)
                In [ ]
drop_feats = [f for f in train_df.columns if train_df[f].nunique() == 1 or train_df[f].nunique() == 0]len(drop_feats), drop_feats
       
(4,
 ['DWSSHY_DKYE_min',
  'GRYJCE_DWYJCE_std',
  'DWYJCE_GRYJCE_std',
  'numsOf_GRYJCE_DWYJCE_diff'])
                In [ ]
cols = [col for col in train_df.columns if col not in ['id', 'label'] + drop_feats]
    In [ ]
from sklearn.model_selection import StratifiedKFoldfrom lightgbm.sklearn import LGBMClassifierfrom sklearn.metrics import f1_score, roc_auc_scorefrom sklearn.ensemble import RandomForestClassifier,VotingClassifierfrom xgboost import XGBClassifierimport timeimport lightgbm as lgb
       
                In [ ]
# callback=paddle.callbacks.VisualDL(log_dir='visualdl_log_dir')# 本地oof = np.zeros(train_df.shape[0])# feat_imp_df = pd.DataFrame({'feat': cols, 'imp': 0})test_df['prob'] = 0clf = LGBMClassifier(    # 0.05--0.1
    learning_rate=0.07,    # 1030
    # 1300
    n_estimators=1030,    # 31
    # 35
    # 37
    # 40
    # (0.523177, 0.93799)  38
    #(0.519115, 0.93587) 39
    num_leaves=37,
    subsample=0.8,    # 0.8
    # 0.85
    colsample_bytree=0.8,
    random_state=11,
    is_unbalace=True,
    sample_pos_weight=13


    # learning_rate=0.066,#学习率
    # n_estimators=1032,#拟合的树的棵树,相当于训练轮数
    # num_leaves=38,#树的最大叶子数,对比xgboost一般为2^(max_depth)
    # subsample=0.85,#子样本频率
    # colsample_bytree=0.85, #训练特征采样率列
    # random_state=17,   #随机种子数
    # reg_lambda=1e-1,    #L2正则化系数
    # # min_split_gain=0.2#最小分割增益
    # learning_rate=0.07,#学习率
    # n_estimators=1032,#拟合的树的棵树,相当于训练轮数
    # num_leaves=37,#树的最大叶子数,对比xgboost一般为2^(max_depth)
    # subsample=0.8,#子样本频率
    # colsample_bytree=0.8, #训练特征采样率 列
    # random_state=17,   #随机种子数
    # silent=True , #训练过程是否打印日志信息
    # min_split_gain=0.05 ,#最小分割增益
    # is_unbalace=True,
    # sample_pos_weight=13)

val_aucs = []
seeds = [11,22,33]for seed in seeds:
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)    for i, (trn_idx, val_idx) in enumerate(skf.split(train_df, train_df['label'])):        print('--------------------- {} fold ---------------------'.format(i))
        t = time.time()
        trn_x, trn_y = train_df[cols].iloc[trn_idx].reset_index(drop=True), train_df['label'].values[trn_idx]
        val_x, val_y = train_df[cols].iloc[val_idx].reset_index(drop=True), train_df['label'].values[val_idx]
        clf.fit(
            trn_x, trn_y,
            eval_set=[(val_x, val_y)],    #         categorical_feature=cate_cols,
            eval_metric='auc',
            early_stopping_rounds=200,
            verbose=200
        )    #     feat_imp_df['imp'] += clf.feature_importances_ / skf.n_splits
        oof[val_idx] = clf.predict_proba(val_x)[:, 1]
        test_df['prob'] += clf.predict_proba(test_df[cols])[:, 1] / skf.n_splits / len(seeds)

    cv_auc = roc_auc_score(train_df['label'], oof)
    val_aucs.append(cv_auc)    print('\ncv_auc: ', cv_auc)print(val_aucs, np.mean(val_aucs))
       
--------------------- 0 fold ---------------------
[LightGBM] [Warning] Unknown parameter: is_unbalace
[LightGBM] [Warning] Unknown parameter: sample_pos_weight
Training until validation scores don't improve for 200 rounds
[200]	valid_0's auc: 0.944549	valid_0's binary_logloss: 0.110362
Early stopping, best iteration is:
[173]	valid_0's auc: 0.944278	valid_0's binary_logloss: 0.1097
--------------------- 1 fold ---------------------
[LightGBM] [Warning] Unknown parameter: is_unbalace
[LightGBM] [Warning] Unknown parameter: sample_pos_weight
Training until validation scores don't improve for 200 rounds
[200]	valid_0's auc: 0.943315	valid_0's binary_logloss: 0.113508
Early stopping, best iteration is:
[161]	valid_0's auc: 0.943045	valid_0's binary_logloss: 0.113012
--------------------- 2 fold ---------------------
[LightGBM] [Warning] Unknown parameter: is_unbalace
[LightGBM] [Warning] Unknown parameter: sample_pos_weight
Training until validation scores don't improve for 200 rounds
[200]	valid_0's auc: 0.942585	valid_0's binary_logloss: 0.119059
Early stopping, best iteration is:
[148]	valid_0's auc: 0.942207	valid_0's binary_logloss: 0.117848
--------------------- 3 fold ---------------------
[LightGBM] [Warning] Unknown parameter: is_unbalace
[LightGBM] [Warning] Unknown parameter: sample_pos_weight
Training until validation scores don't improve for 200 rounds
[200]	valid_0's auc: 0.942192	valid_0's binary_logloss: 0.115931
Early stopping, best iteration is:
[123]	valid_0's auc: 0.942244	valid_0's binary_logloss: 0.114857
--------------------- 4 fold ---------------------
[LightGBM] [Warning] Unknown parameter: is_unbalace
[LightGBM] [Warning] Unknown parameter: sample_pos_weight
Training until validation scores don't improve for 200 rounds
[200]	valid_0's auc: 0.939505	valid_0's binary_logloss: 0.113455
Early stopping, best iteration is:
[164]	valid_0's auc: 0.939654	valid_0's binary_logloss: 0.112933

cv_auc:  0.9420797160267054
--------------------- 0 fold ---------------------
[LightGBM] [Warning] Unknown parameter: is_unbalace
[LightGBM] [Warning] Unknown parameter: sample_pos_weight
Training until validation scores don't improve for 200 rounds
[200]	valid_0's auc: 0.937373	valid_0's binary_logloss: 0.119639
Early stopping, best iteration is:
[140]	valid_0's auc: 0.938125	valid_0's binary_logloss: 0.117851
--------------------- 1 fold ---------------------
[LightGBM] [Warning] Unknown parameter: is_unbalace
[LightGBM] [Warning] Unknown parameter: sample_pos_weight
Training until validation scores don't improve for 200 rounds
[200]	valid_0's auc: 0.942087	valid_0's binary_logloss: 0.113331
Early stopping, best iteration is:
[182]	valid_0's auc: 0.942311	valid_0's binary_logloss: 0.112912
--------------------- 2 fold ---------------------
[LightGBM] [Warning] Unknown parameter: is_unbalace
[LightGBM] [Warning] Unknown parameter: sample_pos_weight
Training until validation scores don't improve for 200 rounds
[200]	valid_0's auc: 0.93272	valid_0's binary_logloss: 0.120388
Early stopping, best iteration is:
[138]	valid_0's auc: 0.933033	valid_0's binary_logloss: 0.118682
--------------------- 3 fold ---------------------
[LightGBM] [Warning] Unknown parameter: is_unbalace
[LightGBM] [Warning] Unknown parameter: sample_pos_weight
Training until validation scores don't improve for 200 rounds
[200]	valid_0's auc: 0.951504	valid_0's binary_logloss: 0.10742
Early stopping, best iteration is:
[178]	valid_0's auc: 0.951198	valid_0's binary_logloss: 0.107208
--------------------- 4 fold ---------------------
[LightGBM] [Warning] Unknown parameter: is_unbalace
[LightGBM] [Warning] Unknown parameter: sample_pos_weight
Training until validation scores don't improve for 200 rounds
       

       

In [ ]
print(val_aucs, np.mean(val_aucs))def tpr_weight_funtion(y_true,y_predict):
    d = pd.DataFrame()
    d['prob'] = list(y_predict) #训练集
    d['y'] = list(y_true) #训练之后的label
    d = d.sort_values(['prob'], ascending=[0]) #对第一列排序
    y = d.y  #训练之后的label
    PosAll = pd.Series(y).value_counts()[1]  #测试集所有为1结果相加
    NegAll = pd.Series(y).value_counts()[0]  #测试集所有为0结果相加
    pCumsum = d['y'].cumsum()      #真实的1的总数
    nCumsum = np.arange(len(y)) - pCumsum + 1   #真实的0的总数
    pCumsumPer = pCumsum / PosAll  #覆盖率
    nCumsumPer = nCumsum / NegAll  #打扰率
    TR1 = pCumsumPer[abs(nCumsumPer-0.001).idxmin()]  #TPR:TPR1:FPR = 0.001
    TR2 = pCumsumPer[abs(nCumsumPer-0.005).idxmin()]   #TPR TPR2:FPR = 0.005
    TR3 = pCumsumPer[abs(nCumsumPer-0.01).idxmin()]   #TPR TPR3:FPR = 0.01
    
    return 0.4 * TR1 + 0.3 * TR2 + 0.3 * TR3

tpr = round(tpr_weight_funtion(train_df['label'], oof), 6)
tpr, round(np.mean(val_aucs), 5)
    In [ ]
submit.head()
    In [ ]
submit['id'] = test_df['id']
submit['label'] = test_df['prob']

submit.to_csv('./work/Sub62 {}_{}.csv'.format(tpr, round(np.mean(val_aucs), 6)), index = False)
submit.head()
   

训练完感觉精度不是很满意,可以对配置文件进行调参,常见的调参是针对优化器和学习率

打开PaddleDetection/configs/yolov3/yolov3_mobilenet_v1_roadsign.yml文件

LearningRate:
  base_lr: 0.0001     
  schedulers:
  - !PiecewiseDecay  
    gamma: 0.1
    milestones: [32, 36]  - !LinearWarmup
    start_factor: 0.3333333333333333
    steps: 100OptimizerBuilder:
  optimizer:
    momentum: 0.9
    type: Momentum
  regularizer:
    factor: 0.0005
    type: L2
       

我们可以根据数据集以及选择的模型来适当调整我们的参数

更多详细可见 配置文件改动和说明

优秀的项目参考

  • PaddleHub 各种项目

  • Jetson Nano上部署PaddleDection

  • 使用SSD-MobileNetv1完成一个项目--准备数据集到完成树莓派部署

  • PaddleClas 源码解析

  • PaddleSeg 2.0动态图:车道线图像分割任务简介

  • PaddleGAN 大合集

如果没有你感兴趣的,可以去寻找相应的项目作为参考,注意项目 paddle 版本最好是 2.0.2 及其以上

       

作业要求

写出完整的训练代码,并说明使用的套件,使用的优化器,在训练过程调整了那些参数,以及简短的心得

Note :如果您打算新建项目完成本作业,可以在提交的作业附上链接

使用的套件:使用了 PaddlePaddle

使用了什么模型:使用了LGBM模型

调整了那些参数: learning_rate=0.066,#学习率

 n_estimators=1032,#拟合的树的棵树,相当于训练轮数
 
 num_leaves=38,#树的最大叶子数,对比xgboost一般为2^(max_depth)
 
 subsample=0.85,#子样本频率
 
 colsample_bytree=0.85, #训练特征采样率列
 
 random_state=17,   #随机种子数
 
 reg_lambda=1e-1,    #L2正则化系数
 
 # min_split_gain=0.2#最小分割增益
 
 learning_rate=0.07,#学习率
 
 n_estimators=1032,#拟合的树的棵树,相当于训练轮数
 
 num_leaves=37,#树的最大叶子数,对比xgboost一般为2^(max_depth)
 
 subsample=0.8,#子样本频率
 
 colsample_bytree=0.8, #训练特征采样率 列
 
 random_state=17,   #随机种子数
 
 silent=True , #训练过程是否打印日志信息
 
 min_split_gain=0.05 ,#最小分割增益
 
 is_unbalace=True,
 
 sample_pos_weight=13


# paddlepaddle  # 决策树  # 更高  # 是否能  # 是一个  # 缴存  # 所属行业  # 种子数  # 采样率  # 套件  # 棵树  # boosting  # python  # 算法  # 多线程  # 线程  # 接口  # 数据结构  # for  # 分布式  # igs  # red  # ai 


相关栏目: 【 Google疑问12 】 【 Facebook疑问10 】 【 网络优化91478 】 【 技术知识72672 】 【 云计算0 】 【 GEO优化84317 】 【 优选文章0 】 【 营销推广36048 】 【 网络运营41350 】 【 案例网站102563 】 【 AI智能45237


相关推荐: 播客数据深度分析:揭秘全球听众分布及增长策略  AI虚拟网红打造指南:轻松制作专属社交媒体形象  人工智能时代:你需要知道的真相和未来趋势  雷小兔ai智能写作怎么设置写作风格_雷小兔ai智能写作风格选择方法【指南】  如何用AI帮你创建自定义表情符号(Emoji)?聊天斗图更有趣  百度AI助手官方入口 文心一言网页版登录入口  QRCODE.AI深度评测:AI驱动的二维码生成器优缺点分析  如何用AI一键生成手机壁纸?4K高清AI壁纸生成关键词【分享】  通义千问网页版怎么切换账号_通义千问账号切换步骤【指南】  1-11月30万元以上插电混动车型销量榜:问界双车前二  3步教你用AI总结会议录音,再也不怕错过重点  稿定设计AI抠图怎么修复瑕疵_稿定设计AI瑕疵修复与手动微调【步骤】  AI赋能软件测试:自动化、智能化与未来趋势  AI视频播客制作终极指南:告别繁琐编辑,轻松发布!  怎么用AI帮你写一份有说服力的加薪申请?  Apollo.io vs Instantly AI:深度测评与功能对比  机器学习赋能AI生产力工具:提升效率与智能决策  消除噪音,提升音质:Audo.ai终极指南  Postman Flows:构建智能AI驱动型工作流完全指南  AI合同提取指南:利用智能实现高效采购和节省成本  利用Gen AI和AI Agent进行软件测试:Ollama本地LLM实践  ChatGPT 4 辅助进行室内设计灵感采集  CanvaAI抠图怎么批量处理_CanvaAI批量抠图与团队协作功能【指南】  豆包AI怎么关闭消息推送_通知与提醒管理设置教程  AI如何一键生成PPT大纲_利用AI工具制作演示文稿方法【教程】  豆包 AI 辅助进行初级绘本创作的剧情构思  AI猫咪视频创作指南:轻松打造百万级YouTube Shorts  提升效率:使用AI代理自动生成视频标题的实用指南  批改网ai检测工具怎么导出检测结果_批改网ai检测工具报告导出与格式选择【指南】  国产开源模型Kimi K2 Thinking上线美应用,挑战美国科技巨头!  怎么用AI帮你写一份客户感谢信?维系客户关系的利器  教你用AI帮你生成一份详细的搬家清单,告别手忙脚乱  Codeforces Pair Programming Problem: C 解题思路  如何利用豆包 AI 快速查询当地生活服务资讯  即梦ai能否生成国风插画_即梦ai国风元素调用与文化符号添加【技巧】  唐库AI拆书工具如何提取核心观点_唐库AI拆书工具观点提取与标注方法【攻略】  AI邮件营销风险解析:如何规避客户触达的潜在陷阱  AI如何变革法律行政助理角色?未来发展趋势分析  美食ASMR:感官盛宴与解压体验  Notion AI整理笔记怎么用_Notion AI整理笔记使用方法详细指南【教程】  寓言故事:狮子与老鼠,学习英语的趣味童话之旅  tofai官网入口网站 tofai官网入口网页版  Z270 Mini-ITX主板全面评测:为Skylake和Kaby Lake打造迷你主机  CanvaAI抠图如何换背景_CanvaAI背景替换与设计模板结合【攻略】  颠覆认知!《小丑回魂》幕后:用爆笑台词颠覆你的恐怖想象  2025年AI招聘大师班:初学者友好且功能强大  小米汽车OTA冬季大版本升级:新增和优化共计9项功能  使用 ChatGPT 自动生成月度财务分析报告  Guru知识管理平台:AI驱动的企业知识中心构建指南  AI视频创作新纪元:CogVideoX Flash模型深度解析 

 2025-08-01

了解您产品搜索量及市场趋势,制定营销计划

同行竞争及网站分析保障您的广告效果

点击免费数据支持

提交您的需求,1小时内享受我们的专业解答。

南京市珐之弘网络技术有限公司


南京市珐之弘网络技术有限公司

南京市珐之弘网络技术有限公司专注海外推广十年,是谷歌推广.Facebook广告全球合作伙伴,我们精英化的技术团队为企业提供谷歌海外推广+外贸网站建设+网站维护运营+Google SEO优化+社交营销为您提供一站式海外营销服务。

 87067657

 13565296790

 87067657@qq.com

Notice

We and selected third parties use cookies or similar technologies for technical purposes and, with your consent, for other purposes as specified in the cookie policy.
You can consent to the use of such technologies by closing this notice, by interacting with any link or button outside of this notice or by continuing to browse otherwise.