【数据挖掘入门】使用树模型快速搭建比赛基线模型及进阶分享


本文是数据挖掘比赛入门教程,以车辆贷款违约预测挑战赛为例,演示用LightGBM树模型快速搭建基线。涵盖数据读取与内存优化、EDA分析、特征筛选,通过5折交叉验证训练模型,输出预测结果,还分享进阶思路,助力初学者系统认识比赛并入门。

☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

项目介绍:

本项目作为个比赛的入门教程,将演示如何用树模型快速搭建比赛基线及分享比赛进阶提升思路。希望能够帮助初学者对比赛形成一个系统的认识,更好地入门并在比赛中取得好成绩。

树模型LightGBM介绍:

LightGBM是基于XGBoost的一款可以快速并行的树模型框架,内部集成了多种集成学习思路,在代码实现上对XGBoost的节点划分进行了改进,内存占用更低训练速度更快。

LightGBM官网:https://lightgbm.readthedocs.io/en/latest/

参数介绍:https://lightgbm.readthedocs.io/en/latest/Parameters.html

使用介绍:你应该知道的LightGBM各种操作!

使用树模型的优势:树模型是生成规则的利器,能够从一系列有特征和标签的数据中总结出决策规则,并用树状图的结构来呈现这些规则,以解决分类和回归问题。

对于采用表格数据的任务,基本都是决策树模型的主场,像XGBoost和LightGBM这类提升(Boosting)树模型已经成为了现在数据挖掘比赛中的标配。

In [1]
# LightGBM的安装# 默认版本!pip install lightgbm# GPU版本,训练更快# !pip install lightgbm --install-option=--gpu
       
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already satisfied: lightgbm in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (3.1.1)
Requirement already satisfied: numpy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.16.4)
Requirement already satisfied: wheel in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.33.6)
Requirement already satisfied: scikit-learn!=0.22.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (0.22.1)
Requirement already satisfied: scipy in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from lightgbm) (1.3.0)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn!=0.22.0->lightgbm) (0.14.1)
       

此次以讯飞赛题:车辆贷款违约预测挑战赛为例,并以树模型构建赛题基线模型

赛事地址:http://challenge.xfyun.cn/topic/info?type=car-loan

赛题任务:通过训练集训练模型,来预测测试集中loan_default字段的具体值,即借款人是否会拖欠付款,其中1表示客户逾期,0表示客户未逾期。

运行要求:对配置上无高要求,选择CPU版本即可运行本项目。树模型一般处理特征多或维度高时才会对内存会有一定要求。

In [2]
# 解压比赛数据集%cd /home/aistudio/data/data101719/
!unzip data.zip
       
/home/aistudio/data/data101719
Archive:  data.zip
  inflating: sample_submit.csv       
  inflating: test.csv                
  inflating: train.csv
        In [3]
# 导入依赖包import pandas as pdimport numpy as npfrom sklearn.model_selection import KFoldfrom sklearn.metrics import f1_score, roc_auc_scorefrom tqdm import tqdmimport gcimport timeimport lightgbm as lgbimport warnings
warnings.filterwarnings('ignore')
       
                In [4]
# 内存优化脚本,避免内存溢出def reduce_mem(df, cols):
    start_mem = df.memory_usage().sum() / 1024 ** 2
    for col in tqdm(cols):
        col_type = df[col].dtypes        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()            if str(col_type)[:3] == 'int':                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)            else:                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)                else:
                    df[col] = df[col].astype(np.float64)
    end_mem = df.memory_usage().sum() / 1024 ** 2
    print('{:.2f} Mb, {:.2f} Mb ({:.2f} %)'.format(start_mem, end_mem, 100 * (start_mem - end_mem) / start_mem))
    gc.collect()    return df
    In [5]
# 读取比赛数据集train = pd.read_csv('./train.csv')  # 训练集test = pd.read_csv('./test.csv')    # 测试集# 对数据集进行内存优化train = reduce_mem(train, [f for f in train.columns])
test = reduce_mem(test, [f for f in test.columns])
       
100%|██████████| 53/53 [00:01<00:00, 42.04it/s]
100%|██████████| 52/52 [00:00<00:00, 559.02it/s]
       
60.65 Mb, 18.02 Mb (70.28 %)
11.90 Mb, 3.55 Mb (70.19 %)
       

        In [6]
# 根据赛题要求设置提交结果文件格式:'customer_id', 'loan_default'# 'loan_default'作为要对测试集数据进行预测的标签,1表示客户逾期,0表示客户未逾期。sample_submit = pd.DataFrame(columns=['customer_id', 'loan_default']) 
sample_submit['customer_id'] = test['customer_id']
   

数据分析(EDA):

全局数据分析:数据的整体情况,包括数据类型、大小、质量等

单变量数据分析:对每个变量进行探索性分析,包括类别变量,连续变量,文本变量等

交叉特征分析:特征与标签的交叉分析以及特征与特征之间的交叉等

训练集、测试集分布分析:训练集和测试集的分布不一致是导致线上和线下不一致的重要原因

参考文章:初学者竞赛学习手册

In [7]
# 数据大小概览,可以看出此赛题的字段较多,如何善用好特征是比赛一大难点train.info()
       

RangeIndex: 150000 entries, 0 to 149999
Data columns (total 53 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   customer_id                    150000 non-null  int32  
 1   main_account_loan_no           150000 non-null  int16  
 2   main_account_active_loan_no    150000 non-null  int16  
 3   main_account_overdue_no        150000 non-null  int8   
 4   main_account_outstanding_loan  150000 non-null  int32  
 5   main_account_sanction_loan     150000 non-null  int32  
 6   main_account_disbursed_loan    150000 non-null  int32  
 7   sub_account_loan_no            150000 non-null  int8   
 8   sub_account_active_loan_no     150000 non-null  int8   
 9   sub_account_overdue_no         150000 non-null  int8   
 10  sub_account_outstanding_loan   150000 non-null  int32  
 11  sub_account_sanction_loan      150000 non-null  int32  
 12  sub_account_disbursed_loan     150000 non-null  int32  
 13  disbursed_amount               150000 non-null  int32  
 14  asset_cost                     150000 non-null  int32  
 15  branch_id                      150000 non-null  int8   
 16  supplier_id                    150000 non-null  int16  
 17  manufacturer_id                150000 non-null  int8   
 18  area_id                        150000 non-null  int8   
 19  employee_code_id               150000 non-null  int16  
 20  mobileno_flag                  150000 non-null  int8   
 21  idcard_flag                    150000 non-null  int8   
 22  Driving_flag                   150000 non-null  int8   
 23  passport_flag                  150000 non-null  int8   
 24  credit_score                   150000 non-null  int16  
 25  main_account_monthly_payment   150000 non-null  int32  
 26  sub_account_monthly_payment    150000 non-null  int32  
 27  last_six_month_new_loan_no     150000 non-null  int8   
 28  last_six_month_defaulted_no    150000 non-null  int8   
 29  average_age                    150000 non-null  int8   
 30  credit_history                 150000 non-null  int8   
 31  enquirie_no                    150000 non-null  int8   
 32  loan_to_asset_ratio            150000 non-null  float16
 33  total_account_loan_no          150000 non-null  int16  
 34  sub_account_inactive_loan_no   150000 non-null  int16  
 35  total_inactive_loan_no         150000 non-null  int8   
 36  main_account_inactive_loan_no  150000 non-null  int16  
 37  total_overdue_no               150000 non-null  int8   
 38  total_outstanding_loan         150000 non-null  int32  
 39  total_sanction_loan            150000 non-null  int32  
 40  total_disbursed_loan           150000 non-null  int32  
 41  total_monthly_payment          150000 non-null  int32  
 42  outstanding_disburse_ratio     150000 non-null  float64
 43  main_account_tenure            150000 non-null  int32  
 44  sub_account_tenure             150000 non-null  int32  
 45  disburse_to_sactioned_ratio    150000 non-null  float32
 46  active_to_inactive_act_ratio   150000 non-null  float16
 47  year_of_birth                  150000 non-null  int16  
 48  disbursed_date                 150000 non-null  int16  
 49  Credit_level                   150000 non-null  int8   
 50  employment_type                150000 non-null  int8   
 51  age                            150000 non-null  int8   
 52  loan_default                   150000 non-null  int8   
dtypes: float16(2), float32(1), float64(1), int16(10), int32(17), int8(22)
memory usage: 18.0 MB
        In [8]
# 确定每个字段中不同的个数,对nunique为1的字段直接删除。train.nunique()
       
customer_id                      150000
main_account_loan_no                104
main_account_active_loan_no          35
main_account_overdue_no              19
main_account_outstanding_loan     48609
main_account_sanction_loan        30564
main_account_disbursed_loan       32862
sub_account_loan_no                  36
sub_account_active_loan_no           21
sub_account_overdue_no                8
sub_account_outstanding_loan       2108
sub_account_sanction_loan          1519
sub_account_disbursed_loan         1725
disbursed_amount                  19235
asset_cost                        38902
branch_id                            82
supplier_id                        2888
manufacturer_id                      10
area_id                              22
employee_code_id                   3241
mobileno_flag                         1
idcard_flag                           1
Driving_flag                          2
passport_flag                         2
credit_score                        570
main_account_monthly_payment      21499
sub_account_monthly_payment        1304
last_six_month_new_loan_no           24
last_six_month_defaulted_no          14
average_age                         100
credit_history                      100
enquirie_no                          23
loan_to_asset_ratio                1994
total_account_loan_no               103
sub_account_inactive_loan_no         90
total_inactive_loan_no               27
main_account_inactive_loan_no        91
total_overdue_no                     19
total_outstanding_loan            49406
total_sanction_loan               31216
total_disbursed_loan              33557
total_monthly_payment             21843
outstanding_disburse_ratio         4391
main_account_tenure               12816
sub_account_tenure                 1230
disburse_to_sactioned_ratio         375
active_to_inactive_act_ratio        211
year_of_birth                        48
disbursed_date                        1
Credit_level                         14
employment_type                       3
age                                  48
loan_default                          2
dtype: int64
               

特征工程( 重点!):

1.特征交互:特征和特征之间组合、特征和特征之间衍生

2.特征编码:one-hot编码、label-encode编码等

3.特征选择:通过对特征重要性及相关性的分析,精简掉无用的特征

特征工程很大程度上是在帮助模型学习,在模型学习不好的地方或者难以学习的地方,采用特征工程的方式帮助其学习,通过人为筛选、人为构建组合特征让模型原本很难学好的东西可以更加容易地进行学习、进而拿到更好的效果。

In [9]
# 筛掉无用特征all_cols = [f for f in train.columns if f not in ['customer_id','loan_default','mobileno_flag','idcard_flag','disbursed_date']]
   

基线模型构建:

主要演示如何用树模型快速地搭建一个比赛基线模型,在特征工程及模型优化上需要结合具体赛题要求进行针对性地优化。

In [10]
# 训练集x_train = train[all_cols]# 训练集标签字段y_train = train['loan_default']# 要进行预测的测试集x_test = test[all_cols]
    In [11]
# 定义训练和预测函数def train_predict(clf, train_x, train_y, test_x, clf_name='lgb'):
    # 5折交叉验证
    folds = 5
    seed = 2025
    kf = KFold(n_splits=folds, shuffle=True, random_state=seed)

    train = np.zeros(train_x.shape[0])
    test = np.zeros(test_x.shape[0])
    cv_scores = []    for i, (train_index, valid_index) in enumerate(kf.split(train_x, train_y)):        print('************************************ {} ************************************'.format(str(i+1)))
        trn_x, trn_y, val_x, val_y = train_x.iloc[train_index], train_y[train_index], train_x.iloc[valid_index], train_y[valid_index]

        train_matrix = clf.Dataset(trn_x, label=trn_y)
        valid_matrix = clf.Dataset(val_x, label=val_y)        # 树模型参数设置
        params = {            'boosting_type': 'gbdt',            'objective': 'binary',            'metric': 'auc',            'min_child_weight': 5,            'num_leaves': 2 ** 7,            'lambda_l2': 10,            'feature_fraction': 0.9,            'bagging_fraction': 0.9,            'bagging_freq': 4,            'learning_rate': 0.01,            'seed': 2025,            'n_jobs':-1,            'verbose': -1,
        }        # 早停和验证步数需要根据具体情况进行调优
        model = clf.train(params, train_matrix, 50000, valid_sets=[train_matrix, valid_matrix], verbose_eval=500,early_stopping_rounds=200)        # 对验证集进行预测
        val_pred = model.predict(val_x, num_iteration=model.best_iteration)        # 对测试集进行预测
        test_pred = model.predict(test_x, num_iteration=model.best_iteration)

        train[valid_index] = val_pred
        test += test_pred / kf.n_splits
        cv_scores.append(roc_auc_score(val_y, val_pred))        # 输出验证集结果分数
        print(cv_scores)    print("%s_scotrainre_list:" % clf_name, cv_scores)    print("%s_score_mean:" % clf_name, np.mean(cv_scores))    print("%s_score_std:" % clf_name, np.std(cv_scores))    # 在训练完成后输出feature_importance,输出各特征的重要性
    print(pd.DataFrame({            'column': all_cols,            'importance': model.feature_importance()/5,
        }).sort_values(by='importance',ascending=False))    return train, test
    In [12]
# 进行模型的训练与预测lgb_train, lgb_test = train_predict(lgb, x_train, y_train, x_test)
       
************************************ 1 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.757221	valid_1's auc: 0.665608
Early stopping, best iteration is:
[648]	training's auc: 0.774819	valid_1's auc: 0.666395
[0.6663954692558639]
************************************ 2 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.756217	valid_1's auc: 0.6646
Early stopping, best iteration is:
[774]	training's auc: 0.786664	valid_1's auc: 0.665809
[0.6663954692558639, 0.6658088579217993]
************************************ 3 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.757318	valid_1's auc: 0.664588
[1000]	training's auc: 0.809107	valid_1's auc: 0.665196
Early stopping, best iteration is:
[840]	training's auc: 0.794933	valid_1's auc: 0.665534
[0.6663954692558639, 0.6658088579217993, 0.6655342821383231]
************************************ 4 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.758371	valid_1's auc: 0.650627
[1000]	training's auc: 0.809869	valid_1's auc: 0.652059
Early stopping, best iteration is:
[996]	training's auc: 0.809559	valid_1's auc: 0.652149
[0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985]
************************************ 5 ************************************
Training until validation scores don't improve for 200 rounds
[500]	training's auc: 0.757135	valid_1's auc: 0.662366
Early stopping, best iteration is:
[692]	training's auc: 0.779432	valid_1's auc: 0.662648
[0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985, 0.662648392749281]
lgb_scotrainre_list: [0.6663954692558639, 0.6658088579217993, 0.6655342821383231, 0.652148910391985, 0.662648392749281]
lgb_score_mean: 0.6625071824914504
lgb_score_std: 0.005338481209206612
                           column  importance
18               employee_code_id      1421.0
15                    supplier_id      1374.6
14                      branch_id      1341.0
29            loan_to_asset_ratio      1307.0
12               disbursed_amount      1150.2
13                     asset_cost      1089.6
44                  year_of_birth       995.4
21                   credit_score       781.6
17                        area_id       760.6
39     outstanding_disburse_ratio       635.6
27                 credit_history       565.6
40            main_account_tenure       560.8
26                    average_age       560.8
22   main_account_monthly_payment       445.8
16                manufacturer_id       434.6
38          total_monthly_payment       371.6
3   main_account_outstanding_loan       339.4
43   active_to_inactive_act_ratio       304.6
35         total_outstanding_loan       264.8
36            total_sanction_loan       233.0
46                employment_type       228.6
4      main_account_sanction_loan       213.2
37           total_disbursed_loan       205.6
28                    enquirie_no       188.8
5     main_account_disbursed_loan       182.2
31   sub_account_inactive_loan_no       155.4
0            main_account_loan_no       155.4
25    last_six_month_defaulted_no       155.2
30          total_account_loan_no       152.6
42    disburse_to_sactioned_ratio       141.6
33  main_account_inactive_loan_no       134.4
1     main_account_active_loan_no       126.4
2         main_account_overdue_no       126.4
24     last_six_month_new_loan_no       122.8
47                            age       117.6
34               total_overdue_no        87.4
45                   Credit_level        53.4
19                   Driving_flag        27.0
23    sub_account_monthly_payment        12.4
41             sub_account_tenure        12.4
6             sub_account_loan_no        10.8
9    sub_account_outstanding_loan         8.0
20                  passport_flag         7.0
32         total_inactive_loan_no         5.6
10      sub_account_sanction_loan         5.6
11     sub_account_disbursed_loan         3.0
8          sub_account_overdue_no         0.2
7      sub_account_active_loan_no         0.2
        In [13]
# 保存预测结果文件sample_submit['loan_default'] = lgb_test# 注意由于赛题要求输出的为0或1,故需要对预测结果进行一定的转换。此处设置大于0.25为1,小于或等于0.25则为0。sample_submit['loan_default'] = sample_submit['loan_default'].apply(lambda x:1 if x>0.25 else 0).values# 保存结果文件sample_submit.to_csv('result.csv', index=False)


# https  # 很难  # 比赛中  # 是在  # 数据挖掘  # 都是  # 如何用  # 内存优化  # 为例  # 更快  # 进阶  # python  # http  # 数据分析  # boosting  # 数据类型  # html  # red  # cos  # 内存占用  # ai 


相关栏目: 【 Google疑问12 】 【 Facebook疑问10 】 【 网络优化91478 】 【 技术知识72672 】 【 云计算0 】 【 GEO优化84317 】 【 优选文章0 】 【 营销推广36048 】 【 网络运营41350 】 【 案例网站102563 】 【 AI智能45237


相关推荐: AI伴侣:连接还是孤独?真实对话揭秘AI伦理困境  孩子作文写不出来?教你用AI引导孩子构思,写出优秀范文  Google Gemini 辅助进行 Android Studio 代码开发  律师视角下的生成式AI:信息爆炸时代的法律实践与未来展望  ChatGPT官方主页入口 ChatGPT网页版快速进入指南  怎么用AI制作数字人短视频?3步教你创建虚拟主播  Google Gemini 对复杂物理解题过程的逐步解析  Bluecap:加拿大AI会议助手,提升混合办公效率  Napkin AI:无需设计技能,AI一键生成精美图表  稿定设计AI抠图怎样调整透明度_稿定设计AI透明度滑块与渐变设置【攻略】  批改网ai检测工具如何导出检测报告_批改网ai检测工具报告导出格式【步骤】  教你用AI帮你生成一份详细的搬家清单,告别手忙脚乱  利用 ChatGPT 进行复杂数学公式的推导教程  如何通过 DeepSeek 优化 Kubernetes 配置文件  现代集团CES 2026首秀机器人Atlas 发布AI机器人战略  教你用AI将一段旋律扩展成一首完整的曲子  豆包Ai官方网页版入口地址_豆包Ai官网在线使用入口  Apollo.io vs Instantly AI:深度测评与功能对比  Midjourney怎么用一键生成壁纸_Midjourney壁纸生成教程【教程】  Jasper AI的Recipes是什么 Jasper AI配方功能使用【详解】  2025年冷邮件营销:技巧、工具和成功案例分享  打破传统,拥抱幸福:公主如何找到真我?  AI赋能音频转录:SovereignAudio自托管解决方案  AI数字人教程:轻松打造专属YouTube虚拟形象  百度输入法智能预测怎么关 百度输入法ai联想词关闭  使用Go语言构建图像识别系统:完整指南  AI赋能播客:十大AI播客工具助力内容创作  Xeon E5-2667 V2性能评测:老平台焕发新生,游戏与工作负载表现分析  AI驱动的Web应用测试:突破QA挑战,提升用户体验  探索都市传说:追寻鳄鱼飞机怪物“Bombardino Crocodilo”  AI代码助手的崛起:软件工程的未来展望与实用指南  AIPPT:AI驱动的PPT制作工具,高效便捷演示文稿方案  AI如何一键生成PPT大纲_利用AI工具制作演示文稿方法【教程】  Jetson SegNet: 语义分割深度探索与实践  AI赋能营销:5分钟快速生成品牌营销素材全攻略  佐糖AI抠图如何免费使用_佐糖AI免费额度获取与消耗查看【指南】  Descript vs. Wisecut:AI视频编辑工具深度测评与最佳选择  批改网AI检测工具怎样优化检测精度_批改网AI检测工具精度调节与模型选择【实操】  Claude怎么用_Claude使用方法详细指南【教程】  Recall:打造你的AI知识库,提升记忆力与效率  AI婴儿播客视频制作终极指南:免费工具与步骤  AI内容审查:谷歌搜索结果是否受到人为干预?  AI营销赋能本地服务:从Facebook广告到客户终身价值提升策略  AI|直播|话术生成工具有哪些_一键生成带货话术的AI工具推荐  冷邮件营销新策略:工作坊模式助力B2B销售增长  去哪旅行ai抢票助手怎样添加备选车次_去哪旅行ai抢票助手备选车次设置与切换【攻略】  掌握解方程技巧:4.2家庭作业难题精讲与分数系数处理  如何通过豆包 AI 进行每日新闻简报的个性化定制  AI如何革新心理健康诊断:从症状检查到大脑分析  GitHub MCP Server:AI赋能代码管理的未来 

 2025-07-30

了解您产品搜索量及市场趋势,制定营销计划

同行竞争及网站分析保障您的广告效果

点击免费数据支持

提交您的需求,1小时内享受我们的专业解答。

南京市珐之弘网络技术有限公司


南京市珐之弘网络技术有限公司

南京市珐之弘网络技术有限公司专注海外推广十年,是谷歌推广.Facebook广告全球合作伙伴,我们精英化的技术团队为企业提供谷歌海外推广+外贸网站建设+网站维护运营+Google SEO优化+社交营销为您提供一站式海外营销服务。

 87067657

 13565296790

 87067657@qq.com

Notice

We and selected third parties use cookies or similar technologies for technical purposes and, with your consent, for other purposes as specified in the cookie policy.
You can consent to the use of such technologies by closing this notice, by interacting with any link or button outside of this notice or by continuing to browse otherwise.