一、台湾信贷研究目的与问题

1.1 基于聚类算法

基于台湾信贷数据，建立各种聚类模型，根据聚类状况来判断在台湾信贷数据上不同聚类方法的表现。并且在方法论上，探讨每种聚类方法的特点，优势和算法原理，比较和分析每种聚类算法的优缺点和适用情况。 ##1.2 基于分类算法随机森林、bagging、adaboost 等方法都是基于决策树的组合方法，但是这些方法和传统的统计思想有很大不同，如果针对具体数据集调整除合适模型来进行预测是本文的核心，特别是针对信贷数据需要把握的特点即精确度的要求，对这三种算法进行调试，并比较最终预测结果和这三种算法的优缺点。不论是 boosting 还是 bagging 当中，当使用的多个分类器的类型都是一致的，如何针对数据集特征挖掘不同算法和分类器的优势才能选择好的分类器。通过这篇报告，对这三种集成算法可以有更清晰的了解。 #二、台湾信贷数据集简介及变量说明本此大数据机器学习实验从UCI机器学习数据集库中选取了经典台湾信贷数据集，并利用该数据来建立目前比较通用聚类模型和集成模型。该鲍鱼数据集是对30000个信贷用户记录个人信息状况、以往信用记录以及银行卡交易状况的描述信息。

信贷数据将这30000条信贷用户记录分为两类：下一月份违约的用“1”标识。以及下一月份未违约的用“0”标识。该数据集的属性信息以及变量描述如表1-1所示，表中给定的是变量名称，变量说明，变量类型和取值范围等。

\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\)表1-1 属性变量说明表

变量名	变量说明	变量类型	取值范围
X1	给定额度	数值型	1万-100万
X2	性别	分类型	1：男，2：女
X3	教育水平	分类型	1:graduate;2:university;3:high school;4:others
X4	婚姻状况	分类型	1:married;2:single;3:others
X5	年龄	数值型	21-79
X6-X11	还款状态	数值型	<0:按时还款，>0:超期相应月数还款
X12-X17	票据金额	数值型	X12表示2015-09，以此类推
X18-X23	先前付款额	数值型	X18表示2015-09，以此类推
Y	是否违约	分类型	0：未违约，1：违约

三、聚类算法与分类算法工具包载入

载入与本次试验相关的数据处理包’reshape2’,‘dplyr’,‘pryr’等,绘图包’ggplot2’,‘DiagrammeR’等，机器学习包’caret’,’xgboost’等以及性能测试包’pryr’等。由于本次试验需要对比各种机器学习算法的ROC曲线以及AUC值，所以进行了pROC的调用。

#首先需要进行xgboost包的安装，只需在R中运行install.packages(‘xgboost’)即可。
library(xgboost)      #导入极限梯度下降法包
library(caret)
library(ggplot2)      #绘图包
library(DiagrammeR)
library(reshape2)     #数据处理包
library(dplyr)
library(pryr) 
library(pROC)         #ROC曲线包
library(plyr)
library(showtext)     #使作图的字体更加丰富
library(RColorBrewer) #增加调色板
library(randomForest) #加载随机森林包
library(pryr)         #加载性能测试包
library(adabag)       #加载GBDT梯度提升树包
library(xgboost)      #加载XGBoost提升树包
library(DiagrammeR)
library(reshape2)
library(dbscan)
library(fpc)
library(sparcl)

四、台湾信贷数据可视化分析

4.1信贷数据预处理

#数据预处理
#数据说明
taiwan<- read.csv("D:/bigdatahw/算法/作业/card1.csv",head=TRUE,sep=',') #读取台湾信用数据
taiwan$SEX<-as.factor(taiwan$SEX)         
taiwan$EDUCATION<-as.factor(taiwan$EDUCATION)
taiwan$MARRIAGE<-as.factor(taiwan$MARRIAGE)
names(taiwan)[25]="default"              #重命名因变量
taiwan$default<-as.factor(taiwan$default)

levels(taiwan$SEX)=list(M="1",F="2")     #进行重命名等级
levels(taiwan$EDUCATION)=list(others="0",graduate="1",university="2", highschool="3",others="4",others="5",others="6")
levels(taiwan$MARRIAGE)=list(married="1", single="2",others="3",others="0")
levels(taiwan$default)=list(T="1", F="0")

4.2信贷数据描述性分析

4.2.1教育状况与违约占比性别分面风玫瑰图

从图中可以看出，不管是男性还是女性，信贷用户中大学毕业生人数比较多，其次是研究生。其他学历人数较少，还可以看出男性的信贷违约人数要少于女性，说明女性更容易违约。在途中还可以看出随着学历的逐渐上升，违约状况也有所改观，说明教育水平会提升信用状况。

label<-c("others","highschool","university","graduate")
taiwan$EDUCATION<- ordered(taiwan$EDUCATION, levels = label)
ggplot(taiwan,aes(x=default,fill=EDUCATION))+
  geom_bar()+coord_polar(theta = 'x')+
  scale_fill_brewer(palette='Spectral')+facet_wrap(~SEX)+theme_bw()+ 
  labs(x="违约状况",y="频数",fill="学历状况",title='分面风玫瑰图')+
  scale_x_discrete()+coord_polar(theta="x")+
  theme(plot.title = element_text(hjust = 
        0.5,family="myFont",size=18,color="gold3"),
        panel.background=element_rect(fill='aliceblue'))

4.2.2违约与年龄分布直方图

根据所受教育水平的不同，违约状况的的年龄分布也有所不同，在学历为其他的用户中违约的年龄分布比较均匀，在高中毕业的水平下，总体的年龄分布比较均匀，违约人的年龄分布也是比较均匀的，在大学的教育水平下，年龄分布比较不均匀，整体成右偏分布，而违约人的年龄分布峰较低，比较平缓。在研究生阶段的的年龄分布峰较高，成尖峰分布，且有右偏拖尾。整体的违约比率较低，分布比较平缓。

label<-c("F","T")
taiwan$default<- ordered(taiwan$default, levels = label)
p<-ggplot(taiwan,aes(x=AGE,fill=default)) 
p+geom_histogram(position="identity",alpha=0.5)+
  ggtitle('各学历违约状况与年龄分布直方图')+
  theme(plot.title = element_text(hjust = 
        0.5,family="myFont",size=18,color="red"),
        panel.background=element_rect(fill='aliceblue')) + 
  xlab("年龄") +ylab("频数")+
  facet_wrap(~EDUCATION)

4.2.3银行给定额度与违约箱线图

如图所示，白色的点代表着中位数，而象限图中的白线则代表着均值。对于给定的信用额度与违约的关系上来看，可以看出的关系是，对于未来不违约的人通常银行给予的信用额度是比较高的，说明银行在当时信用分配筛选时，是下过一番功夫的。因此会导致未来违约的人的信用额度偏低。

ggplot(taiwan,aes(x=default,y=LIMIT_BAL,fill=default))+
  geom_boxplot(outlier.size=1.5, outlier.shape=15,notch=TRUE,alpha=.35)+
  stat_summary (fun.y="mean",geom="point",shape=23,size=2,fill="white")+
  xlab("违约与否") + ylab("给定的信用额度")+
  ggtitle('给定信用额度与违约箱线图')+ylim(0,550000)+
  theme(plot.title = element_text(hjust = 0.5,  
        family="myFont",size=18,color="red"), 
        panel.background=element_rect(fill='aliceblue',color='black'))

4.2.4婚姻状况与违约之间关系

在图中可以看出婚姻状况与违约之间的关系，数量关系没有描述统计的必要，主要进行百分比之间的转化，在分面饼状图中可以清楚的看出，婚姻状况对违约并没有很大的影响，主要的影响是比较细微的，单身的违约概率是比较小的，大约占到总体的1/5左右，而已婚和其他的状况违约概率是几乎一样的，都占各自总体的1/4左右。

ggplot(taiwan,aes(x=factor(1),fill=default))+
  geom_bar(aes(fill=default),position="fill")+
  coord_polar(theta="y")+
  ggtitle('婚姻状况与违约之间关系')+
  theme(plot.title = element_text(hjust = 
       0.5,family="myFont",size=18,color="black"),      
       panel.background=element_rect(fill='aliceblue',color='black'))+
       facet_wrap(~MARRIAGE)

4.2.5教育与违约关系

在教育水平与违约状况环形图中，也可以看到，违约占比较高的教育水平为高中阶段，其次是大学，然后是研究生，相比之下，其他的教育水平违约占比较低，因此我们也可以大胆的推断other是一种更高的教育水平。有可能是博士阶段。各个学历水平的违约率都没超过1/4，other更是小于1/10。

ggplot(taiwan, aes(EDUCATION))+geom_bar(aes(fill=default),position="fill")+
  coord_polar(theta = "y")+ggtitle('教育水平与违约之间关系')+
  theme(plot.title = element_text(hjust = 
        0.5,family="myFont",size=18,color="black"),      
        panel.background=element_rect(fill='aliceblue',color='black'))

4.2.6年龄与给定额度关系

我们筛选的是具有违约状况的信用记录来进行分析。由于点的数量太多，因此绘制了分箱密度估计图，横坐标是违约用户的年龄，纵坐标是给定额度的对数值。来分析对于违约的用户，什么样的年龄和给定额度下违约的人数较多，可以在图中看到，红色区域对应着25岁给定额度为10的人数违约较多，其次还有25岁给定额度为11左右的人。因此银行在分配额度是应该注意这一点，利用密度图来规避风险，尽量少分配给25年龄段更多的信用。还可以看出给定额度与年龄的关系并不大，年龄小一样可以获得较高额度。

taiwan<-taiwan[which(taiwan$default=='T'),]
p=ggplot(taiwan,aes(x=AGE,y=log(LIMIT_BAL)))
#默认等高线图进行分箱化处理
p+geom_point(alpha=0.2)+stat_bin2d()+
  scale_fill_gradient(low="lightblue",high="red")+stat_density2d()+
  theme(plot.title = element_text(hjust = 
        0.5,family="myFont",size=18,color="slateblue2"),
        panel.background=element_rect(fill='papayawhip'))+
  labs(x='年龄',y='log(给定额度)',title='年龄与给定额度密度关系')

五、基于台湾信贷数据的聚类方法比较与分析

5.1 数据预处理

setwd('D:/bigdatahw/算法/作业')   #设置工作路径
taiwan=read.csv("card1.csv")
names(taiwan)[25]="default"       #重命名因变量
#特征变量性别、年龄、教育程度以及下月是否违约转换为因子变量
taiwan$default<- as.factor(taiwan$default)
taiwan$SEX <- as.factor(taiwan$SEX)
taiwan$EDUCATION <- as.factor(taiwan$EDUCATION)
taiwan$MARRIAGE <- as.factor(taiwan$MARRIAGE)
summary(taiwan)

##        ID          LIMIT_BAL       SEX       EDUCATION MARRIAGE 
##  Min.   :    1   Min.   :  10000   1:11888   0:   14   0:   54  
##  1st Qu.: 7501   1st Qu.:  50000   2:18112   1:10585   1:13659  
##  Median :15000   Median : 140000             2:14030   2:15964  
##  Mean   :15000   Mean   : 167484             3: 4917   3:  323  
##  3rd Qu.:22500   3rd Qu.: 240000             4:  123            
##  Max.   :30000   Max.   :1000000             5:  280            
##                                              6:   51            
##       AGE            PAY_0             PAY_2             PAY_3        
##  Min.   :21.00   Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.0000  
##  1st Qu.:28.00   1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.0000  
##  Median :34.00   Median : 0.0000   Median : 0.0000   Median : 0.0000  
##  Mean   :35.49   Mean   :-0.0167   Mean   :-0.1338   Mean   :-0.1662  
##  3rd Qu.:41.00   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000  
##  Max.   :79.00   Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.0000  
##                                                                       
##      PAY_4             PAY_5             PAY_6           BILL_AMT1      
##  Min.   :-2.0000   Min.   :-2.0000   Min.   :-2.0000   Min.   :-165580  
##  1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:-1.0000   1st Qu.:   3559  
##  Median : 0.0000   Median : 0.0000   Median : 0.0000   Median :  22382  
##  Mean   :-0.2207   Mean   :-0.2662   Mean   :-0.2911   Mean   :  51223  
##  3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.: 0.0000   3rd Qu.:  67091  
##  Max.   : 8.0000   Max.   : 8.0000   Max.   : 8.0000   Max.   : 964511  
##                                                                         
##    BILL_AMT2        BILL_AMT3         BILL_AMT4         BILL_AMT5     
##  Min.   :-69777   Min.   :-157264   Min.   :-170000   Min.   :-81334  
##  1st Qu.:  2985   1st Qu.:   2666   1st Qu.:   2327   1st Qu.:  1763  
##  Median : 21200   Median :  20089   Median :  19052   Median : 18105  
##  Mean   : 49179   Mean   :  47013   Mean   :  43263   Mean   : 40311  
##  3rd Qu.: 64006   3rd Qu.:  60165   3rd Qu.:  54506   3rd Qu.: 50191  
##  Max.   :983931   Max.   :1664089   Max.   : 891586   Max.   :927171  
##                                                                       
##    BILL_AMT6          PAY_AMT1         PAY_AMT2          PAY_AMT3     
##  Min.   :-339603   Min.   :     0   Min.   :      0   Min.   :     0  
##  1st Qu.:   1256   1st Qu.:  1000   1st Qu.:    833   1st Qu.:   390  
##  Median :  17071   Median :  2100   Median :   2009   Median :  1800  
##  Mean   :  38872   Mean   :  5664   Mean   :   5921   Mean   :  5226  
##  3rd Qu.:  49198   3rd Qu.:  5006   3rd Qu.:   5000   3rd Qu.:  4505  
##  Max.   : 961664   Max.   :873552   Max.   :1684259   Max.   :896040  
##                                                                       
##     PAY_AMT4         PAY_AMT5           PAY_AMT6        default  
##  Min.   :     0   Min.   :     0.0   Min.   :     0.0   0:23364  
##  1st Qu.:   296   1st Qu.:   252.5   1st Qu.:   117.8   1: 6636  
##  Median :  1500   Median :  1500.0   Median :  1500.0            
##  Mean   :  4826   Mean   :  4799.4   Mean   :  5215.5            
##  3rd Qu.:  4013   3rd Qu.:  4031.5   3rd Qu.:  4000.0            
##  Max.   :621000   Max.   :426529.0   Max.   :528666.0            
##

## 定义标准化函数，将原始数据压缩到0-1之间，同一量纲
myfun=function(x)
{
  m1=max(x)
  m2=min(x)
  y=(x-m1)/(m2-m1)
  return(y)
}
## 抽取部分样本作为建模数据，并进行标准化处理
set.seed(123)
s=sample(1:dim(taiwan)[1],500)
taiwan_num=apply(taiwan[s,c(2,6:24)],2,myfun)    #聚类数值数据
taiwan_char_num=cbind(taiwan_num,taiwan[s,3:5])  #聚类数值+分类数据

5.2AGNES算法层次聚类

## 由于聚类分析是无监督分类，无法知道聚类结果到底是属于哪一类。
## 所以要定义函数变换混淆矩阵，使得对角线上的元素最大，然后输
## 出混淆矩阵，以便确定聚类结果分别属于哪一个类。定义函数找出
## 对角线上数量最大的矩阵
myfun1=function(a,b)
{
  table<-table(a, b)  
  Accuracy1=table[1,1]+table[2,2]
  Accuracy2=table[1,2]+table[2,1]
  if(Accuracy1>Accuracy2){
    confusion<-table(a,b)
    rownames(confusion)<-c('未违约','违约')
    colnames(confusion)<-c('未违约','违约')
    return(confusion)}
  else{ label<-c("2","1")
        b<- ordered(b, levels = label)
        confusion<-table(a,b)
        rownames(confusion)<-c('未违约','违约')
        colnames(confusion)<-c('未违约','违约')
        return(confusion)}
}
d<-dist(taiwan_num)               #计算样本之间的欧式距离
hc<-hclust(d, method="complete")  #根据距离进行聚类,类间距离定义为全连接
plot(hc)                          #显示聚类结果图

hccut<-cutree(hc,k=2)             #将类别数定为2
myfun1(taiwan[s,]$default, hccut) #需要变0-1

##         b
## a        未违约 违约
##   未违约    381    1
##   违约      118    0

5.3Kmeans聚类

kc<-kmeans(taiwan_num,2)                 #k均值聚类将样本聚成2类
myfun1(taiwan[s,]$default, kc$cluster)

##         b
## a        未违约 违约
##   未违约    271  111
##   违约       58   60

#显示混淆矩阵，可以调整以下顺序就会不同了，遍历一下

5.4DBSCAN基于密度聚类

##确定最优eps
kNNdistplot(taiwan_num,k=2)
abline(h=0.3)

#可达距离eps设置为0.45，最小可达数为50，也就是说核心点的最小可达点最少要50个
ds=dbscan(taiwan_num,eps=0.1,MinPts=500,
          scale=TRUE,showplot=TRUE,method="raw")     #显示聚类图谱图

ds                                                   #显示聚类结果

## dbscan Pts=500 MinPts=500 eps=0.1
## 
##   0 
## 500

table(taiwan[s,]$default, ds$cluster)

##    
##       0
##   0 382
##   1 118

#optics算法，载入dbscan层次聚类包，采用的算法为optics,eps可达距离为1。
#ε邻域大小的上限。限制邻域大小可以提高性能，而且只要设置得不太低，就
#不会对排序产生什么影响。稀疏K均值聚类
opt<-optics(d, eps=1, minPts=4) 
plot(opt)                       #可达距离图

5.5稀疏聚类

5.5.1Kmeans稀疏聚类

perm <- KMeansSparseCluster.permute(taiwan_num,K=2,wbounds=seq(1.5,7.5,0.5),nperms=3)#选择最优的tunings

## 101234201234301234015016017018019011001110112011301
## Permutation  1 of  3
## 10122013014015016017018019011001110112011301
## Permutation  2 of  3
## 10122013014015016017018019011001110112011301
## Permutation  3 of  3
## 10122013014015016017018019011001110112011301

print(perm)

## Tuning parameter selection results for Sparse K-means Clustering:
##    Wbound # Non-Zero W's Gap Statistic Standard Deviation
## 1     1.5              6       -0.6080                  0
## 2     2.0              6       -0.4944                  0
## 3     2.5             12       -0.4264                  0
## 4     3.0             20       -0.4047                  0
## 5     3.5             20       -0.4047                  0
## 6     4.0             20       -0.4047                  0
## 7     4.5             20       -0.4047                  0
## 8     5.0             20       -0.4047                  0
## 9     5.5             20       -0.4047                  0
## 10    6.0             20       -0.4047                  0
## 11    6.5             20       -0.4047                  0
## 12    7.0             20       -0.4047                  0
## 13    7.5             20       -0.4047                  0
## Tuning parameter that leads to largest Gap statistic:  3

km<- KMeansSparseCluster(taiwan_num,K=2,wbounds=perm$bestw)#在最优的tuning下求解最优权重，并聚类

## 0123

print(km)

## Wbound is  3 :
## Number of non-zero weights:  20
## Sum of weights:  2.996815
## Clustering:  1 2 1 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 1 2 2 1 
## 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 2 2 2 1 1 1 1 2 2 2 2 1 1 
## 1 1 2 2 1 2 2 1 2 1 2 1 1 2 1 2 2 2 2 1 2 1 1 1 2 1 1 2 2 2 1 2 1 2 2 1 1 
## 1 1 1 2 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 2 2 1 1 1 2 2 1 1 2 2 1 2 1 1 
## 1 1 1 1 1 1 2 1 1 2 1 1 2 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 1 
## 1 1 2 1 1 2 2 1 1 2 1 1 1 1 1 2 2 1 2 1 1 1 1 2 1 2 2 1 1 2 2 1 1 1 1 2 1 
## 2 2 1 1 2 2 1 1 1 2 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 1 1 1 
## 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1 2 1 
## 2 1 2 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2 1 2 1 2 1 1 2 1 1 2 1 1 1 1 1 
## 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 1 1 1 1 2 2 2 1 1 2 2 1 1 1 2 1 1 
## 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 
## 1 1 1 2 1 1 2 1 1 1 1 2 1 1 1 1 1 2 2 2 1 2 1 1 2 1 2 1 1 1 2 1 1 2 2 1 1 
## 1 1 1 1 1 1 2 2 1 1 1 2 2 1 2 1 2 1 2 2 2 2 1 1 1 1 2 1 2 1 1 1 1 2 1 2 1 
## 1 1 1 1 1 1 1 1 2 1 1 2 2 2 2 2 1 1 1 1 1 1 1 2 1

barplot(km[[1]][[1]])#可以看到哪些特征非零

myfun1(taiwan[s,]$default, km[[1]][[2]])

##         b
## a        未违约 违约
##   未违约    259  123
##   违约       86   32

5.5.2层次稀疏聚类

#稀疏层次聚类的调优参数选择
perm.out <- HierarchicalSparseCluster.permute(as.matrix(taiwan_num), wbounds=c(1.5,2:6),nperms=10)

## Running sparse hierarchical clustering on unpermuted data
## 123456
## Running sparse hierarchical clustering on permuted data
## Permutation  1  of  10
## 123456
## Permutation  2  of  10
## 123456
## Permutation  3  of  10
## 123456
## Permutation  4  of  10
## 123456
## Permutation  5  of  10
## 123456
## Permutation  6  of  10
## 123456
## Permutation  7  of  10
## 123456
## Permutation  8  of  10
## 123456
## Permutation  9  of  10
## 123456
## Permutation  10  of  10
## 123456

print(perm.out)

## Tuning parameter selection results for Sparse Hierarchical Clustering:
##   Wbound # Non-Zero W's Gap Statistic Standard Deviation
## 1    1.5              5        0.0164              5e-04
## 2    2.0              7        0.0564              5e-04
## 3    3.0             14        0.1203              4e-04
## 4    4.0             20        0.1479              4e-04
## 5    5.0             20        0.1479              4e-04
## 6    6.0             20        0.1479              4e-04
## Tuning parameter that leads to largest Gap statistic:  4

plot(perm.out)

# 执行稀疏层次聚类
sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists,
                                      wbound=perm.out$bestw, method="complete")

## 12345678910

par(mfrow=c(1,2))
plot(sparsehc)

plot(sparsehc$hc)
print(sparsehc)

## Wbound is  4 :
## Number of non-zero weights:  20
## Sum of weights:  3.81629

cutree(sparsehc$hc,2)

##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2 
##  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2   1   1 
##  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54 
##   1   1   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1   1 
##  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 
##   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 
##   1   1   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1   1 
## 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 
##   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   2   1   1 
## 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   1   1   1   1   2   1   1   1   1   1   1   1   1   2   1   1   1   1 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 
##   1   1   1   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1 
## 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 
##   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 
##   2   1   1   1   1   1   1   1   1   1   2   1   1   1   1   1   2   2 
## 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 
##   1   1   2   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1 
## 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 
##   1   1   2   1   1   1   1   1   1   1   1   2   1   1   1   1   2   1 
## 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   2 
## 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
## 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 
##   1   1   1   1   1   1   2   1   1   1   1   1   1   1   1   1   1   1 
## 487 488 489 490 491 492 493 494 495 496 497 498 499 500 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1

myfun1(taiwan[s,]$default,cutree(sparsehc$hc,2))

##         b
## a        未违约 违约
##   未违约    366   16
##   违约      111    7

# 使用类标签知识来比较真实类
#获得聚类的标签
par(mfrow=c(1,1))

y = cutree(hc, 3)
ColorDendrogram(hc,y=y,main="My Simulated Data",branchlength=.007)

#现在，如果我们想看看数据是否包含了一个*二次聚类，
#在计算第一个得到的数据之后。我们寻找互补稀疏聚类：
# 重做分析，但这次使用“绝对值”不同：
perm.out <- HierarchicalSparseCluster.permute(as.matrix(taiwan_num),
                                              wbounds=c(1.5,2:6),
                                              nperms=5, 
                                              dissimilarity="absolute.value")

## Running sparse hierarchical clustering on unpermuted data
## 123456
## Running sparse hierarchical clustering on permuted data
## Permutation  1  of  5
## 123456
## Permutation  2  of  5
## 123456
## Permutation  3  of  5
## 123456
## Permutation  4  of  5
## 123456
## Permutation  5  of  5
## 123456

print(perm.out)

## Tuning parameter selection results for Sparse Hierarchical Clustering:
##   Wbound # Non-Zero W's Gap Statistic Standard Deviation
## 1    1.5              3        0.0037              2e-04
## 2    2.0              9        0.0138              1e-04
## 3    3.0             13        0.0420              1e-04
## 4    4.0             20        0.0677              1e-04
## 5    5.0             20        0.0677              1e-04
## 6    6.0             20        0.0677              1e-04
## Tuning parameter that leads to largest Gap statistic:  4

plot(perm.out)

# 执行稀疏层次聚类
sparsehc <- HierarchicalSparseCluster(dists=perm.out$dists, wbound=perm.out$bestw, method="complete", dissimilarity="absolute.value")

## 123456

par(mfrow=c(1,2))
plot(sparsehc)

5.6双向聚类

5.6.1数据预处理

##双向聚类
##数据预处理,将分类变量处理成独热编码
library(caret)
taiwan=read.csv("card1.csv")
ohe_feats = c('SEX', 'EDUCATION', 'MARRIAGE')
taiwan$SEX<-as.factor(taiwan$SEX)
taiwan$EDUCATION<-as.factor(taiwan$EDUCATION)
taiwan$MARRIAGE<-as.factor(taiwan$MARRIAGE)
dummies <- dummyVars(~ SEX +  EDUCATION + MARRIAGE, data = taiwan)
df_all_ohe <- as.data.frame(predict(dummies, newdata = taiwan))
taiwan_bi <- cbind(df_all_ohe,taiwan_num)

5.6.2双向聚类分析

library(biclust)
taiwan_bi=as.matrix(taiwan_bi[s,])
heatmap(taiwan_bi)

bidata=binarize(taiwan_bi)                 #处理成二分数据

## [1] "Threshold:  0.5"

bic<-biclust(bidata,method=BCBimax(),minr=400, minc=3)
bic #以BIMAX方法进行聚类

## 
## An object of class Biclust 
## 
## call:
##  biclust(x = bidata, method = BCBimax(), minr = 400, minc = 3)
## 
## Number of Clusters found:  100 
## 
## First  5  Cluster sizes:
##                    BC 1 BC 2 BC 3 BC 4 BC 5
## Number of Rows:     498  498  497  496  495
## Number of Columns:    3    3    4    4    5

heatmapBC(x = bidata,
          bicResult = bic,main="BIMAX算法双向聚类热力图")

## [1] "yto 497"
## [1] "xlo 2"
## [1] "yto 0"
## [1] "xlo 1"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 1"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 1"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 2"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 1"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 1"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"
## [1] "yto 0"
## [1] "xlo 0"

bic1<-biclust(bidata,method=BCCC(), number=2)
bic1 #以CC方法进行聚类

## 
## An object of class Biclust 
## 
## call:
##  biclust(x = bidata, method = BCCC(), number = 2)
## 
## There was one cluster found with
##   500 Rows and  33 columns

##heatmapBC(bidata,bic1,main="CC算法双向聚类热力图")  #做出热力图

5.7聚类算法对比

在第五章中，基于台湾信贷数据进行了不同方式的聚类。不同的聚类方式对类别间的差异程度的衡量有所不同，层次聚类和k均值聚类是根据点与点之间的距离进行聚类的，衡量距离的定义有很多，本章中采用的都是欧氏距离。而Dbscan聚类是基于密度的，将密度大的区域作为一个类，而将密度小的区域作为类与类的边界。

在基于距离的聚类中，AGNES层次聚类是一种自下而上的一种层次能聚方法，不断地将点通过相似的距离进行融合，而k均值聚类则是通过给定的k个类，不断迭代，使规划目标类间距离最小，从而找到核心点，确定类别。

而基于密度聚类的Dbscan聚类相较于基于距离聚类的优势就是可以做出任何形式的类，而不需要提前给出类的个数K的个数。缺点就单纯的依靠点与点之间的密度进行距离，衡量的是某一块区域之内的点的个数，缺乏现实意义，聚类比较盲目，也可以再基于台湾的信贷数据上看到，的确其准确度较低。

对于层次和k均值聚类，使用的都是数据的全部特征，这样做不但会增加计算量是计算效率降低，还会使得聚类效果不明显，忽略一些重要的变量，稀疏聚类的优势就是通过增加W权重，并对权重进行范数为1的限制，来对每一个特征指标在聚类中的表现进行限制。会突出一别变量的作用，在特征变量较多的状况的基因数据下，表现较好。起到数据压缩的效果。

双向聚类则是对观测和指标都进行聚类，以这种方式来反映一些变量在某些指标下的相似状况，而非全部特征下，因为能在全部指标下某些观测是不具有相似性的，所以更加具有实际意义。且尚香聚类的BIMAX在处理系数矩阵时具有独特的优势，在储存空间和运算时间上都较少。CC算法对BIMX的改进再在于，它弥补了在观测和指标较多是进行单列，但行的单节点删除的BIMAX法的运行效率较低，采用了大规模的多节点删除算法和节点添加算法，在大数据量下的表现会更加好。运行时间会进一步缩短。最后，基于台湾信贷数据的聚类运算结果表如下表所示：

\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\)表1-2 信贷数据聚类算法比较表

	准确率	运行时间（s）
层次聚类	76.2%	0.25
k-means聚类	66.2%	0.85
dbcan密度聚类	76.4%	0.67
k-means稀疏聚类	58.2%	0.94
层次稀疏聚类	74.6%	0.45

从对比表来看，基于台湾信贷数据上来看，在准确率上密度聚类的效果最好，而从运行时间上来看，层次聚类由于计算复杂小，所以耗费时间最短。

六、基于台湾信贷数据的分类算法的比较与分析

6.1 随机森林算法

随机森林是一种很灵活实用的方法，它有如下几个特点：

在当前所有算法中，具有极好的准确率
能够有效地运行在大数据集上
能够处理具有高维特征的输入样本，而且不需要降维
能够评估各个特征在分类问题上的重要性
在生成过程中，能够获取到内部生成误差的一种无偏估计
对于缺省值问题也能够获得很好得结果

6.1.1 随机森林数据预处理

#数据预处理
alldata <- read.csv("D:/bigdatahw/算法/作业/card1.csv",head=TRUE,sep=',')
alldata$default.payment.next.month <- as.factor(alldata$default.payment.next.month)
alldata$SEX <- as.factor(alldata$SEX)
alldata$EDUCATION <- as.factor(alldata$EDUCATION)
alldata$MARRIAGE <- as.factor(alldata$MARRIAGE)
alldata$PAY_0 <- as.factor(alldata$PAY_0)
alldata$PAY_2 <- as.factor(alldata$PAY_2)
alldata$PAY_3 <- as.factor(alldata$PAY_3)
alldata$PAY_4 <- as.factor(alldata$PAY_4)
alldata$PAY_5 <- as.factor(alldata$PAY_5)
alldata$PAY_6 <- as.factor(alldata$PAY_6)
#切分训练集和测试集
n <- sample(1:nrow(alldata),0.8*nrow(alldata))
train <- alldata[n,-1]
test <- alldata[-n,-1]

6.1.2 随机森林实证检验

rf <- randomForest(default.payment.next.month~.,data = train,
                   mtry=9,ntree=200,importance = T)
#查看训练集错误率随树的数量变化情况
plot(rf)

#查看训练集平均错误率
mean(rf$err.rate)

## [1] 0.29328

#训练集的混淆矩阵
rf$confusion

##       0    1 class.error
## 0 17563 1126  0.06024934
## 1  3271 2040  0.61589155

#变量重要性
imp <- importance(x=rf)
imp

##                    0          1 MeanDecreaseAccuracy MeanDecreaseGini
## LIMIT_BAL 17.2029543  9.7047482            21.266565        475.47075
## SEX        0.5943296  1.7876208             1.380573         69.81759
## EDUCATION  4.3848673 -0.8927275             3.416634        155.05200
## MARRIAGE   8.5000092 -2.0870822             6.802100         81.16270
## AGE       16.7357291  2.1579620            15.888386        558.61180
## PAY_0     68.6409536 20.7379376            89.123802        980.54191
## PAY_2     24.3634091 -1.5713306            24.749841        387.41056
## PAY_3     22.3145536 -0.6133761            23.650806        224.12159
## PAY_4     25.7885982 -3.9628031            26.276405        137.14523
## PAY_5     18.7097157  4.9360715            22.120779        124.00833
## PAY_6     20.9214964  9.3434181            26.775247        116.52891
## BILL_AMT1 18.8808991 11.8397139            24.933455        519.15213
## BILL_AMT2 29.6618252 -6.7161920            32.357146        434.70636
## BILL_AMT3 27.7400577 -4.9780282            30.980817        400.06105
## BILL_AMT4 31.3560986 -8.6779579            34.183192        392.65935
## BILL_AMT5 25.4924452 -0.2068033            30.243083        390.36588
## BILL_AMT6 23.8797758  1.5465705            29.403866        402.75028
## PAY_AMT1  20.7704798 -4.2723432            22.589126        419.08704
## PAY_AMT2  19.5958596  2.8907808            22.421400        398.55678
## PAY_AMT3  21.8057086  6.1405462            26.279318        386.16816
## PAY_AMT4  17.6185767  5.4049239            22.768492        365.50157
## PAY_AMT5  22.1770788  0.2160371            26.252567        370.52992
## PAY_AMT6  25.1897353  4.5908928            26.634603        398.01621

varImpPlot(rf)

#运行时间和消耗内存
#timecost <- system.time(randomForest(default.payment.next.month~.,
                                     #data = train,
                                     #mtry=9,ntree=500,
                                     #importance = T))
#print(timecost)
#定义评价函数
index2=function(table) {
  Accuracy=table[1,1]+table[2,2]
  #真阳性率+真阴性率=预测准确率
  precision=table[2,2]/sum(table[,2]) 
  #预测阳性率=精确度
  recall=table[2,2]/sum(table[2,]) 
  #第二类错误，灵敏度 =召回率
  F_measure=2*precision*recall/(precision+recall)
  #对前两者进行加权，综合结果
  results=data.frame(Accuracy=Accuracy,
                     recall=recall,
                     precision=precision,
                     F_measure=F_measure)
  return(results)
}
#测试集进行检验
pred1 <- predict(rf,test[,-25])
#测试集混淆矩阵
table(pred1,test$default.payment.next.month)

##      
## pred1    0    1
##     0 4387  827
##     1  288  498

#测试集评价
rf.pred= predict(rf,newdata=test[,-25]) 
rf.real=test$default.payment.next.month
table_RF=table(rf.real,rf.pred)/nrow(test)
a=index2(table_RF)
print(a)

##    Accuracy    recall precision F_measure
## 1 0.8146667 0.3766038 0.6356688 0.4729858

#ROC曲线
library(ROCR)
re.pred= predict(rf,newdata=test[,-25],type = "prob")
rocr = prediction(as.vector(re.pred[,2]),
                  test$default.payment.next.month)
AUC = performance(rocr,"auc")
print(AUC@y.values[[1]])

## [1] 0.7620552

ROC = performance(rocr,"tpr","fpr")
plot(ROC,main="ROC plot")
text(0.5,0.5,paste("AUC = ",
                   format(AUC@y.values[[1]],
                          digits=5,
                          scientific=FALSE)))

6.2 Adaboost集成算法

AdaBoost 采取加权多数表决的方法。具体地，加大分类误差率小的弱分类器的权值，使其在表决中起较大的作用，减小分类误差率大的弱分类器的权值，使其在表决中起较小的作用。

6.2.1 Adaboost数据预处理

credit <- read.csv("D:/bigdatahw/算法/作业/card1.csv",head=TRUE,sep=',')
credit<-credit[,-1] #去除id行
names(credit)[24] <- 'label'
credit$label <- as.factor(credit$label)
credit$SEX <- as.factor(credit$SEX)
credit$EDUCATION <- as.factor(credit$EDUCATION)
credit$MARRIAGE <- as.factor(credit$MARRIAGE)
credit$PAY_0 <- as.factor(credit$PAY_0)
credit$PAY_2 <- as.factor(credit$PAY_2)
credit$PAY_3 <- as.factor(credit$PAY_3)
credit$PAY_4 <- as.factor(credit$PAY_4)
credit$PAY_5 <- as.factor(credit$PAY_5)
credit$PAY_6 <- as.factor(credit$PAY_6)

#按标签比例各20%筛选训练集与测试集
idx0 <- which(credit$label == 1)
idx1 <- which(credit$label == 0)

cls0 <- sample(idx0,round(0.2 * length(idx0)))
cls1 <- sample(idx1,round(0.2 * length(idx1)))

tst <- credit[c(cls0,cls1),]
trn <- credit[-c(cls0,cls1),]

#查看各数据矩阵维度
dim(credit)

## [1] 30000    24

dim(tst)

## [1] 6000   24

dim(trn)

## [1] 24000    24

6.2.2 Adaboost实证检验

now <- Sys.time()
mem_change(model.AdaBoost <- boosting(label~.,data = trn))

## 48.4 MB

Sys.time() - now

## Time difference of 6.846082 mins

model.pred <- predict(model.AdaBoost,newdata = tst,type='class')
model.pred$confusion      #查看混淆矩阵

##                Observed Class
## Predicted Class    0    1
##               0 4454  844
##               1  219  483

cal=function(table) {
  Accuracy=table[1,1]+table[2,2] 
  #真阳性率+真阴性率=预测准确率
  precision=table[2,2]/sum(table[,2]) 
  #预测阳性率=精确度
  recall=table[2,2]/sum(table[2,]) 
  #第二类错误，灵敏度 =召回率
  F_measure=2*precision*recall/(precision+recall)
  #对前两者进行加权，综合结果
  results=data.frame(Accuracy=Accuracy,recall=recall,precision=precision,F_measure=F_measure)
  return(results)
}
cal(model.pred$confusion / 6000)

##    Accuracy    recall precision F_measure
## 1 0.8228333 0.6880342 0.3639789 0.4760966

6.3 XGBoost集成算法

XGBoost 算法总结起来大致其有三个优点：高效、准确度、模型的交互性：

正则化：标准 GBDT 提升树算法的实现没有像 XGBoost 这样的正则化步骤。正则化用于控制模型的复杂度，对减少过拟合也是有帮助的。XGBoost 也正是以“正则化提升”技术而闻名。
并行处理：XGBoost 可以实现并行处理，相比 GBM 有了速度的飞跃。不过，需要注意 XGBoost 的并行不是 tree 粒度的并行，XGBoost 也是一次迭代完才能进行下一次迭代的（第 t 次迭代的代价函数里包含了前面 t-1 次迭代的预测值）。XGBoost 的并行是在特征粒度上的。决策树的学习最耗时的一个步骤就是对特征的值进行排序（因为要确定最佳分割点）。因此 XGBoost 在 R 重定义了一个自己数据矩阵类 DMatrix。XGBoost 在训练之前，预先对数据进行了排序，然后保存为 block 结构，后面的迭代中重复利用索引地使用这个结构，获得每个节点的梯度，大大减小计算量。这个 block 结构也使得并行成为了可能，在进行节点的分裂时，需要计算每个特征的增益，最终选增益最大的那个特征去做分裂，那么各个特征的增益计算就可以开多线程进行。
高度灵活性：XGBoost 允许用户定义自定义优化目标和评价标准，它对模型增加了一个全新的维度，所以我们的处理不会受到任何限制。
缺失值处理：XGBoost 内置处理缺失值的规则。用户需要提供一个和其它样本不同的值，然后把它作为一个参数传进去，以此来作为缺失值的取值。XGBoost 在不同节点遇到缺失值时采用不同的处理方法，并且会学习未来遇到缺失值时的处理方法。
剪枝：当分裂时遇到一个负损失时，传统 GBDT 会停止分裂。因此传统 GBDT 实际上是一个贪心算法。XGBoost 会一直分裂到指定的最大深度 (max_depth)，然后回过头来剪枝。如果某个节点之后不再有正值，它会去除这个分裂。这种做法的优点，当一个负损失（如-2）后面有个正损失（如+10）的时候，就显现出来了。GBM 会在-2 处停下来，因为它遇到了一个负值。但是 XGBoost 会继续分裂，然后发现这两个分裂综合起来会得到+8，因此会保留这两个分裂。
内置交叉验证：XGBoost 允许在每一轮 boosting 迭代中使用交叉验证。因此，可以方便地获得最优 boosting 迭代次数。而传统的 GBDT 使用网格搜索，只能检测有限个值

6.3.1 XGBoost数据预处理

#读取训练数据集与测试数据集
taiwan<- read.csv("D:/bigdatahw/算法/作业/card1.csv",head=TRUE,sep=',') #读取台湾信用数据
set.seed(1234)      #设置随机种子，可以重复实验结果
tra=sample(1:nrow(taiwan), round(0.8* nrow(taiwan)))  #抽取80%,24000个数据作为训练集
train=taiwan[tra,-1]
test=taiwan[-tra,-1]
#独热编码分类特征
ohe_feats = c('SEX', 'EDUCATION', 'MARRIAGE')
train$SEX<-as.factor(train$SEX)
train$EDUCATION<-as.factor(train$EDUCATION)
train$MARRIAGE<-as.factor(train$MARRIAGE)
dummies <- dummyVars(~ SEX +  EDUCATION + MARRIAGE, data = train)
df_all_ohe <- as.data.frame(predict(dummies, newdata = train))
train <- cbind(df_all_ohe,train[,-c(2,3,4)])

ohe_feats = c('SEX', 'EDUCATION', 'MARRIAGE')
test$SEX<-as.factor(test$SEX)
test$EDUCATION<-as.factor(test$EDUCATION)
test$MARRIAGE<-as.factor(test$MARRIAGE)
dummies <- dummyVars(~ SEX +  EDUCATION + MARRIAGE, data = test)
df_all_ohe <- as.data.frame(predict(dummies, newdata = test))
test<- cbind(df_all_ohe,test[,-c(2,3,4)])

#切分训练特征与回归特征
x<-train[,1:33]
y<-train[,34]
x<-apply(x,2,as.numeric)        #将x的列变量转化为数字型，xgboost对数据类型要求严格
y<-as.numeric(y)             #将y也转化为数字型变量

6.3.2 XGBoost实证检验

# xgboost训练过程
#也可以利用函数xgb.cv对数据进行交叉验证分析，利用交叉验证确定均方误差,
#利用交叉检验确定最佳迭代次数，利用10折交叉验证
cv.res<-xgb.cv(data=x,label=y,max.depth=2,eta=1,nround=15,objective='binary:logistic',nfold=10)

## [1]  train-error:0.178583+0.000810   test-error:0.179375+0.007181 
## [2]  train-error:0.177801+0.000793   test-error:0.178583+0.008006 
## [3]  train-error:0.179069+0.001651   test-error:0.181583+0.008055 
## [4]  train-error:0.177750+0.001285   test-error:0.179042+0.008811 
## [5]  train-error:0.177565+0.001132   test-error:0.179375+0.009150 
## [6]  train-error:0.177514+0.001034   test-error:0.179334+0.009210 
## [7]  train-error:0.177310+0.001087   test-error:0.179583+0.009430 
## [8]  train-error:0.177361+0.001135   test-error:0.179042+0.009140 
## [9]  train-error:0.176968+0.001048   test-error:0.178833+0.008885 
## [10] train-error:0.176893+0.001019   test-error:0.180375+0.008211 
## [11] train-error:0.177074+0.000786   test-error:0.180000+0.008580 
## [12] train-error:0.176866+0.000871   test-error:0.180292+0.008904 
## [13] train-error:0.176616+0.001061   test-error:0.180458+0.008542 
## [14] train-error:0.176717+0.000971   test-error:0.180500+0.008186 
## [15] train-error:0.176495+0.000946   test-error:0.179750+0.007692

cv.res <- as.data.frame(cv.res$evaluation_log)
cv.res<-melt(cv.res[,c(1,2,4)],id = c("iter"),     
       variable.name = "type", 
       value.name = "cases",
       na.rm = TRUE)                #控制两列变量在新表中不动新生成的变量用case标识
ggplot(data=cv.res, aes(x=iter, y=cases, group=type, colour=type)) +
  geom_line(size=1) +
  geom_point() +
  xlab("决策树棵数") + ylab("均方误差")+
  ggtitle('交叉检验确定最优棵数')+
  theme(plot.title = element_text(hjust =0.5,family="myFont",size=20,color="red"), 
        panel.background=element_rect(fill='aliceblue',color='black'),panel.grid.minor = element_blank())

#可以从10折交叉验证结果来看，当决策树数量（也就是迭代次数）
#到达8次后测试误差率开始上升说明XGBoost模型开始过于复杂，
#甚至出现了一点过拟合迹象因此选择决策树棵数也就是迭代次数为8次。

#定义XGBoost模型测试集结果评价函数，
#包含Accuracy，recall，precision，F_measure四个评价指标
#Accuracy总正确率，recall后验准确率，precision先验准确率，F_measure指数
index2=function(table) {
  Accuracy=table[1,1]+table[2,2]
  precision=table[2,2]/sum(table[,2])
  recall=table[2,2]/sum(table[2,])
  F_measure=2*precision*recall/(precision+recall)#计算Recall，Precision和F-measure
  results=data.frame(Accuracy=Accuracy,recall=recall,precision=precision,F_measure=F_measure)
  return(results)
}

timestart<-Sys.time();
bst<-xgboost(data=x,label=y,max.depth=2,eta=1,nround=8,objective='binary:logistic')

## [1]  train-error:0.178500 
## [2]  train-error:0.178000 
## [3]  train-error:0.179375 
## [4]  train-error:0.178583 
## [5]  train-error:0.177208 
## [6]  train-error:0.178125 
## [7]  train-error:0.177875 
## [8]  train-error:0.177208

#在此处，根据交叉验证结果我们选取最大深度为2，学习速率为1，
#决策树棵树为10,也就是迭代次数为10，目标函数是基于二分类问
#题的logistic损失进行模型建立。
timeend<-Sys.time()
runningtime<-timeend-timestart
print(runningtime)             #运行时间

## Time difference of 0.192559 secs

#利用predict函数进行预测
test<-apply(test,2,as.numeric) 
pred<-predict(bst,test[,-34])
pred1 = ifelse(pred>0.5,1,0)   #转化为分类变量0-1
true<-as.factor(test[,34])
table_XG=table(true,pred1)/nrow(test)    #混淆数量矩阵
table_XG                       #从混淆矩阵中看出并不好

##     pred1
## true          0          1
##    0 0.73083333 0.03650000
##    1 0.15000000 0.08266667

a=index2(table_XG)             #预测准确率为0.771%,速度很快
a

##   Accuracy    recall precision F_measure
## 1   0.8135 0.3553009 0.6937063 0.4699195

#绘制ROC曲线
xgb_lr.train.modelroc <- roc(test[,34], pred)  
plot(xgb_lr.train.modelroc,print.auc=TRUE,auc.polygon=TRUE,grid=c(0.1,0.2),  
     grid.col=c("green","red"),max.auc.polygon=TRUE,auc.polygon.col="skyblue",print.thres=TRUE)

#XGBoost变量重要性
model <- xgb.dump(bst, with.stats = T)
model[1:10]

##  [1] "booster[0]"                                                  
##  [2] "0:[f15<1.5] yes=1,no=2,missing=1,gain=2459.88,cover=6000"    
##  [3] "1:[f16<1.5] yes=3,no=4,missing=3,gain=502.502,cover=5379.5"  
##  [4] "3:leaf=-1.43651,cover=4927.25"                               
##  [5] "4:leaf=-0.334253,cover=452.25"                               
##  [6] "2:[f17<-0.5] yes=5,no=6,missing=5,gain=34.9331,cover=620.5"  
##  [7] "5:leaf=-0.186667,cover=36.5"                                 
##  [8] "6:leaf=0.817094,cover=584"                                   
##  [9] "booster[1]"                                                  
## [10] "0:[f29<390.5] yes=1,no=2,missing=1,gain=250.48,cover=4031.08"

# 获得特征的真实名称
names <- dimnames(train)[[2]]
# 计算特征重要性矩阵
importance_matrix <- xgb.importance(names, model = bst)
# 制图
xgb.ggplot.importance(importance_matrix[1:10,], rel_to_first = TRUE)+ylab('Gain')

xgb.plot.tree(model = bst,n_first_tree = 1,plot_width = 600,plot_height = 600)

#查看消耗内存
mem_change(xgboost(data=x,label=y,max.depth=2,eta=1,nround=12,objective='binary:logistic'))

## [1]  train-error:0.178500 
## [2]  train-error:0.178000 
## [3]  train-error:0.179375 
## [4]  train-error:0.178583 
## [5]  train-error:0.177208 
## [6]  train-error:0.178125 
## [7]  train-error:0.177875 
## [8]  train-error:0.177208 
## [9]  train-error:0.177542 
## [10] train-error:0.177833 
## [11] train-error:0.177292 
## [12] train-error:0.176708

## 95.9 kB

6.3.3 XGBoost模型优化

###利用XGBoost构造新的特征提高分类效率
train <- data.matrix(train)
test <- data.matrix(test)
new.features.train <- xgb.create.features(model = bst, train[,-34])  
# 生成xgboost构造的新特征组合，训练集  
new.features.test <- xgb.create.features(model = bst, test[,-34])    
# 生成xgboost构造的新特征组合，测试集  

newdtrain <- as.data.frame(as.matrix(new.features.train))        
# 将训练集的特征组合转化为dataframe格式  
newdtest <- as.data.frame(as.matrix(new.features.test))         
# 将测试集的特征组合转化为dataframe格式  

newtraindata <- cbind(newdtrain,y=train[,34])             
# 将训练集的自变量和因变量合并  
newtestdata <- cbind(newdtest,y=test[,34])               
# 将测试集的自变量和因变量合并 

# 对训练集进行预测  
x1<-newtraindata[,1:58]
y1<-newtraindata[,59]
x1<-apply(x1,2,as.numeric)        
#将x的列变量转化为数字型，xgboost对数据类型要求严格
y1<-as.numeric(y1)                     #将y也转化为数字型变量
cv.res<-xgb.cv(data=x1,label=y1,max.depth=2,eta=1,nround=15,objective='binary:logistic',nfold=10)

## [1]  train-error:0.178500+0.001203   test-error:0.178500+0.010822 
## [2]  train-error:0.176875+0.001102   test-error:0.176875+0.009913 
## [3]  train-error:0.178482+0.001379   test-error:0.178791+0.010053 
## [4]  train-error:0.177597+0.001165   test-error:0.177917+0.009995 
## [5]  train-error:0.177009+0.001166   test-error:0.178125+0.009924 
## [6]  train-error:0.176782+0.001129   test-error:0.177708+0.009737 
## [7]  train-error:0.176500+0.001362   test-error:0.177792+0.009212 
## [8]  train-error:0.176467+0.001486   test-error:0.177583+0.010112 
## [9]  train-error:0.176287+0.001548   test-error:0.177792+0.009536 
## [10] train-error:0.176273+0.001468   test-error:0.178292+0.009964 
## [11] train-error:0.176222+0.001279   test-error:0.178000+0.010616 
## [12] train-error:0.176222+0.001311   test-error:0.178042+0.010445 
## [13] train-error:0.176190+0.001256   test-error:0.178583+0.010487 
## [14] train-error:0.176083+0.001230   test-error:0.178417+0.010156 
## [15] train-error:0.176055+0.001287   test-error:0.178125+0.010551

#十折交叉检验确定迭代次数
bst<-xgboost(data=x1,label=y1,max.depth=2,eta=1,nround=20,objective='binary:logistic')

## [1]  train-error:0.178500 
## [2]  train-error:0.176875 
## [3]  train-error:0.178958 
## [4]  train-error:0.177667 
## [5]  train-error:0.177167 
## [6]  train-error:0.176667 
## [7]  train-error:0.176542 
## [8]  train-error:0.177000 
## [9]  train-error:0.177000 
## [10] train-error:0.177208 
## [11] train-error:0.176875 
## [12] train-error:0.176708 
## [13] train-error:0.176417 
## [14] train-error:0.176542 
## [15] train-error:0.176208 
## [16] train-error:0.175958 
## [17] train-error:0.175833 
## [18] train-error:0.175958 
## [19] train-error:0.175958 
## [20] train-error:0.175833

newtestdata<-apply(newtestdata,2,as.numeric) 
pred<-predict(bst,newtestdata[,-59])
pred1 = ifelse(pred>0.5,1,0)           #转化为分类变量0-1
true<-as.factor(newtestdata[,59])
table_XG=table(true,pred1)/nrow(test)    #混淆数量矩阵
table_XG

##     pred1
## true          0          1
##    0 0.72716667 0.04016667
##    1 0.14833333 0.08433333

a=index2(table_XG)                     #预测准确率为0.771%,速度很快
a

##   Accuracy    recall precision F_measure
## 1   0.8115 0.3624642 0.6773762 0.4722352

#绘制ROC曲线
xgb_lr.train.modelroc <- roc(newtestdata[,59], pred)  
plot(xgb_lr.train.modelroc,print.auc=TRUE,auc.polygon=TRUE,grid=c(0.1,0.2),  
     grid.col=c("green","red"),max.auc.polygon=TRUE,auc.polygon.col="skyblue",print.thres=TRUE)

6.4分类算法对比

先不考虑信贷的实际问题，单纯比较算法性能来讲，集成算法的对比表如下表所示，可以看出单从准确率上看，算法之间差距并不是很大，说明在集成算法上可能都已经逼近了数据本身应有的特征，无法有大的改进，但值得注意的是，随机森林和 Adaboost 都需要进行多棵决策树的生成，迭代次数较多，成千上百次，而XGBoost只需要进行少量的迭代次数，在本次试验中仅需要8次。而且运行时间上XGBoost的多线程并行和基于其block的索引数据结构运行，也发挥了极大作用。比Adaboost缩短时间达到了近400倍，而比随机森林缩短时间近200倍。在本次信贷的数据分析中，XGBoost更有调节阈值后可以更加符合银行的商业要求，更加准确的识别出违约人。

\(~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\)表1-2 算法比较表

	随机森林	Adaboost	XGBoost
Accuracy	0.81	0.82	0.82
recall	0.37	0.34	0.36
precision	0.62	0.67	0.68
F_measure	0.46	0.45	0.48
运行时间(s)	44.46	85.42	0.21

基于信用卡数据算法对比

中央财经大学王思雨

2018-01-12