炼数成金 门户 商业智能 R语言 查看内容

手把手教你用R语言建立信用评分模型(三)— —Logistic模型建构

2017-2-1 21:24| 发布者: 炼数成金_小数| 查看: 21834| 评论: 0|来自: Frank和风险模型们

摘要: 我们在上一篇变量筛选专题中,使用WoE完成了单变量分析的部分。接下来,我们会用经过清洗后的数据看一下变量间的相关性。注意,这里的相关性分析只是初步的检查,进一步检查模型的多重共线性还需要通过 VIF(variance ...

模型 R语言

相关性分析 & IV(信息值)筛选
我们在上一篇变量筛选专题中,使用WoE完成了单变量分析的部分。接下来,我们会用经过清洗后的数据看一下变量间的相关性。注意,这里的相关性分析只是初步的检查,进一步检查模型的多重共线性还需要通过 VIF(variance inflation factor)也就是 方差膨胀因子进行检验。 

R代码:
require(corrplot)
cor1<-cor(train)
corrplot(cor1,tl.cex = 0.5)

输出图像:

从相关矩阵图中可以看出, CreditAmount和Duration的相关性较强(0.37),以及NoofCreditatthisBank和PaymentStatusofPreviousCredit相关性较强(0.42)。

接下来,我进一步计算每个变量的Infomation Value(IV)。IV指标是一般用来确定自变量的预测能力。 其公式为:

通过IV值判断变量预测能力的标准是:

< 0.02: unpredictive 
0.02 to 0.1: weak 
0.1 to 0.3: medium 
0.3 to 0.5: strong 
> 0.5: suspicious

因这部分代码较多,我会将更为详尽的代码放在文章末尾。这里是输出各个变量IV值的语句:
ggplot(infovalue, aes(x = va, y = iv)) + geom_bar(stat = "identity",fill = "blue", colour = "grey60",size = 0.2, alpha = 0.2)+labs(title = "Information value")+ theme(axis.text.x=element_text(angle=90,colour="black",size=10));

输出图像:

可以看出,DuratioCurrentAddress, Guarantors, Instalmentpercent,NoofCreditatthisBank,Occupation,Noofdependents,Telephone变量的IV值明显较低。 所以予以删除。其中相关性分析中NoofCreditatthisBank和PaymentStatusofPreviousCredit相关性较强(0.42)的问题也因NoofCreditatthisBank变量被删除而解决。而CreditAmount和Duration的相关性(0.37)并不显著,可以在这部分忽略不计。

StepWise多变量分析 & Logistic模型建立
在进行StepWise分析前,我们需要将筛选后的变量转换为WoE值并建立Logistic模型。

首先,让先去除在筛选过程中删除的因子:
german_credit$DurationinCurrentaddress=NULL
german_credit$Guarantors=NULL
german_credit$Instalmentpercent=NULL
german_credit$NoofCreditatthisBank=NULL
german_credit$Occupation=NULL
german_credit$Noofdependents=NULL
german_credit$Telephone=NULL

然后计算变量对应的WoE值:
AccountBalancewoe=woe(train2, "AccountBalance",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Durationwoe=woe(train2, "Duration",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")
PaymentStatusofPreviousCreditwoe=woe(train2, "PaymentStatusofPreviousCredit",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Purposewoe = woe(train2, "Purpose",Continuous = F, "Creditability",C_Bin = 11,Good = "1",Bad = "0")
CreditAmountwoe= woe(train2, "CreditAmount",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")
(全部代码请参见文末)

对变量对应的取值进行WoE替换:
for(i in 1:1000){
  for(s in 1:4){
  if(german_credit$AccountBalance[i]==s){
    german_credit$AccountBalance[i]=-AccountBalancewoe$WOE[s]
 }
  }
  for(s in 1:3){
    if(german_credit$Duration[i]==s){
      german_credit$Duration[i]=-Durationwoe$WOE[s]
    }
  }
  for(s in 0:4){
    if(german_credit$PaymentStatusofPreviousCredit[i]==s){
      german_credit$PaymentStatusofPreviousCredit[i]=-PaymentStatusofPreviousCreditwoe$WOE[s+1]
    }
  }
(全部代码请参见文末)

通过View(german_credit),我们可以看出全部数据已经替换成功:

将经过WoE转换的数据放入Logistic模型中建模,并使用向后逐步回归方法(backward stepwise)筛选变量:
fit<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment +Sex.Marital.Status+ Mostvaluableavailableasset + Age + ConcurrentCredits + Typeofapartment + ForeignWorker,train2,family = "binomial")
backwards = step(fit)

输出结果:

可以看出,通过逐步回归,模型删除了 Typeofapartment、 Mostvaluableavailableasset 、Sex.Marital.Status等变量。 

我们再用逐步回归筛选后的的变量进行建模:
fit2<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment + Age + ConcurrentCredits  + ForeignWorker,train2,family = "binomial")
summary(fit2)

输出结果:

其中ConcurrentCredits这一变量并不显著,我们在这一步将此变量删除。继续建立logistic模型:
fit3<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment + Age  + ForeignWorker,train2,family = "binomial")

为防止多重共线性问题的出现,我们对模型进行VIF检验:
library(car)
vif(fit3, digits =3 )

输出结果:

从上图可知,所有变量VIF均小于4,可以判断模型中不存在多重共线性问题。

模型检验
到这里,我们的建模部分基本结束了。我们需要验证一下模型的预测能力如何。我们使用在建模开始阶段预留的250条数据进行检验:

prediction <- predict(fit3,newdata=test2)
for (i in 1:250) {
  if(prediction[i]>0.99){
    prediction[i]=1}
  else
  {prediction[i]=0}
}
confusionMatrix(prediction, test2$Creditability)

输出结果:

模型的精度达到了0.72,模型表现一般。这同Logistic模型本身的局限性有关。传统的回归模型精度一般都会弱于决策树、SVM等机器挖掘算法

完整代码:
german_credit$DurationinCurrentaddress=NULL
german_credit$Guarantors=NULL
german_credit$Instalmentpercent=NULL
german_credit$NoofCreditatthisBank=NULL
german_credit$Occupation=NULL
german_credit$Noofdependents=NULL
german_credit$Telephone=NULL
AccountBalancewoe=woe(train2, "AccountBalance",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Durationwoe=woe(train2, "Duration",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")
PaymentStatusofPreviousCreditwoe=woe(train2, "PaymentStatusofPreviousCredit",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Purposewoe = woe(train2, "Purpose",Continuous = F, "Creditability",C_Bin = 11,Good = "1",Bad = "0")
CreditAmountwoe= woe(train2, "CreditAmount",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")
ValueSavingswoe =woe(train2, "ValueSavings",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Lengthofcurrentemploymentwoe=woe(train2, "Lengthofcurrentemployment",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Sex.Marital.Statuswoe=woe(train2, "Sex.Marital.Status",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Mostvaluableavailableassetwoe=woe(train2, "Mostvaluableavailableasset",Continuous = F, "Creditability",C_Bin = 4,Good = "1",Bad = "0")
Agewoe=woe(train2, "Age",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")
ConcurrentCreditswoe=woe(train2, "ConcurrentCredits",Continuous = F, "Creditability",C_Bin = 3,Good = "1",Bad = "0")
Typeofapartmentwoe=woe(train2, "Typeofapartment",Continuous = F, "Creditability",C_Bin = 3,Good = "1",Bad = "0")
ForeignWorkerwoe=woe(train2, "ForeignWorker",Continuous = F, "Creditability",C_Bin = 2,Good = "1",Bad = "0")

for(i in 1:1000){
  
  for(s in 1:4){
  if(german_credit$AccountBalance[i]==s){
    german_credit$AccountBalance[i]=-AccountBalancewoe$WOE[s]
  }
  }

  for(s in 1:3){
    if(german_credit$Duration[i]==s){
      german_credit$Duration[i]=-Durationwoe$WOE[s]
    }
  }
  
  for(s in 0:4){
    if(german_credit$PaymentStatusofPreviousCredit[i]==s){
      german_credit$PaymentStatusofPreviousCredit[i]=-PaymentStatusofPreviousCreditwoe$WOE[s+1]
    }
  }
  
  for(s in 0:10){
    if(s<=6){
    if(german_credit$Purpose[i]==s){
      german_credit$Purpose[i]=-Purposewoe$WOE[s+1]
    }
    }else{
      if(german_credit$Purpose[i]==s){
        german_credit$Purpose[i]=-Purposewoe$WOE[s]
      }
    }
  }
  
  for(s in 1:2){
    if(german_credit$CreditAmount[i]==s){
      german_credit$CreditAmount[i]=-CreditAmountwoe$WOE[s]
    }
  }
  
  for(s in 2:5){
    if(german_credit$ValueSavings[i]==s){
      german_credit$ValueSavings[i]=-ValueSavingswoe$WOE[s-1]
    }
  }
  
  for(s in 1:5){
    if(german_credit$Lengthofcurrentemployment[i]==s){
      german_credit$Lengthofcurrentemployment[i]=-Lengthofcurrentemploymentwoe$WOE[s]
    }
  }
  
  for(s in 1:5){
    if(german_credit$Sex.Marital.Status[i]==s){
      german_credit$Sex.Marital.Status[i]=-Sex.Marital.Statuswoe$WOE[s]
    }
  }
  
  for(s in 1:4){
    if(german_credit$Mostvaluableavailableasset[i]==s){
      german_credit$Mostvaluableavailableasset[i]=-Mostvaluableavailableassetwoe$WOE[s]
    }
  }
  
  for(s in 1:2){
    if(german_credit$Age[i]==s){
      german_credit$Age[i]=-Agewoe$WOE[s]
    }
  }
  
  for(s in 1:5){
    if(german_credit$ConcurrentCredits[i]==s){
      german_credit$ConcurrentCredits[i]=-ConcurrentCreditswoe$WOE[s]
    }
  }
  
  for(s in 1:5){
    if(german_credit$Typeofapartment[i]==s){
      german_credit$Typeofapartment[i]=-Typeofapartmentwoe$WOE[s]
    }
  }
  
  for(s in 1:2){
    if(german_credit$ForeignWorker[i]==s){
      german_credit$ForeignWorker[i]=-ForeignWorkerwoe$WOE[s]
    }
  }
}
fit<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment +Sex.Marital.Status+ Mostvaluableavailableasset + Age + ConcurrentCredits + Typeofapartment + ForeignWorker,train2,family = "binomial")
backwards = step(fit)
summary(backwards)
fit2<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment + Age + ConcurrentCredits  + ForeignWorker,train2,family = "binomial")
summary(fit2)
fit3<-glm(Creditability~ AccountBalance + Duration +PaymentStatusofPreviousCredit +Purpose + CreditAmount + ValueSavings + Lengthofcurrentemployment + Age  + ForeignWorker,train2,family = "binomial")
summary(fit3)
library(car)
vif(fit3, digits =3 )
prediction <- predict(fit3,newdata=test2)
for (i in 1:250) {
  if(prediction[i]>0.99){
    prediction[i]=1}
  else
  {prediction[i]=0}
}
confusionMatrix(prediction, test2$Creditability)

欢迎加入本站公开兴趣群
商业智能与数据分析群
兴趣范围包括各种让数据产生价值的办法,实际应用案例分享与讨论,分析工具,ETL工具,数据仓库,数据挖掘工具,报表系统等全方位知识
QQ群:81035754

鲜花

握手

雷人

路过

鸡蛋

相关阅读

最新评论

热门频道

  • 大数据
  • 商业智能
  • 量化投资
  • 科学探索
  • 创业

即将开课

热门文章

     

    GMT+8, 2019-7-17 11:23 , Processed in 0.168276 second(s), 24 queries .