python风控中KS原理是什么

发布时间：2021-11-22 12:36:17 作者：小新
来源：亿速云阅读：208

# Python风控中KS原理是什么

## 1. KS统计量概述

### 1.1 KS统计量的定义
KS（Kolmogorov-Smirnov）统计量是风控建模中最常用的评估指标之一，用于衡量模型对正负样本的区分能力。其数学定义为：

KS = max(|FPR(t) - TPR(t)|)

其中：
- FPR(t)表示在阈值t时假正率（False Positive Rate）
- TPR(t)表示在阈值t时真正率（True Positive Rate）

### 1.2 KS值的统计学意义
KS值反映了模型将正负样本区分开来的最大能力，其取值范围在0-1之间：
- 0表示模型完全没有区分能力
- 1表示模型完全区分正负样本
- 通常认为KS>0.3的模型具有实用价值

### 1.3 KS与AUC的关系
虽然KS和AUC都是评估模型性能的指标，但两者有显著区别：
- AUC衡量的是模型整体排序能力
- KS关注的是模型在最优阈值处的区分能力
- 实际应用中常同时使用这两个指标进行综合评估

## 2. KS统计量的数学原理

### 2.1 累积分布函数比较
KS统计量的核心思想是比较正负样本的累积分布函数（CDF）差异：

```python
import numpy as np
import matplotlib.pyplot as plt

# 示例数据
score = np.random.normal(0.6, 0.1, 1000)  # 正样本得分
bad_score = np.random.normal(0.4, 0.1, 1000)  # 负样本得分

# 计算CDF
def calculate_cdf(data):
    sorted_data = np.sort(data)
    cdf = np.arange(1, len(sorted_data)+1)/len(sorted_data)
    return sorted_data, cdf

good_sorted, good_cdf = calculate_cdf(score)
bad_sorted, bad_cdf = calculate_cdf(bad_score)

# 可视化
plt.plot(good_sorted, good_cdf, label='Good CDF')
plt.plot(bad_sorted, bad_cdf, label='Bad CDF')
plt.legend()
plt.title('CDF Comparison')
plt.show()

2.2 最优分割点确定

KS值对应的阈值就是模型的最佳cutoff点，此时： - 正样本被正确识别的比例最高 - 负样本被误判的比例最低

2.3 假设检验基础

KS检验源于非参数统计中的Kolmogorov-Smirnov检验，用于检验两个样本是否来自同一分布： - 原假设H0：两个样本来自同一分布 - 备择假设H1：两个样本来自不同分布

在风控中，我们期望好客户和坏客户的评分分布差异越大越好。

3. Python实现KS计算

3.1 基础计算方法

def compute_ks(y_true, y_pred):
    """
    计算KS统计量
    :param y_true: 真实标签（0/1）
    :param y_pred: 预测概率
    :return: KS值, 最佳阈值
    """
    # 合并标签和预测值
    df = pd.DataFrame({'y_true': y_true, 'y_pred': y_pred})
    df = df.sort_values('y_pred')
    
    # 计算累积分布
    df['bad_cum'] = df['y_true'].cumsum() / df['y_true'].sum()
    df['good_cum'] = (1 - df['y_true']).cumsum() / (1 - df['y_true']).sum()
    
    # 计算KS值
    df['ks'] = df['bad_cum'] - df['good_cum']
    ks_value = df['ks'].max()
    best_threshold = df.loc[df['ks'].idxmax(), 'y_pred']
    
    return ks_value, best_threshold

3.2 使用scipy库实现

from scipy.stats import ks_2samp

def ks_scipy(y_true, y_pred):
    good = y_pred[y_true == 0]
    bad = y_pred[y_true == 1]
    result = ks_2samp(bad, good)
    return result.statistic, result.pvalue

3.3 可视化KS曲线

def plot_ks_curve(y_true, y_pred):
    # 计算各分位点
    percentiles = np.linspace(0, 100, 101)
    thresholds = np.percentile(y_pred, percentiles)
    
    tpr = []
    fpr = []
    for thresh in thresholds:
        y_pred_label = (y_pred >= thresh).astype(int)
        tp = np.sum((y_true == 1) & (y_pred_label == 1))
        fp = np.sum((y_true == 0) & (y_pred_label == 1))
        tpr.append(tp / np.sum(y_true == 1))
        fpr.append(fp / np.sum(y_true == 0))
    
    ks_value = np.max(np.array(tpr) - np.array(fpr))
    idx = np.argmax(np.array(tpr) - np.array(fpr))
    
    plt.plot(percentiles/100, tpr, label='TPR')
    plt.plot(percentiles/100, fpr, label='FPR')
    plt.plot([percentiles[idx]/100, percentiles[idx]/100], 
             [fpr[idx], tpr[idx]], 'k--', 
             label=f'KS={ks_value:.3f}')
    plt.legend()
    plt.title('KS Curve')
    plt.xlabel('Threshold Percentile')
    plt.ylabel('Rate')
    plt.show()

4. KS在风控模型中的应用

4.1 模型评估

在风控模型开发中，KS是核心评估指标： - 开发样本KS：反映模型在训练集上的表现 - 验证样本KS：反映模型在测试集上的泛化能力 - 时间外样本KS：检验模型的时间稳定性

4.2 变量筛选

KS也可用于单变量分析： - 计算每个变量的KS值 - 保留KS值较高的变量 - 通常保留KS>0.02的变量

def variable_ks(df, target):
    ks_dict = {}
    for col in df.columns:
        if col != target:
            ks, _ = compute_ks(df[target], df[col])
            ks_dict[col] = ks
    return pd.DataFrame.from_dict(ks_dict, orient='index', columns=['KS'])

4.3 模型监控

上线后需要持续监控KS值的变化： - 周环比/月环比变化不应超过20% - 显著下降可能说明模型失效 - 异常升高可能说明样本分布变化

5. KS指标的局限性

5.1 对样本不平衡敏感

当正负样本比例极度不平衡时： - KS值可能虚高 - 需要结合其他指标如AUC、PSI综合判断

5.2 仅反映最大区分度

KS只关注最优分割点： - 可能忽略模型整体表现 - 需要结合ROC曲线综合评估

5.3 受样本量影响

样本量较小时： - KS值可能不稳定 - 需要进行交叉验证

6. 实际案例演示

6.1 数据准备

使用德国信用数据集演示：

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# 加载数据
data = fetch_openml('GermanCredit', as_frame=True)
df = data.frame
df['target'] = (df['class'] == 'bad').astype(int)

# 简单特征工程
X = pd.get_dummies(df.drop(['class', 'target'], axis=1))
y = df['target']

# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

6.2 模型训练与评估

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

# 训练模型
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 预测概率
train_pred = model.predict_proba(X_train)[:, 1]
test_pred = model.predict_proba(X_test)[:, 1]

# 计算指标
train_ks, _ = compute_ks(y_train, train_pred)
test_ks, _ = compute_ks(y_test, test_pred)
train_auc = roc_auc_score(y_train, train_pred)
test_auc = roc_auc_score(y_test, test_pred)

print(f"Train KS: {train_ks:.4f}, AUC: {train_auc:.4f}")
print(f"Test KS: {test_ks:.4f}, AUC: {test_auc:.4f}")

6.3 结果可视化

# 绘制训练集和测试集KS曲线
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plot_ks_curve(y_train, train_pred)
plt.title('Train KS Curve')

plt.subplot(1, 2, 2)
plot_ks_curve(y_test, test_pred)
plt.title('Test KS Curve')
plt.tight_layout()
plt.show()

7. KS优化策略

7.1 分箱优化

通过最优分箱提升KS： - 等频分箱 - 等距分箱 - 决策树分箱

from sklearn.tree import DecisionTreeClassifier

def optimal_binning(feature, target, n_bins=10):
    # 使用决策树寻找最优分箱点
    tree = DecisionTreeClassifier(max_leaf_nodes=n_bins)
    tree.fit(feature.values.reshape(-1, 1), target)
    thresholds = tree.tree_.threshold[tree.tree_.threshold != -2]
    return np.sort(thresholds)

7.2 模型集成

通过模型集成提升KS： - Bagging - Boosting - Stacking

7.3 样本权重调整

对少数样本赋予更高权重：

model = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',
    random_state=42
)

8. 总结

KS值是风控模型评估的核心指标： 1. 反映了模型区分正负样本的最大能力 2. Python实现简单直观，可与机器学习流程无缝集成 3. 需要结合其他指标综合评估模型性能 4. 在实际应用中要注意其局限性 5. 可通过多种策略优化KS表现

在实际风控项目中，建议： - 定期监控KS值变化 - 建立KS值预警机制 - 结合业务理解分析KS变化原因 - 不要过度追求KS值而牺牲模型稳定性

”`