Python 데이터분석 기초 57 - ROC curve, acc(정확도), recall(재현율), precision(정밀도), specificity(특이도), fallout(위양성률), fallout(위양성률)

코딩탕탕 2022. 11. 18. 15:29
# ROC curve
# ROC 커브는 모든 가능한 threshold에 대해 분류모델의 성능을 평가하는 데 사용됩니다.
# ROC 커브 아래의 영역을 AUC (Area Under thet Curve)라 합니다.

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix

x, y = make_classification(n_samples = 100, n_features = 2, n_redundant = 0, random_state = 123)
print(x[:3])
print(y[:3])

# import matplotlib.pyplot as plt
# plt.scatter(x[:, 0], x[:, 1])
# plt.show()

model = LogisticRegression().fit(x, y)
y_hat = model.predict(x)
print('예측값 :', y_hat[:3])

print()
f_value = model.decision_function(x) # 판별함수(결정함수) : 판별 결계선 설정을 위한 샘플 얻기
# print('f_value :', f_value) # 0 이하와 초과로 나눠진다.

df = pd.DataFrame(np.vstack([f_value, y_hat, y]).T, columns = ['f', 'y_hat', 'y'])
print(df.head(3))

print()
print(confusion_matrix(y, y_hat)) # 참, 거짓 결과 행렬로 출력
acc = (44 + 44) / 100       # (TP + TN) / 전체 수
recall = 44 / (44 + 4)      # TP / (TP + FN)
precision = 44 / (44 + 8)   # TP / (TP + FP)
specificity = 44 / (8 + 44) # TN / (FP + TN)
fallout = 8 / (8 + 44)      # FP / (FP + TN)

print('acc(정확도) :', acc)
print('recall(재현율) :', recall)    # TPR
print('precision(정밀도) :', precision)
print('specificity(특이도) :', specificity)
print('fallout(위양성률) :', fallout) # FPR
print('fallout(위양성율) :', 1 - specificity)

print()
from sklearn import metrics
ac_sco = metrics.accuracy_score(y, y_hat)
print('ac_sco :', ac_sco)

cl_rep = metrics.classification_report(y, y_hat)
print('cl_rep :', cl_rep)

print()
fpr, tpr, threshold = metrics.roc_curve(y, model.decision_function(x))
print('fpr :', fpr)
print('tpr :', tpr)
print('분류결정 임계값(positive 예측값을 결정하는 확률 기준값) :', threshold)

# ROC 커브
import matplotlib.pyplot as plt
plt.plot(fpr, tpr, 'o-', label = 'Logisitic Regression')
plt.plot([0, 1], [0, 1], 'k--', label = 'random classifier line(AUC : 0.5)')
plt.plot([fallout], [recall], 'ro', ms = 10) # 위양성율, 재현율
plt.xlabel('fpr')
plt.ylabel('tpr')
plt.title('ROC curve')
plt.legend()
plt.show()

# AUC : ROC 커브의 면적
print('AUC :', metrics.auc(fpr, tpr)) # 1에 근사할 수록 좋은 분류모델이다.  AUC : 0.95472


<console>
[[-0.01032243 -0.80566819]
 [-1.10293659  2.21661117]
 [-1.90795358 -0.20839902]]
[1 0 0]
예측값 : [0 0 0]

          f  y_hat    y
0 -0.285782    0.0  1.0
1 -0.940879    0.0  0.0
2 -4.232450    0.0  0.0

[[44  4]
 [ 8 44]]
acc(정확도) : 0.88
recall(재현율) : 0.9166666666666666
precision(정밀도) : 0.8461538461538461
specificity(특이도) : 0.8461538461538461
fallout(위양성률) : 0.15384615384615385
fallout(위양성율) : 0.15384615384615385

ac_sco : 0.88
cl_rep :               precision    recall  f1-score   support

           0       0.85      0.92      0.88        48
           1       0.92      0.85      0.88        52

    accuracy                           0.88       100
   macro avg       0.88      0.88      0.88       100
weighted avg       0.88      0.88      0.88       100


fpr : [0.         0.         0.         0.02083333 0.02083333 0.04166667
 0.04166667 0.10416667 0.10416667 0.14583333 0.14583333 0.27083333
 0.27083333 0.29166667 0.29166667 0.41666667 0.41666667 0.45833333
 0.45833333 0.47916667 0.47916667 1.        ]
tpr : [0.         0.01923077 0.78846154 0.78846154 0.82692308 0.82692308
 0.84615385 0.84615385 0.88461538 0.88461538 0.90384615 0.90384615
 0.92307692 0.92307692 0.94230769 0.94230769 0.96153846 0.96153846
 0.98076923 0.98076923 1.         1.        ]
분류결정 임계값(positive 예측값을 결정하는 확률 기준값) : [ 7.04651307  6.04651307  1.67685901  1.54237338  0.88285248  0.61225012
  0.6042564  -0.01701686 -0.3136364  -0.72225891 -0.73422207 -1.03469335
 -1.06350616 -1.1284408  -1.26883002 -1.50493726 -1.58820386 -1.70986327
 -1.76264928 -1.77065102 -1.8172322  -5.91181287]
AUC : 0.9547275641025641