Support Vector Machine(SVM) 예제 - 심장병 환자 데이터

코딩탕탕 2022. 11. 23. 18:33

<작성자 코드>

# [SVM 분류 문제] 심장병 환자 데이터를 사용하여 분류 정확도 분석 연습
# https://www.kaggle.com/zhaoyingzhu/heartcsv
# https://github.com/pykwon/python/tree/master/testdata_utf8         Heartcsv
#
# Heart 데이터는 흉부외과 환자 303명을 관찰한 데이터다. 
# 각 환자의 나이, 성별, 검진 정보 컬럼 13개와 마지막 AHD 칼럼에 각 환자들이 심장병이 있는지 여부가 기록되어 있다. 
# dataset에 대해 학습을 위한 train과 test로 구분하고 분류 모델을 만들어, 모델 객체를 호출할 경우 정확한 확률을 확인하시오. 
# 임의의 값을 넣어 분류 결과를 확인하시오.     
# 정확도가 예상보다 적게 나올 수 있음에 실망하지 말자. ㅎㅎ
#
# feature 칼럼 : 문자 데이터 칼럼은 제외
# label 칼럼 : AHD(중증 심장질환)
#
# 데이터 예)
# "","Age","Sex","ChestPain","RestBP","Chol","Fbs","RestECG","MaxHR","ExAng","Oldpeak","Slope","Ca","Thal","AHD"
# "1",63,1,"typical",145,233,1,2,150,0,2.3,3,0,"fixed","No"
# "2",67,1,"asymptomatic",160,286,0,2,108,1,1.5,2,3,"normal","Yes"
# ...

import pandas as pd
import numpy as np
from sklearn import svm, metrics
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

df = pd.read_csv('https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/Heart.csv', )
df = df.drop(['Unnamed: 0'], axis=1)
print(df.head(3), df.shape) # (303, 15)
print(df.describe())
print(df.info())
# print(df.isnull().sum())
df.fillna({'Ca':float(df['Ca'].mean())}, inplace=True)

df_x = df.drop(columns = ['ChestPain', 'Thal', 'AHD'])
df_y = df['AHD']

x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size = 0.2, random_state = 1)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape) # (242, 12) (61, 12) (242,) (61,)

print()
# model
model = svm.SVC(C=1).fit(x_train, y_train)

pred = model.predict(x_test)
print('예측값 :', pred[:10])
print('실제값 :', y_test[:10].values)

acc = metrics.accuracy_score(y_test, pred)
print('acc :', acc) # 0.67213

print()
# 교차 검증
from sklearn import model_selection
cross_vali = model_selection.cross_val_score(model, x_test, y_test, cv = 5)
print('각각의 검증 정확도 :', cross_vali)
print('평균 검증 정확도 :', cross_vali.mean())

# 새 값으로 예측
new_data = pd.DataFrame({'Age':[60, 65], 'Sex':[1, 0], 'RestBP':[145, 160], 'Chol':[250, 245], 'Fbs':[2, 0], 'RestECG':[1, 2], 'MaxHR':[150, 125], 'ExAng':[0, 1], 'Oldpeak':[2.3, 1.7], 'Slope':[3, 2], 'Ca':[3, 2]})
new_pred = model.predict(new_data)
print('새로운 예측값 :', new_pred)


<console>
   Age  Sex     ChestPain  RestBP  Chol  ...  Oldpeak  Slope   Ca        Thal  AHD
0   63    1       typical     145   233  ...      2.3      3  0.0       fixed   No
1   67    1  asymptomatic     160   286  ...      1.5      2  3.0      normal  Yes
2   67    1  asymptomatic     120   229  ...      2.6      2  2.0  reversable  Yes

[3 rows x 14 columns] (303, 14)
              Age         Sex      RestBP  ...     Oldpeak       Slope          Ca
count  303.000000  303.000000  303.000000  ...  303.000000  303.000000  299.000000
mean    54.438944    0.679868  131.689769  ...    1.039604    1.600660    0.672241
std      9.038662    0.467299   17.599748  ...    1.161075    0.616226    0.937438
min     29.000000    0.000000   94.000000  ...    0.000000    1.000000    0.000000
25%     48.000000    0.000000  120.000000  ...    0.000000    1.000000    0.000000
50%     56.000000    1.000000  130.000000  ...    0.800000    2.000000    0.000000
75%     61.000000    1.000000  140.000000  ...    1.600000    2.000000    1.000000
max     77.000000    1.000000  200.000000  ...    6.200000    3.000000    3.000000

[8 rows x 11 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Age        303 non-null    int64  
 1   Sex        303 non-null    int64  
 2   ChestPain  303 non-null    object 
 3   RestBP     303 non-null    int64  
 4   Chol       303 non-null    int64  
 5   Fbs        303 non-null    int64  
 6   RestECG    303 non-null    int64  
 7   MaxHR      303 non-null    int64  
 8   ExAng      303 non-null    int64  
 9   Oldpeak    303 non-null    float64
 10  Slope      303 non-null    int64  
 11  Ca         299 non-null    float64
 12  Thal       301 non-null    object 
 13  AHD        303 non-null    object 
dtypes: float64(2), int64(9), object(3)
memory usage: 33.3+ KB
None
(242, 11) (61, 11) (242,) (61,)

예측값 : ['No' 'No' 'No' 'Yes' 'No' 'No' 'No' 'Yes' 'Yes' 'Yes']
실제값 : ['No' 'No' 'No' 'Yes' 'Yes' 'No' 'No' 'Yes' 'Yes' 'Yes']
acc : 0.6721311475409836

각각의 검증 정확도 : [0.53846154 0.58333333 0.58333333 0.58333333 0.5       ]
평균 검증 정확도 : 0.5576923076923077
새로운 예측값 : ['No' 'Yes']

<선생님 코드>

# [SVM 분류 문제] 심장병 환자 데이터를 사용하여 분류 정확도 분석 연습
# https://www.kaggle.com/zhaoyingzhu/heartcsv
# https://github.com/pykwon/python/tree/master/testdata_utf8         Heartcsv
#
# Heart 데이터는 흉부외과 환자 303명을 관찰한 데이터다. 
# 각 환자의 나이, 성별, 검진 정보 컬럼 13개와 마지막 AHD 칼럼에 각 환자들이 심장병이 있는지 여부가 기록되어 있다. 
# dataset에 대해 학습을 위한 train과 test로 구분하고 분류 모델을 만들어, 모델 객체를 호출할 경우 정확한 확률을 확인하시오. 
# 임의의 값을 넣어 분류 결과를 확인하시오.     
# 정확도가 예상보다 적게 나올 수 있음에 실망하지 말자. ㅎㅎ
#
# feature 칼럼 : 문자 데이터 칼럼은 제외
# label 칼럼 : AHD(중증 심장질환)

import pandas as pd 
import numpy as np
from sklearn import svm, metrics
from sklearn.model_selection._split import train_test_split

heartdata = pd.read_csv("../testdata/Heart.csv")
print(heartdata.info())

data = heartdata.drop(["ChestPain", "Thal"], axis = 1)  # object type은 제외
data.loc[data.AHD=="Yes", 'AHD'] = 1
data.loc[data.AHD=="No", 'AHD'] = 0
print(heartdata.isnull().sum())      # Ca 열에 결측치 4개

Heart = data.fillna(data.mean())   # CA에 결측치는 평균으로 대체
label = Heart["AHD"]
features = Heart.drop(["AHD"], axis = 1)

x_train, x_test, y_train, y_test = train_test_split(features, label, test_size = 0.3, random_state = 12)
print()
model = svm.SVC(C=0.1).fit(x_train, y_train)
pred = model.predict(x_test)
print('예측값 : ', pred)
print('실제값 : ', np.array(y_test))

# 분류 정확도 
print(model.score(x_train, y_train))
print(model.score(x_test, y_test))
print('분류 정확도 : ', metrics.accuracy_score(y_test, pred))

# 새 값으로 예측
new_test = x_test[:2].copy()
print(new_test)
new_test['Age'] = 10
new_test['Sex'] = 0
print(new_test)

new_pred = model.predict(new_test)
print('예측결과 : ', new_pred)


<console>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  303 non-null    int64  
 1   Age         303 non-null    int64  
 2   Sex         303 non-null    int64  
 3   ChestPain   303 non-null    object 
 4   RestBP      303 non-null    int64  
 5   Chol        303 non-null    int64  
 6   Fbs         303 non-null    int64  
 7   RestECG     303 non-null    int64  
 8   MaxHR       303 non-null    int64  
 9   ExAng       303 non-null    int64  
 10  Oldpeak     303 non-null    float64
 11  Slope       303 non-null    int64  
 12  Ca          299 non-null    float64
 13  Thal        301 non-null    object 
 14  AHD         303 non-null    object 
dtypes: float64(2), int64(10), object(3)
memory usage: 35.6+ KB
None
Unnamed: 0    0
Age           0
Sex           0
ChestPain     0
RestBP        0
Chol          0
Fbs           0
RestECG       0
MaxHR         0
ExAng         0
Oldpeak       0
Slope         0
Ca            4
Thal          2
AHD           0
dtype: int64

예측값 :  [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
실제값 :  [0 0 0 0 1 0 1 1 0 1 0 1 1 1 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 1 0 0
 0 1 1 0 1 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0 0 1 0 0 0
 1 1 0 0 1 1 1 1 1 0 1 1 0 1 0 0 0]
0.5330188679245284
0.5604395604395604
분류 정확도 :  0.5604395604395604
    Unnamed: 0  Age  Sex  RestBP  Chol  ...  MaxHR  ExAng  Oldpeak  Slope   Ca
92          93   62    1     130   231  ...    146      0      1.8      2  3.0
85          86   44    1     140   235  ...    180      0      0.0      1  0.0

[2 rows x 12 columns]
    Unnamed: 0  Age  Sex  RestBP  Chol  ...  MaxHR  ExAng  Oldpeak  Slope   Ca
92          93   10    0     130   231  ...    146      0      1.8      2  3.0
85          86   10    0     140   235  ...    180      0      0.0      1  0.0

[2 rows x 12 columns]
예측결과 :  [0 0]