Python 데이터 분석
Python 데이터분석 기초 60 - Ensemble Learning
코딩탕탕
2022. 11. 22. 10:45
Ensemble Learning : 개별적인 여러 모델들을 모아 종합적으로 취합 후 최종 분류 결과를 출력
종류로는 voting, baggin, boosting 방법이 있다.
# Ensemble Learning : 개별적인 여러 모델들을 모아 종합적으로 취합 후 최종 분류 결과를 출력
# 종류로는 voting, baggin, boosting 방법이 있다.
# breast_cancer dataset 사용
# LogisticRegressoin, DecisionTree, KNN을 사용하여 보팅 분류기 작성
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# sklearn은 데이터프레임으로 가져오지 않는다.
cancer = load_breast_cancer()
data_df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
print(data_df.head(2))
# train / test split
x_train, x_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=1)
print(x_train.shape, x_test.shape, y_train.shape, y_test.shape) # (455, 30) (114, 30) (455,) (114,)
print(x_train[:3])
print(y_train[:3], set(y_train)) # {0, 1} 0: 양성, 1 : 음성
# Ensemble model(VotingClassifier) : LogisticRegression + KNN + DecisionTreeClassifier
logi_regression = LogisticRegression()
knn = KNeighborsClassifier(n_neighbors=3)
demodel = DecisionTreeClassifier()
voting_model = VotingClassifier(estimators=[('LR', logi_regression), ('KNN', knn), ('Decision', demodel)],
voting='soft')
classifiers = [logi_regression, knn, demodel]
# 개별 모델의 학습 및 평가
for classifier in classifiers:
classifier.fit(x_train, y_train)
pred = classifier.predict(x_test)
class_name = classifier.__class__.__name__
print('{0} 정확도 : {1:.4f}'.format(class_name, accuracy_score(y_test, pred)))
# 앙상블 모델 학습 및 평가
voting_model.fit(x_train, y_train)
vpred = voting_model.predict(x_test)
print('앙상블 모델의 정확도 : {0:.4f}'.format(accuracy_score(y_test, vpred)))
<console>
mean radius mean texture ... worst symmetry worst fractal dimension
0 17.99 10.38 ... 0.4601 0.11890
1 20.57 17.77 ... 0.2750 0.08902
[2 rows x 30 columns]
(455, 30) (114, 30) (455,) (114,)
[[1.799e+01 2.066e+01 1.178e+02 9.917e+02 1.036e-01 1.304e-01 1.201e-01
8.824e-02 1.992e-01 6.069e-02 4.537e-01 8.733e-01 3.061e+00 4.981e+01
7.231e-03 2.772e-02 2.509e-02 1.480e-02 1.414e-02 3.336e-03 2.108e+01
2.541e+01 1.381e+02 1.349e+03 1.482e-01 3.735e-01 3.301e-01 1.974e-01
3.060e-01 8.503e-02]
[2.029e+01 1.434e+01 1.351e+02 1.297e+03 1.003e-01 1.328e-01 1.980e-01
1.043e-01 1.809e-01 5.883e-02 7.572e-01 7.813e-01 5.438e+00 9.444e+01
1.149e-02 2.461e-02 5.688e-02 1.885e-02 1.756e-02 5.115e-03 2.254e+01
1.667e+01 1.522e+02 1.575e+03 1.374e-01 2.050e-01 4.000e-01 1.625e-01
2.364e-01 7.678e-02]
[9.000e+00 1.440e+01 5.636e+01 2.463e+02 7.005e-02 3.116e-02 3.681e-03
3.472e-03 1.788e-01 6.833e-02 1.746e-01 1.305e+00 1.144e+00 9.789e+00
7.389e-03 4.883e-03 3.681e-03 3.472e-03 2.701e-02 2.153e-03 9.699e+00
2.007e+01 6.090e+01 2.855e+02 9.861e-02 5.232e-02 1.472e-02 1.389e-02
2.991e-01 7.804e-02]]
[0 0 1] {0, 1}
LogisticRegression 정확도 : 0.9474
KNeighborsClassifier 정확도 : 0.9211
DecisionTreeClassifier 정확도 : 0.9386
앙상블 모델의 정확도 : 0.9474