Python 데이터분석 기초 61

Python 데이터분석 기초 61 - Random forest

코딩탕탕 2022. 11. 22. 12:23
Random forest는 ensemble(앙상블) machine learning 모델입니다.
여러개의 decision tree를 형성하고 새로운 데이터 포인트를 각 트리에 동시에 통과시키며,
각 트리가 분류한 결과에서 투표를 실시하여 가장 많이 득표한 결과를 최종 분류 결과로 선택합니다.
Bagging 방식을 사용
Titanic dataset을 사용
# Random forest는 ensemble(앙상블) machine learning 모델입니다.
# 여러개의 decision tree를 형성하고 새로운 데이터 포인트를 각 트리에 동시에 통과시키며,
# 각 트리가 분류한 결과에서 투표를 실시하여 가장 많이 득표한 결과를 최종 분류 결과로 선택합니다.
# Bagging 방식을 사용
# Titanic dataset을 사용

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
import pandas as pd
import numpy as np

df = pd.read_csv('../testdata/titanic_data.csv')
print(df.head(3))
print(df.columns)
print(df.info())
print(df.isnull().any()) # 결측치가 있는지 확인

df = df.dropna(subset=['Pclass', 'Age', 'Sex'])
print(df.shape) # (714, 12)

df_x = df[['Pclass', 'Age', 'Sex']] # feature
print(df_x.head(2))

# scaling
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Sex column은 dummy화
df_x.loc[:, 'Sex'] = LabelEncoder().fit_transform(df_x['Sex'])
print(df_x.head(5))
# LabelEncoder() 함수는 male, female을 0, 1로 바꿔준다. 사전순
# df_x['Sex'] = df_x['Sex'].apply(lambda x:1 if x == 'male' else 0)
print(df_x.head(2))
# print(set(df_x['Pclass'])) # {1, 2, 3}
df_y = df['Survived']
print(df_y.head(2))

# Pclass 열에 대한 원 핫 인코딩
# (해당 열, 범주의 종류 만큼 벡터의 크기를 설정하고, 범주에 해당하는 index에 1을 주고 나머지 요소 모두에는 0으로 채우기
df_x2 = pd.DataFrame(OneHotEncoder().fit_transform(df_x['Pclass'].values[:, np.newaxis]).toarray(),
                     columns=['f_class', 's_class', 't_class'], index=df_x.index)
print(df_x2.head(2))

df_x = pd.concat([df_x, df_x2], axis=1)
print(df_x.head(10))

# train / test split
train_x, test_x, train_y, test_y = train_test_split(df_x, df_y, test_size = 0.25, random_state=12)
print(train_x.shape, test_x.shape, train_y.shape, test_y.shape) # (535, 6) (179, 6) (535,) (179,)

# model
model = RandomForestClassifier(n_estimators=500, criterion='entropy')
model.fit(train_x, train_y)

pred = model.predict(test_x)
print('예측값 :', pred[:5])
print('실제값 :', np.array(test_y[:5]))

# 정확도
print('acc :', sum(test_y == pred) / len(test_y)) # acc : 0.8156424581005587
from sklearn.metrics import accuracy_score
print('acc :', accuracy_score(test_y, pred))

# 교차검증
cross_vali = cross_val_score(model, df_x, df_y, cv = 5)
print(cross_vali)
print(np.mean(cross_vali))

# 중요변수
print('특성(변수) 중요도 :',model.feature_importances_)

import matplotlib.pyplot as plt
def plot_importance(model):
    n_features = df_x.shape[1]
    plt.barh(range(n_features), model.feature_importances_, align = 'center')
    plt.yticks(np.arange(n_features), df_x.columns)
    plt.xlabel('feature_importances_')
    plt.ylabel('feature')
    plt.show()
    
plot_importance(model)


<console>
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S

[3 rows x 12 columns]
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
PassengerId    False
Survived       False
Pclass         False
Name           False
Sex            False
Age             True
SibSp          False
Parch          False
Ticket         False
Fare           False
Cabin           True
Embarked        True
dtype: bool
(714, 12)
   Pclass   Age     Sex
0       3  22.0    male
1       1  38.0  female
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_x.loc[:, 'Sex'] = LabelEncoder().fit_transform(df_x['Sex'])
   Pclass   Age  Sex
0       3  22.0    1
1       1  38.0    0
2       3  26.0    0
3       1  35.0    0
4       3  35.0    1
   Pclass   Age  Sex
0       3  22.0    1
1       1  38.0    0
0    0
1    1
Name: Survived, dtype: int64
   f_class  s_class  t_class
0      0.0      0.0      1.0
1      1.0      0.0      0.0
    Pclass   Age  Sex  f_class  s_class  t_class
0        3  22.0    1      0.0      0.0      1.0
1        1  38.0    0      1.0      0.0      0.0
2        3  26.0    0      0.0      0.0      1.0
3        1  35.0    0      1.0      0.0      0.0
4        3  35.0    1      0.0      0.0      1.0
6        1  54.0    1      1.0      0.0      0.0
7        3   2.0    1      0.0      0.0      1.0
8        3  27.0    0      0.0      0.0      1.0
9        2  14.0    0      0.0      1.0      0.0
10       3   4.0    0      0.0      0.0      1.0
(535, 6) (179, 6) (535,) (179,)
예측값 : [1 0 0 0 0]
실제값 : [1 0 0 0 1]
acc : 0.8212290502793296
acc : 0.8212290502793296
[0.76223776 0.83216783 0.82517483 0.83216783 0.83098592]
0.816546833448242
특성(변수) 중요도 : [0.0561767  0.55458455 0.302526   0.03068822 0.0131424  0.04288215]