Python 데이터분석 기초 51 - 선형회귀분석 모델 LinearRegression을 사용

Python 데이터분석 기초 51 - 선형회귀분석 모델 LinearRegression을 사용 - summary() 함수 지원 X

코딩탕탕 2022. 11. 16. 14:51

LinearRegression에는 summary() 함수가 없으므로 설명력을 확인하려면 r2_score()함수를 이용한다.

sklearn에는 독립변수가 백터가 아닌 matrix로 넣어진다.

# 선형회귀분석모델 작성시 LinearRegression을 사용 - summary() 함수 지원 X
# 선형회귀분석모델을 평가할 수 있는 score 알아보기

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler # 정규화 지원
from sklearn.metrics import r2_score, explained_variance_score, mean_squared_error
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# sklearn에는 독립변수가 백터가 아닌 matrix로 넣어진다.

# 편차가 있는 표본 데이터 작성
sample_size = 100
np.random.seed(1)

print('표준편차가 같은 경우 두 개의 변수를 생성 : 분산이 작음')
x = np.random.normal(0, 10, sample_size) # 평균이 0 분산이 10인 랜덤한 값
y = np.random.normal(0, 10, sample_size) + x * 30
print(x[:5])
print(y[:5])
print('상관계수 :',np.corrcoef(x, y)) # 0.99939357

# 독립변수 x에 대한 정규화
scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x.reshape(-1, 1))
print(x_scaled[:5])

# 시각화
# plt.scatter(x_scaled, y)
# plt.show()

model = LinearRegression().fit(x_scaled, y)
y_pred = model.predict(x_scaled)
print('예측값 :', y_pred[:5])
print('실제값 :', y[:5])
# print(model.summary()) # AttributeError: 'LinearRegression' object has no attribute 'summary'

print()
# 모델 성능 파악용 함수 작성
def RegScore_func(y_true, y_pred):
    print('r2_score(결정계수, 설명력):{}'.format(r2_score(y_true, y_pred)))
    print('explained_variance_score(설명분산점수):{}'.format(explained_variance_score(y_true, y_pred)))
    print('mean_squared_error(RMSE, 평균제곱오차):{}'.format(mean_squared_error(y_true, y_pred)))
    # RMSE : 평균오차제곱근
    # 평균제곱오차 : 예측값에서 실제값(관찰값)을 뺀 값의 제곱의 합을 표본수로 나눈 것
    # 설명력과 설명분산점수가 다르면 학습에 문제가 생겼다는 의미이다.
    
RegScore_func(y, y_pred)

print('표준편차가 다른 경우 두 개의 변수를 생성 : 분산이 큼')
x = np.random.normal(0, 1, sample_size) # 평균이 0 분산이 10인 랜덤한 값
y = np.random.normal(0, 500, sample_size) + x * 30
print(x[:5])
print(y[:5])
print('상관계수 :',np.corrcoef(x, y)) # 0.00401167

# 독립변수 x에 대한 정규화
scaler2 = MinMaxScaler()
x_scaled2 = scaler2.fit_transform(x.reshape(-1, 1))
print(x_scaled2[:5])

# 시각화
plt.scatter(x_scaled2, y)
plt.show()

model2 = LinearRegression().fit(x_scaled2, y)
y_pred2 = model2.predict(x_scaled2)
print('예측값 :', y_pred2[:5])
print('실제값 :', y[:5])
# print(model.summary()) # AttributeError: 'LinearRegression' object has no attribute 'summary'

print()
# 모델 성능 파악용 함수 작성
def RegScore_func2(y_true, y_pred):
    print('r2_score(결정계수, 설명력):{}'.format(r2_score(y_true, y_pred)))
    print('explained_variance_score(설명분산점수):{}'.format(explained_variance_score(y_true, y_pred)))
    print('mean_squared_error(RMSE, 평균제곱오차):{}'.format(mean_squared_error(y_true, y_pred)))
    # RMSE : 평균오차제곱근
    # 평균제곱오차 : 예측값에서 실제값(관찰값)을 뺀 값의 제곱의 합을 표본수로 나눈 것
    # 설명력과 설명분산점수가 다르면 학습에 문제가 생겼다는 의미이다.
    
RegScore_func2(y, y_pred2)



<console>
표준편차가 같은 경우 두 개의 변수를 생성 : 분산이 작음
[ 16.24345364  -6.11756414  -5.28171752 -10.72968622   8.65407629]
[ 482.83232345 -171.28184705 -154.41660926 -315.95480141  248.67317034]
상관계수 : [[1.         0.99939357]
 [0.99939357 1.        ]]
[[0.87492405]
 [0.37658554]
 [0.39521325]
 [0.27379961]
 [0.70578689]]
예측값 : [ 490.32381062 -182.64057041 -157.48540955 -321.44435455  261.91825779]
실제값 : [ 482.83232345 -171.28184705 -154.41660926 -315.95480141  248.67317034]

r2_score(결정계수, 설명력):0.9987875127274646
explained_variance_score(설명분산점수):0.9987875127274646
mean_squared_error(RMSE, 평균제곱오차):86.14795101998743
표준편차가 다른 경우 두 개의 변수를 생성 : 분산이 큼
[-0.40087819  0.82400562 -0.56230543  1.95487808 -1.33195167]
[1020.86531436 -710.85829436 -431.95511059 -381.64245767 -179.50741077]
상관계수 : [[1.         0.00401167]
 [0.00401167 1.        ]]
[[0.45631435]
 [0.68996139]
 [0.42552204]
 [0.90567574]
 [0.27871173]]
예측값 : [-10.75792685  -8.15919008 -11.10041394  -5.7599096  -12.73331002]
실제값 : [1020.86531436 -710.85829436 -431.95511059 -381.64245767 -179.50741077]

r2_score(결정계수, 설명력):1.6093526521765433e-05
explained_variance_score(설명분산점수):1.6093526521765433e-05
mean_squared_error(RMSE, 평균제곱오차):282457.9703485092

표준편차가 같은 경우 두 개의 변수를 생성 : 분산이 작음

표준편차가 다른 경우 두 개의 변수를 생성 : 분산이 큼