Python 데이터분석 기초 51 - 선형회귀분석 모델 LinearRegression을 사용

Python 데이터분석 기초 51 - 선형회귀분석 모델 LinearRegression을 사용 - summary() 함수 지원 X

Python 데이터 분석 2022. 11. 16. 14:51

LinearRegression에는 summary() 함수가 없으므로 설명력을 확인하려면 r2_score()함수를 이용한다.

sklearn에는 독립변수가 백터가 아닌 matrix로 넣어진다.

# 선형회귀분석모델 작성시 LinearRegression을 사용 - summary() 함수 지원 X
# 선형회귀분석모델을 평가할 수 있는 score 알아보기

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler # 정규화 지원
from sklearn.metrics import r2_score, explained_variance_score, mean_squared_error
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# sklearn에는 독립변수가 백터가 아닌 matrix로 넣어진다.

# 편차가 있는 표본 데이터 작성
sample_size = 100
np.random.seed(1)

print('표준편차가 같은 경우 두 개의 변수를 생성 : 분산이 작음')
x = np.random.normal(0, 10, sample_size) # 평균이 0 분산이 10인 랜덤한 값
y = np.random.normal(0, 10, sample_size) + x * 30
print(x[:5])
print(y[:5])
print('상관계수 :',np.corrcoef(x, y)) # 0.99939357

# 독립변수 x에 대한 정규화
scaler = MinMaxScaler()
x_scaled = scaler.fit_transform(x.reshape(-1, 1))
print(x_scaled[:5])

# 시각화
# plt.scatter(x_scaled, y)
# plt.show()

model = LinearRegression().fit(x_scaled, y)
y_pred = model.predict(x_scaled)
print('예측값 :', y_pred[:5])
print('실제값 :', y[:5])
# print(model.summary()) # AttributeError: 'LinearRegression' object has no attribute 'summary'

print()
# 모델 성능 파악용 함수 작성
def RegScore_func(y_true, y_pred):
    print('r2_score(결정계수, 설명력):{}'.format(r2_score(y_true, y_pred)))
    print('explained_variance_score(설명분산점수):{}'.format(explained_variance_score(y_true, y_pred)))
    print('mean_squared_error(RMSE, 평균제곱오차):{}'.format(mean_squared_error(y_true, y_pred)))
    # RMSE : 평균오차제곱근
    # 평균제곱오차 : 예측값에서 실제값(관찰값)을 뺀 값의 제곱의 합을 표본수로 나눈 것
    # 설명력과 설명분산점수가 다르면 학습에 문제가 생겼다는 의미이다.
    
RegScore_func(y, y_pred)

print('표준편차가 다른 경우 두 개의 변수를 생성 : 분산이 큼')
x = np.random.normal(0, 1, sample_size) # 평균이 0 분산이 10인 랜덤한 값
y = np.random.normal(0, 500, sample_size) + x * 30
print(x[:5])
print(y[:5])
print('상관계수 :',np.corrcoef(x, y)) # 0.00401167

# 독립변수 x에 대한 정규화
scaler2 = MinMaxScaler()
x_scaled2 = scaler2.fit_transform(x.reshape(-1, 1))
print(x_scaled2[:5])

# 시각화
plt.scatter(x_scaled2, y)
plt.show()

model2 = LinearRegression().fit(x_scaled2, y)
y_pred2 = model2.predict(x_scaled2)
print('예측값 :', y_pred2[:5])
print('실제값 :', y[:5])
# print(model.summary()) # AttributeError: 'LinearRegression' object has no attribute 'summary'

print()
# 모델 성능 파악용 함수 작성
def RegScore_func2(y_true, y_pred):
    print('r2_score(결정계수, 설명력):{}'.format(r2_score(y_true, y_pred)))
    print('explained_variance_score(설명분산점수):{}'.format(explained_variance_score(y_true, y_pred)))
    print('mean_squared_error(RMSE, 평균제곱오차):{}'.format(mean_squared_error(y_true, y_pred)))
    # RMSE : 평균오차제곱근
    # 평균제곱오차 : 예측값에서 실제값(관찰값)을 뺀 값의 제곱의 합을 표본수로 나눈 것
    # 설명력과 설명분산점수가 다르면 학습에 문제가 생겼다는 의미이다.
    
RegScore_func2(y, y_pred2)



<console>
표준편차가 같은 경우 두 개의 변수를 생성 : 분산이 작음
[ 16.24345364  -6.11756414  -5.28171752 -10.72968622   8.65407629]
[ 482.83232345 -171.28184705 -154.41660926 -315.95480141  248.67317034]
상관계수 : [[1.         0.99939357]
 [0.99939357 1.        ]]
[[0.87492405]
 [0.37658554]
 [0.39521325]
 [0.27379961]
 [0.70578689]]
예측값 : [ 490.32381062 -182.64057041 -157.48540955 -321.44435455  261.91825779]
실제값 : [ 482.83232345 -171.28184705 -154.41660926 -315.95480141  248.67317034]

r2_score(결정계수, 설명력):0.9987875127274646
explained_variance_score(설명분산점수):0.9987875127274646
mean_squared_error(RMSE, 평균제곱오차):86.14795101998743
표준편차가 다른 경우 두 개의 변수를 생성 : 분산이 큼
[-0.40087819  0.82400562 -0.56230543  1.95487808 -1.33195167]
[1020.86531436 -710.85829436 -431.95511059 -381.64245767 -179.50741077]
상관계수 : [[1.         0.00401167]
 [0.00401167 1.        ]]
[[0.45631435]
 [0.68996139]
 [0.42552204]
 [0.90567574]
 [0.27871173]]
예측값 : [-10.75792685  -8.15919008 -11.10041394  -5.7599096  -12.73331002]
실제값 : [1020.86531436 -710.85829436 -431.95511059 -381.64245767 -179.50741077]

r2_score(결정계수, 설명력):1.6093526521765433e-05
explained_variance_score(설명분산점수):1.6093526521765433e-05
mean_squared_error(RMSE, 평균제곱오차):282457.9703485092

표준편차가 같은 경우 두 개의 변수를 생성 : 분산이 작음

표준편차가 다른 경우 두 개의 변수를 생성 : 분산이 큼

'Python 데이터 분석' 카테고리의 다른 글

선형회귀분석 모델 LinearRegression을 사용 예제(train_test_split) mtcars 데이터 사용 (0)	2022.11.16
선형회귀분석 모델 LinearRegression을 사용 예제(train_test_split) (0)	2022.11.16
회귀분석모형의 적절성을 위한 5가지 조건 예제 (1)	2022.11.16
단순선형회귀, 다중선형회귀 예제(2), 회귀분석모형의 적절성을 위한 5가지 조건 (0)	2022.11.15
단순선형회귀, 다중선형회귀 예제 (0)	2022.11.15

ABOUT ME

코딩탕탕 코딩탕탕

'Python 데이터 분석' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'Python 데이터 분석' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바