Logistic Regression(로지스틱 회귀분석) 예제(당뇨 데이터), 로지스틱 회귀분석 후 저장 후 불러쓰기

Python 데이터 분석 2022. 11. 18. 10:58

# pima-indians-diabetes dataset으로 당뇨병 유무 분류 모델
import pandas as pd
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression

url = "https://raw.githubusercontent.com/pykwon/python/master/testdata_utf8/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv(url, names = names, header = None)
print(df.head(3), df.shape) # (768, 9)

array = df.values #matrix로 호출
print(array)
x = array[:, 0:8] # 2차원
y = array[:, 8]   # 1차원
print(x.shape, y.shape) # (768, 8) (768,)

x_train, x_test, y_train, y_test = model_selection.train_test_split(x, y, test_size = 0.3, random_state = 7)
print(x_train.shape, x_test.shape) # (537, 8) (231, 8)

model = LogisticRegression()
model.fit(x_train, y_train)
print('예측값 :', model.predict(x_test[:10]))
print('실제값 :', y_test[:10])
print((model.predict(x_test) != y_test).sum()) # 57
print('test로 검정한 분류 정확도 :', model.score(x_test, y_test))    # 0.75324
print('train로 확인한 분류 정확도 :', model.score(x_train, y_train)) # 0.78212 둘의 차이가 크면 좋지 않다.

from sklearn.metrics import accuracy_score
pred = model.predict(x_test)
print('분류 정확도 :', accuracy_score(y_test, pred)) 

import joblib
import pickle

# 학습이 끝난 모델은 저장 후 읽어 사용하도록 한다.
# joblib.dump(model, 'pima_model.sav')
pickle.dump(model, open('pima_model.sav', 'wb'))

# mymodel = joblib.load('pima_model.sav')
mymodel = pickle.load(open('pima_model.sav', 'rb'))
print('test로 검정한 분류 정확도 :', mymodel.score(x_test, y_test))    # 0.75324

# 새로운 값으로 예측
print(x_test[:1])
print(mymodel.predict([[1., 90., 62., 12., 43., 27.2, 0.58, 24.]]))




<console>
   Pregnancies  Glucose  BloodPressure  ...  DiabetesPedigreeFunction  Age  Outcome
0            6      148             72  ...                     0.627   50        1
1            1       85             66  ...                     0.351   31        0
2            8      183             64  ...                     0.672   32        1

[3 rows x 9 columns] (768, 9)
[[  6.    148.     72.    ...   0.627  50.      1.   ]
 [  1.     85.     66.    ...   0.351  31.      0.   ]
 [  8.    183.     64.    ...   0.672  32.      1.   ]
 ...
 [  5.    121.     72.    ...   0.245  30.      0.   ]
 [  1.    126.     60.    ...   0.349  47.      1.   ]
 [  1.     93.     70.    ...   0.315  23.      0.   ]]
(768, 8) (768,)
(537, 8) (231, 8)

예측값 : [0. 1. 1. 0. 0. 0. 0. 0. 1. 0.]
실제값 : [0. 1. 1. 0. 1. 1. 0. 1. 0. 0.]
57
test로 검정한 분류 정확도 : 0.7532467532467533
train로 확인한 분류 정확도 : 0.7821229050279329
분류 정확도 : 0.7532467532467533
test로 검정한 분류 정확도 : 0.7532467532467533
[[ 1.   90.   62.   12.   43.   27.2   0.58 24.  ]]
[0.]

로지스틱 회귀분석을 한 뒤에는 프로젝트에서 사용할 경우 저장한 뒤에 불러서 사용한다. 그 이유는 화면이 로딩될 때마다 학습되기 때문에 효율이 안 좋기 때문이다.

'Python 데이터 분석' 카테고리의 다른 글

Python 데이터분석 기초 56 - Metric : 성능평가를 숫자로 표현한 지표(Confusion Matrix) (0)	2022.11.18
Python 데이터분석 기초 55 - Logistic Regression : 다항분류 (얘는 활성화 함수로 softmax - 결과값을 확률로 반환), 표준 (0)	2022.11.18
Logistic Regression(로지스틱 회귀분석) 예제(외식 데이터) (0)	2022.11.17
Logistic Regression(로지스틱 회귀분석) 예제(날씨 데이터) - train_test_split(과적합 방지), 머신러닝의 포용성(inclusion, tolerance) (1)	2022.11.17
Python 데이터분석 기초 54 - Logistic Regression(로지스틱 회귀분석) (0)	2022.11.17

ABOUT ME

코딩탕탕 코딩탕탕

'Python 데이터 분석' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'Python 데이터 분석' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바