TensorFlow 기초 40 - 자소 단위로 분리한 후 텍스트 생성 모델

TensorFlow 2022. 12. 16. 10:22

# 자소 단위로 분리한 후 텍스트 생성 모델
# !pip install jamotools
# !pip --use-deprecated=legacy-resolver install  모듈명  # 라이브러리를 install 할 때 현재 버전에 안 맞을 때 사용 (낮은 버전의 파이썬에서 임의 모듈 설치치)

import jamotools
import tensorflow as tf
import numpy as np


path_to_file = tf.keras.utils.get_file("toji.txt", "https://raw.githubusercontent.com/pykwon/etc/master/rnn_short_toji.txt")

train_text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
s = train_text[:100]
print(s)

s_split = jamotools.split_syllables(s)
print(s_split)

s2 = jamotools.join_jamos(s_split)
print(s2)
print(s == s2)

# train_text로 자모단위 분리
train_text_x = jamotools.split_syllables(train_text)
vocab = sorted(set(train_text_x))
vocab.append('UNK') # 사전에 정의되지 않은 기호가 있는 경우 'UNK'로 사전에 등록
print(len(vocab)) # 136

char2idx = {u:i for i, u in enumerate(vocab)}
print(char2idx) # 인덱싱이 된다.

idx2char = np.array(vocab)
print(idx2char)
text_as_int = np.array([char2idx[c] for c in train_text_x])
print(text_as_int)

print(train_text_x[:20])
print(text_as_int[:20])

# 학습 데이터 생성
seq_length = 80
exam_per_epoch = len(text_as_int) // seq_length
print(exam_per_epoch) # 8636
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

char_dataset = char_dataset.batch(seq_length + 1, drop_remainder=True) # 처음 80개 자소와 그 뒤에 나올 정답이 될 한 단어를 합쳐서 반환

for item in char_dataset.take(1):
    print(idx2char[item.numpy()])
    print(item.numpy())
    
def split_input_target2(chunk):
    return [chunk[:-1], chunk[-1]]

train_dataset = char_dataset.map(split_input_target2)

for x, y in train_dataset.take(1):
    print(idx2char[x.numpy()])
    print(x.numpy())
    print(idx2char[y.numpy()])
    print(y.numpy())
    
# model
BATCH_SIZE = 64
steps_per_epoch = exam_per_epoch // BATCH_SIZE
print('steps_per_epoch :', steps_per_epoch)
train_dataset = train_dataset.shuffle(buffer_size=5000).batch(BATCH_SIZE, drop_remainder=True)

total_chars = len(vocab)

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(total_chars, 100, input_length=seq_length),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.LSTM(256, activation='tanh'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(256, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(total_chars, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
print(model.summary())

from keras.utils import pad_sequences

def testmodel(epoch, logs):
    if epoch % 5 != 0 and epoch != 49:
        return
    
    # 5의 배수 또는 49면 작업을 진행
    test_sentence = train_text[:48]
    test_sentence = jamotools.split_syllables(test_sentence)
    next_chars = 300
    for _ in range(next_chars):
        test_text_x = test_sentence[-seq_length:]
        test_text_x = np.array([char2idx[c] if c in char2idx else char2idx['UNK'] for c in test_text_x])
        test_text_x = pad_sequences([test_text_x], maxlen=seq_length, padding='pre', value=char2idx['UNK'])
        output_idx = np.argmax(model.predict(test_text_x), axis = -1)
        test_sentence += idx2char[output_idx[0]]
    print()
    print(jamotools.join_jamos(test_sentence))

# 모델을 학습시키며 모델이 생성한 결과물을 확인하기 위한 용도
testmodelcb = tf.keras.callbacks.LambdaCallback(on_epoch_end=testmodel)  # 에폭이 끝날 때 마다 testmodel 함수를 호출

# repeat() : input을 무한반복. 한번의 에폭의 끝과 다음 에폭의 시작에 상관없이 인자 만큼 반복
history = model.fit(train_dataset.repeat(), epochs=50, steps_per_epoch=steps_per_epoch, callbacks=[testmodelcb], verbose=2)

model.save('nlp14.hdf5')

# 임의의 문장을 사용해 학습된 모델로 새로운 글 생성
test_sentence = '최참판댁 사랑은 무인지경처럼 적막하다.'

test_sentence = jamotools.split_syllables(test_sentence)
print(test_sentence)

# 앞에서 작성한 for문 복붙
next_chars = 500
for _ in range(next_chars):
    test_text_x = test_sentence[-seq_length:]
    test_text_x = np.array([char2idx[c] if c in char2idx else char2idx['UNK'] for c in test_text_x])
    test_text_x = pad_sequences([test_text_x], maxlen=seq_length, padding='pre', value=char2idx['UNK'])
    output_idx = np.argmax(model.predict(test_text_x), axis = -1)
    test_sentence += idx2char[output_idx[0]]

print('글 생성 결과 ----------')
print(jamotools.join_jamos(test_sentence))

# 시각화
import matplotlib.pyplot as plt
plt.plot(history.history['loss'], c='r', label='loss')
plt.legend()
plt.show()

plt.plot(history.history['accuracy'], c='b', label='accuracy')
plt.legend()
plt.show()


    
<console>
귀녀의 모습을 한번 쳐다보고 떠나려 했다. 집안을 이리저리 기웃거리던 강표수는 윤씨부

인에게 인사를 올리고 중문을 나서는  치수 뒷모습을 보았다. 실망에  얼굴이 일그러지면서 

ㄱㅟㄴㅕㅇㅢ ㅁㅗㅅㅡㅂㅇㅡㄹ ㅎㅏㄴㅂㅓㄴ ㅊㅕㄷㅏㅂㅗㄱㅗ ㄸㅓㄴㅏㄹㅕ ㅎㅐㅆㄷㅏ. ㅈㅣㅂㅇㅏㄴㅇㅡㄹ ㅇㅣㄹㅣㅈㅓㄹㅣ ㄱㅣㅇㅜㅅㄱㅓㄹㅣㄷㅓㄴ ㄱㅏㅇㅍㅛㅅㅜㄴㅡㄴ ㅇㅠㄴㅆㅣㅂㅜ

ㅇㅣㄴㅇㅔㄱㅔ ㅇㅣㄴㅅㅏㄹㅡㄹ ㅇㅗㄹㄹㅣㄱㅗ ㅈㅜㅇㅁㅜㄴㅇㅡㄹ ㄴㅏㅅㅓㄴㅡㄴ  ㅊㅣㅅㅜ ㄷㅟㅅㅁㅗㅅㅡㅂㅇㅡㄹ ㅂㅗㅇㅏㅆㄷㅏ. ㅅㅣㄹㅁㅏㅇㅇㅔ  ㅇㅓㄹㄱㅜㄹㅇㅣ ㅇㅣㄹㄱㅡㄹㅓㅈㅣㅁㅕㄴㅅㅓ 

귀녀의 모습을 한번 쳐다보고 떠나려 했다. 집안을 이리저리 기웃거리던 강표수는 윤씨부

인에게 인사를 올리고 중문을 나서는  치수 뒷모습을 보았다. 실망에  얼굴이 일그러지면서 

True
136
{'\n': 0, '\r': 1, ' ': 2, '!': 3, '"': 4, "'": 5, '(': 6, ')': 7, ',': 8, '-': 9, '.': 10, '0': 11, '1': 12, '2': 13, '3': 14, '4': 15, '5': 16, '6': 17, '7': 18, '8': 19, '9': 20, ':': 21, '?': 22, 'a': 23, 'd': 24, 'f': 25, 'l': 26, 'n': 27, 'p': 28, '‘': 29, '’': 30, '“': 31, '”': 32, '…': 33, '\u3000': 34, 'ㄱ': 35, 'ㄲ': 36, 'ㄳ': 37, 'ㄴ': 38, 'ㄵ': 39, 'ㄶ': 40, 'ㄷ': 41, 'ㄸ': 42, 'ㄹ': 43, 'ㄺ': 44, 'ㄻ': 45, 'ㄼ': 46, 'ㄽ': 47, 'ㄾ': 48, 'ㅀ': 49, 'ㅁ': 50, 'ㅂ': 51, 'ㅃ': 52, 'ㅄ': 53, 'ㅅ': 54, 'ㅆ': 55, 'ㅇ': 56, 'ㅈ': 57, 'ㅉ': 58, 'ㅊ': 59, 'ㅋ': 60, 'ㅌ': 61, 'ㅍ': 62, 'ㅎ': 63, 'ㅏ': 64, 'ㅐ': 65, 'ㅑ': 66, 'ㅒ': 67, 'ㅓ': 68, 'ㅔ': 69, 'ㅕ': 70, 'ㅖ': 71, 'ㅗ': 72, 'ㅘ': 73, 'ㅙ': 74, 'ㅚ': 75, 'ㅛ': 76, 'ㅜ': 77, 'ㅝ': 78, 'ㅞ': 79, 'ㅟ': 80, 'ㅠ': 81, 'ㅡ': 82, 'ㅢ': 83, 'ㅣ': 84, '主': 85, '事': 86, '亡': 87, '佛': 88, '刑': 89, '割': 90, '化': 91, '匠': 92, '善': 93, '地': 94, '壁': 95, '妄': 96, '婚': 97, '子': 98, '寺': 99, '工': 100, '常': 101, '役': 102, '情': 103, '惡': 104, '意': 105, '日': 106, '杖': 107, '水': 108, '池': 109, '無': 110, '燈': 111, '眞': 112, '祈': 113, '祭': 114, '私': 115, '童': 116, '籍': 117, '絶': 118, '置': 119, '者': 120, '衣': 121, '谷': 122, '身': 123, '迷': 124, '造': 125, '銀': 126, '錫': 127, '長': 128, '陷': 129, '電': 130, '食': 131, '金': 132, '落': 133, '來': 134, 'UNK': 135}
['\n' '\r' ' ' '!' '"' "'" '(' ')' ',' '-' '.' '0' '1' '2' '3' '4' '5' '6'
 '7' '8' '9' ':' '?' 'a' 'd' 'f' 'l' 'n' 'p' '‘' '’' '“' '”' '…' '\u3000'
 'ㄱ' 'ㄲ' 'ㄳ' 'ㄴ' 'ㄵ' 'ㄶ' 'ㄷ' 'ㄸ' 'ㄹ' 'ㄺ' 'ㄻ' 'ㄼ' 'ㄽ' 'ㄾ' 'ㅀ' 'ㅁ' 'ㅂ' 'ㅃ'
 'ㅄ' 'ㅅ' 'ㅆ' 'ㅇ' 'ㅈ' 'ㅉ' 'ㅊ' 'ㅋ' 'ㅌ' 'ㅍ' 'ㅎ' 'ㅏ' 'ㅐ' 'ㅑ' 'ㅒ' 'ㅓ' 'ㅔ' 'ㅕ'
 'ㅖ' 'ㅗ' 'ㅘ' 'ㅙ' 'ㅚ' 'ㅛ' 'ㅜ' 'ㅝ' 'ㅞ' 'ㅟ' 'ㅠ' 'ㅡ' 'ㅢ' 'ㅣ' '主' '事' '亡' '佛'
 '刑' '割' '化' '匠' '善' '地' '壁' '妄' '婚' '子' '寺' '工' '常' '役' '情' '惡' '意' '日'
 '杖' '水' '池' '無' '燈' '眞' '祈' '祭' '私' '童' '籍' '絶' '置' '者' '衣' '谷' '身' '迷'
 '造' '銀' '錫' '長' '陷' '電' '食' '金' '落' '來' 'UNK']
[35 80 38 ... 10  2  0]
ㄱㅟㄴㅕㅇㅢ ㅁㅗㅅㅡㅂㅇㅡㄹ ㅎㅏㄴㅂ
[35 80 38 70 56 83  2 50 72 54 82 51 56 82 43  2 63 64 38 51]
8636
2022-12-16 11:56:56.659825: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
['ㄱ' 'ㅟ' 'ㄴ' 'ㅕ' 'ㅇ' 'ㅢ' ' ' 'ㅁ' 'ㅗ' 'ㅅ' 'ㅡ' 'ㅂ' 'ㅇ' 'ㅡ' 'ㄹ' ' ' 'ㅎ' 'ㅏ'
 'ㄴ' 'ㅂ' 'ㅓ' 'ㄴ' ' ' 'ㅊ' 'ㅕ' 'ㄷ' 'ㅏ' 'ㅂ' 'ㅗ' 'ㄱ' 'ㅗ' ' ' 'ㄸ' 'ㅓ' 'ㄴ' 'ㅏ'
 'ㄹ' 'ㅕ' ' ' 'ㅎ' 'ㅐ' 'ㅆ' 'ㄷ' 'ㅏ' '.' ' ' 'ㅈ' 'ㅣ' 'ㅂ' 'ㅇ' 'ㅏ' 'ㄴ' 'ㅇ' 'ㅡ'
 'ㄹ' ' ' 'ㅇ' 'ㅣ' 'ㄹ' 'ㅣ' 'ㅈ' 'ㅓ' 'ㄹ' 'ㅣ' ' ' 'ㄱ' 'ㅣ' 'ㅇ' 'ㅜ' 'ㅅ' 'ㄱ' 'ㅓ'
 'ㄹ' 'ㅣ' 'ㄷ' 'ㅓ' 'ㄴ' ' ' 'ㄱ' 'ㅏ' 'ㅇ']
[35 80 38 70 56 83  2 50 72 54 82 51 56 82 43  2 63 64 38 51 68 38  2 59
 70 41 64 51 72 35 72  2 42 68 38 64 43 70  2 63 65 55 41 64 10  2 57 84
 51 56 64 38 56 82 43  2 56 84 43 84 57 68 43 84  2 35 84 56 77 54 35 68
 43 84 41 68 38  2 35 64 56]
['ㄱ' 'ㅟ' 'ㄴ' 'ㅕ' 'ㅇ' 'ㅢ' ' ' 'ㅁ' 'ㅗ' 'ㅅ' 'ㅡ' 'ㅂ' 'ㅇ' 'ㅡ' 'ㄹ' ' ' 'ㅎ' 'ㅏ'
 'ㄴ' 'ㅂ' 'ㅓ' 'ㄴ' ' ' 'ㅊ' 'ㅕ' 'ㄷ' 'ㅏ' 'ㅂ' 'ㅗ' 'ㄱ' 'ㅗ' ' ' 'ㄸ' 'ㅓ' 'ㄴ' 'ㅏ'
 'ㄹ' 'ㅕ' ' ' 'ㅎ' 'ㅐ' 'ㅆ' 'ㄷ' 'ㅏ' '.' ' ' 'ㅈ' 'ㅣ' 'ㅂ' 'ㅇ' 'ㅏ' 'ㄴ' 'ㅇ' 'ㅡ'
 'ㄹ' ' ' 'ㅇ' 'ㅣ' 'ㄹ' 'ㅣ' 'ㅈ' 'ㅓ' 'ㄹ' 'ㅣ' ' ' 'ㄱ' 'ㅣ' 'ㅇ' 'ㅜ' 'ㅅ' 'ㄱ' 'ㅓ'
 'ㄹ' 'ㅣ' 'ㄷ' 'ㅓ' 'ㄴ' ' ' 'ㄱ' 'ㅏ']
[35 80 38 70 56 83  2 50 72 54 82 51 56 82 43  2 63 64 38 51 68 38  2 59
 70 41 64 51 72 35 72  2 42 68 38 64 43 70  2 63 65 55 41 64 10  2 57 84
 51 56 64 38 56 82 43  2 56 84 43 84 57 68 43 84  2 35 84 56 77 54 35 68
 43 84 41 68 38  2 35 64]
ㅇ
56
steps_per_epoch : 134
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 80, 100)           13600     
                                                                 
 dropout (Dropout)           (None, 80, 100)           0         
                                                                 
 lstm (LSTM)                 (None, 256)               365568    
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense (Dense)               (None, 256)               65792     
                                                                 
 dropout_2 (Dropout)         (None, 256)               0         
                                                                 
 dense_1 (Dense)             (None, 136)               34952     
                                                                 
=================================================================
Total params: 479,912
Trainable params: 479,912
Non-trainable params: 0
_________________________________________________________________
None

최참판댁 사랑은 무인지경처럼 적막하다.
 "아아우, 기러서보, 알, 알래하......나가. 촛사솟한 시조. '요애해힌데 길하가 살을 흘려 들어자."
""느나?"....ㄴ데 출서빈 아니다. 강포수를 밟토 들어다.
 "기었다. 그 줌글 설고 있었다. 예김화는 깃서바.""" "허허헜자."
""제 줌 사장에 했다. 서엉이 둠어자렀다.  "아아우!" 하지, 어리 기우? 하집해서 갗겨섰다.  "요아우!" 하잡하들 춧하고 있었다. 예기라  부추서 있는 인이 일이 일ㄹㄹㄹㄹㄹㄹㄹㄹㄹㄱㄱㄱ고  우었다.
 "귀운화 서엇다 그서 이 기래하  북겁더서 술에 출렀다.

라이브러리를 install 할 때 현재 버전에 안 맞을 때 사용 (낮은 버전의 파이썬에서 임의 모듈 설치) 참고 사이

pip multiple versions of dependency resolver problem

INFO: pip is looking at multiple versions of to determine which version is compatible with other requirements. This could take a while. INFO: This is taking longer than usual. You might need to provide the dependency resolver with stricter constraints to r

uiandwe.tistory.com

'TensorFlow' 카테고리의 다른 글

TensorFlow 기초 42 - IMDB 리뷰 감성 분류하기(IMDB Movie Review Sentiment Analysis) (0)	2022.12.19
TensorFlow 기초 41 - RNN으로 스펨 메일 분류 (이항 분류) (0)	2022.12.19
뉴욕타임즈 뉴스 기사 중 헤드라인을 읽어 텍스트 생성 연습(LSTM) (0)	2022.12.14
TensorFlow 기초 39 - LSTM을 이용한 텍스트 생성, 문맥을 반영하여 다음 단어를 예측하기 (0)	2022.12.14
TensorFlow 기초 38 - 문자열(corpus - 자연어 데이터 집합) 토큰화 + LSTM으로 감성 분류 (0)	2022.12.13

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

인기포스트

ABOUT ME

코딩탕탕 코딩탕탕

'TensorFlow' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

인기포스트

ABOUT ME

'TensorFlow' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역