Quiz
Simpson Family. © 20th Century Fox

TV Script Generation

여러분은 심슨을 좋아하시나요? 이번 프로젝트에서는 RNN을 사용하여 심슨 가족 TV스크립트를 생성할 것입니다. Simpsons dataset 의 27시즌을 학습한 후 RNN을 이용하여 Moe’s Tavern 에피소드를 새롭게 써볼게요.

그럼 시작하겠습니다!

via GIPHY

Get the Data

데이터는 Simpsons dataset에서 다운로드 할 수 있습니다.

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import helper

data_dir = './data/simpsons/moes_tavern_lines.txt'
text = helper.load_data(data_dir)
# Ignore notice, since we don't use it for analysing the data
text = text[81:]

Explore the Data

데이터를 둘러볼까요? view_sentence_range 를 통해 데이터를 둘러볼 수 있습니다.

view_sentence_range = (0, 10)

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import numpy as np

print('Dataset Stats')
print('Roughly the number of unique words: {}'.format(len({word: None for word in text.split()})))
scenes = text.split('\n\n')
print('Number of scenes: {}'.format(len(scenes)))
sentence_count_scene = [scene.count('\n') for scene in scenes]
print('Average number of sentences in each scene: {}'.format(np.average(sentence_count_scene)))

sentences = [sentence for scene in scenes for sentence in scene.split('\n')]
print('Number of lines: {}'.format(len(sentences)))
word_count_sentence = [len(sentence.split()) for sentence in sentences]
print('Average number of words in each line: {}'.format(np.average(word_count_sentence)))

print()
print('The sentences {} to {}:'.format(*view_sentence_range))
print('\n'.join(text.split('\n')[view_sentence_range[0]:view_sentence_range[1]]))
    Dataset Stats
    Roughly the number of unique words: 11492
    Number of scenes: 262
    Average number of sentences in each scene: 15.248091603053435
    Number of lines: 4257
    Average number of words in each line: 11.50434578341555

    The sentences 0 to 10:
    Moe_Szyslak: (INTO PHONE) Moe's Tavern. Where the elite meet to drink.
    Bart_Simpson: Eh, yeah, hello, is Mike there? Last name, Rotch.
    Moe_Szyslak: (INTO PHONE) Hold on, I'll check. (TO BARFLIES) Mike Rotch. Mike Rotch. Hey, has anybody seen Mike Rotch, lately?
    Moe_Szyslak: (INTO PHONE) Listen you little puke. One of these days I'm gonna catch you, and I'm gonna carve my name on your back with an ice pick.
    Moe_Szyslak: What's the matter Homer? You're not your normal effervescent self.
    Homer_Simpson: I got my problems, Moe. Give me another one.
    Moe_Szyslak: Homer, hey, you should not drink to forget your problems.
    Barney_Gumble: Yeah, you should only drink to enhance your social skills.

Implement Preprocessing Functions

모든 데이터 셋에 대해 가장 먼저 해야 할 일은 전처리입니다. 전처리는 다음과 같은 순서로 이루어집니다.

Lookup Table

word embedding을 하기 위해 먼저 단어들에 id를 부여해야 합니다. 이 함수에서는 단어를 id로 변환하는 두가지 dictionary를 만듭니다.

튜플 (vocab_to_int, int_to_vocab)을 통해 두 사전들을 반환하겠습니다.

import numpy as np
import problem_unittests as tests

def create_lookup_tables(text):
    """
    Create lookup tables for vocabulary
    :param text: The text of tv scripts split into words
    :return: A tuple of dicts (vocab_to_int, int_to_vocab)
    """
    # text is a list of words
    text = set(text) # remove duplicates
    vocab_to_int, int_to_vocab = {}, {}

    for k, v in enumerate(text):
        vocab_to_int[v] =  k
        int_to_vocab[k] =  v

    print("sample of vocab_to_int at key ['tavern'] -->", vocab_to_int['homer'])
    print("sample of int_to_vocab at key [0] --> ", int_to_vocab[0])
    return (vocab_to_int, int_to_vocab)


"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_create_lookup_tables(create_lookup_tables)
    sample of vocab_to_int at key ['tavern'] --> 64
    sample of int_to_vocab at key [0] -->  back
    Tests Passed

Tokenize Punctuation

우리는 공백을 구분 기호로 사용해 스크립트를 단어 단위로 분할하겠습니다. 하지만 마침표 ‘.’와 느낌표 ‘!’ 와 같은 기능은 딥러닝이 “Bye”와 “Bye!”라는 단어를 다른 단어로 인식하게 만드므로 제외하도록 하게씃ㅂ니다.

기능 구현token_lokup”!”와 같은 기호를 “Exclamation_Mark”로 점화하는 데 사용될 사전을 만들려면 기호가 키이고 값이 토큰인 다음 기호에 대한 사전을 만듭니다.

token_lookup 은 느낌표같은 기호들을 “||Exclamation_Mark||”로 tokenie합니다. 그러기 위해서는 기호가 key값이고 tokend이 value인 다음과 같은 사전을 만듭니다.

이 사전은 기호를 토큰화하고 기호 주위에 구분 기호(공백)를 추가하는 데 사용됩니다. 따라서 기호와 단어를 분리시켜 신경 네트워크가 다음 단어를 더 쉽게 예측할 수 있게 해 줍니다. 이때 주의할 점은 단어로 혼동될 수 있는 토큰을 사용하지 않도록 해야 합니다. 예를들면 “dash”를 사용하는 대신 “||dash||”를 사용하는 것이 바람직합니다.

def token_lookup():
    """
    Generate a dict to turn punctuation into a token.
    :return: Tokenize dictionary where the key is the punctuation and the value is the token
    """
    punc_dict = {
        '.' : '||Period||',
        ',' : '||Comma||',
        '"' : '||Quotation_Mark||',
        ';' : '||Semicolon||',
        '!' : '||Exclamation_mark||',
        '?' : '||Question_mark||',
        '(' : '||Left_Parentheses||',
        ')' : '||Right_Parentheses||',
        '--' : '||Dash||',
        '\n' : '||Return||',
    }
    return punc_dict

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_tokenize(token_lookup)
    Tests Passed

Preprocess all the data and save it

아래 코드 셀을 실행하면 모든 데이터가 사전 처리되어 파일에 저장됩니다.

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
# Preprocess Training, Validation, and Testing Data
helper.preprocess_and_save_data(data_dir, token_lookup, create_lookup_tables)
    sample of vocab_to_int at key ['tavern'] --> 3875
    sample of int_to_vocab at key [0] -->  portfolium

Check Point

첫번째 체크 포인트입니다. 이 노트북으로 돌아오거나 노트북을 다시 시작해야 하는 경우를 위해 여기서 시작할 수 있습니다.

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import helper
import numpy as np
import problem_unittests as tests

int_text, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()

Build the Neural Network

본격적으로 RNN을 만들어 보겠습니다. RNN을 만들기 위해 아래와 같은 구성 요소를 만들기 합니다.

Check the Version of TensorFlow and Access to GPU

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
from distutils.version import LooseVersion
import warnings
import tensorflow as tf

# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.0'), 'Please use TensorFlow version 1.0 or newer'
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
if not tf.test.gpu_device_name():
    warnings.warn('No GPU found. Please use a GPU to train your neural network.')
else:
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
    TensorFlow Version: 1.1.0
    Default GPU Device: /gpu:0

Input

get_inputs(): TF Placeholder 를 만드는 함수입니다.

이 placeholder들은 다음과 같은 tuple을 리턴합니다. (Input, Targets, LearningRate)

def get_inputs():
    """
    Create TF Placeholders for input, targets, and learning rate.
    :return: Tuple (input, targets, learning rate)
    """
    inputs = tf.placeholder(tf.int32, [None, None], name="input")
    targets = tf.placeholder(tf.int32, [None, None])
    learning_rate = tf.placeholder(tf.float32)
    return (inputs, targets, learning_rate)


"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_get_inputs(get_inputs)
    Tests Passed

Build RNN Cell and Initialize

하나 이상의 BasicLSTMCells 셀을 MultiRNNCell에 쌓습니다.

이 함수는 다음과 같은 tuple을 리턴합니다. (Cell, InitialState)

def get_init_cell(batch_size, rnn_size):
    """
    Create an RNN Cell and initialize it.
    :param batch_size: Size of batches
    :param rnn_size: Size of RNNs
    :return: Tuple (cell, initialize state)
    """
    # TODO: Add function
    lstm = tf.contrib.rnn.BasicLSTMCell(rnn_size)
    Cell = tf.contrib.rnn.MultiRNNCell(
        [tf.contrib.rnn.DropoutWrapper( tf.contrib.rnn.BasicLSTMCell(rnn_size), output_keep_prob = 0.6)])

    initial = Cell.zero_state(batch_size, tf.float32)
    InitialState = tf.identity(initial, name='initial_state')

    return (Cell, InitialState)



"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_get_init_cell(get_init_cell)
    Tests Passed

Word Embedding

input_data 를 사용해 임베딩을 하겠습니다. 임베딩된 시퀀스를 리턴합니다.

def get_embed(input_data, vocab_size, embed_dim):
    """
    Create embedding for <input_data>.
    :param input_data: TF placeholder for text input. --> ph tensor shape=(50, 5), dtype=int3
    :param vocab_size: Number of words in vocabulary. --> 27
    :param embed_dim: Number of embedding dimensions  --> 256
    :return: Embedded input.
    """
    embedding = tf.Variable(tf.random_uniform((vocab_size, embed_dim), -1, 1))
    embed = tf.nn.embedding_lookup(embedding, input_data)

    return embed


"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_get_embed(get_embed)
    Tests Passed

Build RNN

get_init_cell()함수안에 RNN을 만들겠습니다.

이 함수는 다음과 같은 tuple을 리턴합니다. (Outputs, FinalState)

def build_rnn(cell, inputs):
    """
    Create a RNN using a RNN Cell
    :param cell: RNN Cell --> cell multirnn object
    :param inputs: Input text data --> tensor, (?, ?, 256) float
    :return: Tuple (Outputs, Final State)
    """
    # TODO: Implement Function
    Outputs, fs = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32)
    FinalState = tf.identity(fs, name="final_state")

    return (Outputs, FinalState)


"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_build_rnn(build_rnn)
    Tests Passed

Build the Neural Network

위에서 구현한 기능을 다음에 적용합니다.

이 함수는 다음과 같은 tuple을 리턴합니다. (Logits, FinalState)

def build_nn(cell, rnn_size, input_data, vocab_size, embed_dim):
    """
    Build part of the neural network
    :param cell: RNN cell #rnn cell object
    :param rnn_size: Size of rnns #256
    :param input_data: Input data #tensor plc s=(128, 5) int32
    :param vocab_size: Vocabulary size # 27
    :param embed_dim: Number of embedding dimensions #300
    :return: Tuple (Logits, FinalState)
    """
    # TODO: Implement Function
    embeds = get_embed(input_data, vocab_size, embed_dim) #rnn_size
    out, FinalState = build_rnn(cell, embeds)
    Logits = tf.contrib.layers.fully_connected(out, vocab_size, activation_fn=None)

    return (Logits, FinalState)


"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_build_nn(build_nn)
    Tests Passed

Batches

get_batches를 실행해 input과 targets의 Batch를 만듭니다. Batch들은 Numpy array가 되어야 하고 그 모양은 (number of batches, 2, batch size, sequence length)와 같습니다. 각각의 batch는 두 개의 요소를 가지고 있습니다.

마지막 배치에 충분한 데이터를 채울 수 없으면 마지막 배치를 삭제합니다.

예를 들어, get_batches([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], 3, 2) 는 다음과 같은 Numpy array를 리턴합니다.

[
  # First Batch
  [
    # Batch of Input
    [[ 1  2], [ 7  8], [13 14]]
    # Batch of targets
    [[ 2  3], [ 8  9], [14 15]]
  ]

  # Second Batch
  [
    # Batch of Input
    [[ 3  4], [ 9 10], [15 16]]
    # Batch of targets
    [[ 4  5], [10 11], [16 17]]
  ]

  # Third Batch
  [
    # Batch of Input
    [[ 5  6], [11 12], [17 18]]
    # Batch of targets
    [[ 6  7], [12 13], [18  1]]
  ]
]

마지막 batch의 마지막 target value 첫번째 batch의 첫번째 입력 값입니다. 이 경우에는 1이 됩니다. 이는 다소 직관적이지는 않지만, 시퀀스 batch를 만들 때 가장 널리 사용되는 기법입니다

def get_batches(int_text, batch_size, seq_length):
    """
    Return batches of input and target
    :param int_text: Text with the words replaced by their ids
    :param batch_size: The size of batch
    :param seq_length: The length of sequence
    :return: Batches as a Numpy array
    """
    characters_per_batch = batch_size * seq_length #640
    num_batches = len(int_text)//(characters_per_batch) #7

    # clip arrays to ensure we have complete batches for inputs, targets same but moved one unit over,
    # last character in target array is first of input array
    # both have shape (4480, )
    input_data = np.array(int_text[ : (num_batches * characters_per_batch)])
    target_data = np.array(int_text[1 : (num_batches * characters_per_batch)] + [int_text[0]])

    inputs = input_data.reshape(batch_size, -1)
    targets = target_data.reshape(batch_size, -1)
    # both now have shape (7,640)

    inputs = np.split(inputs, num_batches, 1) #num_batches is 7
    targets = np.split(targets, num_batches, 1)

    batches = np.array(list(zip(inputs, targets)))
    batches = batches.reshape(num_batches, 2, batch_size, seq_length)

    return batches



"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_get_batches(get_batches)
    Tests Passed

Neural Network Training

Hyperparameters

신경망은 다음 파라메터를 가지고 있습니다.

# Number of Epochs
num_epochs = 20
# Batch Size
batch_size = 256 #256 seems optimal
# RNN Size
rnn_size = 750
# Embedding Dimension Size
embed_dim = 200
# Sequence Length
seq_length = 10
# Learning Rate
learning_rate = 0.01 # 0.01 seems optimal here
# Show stats for every n number of batches
show_every_n_batches = 1

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
save_dir = './save'

Build the Graph

구현한 신경 네트워크를 사용하여 그래프를 만듭니다.

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
from tensorflow.contrib import seq2seq

train_graph = tf.Graph()
with train_graph.as_default():
    vocab_size = len(int_to_vocab)
    input_text, targets, lr = get_inputs()
    input_data_shape = tf.shape(input_text)
    cell, initial_state = get_init_cell(input_data_shape[0], rnn_size)
    logits, final_state = build_nn(cell, rnn_size, input_text, vocab_size, embed_dim)

    # Probabilities for generating words
    probs = tf.nn.softmax(logits, name='probs')

    # Loss function
    cost = seq2seq.sequence_loss(
        logits,
        targets,
        tf.ones([input_data_shape[0], input_data_shape[1]]))

    # Optimizer
    optimizer = tf.train.AdamOptimizer(lr)

    # Gradient Clipping
    gradients = optimizer.compute_gradients(cost)
    capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
    train_op = optimizer.apply_gradients(capped_gradients)

Train

전처리된 데이터를 사용해 신경망을 학습시킵니다.

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
batches = get_batches(int_text, batch_size, seq_length)

with tf.Session(graph=train_graph) as sess:
    sess.run(tf.global_variables_initializer())

    for epoch_i in range(num_epochs):
        state = sess.run(initial_state, {input_text: batches[0][0]})

        for batch_i, (x, y) in enumerate(batches):
            feed = {
                input_text: x,
                targets: y,
                initial_state: state,
                lr: learning_rate}
            train_loss, state, _ = sess.run([cost, final_state, train_op], feed)

            # Show every <show_every_n_batches> batches
            if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0:
                print('Epoch {:>3} Batch {:>4}/{}   train_loss = {:.3f}'.format(
                    epoch_i,
                    batch_i,
                    len(batches),
                    train_loss))

    # Save Model
    saver = tf.train.Saver()
    saver.save(sess, save_dir)
    print('Model Trained and Saved')
    Epoch   0 Batch    0/26   train_loss = 8.823
    Epoch   0 Batch    1/26   train_loss = 8.330
    Epoch   0 Batch    2/26   train_loss = 7.370
    Epoch   0 Batch    3/26   train_loss = 7.173
    ...
    Epoch  19 Batch   22/26   train_loss = 0.834
    Epoch  19 Batch   23/26   train_loss = 0.859
    Epoch  19 Batch   24/26   train_loss = 0.857
    Epoch  19 Batch   25/26   train_loss = 0.868
    Model Trained and Saved

Save Parameters

seq_lengthsave_dir을 이용하여 파라미터를 저장합니다.

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
# Save parameters for checkpoint
helper.save_params((seq_length, save_dir))

Checkpoint

"""
DON'T MODIFY ANYTHING IN THIS CELL
"""
import tensorflow as tf
import numpy as np
import helper
import problem_unittests as tests

_, vocab_to_int, int_to_vocab, token_dict = helper.load_preprocess()
seq_length, load_dir = helper.load_params()

Implement Generate Functions

Get Tensors

get_tensor_by_name() 함수를 사용해 loaded_graph로 텐서를 리턴받습니다. 다음과 같은 텐서들이 나옵니다.

다음 텐서의 튜플을 반환합니다. (InputTensor, InitialStateTensor, FinalStateTensor, ProbsTensor)

def get_tensors(loaded_graph):
    """
    Get input, initial state, final state, and probabilities tensor from <loaded_graph>
    :param loaded_graph: TensorFlow graph loaded from file
    :return: Tuple (InputTensor, InitialStateTensor, FinalStateTensor, ProbsTensor)
    """
    input_tensor = loaded_graph.get_tensor_by_name("input:0")
    initial_state_tensor = loaded_graph.get_tensor_by_name("initial_state:0")
    final_state_tensor = loaded_graph.get_tensor_by_name("final_state:0")
    probs_tensor = loaded_graph.get_tensor_by_name("probs:0")
    return (input_tensor, initial_state_tensor, final_state_tensor, probs_tensor )


"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_get_tensors(get_tensors)
    Tests Passed

Choose Word

pick_word() 함수를 적용해 probabilities로 다음 단어를 선택합니다.

def pick_word(probabilities, int_to_vocab):
    """
    Pick the next word in the generated text
    :param probabilities: Probabilites of the next word
    :param int_to_vocab: Dictionary of word ids as the keys and words as the values
    :return: String of the predicted word
    """
    cumulative_sum = np.cumsum(probabilities) # array with 4 values, sums items before
    chooser = np.sum(probabilities) * np.random.rand(1) # generates an array of possible values, uniform distribution
    word = int_to_vocab[int(np.searchsorted(cumulative_sum, chooser))] #finds index of word

    return word


"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
tests.test_pick_word(pick_word)
    Tests Passed

Generate TV Script

이제 TV스크립트를 만들어보겠습니다. gen_length를 세팅해 원하는 스크립트의 길이를 정해주세요.

gen_length = 200
# homer_simpson, moe_szyslak, or Barney_Gumble
prime_word = 'moe_szyslak'

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
    # Load saved model
    loader = tf.train.import_meta_graph(load_dir + '.meta')
    loader.restore(sess, load_dir)

    # Get Tensors from loaded model
    input_text, initial_state, final_state, probs = get_tensors(loaded_graph)

    # Sentences generation setup
    gen_sentences = [prime_word + ':']
    prev_state = sess.run(initial_state, {input_text: np.array([[1]])})

    # Generate sentences
    for n in range(gen_length):
        # Dynamic Input
        dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
        dyn_seq_length = len(dyn_input[0])

        # Get Prediction
        probabilities, prev_state = sess.run(
            [probs, final_state],
            {input_text: dyn_input, initial_state: prev_state})

        pred_word = pick_word(probabilities[dyn_seq_length-1], int_to_vocab)

        gen_sentences.append(pred_word)

    # Remove tokens
    tv_script = ' '.join(gen_sentences)
    for key, token in token_dict.items():
        ending = ' ' if key in ['\n', '(', '"'] else ''
        tv_script = tv_script.replace(' ' + token.lower(), key)
    tv_script = tv_script.replace('\n ', '\n')
    tv_script = tv_script.replace('( ', '(')

    print(tv_script)
    INFO:tensorflow:Restoring parameters from ./save
    moe_szyslak:(then) if kemi there is a nigerian princess, homer?
    moe_szyslak:(" no more") not like a business.
    moe_szyslak: moe's tavern. we're not bring each other's and-- i'm totally a committee of that myself.
    barney_gumble:(ominous) for you know, the shriners is?
    homer_simpson:(to homer) sound?
    homer_simpson: see, this one was me.
    now i'm gonna be afraid and--(nods) sure call me?
    moe_szyslak: few out.
    waylon_smithers: marge...
    moe_szyslak:(almost there) that's pretty well. here you mean one way on?
    barney_gumble:(super casual) up-bup-bup. you didn't say you'd kill me.
    moe_szyslak: a sneeze. let's see homer, but what we're workin'.
    moe_szyslak: marge, a little weird i'm just sure for a while.
    moe_szyslak:(to homer) hey, get out there, and i'm a tanked-up glad. what are you safe, as a big deal?
    moe_szyslak: yeah. but i gotta wing

The TV Script is Nonsensical

TV대본이 말이 안되더라도 괜찮습니다. 우리는 1메가바이트 미만의 텍스트로 훈련을 했습니다. 좋은 결과를 얻기 위해서 더 적은 어휘를 사용하거나 더 많은 자료를 학습시켜야 합니다. 우리는 데이터셋의 일부만 훈련시켰습니다. 그리고 여기, 더 많은 데이터 셋이 있습니다. 만약 모든 데이터에서 훈련시킨다면 더 나은 결과가 나올거에요. :)

via GIPHY

Comments

Eungbean Lee's Picture

About Eungbean Lee

Lee is a Student, Programmer, Engineer, Designer and a DJ

Seoul, South Korea https://eungbean.github.io