Creating A Text Generator Using Recurrent Neural Network

14 minute read

Hello guys, it’s been another while since my last post, and I hope you’re all doing well with your own projects. I’ve been kept busy with my own stuff, too. And till this point, I got some interesting results which urged me to share to all you guys. Yeah, what I did is creating a Text Generator by training a Recurrent Neural Network Model. Below is a sample which was generated by the trained Model:

They had no choice but the most recent univerbeen fairly uncomfortable and dangerous as ever. As long as he dived experience that it was not uncertain that even Harry had taken in black tail as the train roared and was thin, but Harry, Ron, and Hermione, at the fact that he was in complete disarraying the rest of the class holding him, he should have been able to prove them.

Does it sound similar? Yeah, you may recognize J. K. Rowling’s style in the paragraph above. That’s because I trained the Model using the famous Harry Potter series! Do you feel excited and want to create something of your own? Just keep reading, a lot of fun is waiting ahead, I promise!

Many of you may know about Recurrent Neural Networks, and many may not, but I’m quite sure that you all heard about Neural Networks. We have already seen how Neural Networks can solve nearly all Machine Learning problems no matter how complicated they are. And because to fully understand how Neural Networks work does require a lot of time for reading and implementing by yourself, and yet I haven’t made any tutorials on them, it’s nearly impossible to write it all in this post. So it’d be better to leave them for some future tutorials and make it easy this time by looking at the picture below instead.

neural_network

As you could see in the picture above, the main reason why Neural Network can out-perform other learning algorithms is because of the hidden layers. What the hidden layers do is to create a more complicated set of features, which results in a better predicting accuracy. I also mentioned about this in my previous posts: the more complicated and informative the features become, the more likely your Model can learn better and give more precise predictions.

Despite the outstanding performance that Neural Networks have shown us over the last decade, they still have a big big limitation: they can’t understand the sequence, in which the current state is affected by its previous states. And Recurrent Neural Networks came out as a promising solution for that.

The explanation of Recurrent Neural Networks such as what they are, how they work, or something like that is quite long and not the main purpose of this post, which I mainly want to guide you to create your own text generator. In fact, there are many guys out there who made some excellent posts on how Recurrent Neural Networks work. You can refer to their post through the links below. Some of them provides their codes too, but they used Theano or Torch for their work, which may hurt a lot if you don’t have experience with those frameworks. To make it easy for you, I tried to re-implement the code using a more relaxing framework called Keras. You can check it out in the Implementation section below.

And because the fact that there are already many great posts on Recurrent Neural Networks, I will only talk briefly about some points which confused me, and may confuse you too, I think.

Vanilla RNN

vanilla_RNN

The very first basic idea of RNN is to stack one or more hidden layers of previous timesteps, each hidden layer depends on the corresponding input at that timestep and the previous timestep, like below:

The output, on the other hand, is computed using only the associating hidden layer:

So, with hidden layers of different timesteps, obviously the new tyep of Network can now have ability to “remember”. But it can’t not remember over a long timestep due to a problem called vanishing gradient (I will talk about it in future post), and it can’t decide which information of some timestep is valuable (which it should keep) and which information is not valuable (which it should forget). So an improvement was required. And Long Short-term Memory, or LSTM came out as a potential successor.

Long Short-term Memory Networks

LSTM

Having seen the limitation of vanilla RNN, now let’s take a look at its successor, the LSTM Networks. The explanations of LSTM in the links above are pretty awesome, but honestly, they confused me a little. Personally I think it would be easier to understand if we begin from what RNNs could accomplish:

Comparing to RNN, the equation above is exactly the same with RNN to compute the hidden state at timestep \(t\). But it’s not the actual hidden state in terms of LSTM, so we name it differently, let’s say \(o_t\). So from here, we will see how LSTM was improved from RNN.

First, LSTM is given the ability to “forget”, which mean it can decide whether to forget the previous hidden state. All is done by adding Forget Gate Layer:

In contrast to forget gate layer, to tell the Model whether to update the current state using the previous state, we need to add Input Gate Layer accordingly.

Next, we will compute the temporal cell state for the current timestep. It looks just like the output of RNN above, except that tanh activation function is used:

And now, we will compute the actual cell state for current timestep, using the forget gate and input gate above. Intuitively, doing so makes LSTM be able to keep only the necessary information and forget the unnecessary one.

After we computed the current cell state, we will use it to compute the current hidden state like below:

So after all, we now have the hidden state for the current timestep. The rest is similar to vanilla RNN, which is computing the actual output \(y_t\):

That’s all I want to tell you about RNNs and LSTMs. I suggest that you read the three articles above for better understanding about how they work. And now let’s jump into the most interesting part (I think so): the Implementation section!

Implementation

As I mentioned earlier in this post, there are quite a lot of excellent posts on how Recurrent Neural Networks work, and those guys also included the implementations for demonstration. Actually, because they wrote code for teaching purpose, reading the codes does help understanding the tutorials a lot. But I must say that it may hurt, especially if you don’t have any experience in Theano or Torch (Denny wrote his code in Theano and Andrej used Torch). I want to make it easy for you, so I will show you how to implement RNN using Keras, an excellent work from François Chollet, which I had a chance to introduced to you in my previous posts.

If you don’t have Keras installed on your machine, just give the link below a click. The installation only takes 20 minutes (max):

Now, let’s get down to business. For sake of simplicity, I will divide the code into four parts and dig into each part one at a time. Of course I will omit some lines used for importing or argument parsing, etc. You can find the full source file in my GitHub here: Text Generator. Now let’s go into the first part: preparing the data.

1. Prepare the training data

I always try to deal with the most tedious part in the beginning, which is data preparation. Not only because a good data preparation can result in a well learned Model, but this step is also some kind of tricky, which we likely spend a lot of time until it works (especially if you are working with different frameworks).

We are gonna work with text in this post, so obviously we have to prepare a text file to train our Model. You can go on the internet to grab anything you want such as free text novels here, and I recommend the file size is at least 2MB for an acceptable result. In my case, I used the famous Harry Potter series for training (of course I can’t share it here for copyright privacy).

 
data = open(DATA_DIR, 'r').read()
chars = list(set(data))
VOCAB_SIZE = len(chars)

First, we will read the text file, then split the content into an array which each element is a character, and store it into data variable. Next, we will create a new array called chars to store the unique values in data. For example, your text file contains only the following sentence:

I have a dream.

Then the data array will look like this:

 
data
['I',' ', 'h', 'a', 'v', 'e', ' ', 'a', ' ', 'd', 'r', 'e', 'a', 'm', '.']

And the chars array will look like this:

 
chars
['I',' ', 'h', 'a', 'v', 'e', 'd', 'r', 'm', '.']

As you could see, every element in char array only appears once. So the data array contains all the examples, and the chars array acts like a features holder, which we then create two dictionaries to map between indexes and characters:

 
ix_to_char = {ix:char for ix, char in enumerate(chars)}
char_to_ix = {char:ix for ix, char in enumerate(chars)}

Why do we have to do the mapping anyway? Because it’s better to input numeric training data into the Networks (as well as other learning algorithms). And we also need a different dictionary to convert the numbers back to the original characters. That’s why we created the two dictionaries above.

After we’ve done the file reading, we will create the actual input for the Network. We’re gonna use Keras to create and train our Network, so we must convert the data into this form: (number_of_sequences, length_of_sequence, number_of_features). The last dimension is the number of the features, in this case the length of the chars array above. Next, the length of sequence means how long you want your Model to learn at a time. It’s also the total timesteps of our Networks which I showed you above. The first dimension is the number of sequences, which is easy to achieve by dividing the length of our data by the length of each sequence. Of course we also need to convert each character into the corresponding index number.

And what about the target sequences? In this post, we only make a simple text generator, so we just need to set the target by shifting the corresponding input sequence by one character. Obviously our target sequence will have the same length with the input sequence. About model that can output target sequences with different length, I will leave for the next post.

 
X = np.zeros((len(data)/SEQ_LENGTH, SEQ_LENGTH, VOCAB_SIZE))
y = np.zeros((len(data)/SEQ_LENGTH, SEQ_LENGTH, VOCAB_SIZE))
for i in range(0, len(data)/SEQ_LENGTH):
    X_sequence = data[i*SEQ_LENGTH:(i+1)*SEQ_LENGTH]
    X_sequence_ix = [char_to_ix[value] for value in X_sequence]
    input_sequence = np.zeros((SEQ_LENGTH, VOCAB_SIZE))
    for j in range(SEQ_LENGTH):
        input_sequence[j][X_sequence_ix[j]] = 1.
    X[i] = input_sequence

    y_sequence = data[i*SEQ_LENGTH+1:(i+1)*SEQ_LENGTH+1]
    y_sequence_ix = [char_to_ix[value] for value in y_sequence]
    target_sequence = np.zeros((SEQ_LENGTH, VOCAB_SIZE))
    for j in range(SEQ_LENGTH):
        target_sequence[j][y_sequence_ix[j]] = 1.
    y[i] = target_sequence

The code is not difficult to understand at all, but make sure you take a look before moving on.

2. Create the Network

So we have done with the data preparation. The rest is some kind of relaxing since we can make use of Keras to help us handle the hardest part: create the Network. We’re gonna use LSTM for its ability to deal with long sequences, you can experiment other Model by changing LSTM to SimpleRNN or GRU. The choice is yours!

 
model = Sequential()
model.add(LSTM(HIDDEN_DIM, input_shape=(None, VOCAB_SIZE), return_sequences=True))
for i in range(LAYER_NUM - 1):
    model.add(LSTM(HIDDEN_DIM, return_sequences=True))
model.add(TimeDistributed(Dense(VOCAB_SIZE)))
model.add(Activation('softmax'))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop")

You should have no problem in understand the code above, right? There are only few points that I want to make clear:

  • return_sequences=True parameter:

We want to have a sequence for the output, not just a single vector as we did with normal Neural Networks, so it’s necessary that we set the return_sequences to True. Concretely, let’s say we have an input with shape (num_seq, seq_len, num_feature). If we don’t set return_sequences=True, our output will have the shape (num_seq, num_feature), but if we do, we will obtain the output with shape (num_seq, seq_len, num_feature).

  • TimeDistributed wrapper layer:

Since we set return_sequences=True in the LSTM layers, the output is now a three-dimension vector. If we input that into the Dense layer, it will raise an error because the Dense layer only accepts two-dimension input. In order to input a three-dimension vector, we need to use a wrapper layer called TimeDistributed. This layer will help us maintain output’s shape, so that we can achieve a sequence as output in the end.

3. Train the Network

In the next step, we will train our Network using the data we prepared above. Here we want the Model to generate some texts after each epoch, so we set nb_epoch=1 and put the training into a while loop. We also save the weights after each 10 epochs in order to load it back later, without training the Network again!

 
nb_epoch = 0
while True:
    print('\n\n')
    model.fit(X, y, batch_size=BATCH_SIZE, verbose=1, nb_epoch=1)
    nb_epoch += 1
    generate_text(model, GENERATE_LENGTH)
    if nb_epoch % 10 == 0:
        model.save_weights('checkpoint_{}_epoch_{}.hdf5'.format(HIDDEN_DIM, nb_epoch))

4. Generate text

Last but not least, I want to talk a little about the method to generate text. We begin with some random character and use the trained Model to predict the next one. Then we append the predicted character into the input, and have the Model predict the next one, which is the third character. We continue the process until we obtain a sequence with the length we want (500 characters by default). It’s just that simple!

 
def generate_text(model, length):
    ix = [np.random.randint(VOCAB_SIZE)]
    y_char = [ix_to_char[ix[-1]]]
    X = np.zeros((1, length, VOCAB_SIZE))
    for i in range(length):
        X[0, i, :][ix[-1]] = 1
        print(ix_to_char[ix[-1]], end="")
        ix = np.argmax(model.predict(X[:, :i+1, :])[0], 1)
        y_char.append(ix_to_char[ix[-1]])
    return ('').join(y_char)

5. Result

I created the Network with three LSTM layers, each layer has 700 hidden states, with Dropout ratio 0.3 at the first LSTM layer. I was training the Network on GPU for roughly a day (\(\approx200\) epochs), and here are some paragraphs which were generated by the trained Model:

“Yeah, I know, I saw him run off the balls of the Three Broomsticks around the Daily Prophet that we met Potter’s name!” said Hermione. “We’ve done all right, Draco, and Karkaroff would have to spell the Imperius Curse,” said Dumbledore. “But Harry, never found out about the happy against the school.”

“Albus Dumbledore, I should, do you? But he doesn’t want to adding the thing that you are at Hogwarts, so we can run and get more than one else, you see you, Harry.”

“I know I don’t think I’ll be here in my bed!” said Ron, looking up at the owners of the Dursleys.

“Well, you can’t be the baby way?” said Harry. “He was a great Beater, he didn’t want to ask for more time.”

“What about this thing, you shouldn’t,” Harry said to Ron and Hermione. “I have no furious test,” said Hermione in a small voice.

To be honest, I was impressed by what the Model can generate. After leaving it a while for learning, as you could see, not only it can generate nearly perfect English words, but it also learned the structures, which means it capitalizes the first letter after period, it knows how to use the quotation marks, etc. And if I don’t tell you anything about RNNs, you may think (even I do too!) that the paragraphs above were written by somebody. So, it’s now your turn to train your own Network using the dataset of your own choice, and see what you achieve. And if you find the result interesting, please let me know by dropping me a line below!

Summary

So we have come a long way to finish today’s post, and I hope you all now obtain some interesting results for your own. We have walked through a brief introduction about the need of Recurrent Neural Networks o solve the limitation of common Neural Networks and figured out how LSTMs even improved the state-of-the-art vanilla RNNs.

And we also implemented our own Networks to create a simple text generator, which we can use to generate some sample texts in the style of what they learned from! Note that this is just a fast and dirty implementation, and obviously there are a lot of rooms for improvement, which I will leave them for you to improvise by yourself.

That’s it for today. I will be back with you guys in the coming post, with even more interesting stuff. So just stay updated!

Leave a Comment