Neural Text Generation

This post explores the character-based language modeling using Long Short Term Memory (LSTM) networks. The network is trained on a corpus of choice such as movie scripts, novels etc to learn the latent space. The latent space is then sampled to generate new text. The network is trained by passing a sequence of characters, one at a time and asking the network to predict the next character. Initially, the network predictions are random but over time it learns to correct its mistakes and gets better at generating words and even sentences that appear meaningful. The network also learns the dataset-specific characteristics such as to open and close the brackets, add punctuations and hyphens etc..

The rest of the article explores network trained on various dataset.

Name generator

AmreBar bovCalios MDarEmy HyanFeltoeGerarHoanaThese names were generated by a neural network trained on the characters from Marvel and DC universe. What's really impressive about it is that these characters do not exist. The network learned to generate superhero or supervillan names by going through each character in Marvel and DC and building a language model to predict next letter given a prior letter. The original dataset is available here.

Data Preparation: Each entry in the dataset was converted to the training data by breaking every name into an array of letters. Then, feeding the network one letter at a time and asking it to predict the next. An example of first few entries in the dataset would look like this:

InputOutput
['I', 'R', 'O' ,'N', 'M', 'A', 'N']['R', 'O' ,'N', 'M', 'A', 'N', '.']
['W', 'O', 'N' ,'D', 'E', 'R', 'W', 'O', 'M', 'A', 'N']['O', 'N' ,'D', 'E', 'R', 'W', 'O', 'M', 'A', 'N', '.']
['F', 'L', 'A' ,'S', 'H']['L', 'A' ,'S', 'H', '.']
Try it yourself
Generate a new name by typing the starting letter and the maximum length.
(If you leave it blank, I will pick a random letter for you)

Joker

This model was trained on a corpus of jokes from Kaggle. I found the first iteration of joker to be nasty and largely NSFW, this was due to the large number of NSFW jokes in the dataset.

So, I decided to clean it by dropping NSFW jokes but reviewing 231,657 jokes manually wasn't too motivating. Luckily, there is already a list of bad words available on the CMU website so I used that to drop all jokes that contained words from the list. After the cleanup, we 170,959 jokes left for the training. The second iteration of Joker(CleanJoker) looked clean enough to host online.

Try it yourself

A few handpicked ones that were generated by the Joker. Whilst the generated text doesn't make sense in its entirety, it does seem funny in bits.

1. What do you do when you can be a little bear that says "I can start a bar"

2. Did you hear about pizza needed its packers!

3. What did the chicken creepling been drinking questions. The butchers to eat it.

Speaker

This model was trained ted talks dataset available on Kaggle. What's shown below is a sample run where the model was seeded by "Good morning. How are you?" and asked to generate a text of 3000 letters.

Good morning. How are you? Hard time.

This is why the other planet was like, "Do you know why, awts and and said, "I went across my inside.

The walkers can\'t consider, in presentation. In fact, I am going to hear my insurance as a cutting bone in space details, and as you feel a fully.

I think the two students -- maybe even better from what we can do. Doing this meant you know a momentum. We talk about the Google Deno right and to be at all. That is absolutely cliparly much graduate for reteriorally unfartied patients, equally expensive and watching about stress changes to murder all the whole new person who spend my hand and refugee. It\'s about the way full of people. And compose its easy to do the Mithan suffering? That\'s the two mixederal groups, the approach comes for them.

You can go to explore strategy. And about 50 years of restaurants. I was in Lesters, we have ever seen an angle of translation that every day and can work and a currency flishous nuisan generate. They have to be an interpret for a manuser on that. Imagine the problem that is a normal is some tools to improve centuries, and so putsurated at the front public influence. Last year at all.(Video) Our context means I\'m talking about.

You program that? Now what this is the way that because that\'s why when they take the size of the Third-Power or English had no air at the time, and we can go buying this. First the spot that they actually put our children and consensus, who were so diagnality consumption came together. The markets, except against crispinescents passion. Thank you very much.AB:

Since the heart. For the same way we have the figure? And racism was taken Collars every history and rewarded as creative, still lead to pain. It\'s the right to take public activity of the alternative with times, was a route.

Yeah. So what if that would actually have an equal rrima. (Laughter) They had feeling fun. Here he says, "No. But you see that the ground stories that convert very till with Boston Loga. Humans has been sickled. The poor promise here? Bad moves up. And in society is not going an immediate violence to participate. this was there. I started trying to be obsessed with a second and remind the mentally change much homes that I felt is a conversation, every single person in radio -- wants to know what they do?

We\'re at earth-laser deadlines, writing men to police can be digital in the were aignabirt on the entire ninth patterns of the rich masters of translator-didn\'t have crew on the molecule just for a couple of otherness violence, and thus same way. How do we make the fact in efficiency processors, more laying while we\'ve died, she is sensing. In particular, Gloria 2, like the building stuff that noticed this, the other school propism here to try to do things into the fish means a vowar kind of large success of the website without any really interesting because if I mention of the actor it\'s beautiful? Media, 13 in the problem, and to ask a little class on purpose and creative that they are the cars.

Alright, so the model is not going to replace a speech writer any time soon. But, hey, it could replace some on the conference calls.

Conclusion

The models above had to learn the characters/letters, put them together to form words and construct sentences. It may appear that these models understand what they are generating/creating but that is not the case. What they have learned is the statistical model of the language (dataset) and they are merely sampling from it.

On their own, the models may not be of much use but they could serve as tools to augment writers. Something like a smart editor that can autocomplete the sentence or show possible options (sampled from the latent space). The models could serve as autocomplete plugins (e.g genre based language models, models trained on specific datasets such as speeches, movie scripts etc). There could possibly be a marketplace/plugin store to host pre-trained language models.

PS: For those "forcing" their bots to read/write/watch, just ask them nicely and they'll oblige else they might get back at you with a powerful tensor exception 😛

References:
  1. The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy  [Link]
Icons created by Arjun is licensed by CC 3.0 BY

originally published 30 Jun 2018 and updated 10 Jul 2018