Machine Learning in Language

Mike Heavers
11 min readDec 5, 2018

--

This is part 2 of my art focused Machine Learning Course. Click here to access the overview, which provides links to the other workshops in the series.

Machine learning models within the context of language can operate quite differently than those used in Image Processing and other applications. For starters, because language follows such specific rules, when processing series of words, our ML models need to have some concept of what has come before them, what comes after them, and in what sort of context they are operating. For this reason, recurrent neural networks are often used rather than convolutional neural networks or other formats.

Objectives

We will be looking to understand how Machine Learning works with respect to language, as well as examining some practical examples of ML in Language, including:

Word Vectors

Finding similar words, word analogies, averages between two different kinds of words, etc.

Generative Text

Generating words in the style of a trained corpus of words such as JK Rowling’s Harry Potter. Training a model on our own corpus of words.

Current State of ML in Language

Other applications of ML in language processing, including a predictive writer keyboard to which you can upload your own corpus of text.

RNNs

Recurrent neural networks (RNN): networks with loops to create persistence. The networks base their knowledge off of things they’ve learned in the past, rather than being insularly focused on the data they are processing at the present moment like feed-forward networks. In machine learning, this is effective, but slow. RNNs can also be difficult to train. Recent optimization algorithms have helped with this.

How’s it different from other neural networks?

Unlike feedforward neural networks, RNNs can use their internal state (memory) to process sequences of inputs.

WORD2VEC

A Group of related models trained to construct the linguistic contexts of words by generating word vectors. Essentially, these vectors involve assigning words numerical values based on their sentiment within a specified context, and then grouping them.

Basic Premise:

Words that appear in the same contexts / proximity share semantic meaning.

“the quick brown fox jumped over the lazy dog”

We can organize this sentence into groups where words reoccur and are evaluated against a main word:

([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ...

We then optimize everything with something like Stochastic Gradient Descent (SGD) or T-SNE.

SGD

Different from standard gradient descent in that samples are selected randomly instead of in order, or as a single group. Remember walking down that mountain? Now imagine teleporting around until you find the bottom.

T-SNE

A method of reducing a dataset with many dimensions down to just 2 dimensions.

History

Developed by Google, and used in Neuro-Linguistic Programming (NLP).

Common Challenges

One of the dangers of vectorizing words is oversimplifying the multiple contexts that words can have depending on their context. It’s a tricky art!

What’s available to us in ML5?

We can perform the following operations in ML5:

  • Add or Subtract
  • Average
  • Nearest
  • Get Random Word

Word2Vec Example

HTML File: 06-word2vec.html

Javascript File: assets/js/06-word2vec.js

Looking at the HTML, we have 3 different implementations of word2vec.

The first takes a word from the user, and tries to find its place among similar words from a trained model, displaying the top 10 results.

The second tries to find an average between two words.

The last tries to create an analogy. You specify one concrete analogy, and give another word from which ml5 should complete its own analogy.

Looking at the javascript:

noLoop();

noCanvas();

We’re not displaying anything on screen with processing, so we can optimize things by specifying that here. All we are going to do is place words into their appropriate html containing elements.

// Create the Word2Vec model with pre-trained file of 10,000 words

word2Vec = ml5.word2vec('/assets/data/wordvecs10000.json', modelLoaded);

ML5 comes with a pre-trained model of 10,000 word embeddings. We’ll specify here that word2vec should use those embeddings when assessing the words supplied by the user. When it has loaded this file (located at assets/data/wordvecs10000.json), it will call the modelLoaded function.

// Select all the DOM elements

let nearWordInput = select('#nearword');

let nearButton = select('#submit');

let nearResults = select('#results');

let betweenWordInput1 = select("#between1");

let betweenWordInput2 = select("#between2");

let betweenButton = select("#submit2");

let betweenResults = select("#results2");

let addInput1 = select("#isto1");

let addInput2 = select("#isto2");

let addInput3 = select("#isto3");

let addButton = select("#submit3");

let addResults = select("#results3");

All this code does is declare a bunch of variables so that when we hit buttons on screen, the javascript can be notified, and when we have results, we can assign them to the appropriate places in the html.

// Finding the nearest words

nearButton.mousePressed(() => {

let word = nearWordInput.value();

word2Vec.nearest(word, (err, result) => {

let output = '';

if (result) {

for (let i = 0; i < result.length; i++) {

output += result[i].word + '<br/>';

}

} else {

output = 'No word vector found';

}

nearResults.html(output);

});

});

Above is the first example — finding the words that are most similar to the word supplied by the user. We grab that word (nearWordInput.value()), and call word2vec’s nearest() function, then pass the output to the html page.

betweenButton.mousePressed(() => {

let word1 = betweenWordInput1.value();

let word2 = betweenWordInput2.value();

word2Vec.average([word1, word2], 4, (err, average) => {

betweenResults.html(average[0].word);

})

});

This is the second value — finding the average between two words: word2Vec.average([word1, word2], 4, (err, average) => { … }). If we look at the documentation, we can see word2vec accepts a few parameters:

The first parameter is our input values. We can see it wants an array. We could find the average of any number of words: 1 word, 2 words, 2 million words…The second is how many results we want back. We’re specifying 4 in the example, but really we only need one (the default).

average[0].word

Here we’re taking the first of the 4 results returned from word2vec, and displaying it on our page.

// Adding two words together to "solve" an analogy

addButton.mousePressed(() => {

let is1 = addInput1.value();

let to1 = addInput2.value();

let is2 = addInput3.value();

word2Vec.subtract([to1, is1])

.then(difference => word2Vec.add([is2, difference[0].word]))

.then(result => addResults.html(result[0].word))

});

This is our 3rd example. We’re subtracting the value of our first word from our second, and taking that difference and applying it to the third word we provide, theorizing that it will put us in a similar place on the spectrum of like words.

LSTM

LSTM stands for Long Short Term Memory. A neural network architecture based on RNN that is useful for working with sequential data where the order of the sequence matters.

Long Term means they can use the past to connect to the present. It does this by not throwing away the lessons learned from errors in its past, using loops.

Google used LSTMs in its speech recognition algorithms starting in 2015 and saw a 49% jump in accuracy.

How it Works

The horizontal line at the top of this diagram is the cell state. It’s like a conveyor belt — it has the ability to move information along unchanged through each layer of the neural network.

The vertical arrows are gates. They allow information on to the conveyor belt. Their output ranges between 0 and 1, with 0 being — let no information through, and 1 being — let everything through! There are 3 of these gates, for the 4 basic evaluation operations of an LSTM:

  • Decide what to throw away
  • Decide what to update
  • Decide what new information to allow
  • Decide what to output

LSTM in ML5

In our example we’ll take a pre-trained body of text to generate new text.

HTML File: 07-lstm.html

Javascript File: assets/js/07-lstm.js

In our html, we’ve got a dropdown of different pre-trained bodies of text. The goal is to render new text in the style of the text our model was trained on.

Our seed text is something to get our model started on. It will attempt to add words that are similar to seed text, of a specified number of words (length). Temperature controls the randomness of the output. 0 means random, but potentially unrecognizable english. 1 means english, but potentially more scripted from the reference material. Temperature involves finding a balance between inventive and referenced.

In our javascript:

lstm = ml5.LSTMGenerator('assets/models/jkrowling/', modelReady);

We’ll have some Harry Potter queued up as the default body of work, but if we change the dropdown, we’ll pass that selection to the LSTMGenerator method instead.

textInput = select('#textInput');

lengthSlider = select('#lenSlider');

tempSlider = select('#tempSlider');

corpus = select('#corpus');

button = select('#generate');

We do our standard assignment of HTML buttons and fields above…

// DOM element events

button.mousePressed(generate);

corpus.input(updateCorpus);

lengthSlider.input(updateSliders);

tempSlider.input(updateSliders);

…and specify what functions to call when a user interacts with them.

function updateCorpus(){

const corpusVal = corpus.value();

select('#status').html('Loading Model');

lstm = ml5.LSTMGenerator(`assets/models/${corpusVal}/`, modelReady);

}

As mentioned before, here’s where we assign a new model to our LSTM Generator when a selection is made from the dropdown.

// Update the slider values

function updateSliders() {

select('#length').html(lengthSlider.value());

select('#temperature').html(tempSlider.value());

}

These functions make sure that whenever we change a slider, we capture its new value.

function generate() {

// Update the status log

select('#status').html('Generating...');

// Grab the original text

let original = textInput.value();

// Make it to lower case

let txt = original.toLowerCase();

// Check if there's something to send

if (txt.length > 0) {

// This is what the LSTM generator needs

// Seed text, temperature, length to outputs

// TODO: What are the defaults?

let data = {

seed: txt,

temperature: tempSlider.value(),

length: lengthSlider.value()

};

// Generate text with the lstm

lstm.generate(data, gotData);

// When it's done

function gotData(err, result) {

// Update the status log

select('#status').html('Ready!');

select('#result').html(txt + result);

}

}

Here’s where all the work gets done. When we submit our words and slider values, we display the status, and construct a data object with all of our input values: seed, temperature, and length. Then we pass that object to ML5’s lstm.generate() method, and display the results.

/media/insert-embed/dropbox

Training Your Own Model

You could train your own model by installing tensorflow, but this can be notoriously difficult, especially on the Mac, and unless you have a CUDA-enabled Graphics Card, it will also be very slow to train.

For this class, it’s Paperspace to the rescue again. To train your model with paperspace:

  • Make a directory on your computer for the project, and open up a terminal window. Then type cd and drag the folder from your finder window into the terminal window (or type out the full path to your folder).
  • Type paperspace login, and enter the username and password you set up previously.
  • Open a terminal window at the 08/lstm-training folder.
  • Get some source material to train your model on. The more content the better. I am using a short and terrible book I wrote when I was a teenager. Copy all of the text, put it in a file (you can use TextEdit or Notepad if you want, just make sure you save it as plain text (format > make plain text in TextEdit).
  • See that directory called data in the directory you just changed into? It’s got a folder named zora_neale_hurston. We’re gonna make a folder beside that and save our own material in it. Call your folder whatever you want, and inside it, save your plain text source material file as input.txt.
  • Edit the run.sh file with a code editor, and change the — data_dir=./data/zora_neale_hurston \ bit to reference the directory you made for your own material (e.g. — data_dir=./data/my_dir \.
  • Now we run the process with paperspace. Type: paperspace jobs create — container tensorflow/tensorflow:1.5.1-gpu-py3 — machineType P5000 — command ‘bash run.sh’ — project ‘LSTM training’.
  • Wait patiently…Eventually your model will be trained. You can log in to the paperspace console and click on the project name in the Project column, click on logs to view the results, or check out the metrics to see how your model improved over time.
  • Get your stuff. From within the models directory of the ml5js_example folder, run paperspace jobs artifactsGet — jobId [YOUR_JOB_ID] where [YOUR_JOB_ID] can be found in the paperspace console under Gradient > Jobs in the Machine Type / Job ID column.
  • Within the sketch.js file in the example folder, change the const lstm = ml5.LSTMGenerator(‘models/hemingway/’, modelReady); line from hemingway to the name of the folder you just made.
  • Start a server! With python 2:

python -m SimpleHTTPServer 8001

or with python 3:

python -m http.server 8001

  • Visit http://localhost:8001 in your browser — this time the example will pull from your own body of work to make its predictions.

Code Behind the Code

If we look at the source code for ML5, in the LSTM > index.js file, we’ll see some perhaps unfamiliar terminology. What do these things mean?

Softmaxpresent in the final layer of a neural network, this ‘squashes’ or reduces the vectors into a single probability distribution.

Forget Bias — how much of the previous neuron’s information to allow through or discard.

One-Hot Encoding — Converting data into a format that can be used by the machine learning algorithm (in this case Tensorflow), by converting categorical labels into binary encodings — true or false.

CURRENT EXAMPLES OF ML IN LANGUAGE

Botnik’s Predictive Writer Keyboard

A better implementation of LSTMs, Botnik’s Predictive Writer keyboard trains itself on a corpus of text, but allows you to choose suggested words from that corpus, replace them with new, randomly selected words, or override it with your own text. You can also upload your own corpus of text, or combine existing corpuses for some weird results.

Botnik analyzes a corpus of text by using an algorithm to determine common patterns of words. Then, it looks at the the last three words of your text to figure out suggestions for new words to follow. Combining corpuses recalculates word frequency and common word patterns for both texts.

Authors Using AI To Write Novels

Author Robin Sloan uses a custom-trained AI for predictive text suggestions.

FOR NEXT CLASS

  • Train Botnik on a corpus of text of your choosing. Use that to write a short story. A couple of paragraphs is fine.
  • Research some examples of Machine Learning being used in text / language. We’ll share what we found in class. Be ready with one example of a project you might do involving ML in Text / Speech.
  • We’ll be looking into some new coding environments. If you plan on using your own computer, make sure you have the following installed:
  • Python 2.7+
  • PIP
  • Node / NPM
  • In addition, the following are optional (but encouraged)
  • Git
  • Audacity
  • Polyphone

--

--