Bayan Bennett

Pseudo-English—Typing Practice w/ Machine Learning

TensorFlow
JavaScript

Articles in this series:

  1. Introduction
  2. Pseudo-English
  3. Keyboard Input
  4. Inference Using Web Workers

The finished project is located here: https://www.bayanbennett.com/projects/rnn-typing-practice

Objective

Generate English-looking words using a recurrent neural network.

Trivial Methods

Before settling on using ML, first I had to convince myself that the trivial methods did not provide adequate results.

Random Letters

const getRandom = (distribution) => {
  const randomIndex = Math.floor(Math.random() * distribution.length);
  return distribution[randomIndex];
}

const alphabet = "abcdefghijklmnopqrstuvwxyz";

const randomLetter = getRandom(alphabet);

Unsurprisingly, no resemblance to English words. The character sequences that were generated were painful to type. Here are a few examples of five letter words:

snyam	iqunm	nbspl	onrmx	wjavb	nmlgj
arkpt	ppqjn	zgwce	nhnxl	rwpud	uqhuq
yjwpt	vlxaw	uxibk	rfkqa	hepxb	uvxaw

Weighted Random Letters

What if we generated sequences that had the same distribution of letters as English? I obtained the letter frequencies from Wikipedia and created a JSON file that mapped the alphabet to their corresponding relative frequency.

// letter-frequencies.json
{
  "a": 0.08497,  "b": 0.01492,  "c": 0.02202,  "d": 0.04253,
  "e": 0.11162,  "f": 0.02228,  "g": 0.02015,  "h": 0.06094,
  "i": 0.07546,  "j": 0.00153,  "k": 0.01292,  "l": 0.04025,
  "m": 0.02406,  "n": 0.06749,  "o": 0.07507,  "p": 0.01929,
  "q": 0.00095,  "r": 0.07587,  "s": 0.06327,  "t": 0.09356,
  "u": 0.02758,  "v": 0.00978,  "w": 0.02560,  "x": 0.00150,
  "y": 0.01994,  "z": 0.00077
}

The idea here is to create a large sequence of letters whose distribution closely matches frequencies above. Math.random has a uniform distribution, so when we select random letters from the sequence, the probability for picking a letter matches its frequency.

const TARGET_DISTRIBUTION_LENGTH = 1e4; // 10,000

const letterFrequencyMap = require("./letter-frequencies.json");

const letterFrequencyEntries = Object.entries(letterFrequencyMap);

const reduceLetterDistribution = (result, [letter, frequency]) => {
  const num = Math.round(TARGET_DISTRIBUTION_LENGTH * frequency);
  const letters = letter.repeat(num);
  return result.concat(letters);
};

const letterDistribution = letterFrequencyEntries
  .reduce(reduceLetterDistribution, "");

const randomLetter = getRandom(letterDistribution);

The increase in the number of vowels was noticeable, but the generated sequences still fail to resemble an English word. Here are a few examples of five-letter words:

aoitv	aertc	cereb	dettt	rtrsl	ararm
oftoi	rurtd	ehwra	rnfdr	rdden	kidda
nieri	eeond	cntoe	rirtp	srnye	enshk

Markov Chains

This would be the next logical step, where we would create probabilities of letter sequence pairs. This was the point that I decided to go straight to RNNs. If anyone would like to implement this approach, I'd be interested in seeing the results.

Recurrent Neural Networks

Neural networks are usually memoryless, where the system has no information from previous steps. RNNs are a type of neural network where the previous state of the network is an input to the current step.

  • Input: A character
  • Output: A tensor with the probabilities for the next character.

NNs are inherently bad at processing inputs of varying length, there are ways around this (like with positional encoding in transformers). With RNNs, the inputs are consistent in size, a single character. Natural language processing has a natural affinity for RNNs as languages are unidirectional (LTR or RTL) and the order of the characters are important. In other words, although the words united and untied only have two characters swapped, but they have opposite meanings (see: antigram).

The model below is based on the Tensorflow Text generation with an RNN tutorial.

Input Layer with Embedding

This was the first time I encountered the concept of an embedding layer. It was a fascinating concept and I was excited to start using it.

I wrote a short post summarizing embeddings here: https://bayanbennett.com/posts/embeddings-in-machine-learning

const generateEmbeddingLayer = (batchSize, outputDim) =>
  tf.layers.embedding({
    inputDim: vocabSize,
    outputDim,
    maskZero: true,
    batchInputShape: [batchSize, null],
  });

Gated Recurrent Unit (GRU)

I don't have enough knowledge to justify why a GRU was chosen, so I deferred to the implementation in the aforementioned Tensorflow tutorial.

const generateRnnLayer = (units) =>
  tf.layers.gru({
    units,
    returnSequences: true,
    recurrentInitializer: "glorotUniform",
    activation: "softmax",
  });

Putting it all together

Since we are sequentially feeding the output of one layer into the input of another layer, tf.Sequential is the class of model that we should use.

const generateModel = (embeddingDim, rnnUnits, batchSize) => {
  const layers = [
    generateEmbeddingLayer(batchSize, embeddingDim),
    generateRnnLayer(rnnUnits),
  ];
  return tf.sequential({ layers });
};

Training Data

I used Princeton's WordNet 3.1 data set as a source for words.

"WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets)..."
Princeton University "About WordNet." [WordNet][9]. Princeton University. 2010.

Since I was only interested in the words, I parsed each file and extracted only the words. Words with spaces were split into separate words. Words that matched the following criteria were also removed:

  • Words with diacritics
  • Single character words
  • Words with numbers
  • Roman numerals
  • Duplicate words

Dataset Generator

Both the tf.LayersModel and tf.Sequential both have the .fitDataset method, which is a convenient way of—fitting a dataset. We need to create a tf.data.Dataset, but first here are some helper functions:

// utils.js

const characters = Array.from("\0 abcdefghijklmnopqrstuvwxyz");
const mapCharToInt = Object.fromEntries(
  characters.map((char, index) => [char, index])
);

const vocabSize = characters.length;

const int2Char = (int) => characters[int];
const char2Int = (char) => mapCharToInt[char];
// dataset.js

const wordsJson = require("./wordnet-3.1/word-set.json");
const wordsArray = Array.from(wordsJson);

// add 1 to max length to accommodate a single space that follows each word
const maxLength = wordsArray.reduce((max, s) => Math.max(max, s.length), 0) + 1;

const data = wordsArray.map((word) => {
  const paddedWordInt = word
    .concat(" ")
    .padEnd(maxLength, "\0")
    .split("")
    .map(char2Int);
  return { input: paddedWordInt, expected: paddedWordInt.slice(1).concat(0) };
});

function* dataGenerator() {
  for (let { input, expected } of data) {
    /* If I try to make the tensors inside `wordsArray.map`,
     * I get an error on the second epoch of training */
    yield { xs: tf.tensor1d(input), ys: tf.tensor1d(expected) };
  }
}

module.exports.dataset = tf.data.generator(dataGenerator);

Note that we need all the inputs to be the same length, so we pad all words with null characters, which will be converted to integer 0 with the char2Int function.

Generating and compiling the model

Here it is, the moment we've been building towards:

const BATCH_SIZE = 500;

const batchedData = dataset.shuffle(10 * BATCH_SIZE).batch(BATCH_SIZE, false);
const model = generateModel(vocabSize, vocabSize, BATCH_SIZE);
const optimizer = tf.train.rmsprop(1e-2);

model.compile({
  optimizer,
  loss: "sparseCategoricalCrossentropy",
  metrics: tf.metrics.sparseCategoricalAccuracy,
});

model.fitDataset(batchedData, { epochs: 100 });

A batch size of 500 was selected as that was around what I could fit without running out of memory.

Examples

ineco uno kam whya qunaben qunobin
xexaela sadinon zaninab mecoomasph
anonyus lyatra fema inimo unenones

It's not perfect, but it produces words that vaguely appear to come from another Romance or Germanic language. The size of the model.json and weights.bin files are only 44 kB. This is important since simpler models generally run inference faster and are light enough for the end user to download without affecting perceived page performance.

The next step is where the fun begins, building a typing practice web app!

© 2022 Bayan Bennett