Word Embeddings

5 min readJun 1, 2020

An attempt at describing the vectorization of language

As described in my first post, word vectors are a crucial part to analyzing, comparing, predicting, and generating text. They also seem quite mysterious at first glance. In fact, take a look at a vector I created for the word mysterious.

Huh?

What are they anyway?

Word vectors (more specifically referred to as “word embeddings”) are numerical representations of a word’s meaning, as shown in the array above. Each number within this array corresponds to a value within one dimension; and each dimension is a certain meaning, with 1 being equal to that exact meaning and -1 being equal to the opposite of that meaning. It is worth noting that there is no actual meaning here — you can’t call a function that tells you what a particular dimension means — rather, it is merely a collection of parameters created by the model that one trains. The value itself is the proportion of closeness to the meaning that that dimension represents.

In the example pictured above, there are one hundred dimensions (note: when you create word vectors from a corpus, you can specify a quantity of dimensions). If you could imagine a one hundred dimensional space, each word within a corpus is a vector within that space. Proximity between words equates to closeness of a word’s meaning.

To think about it more simply, and to get a sense of why this is important, consider the following depiction of a classic example:

Subtracting Man from King gives you the yellow vector; then adding that yellow vector to Woman gives you a vector equal to (or very nearly equal to) the one for Queen. In a simplistic sense, the yellow vector could be seen as analogous to, say, Royalty. It goes without saying, however, that the English language is rather complicated, and the larger the corpus on which one trains a model, the more nuanced and complicated a word’s vector becomes.

What’s so special about that?

Prior to these more complicated word vectors, NLP-ers used a simpler method called one-hot encoding as their main technique for word classification. One-hot encoding can be used for many things; it can make vectors out of any categorical data, and it is still widely used in certain circumstances. While it is much simpler and faster to implement than word embedding, it also lacks any measure of contextuality and frequency, being that it is a binary vector. In other words, it can tell you nothing of the meaning (the semantic value) of a word.

So while one-hot encoding can tell you a lot about where words appear in sentences, it cannot tell you how words appear in sentences, unlike embedding, which can often do both. Word vectors as I’ve been discussing them (again, what are more frequently referred to as word embeddings) go much further in giving us a semantic and syntactic sense of words. One-hot encoding also gets unwieldy with large corpora, as each vector will have the the same number of dimensions as the number of words in the corpus (as always, beware the curse of dimensionality).

How do I create word embeddings?

The most common tool for creating word embeddings is Word2vec, an algorithm patented in 2013 by researchers at Google. Another tool is GloVe, developed by researchers at Stanford, and fastText, developed by Facebook. All of these use neural networks to train models from a corpus. For the rest of this article, however, I’ll be focusing only on Word2vec.

Gensim and TensorFlow
There are two main libraries that can be used to vectorize words using Word2vec: Gensim and TensorFlow. I’ve tried both, and so far Gensim seems a bit easier to implement and play around with.

After you’ve found or created a corpus (for how to create a custom corpus, refer to my previous post), you can input the sentences from the corpus into Gensim’s Word2Vec neural network to build a model. With the model, you can analyze text in various ways, including finding word similarities, as well as building the framework for a text generator.

And make cool (though, for me, still mysterious) graphs.

Where do the numbers actually come from?

You may be wondering how the vector representations even work. Within Word2vec, there are two common means of generating vector values: Continuous Bag of Words and skip-gram models.

Continuous Bag of Words (CBOW)
In CBOW, the model looks at the words surrounding a given word (its context) and uses that context to maximize the likelihood of predicting that word. It relies on these maximizing techniques to build the probabilities that words appear contextually near other words. Using CBOW, you can adjust what is referred to as the size of the “window”, i.e. how many words before and after your target word that you’d like to predict with. The vectors for these words get added together and their average is used to predict the most probable word.

For example, for the sentence “the quick brown fox jumps”, a window size of 2 would use the, quick, fox, and jumps to predict the word brown.

One drawback of the CBOW model is that, since it always predicts the most probable word, it does not work well for rare words. Say the actual color of the fox was umber, a CBOW model would most likely have trouble predicting that because it is a much less common word. That said, CBOW is much faster than skip-gram and more accurate for frequent words.

Skip-gram
With skip-gram, however, the opposite process is used, i.e. the model predicts the context based on an input word. Using the word umber as input, a skip-gram model may tell us that the largest probability is that its context is the quick…fox jumps.

Skip-gram models work better on smaller datasets, work far better on rarer words, and also work better with longer sentences.

One final note

The more I read about word embeddings the more confused I become, but hopefully this article has helped give you and me a bit more understanding. I look forward to updating this as I learn and understand more.

So in the meantime, I’ll have to learn how to deal with a constantly exploding brain.