One of my current projects is to create a machine learning (ML) process to extract relevant information from emails. Essentially a bit of Named Entity Recognition (NER) leading to more interesting stuff.
As part of the testing and development for the NER system I have been using a neural network to tag the words in emails.
But as you all probably know, computers like numbers and not words.
While there are a number of ways to transform text data into numerical data, here I will demonstrate one method using Google’s Word2Vec approach.
For the demonstration I will be using some data on people’s names, company names and cities in which the companies are based. There is nothing confidential about this as the first and last names are split, and the companies names are split into their component parts ie Stamford Enterprise Ltd becomes [‘Stamford’, ‘Enterprise’, ‘Ltd’] and gets added to the mix of data.
To convert each word into a numerical representation I will be using a pre-built version of Googles Word2Vec model. The model contains a vocabulary of 3 million words/phrases which was trained on 100 billion news articles. An example of the implementation in Python can be seen [here]. The output from the model is a vector of 300 numbers.
In its basic form you can give the model a word or phrase, if that word or phrase is in the models vocabulary it will return 300 numbers representing that word or phrase.
To visualise the data, the 300 numbers are run through a dimensionality reduction technique – t-Distributed Stochastic Neighbor Embedding (t-SNE). This is done as it would be difficult to plot 300 dimensions, so they are reduced to just 2 (x, y). t-SNE is commonly used with this type of data and a more detailed explanation can be found [here]
Here I aim to demonstrate why this is important.
To generate the results I run 374 words though the word2vec model. The 374 words included 137 last names, 98 company names, 98 first names and 41 cities.
When run through the word2vec model the output was as an array of (374, 300). Using t-SNE the final result was (374, 2), so 374 words each with 2 numerical values. Again, t-SNE reduced the number of dimensions from 300 to just 2 for visualation purposed (we could have also reduced to 3 dimensions).
Now lets plot the numerical data.
Here is started to become evident why Word2Vec is a good method for converting text data into a numerical representation. This can be seen when considering the groups of each type. At the top we can see a group of CITIES (green), to the right we can see a dispersed group of LAST NAMES (red) and to the middle we can see an array of COMPANY (orange) names.
More interestingly we can see the FIRST NAMES (blue) on the left. Not only has it grouped the FIRST NAMES is has also seperated them out into male and female names.
A clearer example of this can be seen below.
Why is this important? What is the point?
The main importance is how the words are handled. For example [‘John’,’Jim’,’Dave’] are names, in any sentence containing them they can be interchanged without it impacting the sentence. When run through the Word2Vec model this is supported by each name been given approximately the same values (they would be close to each other). However, if be took [‘John’,’Jim’,’London’] the [‘John’,’Jim’] would be similar whereas [‘London’] would be dissimilar.
Therefore, values generated by the word2vec model for names would be be similar, values for cities would be similar, etc.