Disclaimer - The views, thoughts, and opinions expressed in the text belong solely to the author, and should not in any way be attributed tothe author’s employer, or to the author as a representative, officer or employee of any organisation.
This article is an extract from my book “Practical Data Analysis: Using Open Source Tools & Techniques” (available on Amazon worldwide, iBook Store, and Barnes & Noble).
Can a machine learn our language? It turns out that yes, a machine can learn our language, perhaps even better than us humans. Because in our lifetime, we can at best endeavour to become proficient in understanding only a handful of languages. Given enough compute power and data, a machine on the other hand can learn many languages, in a matter of days. Not convinced? Well, let’s see if I can persuade you to think otherwise!!
On 14th of August 2013, three researchers from the Google Knowledge team (Tomas Mikolov, Ilya Sutskeverand Quoc Le) published an article on Google’s Open Source blog - “Learning the meaning behind words”. This post stirred a lot of excitement, euphoria and bewilderment in the Machine Learning, artificial intelligence and natural language processing community. Because in that article was a very simple yet surreal message - Google scientists have successfully designed a computationally efficient neural network (called word2vec) that allows machines to grasp and numerically represent (in a series of numbers called word vectors) the “meaning” behind words and their semantic relationships, by simply reading what people are writing and posting. And just like that, a new era for computational linguistics was born.
While the notion of learning the meaning behind words and sentences by skimming through documents is nothing new (for example, Hollywood has over the years managed to entertain Sci-Fi movie buffs by dramatising scenes of aliens travelling to our planet and scanning through documents on the internet to learn about the human race), Google’s word2vec concept has been so successful in improving and simplifying the existing state-of-the-art solutions to a number of natural language processing (NLP) problems that it is almost on the verge of replacing more traditional representation of words in computational linguistics. It is also widely featured as a member of the new wave of Machine Learning algorithms that are causing a tectonic shift in the technology landscape in the recent years.
But what makes Google’s word2vec so powerful as a tool in natural language processing tasks? To answer this question, let’s quickly explore some of the key features of word2vec.
a) Many Machine Learning algorithms (including deep learning networks) require their inputs to be numbers or vectors (list of real numbers) of fixed length; and they simply won’t work on strings or plain texts. So, a natural language modelling technique like “word embedding” is typically used to map the words or phrases in a vocabulary to a corresponding fixed length vector of real numbers. Google’s word2vec is a word embedding technique that not only maps each word in a vocabulary to a unique vector of real numbers, but also encodes in the same vector the meaning of the word and its semantic relationship with the other words in the vocabulary.
b) For a language model to be able to predict the meaning of a text, it needs to be aware of the contextual similarity of words. For instance, we tend to find fruit words (like apple or orange) in the context of where, how or why they are grown, picked, eaten and juiced; but you wouldn’t expect to find those same concepts in such close proximity to, say, the word automobile. The vectors created by word2vec preserve these similarities, so that words that regularly occur nearby in text will also be in close proximity in the vector space.
c) An interesting feature of word vectors is that because they are numerical representations of contextual similarities between words, they can be manipulated arithmetically just like any other vector. For example, if you use the word vectors from Google’s pre-trained word2vec model (more on this later) and apply the following vector arithmetic: king minus man plus woman, the resulting word vector is closest to the vector representation of the word queen!! The other way to look at this vector arithmetic is that the distance between the words man and woman in the word2vec vector space is same as the distance between the words kingand queen (understanding of male and female relationship).
king - man + woman => queen
Similarly (understanding of country and capital relationship):
Paris + Germany – France => Berlin
What this means is that the word2vec representation of words is able to preserve the syntactic and semantic relationships between words such as gender, verb tense, and country and capital. If you think about it, this is truly remarkable as all of this knowledge simply comes from skimming through lots and lots of text with no other information provided about their semantics.
d) In this age of social media and micro blogging, internet slang is constantly changing the way we speak and communicate, so much so that many of the phrases and vocabulary that may be heard being spoken by today’s high school and college students is often difficult to understand. Apparently, they have their very own vocabulary that no dictionary recognises. In this context, word2vec is an extremely powerful word discovery tool, as it does not really care how a word is spelled or represented (e.g. as an emoji). As long as there is sufficient volume of context words or texts to work with, word2vec can easily locate these new words, and uncover their meanings and relationships with other internet slangs or mainstream vocabulary. As a side note though, millennials are not the only ones to use jargons and slangs. In many professional domains (such as in the financial markets), technical jargons are frequently used as a shorthand by people in-the-know to make communication easier.
e) Google open sourced its implementation of the word2vec model alongside the academic paper that explains the model’s architecture. Google also published a pre-trained word2vec model containing 300-dimensional vectors for over three million words and phrases that was created using a collection of Google News articles (comprising of approximately 100 billion words). These initiatives significantly accelerated the understanding and adoption of the word2vec concept by practitioners of Machine Learning, artificial intelligence and natural language processing. Additionally, it also led to new innovation (such as document to vector or doc2vec and sense to vector or sense2vec), centered around the practical use of word2vecin NLP applications.
If you are interested in learning more about word2vec, and how it can be applied to natural language processing tasks such as automatically grouping documents based on their similarities, labelling documents as belonging to certain pre-defined categories (e.g. finance, politics, technology, or entertainment related), and automatically generating a summary of a document, read chapter six of my book “Practical Data Analysis: Using Open Source Tools & Techniques” (available on Amazon worldwide, iBook Store, and Barnes & Noble).
Dhiraj Bhuyan, 24 July 2018
No comments:
Post a Comment