Before Deep Learning Era

Back in the days when a Neural Network was that scary, hard-to-learn thing which was rather a mathematical curiosity than a powerful Machine Learning or Artificial Intelligence tool - there were surprisingly many relatively successful applications of classical data mining algorithms in Natural Language Processing (NLP) domain. It seemed that problems like spam filtering or Part of Speech Tagging could be solved using rather easy and understandable models.

But not every problem can be solved this way. Simple models fail to properly capture linguistic subtleties like irony (although humans often fail at that one too), idioms or context. Algorithms based on overall summarization (e.g. bag-of-words) turned out to be not powerful enough to capture sequential nature of text data, whereas n-grams struggled to model general context and suffered severely from a curse of dimensionality. Even HMM-based models had trouble overcoming these issues due to their Markovian nature (memorylessness). Of course, these methods were also used when tackling more complex NLP tasks, but not to a great success.

First breakthrough - Word2Vec

Fundamental improvement which was brought by applying neural networks to NLP was providing a semantically rich representation of words. Before, the most common representation was a so-called one-hot encoding - where each word is transformed into a unique binary vector with only one non-zero entry. This approach severely suffered from sparsity and didn't take any advantage of meaning of particular words.

word2vecFigure 1: Word2Vec representations of words projected onto a 2 dimensional space.

Instead, imagine looking at a few adjacent words, removing the middle one and make a neural network predict context given a middle word (skip-gram) or another way around: predict the center word basing on context (Continuous Bag of Words,
CBOW). Of course, such model is useless, but it turns out that as a side effect it produces a surprisingly powerful vector representation preserving semantic structure of words.

Further improvements

Even though the new powerful Word2Vec representation boosted the performance of many classical algorithms, there was still a need for a solution able to capture sequential dependencies in a text (both long and short term). The first concept for this problem were so-called vanilla Recurrent Neural Networks. They take advantage of temporal nature of data by feeding words to the network sequentially while using the information about previous words stored in a hidden-state.

Figure 2: A recurrent neural network. Image courtesy to an excellent Colah's post on LSTMs

It turned out that these networks handled local dependencies very well, but were difficult to train due to the vanishing gradient. To address this issue a new network topology called LSTM (Long Short-Term Memory) was invented. It handles the problem by introducing special units in the network called memory cells. This sophisticated mechanism allows finding longer patterns without a significant increase in a number of parameters.

Many popular architectures are also variations of LSTM, such as mLSTM or GRU which thanks to an intelligent simplification of a memory cell update mechanism significantly decreased the number of parameters needed.

After an astounding success of Convolutional Neural Networks in Computer Vision - it was only a matter of time when their use would be extended to NLP. Today 1d convolutions are popular building blocks of many successful applications including semantic segmentation, fast machine translation and general sequence to sequence learning framework which beats recurrent networks and can be trained an order of magnitude faster due to an easier parallelization.

👀 Convolutional Neural Networks, aside from NLP, were first used to solve Computer Vision problems and remain state-of-the art in that space. You may want to learn about their applications and capabilities.

Now let's go through common NLP problems

There are various tasks concerning the interaction between a computer and human languages that are trivial to a human observer but cause a lot of trouble to the computer. This is caused mostly by many linguistic nuances, such as irony or idioms. Areas of NLP that researchers try to tackle contain (roughly in order of their complexity):

The most common and possibly the easiest one is Sentiment Analysis. It boils down to determining the attitude or emotional reaction of speaker/writer towards a specific topic or in general. Possible sentiments are positive, neutral and negative. Here you can find a great article about using Deep Convolutional Neural Networks in learning sentiment from tweets. Another interesting experiment showed that a Deep Recurrent Net can learn sentiment by accident.

Unsupervised sentiment neuronFigure 2: Activation of a neuron from a net used to generate next character of text. It is clear that it learned the sentiment even though it was trained in a completely unsupervised environment.

A natural generalization of the previous case is Document Classification, where instead of assigning 1 of 3 possible flags to each article, we solve an ordinary classification problem. According to a very thorough comparison of algorithms, it is safe to say that Deep Learning is a way to go when it comes to text classification.

Now, we move on to the real thing - Machine Translation has posed a serious challenge for quite some time. It is important to understand that this a completely different task than 2 previous ones. Now, we require from a model to predict a sequence of words instead of a label. Here we can see what the fuss is all about with Deep Learning as it has been an unbelievable breakthrough when it comes to sequential data. In this blog post you can read more about how (yep, you guessed it) Recurrent Neural Networks tackle translation and here about how they achieve state-of-the-art results.

Still here? Let's crank it up a notch. Say, you need an automatic Text Summarization model, which basically needs to extract only the most important parts of text while preserving all of the meaning. This requires an algorithm to understand full text while being able to focus on specific parts which carry most of the meaning. This is solved very neatly by Attention Mechanisms, that can be introduced as a module inside an end-to-end solution.

Lastly, there is Question Answering, which comes as close to Artificial Intelligence as you can get. Not only does the model need to understand a question, but also it is required to have a full understanding of a text of interest and know exactly where to look to produce an answer. For a detailed explanation of a solution (of course using Deep Learning) check this article.

Attention MechanismFigure 3: Beautiful visualization of an attention mechanism in a recurrent neural network trained to translate English to French.

Since Deep Learning offers vector representations for various kinds of data, e.g. text and images, you can build models taking advantage of different domains. This is how researchers came up with Visual Question Answering. The problem is "trivial" - all you need to do is answer a question about an image. Sounds like a job for a 7-year-old, right? Nonetheless, deep models are first to produce any reasonable results without human supervision. Results and description of such a model are in this paper.

🍔 🍳 🍟 Starving for applications? Get your hands dirty and implement your NLP chatbot using LSTMs.

Recap

So, now you know. Deep Learning appeared in NLP relatively recently due to computational issues and the fact, that we needed to learn much more about Deep Neural Networks to understand their capabilities. But once it did, it changed the game forever.