Advanced introduction to NLP deep learning how RNN and transformer work
Advanced introduction to NLP deep learning how RNN and transformer work
Advanced introduction to NLP deep learning how RNN and transformer work
Just a few years ago, RNN and its gating variant (added multiplicative interaction and better gradient transfer mechanism) was the most popular architecture for NLP.
Famous researchers like Andrey Karpathy are singing about the unreasonable effectiveness of RNNs, and large companies are keen to adopt models to incorporate them into virtual agencies and other NLP applications.
Now that Transformers (BERT, GPT-2) have appeared, the community rarely even mentions regular networks.
In this article, we will provide you with a high-level introduction to NLP deep learning; we will briefly explain how RNN and transformer work and which specific properties of the latter make it a better architecture for various NLP tasks.
here we go!
Let's start with RNNs, why, until recently, they were considered special.
Recurrent Neural Networks are a series of neural architectures with cool properties-cyclic mechanism-making them a natural choice for processing variable-length sequence data. Unlike standard NN, RNN can suspend information from previous layers when receiving new input.
This is how it works
Suppose we are building an e-commerce chatbot that contains an RNN that processes text and a feedforward network that predicts the intention behind it. The robot received this message: "Hi! Does your shirt have a different color?"
As our input, we have 11 words (11 word embeddings) and sequences, cut into the mark, and look like this I1, I2...I11.
The core idea behind RNN is that it applies the same weight matrix to each individual input, and also produces a series of hidden states (they will have as much data as our input), which carries information from previous time steps.
Calculate each hidden state (Ht) based on the previous hidden state (Ht-1) and the current input (It); as we mentioned, they are actually states that are constantly modified at each time step.
Therefore, the processing starts with the first word embedding (I1) entering the model together with the initial hidden state (H0); in the first unit of the RNN, linear transformation is performed on I1 and H0, increasing the deviation, and the final value passes through some non- Linear (sigmoid, ReLU, etc.)-this is how we get H1.
After that, the model eats I2 and H1 paired and performs the same calculation, then I3 and H2 enter, then I4 and H3, and so on, until we complete the entire sequence.
Since we use the same weight matrix over and over again, RNN can use lengthy sequences without increasing the size itself. Another advantage is that, in theory, each time step can access the data before many steps.
problem
The uniqueness of RNN-it uses the same layer multiple times-also makes it extremely vulnerable to disappearing and exploding gradients. In fact, it is difficult for these networks to save data in multiple steps.
In addition, RNN does not see any hierarchical structure in the sequence. Each time a new input is processed, the model changes its hidden state, although it may not matter. Therefore, data from earlier layers may be completely washed out when the network reaches the end of the sequence.
This means that in our example "Hi! Do you have this shirt in different colors?" The feedforward net might just try to predict the intent based on "any different color?" It is not easy even for humans.
Another inherent shortcoming lies in the nature of sequential processing: since part of the input is processed one at a time (unless we have H1, we cannot calculate H2). The calculation of the network is generally very slow.
Gated variant
In order to solve the problems discussed above, different architectural modifications have been proposed to improve RNN, the most popular ones are long short-term memory (LSTM) and gated recurrent unit (GTU).
Roughly speaking, the main idea behind LSTM is that in addition to the hidden state, each unit has a unit state-a memory memory (they are all vectors of the same size).
In addition, these models have three gates (forget gate, input gate, output gate) to determine what information is written, read, or erased from the cell state.
All doors are vectors of the same length as the hidden state, which is exactly what they are used for:
The forget gate determines what should be kept and what should be erased from the previous time step. The input gate determines which new information should be put into the unit state. The output gate determines which data from the unit should be merged into the hidden state. They are all calculated using the sigmoid function, so they always output a value between 0 and 1.
If a gate produces something close to 1, it is considered open (data can be contained in the unit state), and if it gives a value close to 0, the information will be ignored.
GRUs operate similarly to LSTMs, but they are simpler in architecture; they eliminate the cell state and calculate two gates instead of three before the hidden state.
The main point of GRU is to maintain the power and robustness of LSTM (in terms of reducing vanishing gradients) and get rid of its complexity. The gates of GRU are:
The update gate determines which parts of the hidden state should be modified and which parts should be kept. To some extent, it performs input and forget gate operations in LSTM.
The reset door determines which parts of the hidden state are now important. If it outputs a number close to 1, we can copy the previous state and save the network from updating weights (no weight adjustment-no vanishing gradients.)
Both LSTM and GRU can control the flow of information, grasp remote dependencies, and make error messages flow with different intensities based on input.
Sequence to sequence (seq2seq) model and attention mechanism
The sequence model used to be so popular in the field of neural machine translation (NMT), it consists of two RNNs (encoder and decoder) stacked together.
The encoder processes the input sequentially and generates a thought vector to save the data at each time step. Then, its output is passed to the decoder, and the decoder uses the context to predict the appropriate target sequence (translation, chatbot reply, etc.)
However, the problem with vanilla seq2seq is that it tries to fill the entire input context into a fixed-size vector, and there is a limit on how much data it can carry.
This is where the attention mechanism is useful. They allow the decoder network to focus on the relevant part of the input when producing the output. They achieve this by adding additional inputs to each decoding step in the encoding step.
RNN's Fall and Transformers
Yes, we can use LSTM to build short memories in RNNs, and even use attention to memorize long memories. But we still can't completely eliminate the effects of vanishing gradients, make these models (its design inhibits parallel computing) faster, or let them explicitly simulate the long-range dependencies and hierarchical structure in the sequence.
Transformer is a model launched by Google researchers in 2017, which overcomes all the shortcomings of RNN. This new revolutionary architecture allows us to eliminate periodic calculations and achieve state-of-the-art results in various NLP tasks (NMT, Q&A, etc.) by completely relying on the attention mechanism.
Transformers also include encoders and decoders. It actually has a bunch of encoders on one side and a bunch of decoders (with the same number of units) on the other side.
Encoder
Each encoder unit consists of a self-attention layer and a feedforward layer.
Self-attention is a mechanism that allows the cell to compare the input content with all other inputs in the sequence and include the relationship between them in the embedding. If we are talking about a word, then self-attention can indicate which other words in the sentence have a strong relationship with it.
In the transformer model, each location can interact with all other locations in the input at the same time; the calculation of the network is trivial for parallelization.
The self-attention layer is further enhanced by the multi-head attention mechanism, which improves the model's ability to focus on various positions and enables it to create representation subspaces (applying different weight matrices to the same input).
In order to establish the order of the inputs, the transformer adds another vector to each embedding (this is called position coding), which helps them identify the position of each input in the sequence and the distance between them.
Each encoder pushes its output to the unit directly above it.
On the decoder side, the unit also has a self-attention layer, a feedforward layer and an additional element-the encoder decoder attention layer-in between. The decoder component in Transformers receives the output from the top encoder-a series of attention vectors-and uses it to focus on the relevant part of the input when predicting the target sequence.
In general, transformers are lighter than RNNs, they are easier to train, and are well suited for parallelization; they can learn remote dependencies.
Concluding note
The transformer architecture has become the basis for many breakthrough models.
Google’s research used the "Attention is all you need" paper to develop BERT-a powerful language representation model that can be easily adapted to various NLP tasks (by adding a fine-tuned output layer) and OpenAI scientists managed to create one The incredible coherent language model GPT-2, according to them, is too dangerous to be released.
Multi-head attention technology is currently being tried in various research fields. Soon, we may see it change multiple industries in a profound way. This will be exciting.