Hi All, welcome to my blog “Long Short Term Memory and Gated Recurrent Unit’s Explained — ELI5 Way” this is my last blog of the year 2019. My name is Niranjan Kumar and I’m a Senior Consultant Data Science at Allstate India.
Recurrent Neural Networks(RNN) are a type of Neural Network where the output from the previous step is fed as input to the current step.
RNN’s are mainly used for,
- Sequence Classification — Sentiment Classification & Video Classification
- Sequence Labelling — Part of speech tagging & Named entity recognition
- Sequence Generation — Machine translation & Transliteration
Citation Note: The content and the structure of this article is based on my understanding of the deep learning lectures from One-Fourth Labs — PadhAI.
In Recurrent Neural Networks at each time step, the old information gets morphed by the current input. For longer sentences, we can imagine that after ‘t’ time steps the information stored at the time step ‘t-k’ ( k << t) would have undergone a gradual process of transformation. During back-propagation, the information has to flow through the long chain of timesteps to update the parameters of the network to minimize the loss of the network.
Consider a scenario, where we need to compute the loss of the network at time step four L₄. Assume that the loss occurred due to the wrong computation of hidden representation at the time step S₁. The error at S₁ is due to incorrect parameters of the vector W. This information has to be back-propagated to W so that the vector will correct its parameters.
To propagate the information back to the vector W, we need to use the concept of the chain rule. In a nutshell, the chain rule boils down to the product of all the partial derivatives of hidden representations at the specific timesteps.
If we have more than 100 hidden representations for longer sequences then we have to compute the product of these representations for the back-propagation. Suppose one of the partial derivatives comes to be a large value then the entire gradient value will explode causing the problem of Exploding gradients.
If one of the partial derivatives is a small value, then the entire gradient becomes too small or vanishes making the network hard to train. The problem of Vanishing gradients
White Board Analogy
Consider that you have a whiteboard of fixed size, over time the whiteboard becomes so messy that you can’t extract any information from it. In the context of RNN for longer sequences, the hidden state representation computed will become messy and it will difficult to extract relevant information from it.
Since RNN’s have a finite state size instead of extracting information from all the timesteps and computing the hidden state representation. we need to follow the selectively read, write and forget strategy while extracting information from different timesteps.
White Board Analogy — RNN Example
Let’s see how selectively read, write and forget strategy works taking an example of sentiment analysis using RNN.
Review: The first half of the movie is dry but the second half really picked up the pace. The lead actor delivered an amazing performance.
The movie review started with a negative sentiment but from thereon it changed to a positive response. In the context of selective read, write and forget:
- We want to forget the information added by stop words (a, the, is etc…).
- Selectively read the information added by sentiment bearing words (amazing, awesome etc…).
- Selectively write hidden state representation information from the current word to the new hidden state.
Using the selective read, write and forget strategy we have control of the flow of information so that the network doesn’t suffer from the problem of short term memory and also to ensure that the finite-sized state vector is used effectively.
Long Short Term Memory — LSTM
LSTM’s are introduced to overcome the problems in vanilla RNN such as short term memory and vanishing gradients. In LSTM’s we can selectively read, write and forget information by regulating the flow of information using gates.
In the following few sections, we will discuss how we can implement the selective read, write and forget strategy. We will also discuss how do we know which information to read and which information to forget.
In the vanilla RNN version, the hidden representation (sₜ) computed as a function of the output of the previous time step hidden representation (sₜ₋₁) and current input (xₜ) along with bias (b).
Here, we are taking all the values of sₜ₋₁ and computing the hidden state representation at the current time (sₜ).
In Selective Write, instead of writing all the information in sₜ₋₁ to compute the hidden representation (sₜ). we could pass only some information about sₜ₋₁ to the next state to compute sₜ. One way of doing this would be to assign a value between 0–1 which determines what fraction of current state information to be passed on to the next hidden state.
The way we are doing selective write is that we multiply every element of sₜ₋₁ with a value between 0–1 to compute a new vector hₜ₋₁. We will use this new vector to compute the hidden representation sₜ.
How do we compute oₜ₋₁?
We will learn oₜ₋₁ from the data just like we learn other parameters like U and W using parametric learning based on gradient descent optimization. The mathematical equation for oₜ₋₁ is given below:
Once we learn from oₜ₋₁ from the data, it is multiplied with the sₜ₋₁ to get a new vector hₜ₋₁. Since oₜ₋₁ is controlling what information is going to the next hidden state, it is called the Output Gate.
After computing the new vector hₜ₋₁ we will compute an intermediatory hidden state vector Šₜ (marked in green). In this section, we will discuss how to implement selective read to get our final hidden state sₜ.
The mathematical equation for Šₜ is given below:
- Šₜ captures all the information from the previous state hₜ₋₁ and the current input xₜ.
- However, we may not want to use all the new information and only selectively read from it before constructing the new cell structure. i.e… we would like to read only some information from Šₜ to compute the sₜ.
Just like our output gate, here we multiply every element of Šₜ with a new vector iₜ which contains values between 0–1. Since the vector iₜ is controlling what information flows in from the current input, it is called the Input Gate.
The mathematical equation for iₜ is given below:
In the input gate, we pass the previous time step hidden state information hₜ₋₁ and the current input xₜ along with a bias into a sigmoid function. The output of the computation will between 0–1 and it will decide what information to flow in from the current input and previous time step hidden state. 0 means not important and 1 means important.
Recap of what we learned so far, we have the previous hidden state sₜ₋₁ and our goal is to compute the current state sₜ using selective read, write and forget strategy.
In this section, we will discuss how we will compute the current state vector sₜ by combining sₜ₋₁ and Šₜ.
The Forget Gate fₜ decides what fraction of information to be retained or discarded from sₜ₋₁ hidden vector.
The mathematical equation for forget gate fₜ is given below:
In forget gate, we pass the previous time step hidden state information hₜ₋₁ and the current input xₜ along with a bias into a sigmoid function. The output of the computation will between 0–1 and it will decide what information to be retained or discarded. If the value is closer to 0 means to discard and if it’s closer to 1 means to retain.
By combing both the forget gate and input gate, we can compute the current hidden state information.
The final illustration would look like this:
The full set of equations looks like this:
Note: Some versions of LSTM architecture will not have forget gate instead it will have only an output gate and input gate to control the flow of information. It will only implement the selective read and selective write strategy.
The variant of LSTM that we discussed above is the most popular variant of LSTM with all three gates controlling the information.
Gated Recurrent Units — GRU’s
In this section, we will briefly discuss the intuition behind GRU. Gated Recurrent Units is another popular variant of LSTM. GRU uses fewer gates.
In Gated Recurrent Units just like LSTM, we have an output gate oₜ₋₁ controlling what information is going to the next hidden state. Similarly, we also have an input gate iₜ controlling what information flows in from the current input.
The major difference between LSTM and GRU is the way they combine the intermediatory hidden state vector Šₜ and previous hidden state representation vector sₜ₋₁. In LSTM we had a forget to determine what fraction of information to be retained from sₜ₋₁.
In GRU instead of the forget gate, we decide how much past information to retain or discard based on the compliment of the input gate vector (1-iₜ).
forget gate = 1 — Input gate vector (1-iₜ)
The full set of equations for the GRU is given below:
From the equations, we can notice that there are only two gates (input and output) and we are not explicitly computing the hidden state vector hₜ₋₁. So we are not maintaining the additional state vector in GRU’s i.e…lesser computations and faster to train than the LSTM.
Where to go from here?
If you want to learn more about Neural Networks using Keras & Tensorflow 2.0 (Python or R). Check out the Artificial Neural Networks by Abhishek and Pukhraj from Starttechacademy. They explain the fundamentals of deep learning in a simplistic manner.
In this article, we discussed the shortfalls of the Recurrent Neural Networks in when dealing with longer sentences. RNN’s suffer from the problem of short term memory ie…it can store only a finite number of states before the information gets morphed. After that, we discussed in detail how the selective read, write and forget strategy works in LSTM by controlling the flow of information using the gate mechanism. We then looked at the variant of LSTM called Gated Recurrent Unit with fewer gates and lesser computations than the LSTM Model.
In my next post, we will discuss Encoder-Decoder Models in-depth. So make sure you follow me on Medium to get notified as soon as it drops.
Until then, Peace 🙂
Start Practising LSTM and GRU using PytorchClassifying the Name Nationality of a Person using LSTM and PytorchThe personal name tends to have different variations from country to country or even within a country. Typically the…www.marktechpost.com
Learn about Recurrent Neural NetworksRecurrent Neural Networks (RNN) Explained — the ELI5 waySequence Labeling and Sequence Classification using RNNtowardsdatascience.com
Niranjan Kumar is Senior Consultant Data Science at Allstate India. He is passionate about Deep Learning and Artificial Intelligence. Apart from writing on Medium, he also writes for Marktechpost.com as a freelance data science writer. Check out his articles here.
Connect with Me:
- LinkedIn — https://www.linkedin.com/in/niranjankumar-c/
- GitHub — https://github.com/Niranjankumar-c
- Twitter — https://twitter.com/Nkumar_n
- Medium — https://medium.com/@niranjankumarc