Hi All, welcome to my blog “*Long Short Term Memory and Gated Recurrent Unit’s Explained — ELI5 Way*” this is my last blog of the year 2019. My name is Niranjan Kumar and I’m a Senior Consultant Data Science at Allstate India.

**Recurrent Neural Networks(RNN)** are a type of Neural Network where the output from the previous step is fed as input to the current step.

RNN’s are mainly used for,

**Sequence Classification**— Sentiment Classification & Video Classification**Sequence Labelling**— Part of speech tagging & Named entity recognition**Sequence Generation**— Machine translation & Transliteration

Citation Note:The content and the structure of this article is based on my understanding of the deep learning lectures from One-Fourth Labs — PadhAI.

In Recurrent Neural Networks at each time step, the old information gets morphed by the current input. For longer sentences, we can imagine that after ‘t’ time steps the information stored at the time step ‘t-k’ ( k << t) would have undergone a gradual process of transformation. During back-propagation, the information has to flow through the long chain of timesteps to update the parameters of the network to minimize the loss of the network.

Consider a scenario, where we need to compute the loss of the network at time step four **L₄**. Assume that the loss occurred due to the wrong computation of hidden representation at the time step **S₁**. The error at **S₁** is due to incorrect parameters of the vector **W. **This information has to be back-propagated to **W **so that the vector will correct its parameters.

To propagate the information back to the vector W, we need to use the concept of the chain rule. In a nutshell, the chain rule boils down to the product of all the partial derivatives of hidden representations at the specific timesteps.

If we have more than 100 hidden representations for longer sequences then we have to compute the product of these representations for the back-propagation. Suppose one of the partial derivatives comes to be a large value then the entire gradient value will explode causing the problem of **Exploding gradients.**

If one of the partial derivatives is a small value, then the entire gradient becomes too small or vanishes making the network hard to train. The problem of **Vanishing gradients**

# White Board Analogy

Consider that you have a whiteboard of fixed size, over time the whiteboard becomes so messy that you can’t extract any information from it. In the context of RNN for longer sequences, the hidden state representation computed will become messy and it will difficult to extract relevant information from it.

Since RNN’s have a finite state size instead of extracting information from all the timesteps and computing the hidden state representation. we need to follow the selectively read, write and forget strategy while extracting information from different timesteps.

# White Board Analogy — RNN Example

Let’s see how selectively read, write and forget strategy works taking an example of sentiment analysis using RNN.

**Review**: *The first half of the movie is dry but the second half really picked up the pace. The lead actor delivered an amazing performance.*

The movie review started with a negative sentiment but from thereon it changed to a positive response. In the context of selective read, write and forget:

- We want to forget the information added by stop words (a, the, is etc…).
- Selectively read the information added by sentiment bearing words (amazing, awesome etc…).
- Selectively write hidden state representation information from the current word to the new hidden state.

Using the selective read, write and forget strategy we have control of the flow of information so that the network doesn’t suffer from the problem of short term memory and also to ensure that the finite-sized state vector is used effectively.

# Long Short Term Memory — LSTM

LSTM’s are introduced to overcome the problems in vanilla RNN such as short term memory and vanishing gradients. In LSTM’s we can selectively read, write and forget information by regulating the flow of information using gates.

*In the following few sections, we will discuss how we can implement the selective read, write and forget strategy. We will also discuss how do we know which information to read and which information to forget.*

# Selective Write

In the vanilla RNN version, the hidden representation (**sₜ) **computed as a function of the output of the previous time step hidden representation (**sₜ₋₁**) and current input (**xₜ**) along with bias (**b**).

Here, we are taking all the values of **sₜ₋₁** and computing the hidden state representation at the current time (**s**ₜ).

In Selective Write, instead of writing all the information in **sₜ₋₁** to compute the hidden representation (**sₜ**). we could pass only some information about **sₜ₋₁** to the next state to compute **sₜ. **One way of doing this would be to assign a value between 0–1 which determines what fraction of current state information to be passed on to the next hidden state.

The way we are doing selective write is that we multiply every element of **sₜ₋₁** with a value between 0–1 to compute a new vector **hₜ₋₁**. We will use this new vector to compute the hidden representation **sₜ.**

How do we compute

oₜ₋₁?

We will learn **oₜ₋₁ **from the data just like we learn other parameters like **U **and **W** using parametric learning based on gradient descent optimization. The mathematical equation for **oₜ₋₁ **is given below:

Once we learn from **oₜ₋₁ **from the data, it is multiplied with the **sₜ₋₁ **to get a new vector **hₜ₋₁. **Since **oₜ₋₁ **is controlling what information is going to the next hidden state, it is called the **Output Gate**.

# Selective Read

After computing the new vector **hₜ₋₁ **we will compute an intermediatory hidden state vector **Šₜ** (marked in green). In this section, we will discuss how to implement selective read to get our final hidden state **sₜ.**

The mathematical equation for **Š**ₜ is given below:

**Šₜ**captures all the information from the previous state**hₜ₋₁**and the current input**xₜ**.- However, we may not want to use all the new information and only selectively read from it before constructing the new cell structure. i.e… we would like to read only some information from
**Šₜ**to compute the**sₜ**.

Just like our output gate, here we multiply every element of **Šₜ** with a new vector **iₜ **which contains values between 0–1. Since the vector **iₜ **is controlling what information flows in from the current input, it is called the **Input Gate**.

The mathematical equation for **iₜ **is given below:

In the input gate, we pass the previous time step hidden state information **hₜ₋₁ **and the current input **xₜ **along with a bias into a sigmoid function. The output of the computation will between 0–1 and it will decide what information to flow in from the current input and previous time step hidden state. 0 means not important and 1 means important.

Recap of what we learned so far, we have the previous hidden state **sₜ₋₁ **and our goal is to compute the current state **sₜ **using selective read, write and forget strategy.

# Selective Forget

In this section, we will discuss how we will compute the current state vector **sₜ **by combining **sₜ₋₁ **and **Šₜ.**

The **Forget Gate fₜ** decides what fraction of information to be retained or discarded from **sₜ₋₁ **hidden vector.

The mathematical equation for forget gate **fₜ **is given below:

In forget gate, we pass the previous time step hidden state information **hₜ₋₁ **and the current input **xₜ **along with a bias into a sigmoid function. The output of the computation will between 0–1 and it will decide what information to be retained or discarded. If the value is closer to 0 means to discard and if it’s closer to 1 means to retain.

By combing both the forget gate and input gate, we can compute the current hidden state information.

The final illustration would look like this:

The full set of equations looks like this:

**Note: ***Some versions of LSTM architecture will not have forget gate instead it will have only an output gate and input gate to control the flow of information. It will only implement the selective read and selective write strategy.*

The variant of LSTM that we discussed above is the most popular variant of LSTM with all three gates controlling the information.

# Gated Recurrent Units — GRU’s

In this section, we will briefly discuss the intuition behind GRU. Gated Recurrent Units is another popular variant of LSTM. GRU uses fewer gates.

In Gated Recurrent Units just like LSTM, we have an output gate **oₜ₋₁ **controlling what information is going to the next hidden state. Similarly, we also have an input gate **iₜ **controlling what information flows in from the current input.

The major difference between LSTM and GRU is the way they combine the intermediatory hidden state vector **Šₜ **and previous hidden state representation vector **sₜ₋₁. **In LSTM we had a forget to determine what fraction of information to be retained from **sₜ₋₁.**

In GRU instead of the forget gate, we decide how much past information to retain or discard based on the compliment of the input gate vector (1-**iₜ**).

forget gate = 1 — Input gate vector (1-

iₜ)

The full set of equations for the GRU is given below:

From the equations, we can notice that there are only two gates (input and output) and we are not explicitly computing the hidden state vector **hₜ₋₁. **So we are not maintaining the additional state vector in GRU’s i.e…lesser computations and faster to train than the LSTM.

# Where to go from here?

If you want to learn more about Neural Networks using Keras & Tensorflow 2.0 (Python or R). Check out the Artificial Neural Networks by Abhishek and Pukhraj from Starttechacademy. They explain the fundamentals of deep learning in a simplistic manner.

# Summary

In this article, we discussed the shortfalls of the Recurrent Neural Networks in when dealing with longer sentences. RNN’s suffer from the problem of short term memory ie…it can store only a finite number of states before the information gets morphed. After that, we discussed in detail how the selective read, write and forget strategy works in LSTM by controlling the flow of information using the gate mechanism. We then looked at the variant of LSTM called Gated Recurrent Unit with fewer gates and lesser computations than the LSTM Model.

In my next post, we will discuss Encoder-Decoder Models in-depth. So make sure you follow me on Medium to get notified as soon as it drops.

Until then, Peace 🙂

NK.

*Recommended Reading*

Start Practising LSTM and GRU using PytorchClassifying the Name Nationality of a Person using LSTM and PytorchThe personal name tends to have different variations from country to country or even within a country. Typically the…www.marktechpost.com

Learn about Recurrent Neural NetworksRecurrent Neural Networks (RNN) Explained — the ELI5 waySequence Labeling and Sequence Classification using RNNtowardsdatascience.com

# Author Bio

Niranjan Kumar is Senior Consultant Data Science at Allstate India. He is passionate about Deep Learning and Artificial Intelligence. Apart from writing on Medium, he also writes for Marktechpost.com as a freelance data science writer. Check out his articles here.

You can connect with him on LinkedIn or follow him on Twitter for updates about upcoming articles on deep learning and machine learning.

**Connect with Me:**

- LinkedIn — https://www.linkedin.com/in/niranjankumar-c/
- GitHub — https://github.com/Niranjankumar-c
- Twitter — https://twitter.com/Nkumar_n
- Medium — https://medium.com/@niranjankumarc