Sequence Modeling is the task of predicting what word/letter comes next. Sequence models compute the probability of occurrence of a number of words in a particular sequence. Unlike the FNN and CNN, in sequence modeling, the current output not only dependent on the current input but also on the previous input. In the sequence model, the length of the input is not fixed.
Citation Note: The content and the structure of this article is based my understand of the deep learning lectures from One-Fourth Labs — PadhAI.
Recurrent Neural Networks
Recurrent Neural Networks(RNN) are a type of Neural Network where the output from the previous step is fed as input to the current step.
RNN’s are mainly used for,
- Sequence Classification — Sentiment Classification & Video Classification
- Sequence Labelling — Part of speech tagging & Named entity recognition
- Sequence Generation — Machine translation & Transliteration
In this section, we will discuss how we can use RNN to do the task of Sequence Classification. In Sequence Classification, we will be given a corpus of sentences and the corresponding labels i.e…sentiment of the sentences either positive or negative.
In this scenario, we don’t need to output after every word of the input rather we just need to understand the mood after reading the entire sentence i.e…either positive or negative.
As you can see from the above figure, the input sentences are not of equal length. Before we feed the data into the RNN we need to pre-process the data such that the input sequences are of equal length (Input matrix will have a fixed dimension of mxn). The input words should be converted into a one-hot representation vector.
In processing, we define a few special characters such as the start of the sequence, end of the sequence.
All the input sequences are appended with “Start-of-sequence”<sos> character to indicate the beginning of the character sequence. The end of the sequence is appended with “End-of-sequence”<eos> character to mark the end of the character sequence. Since all character sequences must have the same length as defined by the corresponding input layer, padding will be applied where needed.
The way we apply padding is that,
- Find the maximum input length across all the sequences (say, 10)
- Add special word <pad> to all shorter sequences so that they become of the same length (10, in this case).
Once we are doing with the pre-processing (adding the special characters), we have to convert these words including the special characters into a one-hot vector representation and feed them into the network.
Important points to note about padding is that:
- Padding was only done to ensure that the input sequences are of uniform size.
- The computations in the RNN are only performed till the “End-of-sequence” special character i.e…padding is not considered as an input for the network.
Parts of speech tagging is a task of labeling (predicting) the part of speech tag for each word in the sequence. Again in this problem, the output at the current time step is not only dependent on the current input (current word) but also on the previous input. For example, the probability of tagging the word ‘movie’ as a noun would be higher if we know that the previous word is an adjective.
Unlike the problem of sequence classification, in sequence labeling, we have to predict the output at each time step for every word occurring in the sequence. As we can see from the image since we have 6 words in the first sequence we will get 6 predictions for there part of speech based on the structure of the sentences.
Since our input sequences are of varying length, we have to pre-process the data such that the input sequences are of equal length. Remember that RNN will process the sequence of words only after it encounters “Start-of-sequence” <sos>token and “End-of-sequence” token signals to the network that the input has reached the end and the output needs to be the finalized.
In the previous sections, we have discussed some of the tasks where RNN can be used along with the pre-processing steps to perform before feeding data into the model. In this section, we will discuss how to model (approximation function) the true relationship between input and output.
As we already know, in sequence classification the output depends on the entire sequence. eg. Predicting the pulse of the movie by analyzing the reviews.
The input to the function is denoted in orange color and represented as an xᵢ. The weights associated with the input is denoted using a vector U and the hidden representation (sᵢ) of the word is computed as a function of the output of the previous time step and current input along with bias. The hidden representation will be computed until the length of the sequence (sₜ).
The final output (y_hat) from the network is a softmax function of hidden representation and weights associated with it along with the bias.
In sequence labeling, we have to predict the output at each time step unlike the predictions at the end in sequence classification.
The mathematical formula will slightly vary from sequence classification, in this approach, we will predict the output after each time step.
Once we compute the hidden representation, the output (yᵢ) at the particular timestep from the network is a softmax function of hidden representation and weights associated with it along with the bias. Similarly, we will compute the hidden representation state and predicted output for each and every time step in the sequence.
The purpose of the loss function is to tell the model that some correction needs to be done in the learning process.
In the context of sequence classification problem, to compare two probability distributions (true distribution and predicted distribution) we will use the cross-entropy loss function. The loss function is equal to the summation of the true probability and log of the predicted probability.
For ‘m’ training samples, the total loss would be equal to the average of overall loss (Where c indicates the correct class or true class).
In the sequence labeling problem at every time step, we have to make a prediction that means at every time step we have a true distribution and predicted distribution.
Since we are predicting the labels at every time step, there is a possibility of making an error at each time step. So we have to check the true probability distribution and predicted probability distribution at every time step to calculate the loss of the model.
In effect, for all the training examples (m — training examples) and for all the time steps (T) we try to minimize the cross-entropy loss between the predicted distribution of the true class.
The objective of the learning algorithm is to determine the best possible values for the parameters, such that the overall loss (squared error loss) of the model is minimized as much as possible. Here goes the learning algorithm:
We initialize w, u, v and b randomly. We then iterate over all the observations in the data, for each observation find the predicted outcome using the RNN equation and compute the overall loss. Based on the loss value, we will update the weights such that the overall loss of the model at the new parameters will be less than the current loss of the model.
We will keep doing the update operation until we are satisfied. Till satisfied could mean any of the following:
- The overall loss of the model becomes zero.
- The overall loss of the model becomes a very small value closer to zero.
- Iterating for a fixed number of passes based on computational capacity.
Recommended ReadingUnderstanding Convolution Neural Networks — the ELI5 wayLearn about Convolution Operation and CNN’stowardsdatascience.comBuilding a Feedforward Neural Network using Pytorch NN ModuleA Beginners Guide to Pytoch NN Modulemedium.com
Where to go from here?
If you want to learn more about Artificial Neural Networks using Keras & Tensorflow 2.0 (Python or R). Check out the Artificial Neural Networks by Abhishek and Pukhraj from Starttechacademy. They explain the fundamentals of deep learning in a simplistic manner.
In this post, we have discussed how RNN’s are used in different tasks like sequence labeling and sequence classification. we then looked at the pre-processing techniques used to process the data before feeding into the model. After that, we looked at the mathematical model on how to solve the problem of sequence labeling and sequence classification. Finally, we discussed the loss function and learning algorithm for RNN.
In my next post, we will discuss LSTM & GRU in-depth. So make sure you follow me on Medium to get notified as soon as it drops.
Until then, Peace 🙂
Niranjan Kumar is Senior Consultant Data Science at Allstate India. He is passionate about Deep Learning and Artificial Intelligence. Apart from writing on Medium, he also writes for Marktechpost.com as a freelance data science writer. Check out his articles here.
Disclaimer — There might be some affiliate links in this post to relevant resources. You can purchase the bundle at the lowest price possible. I will receive a small commission if you purchase the course.