Workings of LSTMs in RNN
Step 1: Decide How Much Past Data It Should Remember
The first step within the LSTM is to determine which information should be omitted from the cell therein particular time step. The sigmoid function determines this. it’s at the previous state (ht1) together with the present input xt and computes the function.
Consider the subsequent two sentences:
Let the output of h(t1) be “Alice is good in Physics. John, on the opposite hand, is nice at Chemistry.”Let the present input at x(t) be “John plays football well. He told me yesterday over the phone that he had served because the captain of his college team.”The forget gate realizes there may well be a change in context after encountering the primary punctuation mark. It compares with the present input sentence at x(t). the subsequent sentence talks about John, that the information on Alice is deleted. The position of the topic is vacated and assigned to John.
Step 2: Decide How Much This Unit Adds to the Current State
In the second layer, there are two parts. One is that the sigmoid function, and also the other is that the tanh function. within the sigmoid function, it decides which values to let through (0 or 1). tanh function gives weightage to the values which are passed, deciding their level of importance (1 to 1).
With the present input at x(t), the input gate analyzes the important information — John plays football, and also the incontrovertible fact that he was the captain of his college team is vital.“He told me yesterday over the phone” is a smaller amount importance; hence it’s forgotten. This process of adding some new information may be done via the input gate.
Step 3: Decide What Part of the Current Cell State Makes It to the Output
The third step is to determine what the output are. First, we run a sigmoid layer, which decides what parts of the cell state make it to the output. Then, we put the cell state through tanh to push the values to be between 1 and 1 and multiply it by the output of the sigmoid gate.
Let’s consider this instance to predict the subsequent word within the sentence: “John played tremendously well against the opponent and won for his team. For his contributions, brave ____ was awarded player of the match.”There can be many choices for the empty space. this input brave is an adjective, and adjectives describe a noun. So, “John” can be the most effective output after brave.
Applications of RNN
RNN has multiple uses, especially when it comes to predicting the future. In the financial industry, RNN can be helpful in predicting stock prices or the sign of the stock market direction (i.e., positive or negative).
RNN is useful for an autonomous car as it can avoid a car accident by anticipating the trajectory of the vehicle.
RNN is widely used in text analysis, image captioning, sentiment analysis and machine translation. For example, one can use a movie review to understand the feeling the spectator perceived after watching the movie. Automating this task is very useful when the movie company does not have enough time to review, label, consolidate and analyze the reviews. The machine can do the job with a higher level of accuracy.
Python3

Train a Recurrent Neural Network (RNN) in TensorFlow
Now that the data is ready, the next step is building a Simple Recurrent Neural network. Before training with SImpleRNN, the data is passed through the Embedding layer to perform the equal size of Word Vectors.
Note: We use return_sequences = True only when we need another layer to stack.
Try the model
Now run the model to see that it behaves as expected.
First check the shape of the output:
for input_example_batch, target_example_batch in dataset.take(1): example_batch_predictions = model(input_example_batch) print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")
(64, 100, 66) # (batch_size, sequence_length, vocab_size)
In the above example the sequence length of the input is
100
but the model can be run on inputs of any length:
model.summary()
Model: “my_model” _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) multiple 16896 gru (GRU) multiple 3938304 dense (Dense) multiple 67650 ================================================================= Total params: 4022850 (15.35 MB) Trainable params: 4022850 (15.35 MB) Nontrainable params: 0 (0.00 Byte) _________________________________________________________________
To get actual predictions from the model you need to sample from the output distribution, to get actual character indices. This distribution is defined by the logits over the character vocabulary.
Try it for the first example in the batch:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1) sampled_indices = tf.squeeze(sampled_indices, axis=1).numpy()
This gives us, at each timestep, a prediction of the next character index:
sampled_indices
array([15, 52, 19, 34, 6, 39, 41, 62, 50, 61, 42, 26, 29, 57, 34, 46, 12, 61, 53, 14, 26, 50, 5, 8, 29, 44, 2, 65, 62, 52, 53, 26, 25, 39, 64, 36, 53, 21, 34, 30, 12, 58, 61, 43, 38, 29, 1, 26, 47, 35, 52, 30, 10, 20, 59, 9, 11, 34, 59, 45, 56, 20, 39, 29, 46, 10, 54, 56, 57, 17, 19, 19, 14, 40, 12, 12, 4, 54, 22, 17, 31, 7, 61, 44, 56, 36, 5, 38, 30, 32, 23, 21, 52, 39, 42, 30, 42, 8, 17, 53])
Decode these to see the text predicted by this untrained model:
print("Input:\n", text_from_ids(input_example_batch[0]).numpy()) print() print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())
Input: b” of woman in the world,\nAy, every dram of woman’s flesh is false, If she be.\n\nLEONTES:\nHold your pea” Next Char Predictions: b”BmFU’ZbwkvcMPrUg;vnAMk&Pe zwmnMLZyWnHUQ;svdYP\nMhVmQ3Gt.:UtfqGZPg3oqrDFFAa;;$oIDR,veqW&YQSJHmZcQcDn”
Frequently Asked Questions (FAQs)
Q1. What’s the Difference Between a Feedforward Neural Network and Recurrent Neural Network?
In this deep learning interview question, the interviewee expects you to relinquish an in depth answer.
 A Feedforward Neural Network signals travel in one direction from input to output. There are not any feedback loops; the network considers only this input. It cannot memorize previous inputs (e.g., CNN).
 A Recurrent Neural Network’s signals travel in both directions, creating a looped network. It considers this input with the previously received inputs for generating the output of a layer and might memorize past data because of its internal memory.
Q2. What Are the Applications of a Recurrent Neural Network (RNN)?
The RNN are often used for sentiment analysis, text mining, and image captioning. Recurrent Neural Networks also can address statistic problems like predicting the costs of stocks during a month or quarter.
Q3. What Are the Softmax and ReLU Functions?
Softmax is an activation function that generates the output between zero and one. It divides each output, specified the whole sum of the outputs is adequate to one. Softmax is usually used for output layers.
ReLU (or Rectified Linear Unit) is that the most generally used activation function. It gives an output of X if X is positive and zeros otherwise. ReLU is commonly used for hidden layers.
Q4. What Are Hyperparameters?
This is another commonly asked deep learning interview question. With neural networks, you’re usually working with hyperparameters once the information is formatted correctly. A hyperparameter may be a parameter whose value is about before the educational process begins. It determines how a network is trained and also the structure of the network (such because the number of hidden units, the training rate, epochs, etc.).
Q5. What’s going to Happen If the training Rate is ready Too Low or Too High?
When your learning rate is simply too low, training of the model will progress very slowly as we are making minimal updates to the weights. it’ll take many updates before reaching the minimum point.If the training rate is ready too high, this causes undesirable divergent behavior to the loss function thanks to drastic updates in weights. it’s going to fail to converge (model can provides a good output) or perhaps diverge (data is simply too chaotic for the network to train).
Q6. What’s Dropout and Batch Normalization?
Dropout could be a technique of dropping by the wayside hidden and visual units of a network randomly to stop overfitting of information (typically dropping 20 percent of the nodes). It doubles the quantity of iterations needed to converge the network.
Batch normalization is that the technique to enhance the performance and stability of neural networks by normalizing the inputs in every layer in order that they need mean output activation of zero and variance of 1.
Q7. What’s Overfitting and Underfitting, and the way to Combat Them?
Overfitting occurs when the model learns the main points and noise within the training data to the degree that it adversely impacts the execution of the model on new information. it’s more likely to occur with nonlinear models that have more flexibility when learning a target function. An example would be if a model is watching cars and trucks, but only recognizes trucks that have a selected box shape. it would not be ready to notice a flatbed truck because there’s only a selected quite truck it saw in training. The model performs well on training data, but not within the universe.
Underfitting alludes to a model that’s neither welltrained on data nor can generalize to new information. This usually happens when there’s less and incorrect data to coach a model. Underfitting has both poor performance and accuracy.
To combat overfitting and underfitting, you’ll resample the info to estimate the model accuracy (kfold crossvalidation) and by having a validation dataset to judge the model.
Q8. How Are Weights Initialized in an exceedingly Network?
There are two methods here: we are able to either initialize the weights to zero or assign them randomly.
 Initializing all weights to 0: This makes your model almost like a linear model. All the neurons and each layer perform the identical operation, giving the identical output and making the deep net useless.
 Initializing all weights randomly: Here, the weights are assigned randomly by initializing them very near 0. It gives better accuracy to the model since every neuron performs different computations. this is often the foremost commonly used method.
Q9. What Are the various Layers on CNN?
There are four layers in CNN:
 Convolutional Layer – the layer that performs a convolutional operation, creating several smaller picture windows to travel over the info.
 ReLU Layer – it brings nonlinearity to the network and converts all the negative pixels to zero. The output could be a rectified feature map.
 Pooling Layer – pooling may be a downsampling operation that reduces the dimensionality of the feature map.
 Fully Connected Layer – this layer recognizes and classifies the objects within the image.
Q10. what’s Pooling on CNN, and the way Does It Work?
Pooling is employed to scale back the spatial dimensions of a CNN. It performs downsampling operations to cut back the dimensionality and creates a pooled feature map by sliding a filter matrix over the input matrix.
Q11. How Does an LSTM Network Work?
LongShortTerm Memory (LSTM) could be a special reasonably recurrent neural network capable of learning longterm dependencies, remembering information for long periods as its default behavior. There are three steps in an LSTM network:
 Step 1: The network decides what to forget and what to recollect.
 Step 2: It selectively updates cell state values.
 Step 3: The network decides what a part of this state makes it to the output.
Q12. What Are Vanishing and Exploding Gradients?
While training an RNN, your slope can become either too small or too large; this makes the training difficult. When the slope is simply too small, the matter is thought as a “Vanishing Gradient.” When the slope tends to grow exponentially rather than decaying, it’s remarked as an “Exploding Gradient.” Gradient problems cause long training times, poor performance, and low accuracy.
Q13. what’s the Difference Between Epoch, Batch, and Iteration in Deep Learning?
Epoch – Represents one iteration over the whole dataset (everything put into the training model).Batch – Refers to once we cannot pass the whole dataset into the neural network directly, so we divide the dataset into several batches.Iteration – if we’ve got 10,000 images as data and a batch size of 200. then an epoch should run 50 iterations (10,000 divided by 50).
Python3

Output:
Countplot for Class Name Category
Countplots help us to understand the distribution of the whole data along the different categories of a particular column.
Setup
import os import datetime import IPython import IPython.display import matplotlib as mpl import matplotlib.pyplot as plt import numpy as np import pandas as pd import seaborn as sns import tensorflow as tf mpl.rcParams['figure.figsize'] = (8, 6) mpl.rcParams['axes.grid'] = False
20231027 05:27:51.778665: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 20231027 05:27:51.778713: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 20231027 05:27:51.780357: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Setup
import numpy as np import tensorflow as tf import keras from keras import layers
20231116 12:10:07.977993: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 20231116 12:10:07.978039: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 20231116 12:10:07.979464: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Backpropagation Through Time
Backpropagation through time is once we apply a Backpropagation algorithm to a Recurrent Neural network that has statistic data as its input.
In a typical RNN, one input is fed into the network at a time, and one output is obtained. But in backpropagation, you utilize this additionally because the previous inputs as input. this is often called a timestep and one timestep will contains many statistic data points entering the RNN simultaneously.
Once the neural network has trained on a timeset and given you an output, that output is employed to calculate and accumulate the errors. After this, the network is rolled duplicate and weights are recalculated and updated keeping the errors in mind.
What are Recurrent Neural Networks (RNN)
A recurrent neural network (RNN) is the type of artificial neural network (ANN) that is used in Apple’s Siri and Google’s voice search. RNN remembers past inputs due to an internal memory which is useful for predicting stock prices, generating text, transcriptions, and machine translation.
In the traditional neural network, the inputs and the outputs are independent of each other, whereas the output in RNN is dependent on prior elementals within the sequence. Recurrent networks also share parameters across each layer of the network. In feedforward networks, there are different weights across each node. Whereas RNN shares the same weights within each layer of the network and during gradient descent, the weights and basis are adjusted individually to reduce the loss.
The image above is a simple representation of recurrent neural networks. If we are forecasting stock prices using simple data [45,56,45,49,50,…], each input from X0 to Xt will contain a past value. For example, X0 will have 45, X1 will have 56, and these values are used to predict the next number in a sequence.
Next Task For You
If you are also interested and want to more about the AWS certified Machine Learning Specialist then join the Waitlist.
 TensorFlow Tutorial
 TensorFlow – Home
 TensorFlow – Introduction
 TensorFlow – Installation
 Understanding Artificial Intelligence
 Mathematical Foundations
 Machine Learning & Deep Learning
 TensorFlow – Basics
 Convolutional Neural Networks
 Recurrent Neural Networks
 TensorBoard Visualization
 TensorFlow – Word Embedding
 Single Layer Perceptron
 TensorFlow – Linear Regression
 TFLearn and its installation
 CNN and RNN Difference
 TensorFlow – Keras
 TensorFlow – Distributed Computing
 TensorFlow – Exporting
 MultiLayer Perceptron Learning
 Hidden Layers of Perceptron
 TensorFlow – Optimizers
 TensorFlow – XOR Implementation
 Gradient Descent Optimization
 TensorFlow – Forming Graphs
 Image Recognition using TensorFlow
 Recommendations for Neural Network Training
 TensorFlow Useful Resources
 TensorFlow – Quick Guide
 TensorFlow – Useful Resources
 TensorFlow – Discussion
TensorFlow – Recurrent Neural Networks
Recurrent neural networks is a type of deep learningoriented algorithm, which follows a sequential approach. In neural networks, we always assume that each input and output is independent of all other layers. These type of neural networks are called recurrent because they perform mathematical computations in sequential manner.
Consider the following steps to train a recurrent neural network −
Step 1 − Input a specific example from dataset.
Step 2 − Network will take an example and compute some calculations using randomly initialized variables.
Step 3 − A predicted result is then computed.
Step 4 − The comparison of actual result generated with the expected value will produce an error.
Step 5 − To trace the error, it is propagated through same path where the variables are also adjusted.
Step 6 − The steps from 1 to 5 are repeated until we are confident that the variables declared to get the output are defined properly.
Step 7 − A systematic prediction is made by applying these variables to get new unseen input.
The schematic approach of representing recurrent neural networks is described below −
Python3

Output:
Countplot for the Rating and Recommended IND category
Now let’s plot the histogram plot of the Age group along with the Recommended IND category and the presence of outliers categorywise.
Outputs and states
By default, the output of a RNN layer contains a single vector per sample. This vector
is the RNN cell output corresponding to the last timestep, containing information
about the entire input sequence. The shape of this output is
(batch_size, units)
where
units
corresponds to the
units
argument passed to the layer’s constructor.
A RNN layer can also return the entire sequence of outputs for each sample (one vector
per timestep per sample), if you set
return_sequences=True
. The shape of this output
is
(batch_size, timesteps, units)
.
model = keras.Sequential() model.add(layers.Embedding(input_dim=1000, output_dim=64)) # The output of GRU will be a 3D tensor of shape (batch_size, timesteps, 256) model.add(layers.GRU(256, return_sequences=True)) # The output of SimpleRNN will be a 2D tensor of shape (batch_size, 128) model.add(layers.SimpleRNN(128)) model.add(layers.Dense(10)) model.summary()
Model: “sequential_1” _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_1 (Embedding) (None, None, 64) 64000 gru (GRU) (None, None, 256) 247296 simple_rnn (SimpleRNN) (None, 128) 49280 dense_1 (Dense) (None, 10) 1290 ================================================================= Total params: 361866 (1.38 MB) Trainable params: 361866 (1.38 MB) Nontrainable params: 0 (0.00 Byte) _________________________________________________________________
In addition, a RNN layer can return its final internal state(s). The returned states can be used to resume the RNN execution later, or to initialize another RNN. This setting is commonly used in the encoderdecoder sequencetosequence model, where the encoder final state is used as the initial state of the decoder.
To configure a RNN layer to return its internal state, set the
return_state
parameter
to
True
when creating the layer. Note that
LSTM
has 2 state tensors, but
GRU
only has one.
To configure the initial state of the layer, just call the layer with additional
keyword argument
initial_state
.
Note that the shape of the state needs to match the unit size of the layer, like in the
example below.
encoder_vocab = 1000 decoder_vocab = 2000 encoder_input = layers.Input(shape=(None,)) encoder_embedded = layers.Embedding(input_dim=encoder_vocab, output_dim=64)( encoder_input ) # Return states in addition to output output, state_h, state_c = layers.LSTM(64, return_state=True, name="encoder")( encoder_embedded ) encoder_state = [state_h, state_c] decoder_input = layers.Input(shape=(None,)) decoder_embedded = layers.Embedding(input_dim=decoder_vocab, output_dim=64)( decoder_input ) # Pass the 2 states to a new LSTM layer, as initial state decoder_output = layers.LSTM(64, name="decoder")( decoder_embedded, initial_state=encoder_state ) output = layers.Dense(10)(decoder_output) model = keras.Model([encoder_input, decoder_input], output) model.summary()
Model: “model” __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) [(None, None)] 0 [] input_2 (InputLayer) [(None, None)] 0 [] embedding_2 (Embedding) (None, None, 64) 64000 [‘input_1[0][0]’] embedding_3 (Embedding) (None, None, 64) 128000 [‘input_2[0][0]’] encoder (LSTM) [(None, 64), 33024 [’embedding_2[0][0]’] (None, 64), (None, 64)] decoder (LSTM) (None, 64) 33024 [’embedding_3[0][0]’, ‘encoder[0][1]’, ‘encoder[0][2]’] dense_2 (Dense) (None, 10) 650 [‘decoder[0][0]’] ================================================================================================== Total params: 258698 (1010.54 KB) Trainable params: 258698 (1010.54 KB) Nontrainable params: 0 (0.00 Byte) __________________________________________________________________________________________________
LSTM Use Case
Now that you just understand how LSTMs work, let’s do a practical implementation to predict the costs of stocks using the “Google stock price” data.Based on the stock price data between 2012 and 2016, we are going to predict the stock prices of 2017.
1. Import the desired libraries
2. Import the training dataset
3. Perform feature scaling to remodel the information
4. Create an information structure with 60time steps and 1 output
5. Import Keras library and its packages
6. Initialize the RNN
7. Add the LSTM layers and a few dropout regularization.
8. Add the output layer.
9. Compile the RNN
10. Fit the RNN to the training set
11. Load the stock price test data for 2017
12. Get the anticipated stock price for 2017
13. Visualize the results of predicted and real stock price
Create the model
Above is a diagram of the model.

This model can be build as a
tf.keras.Sequential
. 
The first layer is the
encoder
, which converts the text to a sequence of token indices. 
After the encoder is an embedding layer. An embedding layer stores one vector per word. When called, it converts the sequences of word indices to sequences of vectors. These vectors are trainable. After training (on enough data), words with similar meanings often have similar vectors.
This indexlookup is much more efficient than the equivalent operation of passing a onehot encoded vector through a
tf.keras.layers.Dense
layer. 
A recurrent neural network (RNN) processes sequence input by iterating through the elements. RNNs pass the outputs from one timestep to their input on the next timestep.
The
tf.keras.layers.Bidirectional
wrapper can also be used with an RNN layer. This propagates the input forward and backwards through the RNN layer and then concatenates the final output.
The main advantage of a bidirectional RNN is that the signal from the beginning of the input doesn’t need to be processed all the way through every timestep to affect the output.

The main disadvantage of a bidirectional RNN is that you can’t efficiently stream predictions as words are being added to the end.


After the RNN has converted the sequence to a single vector the two
layers.Dense
do some final processing, and convert from this vector representation to a single logit as the classification output.
The code to implement this is below:
model = tf.keras.Sequential([ encoder, tf.keras.layers.Embedding( input_dim=len(encoder.get_vocabulary()), output_dim=64, # Use masking to handle the variable sequence lengths mask_zero=True), tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)), tf.keras.layers.Dense(64, activation='relu'), tf.keras.layers.Dense(1) ])
Please note that Keras sequential model is used here since all the layers in the model only have single input and produce single output. In case you want to use stateful RNN layer, you might want to build your model with Keras functional API or model subclassing so that you can retrieve and reuse the RNN layer states. Please check Keras RNN guide for more details.
The embedding layer uses masking to handle the varying sequencelengths. All the layers after the
Embedding
support masking:
print([layer.supports_masking for layer in model.layers])
[False, True, True, True, True]
To confirm that this works as expected, evaluate a sentence twice. First, alone so there’s no padding to mask:
# predict on a sample text without padding. sample_text = ('The movie was cool. The animation and the graphics ' 'were out of this world. I would recommend this movie.') predictions = model.predict(np.array([sample_text])) print(predictions[0])
1/1 [==============================] – 3s 3s/step [0.00856274]
Now, evaluate it again in a batch with a longer sentence. The result should be identical:
# predict on a sample text with padding padding = "the " * 2000 predictions = model.predict(np.array([sample_text, padding])) print(predictions[0])
1/1 [==============================] – 0s 86ms/step [0.00856275]
Compile the Keras model to configure the training process:
model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), optimizer=tf.keras.optimizers.Adam(1e4), metrics=['accuracy'])
RNNs with list/dict inputs, or nested inputs
Nested structures allow implementers to include more information within a single timestep. For example, a video frame could have audio and video input at the same time. The data shape in this case could be:
[batch, timestep, {"video": [height, width, channel], "audio": [frequency]}]
In another example, handwriting data could have both coordinates x and y for the current position of the pen, as well as pressure information. So the data representation could be:
[batch, timestep, {"location": [x, y], "pressure": [force]}]
The following code provides an example of how to build a custom RNN cell that accepts such structured inputs.
Define a custom cell that supports nested input/output
See Making new Layers & Models via subclassing for details on writing your own layers.
@keras.saving.register_keras_serializable() class NestedCell(keras.layers.Layer): def __init__(self, unit_1, unit_2, unit_3, **kwargs): self.unit_1 = unit_1 self.unit_2 = unit_2 self.unit_3 = unit_3 self.state_size = [tf.TensorShape([unit_1]), tf.TensorShape([unit_2, unit_3])] self.output_size = [tf.TensorShape([unit_1]), tf.TensorShape([unit_2, unit_3])] super().__init__(**kwargs) def build(self, input_shapes): # expect input_shape to contain 2 items, [(batch, i1), (batch, i2, i3)] i1 = input_shapes[0][1] i2 = input_shapes[1][1] i3 = input_shapes[1][2] self.kernel_1 = self.add_weight( shape=(i1, self.unit_1), initializer="uniform", name="kernel_1" ) self.kernel_2_3 = self.add_weight( shape=(i2, i3, self.unit_2, self.unit_3), initializer="uniform", name="kernel_2_3", ) def call(self, inputs, states): # inputs should be in [(batch, input_1), (batch, input_2, input_3)] # state should be in shape [(batch, unit_1), (batch, unit_2, unit_3)] input_1, input_2 = tf.nest.flatten(inputs) s1, s2 = states output_1 = tf.matmul(input_1, self.kernel_1) output_2_3 = tf.einsum("bij,ijkl>bkl", input_2, self.kernel_2_3) state_1 = s1 + output_1 state_2_3 = s2 + output_2_3 output = (output_1, output_2_3) new_states = (state_1, state_2_3) return output, new_states def get_config(self): return {"unit_1": self.unit_1, "unit_2": self.unit_2, "unit_3": self.unit_3}
Build a RNN model with nested input/output
Let’s build a Keras model that uses a
keras.layers.RNN
layer and the custom cell
we just defined.
unit_1 = 10 unit_2 = 20 unit_3 = 30 i1 = 32 i2 = 64 i3 = 32 batch_size = 64 num_batches = 10 timestep = 50 cell = NestedCell(unit_1, unit_2, unit_3) rnn = keras.layers.RNN(cell) input_1 = keras.Input((None, i1)) input_2 = keras.Input((None, i2, i3)) outputs = rnn((input_1, input_2)) model = keras.models.Model([input_1, input_2], outputs) model.compile(optimizer="adam", loss="mse", metrics=["accuracy"])
Train the model with randomly generated data
Since there isn’t a good candidate dataset for this model, we use random Numpy data for demonstration.
input_1_data = np.random.random((batch_size * num_batches, timestep, i1)) input_2_data = np.random.random((batch_size * num_batches, timestep, i2, i3)) target_1_data = np.random.random((batch_size * num_batches, unit_1)) target_2_data = np.random.random((batch_size * num_batches, unit_2, unit_3)) input_data = [input_1_data, input_2_data] target_data = [target_1_data, target_2_data] model.fit(input_data, target_data, batch_size=batch_size)
10/10 [==============================] – 1s 27ms/step – loss: 0.7623 – rnn_1_loss: 0.2873 – rnn_1_1_loss: 0.4750 – rnn_1_accuracy: 0.1016 – rnn_1_1_accuracy: 0.0350
With the Keras
keras.layers.RNN
layer, You are only expected to define the math
logic for individual step within the sequence, and the
keras.layers.RNN
layer
will handle the sequence iteration for you. It’s an incredibly powerful way to quickly
prototype new kinds of RNNs (e.g. a LSTM variant).
For more details, please visit the API docs.
Humans do not reboot their understanding of language each time we hear a sentence. Given an article, we grasp the context based on our previous understanding of those words. One of the defining characteristics we possess is our memory (or retention power).
Can an algorithm replicate this? The first technique that comes to mind is a neural network (NN). But the traditional NNs unfortunately cannot do this. Take an example of wanting to predict what comes next in a video. A traditional neural network will struggle to generate accurate results.
That’s where the concept of recurrent neural networks (RNNs) comes into play. RNNs have become extremely popular in the deep learning space which makes learning them even more imperative. A few realworld applications of RNN include:
In this article, we’ll first quickly go through the core components of a typical RNN model. Then we’ll set up the problem statement which we will finally solve by implementing an RNN model from scratch in Python.
We can always leverage highlevel Python libraries to code a RNN. So why code it from scratch? I firmly believe the best way to learn and truly ingrain a concept is to learn it from the ground up. And that’s what I’ll showcase in this tutorial.
This article assumes a basic understanding of recurrent neural networks. In case you need a quick refresher or are looking to learn the basics of RNN, I recommend going through the below articles first:
Let’s quickly recap the core concepts behind recurrent neural networks.
We’ll do this using an example of sequence data, say the stocks of a particular firm. A simple machine learning model, or an Artificial Neural Network, may learn to predict the stock price based on a number of features, such as the volume of the stock, the opening value, etc. Apart from these, the price also depends on how the stock fared in the previous fays and weeks. For a trader, this historical data is actually a major deciding factor for making predictions.
In conventional feedforward neural networks, all test cases are considered to be independent. Can you see how that’s a bad fit when predicting stock prices? The NN model would not consider the previous stock price values – not a great idea!
There is another concept we can lean on when faced with time sensitive data – Recurrent Neural Networks (RNN)!
A typical RNN looks like this:
This may seem intimidating at first. But once we unfold it, things start looking a lot simpler:
It is now easier for us to visualize how these networks are considering the trend of stock prices. This helps us in predicting the prices for the day. Here, every prediction at time t (h_t) is dependent on all previous predictions and the information learned from them. Fairly straightforward, right?
RNNs can solve our purpose of sequence handling to a great extent but not entirely.
Text is another good example of sequence data. Being able to predict what word or phrase comes after a given text could be a very useful asset. We want our models to write Shakespearean sonnets!
Now, RNNs are great when it comes to context that is short or small in nature. But in order to be able to build a story and remember it, our models should be able to understand the context behind the sequences, just like a human brain.
In this article, we will work on a sequence prediction problem using RNN. One of the simplest tasks for this is sine wave prediction. The sequence contains a visible trend and is easy to solve using heuristics. This is what a sine wave looks like:
We will first devise a recurrent neural network from scratch to solve this problem. Our RNN model should also be able to generalize well so we can apply it on other sequence problems.
We will formulate our problem like this – given a sequence of 50 numbers belonging to a sine wave, predict the 51st number in the series. Time to fire up your Jupyter notebook (or your IDE of choice)!
Ah, the inevitable first step in any data science project – preparing the data before we do anything else.
What does our network model expect the data to be like? It would accept a single sequence of length 50 as input. So the shape of the input data will be:
(number_of_records x length_of_sequence x types_of_sequences)
Here, types_of_sequences is 1, because we have only one type of sequence – the sine wave.
On the other hand, the output would have only one value for each record. This will of course be the 51st value in the input sequence. So it’s shape would be:
(number_of_records x types_of_sequences) #where types_of_sequences is 1
Let’s dive into the code. First, import the necessary libraries:
%pylab inline import math
To create a sine wave like data, we will use the sine function from Python’s math library:
sin_wave = np.array([math.sin(x) for x in np.arange(200)])
Visualizing the sine wave we’ve just generated:
plt.plot(sin_wave[:50])
Python Code:
X_val = [] Y_val = [] for i in range(num_records – 50, num_records): X_val.append(sin_wave[i:i+seq_len]) Y_val.append(sin_wave[i+seq_len]) X_val = np.array(X_val) X_val = np.expand_dims(X_val, axis=2) Y_val = np.array(Y_val) Y_val = np.expand_dims(Y_val, axis=1)
Our next task is defining all the necessary variables and functions we’ll use in the RNN model. Our model will take in the input sequence, process it through a hidden layer of 100 units, and produce a single valued output:
learning_rate = 0.0001 nepoch = 25 T = 50 # length of sequence hidden_dim = 100 output_dim = 1 bptt_truncate = 5 min_clip_value = 10 max_clip_value = 10
We will then define the weights of the network:
U = np.random.uniform(0, 1, (hidden_dim, T)) W = np.random.uniform(0, 1, (hidden_dim, hidden_dim)) V = np.random.uniform(0, 1, (output_dim, hidden_dim))
Here,
Finally, we will define the activation function, sigmoid, to be used in the hidden layer:
def sigmoid(x): return 1 / (1 + np.exp(x))
Now that we have defined our model, we can finally move on with training it on our sequence data. We can subdivide the training process into smaller steps, namely:
Step 2.1 : Check the loss on training dataStep 2.1.1 : Forward PassStep 2.1.2 : Calculate ErrorStep 2.2 : Check the loss on validation dataStep 2.2.1 : Forward PassStep 2.2.2 : Calculate ErrorStep 2.3 : Start actual trainingStep 2.3.1 : Forward PassStep 2.3.2 : Backpropagate ErrorStep 2.3.3 : Update weights
We need to repeat these steps until convergence. If the model starts to overfit, stop! Or simply predefine the number of epochs.
We will do a forward pass through our RNN model and calculate the squared error for the predictions for all records in order to get the loss value.
for epoch in range(nepoch): # check loss on train loss = 0.0 # do a forward pass to get prediction for i in range(Y.shape[0]): x, y = X[i], Y[i] # get input, output values of each record prev_s = np.zeros((hidden_dim, 1)) # here, prevs is the value of the previous activation of hidden layer; which is initialized as all zeroes for t in range(T): new_input = np.zeros(x.shape) # we then do a forward pass for every timestep in the sequence new_input[t] = x[t] # for this, we define a single input for that timestep mulu = np.dot(U, new_input) mulw = np.dot(W, prev_s) add = mulw + mulu s = sigmoid(add) mulv = np.dot(V, s) prev_s = s # calculate error loss_per_record = (y – mulv)**2 / 2 loss += loss_per_record loss = loss / float(y.shape[0])
We will do the same thing for calculating the loss on validation data (in the same loop):
# check loss on val val_loss = 0.0 for i in range(Y_val.shape[0]): x, y = X_val[i], Y_val[i] prev_s = np.zeros((hidden_dim, 1)) for t in range(T): new_input = np.zeros(x.shape) new_input[t] = x[t] mulu = np.dot(U, new_input) mulw = np.dot(W, prev_s) add = mulw + mulu s = sigmoid(add) mulv = np.dot(V, s) prev_s = s loss_per_record = (y – mulv)**2 / 2 val_loss += loss_per_record val_loss = val_loss / float(y.shape[0]) print(‘Epoch: ‘, epoch + 1, ‘, Loss: ‘, loss, ‘, Val Loss: ‘, val_loss)
You should get the below output:
Epoch: 1 , Loss: [[101185.61756671]] , Val Loss: [[50591.0340148]] … …
We will now start with the actual training of the network. In this, we will first do a forward pass to calculate the errors and a backward pass to calculate the gradients and update them. Let me show you these stepbystep so you can visualize how it works in your mind.
In the forward pass:
Here is the code for doing a forward pass (note that it is in continuation of the above loop):
# train model for i in range(Y.shape[0]): x, y = X[i], Y[i] layers = [] prev_s = np.zeros((hidden_dim, 1)) dU = np.zeros(U.shape) dV = np.zeros(V.shape) dW = np.zeros(W.shape) dU_t = np.zeros(U.shape) dV_t = np.zeros(V.shape) dW_t = np.zeros(W.shape) dU_i = np.zeros(U.shape) dW_i = np.zeros(W.shape) # forward pass for t in range(T): new_input = np.zeros(x.shape) new_input[t] = x[t] mulu = np.dot(U, new_input) mulw = np.dot(W, prev_s) add = mulw + mulu s = sigmoid(add) mulv = np.dot(V, s) layers.append({‘s’:s, ‘prev_s’:prev_s}) prev_s = s
After the forward propagation step, we calculate the gradients at each layer, and backpropagate the errors. We will use truncated back propagation through time (TBPTT), instead of vanilla backprop. It may sound complex but its actually pretty straight forward.
The core difference in BPTT versus backprop is that the backpropagation step is done for all the time steps in the RNN layer. So if our sequence length is 50, we will backpropagate for all the timesteps previous to the current timestep.
If you have guessed correctly, BPTT seems very computationally expensive. So instead of backpropagating through all previous timestep , we backpropagate till x timesteps to save computational power. Consider this ideologically similar to stochastic gradient descent, where we include a batch of data points instead of all the data points.
Here is the code for backpropagating the errors:
# derivative of pred dmulv = (mulv – y) # backward pass for t in range(T): dV_t = np.dot(dmulv, np.transpose(layers[t][‘s’])) dsv = np.dot(np.transpose(V), dmulv) ds = dsv dadd = add * (1 – add) * ds dmulw = dadd * np.ones_like(mulw) dprev_s = np.dot(np.transpose(W), dmulw) for i in range(t1, max(1, tbptt_truncate1), 1): ds = dsv + dprev_s dadd = add * (1 – add) * ds dmulw = dadd * np.ones_like(mulw) dmulu = dadd * np.ones_like(mulu) dW_i = np.dot(W, layers[t][‘prev_s’]) dprev_s = np.dot(np.transpose(W), dmulw) new_input = np.zeros(x.shape) new_input[t] = x[t] dU_i = np.dot(U, new_input) dx = np.dot(np.transpose(U), dmulu) dU_t += dU_i dW_t += dW_i dV += dV_t dU += dU_t dW += dW_t
Lastly, we update the weights with the gradients of weights calculated. One thing we have to keep in mind that the gradients tend to explode if you don’t keep them in check.This is a fundamental issue in training neural networks, called the exploding gradient problem. So we have to clamp them in a range so that they dont explode. We can do it like this
if dU.max() > max_clip_value: dU[dU > max_clip_value] = max_clip_value if dV.max() > max_clip_value: dV[dV > max_clip_value] = max_clip_value if dW.max() > max_clip_value: dW[dW > max_clip_value] = max_clip_value if dU.min() < min_clip_value: dU[dU < min_clip_value] = min_clip_value if dV.min() < min_clip_value: dV[dV < min_clip_value] = min_clip_value if dW.min() < min_clip_value: dW[dW < min_clip_value] = min_clip_value # update U = learning_rate * dU V = learning_rate * dV W = learning_rate * dW
On training the above model, we get this output:
Epoch: 1 , Loss: [[101185.61756671]] , Val Loss: [[50591.0340148]] Epoch: 2 , Loss: [[61205.46869629]] , Val Loss: [[30601.34535365]] Epoch: 3 , Loss: [[31225.3198258]] , Val Loss: [[15611.65669247]] Epoch: 4 , Loss: [[11245.17049551]] , Val Loss: [[5621.96780111]] Epoch: 5 , Loss: [[1264.5157739]] , Val Loss: [[632.02563908]] Epoch: 6 , Loss: [[20.15654115]] , Val Loss: [[10.05477285]] Epoch: 7 , Loss: [[17.13622839]] , Val Loss: [[8.55190426]] Epoch: 8 , Loss: [[17.38870495]] , Val Loss: [[8.68196484]] Epoch: 9 , Loss: [[17.181681]] , Val Loss: [[8.57837827]] Epoch: 10 , Loss: [[17.31275313]] , Val Loss: [[8.64199652]] Epoch: 11 , Loss: [[17.12960034]] , Val Loss: [[8.54768294]] Epoch: 12 , Loss: [[17.09020065]] , Val Loss: [[8.52993502]] Epoch: 13 , Loss: [[17.17370113]] , Val Loss: [[8.57517454]] Epoch: 14 , Loss: [[17.04906914]] , Val Loss: [[8.50658127]] Epoch: 15 , Loss: [[16.96420184]] , Val Loss: [[8.46794248]] Epoch: 16 , Loss: [[17.017519]] , Val Loss: [[8.49241316]] Epoch: 17 , Loss: [[16.94199493]] , Val Loss: [[8.45748739]] Epoch: 18 , Loss: [[16.99796892]] , Val Loss: [[8.48242177]] Epoch: 19 , Loss: [[17.24817035]] , Val Loss: [[8.6126231]] Epoch: 20 , Loss: [[17.00844599]] , Val Loss: [[8.48682234]] Epoch: 21 , Loss: [[17.03943262]] , Val Loss: [[8.50437328]] Epoch: 22 , Loss: [[17.01417255]] , Val Loss: [[8.49409597]] Epoch: 23 , Loss: [[17.20918888]] , Val Loss: [[8.5854792]] Epoch: 24 , Loss: [[16.92068017]] , Val Loss: [[8.44794633]] Epoch: 25 , Loss: [[16.76856238]] , Val Loss: [[8.37295808]]
Looking good! Time to get the predictions and plot them to get a visual sense of what we’ve designed.
We will do a forward pass through the trained weights to get our predictions:
preds = [] for i in range(Y.shape[0]): x, y = X[i], Y[i] prev_s = np.zeros((hidden_dim, 1)) # Forward pass for t in range(T): mulu = np.dot(U, x) mulw = np.dot(W, prev_s) add = mulw + mulu s = sigmoid(add) mulv = np.dot(V, s) prev_s = s preds.append(mulv) preds = np.array(preds)
Plotting these predictions alongside the actual values:
plt.plot(preds[:, 0, 0], ‘g’) plt.plot(Y[:, 0], ‘r’) plt.show()
preds = [] for i in range(Y_val.shape[0]): x, y = X_val[i], Y_val[i] prev_s = np.zeros((hidden_dim, 1)) # For each time step… for t in range(T): mulu = np.dot(U, x) mulw = np.dot(W, prev_s) add = mulw + mulu s = sigmoid(add) mulv = np.dot(V, s) prev_s = s preds.append(mulv) preds = np.array(preds) plt.plot(preds[:, 0, 0], ‘g’) plt.plot(Y_val[:, 0], ‘r’) plt.show()
from sklearn.metrics import mean_squared_error math.sqrt(mean_squared_error(Y_val[:, 0] * max_val, preds[:, 0, 0] * max_val))
0.127191931509431
I cannot stress enough how useful RNNs are when working with sequence data. I implore you all to take this learning and apply it on a dataset. Take a NLP problem and see if you can find a solution for it. You can always reach out to me in the comments section below if you have any questions.
In this article, we learned how to create a recurrent neural network model from scratch by using just the numpy library. You can of course use a highlevel library like Keras or Caffe but it is essential to know the concept you’re implementing.
Do share your thoughts, questions and feedback regarding this article below. Happy learning!
Great article! How would the code need to be modified if more than one time series are used to make a prediction? For example: – predict next day temperature using the last 50day temperature and last 50day humidity level; or, – predict next day temperature and next day humidity level using the last 50day temperature and last 50day humidity level Thank you, Guy Aubin
This article is useful! But there is a little question I want to ask. In the last cell, ‘math.sqrt(mean_squared_error(Y_val[:, 0] * max_val, preds[:, 0, 0] * max_val))’ The object ‘max_val’ seems not be defined in the above code. What is the value(or meaning) of ‘max_val’? Thank you!
Hi, Thanks for a great note. I am getting an error while running the code. The error is coming from the last line. –>math.sqrt(mean_squared_error(Y_val[:, 0] * max_val, preds[:, 0, 0] * max_val)) How did you define “max_val” here?
Hi How can I work it With audio Data
Yes, I also greatly enjoy the explanations. I’ve tried the code and it works well except that prediction and actual signals are phase quadrature signals which doesn’t appear in your graphics
Thank you for this excellent article. Just wondering, shouldn’t this “new_input = np.zeros(x.shape)” come outside the for loop ? It would preserve the sequence and help the context vector ?
How to predict future values, you have used Train and Test data to predict, but how will you predict future say 20 values ?
How to predict future values , say for next 30 values ?
“` # derivative of pred dmulv = (mulv – y) # backward pass for t in range(T): dV_t = np.dot(dmulv, np.transpose(layers[t][‘s’])) dsv = np.dot(np.transpose(V), dmulv) “` quick question here: can we back propagate the derivative of prediction loss back to every t in the sequence in the `manytoone` setting?
at the end: what is max_val?
Hi, Thanks for the article. Does RNN use onehot encoding in each time step for time series data forecasting? for instance, input=[10,20, 30] In 1st time step input is [10, 0, 0], In 2nd time step input is [0, 20, 0], and In 3rd time step input is [0, 0, 30] Isn’t it? If yes, could you please share reference if you have it. Thanks in advance.
Can we use the same code for DNA Sequence?
Above End Notes, in the equations what is the max_val?
Thanks for the tutorial. How would we include static data as well as sequential data?
Hello, I’m very interested in the neural network code. However, the code on replit does not load. How else can you see this code?
Is there some reason why loss_per_record = (y – mulv)**2 / 2, not that loss_per_record = (y – mulv)**2 ?
Recurrent Neural Network Tutorial (RNN)
Python3

Output:
Training Process of the GRU model
If you notice the accuracy difference, the model performed better in LSTM and GRU cases than in simpler ones. We shall conclude this article by discussing the applications, where RNNs are used widely.
Don’t miss your chance to ride the wave of the data revolution! Every industry is scaling new heights by tapping into the power of data. Sharpen your skills and become a part of the hottest trend in the 21st century.
Dive into the future of technology – explore the Complete Machine Learning and Data Science Program by GeeksforGeeks and stay ahead of the curve.
Last Updated :
30 Dec, 2022
Like Article
Save Article
Share your thoughts in the comments