Why use the backpropagation algorithm?
Earlier we discussed that the network is trained using 2 passes: forward and backward. At the end of the forward pass, the network error is calculated, and should be as small as possible.
If the current error is high, the network didn’t learn properly from the data. What does this mean? It means that the current set of weights isn’t accurate enough to reduce the network error and make accurate predictions. As a result, we should update network weights to reduce the network error.
The backpropagation algorithm is one of the algorithms responsible for updating network weights with the objective of reducing the network error. It’s quite important.
Here are some of the advantages of the backpropagation algorithm:
 It’s memoryefficient in calculating the derivatives, as it uses less memory compared to other optimization algorithms, like the genetic algorithm. This is a very important feature, especially with large networks.
 The backpropagation algorithm is fast, especially for small and mediumsized networks. As more layers and neurons are added, it starts to get slower as more derivatives are calculated.
 This algorithm is generic enough to work with different network architectures, like convolutional neural networks, generative adversarial networks, fullyconnected networks, and more.
 There are no parameters to tune the backpropagation algorithm, so there’s less overhead. The only parameters in the process are related to the gradient descent algorithm, like learning rate.
Next, let’s see how the backpropagation algorithm works, based on a mathematical example.
5.3.Forward Propagation¶
Forward propagation (or forward pass) refers to the calculation and storage of intermediate variables (including outputs) for a neural network in order from the input layer to the output layer. We now work stepbystep through the mechanics of a neural network with one hidden layer. This may seem tedious but in the eternal words of funk virtuoso James Brown, you must “pay the cost to be the boss”.
For the sake of simplicity, let’s assume that the input example is \(\mathbf{x}\in \mathbb{R}^d\) and that our hidden layer does not include a bias term. Here the intermediate variable is:
where \(\mathbf{W}^{(1)} \in \mathbb{R}^{h \times d}\) is the weight parameter of the hidden layer. After running the intermediate variable \(\mathbf{z}\in \mathbb{R}^h\) through the activation function \(\phi\) we obtain our hidden activation vector of length \(h\):
The hidden layer output \(\mathbf{h}\) is also an intermediate variable. Assuming that the parameters of the output layer possess only a weight of \(\mathbf{W}^{(2)} \in \mathbb{R}^{q \times h}\), we can obtain an output layer variable with a vector of length \(q\):
Assuming that the loss function is \(l\) and the example label is \(y\), we can then calculate the loss term for a single data example,
As we will see the definition of \(\ell_2\) regularization to be introduced later, given the hyperparameter \(\lambda\), the regularization term is
where the Frobenius norm of the matrix is simply the \(\ell_2\) norm applied after flattening the matrix into a vector. Finally, the model’s regularized loss on a given data example is:
We refer to \(J\) as the objective function in the following discussion.
There are multiple libraries (PyTorch, TensorFlow) that can assist you in implementing almost any architecture of neural networks. This article is not about solving a neural net using one of those libraries. There are already plenty of articles, videos on that. In this article, we’ll see a step by step forward pass (forward propagation) and backward pass (backpropagation) example. We’ll be taking a single hidden layer neural network and solving one complete cycle of forward propagation and backpropagation.
Getting to the point, we will work step by step to understand how weights are updated in neural networks. The way a neural network learns is by updating its weight parameters during the training phase. There are multiple concepts needed to fully understand the working mechanism of neural networks: linear algebra, probability, calculus. I’ll try my best to revisit calculus for the chain rule concept. I will keep aside the linear algebra (vectors, matrices, tensors) for this article. We’ll work on each and every computation and in the end up we’ll update all the weights of the example neural network for one complete cycle of forward propagation and backpropagation. Let’s get started.
Here’s a simple neural network on which we’ll be working.
I think the above example neural network is selfexplanatory. There are two units in the Input Layer, two units in the Hidden Layer and two units in the Output Layer. The w1,w2,w2,…,w8 represent the respective weights. b1 and b2 are the biases for Hidden Layer and Output Layer, respectively.
In this article, we’ll be passing two inputs i1 and i2, and perform a forward pass to compute total error and then a backward pass to distribute the error inside the network and update weights accordingly.
Before getting started, let us deal with two basic concepts which should be sufficient to comprehend this article.
Peeking inside a single neuron
Inside a unit, two operations happen (i) computation of weighted sum and (ii) squashing of the weighted sum using an activation function. The result from the activation function becomes an input to the next layer (until the next layer is an Output Layer). In this example, we’ll be using the Sigmoid function (Logistic function) as the activation function. The Sigmoid function basically takes an input and squashes the value between 0 and +1. We’ll discuss the activation functions in later articles. But, what you should note is that inside a neural network unit, two operations (stated above) happen. We can suppose the input layer to have a linear function that produces the same value as the input.
Chain Rule in Calculus
If we have y = f(u) and u = g(x) then we can write the derivative of y as:
\frac{dy}{dx} = \frac{dy}{du} * \frac{du}{dx}
Types of backpropagation
There are 2 main types of the backpropagation algorithm:
 Traditional backpropagation is used for static problems with a fixed input and a fixed output all the time, like predicting the class of an image. In this case, the input image and the output class never change.
 Backpropagation through time (BPTT) targets nonstatic problems that change over time. It’s applied in timeseries models, like recurrent neural networks (RNN).
5.3.Backpropagation¶
Backpropagation refers to the method of calculating the gradient of neural network parameters. In short, the method traverses the network in reverse order, from the output to the input layer, according to the chain rule from calculus. The algorithm stores any intermediate variables (partial derivatives) required while calculating the gradient with respect to some parameters. Assume that we have functions \(\mathsf{Y}=f(\mathsf{X})\) and \(\mathsf{Z}=g(\mathsf{Y})\), in which the input and the output \(\mathsf{X}, \mathsf{Y}, \mathsf{Z}\) are tensors of arbitrary shapes. By using the chain rule, we can compute the derivative of \(\mathsf{Z}\) with respect to \(\mathsf{X}\) via
Here we use the \(\textrm{prod}\) operator to multiply its arguments after the necessary operations, such as transposition and swapping input positions, have been carried out. For vectors, this is straightforward: it is simply matrix–matrix multiplication. For higher dimensional tensors, we use the appropriate counterpart. The operator \(\textrm{prod}\) hides all the notational overhead.
Recall that the parameters of the simple network with one hidden layer, whose computational graph is in Fig. 5.3.1, are \(\mathbf{W}^{(1)}\) and \(\mathbf{W}^{(2)}\). The objective of backpropagation is to calculate the gradients \(\partial J/\partial \mathbf{W}^{(1)}\) and \(\partial J/\partial \mathbf{W}^{(2)}\). To accomplish this, we apply the chain rule and calculate, in turn, the gradient of each intermediate variable and parameter. The order of calculations are reversed relative to those performed in forward propagation, since we need to start with the outcome of the computational graph and work our way towards the parameters. The first step is to calculate the gradients of the objective function \(J=L+s\) with respect to the loss term \(L\) and the regularization term \(s\):
Next, we compute the gradient of the objective function with respect to variable of the output layer \(\mathbf{o}\) according to the chain rule:
Next, we calculate the gradients of the regularization term with respect to both parameters:
Now we are able to calculate the gradient \(\partial J/\partial \mathbf{W}^{(2)} \in \mathbb{R}^{q \times h}\) of the model parameters closest to the output layer. Using the chain rule yields:
To obtain the gradient with respect to \(\mathbf{W}^{(1)}\) we need to continue backpropagation along the output layer to the hidden layer. The gradient with respect to the hidden layer output \(\partial J/\partial \mathbf{h} \in \mathbb{R}^h\) is given by
Since the activation function \(\phi\) applies elementwise, calculating the gradient \(\partial J/\partial \mathbf{z} \in \mathbb{R}^h\) of the intermediate variable \(\mathbf{z}\) requires that we use the elementwise multiplication operator, which we denote by \(\odot\):
Finally, we can obtain the gradient \(\partial J/\partial \mathbf{W}^{(1)} \in \mathbb{R}^{h \times d}\) of the model parameters closest to the input layer. According to the chain rule, we get
Forward and backward passes in Neural Networks
To train a neural network, there are 2 passes (phases):
 Forward
 Backward
In the forward pass, we start by propagating the data inputs to the input layer, go through the hidden layer(s), measure the network’s predictions from the output layer, and finally calculate the network error based on the predictions the network made.
This network error measures how far the network is from making the correct prediction. For example, if the correct output is 4 and the network’s prediction is 1.3, then the absolute error of the network is 41.3=2.7. Note that the process of propagating the inputs from the input layer to the output layer is called forward propagation. Once the network error is calculated, then the forward propagation phase has ended, and backward pass starts.
The next figure shows a red arrow pointing in the direction of the forward propagation.
In the backward pass, the flow is reversed so that we start by propagating the error to the output layer until reaching the input layer passing through the hidden layer(s). The process of propagating the network error from the output layer to the input layer is called backward propagation, or simple backpropagation. The backpropagation algorithm is the set of steps used to update network weights to reduce the network error.
In the next figure, the blue arrow points in the direction of backward propagation.
The forward and backward phases are repeated from some epochs. In each epoch, the following occurs:
 The inputs are propagated from the input to the output layer.
 The network error is calculated.
 The error is propagated from the output layer to the input layer.
We will focus on the backpropagation phase. Let’s discuss the advantages of using the backpropagation algorithm.
How backpropagation algorithm works
How the algorithm works is best explained based on a simple network, like the one given in the next figure. It only has an input layer with 2 inputs (X1 and X2), and an output layer with 1 output. There are no hidden layers.
The weights of the inputs are W1 and W2, respectively. The bias is treated as a new input neuron to the output neuron which has a fixed value +1 and a weight b. Both the weights and biases could be referred to as parameters.
Let’s assume that output layer uses the sigmoid activation function defined by the following equation:
Where s is the sum of products (SOP) between each input and its corresponding weight:
s=X1* W1+ X2*W2+b
To make things simple, a single training sample is used in this example. The next table shows the single training sample with the input and its corresponding desired (i.e. correct) output for the sample. In practice, more training examples are used.
X1 
X2 
Desired Output 
0.1 
0.3 
0.03 
Assume that the initial values for both weights and bias are like in the next table.
W1 
W2 

0.5 
0.2 
1.83 
For simplicity, the values for all inputs, weights, and bias will be added to the network diagram.
Now, let’s train the network and see how the network will predict the output of the sample based on the current parameters.
As we discussed previously, the training process has 2 phases, forward and backward.
Forward pass
The input of the activation function will be the SOP between each input and its weight. The SOP is then added to the bias to return the output of the neuron:
s=X1* W1+ X2*W2+b
s=0.1* 0.5+ 0.3*0.2+1.83
s=1.94
The value 1.94 is then applied to the activation function (sigmoid), which results in the value 0.874352143.
The output of the activation function from the output neuron reflects the predicted output of the sample. It’s obvious that there’s a difference between the desired and expected output. But why? How do we make predicted output closer to desired output? We’ll answer these questions later. For now, let’s see the error of our network based on an error function.
The error functions tell how close the predicted output(s) are from the desired output(s). The optimal value for error is zero, meaning there’s no error at all, and both desired and predicted results are identical. One of the error functions is the squared error function, as defined in the next equation:
Note that the value 12 multiplied by the equation is for simplifying the derivatives calculation using the backpropagation algorithm.
Based on the error function, we can measure the error of our network as follows:
The result shows that there is an error, and a large one: (~0.357). The error just gives us an indication of how far the predicted results are from the desired results.
Knowing that there’s an error, what should we do? We should minimize it. To minimize network error, we must change something in the network. Remember that the only parameters we can change are the weights and biases. We can try different weights and biases, and then test our network.
We calculate the error, then the forward pass ends, and we should start the backward pass to calculate the derivatives and update the parameters.
To practically feel the importance of the backpropagation algorithm, let’s try to update the parameters directly without using this algorithm.
Parameters update equation
The parameters can be changed according to the next equation:
W(n+1)=W(n)+η[d(n)Y(n)]X(n)
Where:
 n: Training step (0, 1, 2, …).
 W(n): Parameters in current training step. Wn=[bn,W1(n),W2(n),W3(n),…, Wm(n)]
 η: Learning rate with a value between 0.0 and 1.0.
 d(n): Desired output.
 Y(n): Predicted output.
 X(n): Current input at which the network made false prediction.
For our network, these parameters have the following values:
 n: 0
 W(n): [1.83, 0.5, 0.2]
 η: Because it is a hyperparameter, then we can choose it 0.01 for example.
 d(n): [0.03].
 Y(n): [0.874352143].
 X(n): [+1, 0.1, 0.3]. First value (+1) is for the bias.
We can update our network parameters as follows:
W(n+1)=W(n)+η[d(n)Y(n)]X(n)
=[1.83, 0.5, 0.2]+0.01[0.030.874352143][+1, 0.1, 0.3]
=[1.83, 0.5, 0.2]+0.01[0.844352143][+1, 0.1, 0.3]
=[1.83, 0.5, 0.2]+0.00844352143[+1, 0.1, 0.3]
=[1.83, 0.5, 0.2]+[0.008443521, 0.000844352, 0.002533056]
=[1.821556479, 0.499155648, 0.197466943]
The new parameters are listed in the next table:
W1new 
W2new 
bnew 
0.197466943 
0.499155648 
1.821556479 
Based on the new parameters, we will recalculate the predicted output. The new predicted output is used to calculate the new network error. The network parameters are updated according to the calculated error. The process continues to update the parameters and recalculate the predicted output until it reaches an acceptable value for the error.
Here, we successfully updated the parameters without using the backpropagation algorithm. Do we still need that algorithm? Yes. You’ll see why.
The parametersupdate equation just depends on the learning rate to update the parameters. It changes all the parameters in a direction opposite to the error.
But, using the backpropagation algorithm, we can know how each single weight correlates with the error. This tells us the effect of each weight on the prediction error. That is, which parameters do we increase, and which ones do we decrease to get the smallest prediction error?
For example, the backpropagation algorithm could tell us useful information, like that increasing the current value of W1 by 1.0 increases the network error by 0.07. This shows us that a smaller value for W1 is better to minimize the error.
Partial derivative
One important operation used in the backward pass is to calculate derivatives. Before getting into the calculations of derivatives in the backward pass, we can start with a simple example to make things easier.
For a multivariate function such as Y=X2Z+H, what’s the effect on the output Y given a change in variable X? We can answer this using partial derivatives, as follows:
Note that everything except X is regarded as a constant. Therefore, H is replaced by 0 after calculating a partial derivative. Here, ∂X means a tiny change of variable X. Similarly, ∂Y means a tiny change of Y. The change of Y is the result of changing X. By making a very small change in X, what’s the effect on Y?
The small change can be an increase or decrease by a tiny value. By substituting the different values of X, we can find how Ychanges with respect to X.
The same procedure can be followed to learn how the NN prediction error changes W.R.T changes in network weights. So, our target is to calculate ∂E/W1 and ∂E/W2 as we have just two weights W1 and W2. Let’s calculate them.
Derivatives of the Prediction Error W.R.T Parameters
Looking at this equation, Y=X2Z+H, it seems straightforward to calculate the partial derivative ∂Y/∂X because there’s an equation relating both Yand X. In our case, there’s no direct equation in which both the prediction error and the weights exist. So, we’re going to use the multivariate chain rule to find the partial derivative of Y W.R.T X.
Prediction Error to Parameters Chain
Let’s try to find the chain relating the prediction error to the weights. The prediction error is calculated based on this equation:
This equation doesn`t have any parameter. No problem, we can inspect how each term (desired & predicted) of the previous equation is calculated, and substitute with its equation until reaching the parameters.
The desired term in the previous equation is a constant, so there’s no chance for reaching parameters through it. The predicted term is calculated based on the sigmoid function, like in the next equation:
Again, the equation for calculating the predicted output doesn’t have any parameter. But there’s still variable s (SOP) that already depends on parameters for its calculation, according to this equation:
s=X1* W1+ X2*W2+b
Once we’ve reached an equation that has the parameters (weights and biases), we’ve reached the end of the derivative chain. The next figure presents the chain of derivatives to follow to calculate the derivative of the error W.R.T the parameters.
Note that the derivative of s W.R.T the bias b (∂s/W1) is 0, so it can be omitted. The bias can simply be updated using the learning rate, as we did previously. This leaves us to calculate the derivatives of the 2 weights.
According to the previous figure, to know how prediction error changes W.R.T changes in the parameters we should find the following intermediate derivatives:
 Network error W.R.T the predicted output.
 Predicted output W.R.T the SOP.
 SOP W.R.T each of the 3 parameters.
As a total, there are four intermediate partial derivatives:
∂E/∂Predicted, ∂Predicted/∂s, ∂s/W1 and ∂s/W2
To calculate the derivative of the error W.R.T the weights, simply multiply all the derivatives in the chain from the error to each weight, as in the next 2 equations:
∂E/W1=∂E/∂Predicted* ∂Predicted/∂s* ∂s/W1
∂EW2=∂E/∂Predicted* ∂Predicted/∂s* ∂s/W2
Important note: We used the derivative chain solution because there was no direct equation relating the error and the parameters together. But, we can create an equation relating them and applying partial derivatives directly to it:
E=1/2(desired1/(1+e(X1* W1+ X2*W2+b))2
Because this equation seems complex to calculate the derivative of the error W.R.T the parameters directly, it’s preferred to use the multivariate chain rule for simplicity.
Calculating partial derivatives values by substitution
Let’s calculate partial derivatives of each part of the chain we created.
For the derivative of the error W.R.T the predicted output:
∂E/∂Predicted=∂/∂Predicted(1/2(desiredpredicted)2)
=2*1/2(desiredpredicted)21*(01)
=(desiredpredicted)*(1)
=predicteddesired
By substituting by the values:
∂E/∂Predicted=predicteddesired=0.8743521430.03
∂E/∂Predicted=0.844352143
For the derivative of the predicted output W.R.T the SOP:
∂Predicted/∂s=∂/∂s(1/(1+es))
Remember: the quotient rule can be used to find the derivative of the sigmoid function as follows:
∂Predicted/∂s=1/(1+es)(11/(1+es))
By substituting by the values:
∂Predicted/∂s=1/(1+es)(11/(1+es))=1/(1+e1.94)(11/(1+e1.94))
=1/(1+0.143703949)(11/(1+0.143703949))
=1/1.143703949(11/1.143703949)
=0.874352143(10.874352143)
=0.874352143(0.125647857)
∂Predicted/∂s=0.109860473
For the derivative of SOP W.R.T W1:
∂s/W1=∂/∂W1(X1* W1+ X2*W2+b)
=1*X1*(W1)(11)+ 0+0
=X1*(W1)(0)
=X1(1)
∂s/W1=X1
By substituting by the values:
∂s/W1=X1=0.1
For the derivative of SOP W.R.T W2:
∂s/W2=∂/∂W2(X1* W1+ X2*W2+b)
=0+1*X2*(W2)(11)+0
=X2*(W2)(0)
=X2(1)
∂s/W2=X2
By substituting by the values:
∂s/W2=X2=0.3
After calculating the individual derivatives in all chains, we can multiply all of them to calculate the desired derivatives (i.e. derivative of the error W.R.T each weight).
For the derivative of the error W.R.T W1:
∂E/W1=0.844352143*0.109860473*0.1
∂E/W1=0.009276093
For the derivative of the error W.R.T W2:
∂E/W2=0.844352143*0.109860473*0.3
∂E/W2=0.027828278
Finally, there are two values reflecting how the prediction error changes with respect to the weights:
0.009276093 for W1
0.027828278 for W2
What do these values mean? These results need interpretation.
Interpreting results of backpropagation
There are two useful conclusions from each of the last two derivatives. These conclusions are obtained based on:
 Derivative sign
 Derivative magnitude (DM)
If the derivative sign is positive, that means increasing the weight increases the error. In other words, decreasing the weight decreases the error.
If the derivative sign is negative, increasing the weight decreases the error. In other words, if it’s negative then decreasing the weight increases the error.
But by how much does the error increase or decrease? The derivative magnitude answers this question.
For positive derivatives, increasing the weight by p increases the error by DM*p. For negative derivatives, increasing the weight by p decreases the error by DM*p.
Let’s apply this to our example:
 Because the result of the ∂E/W1 derivative is positive, this means if W1 increases by 1, then the total error increases by 0.009276093.
 Because the result of the ∂E/W2 derivative is positive, this means that if W2 increases by 1 then the total error increases by 0.027828278.
Let’s now update the weights according to the calculated derivatives.
Updating weights
After successfully calculating the derivatives of the error with respect to each individual weight, we can update the weights to improve the prediction. Each weight is updated based on its derivative:
For W1:
W1new=W1η*∂E/W1
=0.50.01*0.009276093
W1new=0.49990723907
For W2:
W2new=W2η*∂E/W2
=0.20.01*0.027828278
W2new= 0.1997217172
Note that the derivative is subtracted (not added) from the old value of the weight, because the derivative is positive.
The new values for the weights are:
 W1=0.49990723907
 W2= 0.1997217172
These 2 weights, in addition to the bias (1.821556479) previously calculated, are used in a new forward pass to calculate the error. It’s expected that the new error will be smaller than the current error (0.356465271).
Here are the new forward pass calculations:
s=X1* W1+ X2*W2+b
s=0.1*0.49990723907+ 0.3*0.1997217172+1.821556479
s=1.931463718067
f(s)=1/(1+es)
f(s)=1/(1+e1.931463718067)
f(s)=0.873411342830056
E=1/2(0.030.873411342830056)2
E=0.35567134660719907
When comparing the new error (0.35567134660719907) with the old error (0.356465271), there’s a reduction of 0.0007939243928009043. As long as there’s a reduction, we’re moving in the right direction.
The error reduction is small because we’re using a small learning rate (0.01). To learn about how the learning rate affects the process of training a neural network, read this article.
The forward and backward passes should be repeated until the error is 0 or for a number of epochs (i.e. iterations). This marks the end of the example.
The next section discusses how to implement the backpropagation for the example discussed in this section.
The Backpropagation
The aim of backpropagation (backward pass) is to distribute the total error back to the network so as to update the weights in order to minimize the cost function (loss). The weights are updated in such as way that when the next forward pass utilizes the updated weights, the total error will be reduced by a certain margin (until the minima is reached).
For weights in the output layer (w5, w6, w7, w8)
For w5,
Let’s compute how much contribution w5 has on E_{1}. If we become clear on how w5 is updated, then it would be really easy for us to generalize the same to the rest of the weights. If we look closely at the example neural network, we can see that E_{1} is affected by output_{o1}, output_{o1} is affected by sum_{o1}, and sum_{o1} is affected by w5. It’s time to recall the Chain Rule.
\frac{\partial E_{total}}{\partial w5} = \frac{\partial E_{total}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial w5}
Let’s deal with each component of the above chain separately.
Component 1: partial derivative of Error w.r.t. Output
E_{total} = \sum \frac{1}{2}(targetoutput)^{2} E_{total} = \frac{1}{2}(target_{1}output_{o1})^{2} + \frac{1}{2}(target_{2}output_{o2})^{2}
Therefore,
\frac{\partial E_{total}}{\partial output_{o1}} = 2*\frac{1}{2}*(target_{1}output_{o1})*1 = output_{o1} – target_{1}
Component 2: partial derivative of Output w.r.t. Sum
The output section of a unit of a neural network uses nonlinear activation functions. The activation function used in this example is Logistic Function. When we compute the derivative of the Logistic Function, we get:
\sigma(x) = \frac{1}{1+e^{x}} \frac{\mathrm{d}}{\mathrm{dx}}\sigma(x) = \sigma(x)(1\sigma(x))
Therefore, the derivative of the Logistic function is equal to output multiplied by (1 – output).
\frac{\partial output_{o1}}{\partial sum_{o1}} = output_{o1} (1 – output_{o1})
Component 3: partial derivative of Sum w.r.t. Weight
sum_{o1} = output_{h1}*w_{5}+output_{h2}*w_{6}+b_{2}
Therefore,
\frac{\partial sum_{o1}}{\partial w5} = output_{h1}
Putting them together,
\frac{\partial E_{total}}{\partial w5} = \frac{\partial E_{total}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial w5} \frac{\partial E_{total}}{\partial w5} = [output_{o1} – target_{1} ]* [output_{o1} (1 – output_{o1})] * [output_{h1}] \frac{\partial E_{total}}{\partial w5} = 0.68492 * 0.19480 * 0.60108 \frac{\partial E_{total}}{\partial w5} = 0.08020
The new\_w_{5} is,
new\_w_{5} = w5 – n * \frac{\partial E_{total}}{\partial w5}, where n is learning rate.
new\_w_{5} = 0.5 – 0.6 * 0.08020 new\_w_{5} = 0.45187
We can proceed similarly for w6, w7 and w8.
For w6,
\frac{\partial E_{total}}{\partial w6} = \frac{\partial E_{total}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial w6}
The first two components of this chain have already been calculated. The last component \frac{\partial sum_{o1}}{\partial w6} = output_{h2}.
\frac{\partial E_{total}}{\partial w6} = 0.68492 * 0.19480 * 0.61538 = 0.08211
The new\_w_{6} is,
new\_w_{6}= w6 – n * \frac{\partial E_{total}}{\partial w6} new\_w_{6} = 0.6 – 0.6 * 0.08211 new\_w_{6} = 0.55073
For w7,
\frac{\partial E_{total}}{\partial w7} = \frac{\partial E_{total}}{\partial output_{o2}} * \frac{\partial output_{o2}}{\partial sum_{o2}} * \frac{\partial sum_{o2}}{\partial w7}
For the first component of the above chain, Let’s recall how the partial derivative of Error is computed w.r.t. Output.
\frac{\partial E_{total}}{\partial output_{o2}} = output_{o2} – target_{2}
For the second component,
\frac{\partial output_{o2}}{\partial sum_{o2}} = output_{o2} (1 – output_{o2})
For the third component,
\frac{\partial sum_{o2}}{\partial w7} = output_{h1}
Putting them together,
\frac{\partial E_{total}}{\partial w7} = [output_{o2} – target_{2}] * [output_{o2} (1 – output_{o2})] * [output_{h1}] \frac{\partial E_{total}}{\partial w7} = 0.17044 * 0.17184 * 0.60108 \frac{\partial E_{total}}{\partial w7} = 0.01760
The new\_w_{7} is,
new\_w_{7} = w7 – n * \frac{\partial E_{total}}{\partial w7} new\_w_{7} = 0.7 – 0.6 * 0.01760 new\_w_{7} = 0.71056
Proceeding similarly, we get new\_w_{8} = 0.81081 (with \frac{\partial E_{total}}{\partial w8} = 0.01802).
For weights in the hidden layer (w1, w2, w3, w4)
Similar calculations are made to update the weights in the hidden layer. However, this time the chain becomes a bit longer. It does not matter how deep the neural network goes, all we need to find out is how much error is propagated (contributed) by a particular weight to the total error of the network. For that purpose, we need to find the partial derivative of Error w.r.t. to the particular weight. Let’s work on updating w1 and we’ll be able to generalize similar calculations to update the rest of the weights.
For w1 (with respect to E1),
For simplicity let us compute \frac{\partial E_{1}}{\partial w1} and \frac{\partial E_{2}}{\partial w1} separately, and later we can add them to compute \frac{\partial E_{total}}{\partial w1}.
\frac{\partial E_{1}}{\partial w1} = \frac{\partial E_{1}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial output_{h1}} * \frac{\partial output_{h1}}{\partial sum_{h1}} * \frac{\partial sum_{h1}}{\partial w1}
Let’s quickly go through the above chain. We know that E_{1} is affected by output_{o1}, output_{o1} is affected by sum_{o1}, sum_{o1} is affected by output_{h1}, output_{h1} is affected by sum_{h1}, and finally sum_{h1} is affected by w1. It is quite easy to comprehend, isn’t it?
For the first component of the above chain,
\frac{\partial E_{1}}{\partial output_{o1}} = output_{o1} – target_{1}
We’ve already computed the second component. This is one of the benefits of using the chain rule. As we go deep into the network, the previous computations are reusable.
For the third component,
sum_{o1} = output_{h1}*w_{5}+output_{h2}*w_{6}+b_{2} \frac{\partial sum_{o1}}{\partial output_{h1}} = w5
For the fourth component,
\frac{\partial output_{h1}}{\partial sum_{h1}} = output_{h1}*(1output_{h1})
For the fifth component,
sum_{h1} = i_{1}*w_{1}+i_{2}*w_{3}+b_{1} \frac{\partial sum_{h1}}{\partial w1} = i_{1}
Putting them all together,
\frac{\partial E_{1}}{\partial w1} = \frac{\partial E_{1}}{\partial output_{o1}} * \frac{\partial output_{o1}}{\partial sum_{o1}} * \frac{\partial sum_{o1}}{\partial output_{h1}} * \frac{\partial output_{h1}}{\partial sum_{h1}} * \frac{\partial sum_{h1}}{\partial w1} \frac{\partial E_{1}}{\partial w1} = 0.68492 * 0.19480 * 0.5 * 0.23978 * 0.1 = 0.00159
Similarly, for w1 (with respect to E2),
\frac{\partial E_{2}}{\partial w1} = \frac{\partial E_{2}}{\partial output_{o2}} * \frac{\partial output_{o2}}{\partial sum_{o2}} * \frac{\partial sum_{o2}}{\partial output_{h1}} * \frac{\partial output_{h1}}{\partial sum_{h1}} * \frac{\partial sum_{h1}}{\partial w1}
For the first component of the above chain,
\frac{\partial E_{2}}{\partial output_{o2}} = output_{o2} – target_{2}
The second component is already computed.
For the third component,
sum_{o2} = output_{h1}*w_{7}+output_{h2}*w_{8}+b_{2} \frac{\partial sum_{o2}}{\partial output_{h1}} = w7
The fourth and fifth components have also been already computed while computing \frac{\partial E_{1}}{\partial w1}.
Putting them all together,
\frac{\partial E_{2}}{\partial w1} = \frac{\partial E_{2}}{\partial output_{o2}} * \frac{\partial output_{o2}}{\partial sum_{o2}} * \frac{\partial sum_{o2}}{\partial output_{h1}} * \frac{\partial output_{h1}}{\partial sum_{h1}} * \frac{\partial sum_{h1}}{\partial w1} \frac{\partial E_{2}}{\partial w1} = 0.17044 * 0.17184 * 0.7 * 0.23978 * 0.1 = 0.00049
Now we can compute \frac{\partial E_{total}}{\partial w1} = \frac{\partial E_{1}}{\partial w1} + \frac{\partial E_{2}}{\partial w1}.
\frac{\partial E_{total}}{\partial w1} = 0.00159 + (0.00049) = 0.00110.
The new\_w_{1} is,
new\_w_{1} = w1 – n * \frac{\partial E_{total}}{\partial w1} new\_w_{1}= 0.1 – 0.6 * 0.00110 new\_w_{1} = 0.09933
Proceeding similarly, we can easily update the other weights (w2, w3 and w4).
new\_w_{2} = 0.19919 new\_w_{3} = 0.29667 new\_w_{4} = 0.39597
Once we’ve computed all the new weights, we need to update all the old weights with these new weights. Once the weights are updated, one backpropagation cycle is finished. Now the forward pass is done and the total new error is computed. And based on this newly computed total error the weights are again updated. This goes on until the loss value converges to minima. This way a neural network starts with random values for its weights and finally converges to optimum values.
I hope you found this article useful. I’ll see you in the next one.
What is the meaning of forward pass and backward pass in neural networks?
Everybody is mentioning these expressions when talking about backpropagation and epochs.
I understood that forward pass and backward pass together form an epoch.
What is the meaning of forward pass and backward pass in neural networks?
Everybody is mentioning these expressions when talking about backpropagation and epochs.
I understood that forward pass and backward pass together form an epoch.
The “forward pass” refers to calculation process, values of the output layers from the inputs data. It’s traversing through all neurons from first to last layer.
A loss function is calculated from the output values.
And then “backward pass” refers to process of counting changes in weights (de facto learning), using gradient descent algorithm (or similar). Computation is made from last layer, backward to the first layer.
Backward and forward pass makes together one “iteration”.
During one iteration, you usually pass a subset of the data set, which is called “minibatch” or “batch” (however, “batch” can also mean an entire set, hence the prefix “mini”)
“Epoch” means passing the entire data set in batches.One epoch contains (number_of_items / batch_size) iterations
5.Forward Propagation, Backward Propagation, and Computational Graphs¶ Open the notebook in SageMaker Studio Lab
So far, we have trained our models with minibatch stochastic gradient descent. However, when we implemented the algorithm, we only worried about the calculations involved in forward propagation through the model. When it came time to calculate the gradients, we just invoked the backpropagation function provided by the deep learning framework.
The automatic calculation of gradients profoundly simplifies the implementation of deep learning algorithms. Before automatic differentiation, even small changes to complicated models required recalculating complicated derivatives by hand. Surprisingly often, academic papers had to allocate numerous pages to deriving update rules. While we must continue to rely on automatic differentiation so we can focus on the interesting parts, you ought to know how these gradients are calculated under the hood if you want to go beyond a shallow understanding of deep learning.
In this section, we take a deep dive into the details of backward propagation (more commonly called backpropagation). To convey some insight for both the techniques and their implementations, we rely on some basic mathematics and computational graphs. To start, we focus our exposition on a onehiddenlayer MLP with weight decay (\(\ell_2\) regularization, to be described in subsequent chapters).
5.3.Exercises¶

Assume that the inputs \(\mathbf{X}\) to some scalar function \(f\) are \(n \times m\) matrices. What is the dimensionality of the gradient of \(f\) with respect to \(\mathbf{X}\)?

Add a bias to the hidden layer of the model described in this section (you do not need to include bias in the regularization term).

Draw the corresponding computational graph.

Derive the forward and backward propagation equations.


Compute the memory footprint for training and prediction in the model described in this section.

Assume that you want to compute second derivatives. What happens to the computational graph? How long do you expect the calculation to take?

Assume that the computational graph is too large for your GPU.

Can you partition it over more than one GPU?

What are the advantages and disadvantages over training on a smaller minibatch?

Đăng nhập/Đăng ký
Ranking
Cộng đồng

Kiến thức

Machine Learning
16 tháng 02, 2022
Admin
07:11 16/02/2022
Pytorch
lg
…
Giải Thích Về Forward Propagation
Cùng tác giả
Không có dữ liệu
0
0
0
Admin
2995 người theo dõi
1283
184
Có liên quan
Không có dữ liệu
Chia sẻ kiến thức – Kết nối tương lai
Về chúng tôi
Về chúng tôi
Giới thiệu
Chính sách bảo mật
Điều khoản dịch vụ
Học miễn phí
Học miễn phí
Khóa học
Luyện tập
Cộng đồng
Cộng đồng
Kiến thức
Tin tức
Hỏi đáp
CÔNG TY CỔ PHẦN CÔNG NGHỆ GIÁO DỤC VÀ DỊCH VỤ BRONTOBYTE
The Manor Central Park, đường Nguyễn Xiển, phường Đại Kim, quận Hoàng Mai, TP. Hà Nội
THÔNG TIN LIÊN HỆ
[email protected]
©2024 TEK4.VN
Copyright © 2024
TEK4.VN
Forward pass of a simple recurrent neuralThe recurrent neural network consists of three kinds of layers, input layer, hidden layer and outputInput value is passed into input layer, then delivered through the hidden layers to the output layer and finallyThis procedure is called forward pass.
Source publication
Motivation: Proteins usually fulfill their biological functions by interacting with other proteins. Although some methods have been developed to predict the binding sites of a monomer protein, these are not sufficient for prediction of the interaction between two monomer proteins. The correct prediction of interface residue pairs from two monomer…
Drawbacks of the backpropagation algorithm
Even though the backpropagation algorithm is the most widely used algorithm for training neural networks, it has some drawbacks:
 The network should be designed carefully to avoid the vanishing and exploding gradients that affect the way the network learns. For example, the gradients calculated out of the sigmoid activation function may be very small, close to zero, which makes the network unable to update its weights. As a result, no learning happens.
 The backpropagation algorithm considers all neurons in the network equally and calculates their derivatives for each backward pass. Even when dropout layers are used, the derivatives of the dropped neurons are calculated, and then dropped.
 Backpropagation relies on inﬁnitesimal effects (partial derivatives) to perform credit assignment. This could become a serious issue as one considers deeper and more nonlinear functions.
 It expects that the error function is convex. For a nonconvex function, backpropagation might get stuck in a local optima solution.
 The error function and the activation function must be differentiable in order for the backpropagation algorithm to work. It won’t work with nondifferentiable functions.
 In the forward pass, layer i+1 must wait the calculations of layer i to complete. In the backward pass, layer i must wait layer i+1 to complete. This makes all layers of the network locked, waiting for the remainder of the network to execute forwards and propagate error backwards, before they can be updated.
The Forward Pass
Remember that each unit of a neural network performs two operations: compute weighted sum and process the sum through an activation function. The outcome of the activation function determines if that particular unit should activate or become insignificant.
Let’s get started with the forward pass.
For h1,
sum_{h1} = i_{1}*w_{1}+i_{2}*w_{3}+b_{1} sum_{h1} = 0.1*0.1+0.5*0.3+0.25 = 0.41
Now we pass this weighted sum through the logistic function (sigmoid function) so as to squash the weighted sum into the range (0 and +1). The logistic function is an activation function for our example neural network.
output_{h1}=\frac{1}{1+e^{sum_{h1}}} output_{h1}=\frac{1}{1+e^{0.41}} = 0.60108
Similarly for h2, we perform the weighted sum operation sum_{h2} and compute the activation value output_{h2}.
sum_{h2} = i_{1}*w_{2}+i_{2}*w_{4}+b_{1} = 0.47 output_{h2} = \frac{1}{1+e^{sum_{h2}}} = 0.61538
Now, output_{h1} and output_{h2} will be considered as inputs to the next layer.
For o1,
sum_{o1} = output_{h1}*w_{5}+output_{h2}*w_{6}+b_{2} = 1.01977 output_{o1}=\frac{1}{1+e^{sum_{o1}}} = 0.73492
Similarly for o2,
sum_{o2} = output_{h1}*w_{7}+output_{h2}*w_{8}+b_{2} = 1.26306 output_{o2}=\frac{1}{1+e^{sum_{o2}}} = 0.77955
Computing the total error
We started off supposing the expected outputs to be 0.05 and 0.95 respectively for output_{o1} and output_{o2}. Now we will compute the errors based on the outputs received until now and the expected outputs.
We’ll use the following error formula,
E_{total} = \sum \frac{1}{2}(targetoutput)^{2}
To compute E_{total}, we need to first find out respective errors at o1 and o2.
E_{1} = \frac{1}{2}(target_{1}output_{o1})^{2} E_{1} = \frac{1}{2}(0.050.73492)^{2} = 0.23456
Similarly for E2,
E_{2} = \frac{1}{2}(target_{2}output_{o2})^{2} E_{2} = \frac{1}{2}(0.950.77955)^{2} = 0.01452
Therefore, E_{total} = E_{1} + E_{2} = 0.24908
Alternatives to traditional backpropagation
There are a number of alternatives to the traditional back propagation. Below are 4 alternatives.
In Lee, DongHyun, et al. “Difference target propagation.” Joint european conference on machine learning and knowledge discovery in databases. Springer, Cham, 2015., the main idea is to compute targets rather than gradients, at each layer. Like gradients, they are propagated backwards. Target propagation relies on autoencoders at each layer. Unlike backpropagation, it can be applied even when units exchange stochastic bits rather than real numbers.
For Ma, WanDuo Kurt, J. P. Lewis, and W. Bastiaan Kleijn. “The hsic bottleneck: Deep learning without backpropagation.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. No. 04. 2020., they propose HSIC (HilbertSchmidt independence criterion) bottleneck for training deep neural networks. The HSIC bottleneck is an alternative to conventional backpropagation, with a number of distinct advantages. The method facilitates parallel processing, and requires signiﬁcantly less operations. It doesn’t suffer from exploding or vanishing gradients. It’s biologically more plausible than backpropagation, as there’s no requirement for symmetric feedback.
In Choromanska, Anna, et al. “Beyond backprop: Online alternating minimization with auxiliary variables.” International Conference on Machine Learning. PMLR, 2019., they propose an online (stochastic/minibatch) alternating minimization (AM) approach for training deep neural networks.
In Jaderberg, Max, et al. “Decoupled neural interfaces using synthetic gradients.” International Conference on Machine Learning. PMLR, 2017., they break the constraint of locking layers by decoupling modules (i.e. layers), and introduce a model of the future computation of the network graph. These models predict what the result of the modelled subgraph will produce using only local information. As a result, the subgraphs can be updated independently and asynchronously.
Citations
… Geometricbased and physicochemicalbased features. The commonly used geometricbased features like solvent accessible surface area (SASA) and relative solvent accessible surface area (RASA) as well as the lately developed features including interior contact area (ICA), exterior contact area (ECA), and exterior void area (EVA) are used to describe residues [78][79][80][81] . In addition, physicochemicalbased features are also applied, including the hydropathy index (HI, two versions) and pKα (two versions) 82,83 . …
Interactions between intrinsically disordered proteins (IDPs) are highly dynamic, their interaction interfaces are diverse and affected by their inherent conformation fluctuations. Comprehensive characterization of these interactions based on current techniques is challenging. Here, we present GSALIDP, a GraphSAGELSTM Network to capture the dynamic nature of IDP conformation and predict the behavior of IDP interaction. The training data for our method is obtained from atomistic molecular dynamics (MD) simulations. Our method models multiple conformations of IDP as a dynamic graph which can effectively describe the fluctuation of its flexible conformation. GSALIDP can effectively predict the interaction sites of IDP as well as the contact residue pairs between IDPs. Its performance to predict IDP interaction is on par with or even better than the conventional models to predict the interaction of structural proteins. To the best of our knowledge, this is the first machinelearning method to realize the prediction of interaction behavior of IDPs. Our framework can be exploited to study other IDPmediated processes such as foldinguponbinding and fuzzy interaction, wherein the dynamic nature of IDP conformation is also essential.
… DeepComplex, meanwhile, applies a gradient descent optimization algorithm to model quarternary structures by utilizing the predicted interprotein contacts as distance restraints 13 . Given the importance of interfacial interactions, various deep learningbased methods have been developed to predict interprotein contacts [14][15][16][17][18][19][20] and proteinprotein interactions [21][22][23][24] . Compared with the intraprotein Nature Machine Intelligence Article https://doi.org/10.1038/s42256023007412 …
Information regarding the residue–residue distance between interacting proteins is important for modelling the structures of protein complexes, as well as being valuable for understanding the molecular mechanism of protein–protein interactions. With the advent of deep learning, many methods have been developed to accurately predict the intraprotein residue–residue contacts of monomers. However, it is still challenging to accurately predict interprotein residue–residue contacts for protein complexes, especially heteroprotein complexes. Here we develop a protein language modelbased deep learning method to predict the interprotein residue–residue contacts of protein complexes—named DeepInter—by introducing a triangleaware mechanism of triangle update and triangle selfattention into the deep neural network. We extensively validate DeepInter on diverse test sets of 300 homodimeric, 28 CASPCAPRI homodimeric and 99 heterodimeric complexes and compare it with stateoftheart methods including CDPred, DeepHomo2.0, GLINTER and DeepHomo. The results demonstrate the accuracy and robustness of DeepInter.
… Motivated by the success of intraprotein contact predictions in monomer structure prediction [26][27][28] , various advanced deep learning methods have been developed to predict the interchain contacts for protein complexes [29][30][31][32][33][34][35][36][37][38][39][40] . Our previous work, DeepHomo 29 , utilizes sequence and structure features to predict interchain contacts with ResNet2 architectures 41 . …
Membrane proteins are encoded by approximately a quarter of human genes. Interchain residueresidue contact information is important for structure prediction of membrane protein complexes and valuable for understanding their molecular mechanism. Although many deep learning methods have been proposed to predict the intraprotein contacts or helixhelix interactions in membrane proteins, it is still challenging to accurately predict their interchain contacts due to the limited number of transmembrane proteins. Addressing the challenge, here we develop a deep transfer learning method for predicting interchain contacts of transmembrane protein complexes, named DeepTMP, by taking advantage of the knowledge pretrained from a large data set of nontransmembrane proteins. DeepTMP utilizes a geometric triangleaware module to capture the correct interchain interaction from the coevolution information generated by protein language models. DeepTMP is extensively evaluated on a test set of 52 selfassociated transmembrane protein complexes, and compared with stateoftheart methods including DeepHomo2.0, CDPred, GLINTER, DeepHomo, and DNCON2_Inter. It is shown that DeepTMP considerably improves the precision of interchain contact prediction and outperforms the existing approaches in both accuracy and robustness.
… Another group of methods focus on monitoring changes in the accessible surface area upon complex formation , such as NACCESS (8,9) or POPSCOMB (10), which are all based on the original method of Lee and Richards (11). However, there is a lack of consensus among these methodologies regarding how to consistently define the biologically relevant proteinprotein interaction interface (12)(13)(14)(15)(16)(17)(18)(19). A recent study illustrated that even among approaches employing nearly identical algorithms to define interfaces in known protein complexes, a minimal difference in definition can reduce the agreement between them to about 80% and in a significant number of cases the interface definitions could overlap by as little as 40% (20). …
Many biomedical applications, such as classification of binding specificities or bioengineering, depend on the accurate definition of protein binding interfaces. Depending on the choice of method used, substantially different sets of residues can be classified as belonging to the interface of a protein. A typical approach used to verify these definitions is to mutate residues and measure the impact of these changes on binding. Besides the lack of exhaustive data this approach generates, it also suffers from the fundamental problem that a mutation introduces an unknown amount of alteration into an interface, which potentially alters the binding characteristics of the interface. In this study we explore the impact of alternative binding site definitions on the ability of a protein to recognize its cognate ligand using a pharmacophore approach, which does not affect the interface. The study also provides guidance on the minimum expected accuracy of interface definition that is required to capture the biological function of a protein. AUTHOR SUMMARY The residue level description or prediction of protein interfaces is a critical input for protein engineering and classification of function. However, different parametrizations of the same methods and especially alternative methods used to define the interface of a protein can return substantially different sets of residues. Typical experimental or computational methods employ mutational studies to verify interface definitions, but all these approaches inherently suffer from the problem that in order to probe the importance of any one position of an interface, an unknown amount of alteration is introduced into the very interface being studied. In this work, we employ a pharmacophorebased approach to computationally explore the consequences of defining alternative binding sites. The pharmacophore generates a hypothesis for the complementary protein binding interface, which then can be used in a search to identify the corresponding ligand from a library of candidates. The accurate ranking of cognate ligands can inform us about the biological accuracy of the interface definition. This study also provides a guideline about the minimum required accuracy of protein interface definitions that still provides a statistically significant recognition of cognate ligands above random expectation, which in turn sets a minimum expectation for interface prediction methods.
… In protein P threedimensional structure, different amino acids have different geometric properties. These geometric properties, such as Accessible Surface Area (ASA), Relative solvent Accessible Surface Area (RASA), Exterior Contact Area (ECA), Interior Contact Area (ICA), and Exterior Void Area (EVA), play important roles in multibody protein complex interactions Yang and Gong, 2018;Liu and Gong, 2019;Zhao and Gong, 2019;. In this paper, we consider using the above five geometric properties to predict the tetramer protein complex interaction. …
… In this paper, we consider using the above five geometric properties to predict the tetramer protein complex interaction. References (Liu and Gong, 2019; and (Zhao and Gong, 2019) introduce the five geometric properties and their computing tools in detail. …
… In several previous research studies (Yang and Gong, 2018;Liu and Gong, 2019;Zhao and Gong, 2019;, it has been found that the five geometric properties (ASA, RASA, ECA, ICA, and EVA) can be used to distinguish interface residues and noninterface residues. According to the five geometric properties of the residue, we map the protein P to 5 number sequences, as shown in formula 2. …
Proteinprotein interactions play an important role in life activities. The study of proteinprotein interactions helps to better understand the mechanism of protein complex interaction, which is crucial for drug design, protein function annotation and threedimensional structure prediction of protein complexes. In this paper, we study the tetramer protein complex interaction. The research has two parts: The first part is to predict the interaction between chains of the tetramer protein complex. In this part, we proposed a feature map to represent a sample generated by two chains of the tetramer protein complex, and constructed a Convolutional Neural Network (CNN) model to predict the interaction between chains of the tetramer protein complex. The AUC value of testing set is 0.6263, which indicates that our model can be used to predict the interaction between chains of the tetramer protein complex. The second part is to predict the tetramer protein complex interface residue pairs. In this part, we proposed a Support Vector Machine (SVM) ensemble method based on undersampling and ensemble method to predict the tetramer protein complex interface residue pairs. In the top 10 predictions, when at least one proteinprotein interaction interface is correctly predicted, the accuracy of our method is 82.14%. The result shows that our method is effective for the prediction of the tetramer protein complex interface residue pairs.
… Most biological processes in organisms are carried out through the interaction between proteins, and present the corresponding physiological functions in the form of protein complexes. Therefore, accurately identifying complexes in protein networks is an important problem in computational biology [4] . The solution of this problem is conducive to explaining the complex relationship between proteins in cells from the micro level, and provides a new way for biologists and medical scientists to understand the internal organization and biological process of life complex networks, it also has important application value in target spot based drug design, disease diagnosis and treatment. …
Accurate recognition of protein complexes is of great practical significance for biological process mechanism research, disease dynamic analysis, target spot based drug design and other applications. However, the existing recognition algorithms still lack the ability to represent the information of changing nodes and edges in dynamic protein networks, which greatly affects the accuracy of recognition. Considering the great advantages of graph convolution neural network in processing graph data, a dynamic protein network complex recognition algorithm based on spatialtemporal graph convolution is proposed. Firstly, the edge strength, node strength and edge existence probability are defined to model the dynamic protein network. Then, combined with the time series information and structure information on the graph, two convolution operators are designed based on Hilbert Huang transform, attention mechanism and residual connection technology to represent and learn the characteristics of proteins in the network, and the dynamic protein network characteristic map was constructed. Finally, spectral clustering is used to identify protein complexes, which well fits the development and change law of complexes in dynamic network environment. The experimental results on several public biological data sets show that the performance of the proposed algorithm is better than the latest complex recognition methods in recall, precision, coverage, function enrichment, reliability and efficiency.
… A predicted interchain contact is correct if the minimal distance between the heavy atoms of the two residues is less than 8 Å. The accuracy order, accuracy rate 46 , and AUC score are also used to evaluate the interchain distance prediction of CDPred. The accuracy order is the rank of the first correct contact prediction divided by the total number of residues of a dimer. …
Residueresidue distance information is useful for predicting tertiary structures of protein monomers or quaternary structures of protein complexes. Many deep learning methods have been developed to predict intrachain residueresidue distances of monomers accurately, but few methods can accurately predict interchain residueresidue distances of complexes. We develop a deep learning method CDPred (i.e., Complex Distance Prediction) based on the 2D attentionpowered residual network to address the gap. Tested on two homodimer datasets, CDPred achieves the precision of 60.94% and 42.93% for top L/5 interchain contact predictions (L: length of the monomer in homodimer), respectively, substantially higher than DeepHomo’s 37.40% and 23.08% and GLINTER’s 48.09% and 36.74%. Tested on the two heterodimer datasets, the top Ls/5 interchain contact prediction precision (Ls: length of the shorter monomer in heterodimer) of CDPred is 47.59% and 22.87% respectively, surpassing GLINTER’s 23.24% and 13.49%. Moreover, the prediction of CDPred is complementary with that of AlphaFold2multimer. Predicting interchain residueresidue distances of protein complexes is useful for constructing and evaluating quaternary structures of the protein complexes. Here, the authors develop a deep attentionbased residual network method (CDPred) to predict interchain residueresidue distances of protein dimers.
… These interactions ought to comply with two conditions: first, the interaction must be by design, i.e. the result of a specific biomolecular event; second, the interaction has evolved to serve a certain nongeneric function [3][4][5]. Thus, one may obtain biological insights into protein functions, disease prevalence, and therapy development by identifying interaction amongst protein pairs [6][7][8]. Hence, proteinprotein interaction (PPI) and proteinligand binding problems have drawn attention in bioinformatics and computeraided drug discovery [7,9,10]. Computational methods paved the way for scientists to predict the 3D structures of proteins from genomes and, hence, to predict their functions and attributes, allowing them to modify proteins and design new ones to target desired functions. …
Most proteins perform their biological function by interacting with themselves or other molecules. Thus, one may obtain biological insights into protein functions, disease prevalence, and therapy development by identifying protein–protein interactions (PPI). However, finding the interacting and noninteracting protein pairs through experimental approaches is labourintensive and timeconsuming, owing to the variety of proteins. Hence, protein–protein interaction and protein–ligand binding problems have drawn attention in the fields of bioinformatics and computeraided drug discovery. Deep learning methods paved the way for scientists to predict the 3D structure of proteins from genomes, predict the functions and attributes of a protein, and modify and design new proteins to provide desired functions. This review focuses on recent deep learning methods applied to problems including predicting protein functions, protein–protein interaction and their sites, protein–ligand binding, and protein design.
… More recently, machine learning methods based on deep learning [34], has been widely used for PPI prediction, with remarkable results. Zhao et al. [35] used nine properties of amino acids as feature representation (including Relative Exterior Solvent Accessible area (RESA) and Hydropathy Index (HI)), and then trained LongShort Term Memory (LSTM) [36] networks to predict interface residue pairs from two monomer proteins. Li et al. [37] firstly substituted corresponding random numbers for amino acids in the protein sequence to complete sequence coding. …
Background Protein–protein interaction (PPI) is vital for life processes, disease treatment, and drug discovery. The computational prediction of PPI is relatively inexpensive and efficient when compared to traditional wetlab experiments. Given a new protein, one may wish to find whether the protein has any PPI relationship with other existing proteins. Current computational PPI prediction methods usually compare the new protein to existing proteins one by one in a pairwise manner. This is time consuming. Results In this work, we propose a more efficient model, called deep hash learning proteinandprotein interaction (DHLPPI), to predict allagainstall PPI relationships in a database of proteins. First, DHLPPI encodes a protein sequence into a binary hash code based on deep features extracted from the protein sequences using deep learning techniques. This encoding scheme enables us to turn the PPI discrimination problem into a much simpler searching problem. The binary hash code for a protein sequence can be regarded as a number. Thus, in the prescreening stage of DHLPPI, the string matching problem of comparing a protein sequence against a database with M proteins can be transformed into a much more simpler problem: to find a number inside a sorted array of length M . This prescreening process narrows down the search to a much smaller set of candidate proteins for further confirmation. As a final step, DHLPPI uses the Hamming distance to verify the final PPI relationship. Conclusions The experimental results confirmed that DHLPPI is feasible and effective. Using a dataset with strictly negative PPI examples of four species, DHLPPI is shown to be superior or competitive when compared to the other stateoftheart methods in terms of precision, recall or F1 score. Furthermore, in the prediction stage, the proposed DHLPPI reduced the time complexity from $$O(M^2)$$ O ( M 2 ) to $$O(M\log M)$$ O ( M log M ) for performing an allagainstall PPI prediction for a database with M proteins. With the proposed approach, a protein database can be preprocessed and stored for later search using the proposed encoding scheme. This can provide a more efficient way to cope with the rapidly increasing volume of protein datasets.
… However, these studies used both microarray and sequencing data. Most recently, [15,44] achieved moderate accuracy in proteinprotein interaction interface residue pairs prediction, but used supplementary data and handtailored algorithms for inference. …
… DNAbinding motif prediction for other target molecules has been achieved in several early works using MSA [27], physicsbased simulations [11], and kernel based algorithms [19]. Deep learningbased approaches most often include using RNNs as in [15,16,18,23,30,35,44]. Some other recent works include combinations of CNN and RNN models [1,21,22,43]. …
Computeraided rational vaccine design (RVD) and synthetic pharmacology are rapidly developing fields that leverage existing datasets for developing compounds of interest. Computational proteomics utilizes algorithms and models to probe proteins for functional prediction. A potentially strong target for computational approach is autoimmune antibodies, which are the result of broken tolerance in the immune system where it cannot distinguish “self” from “nonself” resulting in attack of its own structures (proteins and DNA, mainly). The information on structure, function, and pathogenicity of autoantibodies may assist in engineering RVD against autoimmune diseases. Current computational approaches exploit large datasets curated with extensive domain knowledge, most of which include the need for many resources and have been applied indirectly to problems of interest for DNA, RNA, and monomer protein binding. We present a novel method for discovering potential binding sites. We employed long shortterm memory (LSTM) models trained on FASTA primary sequences to predict protein binding in DNAbinding hydrolytic antibodies (abzymes). We also employed CNN models applied to the same dataset for comparison with LSTM. While the CNN model outperformed the LSTM on the primary task of binding prediction, analysis of internal model representations of both models showed that the LSTM models recovered subsequences that were strongly correlated with sites known to be involved in binding. These results demonstrate that analysis of internal processes of LSTM models may serve as a powerful tool for primary sequence analysis.
What does a forward pass consist of?
I understand that, during a forward pass, data is passed through the network and the network returns predicted values, but how does this work?
Stack Exchange network consists of 183 Q&A communities including Stack Overflow, the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.
Visit Stack Exchange
What does a forward pass consist of?
I understand that, during a forward pass, data is passed through the network and the network returns predicted values, but how does this work?
If you already understand the highlevel concept of Forward Pass, all that’s left to understand is the Neural Net algorithm itself.
Please see the picture below, taken from here:
This is a simple representation of a neural net. On the left input is taken into the net, scaled by weight factors, and summed to give the Net Input Function. Typically, there is also a bias (constant) added to each input. Multiplying by a weight and adding a bias turns each input into a linear equation. The Net Input Function is a combination of linear equations that is used as input to the Activation Function. The Activation Function can take many different forms, but they all perform the same basic job, adding some nonlinearity to the inputs. A common Activation Function is ReLU:
If you’re struggling to see how this adds nonlinearity, try creating and using the function in Desmos. You’ll get something like the following:
The black function is how you define ReLU in Desmos. The green function is a combination of linear functions passed through ReLU.
If the Neural Net has more hidden layers, the Activation Function’s output is passed forward to the next hidden layer, with a weight and bias, as before, and the process is repeated. If there are no more Hidden layers the output is summed and used to produce predicted values for the input data.
The Forward Pass’s final step is to compute loss, by squaring and summing the difference between the predicted and expected values. The loss indicates the predictive power of the network on the input data.
A Comprehensive Guide to the Backpropagation Algorithm in Neural Networks
This article is a comprehensive guide to the backpropagation algorithm, the most widely used algorithm for training artificial neural networks. We’ll start by defining forward and backward passes in the process of training neural networks, and then we’ll focus on how backpropagation works in the backward pass. We’ll work on detailed mathematical calculations of the backpropagation algorithm.
Also, we’ll discuss how to implement a backpropagation neural network in Python from scratch using NumPy, based on this GitHub project. The project builds a generic backpropagation neural network that can work with any architecture.
Let’s get started.
5.3.Computational Graph of Forward Propagation¶
Plotting computational graphs helps us visualize the dependencies of operators and variables within the calculation. Fig. 5.3.1 contains the graph associated with the simple network described above, where squares denote variables and circles denote operators. The lowerleft corner signifies the input and the upperright corner is the output. Notice that the directions of the arrows (which illustrate data flow) are primarily rightward and upward.
Fig. 5.3.1 Computational graph of forward propagation.¶
5.3.Training Neural Networks¶
When training neural networks, forward and backward propagation depend on each other. In particular, for forward propagation, we traverse the computational graph in the direction of dependencies and compute all the variables on its path. These are then used for backpropagation where the compute order on the graph is reversed.
Take the aforementioned simple network as an illustrative example. On the one hand, computing the regularization term (5.3.5) during forward propagation depends on the current values of model parameters \(\mathbf{W}^{(1)}\) and \(\mathbf{W}^{(2)}\). They are given by the optimization algorithm according to backpropagation in the most recent iteration. On the other hand, the gradient calculation for the parameter (5.3.11) during backpropagation depends on the current value of the hidden layer output \(\mathbf{h}\), which is given by forward propagation.
Therefore when training neural networks, once model parameters are initialized, we alternate forward propagation with backpropagation, updating model parameters using gradients given by backpropagation. Note that backpropagation reuses the stored intermediate values from forward propagation to avoid duplicate calculations. One of the consequences is that we need to retain the intermediate values until backpropagation is complete. This is also one of the reasons why training requires significantly more memory than plain prediction. Besides, the size of such intermediate values is roughly proportional to the number of network layers and the batch size. Thus, training deeper networks using larger batch sizes more easily leads to outofmemory errors.
5.3.Summary¶
Forward propagation sequentially calculates and stores intermediate variables within the computational graph defined by the neural network. It proceeds from the input to the output layer. Backpropagation sequentially calculates and stores the gradients of intermediate variables and parameters within the neural network in the reversed order. When training deep learning models, forward propagation and backpropagation are interdependent, and training requires significantly more memory than prediction.
Coding backpropagation in Python
It’s quite easy to implement the backpropagation algorithm for the example discussed in the previous section. In this section, we’ll use this GitHub project to build a network with 2 inputs and 1 output from scratch.
The next code uses NumPy to prepare the inputs (x1=0.1 and x2=0.4), the output with the value 0.7, the learning rate with value 0.01, and assign initial values for the 2 weights w1 and w2. At the end, 2 empty lists are created to hold the network prediction and error in each epoch.
import numpy x1=0.1 x2=0.4 target = 0.7 learning_rate = 0.01 w1=numpy.random.rand() w2=numpy.random.rand() print(“Initial W : “, w1, w2) predicted_output = [] network_error = []
The next code builds some functions that help us in the calculations:
 Sigmoid(): Applies the sigmoid activation function.
 error(): Returns the squared error.
 error_predicted_deriv(): Returns the derivative of the error W.R.T the predicted output.
 sigmoid_sop_deriv(): Returns the derivative of the sigmoid function W.R.T the SOP.
 sop_w_deriv(): Returns the derivative of the SOP W.R.T a single weight.
update_w(): Updates a single weight.
import numpy def sigmoid(sop): return 1.0/(1+numpy.exp(1*sop)) def error(predicted, target): return numpy.power(predictedtarget, 2) def error_predicted_deriv(predicted, target): return 2*(predictedtarget) def sigmoid_sop_deriv(sop): return sigmoid(sop)*(1.0sigmoid(sop)) def sop_w_deriv(x): return x def update_w(w, grad, learning_rate): return w – learning_rate*grad
Now, we’re ready to do the forward and backward pass calculations for a number of epochs, using a “for” loop according to the next code. The loop goes through 80,000 epochs.
for k in range(80000): # Forward Pass y = w1*x1 + w2*x2 predicted = sigmoid(y) err = error(predicted, target) predicted_output.append(predicted) network_error.append(err) # Backward Pass g1 = error_predicted_deriv(predicted, target) g2 = sigmoid_sop_deriv(y) g3w1 = sop_w_deriv(x1) g3w2 = sop_w_deriv(x2) gradw1 = g3w1*g2*g1 gradw2 = g3w2*g2*g1 w1 = update_w(w1, gradw1, learning_rate) w2 = update_w(w2, gradw2, learning_rate)
In the forward pass, the following lines are executed that calculate the SOP, apply the sigmoid activation function to get the predicted output, and calculate the error. This appends the current network prediction and error in the predicted_output and network_error lists, respectively.
y = w1*x1 + w2*x2 predicted = sigmoid(y) err = error(predicted, target) predicted_output.append(predicted) network_error.append(err)
In the backward pass, the remaining lines in the “for” loop are executed to calculate the derivatives in all chains. The derivatives of the error W.R.T to the weights are saved in the variables gradw1 and gradw2. Finally, the weights are updated by calling the update_w() function.
g1 = error_predicted_deriv(predicted, target) g2 = sigmoid_sop_deriv(y) g3w1 = sop_w_deriv(x1) g3w2 = sop_w_deriv(x2) gradw1 = g3w1*g2*g1 gradw2 = g3w2*g2*g1 w1 = update_w(w1, gradw1, learning_rate) w2 = update_w(w2, gradw2, learning_rate)
The complete code is below. It prints the predicted output after each epoch. Also, it uses the matplotlib library to create 2 plots, showing how the predicted output and the error evolves by epoch.
import numpy import matplotlib.pyplot def sigmoid(sop): return 1.0/(1+numpy.exp(1*sop)) def error(predicted, target): return numpy.power(predictedtarget, 2) def error_predicted_deriv(predicted, target): return 2*(predictedtarget) def sigmoid_sop_deriv(sop): return sigmoid(sop)*(1.0sigmoid(sop)) def sop_w_deriv(x): return x def update_w(w, grad, learning_rate): return w – learning_rate*grad x1=0.1 x2=0.4 target = 0.7 learning_rate = 0.01 w1=numpy.random.rand() w2=numpy.random.rand() print(“Initial W : “, w1, w2) predicted_output = [] network_error = [] old_err = 0 for k in range(80000): # Forward Pass y = w1*x1 + w2*x2 predicted = sigmoid(y) err = error(predicted, target) predicted_output.append(predicted) network_error.append(err) # Backward Pass g1 = error_predicted_deriv(predicted, target) g2 = sigmoid_sop_deriv(y) g3w1 = sop_w_deriv(x1) g3w2 = sop_w_deriv(x2) gradw1 = g3w1*g2*g1 gradw2 = g3w2*g2*g1 w1 = update_w(w1, gradw1, learning_rate) w2 = update_w(w2, gradw2, learning_rate) print(predicted) matplotlib.pyplot.figure() matplotlib.pyplot.plot(network_error) matplotlib.pyplot.title(“Iteration Number vs Error”) matplotlib.pyplot.xlabel(“Iteration Number”) matplotlib.pyplot.ylabel(“Error”) matplotlib.pyplot.figure() matplotlib.pyplot.plot(predicted_output) matplotlib.pyplot.title(“Iteration Number vs Prediction”) matplotlib.pyplot.xlabel(“Iteration Number”) matplotlib.pyplot.ylabel(“Prediction”)
In the next figure, the error is plotted for the 80,000 epochs. Note how the error is saturated at the value 3.150953682878443e13, which is very close to 0.0.
The next figure shows how the predicted output changed by iteration. Remember that the correct output value is set to 0.7 in our example. The output is saturated at the value 0.6999994386664375, very close to 0.7.
The GitHub project also gives a simpler interface to build the network in the Ch09 directory. There’s an example that builds a network with 3 inputs and 1 output. At the end of the code, the function predict() is called to ask the network to predict the output of a new sample [0.2, 3.1, 1.7].
import MLP import numpy x = numpy.array([0.1, 0.4, 4.1]) y = numpy.array([0.2]) network_architecture = [7, 5, 4] # Network Parameters trained_ann = MLP.MLP.train(x=x, y=y, net_arch=network_architecture, max_iter=500, learning_rate=0.7, debug=True) print(“Derivative Chains : “, trained_ann[“derivative_chain”]) print(“Training Time : “, trained_ann[“training_time_sec”]) print(“Number of Training Iterations : “, trained_ann[“elapsed_iter”]) predicted_output = MLP.MLP.predict(trained_ann, numpy.array([0.2, 3.1, 1.7])) print(“Predicted Output : “, predicted_output)
This code uses a module called MLP, a script that builds the backpropagation algorithm while giving the user a simple interface to build, train, and test the network. For details about how to build this script, please refer to this book.
Quick overview of Neural Network architecture
In the simplest scenario, the architecture of a neural network consists of some sequential layers, where the layer numbered i is connected to the layer numbered i+1. The layers can be classified into 3 classes:
 Input
 Hidden
 Output
The next figure shows an example of a fullyconnected artificial neural network (FCANN), the simplest type of network for demonstrating how the backpropagation algorithm works. The network has an input layer, 2 hidden layers, and an output layer. In the figure, the network architecture is presented horizontally so that each layer is represented vertically from left to right.
Each layer consists of 1 or more neurons represented by circles. Because the network type is fullyconnected, then each neuron in layer i is connected with all neurons in layer i+1. If 2 subsequent layers have X and Y neurons, then the number of inbetween connections is X*Y.
For each connection, there is an associated weight. The weight is a floatingpoint number that measures the importance of the connection between 2 neurons. The higher the weight, the more important the connection. The weights are the learnable parameter by which the network makes a prediction. If the weights are good, then the network makes accurate predictions with less error. Otherwise, the weight should be updated to reduce the error.
Assume that a neuron N1 at layer 1 is connected to another neuron N2 at layer 2. Assume also that the value of N2 is calculated according to the next linear equation.
N2=w1N1+b
If N1=4, w1=0.5 (the weight) and b=1 (the bias), then the value of N2 is 3.
N2=0.54+1=2+1=3
This is how a single weight connects 2 neurons together. Note that the input layer has no learnable parameters at all.
Each neuron at layer i+1 has a weight for each connected neuron at layer i , but it only has a single bias. So, if layer i has 10 neurons and layer i+1 has 6 neurons, then the total number of parameters for layer i+1 is:
number of weights+number of biases=10×6 +6=66
The input layer is the first layer in the network, it’s directly connected by the network’s inputs. There can only be a single input layer in the network. For example, if the inputs are student scores in a semester, then these grades are connected to the input layer. In our figure, the input layer has 10 neurons (e.g. scores for 10 courses — a hero student took 10 courses/semester).
The output layer is the last layer which returns the network’s predicted output. Like the input layer, there can only be a single output layer. If the objective of the network is to predict student scores in the next semester, then the output layer should return a score. The architecture in the next figure has a single neuron that returns the next semester’s predicted score.
Between the input and output layers, there might be 0 or more hidden layers. In this example, there are 2 hidden layers with 6 and 4 neurons, respectively. Note that the last hidden layer is connected to the output layer.
Usually, each neuron in the hidden layer uses an activation function like sigmoid or rectified linear unit (ReLU). This helps to capture the nonlinear relationship between the inputs and their outputs. The neurons in the output layer also use activation functions like sigmoid (for regression) or SoftMax (for classification).
After building the network architecture, it’s time to start training it with data.
Keywords searched by users: neural network forward pass
Categories: Chia sẻ 48 Neural Network Forward Pass
See more here: kientrucannam.vn
See more: https://kientrucannam.vn/vn/