Describe the long short-term memory neural network architecture (LSTM). Give examples of applications.

## Expert Answer

*long short-term memory neural network:-*

1.Long Short-Term Memory (LSTM) networks are a type of recurrent neural network capable of learning order dependence in sequence prediction problems.

2.This is a behavior required in complex problem domains like machine translation, speech recognition, and more.

3.LSTMs are a complex area of deep learning. It can be hard to get your hands around what LSTMs are, and how terms like bidirectional and sequence-to-sequence relate to the field.

4.In this post, you will get insight into LSTMs using the words of research scientists that developed the methods and applied them to new and important problems.

5.There are few that are better at clearly and precisely articulating both the promise of LSTMs and how they work than the experts that developed them.

*How do LSTMs Work:-*

1.Rather than go into the equations that govern how LSTMs are fit, analogy is a useful tool to quickly get a handle on how they work.

2.We use networks with one input layer, one hidden layer, and one output layer… The (fully) self-connected hidden layer contains memory cells and corresponding gate units.

3.Each memory cell’s internal architecture guarantees constant error ow within its constant error carrousel CEC… This represents the basis for bridging very long time lags. Two gate units learn to open and close access to error ow within each memory cell’s CEC. The multiplicative input gate affords protection of the CEC from perturbation by irrelevant inputs. Likewise, the multiplicative output gate protects other units from perturbation by currently irrelevant memory contents.

*LSTM:-*

In a traditional recurrent neural network, during the gradient back-propagation phase, the gradient signal can end up being multiplied a large number of times (as many as the number of timesteps) by the weight matrix associated with the connections between the neurons of the recurrent hidden layer. This means that, the magnitude of weights in the transition matrix can have a strong impact on the learning process.

If the weights in this matrix are small (or, more formally, if the leading eigenvalue of the weight matrix is smaller than 1.0), it can lead to a situation called vanishing gradients where the gradient signal gets so small that learning either becomes very slow or stops working altogether. It can also make more difficult the task of learning long-term dependencies in the data. Conversely, if the weights in this matrix are large (or, again, more formally, if the leading eigenvalue of the weight matrix is larger than 1.0), it can lead to a situation where the gradient signal is so large that it can cause learning to diverge. This is often referred to as exploding gradients.

The equations below describe how a layer of memory cells is updated at every timestep t. In these equations:

x_t is the input to the memory cell layer at time t

W_i, W_f, W_c, W_o, U_i, U_f, U_c, U_o and V_o are weight matrices

b_i, b_f, b_c and b_o are bias vectors

First, we compute the values for i_t, the input gate, and widetilde{C_t} the candidate value for the states of the memory cells at time t:

(1)i_t = sigma(W_i x_t + U_i h_{t-1} + b_i)

(2)widetilde{C_t} = tanh(W_c x_t + U_c h_{t-1} + b_c)

Second, we compute the value for f_t, the activation of the memory cells’ forget gates at time t:

(3)f_t = sigma(W_f x_t + U_f h_{t-1} + b_f)

Given the value of the input gate activation i_t, the forget gate activation f_t and the candidate state value widetilde{C_t}, we can compute C_t the memory cells’ new state at time t:

(4)C_t = i_t * widetilde{C_t} + f_t * C_{t-1}

With the new state of the memory cells, we can compute the value of their output gates and, subsequently, their outputs:

(5)o_t = sigma(W_o x_t + U_o h_{t-1} + V_o C_t + b_o)

(6)h_t = o_t * tanh(C_t)

*The Problem of Long-Term Dependencies:-*

1.One of the appeals of RNNs is the idea that they might be able to connect previous information to the present task, such as using previous video frames might inform the understanding of the present frame. If RNNs could do this, they’d be extremely useful. But can they? It depends.

2.Sometimes, we only need to look at recent information to perform the present task. For example, consider a language model trying to predict the next word based on the previous ones. If we are trying to predict the last word in “the clouds are in the sky,” we don’t need any further context – it’s pretty obvious the next word is going to be sky. In such cases, where the gap between the relevant information and the place that it’s needed is small, RNNs can learn to use the past information.