Long Short-Term Memory

(LSTM)

Long Short-Term Memory (LSTM) is an RNN architecture specifically designed to address the vanishing gradient problem in a vanilla RNN.

Vanilla RNNs fail to learn in the presence of time lags greater than 5 – 10 discrete time steps between relevant input events and target signals. The vanishing error problem casts doubt on whether standard RNNs can indeed exhibit significant practical advantages over time window-based feedforward networks. Long Short-Term Memory is not affected by this problem. LSTM can learn to bridge minimal time lags in excess of 1000 discrete time steps by enforcing constant error flow through “constant error carrousels” (CECs) within special units, called cells.

A common LSTM unit is composed of a cell including an input gate, an output gate and a forget gate. The cell remembers values over arbitrary time intervals and the three gates regulate the flow of information into and out of the cell. In which, the gates provide continuous analogues of write, read and reset operations for the cells. The net can only interact with the cells via the gates.

It is interesting to note, that even after more than 20 years, the simple (or vanilla) LSTM may still be the best place to start when applying the technique. The most commonly used LSTM architecture (vanilla LSTM) performs reasonably well on various datasets. Learning rate and network size are the most crucial tunable LSTM hyperparameters. This implies that the hyperparameters can be tuned independently. In particular, the learning rate can be calibrated first using a fairly small network, thus saving a lot of experimentation time.