Chapter 3

LSTMs: Solving the Memory Problem

How Long Short-Term Memory networks fixed RNN limitations with gates and cell states.

8 min read

The Problem LSTMs Solved

RNNs could process sequences but couldn't remember long-range dependencies. If a sentence started with "The man who went to the store and bought groceries and then walked home" and ended with "was tired," an RNN would struggle to connect "was tired" back to "The man." The information simply decayed over those many steps.

The Memory Challenge

Imagine reading a novel where you forget the main character's name by chapter 3. That's what RNNs experience with long sequences. LSTMs were designed to remember what matters across hundreds of steps.

The LSTM Innovation: Gates

Long Short-Term Memory networks, introduced in 1997, solved this with a brilliant mechanism: gates. Think of gates as selective filters that control what information to keep, what to forget, and what to output.

GateFunctionAnalogy
Forget GateDecides what to discardClearing irrelevant notes
Input GateDecides what to storeWriting important notes
Output GateDecides what to outputReading relevant notes aloud
Gates in Action

Forget Gate: "Is 'The man' still relevant? Keep it. Is the detail about groceries important? Maybe forget it."

Input Gate: "This new word seems important—add it to memory."

Output Gate: "Based on everything I remember, here's what's relevant right now."

The Cell State: A Highway for Information

The real innovation is the cell state—a highway that runs through the entire sequence with minimal modification. Information can flow along this highway relatively unchanged, protected from the vanishing gradient problem. Gates add or remove information from this highway as needed.

Why This Matters

The cell state acts like a conveyor belt. Information placed on it at step 1 can travel to step 100 with minimal degradation. This is how LSTMs remember long-range dependencies that RNNs cannot.

Where LSTMs Excel

ApplicationExample CompaniesWhy LSTM
Language modelingEarly Google, FacebookContext over paragraphs
Machine translationGoogle Translate (pre-2017)Sentence-level memory
Time seriesFinancial firms, utilitiesLong-term patterns
Speech recognitionSiri, Alexa (early versions)Audio context

Why LSTMs Were Superseded

LSTMs dominated for nearly two decades. But they have limitations:

LSTM Limitations

Sequential processing: Must process one step at a time—slow to train
Compute intensive: Gates add complexity and parameters
Very long sequences: Still struggle beyond a few thousand steps
Parallelization: Cannot leverage modern GPU architectures fully

In 2017, transformers arrived and changed everything—not by improving sequential processing, but by abandoning it entirely.

Key Insight

LSTMs are still used today for specific time-series applications where transformers are overkill. They're not obsolete—just no longer the default choice for language tasks.

AI Assistant
00:00