motyl.dev
TrendingNewsletterBlogNewsAbout
Support
Grzegorz Motyl

© 2026 Grzegorz Motyl. Raising the bar of professional software development.

GitHubTwitterEmail
Home
News
Blog
Me
    /
    motyl.dev
    TrendingNewsletterBlogNewsAbout
    Support
    1. Home
    2. News
    3. Does the Adam Optimizer Make Neural Networks Forget? A Deep Dive

    Does the Adam Optimizer Make Neural Networks Forget? A Deep Dive

    Published on 19.03.2026

    #hackernoon
    #ai
    #ml
    AI & AGENTS

    Does the Adam Optimizer Amplify Catastrophic Forgetting?

    TLDR: Research by Dylan Ashley, Sina Ghiassian, and Richard Sutton shows that the choice of optimizer has a significant impact on catastrophic forgetting in neural networks, and surprisingly, vanilla SGD often outperforms Adam. The findings also reveal that how you measure forgetting can completely change your conclusions.

    Catastrophic forgetting remains one of the stubborn unsolved problems in neural network research. It is the tendency for artificial neural networks to forget previously learned information when exposed to new information, and it is a severe hindrance to the broad application of ANNs in online learning and continual learning settings. Despite decades of work on the problem, a research paper published on HackerNoon by the Adam Optimizer channel argues that the field still does not understand the phenomenon well enough to measure it reliably, let alone fix it.

    The core finding is striking. The choice of which modern gradient-based optimization algorithm is used to train an ANN has a significant impact on the amount of catastrophic forgetting. And the surprise is that classical algorithms such as vanilla SGD frequently experience less catastrophic forgetting than more modern algorithms such as Adam. The researchers compared vanilla SGD, SGD with Momentum, RMSProp, and Adam across different testbeds from both reinforcement learning and supervised learning literature. The results ground previous observations about why vanilla SGD is often favored in continual learning settings with strong empirical evidence for the first time.

    But the measurement problem is arguably more important than the optimizer finding. The paper compares four different existing metrics for quantifying catastrophic forgetting: retention, relearning, activation overlap, and pairwise interference. The result is disturbing for the research community. The degree to which learning systems experience catastrophic forgetting is sufficiently sensitive to the metric used that a change from one principled metric to another is enough to change the conclusions of a study dramatically. The relative ranking of algorithms varies wildly between metrics. This means that most existing research into methods to mitigate catastrophic forgetting, which rarely looks at more than one metric, may be drawing conclusions that depend entirely on which measurement they chose rather than on actual algorithm performance.

    The authors recommend that work looking at inter-task forgetting in supervised learning must consider both retention and relearning metrics concurrently. For intra-task forgetting in reinforcement learning, they recommend at the very least measuring with pairwise interference. This is a much more rigorous experimental methodology than the current standard.

    The significance of this work extends beyond academic measurement debates. As neural networks continue to drive major AI breakthroughs, from computer vision to natural language processing to reinforcement learning agents, the ability to learn continuously without catastrophic forgetting becomes critical. If the field has been measuring the problem incorrectly, then the solutions people have been building may be solving the wrong version of the problem.

    The paper also raises important questions about why Adam, which is the default optimizer for most deep learning practitioners, might be worse for continual learning. The adaptive learning rate mechanism that makes Adam converge faster on new tasks may simultaneously be more aggressively overwriting the parameters that encode previously learned information. This is a fundamental tension between fast learning and stable memory that the optimizer choice brings into sharp focus.

    The related work section traces the problem back to its origins in McCloskey and Cohen in 1989 and connects to numerous mitigation methods proposed over the years. The authors note that despite all this work, catastrophic forgetting continues to be an unsolved issue, and recent work is still uncovering fundamental connections about what contributes to it.

    Does the Adam Optimizer Amplify Catastrophic Forgetting? | Study Finds Optimizer Choice Significantly Impacts Model Retention | Measuring Catastrophic Forgetting in AI | This Is How Your Model Forgets What It Just Learned | Teaching Machines to Remember Means Choosing What They Forget | Why Adam May Be Hurting Your Neural Network's Memory | The Fragile Memory of Neural Networks, and the Metrics We Trust

    ☕ Knowledge costs tokens,fuel meHelp me keep the content flowing
    External Links (7)

    Does the Adam Optimizer Amplify Catastrophic Forgetting?

    hackernoon.com

    Study Finds Optimizer Choice Significantly Impacts Model Retention

    hackernoon.com

    Measuring Catastrophic Forgetting in AI

    hackernoon.com

    This Is How Your Model Forgets What It Just Learned

    hackernoon.com

    Teaching Machines to Remember Means Choosing What They Forget

    hackernoon.com

    Why Adam May Be Hurting Your Neural Network's Memory

    hackernoon.com

    The Fragile Memory of Neural Networks, and the Metrics We Trust

    hackernoon.com

    Sign in to bookmark these links
    Next
    6 Mistakes That Destroy Agentic AI Systems in Production
    Grzegorz Motyl

    © 2026 Grzegorz Motyl. Raising the bar of professional software development.

    GitHubTwitterEmail