San Diego, California. 1986. A cognitive scientist named David Rumelhart sits in his office at the Institute for Cognitive Science at UC San Diego, working on a paper that he knows is important but cannot yet know will be transformative. He has been working on neural networks for years — on the connectionist approach to cognition and intelligence that the symbolic AI mainstream has been dismissing for a decade and a half. He has a collaborator, Geoffrey Hinton, who has been working on the same problems from a slightly different angle at Carnegie Mellon. And he has an algorithm.

The algorithm is not new. It has been discovered before — at least twice, possibly three times — by researchers who published it in forms that did not reach the right audience at the right moment. It is an algorithm for computing how much each connection in a multi-layer neural network contributed to the network’s output error, and for adjusting those connections to reduce the error. It is, in the vocabulary of mathematics, the application of the chain rule of calculus to the computation of gradients in a layered computational graph. In the vocabulary of neural networks, it is backpropagation.

Rumelhart writes the paper. He sends it to his collaborators — Hinton and Ronald Williams at Carnegie Mellon. They refine it, add demonstrations, make the case. The paper is submitted to Nature — one of the world’s most prestigious scientific journals — and accepted. It appears in October 1986, under the title “Learning representations by back-propagating errors.”

The paper is read by everyone. The world changes.

But the story of backpropagation is not the story of the 1986 paper. It is the story of an algorithm that was invented, ignored, forgotten, reinvented, still ignored, and then — at exactly the right moment, with exactly the right demonstrations, in exactly the right journal — finally heard. It is the story of how ideas survive when their time has not yet come, and what happens when their time finally arrives.


The Problem the Algorithm Solves

To understand backpropagation — what it does, why it matters, why finding it was hard — you need to understand the problem it solves.

The problem is the credit assignment problem, and it is one of the central challenges of learning machine design. The question it poses is simple to state and hard to answer: when a system makes an error, which of the system’s many internal parameters was responsible for the error, and by how much should each parameter be adjusted to reduce the error?

For a simple system — a single-layer perceptron, for instance — the credit assignment problem is straightforward. The output is produced by a single layer of connections, and each connection’s contribution to the output is directly calculable. If the output is wrong, you can directly compute how much each connection contributed to the wrongness and adjust it accordingly. Rosenblatt’s Perceptron learning rule did exactly this.

For a multi-layer system — a network with one or more hidden layers between the input and the output — the problem is harder. The hidden layers are not directly connected to the output. Their contribution to the output error is indirect, mediated through the subsequent layers. To know how to adjust a connection in the first hidden layer, you need to know how that connection’s activation affected the activations of the second hidden layer, and how those activations affected the output, and how the output differed from what it should have been. The chain of dependencies is what makes the problem hard.

The difficulty is not conceptual. In principle, you can always compute how any parameter in a network contributed to the output error — the mathematics of differentiation handles this. The chain rule of calculus allows you to compute the gradient of any function of a function of a function — to differentiate through multiple layers of composition. A multi-layer neural network is exactly a composition of functions: the input goes through the first layer, the result through the second, and so on to the output. The chain rule applies.

The difficulty, in 1960 and 1970 and even 1975, was partly practical and partly conceptual. Practical, because implementing the chain rule efficiently for networks of the kind needed for AI required a specific algorithmic organisation — a backward pass through the network that computed gradients layer by layer — that was not immediately obvious. Conceptual, because the right way to think about the problem — as a gradient computation in a computational graph — required mathematical concepts that were not widely accessible to the AI researchers who needed them.

Backpropagation is the efficient algorithm for computing gradients in layered networks. It works by making two passes through the network: a forward pass that computes the network’s output from the input, and a backward pass that propagates error signals from the output back through the layers, computing the gradient of the error with respect to each connection weight as it goes. The backward pass is what gives the algorithm its name.

With the gradients in hand, the weights can be updated in the direction that reduces the error — a procedure called gradient descent. Repeat the forward and backward passes many times on many examples, adjusting the weights each time, and the network gradually learns to produce correct outputs.

This is, in broad outline, how every modern neural network is trained. It is how GPT-4 was trained, how AlphaGo was trained, how every image recognition system and voice assistant and recommendation algorithm that uses neural networks was trained. The specific implementations are enormously more sophisticated than the original 1986 description. But the fundamental algorithm — forward pass, compute error, backward pass to compute gradients, update weights — is the same.


Paul Werbos: The First Discovery, 1974

The first person to clearly describe backpropagation was Paul Werbos, in his doctoral dissertation at Harvard in 1974. Werbos was studying economics and decision theory, and he was interested in the question of how to apply dynamic programming and optimal control theory to complex nonlinear systems. His key insight was that the chain rule could be used to compute gradients efficiently in layered systems — a mathematical observation that was general and powerful, applicable far beyond the specific neural network context.

Werbos included a chapter in his dissertation that described the application of this gradient computation technique to neural networks — describing what would later be called backpropagation clearly and in some detail. He was aware that this had implications for training multi-layer networks, that it solved the credit assignment problem that had limited single-layer architectures.

The dissertation was completed in 1974 and filed with Harvard. It was not published in a journal accessible to the AI research community. The academic convention was that doctoral dissertations were not widely distributed, and Werbos’s work remained in the Harvard library, available to anyone who went looking but not reaching the researchers who needed it.

He did publish papers in subsequent years that described the gradient computation ideas, but they were published in venues — systems theory and control journals, economics publications — that the AI and neural network communities did not read. The work was available. It was not found.

Werbos has spent decades, with some justification, feeling that his priority was not adequately recognised when the backpropagation paper became famous in 1986. He has written about this in academic papers and has engaged with the historical question of credit and priority directly. The question is genuinely complex: he clearly had the key idea first and described it clearly in his dissertation. But the algorithm had no impact on the field until 1986, and the 1986 paper reached the audience that mattered and demonstrated the results that changed the field.

Scientific priority and historical credit are not always the same thing. The person who publishes an idea in the right venue at the right time, with the right demonstrations, and thereby changes the field, often receives more credit than the person who published it earlier in a venue that did not reach the relevant community. This is not always fair, and the Werbos case is a genuine injustice. But it is how intellectual history works.


David Parker: The Second Discovery, 1982

Eight years after Werbos, and four years before the famous 1986 paper, a researcher named David Parker independently discovered backpropagation. Parker was working at Stanford, and he described the algorithm in a technical report published by MIT’s Sloan School of Management in 1982, under the title “Learning Logic.”

Parker’s description was clear and correct. He understood that the chain rule could be applied to compute gradients in layered networks, that this allowed multi-layer networks to be trained by gradient descent, and that this overcame the limitations of single-layer perceptrons. He circulated the report and sent it to researchers in the field.

The report was not widely read. Technical reports had limited distribution. The timing was not ideal — 1982 was during the expert systems boom, when commercial AI was oriented toward rule-based systems rather than neural networks. The neural network community was small and not closely connected to the channels through which Parker’s report circulated.

Parker subsequently published a paper in a conference proceedings in 1985 that described the algorithm again, but again in a venue that did not reach the broad AI research community. By 1986, when the Rumelhart-Hinton-Williams paper appeared in Nature, Parker had published the idea twice without it taking hold.

The Werbos and Parker discoveries are important not just as matters of historical justice but as illustrations of how ideas can exist in the literature without having impact. The existence of an idea in a technical report or a conference proceeding or a doctoral dissertation is not the same as the existence of that idea in the active knowledge base of a research community. Ideas need to be findable, readable, and compelling — they need to be presented in the right venue, with the right demonstrations, at the right moment — to have impact. The 1986 paper had all of these qualities. The earlier discoveries did not.


The Rumelhart-Hinton-Williams Paper: Why It Worked

The October 1986 paper — “Learning representations by back-propagating errors,” by David Rumelhart, Geoffrey Hinton, and Ronald Williams — was not the first description of backpropagation. But it was the paper that changed the field. Understanding why requires understanding what made it different from the Werbos and Parker descriptions.

The first difference was the venue. Nature was, and is, one of the most widely read and most prestigious scientific journals in the world. A paper in Nature was read by researchers across scientific disciplines, including AI researchers who would not have read a technical report from MIT’s Sloan School or a paper in a systems theory journal. Publication in Nature guaranteed broad exposure.

The second difference was the demonstrations. Werbos had described the algorithm abstractly, in the context of dynamic programming and optimal control. Parker had described it theoretically. Rumelhart, Hinton, and Williams demonstrated it working — on specific, compelling problems.

The demonstrations were carefully chosen. One of the most striking was the network that learned to pronounce English text from spelling — the NETtalk system, which the paper described in detail. A neural network trained with backpropagation could learn, from examples, to map English spelling to phonetic pronunciation. After training on a corpus of English words with their pronunciations, the network produced intelligible speech — not perfectly accurate, but recognisable, and improving steadily with more training.

This demonstration was important because pronunciation is not a simple problem. English spelling is notoriously irregular — the same letter combination can be pronounced many different ways depending on context, word origin, and surrounding letters. The network was learning something genuinely complex, and it was learning it from data without being told the rules. The demonstration showed that backpropagation-trained networks could learn non-trivial, real-world tasks.

Another key demonstration was the past tense learning network — the connectionist model of how children learned the past tense of English verbs that the companion PDP volumes described in detail. The network showed a developmental trajectory — over-regularising irregular verbs before learning the correct forms — that closely paralleled the observed development of children learning language. This gave the neural network approach not just engineering relevance (it could learn useful tasks) but scientific relevance (it made correct empirical predictions about human cognition).

The third difference was the timing. By 1986, several things had aligned to make the research community receptive to backpropagation in a way it had not been in 1974 or 1982.

The expert systems boom was at its height, and alongside the enthusiasm for expert systems, there was a growing awareness of its limitations. The knowledge acquisition bottleneck, the brittleness, the inability to learn — these limitations were beginning to be felt in deployed systems. Researchers were starting to think seriously about whether there were better approaches.

The PDP volumes — Rumelhart and McClelland’s two-volume collection on parallel distributed processing — appeared simultaneously with the backpropagation paper and provided a comprehensive intellectual framework for connectionist AI. The paper was not an isolated result; it was part of a manifesto.

And the research community had been primed by a decade of discussion about the limitations of the Minsky-Papert critique. Researchers who had been sceptical of neural networks because of the Perceptrons book knew that the critique applied specifically to single-layer architectures. There was an open question — could multi-layer networks do better? — that the backpropagation paper answered with demonstrations.


Geoffrey Hinton: The Man Most Responsible

Of the three authors of the 1986 paper, Geoffrey Hinton had been working toward it the longest and had been most responsible for keeping the neural network tradition alive during the years when it was marginalised.

Hinton was born in 1947 in Wimbledon, England, into a family with a scientific tradition — his great-great-grandfather was the mathematician George Boole, whose Boolean algebra was the foundation of digital electronics. He studied experimental psychology and then philosophy at Cambridge before moving into AI, specifically the connectionist approach.

His doctoral work at Edinburgh, completed in 1978, was on models of how the brain represented spatial structure — connectionist models of a kind that the symbolic AI mainstream was dismissing as insufficiently rigorous. The years of his doctoral work coincided with the Lighthill Report and the first AI winter, an environment in which his research direction was particularly unfashionable.

After Edinburgh, Hinton moved to the United States — first to UC San Diego, where he worked with Rumelhart, and then to Carnegie Mellon. Through the late 1970s and early 1980s, he was one of the handful of researchers who kept connectionist AI alive: publishing papers, giving talks, training students, and developing the theoretical ideas about distributed representations and learning algorithms that would eventually come together in the backpropagation paper.

Hinton’s specific contribution to the 1986 paper was primarily in the theoretical analysis and the broader intellectual framework — in understanding why backpropagation worked, what it meant for representations to be learned, and how the approach connected to broader questions in cognitive science and neuroscience. Rumelhart and Williams contributed more to the specific algorithmic formulation. But the three authors were genuinely collaborative, and distinguishing their individual contributions precisely is more difficult than the conventional narrative of “Rumelhart discovered it with Hinton and Williams” suggests.

What is clear is that Hinton was the intellectual continuity — the person who had been working on this approach through the lean years, who had the deepest commitment to it, and who would go on to develop it most extensively in subsequent decades. Without Hinton’s persistence through the 1970s and early 1980s, the 1986 paper might not have been written, or might have been written later by someone with less deep theoretical understanding.

Hinton’s subsequent career — developing restricted Boltzmann machines, deep belief networks, dropout regularisation, and other key advances in deep learning; building the research group at the University of Toronto that produced AlexNet; eventually winning the Turing Award jointly with LeCun and Bengio — is the career of a person who understood, earlier and more deeply than almost anyone, where AI was going and what it required to get there.


The PDP Volumes: More Than a Paper

The backpropagation paper in Nature was accompanied by something almost as important: the simultaneous publication of “Parallel Distributed Processing: Explorations in the Microstructure of Cognition,” a two-volume collection edited by Rumelhart and McClelland that provided the intellectual framework for connectionist AI.

The PDP volumes, as they were immediately called by everyone in the field, were not a collection of papers presenting new results. They were a manifesto — a systematic argument for why the connectionist approach was right, what it could explain about human cognition, and what research programme it implied.

Volume 1 laid out the theoretical foundations of parallel distributed processing: the basic architecture of connectionist models, the learning algorithms available (including backpropagation), and the conceptual framework for understanding what distributed representations were and why they mattered. Volume 2 provided a series of case studies — specific models of specific cognitive phenomena — that demonstrated the explanatory power of the approach.

The volumes were comprehensive in a way that the 1986 paper alone was not. The backpropagation paper showed that the algorithm worked on specific tasks. The PDP volumes showed why the approach mattered — what research questions it addressed, what alternative it provided to the symbolic AI mainstream, what scientific programme it implied.

The combined effect of the paper and the volumes was to provide both a technical tool (backpropagation) and an intellectual home (the PDP framework) for researchers who wanted to work on neural networks. It established connectionist AI as a coherent research programme, not just a collection of isolated results.

The impact was immediate. Graduate students who had been uncertain about research directions were drawn to the connectionist approach. Researchers who had been working on symbolic AI began to wonder whether the connectionist alternative deserved more attention. A new generation of neural network researchers began to form around the PDP framework, developing the ideas that would eventually produce deep learning.


The Reception: Enthusiasm and Scepticism

The reception of the backpropagation paper and the PDP volumes was sharply divided — enthusiasm from researchers who had been waiting for exactly this development, and scepticism from the symbolic AI mainstream that saw it as a repetition of the mistakes of the Rosenblatt era.

The enthusiastic reception came primarily from three groups. First, the connectionist researchers who had been working on related problems for years — people like Hinton, LeCun, Bengio, Schmidhuber, and their students and collaborators — for whom the 1986 paper was a vindication and a starting point. They immediately began extending and developing the approach, applying backpropagation to new problems, developing better architectures, improving the training algorithms.

Second, cognitive scientists who had been looking for computational models that could account for phenomena — language acquisition, semantic memory, visual perception — that the symbolic AI approach had difficulty with. The PDP volumes provided exactly the kind of comprehensive theoretical framework that cognitive science needed, grounded in empirical demonstrations and connected to the broader programme of understanding human cognition computationally.

Third, a new generation of graduate students who were drawn to the approach by its combination of technical rigour, empirical demonstrations, and intellectual ambition. The backpropagation paper opened a new research space — a vast territory of unexplored problems and promising approaches — that was enormously attractive to researchers at the beginning of their careers.

The sceptical reception came primarily from the symbolic AI mainstream. The critics raised several objections that were, in different degrees, legitimate.

The most substantive objection was about generalisation. The demonstrations in the 1986 paper were on relatively small problems — the past tense learning network was trained on a few hundred words, the pronunciation network on a similarly small corpus. Would the approach scale? Would networks trained with backpropagation generalise well to new examples, or would they overfit — memorising the training data without learning the underlying patterns?

This was a genuine concern. Neural networks trained with backpropagation were known to be susceptible to overfitting, particularly when the networks were large relative to the training data. The theoretical understanding of when and why networks generalised was limited. The engineering challenges of training large networks effectively were substantial.

The objection about interpretability was also legitimate. A neural network trained with backpropagation was, from the perspective of a symbolic AI researcher, a black box. The knowledge it had learned was encoded in the numerical values of thousands or millions of connection weights, in a form that could not be directly interpreted or examined. If the network made an error, you could not easily diagnose why. If you wanted to verify that the network had learned the right thing, you could not examine its “reasoning.” This opacity was a genuine problem for applications where interpretability and verifiability were important.

And the objection about biological plausibility was, ironically, raised by researchers who were not themselves committed to biological modelling. Backpropagation required computing the exact gradient of the error with respect to each connection — a computation that required precise, signed error signals to be propagated backward through the network. This was not known to happen in biological neural systems, which raised the question of whether backpropagation was a plausible model of learning or merely a useful engineering tool.

These objections were not wrong. They identified real limitations of backpropagation and the neural network approach more broadly. But they did not justify dismissing the approach. Every research programme has limitations. The question was whether the limitations were fundamental or addressable, and whether the capabilities unlocked by the approach were worth the costs of its limitations.

The answer, as subsequent decades demonstrated, was clearly yes.


The Technical Details: What Makes Backpropagation Work

For readers who want to understand the algorithm more concretely, a brief description of what backpropagation actually does is worthwhile — not at the level of mathematical formulas, but at the level of intuition.

Imagine a neural network as a series of transformations. The input (say, an image) goes through the first layer, which produces a new representation. That representation goes through the second layer, producing another representation. And so on, through as many layers as the network has, until the final layer produces the output (say, a classification of what the image shows).

Each transformation in each layer is parameterised — it has connection weights that determine how it transforms its input. The question backpropagation answers is: how should each weight be changed to make the output more accurate?

The answer is the gradient — the direction and magnitude of change in each weight that would most reduce the error. Computing this gradient requires knowing how the error depends on each weight, which requires knowing how the error depends on the output of each layer, which requires knowing how the output of each layer depends on its input, and how that input depends on the weights of the previous layer.

The chain rule of calculus allows this chain of dependencies to be computed systematically, layer by layer, working backward from the output. Hence “backpropagation” — the error signal is propagated backward through the network, computing the gradient at each layer as it goes.

With the gradients computed, each weight is adjusted by a small step in the direction that reduces the error. This is gradient descent — moving downhill in the landscape of the error function, toward a minimum where the network performs well.

Repeat this process many times on many examples — forward pass to compute the output, compute the error, backward pass to compute the gradients, adjust the weights — and the network learns. The weights gradually converge to values that make the network perform well on the training examples, and — if the training has been done correctly — on new examples it has not seen before.

This is, in essence, how every modern neural network is trained. The specific implementations are enormously more sophisticated: better optimisers than simple gradient descent, better regularisation techniques to prevent overfitting, better architectures that make the learning problem easier, training on vastly larger datasets with vastly more computing power. But the fundamental algorithm — forward pass, compute error, backward pass, update weights — is the same one that Rumelhart, Hinton, and Williams described in 1986.


LeCun and Convolutional Networks: The First Major Application

One of the first researchers to apply backpropagation to a major real-world problem was Yann LeCun, who at the time was working at Bell Labs. LeCun developed convolutional neural networks — a specific architecture designed for image processing tasks — and trained them with backpropagation to recognise handwritten digits.

LeCun’s insight was that the generic, fully-connected architecture used in the PDP demonstrations was not the best architecture for visual data. Visual data has a specific structure: nearby pixels are more related than distant ones, and the same patterns (edges, textures, shapes) appear at different locations in an image. An architecture that exploited this structure would be more efficient and more effective than one that treated all pixels equally.

Convolutional networks exploited this structure by using convolutional layers — layers in which the same set of weights (a “filter” or “kernel”) was applied at every location in the image. This meant the network learned to detect certain features regardless of where in the image they appeared — the same edge detector would fire whether the edge was in the upper left or the lower right of the image. The translation invariance built into the architecture matched the statistics of natural images.

LeCun’s convolutional networks, trained with backpropagation on datasets of handwritten digits, achieved dramatically better performance than previous systems. When the US Postal Service deployed his digit recognition system to read ZIP codes on envelopes, it was handling hundreds of thousands of pieces of mail per day, recognising handwritten digits more accurately than many human operators.

This was the first large-scale commercial deployment of a neural network trained with backpropagation. It demonstrated that the approach was not just a laboratory curiosity but a practical technology with real-world applications. And it provided the first demonstration that the specific architecture mattered — that the right design choices could unlock capabilities that a generic architecture could not approach.

LeCun’s work on convolutional networks in the late 1980s and early 1990s was in many ways a preview of the deep learning revolution that would come two decades later. The systems he built were limited by the computing power and data available at the time. But the architecture and the training approach were essentially correct. When the computing power and the data were available — when GPUs could be repurposed for neural network training and when datasets like ImageNet could be assembled — the same architecture scaled to superhuman performance.


The Long Road: Why the Revolution Took Two More Decades

If backpropagation was described in 1986 and demonstrated to work on real problems by the late 1980s, why did the deep learning revolution not happen until 2012? Why was there a gap of twenty-six years between the algorithm’s mainstream debut and its world-changing impact?

The gap had multiple causes, all of which are important for understanding the full story.

Computing power. Training large neural networks with backpropagation requires substantial computation — millions of multiply-accumulate operations per example, repeated over millions of examples, potentially hundreds or thousands of times. The computers available in the late 1980s and 1990s were fast by the standards of the time but not fast enough to train the large networks that deep learning required. The computing power needed to make deep learning work at scale only became available in the 2000s, when NVIDIA and AMD released graphics processing units capable of the kinds of massively parallel computation that neural network training required.

Data. Neural networks trained with backpropagation require large amounts of labelled training data to generalise well. In the late 1980s, such datasets did not exist for most tasks. The internet had not yet been built. The digitisation of visual and textual content had barely begun. The assembly of the large labelled datasets that would eventually enable deep learning — ImageNet, with its fourteen million labelled images, was assembled over years of effort by Fei-Fei Li and her team starting in 2006 — was a massive undertaking that required a decade of work and the existence of a global internet through which examples could be collected.

Algorithmic improvements. The backpropagation algorithm of 1986 was correct but not optimal. Training deep networks — networks with many hidden layers — was particularly difficult, because error signals decayed as they were propagated backward through many layers, leaving the early layers unable to learn effectively. This problem — the vanishing gradient problem — was identified and analysed in the 1990s by Sepp Hochreiter and Jürgen Schmidhuber, but effective solutions were developed only gradually over subsequent years.

Regularisation techniques — methods for preventing overfitting, for ensuring that networks generalised well to new examples rather than memorising the training data — were also improved over this period. Dropout, the technique of randomly deactivating connections during training, was developed by Hinton and his students in the 2010s and proved to be one of the most effective regularisation methods available.

And the field gradually developed better understanding of network architectures — better ways to initialise weights, better activation functions, better normalisation techniques — that made training more reliable and more effective.

The intellectual climate. Through the 1990s and early 2000s, the neural network approach was again somewhat marginalised within the machine learning mainstream, which had shifted its attention to support vector machines (SVMs) and other kernel methods that had strong theoretical foundations and good empirical performance on the benchmark problems of the time. Neural networks were not dismissed as completely as they had been after the Perceptrons book, but they were not the dominant approach. The intellectual climate did not favour the large-scale investment in neural network research that deep learning eventually required.

The combination of these factors meant that the promise of backpropagation — visible in the demonstrations of the 1986 paper and the subsequent work of LeCun, Hinton, Bengio, and others — was not fully realised until all the necessary ingredients came together: sufficient computing power, sufficient data, improved algorithms, and an intellectual climate willing to invest seriously in the approach.


The AlexNet Moment: When the Waiting Ended

The waiting ended in 2012, when a team led by Geoffrey Hinton’s graduate students Alex Krizhevsky and Ilya Sutskever submitted an entry to the ImageNet Large Scale Visual Recognition Challenge — the annual competition to classify images from a dataset of fourteen million labelled examples into a thousand categories.

The AlexNet entry — named for Krizhevsky — used a deep convolutional neural network trained with backpropagation on ImageNet. It ran on two GPUs that Krizhevsky had repurposed from their original use in gaming. And it won the competition by a margin so large that it shocked the computer vision community.

The winning systems in previous years had achieved top-5 error rates (the fraction of images for which the correct label was not in the system’s top five predictions) of around 25-26%. AlexNet achieved a top-5 error rate of 15.3% — a reduction in error of nearly 40% over the previous year’s winner. The improvement was not incremental. It was categorical. Something had fundamentally changed.

What had changed was not primarily the algorithm — backpropagation had been available for twenty-six years. What had changed was the scale: a much deeper network than had previously been practical, trained on a much larger dataset than had previously been available, using GPU computing power that made the training feasible in weeks rather than years.

The AlexNet moment was, in retrospect, the vindication of everything that Hinton had been working toward since his doctoral research in the 1970s. It was the moment when the connectionist tradition that he had sustained through the lean years finally demonstrated, at a scale and with a clarity that could not be dismissed, that it was the right approach.

It was also the vindication of the 1986 paper — of the algorithm that Rumelhart, Hinton, and Williams had described, that had been waiting for the tools to make it work at the scale it required. Backpropagation had been the key all along. It had just needed a door large enough to unlock.


The Godfathers: A Community That Made History

The 1986 backpropagation paper was the work of three people, but the deep learning revolution it helped enable was the work of a community — a small, tightly connected group of researchers who had maintained their commitment to the connectionist approach through years of marginalisation and who, in the years after 1986, developed the theoretical and practical advances that eventually made deep learning work.

Geoffrey Hinton at Toronto was the intellectual centre of the community — the person with the deepest theoretical understanding, the most students, and the most influence. His students and collaborators — Yann LeCun, Yoshua Bengio, Sepp Hochreiter, Jürgen Schmidhuber, and dozens of others — developed the key advances: convolutional networks, recurrent networks with long-short-term memory, techniques for training deep networks, regularisation methods.

Yann LeCun, working first at Bell Labs and then at AT&T Labs and then at New York University, developed convolutional networks and demonstrated their power in real-world applications. His digit recognition system was the first large-scale commercial neural network deployment. His theoretical work on the convergence of gradient-based learning contributed to the understanding of why neural networks worked and when they could be expected to generalise.

Yoshua Bengio at the Université de Montréal worked on the theoretical foundations of deep learning — on why deep networks were better than shallow ones for certain kinds of problems, on the representation learning that was central to the deep learning approach, on connections between neural networks and probabilistic models. His group developed many of the ideas that went into the language models that would eventually produce ChatGPT.

These three researchers — whose collaboration and mutual influence was as important as their individual contributions — are now called the Godfathers of Deep Learning. They received the Turing Award jointly in 2018, the most prestigious prize in computer science, in recognition of their contributions to the field. The award was, in part, a recognition of the long game they had played — the commitment to a research direction that the field had not been willing to support for most of the period during which they were developing it.


The Legacy of 1986: An Algorithm That Changed the World

The 1986 backpropagation paper is, by any measure, one of the most consequential scientific papers of the twentieth century. Its impact — measured by the transformation of the AI field and the transformation of the world that transformation has produced — is extraordinary.

Every image recognition system that tags your photos, every voice assistant that responds to your questions, every language model that writes text or answers questions or translates languages, every recommendation algorithm that suggests what you should watch or listen to or buy — all of these systems, in their current form, are implementations of the approach that the 1986 paper enabled. They are neural networks, trained with backpropagation, on large datasets, with architectures and regularisation techniques that were developed in the decades following the paper.

The paper did not cause all of this by itself. It was the key that unlocked the door, but the door was large and the room behind it was filled with additional challenges that required decades of work to address. The computing power, the data, the algorithmic improvements, the architectural innovations — all of these were necessary, and all of them required sustained investment and creative work by thousands of researchers.

But the paper identified the critical missing piece. The credit assignment problem — how to train multi-layer networks — was the obstacle that had prevented neural networks from scaling to the complex problems that general AI required. Backpropagation solved that problem. With the problem solved, in principle, it was possible to imagine — and eventually to build — the deep networks that would transform the field.

The story of backpropagation is also the story of ideas that survive their time. The algorithm was discovered at least three times before it took hold. It sat in doctoral dissertations and technical reports, available but unread, for more than a decade. When it finally reached the right audience at the right moment, it changed everything.

This is how intellectual history works. Ideas do not become important when they are first stated. They become important when the world is ready for them — when the tools to implement them exist, when the data to feed them is available, when the intellectual climate is receptive, when the demonstrations are compelling. Backpropagation was ready before the world was ready for it. When the world caught up, the algorithm was there, waiting.


Further Reading

  • “Learning representations by back-propagating errors” by Rumelhart, Hinton, and Williams (1986) — The original paper, available online. Four pages long, clearly written, transformative. Read it.
  • “Parallel Distributed Processing” edited by Rumelhart and McClelland (1986) — The comprehensive intellectual framework that accompanied the backpropagation paper. Volume 1 is the theoretical foundation; Volume 2 is the empirical applications.
  • “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences” by Paul Werbos (1974) — Werbos’s doctoral dissertation, the first clear description of backpropagation. Available through Harvard’s library archive.
  • “Gradient-Based Learning Applied to Document Recognition” by LeCun et al. (1998) — LeCun’s major paper on convolutional networks, describing the LeNet system that was the direct predecessor of AlexNet. Essential for understanding how the 1986 breakthrough was developed into practical systems.
  • “ImageNet Classification with Deep Convolutional Neural Networks” by Krizhevsky, Sutskever, and Hinton (2012) — The AlexNet paper. The moment when everything changed. Read it alongside the 1986 paper to see how far the field had come in twenty-six years.

Next in the Events series: E11 — Deep Blue vs. Kasparov, 1997: The Match the World Watched — The full story of the chess match that divided history: the two games, the controversy, Kasparov’s accusations that IBM was cheating, the cultural earthquake of a machine defeating the greatest chess player alive, and what it actually meant for AI — and for what it means to be human.


Minds & Machines: The Story of AI is published weekly. If the story of backpropagation — the algorithm that was found, lost, found again, and eventually changed the world — speaks to something about how ideas survive and find their moment, share it with someone who would find that story worth knowing.