Claude Shannon: The Man Who Invented Information

One Bell Labs engineer created information theory and gave AI its mathematical language. His 1948 paper is the most important paper in the history of communications.

Murray Hill, New Jersey. 1950. The corridors of Bell Telephone Laboratories are quiet in the early morning, before most researchers have arrived. But one office is already occupied. A man in his early thirties sits at a workbench, not at a desk — the desk is buried under papers, books, and half-finished mechanical contraptions whose purpose is not immediately obvious. He is tinkering with something that has wheels. And stilts. And a form of balance mechanism.

He is building a unicycle. Not because anyone has asked him to. Not because it is relevant to his research. Because he finds the problem of mechanical balance interesting, and because he has been thinking about juggling, and because the mathematics of juggling and the mathematics of information transmission have, in his mind, a family resemblance that he has not yet worked out how to articulate.

His name is Claude Elwood Shannon. Two years earlier, he published a paper that is widely regarded as the most important paper in the history of communications engineering. The paper created the entire mathematical framework within which every digital communication system — every telephone call, every internet transmission, every Wi-Fi signal, every streaming video — would subsequently be designed.

He has solved one of the deepest problems in the history of applied mathematics. He has come to work early to build a unicycle.

This is, it turns out, completely consistent.

The Engineer Who Thought Like a Mathematician

Claude Shannon was born on April 30, 1916, in Petoskey, Michigan, the son of Claude Sr., a businessman and probate judge, and Mabel Wolf Shannon, a high school principal. He grew up in Gaylord, a small town in northern Michigan, in a family that valued education but was not intellectually extraordinary in any obvious way.

What was extraordinary was Shannon himself, and it manifested early. He showed a precocious interest in mechanical and electrical things — in how devices worked, in what could be built from available parts, in the elegant solution to a practical problem. As a child he built model planes, a radio-controlled boat, a telegraph system that connected his house to a friend’s half a mile away. He was, from the beginning, someone who learned by building — who understood things best when he could make them work.

He enrolled at the University of Michigan, graduating in 1936 with two bachelor’s degrees: one in mathematics, one in electrical engineering. This double degree was not an accident or a hedging of bets. It reflected a genuine dual interest that would define his career: the mathematical structures underlying physical systems, and the physical systems that instantiated mathematical structures. Shannon never separated the abstract from the concrete. For him, mathematics was a tool for understanding and building things, and things were a way of testing and extending mathematics.

After Michigan he went to MIT for graduate work, where he encountered an assignment that would produce one of the most extraordinary master’s theses in the history of science.

The assignment was to work on the differential analyser — a large mechanical analogue computing device used for solving differential equations. The analyser was controlled by a network of relay switches, and Shannon’s job was to understand and improve this control network. Working on the relays, he noticed something: the behaviour of the relay network — which relays were open, which were closed, what configurations were possible — could be analysed mathematically using the algebra of logic that George Boole had developed in the 1850s.

Boolean algebra — the algebra of true and false, of AND and OR and NOT — was a system for manipulating logical propositions. Shannon’s insight was that it was also, exactly and precisely, the algebra of electrical relay circuits. A relay was either open or closed — true or false. The combination of relays in series was a logical AND — both had to be closed for current to flow. The combination in parallel was a logical OR — either could be closed. The whole network of relays was, mathematically, a Boolean expression.

This equivalence — between electrical circuits and logical expressions — was the foundation of digital electronics. It meant that any logical computation could in principle be implemented as an electrical circuit, and any electrical relay circuit could be analysed as a logical expression. It was the mathematical bridge between logic and electronics that made digital computers possible.

Shannon published this result in 1937 as his master’s thesis, “A Symbolic Analysis of Relay and Switching Circuits.” It has been called the most important master’s thesis of the twentieth century. The claim is not unreasonable. Every digital computer, every smartphone, every digital communication system built since 1937 rests on the mathematical foundation that Shannon established in that thesis, at the age of twenty-one.

He was just getting started.

Bell Labs: The Greatest Industrial Research Laboratory in History

After his master’s thesis and a brief detour into genetics for his PhD — he applied Boolean algebra to Mendelian genetics, characteristically finding a mathematical structure that unified two apparently different domains — Shannon joined Bell Telephone Laboratories in 1941.

Bell Labs was, in the mid-twentieth century, the most productive industrial research laboratory in history. It was the research arm of AT&T, the American telephone monopoly, and it was funded with the kind of generosity that only a regulated monopoly with guaranteed revenues can sustain. Its researchers had the freedom, the resources, and the collegial environment to work on fundamental problems without immediate commercial pressure. The results were extraordinary: the transistor, the laser, information theory, UNIX, the C programming language, cellular telephony, the charge-coupled device — the inventions that built the modern world came, in disproportionate numbers, from Bell Labs.

Shannon thrived there. Bell Labs gave him freedom — the freedom to follow his curiosity wherever it led, to spend years on problems that had no obvious commercial application, to pursue mathematical beauty as well as practical utility. In return, the thinking he did at Bell Labs produced results of incalculable commercial value, even when that was not his intention.

His colleagues at Bell Labs were, by any measure, among the most talented people in the world. John Bardeen, Walter Brattain, and William Shockley were there, working on the transistor. Harry Nyquist was there, having already established the sampling theorem that would bear his name. Hendrik Bode was there, developing feedback control theory. The mathematician Norbert Wiener, working at MIT, was in regular contact. The community of people thinking seriously about information, communication, and control was small, interconnected, and extraordinarily productive.

Shannon moved through this world with characteristic independence. He was collegial without being social — genuinely interested in other people’s ideas, willing to engage deeply with problems that were not his own, but also capable of working alone for extended periods with a focus that bordered on the monomaniacal. He rode a unicycle through the Bell Labs corridors — a habit that was tolerated, even celebrated, as a sign of the creative eccentricity that the laboratory’s culture valued. He was seen as playful in a way that was inseparable from his seriousness: the juggling, the mechanical toys, the word games and puzzles were not distractions from his intellectual work but expressions of the same quality of mind.

The quality was a gift for finding deep structure in apparently diverse phenomena — for seeing the mathematical skeleton beneath the flesh of physical reality, and for expressing that skeleton in forms that were simultaneously rigorous and beautiful.

The 1948 Paper: Inventing Information

On July 1, 1948, Bell System Technical Journal published the first part of a paper by Claude Shannon titled “A Mathematical Theory of Communication.” The second part appeared in October. Together, the two parts constituted one of the most important scientific papers of the twentieth century.

The paper did something that had never been done before: it gave a precise, mathematical definition of information.

Before Shannon, “information” was an everyday word — vague, context-dependent, impossible to measure. You could say that a message contained information, that some messages contained more information than others, that information was transmitted through communication systems and could be corrupted by noise. But you could not say, precisely, how much information a message contained, or calculate exactly how much capacity a communication channel had, or determine rigorously what the limits of reliable communication over a noisy channel were.

Shannon changed all of this. In one paper, he defined information, developed the mathematics for measuring it, established the fundamental limits of communication, and proved that those limits could be achieved. It was an achievement of breathtaking scope and elegance.

The definition of information was the key. Shannon defined the information content of a message in terms of its probability — specifically, in terms of how surprising the message was. A message that was certain to occur conveyed no information: if you already knew what was going to be said, hearing it told you nothing new. A message that was very unlikely conveyed a great deal of information: it told you something you did not already know, something that your prior knowledge gave little reason to expect.

Mathematically, Shannon defined the information content of a message with probability p as the logarithm of one divided by p — written as log(1/p), or equivalently as -log(p). The less probable the message, the higher the information content. This definition had exactly the properties that a good definition of information should have: it was zero for certain events, it increased as events became less probable, and it was additive — the information in two independent messages was the sum of their individual informations.

From this definition, Shannon derived the concept of entropy — the average information content of a source. A source that always produced the same message had entropy zero — no information at all. A source that could produce many equally probable messages had maximum entropy — maximum information per message, maximum uncertainty about what would come next.

The connection to the thermodynamic entropy of physics was not coincidental. Shannon was aware of the parallel, and he chose the word “entropy” deliberately, on the suggestion of the mathematician John von Neumann. The parallel ran deep: both thermodynamic entropy and Shannon entropy measured the degree of disorder or uncertainty in a system, and the mathematical form of the two definitions was identical. The connection suggested a deep relationship between information and physics that has been a productive research area ever since.

Having defined information, Shannon proved the two fundamental theorems of information theory.

The first theorem — the noiseless channel coding theorem — established how efficiently information could be stored and transmitted in the absence of noise. Shannon showed that there was a theoretical limit to compression: you could not compress a message below its entropy, measured in bits. But you could, in principle, achieve compression arbitrarily close to this limit. This theorem established the foundation for all data compression — the algorithms that make files smaller, that allow music and video to be streamed efficiently, that underlie the zip files and JPEG images of everyday computing.

The second theorem — the noisy channel coding theorem — was even more remarkable. Shannon proved that even in the presence of noise — even when a communication channel randomly corrupted some of the transmitted bits — it was possible to transmit information with arbitrarily small error, at a rate approaching the channel’s capacity. This was achieved through redundancy: encoding the information in ways that spread it across many transmitted symbols, so that random corruption of individual symbols did not destroy the message.

The noisy channel coding theorem was extraordinary for two reasons. First, it was counterintuitive: it said that reliable communication was possible even over noisy channels, something that was not at all obvious before Shannon proved it. Second, it established a sharp theoretical limit — the channel capacity — and proved that this limit could be approached but not exceeded. It defined what was possible and impossible in communication theory, with mathematical precision.

These results were not just theoretical. They were directly applicable to the design of every communication system ever built. Every modern communication standard — from WiFi protocols to mobile phone networks to the error-correcting codes used in deep space communication — is an engineering implementation of Shannon’s theory. The engineers who designed these systems were working within the framework Shannon established, trying to approach the limits he proved.

The Bit: The Currency of the Digital Age

Among Shannon’s many contributions, the one that most directly entered everyday language and everyday life was the bit.

The bit — binary digit — was not invented by Shannon. The word had been coined by the statistician John Tukey, and binary representation of numbers was much older. But Shannon established the bit as the fundamental unit of information — the currency in which all information could be measured and exchanged.

A bit was the information contained in the answer to a yes/no question whose two possible answers were equally probable. Flip a fair coin: one bit of information is revealed by the result. Open one of two equally probable doors: one bit of information. Choose between two equally probable paths: one bit.

This definition made the bit universal. Any message from any source could, in principle, be reduced to bits — to a sequence of binary choices. Any amount of information could be measured in bits. Any channel capable of transmitting binary signals could, in principle, transmit any information. The bit unified all of information theory under a single framework and a single unit.

The practical consequence was the digital revolution. If all information — text, numbers, images, sounds, video, computer programs — could be represented as sequences of bits, then a single kind of transmission system, a single kind of storage system, a single kind of processing system could handle all of it. The specialised analogue systems of the pre-digital era — different systems for voice, for text, for images, for different kinds of data — could be replaced by a single universal digital system. Different media would become different bit streams, all handled by the same infrastructure.

This is exactly what happened. The telephone network became digital. Broadcasting became digital. Photography became digital. Music became digital. Video became digital. Publishing became digital. Every medium that existed before 1948, and every medium invented after, was eventually reduced to bits. The information age is, in the most literal sense, the Shannon age.

For AI, the bit is fundamental in a different but equally important way. Every AI system — every neural network, every language model, every computer vision system — represents its inputs, its internal states, and its outputs as patterns of bits. The training data that teaches these systems is stored as bits. The model parameters that encode what they have learned are stored as bits. The computations they perform are manipulated as bits. AI is, at the most fundamental level, information processing in Shannon’s sense — the transformation of bit patterns according to computational rules.

Shannon’s definition of information as surprise, as reduction of uncertainty, also connects directly to modern machine learning. A language model, trained to predict the next word in a text, is optimising for a quantity closely related to Shannon entropy — it is trying to minimise its uncertainty about what comes next, to assign high probability to likely continuations and low probability to unlikely ones. The connection between Shannon’s information theory and the probability theory that underlies machine learning is not superficial. It is deep and structural.

Chess, Machines, and the Shape of AI

Shannon’s connection to AI was not limited to the mathematical foundations he provided. He was directly engaged with the question of machine intelligence, and made specific contributions to the problem of building AI systems.

In 1950 — the same year as Turing’s landmark paper — Shannon published “Programming a Computer for Playing Chess.” The paper was not a working program (Shannon was a theorist, not primarily a programmer), but it was the first rigorous analysis of what it would take to make a computer play chess well.

The paper identified the fundamental challenge: the space of possible chess games was astronomically large — far too large to search exhaustively, even on the fastest computers imaginable. Playing chess well required selecting good moves from this enormous space, and doing so efficiently enough to be practically useful.

Shannon proposed two approaches. The first — “Type A” — was to search the game tree exhaustively to a fixed depth, evaluating all positions at that depth and choosing the move that led to the best evaluated position. This was brute force, limited by the depth that computing power allowed. The second — “Type B” — was more selective: use chess knowledge to identify which moves were worth exploring, following only the most promising lines rather than all lines.

Shannon’s analysis established the framework within which chess AI was developed for the next fifty years. The Deep Blue system that defeated Kasparov in 1997 used a sophisticated version of Type A search — brute force, but at extraordinary speed and depth, augmented with chess-specific evaluation functions. The AlphaGo and AlphaZero systems that later mastered Go and then chess again used something closer to Type B — neural network-guided search that concentrated computational resources on the most promising moves.

Both approaches, and every hybrid between them, were anticipated in Shannon’s 1950 paper. He had identified the right questions before anyone had built a program capable of addressing them.

Shannon also contributed to AI through his work on cryptography — the science of encoding and decoding information securely. His 1949 paper “Communication Theory of Secrecy Systems” placed cryptography on a rigorous mathematical foundation, showing the conditions under which a cipher could be proven unbreakable and the limits of what eavesdroppers could infer. This work was not AI in any direct sense. But it established the mathematical culture of rigorous security analysis that underlies the field of cryptography, and cryptography is now deeply intertwined with AI — in the security of AI systems, in the privacy-preserving techniques used to train models on sensitive data, in the authentication mechanisms that govern who can use AI services.

The Thinking Machine: Shannon’s AI Demonstrations

Shannon was not just a theorist of machine intelligence. He built things — and the things he built were demonstrations of principles, chosen to be illuminating and often to be funny.

His most famous AI construction was Theseus — a maze-solving mouse, built in 1950, that was arguably the first demonstration of a machine learning system. Theseus was a small wooden mouse driven by a magnet beneath the floor of a maze. The maze was filled with relay circuitry — Shannon’s speciality — that allowed the mouse to remember which turns had led to dead ends and to update its strategy based on this memory.

The mouse would enter the maze and begin exploring, trying paths at random when it encountered a new junction. When it hit a dead end, it would back up, record that the path was unproductive, and try a different direction. When it found the exit, it had learned a complete path through the maze. If then placed anywhere in the maze it had previously explored, it would navigate directly to the exit, using its accumulated memory of successful paths.

Theseus could not generalise to new mazes. It could not explain its strategy. It had no understanding of what it was doing. But it solved the maze through a process that had the essential structure of trial and error learning — exploring, recording outcomes, updating behaviour based on results, and eventually achieving reliable performance. It was a demonstration of reinforcement learning, demonstrated in a physical machine, in 1950.

Shannon also built a computer called NIMABI that played the mathematical game of Nim — a game in which two players take turns removing objects from piles, with the player who takes the last object winning or losing depending on the variant. Shannon’s NIMABI could play Nim optimally — it would never lose against an opponent who did not know the optimal strategy. He demonstrated it at parties and conferences, inviting guests to play against it and watching their reactions when the machine beat them. The machine was not doing anything mysterious — it was following an algorithm that Shannon had worked out — but the experience of losing to a machine that played perfectly was unsettling for people who had assumed that game-playing required intelligence.

He built a machine called THROBAC — Thrifty Roman-Numeral Backward-looking Computer — that performed arithmetic in Roman numerals, for no other reason than that he found it amusing. He built juggling robots. He built a device called the Ultimate Machine — a box with a switch that, when turned on, caused a mechanical arm to emerge from the box, turn itself off, and retract. It did nothing except prevent itself from being turned on. He presented it to visitors with a perfectly straight face.

These constructions were not frivolous. They were demonstrations — of principles, of limitations, of the gap between what a machine could do and what it appeared to be doing. The maze-solving mouse demonstrated learning without understanding. NIMABI demonstrated optimal play without insight. The Ultimate Machine demonstrated that a machine could have a purpose — turning itself off — without having any goal in any philosophically interesting sense. Shannon was, with playfulness and wit, making serious points about the nature of machine intelligence.

Theseus, Learning, and the Road to Reinforcement Learning

The maze-solving Theseus deserves special attention, because it anticipates one of the most important developments in modern AI: reinforcement learning.

Reinforcement learning is the approach to machine learning in which a system learns to perform a task by trying things, receiving feedback on whether the things it tried were good or bad, and adjusting its behaviour accordingly. It is the approach that underlies AlphaGo, AlphaZero, and many of the most impressive recent AI achievements. It is also, in broad outline, the approach through which biological intelligence is shaped by experience — animals learn what to do by experiencing the consequences of their actions and adjusting their behaviour to increase rewards and avoid punishments.

Shannon’s Theseus was not a reinforcement learning system in the modern technical sense — it did not use gradient-based optimisation or dynamic programming or any of the mathematical machinery of modern reinforcement learning. But it had the same fundamental structure: explore the environment, record the outcomes, update behaviour to do more of what worked and less of what did not, eventually converge on a reliable strategy.

The gap between Theseus and modern reinforcement learning is a gap in mathematical sophistication and computational power, not a gap in conceptual approach. Shannon identified the right approach — trial and error learning with outcome-based updating — and demonstrated it in physical hardware in 1950. The mathematical framework for doing it rigorously was developed later, by Richard Bellman (dynamic programming), by researchers in optimal control, and eventually by the AI researchers who formalised reinforcement learning in the 1980s and 1990s.

Shannon was there first. Not with the mathematics, but with the concept and the demonstration. This was characteristic: he saw where things were going before the tools to get there existed, built a physical demonstration of the principle, and left the rigorous development to others.

The Bandwagon Paper: A Warning From the Inventor

In 1956 — the same year as the Dartmouth Conference — Shannon published a short editorial in the IRE Transactions on Information Theory that he titled “The Bandwagon.” It was, in its quiet way, one of the most remarkable things he ever wrote.

The bandwagon Shannon was worried about was information theory itself. In the eight years since “A Mathematical Theory of Communication,” the term “information theory” had become fashionable. Researchers in fields from biology to psychology to economics to linguistics were claiming to be working in information theory, often with minimal connection to the actual mathematical content of Shannon’s work. The entropy measure was being applied to systems and phenomena where its applicability was untested and unclear. The prestige of the rigorous original theory was lending credibility to work that was, in many cases, merely using the terminology without the substance.

Shannon was concerned. Not about the field’s growth — he was delighted that information theory was proving influential. But about the quality of the work being done under its banner. He worried that premature or inappropriate applications of information theory would produce misleading results, waste researchers’ time, and ultimately damage the reputation of the genuine mathematical theory.

He wrote: “Workers in many fields, attracted by the success of information theory, have borrowed liberally from its vocabulary, from its concepts, and even sometimes from its mathematics, without an appreciation of what these importations imply. In some instances this is fine and proper; the application is apt and the methods serve well. In others, the analogy is shaky, and the results at best doubtful, at worst misleading.”

Reading this warning in 2026, with the benefit of seventy years of hindsight about what happened to AI, is instructive. Shannon was describing, in the specific context of information theory, a pattern that has recurred throughout the history of AI: a genuine theoretical achievement generates excitement, attracts researchers from adjacent fields, inspires applications that range from the brilliant to the dubious, and produces both real advances and significant amounts of hype.

The AI field has experienced this pattern repeatedly. Symbolic AI’s early successes attracted applications in domains where the approach was inadequate. Expert systems were applied far beyond their useful range. Neural networks were overhyped in the 1980s, then dismissed too completely in the AI winter, then overhyped again in the 2010s. Large language models are currently attracting claims that range from “this is general intelligence” to “this is nothing more than autocomplete on steroids” — with the truth, as usual, somewhere more nuanced and more interesting than either extreme.

Shannon’s warning was not against enthusiasm or ambition. It was for rigour and intellectual honesty. It was for the discipline of distinguishing what a result actually showed from what it might be taken to suggest, for the hard work of understanding whether a tool genuinely applied to a new domain before claiming that it did.

It was, in this respect, a model of the scientific temperament that Shannon embodied throughout his career: open to new ideas, willing to speculate, but always disciplined by the requirement that claims be justified by evidence and that enthusiasm not outrun understanding.

The Man Himself: Playfulness as Intellectual Method

No portrait of Claude Shannon is adequate that does not dwell on the playfulness — the juggling, the unicycle, the word games, the mechanical toys — because the playfulness was not separate from the intellectual achievement. It was the same thing, expressed differently.

Shannon was, fundamentally, a person who found delight in problems. Not in solving problems — not in the satisfaction of completion — but in the problems themselves, in the process of seeing structure emerge from what had appeared to be chaos, in the moment when a good question became clear enough to be worth pursuing. He was drawn to problems that had elegant structures, problems where the right formulation made the solution almost obvious, problems where a simple idea illuminated an apparently complex domain.

This quality — the ability to find the right formulation, to see the problem from the angle that made it tractable — is the quality that produced information theory. Before Shannon, the problem of reliable communication over a noisy channel was a messy, multidimensional engineering challenge. Shannon saw it from an angle that made it a clean mathematical problem — the problem of measuring information and proving fundamental limits — and from that angle, the solution followed.

The same quality drove the juggling. Shannon was genuinely interested in the mathematics of juggling — in the relationships between the number of balls, the timing of throws, and the structure of possible juggling patterns. He developed mathematical theorems about juggling — results that described what patterns were possible under what conditions — and built mechanical jugglers that implemented the patterns he had analysed. The juggling was mathematics embodied in physical performance.

The unicycle was similar. Balance is a control system — a continuous feedback loop between the state of the system and corrective actions. Shannon was interested in the mathematics of balance and control, and building and riding a unicycle was a way of studying that mathematics physically. He reportedly could ride a unicycle while juggling, because of course he could.

He also loved word games. Cryptic crosswords, Scrabble (for which he calculated optimal strategies), invented word puzzles, the kind of play with language that is itself a form of mathematical analysis — finding structure in the combinatorics of letters and words, in the information-theoretic properties of natural language. He was, in his own way, doing what Shannon the engineer was doing: finding the mathematical skeleton underneath apparent complexity.

This unity — between the playful and the serious, between the concrete and the abstract, between the physical and the mathematical — was the characteristic feature of Shannon’s mind. It was also, arguably, what made him so extraordinarily productive. He was never bored by the problem he was working on, because he was working on the problem he found delightful. He followed his curiosity without apology, and his curiosity happened to lead him to some of the most important mathematical results of the twentieth century.

The Intersection With Turing: Two Minds, One Problem

Claude Shannon and Alan Turing met in person during the Second World War, when Turing visited Bell Labs in 1943 as part of a visit to the United States to discuss cryptographic cooperation. The two men spent time together — eating lunch, talking, exchanging ideas — and each came away impressed by the other.

The conversation is not recorded in detail. What we know is that both men were thinking, at this time, about related problems: the mathematical structure of communication, the nature of computation, the relationship between information and intelligence. Shannon was developing what would become information theory. Turing was working on cryptanalysis and thinking about the theoretical foundations of computing.

The convergence of their thinking is striking in retrospect. Shannon’s information theory and Turing’s computability theory are two aspects of the same mathematical structure — the structure of formal symbol manipulation according to rules. Shannon analysed the information content of symbol sequences and the limits of their transmission. Turing analysed the computational processes that transformed symbol sequences and the limits of those transformations. Together, their two theories provided the mathematical foundation of the digital age: information theory tells us how to measure and transmit information, computability theory tells us what can be computed with it.

The two men also shared an interest in machine intelligence, and their contributions to AI are complementary. Turing’s contribution was primarily philosophical and architectural: he defined what computing was, asked what machines could think, and identified the research program. Shannon’s contribution was primarily foundational and quantitative: he defined what information was, measured it, and established the limits of what communication systems could do.

Both were necessary. A theory of machine intelligence that did not understand computation — that could not characterise what machines could and could not do — would be philosophically confused. A theory of computation that did not understand information — that could not measure what was being processed or quantify the limits of processing — would be incomplete. Turing and Shannon, taken together, provided the conceptual foundations that AI needed: a theory of what computing was and what it could do, and a theory of what information was and how it could be measured.

They met once, briefly, in the middle of a world war. The ideas they were developing would eventually reshape civilisation. It is one of the most consequential lunches in intellectual history.

Information Theory and Modern AI: The Deep Connection

The connection between Shannon’s information theory and modern AI is not merely historical — it is technical, deep, and ongoing.

The most direct connection is through the concept of entropy. Shannon entropy, as we have discussed, measures the average information content of a source — the average amount of uncertainty resolved by observing a message. In machine learning, entropy appears everywhere: in the loss functions used to train classification models, in the measures of model uncertainty used in active learning and Bayesian inference, in the information-theoretic analyses of what neural networks learn and what they fail to learn.

The cross-entropy loss — the most commonly used loss function for training classification models, including the language models that power modern AI — is directly derived from Shannon’s entropy. Training a language model to predict the next word in a text is, mathematically, minimising the cross-entropy between the model’s predictions and the true distribution of words. The model is being trained to be a good Shannon encoder of natural language — to assign probabilities to word sequences that are close to their true probabilities, thereby minimising the average amount of information needed to describe natural text.

The connection runs deeper than loss functions. Information theory provides a principled framework for understanding what machine learning systems are doing: they are learning compressed representations of their training data, finding regularities and patterns that allow the data to be described more efficiently. The features learned by a neural network are, in an information-theoretic sense, efficient codings of the data — ways of representing the important structure of the data while discarding less relevant detail.

The field of information-theoretic analysis of neural networks — sometimes called the information bottleneck approach, developed by Naftali Tishby and colleagues — tries to understand what is happening inside neural networks in terms of Shannon information: how much information about the input is preserved in each layer, how much information about the target output is captured, how the network balances compression with prediction accuracy. This is Shannon’s theoretical framework applied to understanding the internal workings of modern AI.

There is also a connection through the theory of compression. Large language models are, in a specific technical sense, compressors of natural language — they assign high probability to common and predictable text and low probability to rare and surprising text. The better the model, the more compressed its representation of natural language. Measuring the quality of a language model by its perplexity — a measure derived directly from Shannon entropy — is measuring how well the model has learned to compress natural language.

The theoretical limits of language model performance are, in this framework, set by the Shannon entropy of natural language — the intrinsic uncertainty and unpredictability that remains after all statistical patterns have been learned. A perfect language model would have perplexity equal to the true entropy of natural language. No model can do better than this theoretical limit. Shannon told us this in 1948.

The Late Years: A Life Fully Lived

Shannon’s career at Bell Labs effectively ended, in the sense of active research, in the early 1960s. He moved to MIT as a professor in 1956 and remained there until his retirement. His later years were less mathematically productive — he had done the fundamental work, and the detailed mathematical development of information theory was carried out by others. He remained interested in AI, in computation, in the mathematical puzzles that had always delighted him.

He made investments — famously successful ones — in technology companies, with an analytical approach that he characteristically treated as a mathematical problem. He continued to tinker, to build, to play. He designed new unicycles. He built more juggling machines. He played Scrabble with a competitive intensity that his opponents found unnerving.

He was diagnosed with Alzheimer’s disease in the 1990s and spent his final years in a care facility in Medford, Massachusetts. The disease took his memory — the thing that had been at the center of his intellectual life, the ability to hold complex mathematical structures and manipulate them with precision — with a cruelty that those who had known him in his prime found almost unbearable to witness.

He died on February 24, 2001, at the age of eighty-four. He had outlived most of the colleagues who had shared in the making of the information age. He had lived long enough to see the internet, the World Wide Web, the early stages of the digital revolution he had made possible. Whether he understood what he was seeing, in his final years, is not clear.

What is clear is what he had made. Every bit transmitted over the internet. Every compressed image. Every error-corrected communication from a deep-space probe. Every language model trained on the statistical patterns of human text. Every AI system that learns from data, that compresses experience into weights, that predicts the probable from the patterns of the past. All of it runs, at the deepest mathematical level, on the foundation that Shannon built.

The Forgotten Founder

Like Norbert Wiener, Claude Shannon is less celebrated in the popular history of AI than his contributions warrant. You will find his name in every communications engineering textbook, in every information theory course, in the technical literature of every field that deals with data and uncertainty. But in the popular telling of the AI story — the story of the brilliant minds who built thinking machines — Shannon’s name appears less often than it should.

The reason is partly the same as for Wiener: Shannon’s most important work was foundational rather than applied. He did not build a chatbot or a chess program or a neural network. He built the mathematical framework within which all of those things were developed. The foundation is invisible precisely because it is the foundation — it underlies everything, which means it is not visible in any particular thing.

But the invisibility understates the importance. A building without a foundation is not a building. An information age without information theory is not an information age. Every digital system, every AI application, every piece of data ever transmitted or stored or processed, rests on the mathematical results that Shannon established in 1948.

He was also, in a way that deserves recognition, a model of what scientific curiosity looks like at its best. He followed his interests wherever they led, without regard for prestige or practical utility or the expectations of his field. He did mathematics and engineering with equal facility and equal pleasure. He built unicycles and juggling machines and maze-solving mice with the same quality of attention he brought to his most rigorous proofs. He took play seriously, because play is how curious people think when they are thinking at their best.

The information age is his legacy. The AI revolution is, at its deepest foundations, built on his mathematics. The bits that carry every message, the entropy that measures every uncertainty, the channel capacity that limits every transmission — these are Shannon’s. The understanding of what information is, and what can be done with it, and what cannot — this is Shannon’s.

He invented information. That is enough to earn him a permanent place in the story of minds and machines.