The Turing Test, 1950: The Question That Still Has No Answer

Oxford, October 1950. The philosophy journal Mind publishes its latest issue. Among the articles on perception, ethics, and the philosophy of language sits a paper by a thirty-eight-year-old mathematician from Manchester named Alan Turing. The paper is titled “Computing Machinery and Intelligence.” Its opening sentence is a provocation: “I propose to consider the question, ‘Can machines think?’”

The paper is not long — thirty pages, readable in an afternoon. It does not contain any new mathematics. It does not describe a new experiment or a new machine. It is, in the most literal sense, a thought experiment: a carefully constructed imaginary scenario designed to make a vague and contested question precise enough to investigate.

Nobody in October 1950 has any idea what they are holding. The field of artificial intelligence does not yet exist — it will not be named for another six years. The first working stored-program computers are barely two years old. The idea that a machine might one day converse with a human being in a way indistinguishable from another human being seems, to almost everyone who encounters it, either wildly premature or philosophically confused.

Seventy-five years later, the paper has been cited tens of thousands of times. The thought experiment it proposed — now known universally as the Turing Test — is the most debated idea in the philosophy of mind and AI. And the question Turing asked in that opening sentence has still not been answered.

This is the story of how a thought experiment changed the world.

The Problem with “Can Machines Think?”

To understand why the Turing Test was necessary, you first have to understand the problem Turing was trying to solve. And the problem was not technical. It was philosophical. It was a problem with a question.

“Can machines think?” seems like a clear question. You either believe machines can think or you believe they cannot. But as soon as you try to answer it seriously, the question dissolves into a tangle of subsidiary questions that turn out to be far harder than the original.

What is a machine? In the broadest sense, a machine is any device built by human hands to perform a function. But surely the relevant class is narrower than that — we are interested in computing machines, information-processing machines, devices that manipulate symbols according to rules. Very well. But the human brain is, in some sense, a biological machine — an information-processing device built by evolution rather than by human hands. Are brains machines? If they are, and if brains think, then the question “Can machines think?” becomes trivially answered: yes, obviously, because we are machines and we think.

So the question must be asking something more specific — can artificially constructed machines, built from metal and silicon rather than grown from biological tissue, think? But this formulation introduces a new problem. Why should the substrate matter? Why should the fact that something is made of silicon rather than carbon determine whether it can think? If a machine can do everything a brain does — process information, learn, reason, communicate — does it matter what it is made of?

Then there is the problem of defining “think.” Human beings think in many different ways — we reason logically, we imagine, we feel emotions, we have sudden intuitions, we dream. Which of these are necessary for “thinking”? All of them? Some of them? And how would we know if a machine had any of them? We cannot directly observe another person’s thinking — we infer it from their behavior, their words, their actions. Why should the standard be different for machines?

Turing saw all of these problems clearly. He addressed them at the start of his paper with characteristic directness: the question “Can machines think?” was, he wrote, too ill-defined to deserve discussion. The words “machine” and “think” were so laden with prior assumptions and ambiguities that any attempt to answer the question as stated would immediately collapse into definitional disputes that generated no useful knowledge.

What Turing proposed instead was to replace the vague question with a precise one — one that could, in principle, be answered by observation and experiment rather than by philosophical debate about definitions.

The replacement was the Imitation Game.

The Imitation Game: What Turing Actually Proposed

The Imitation Game as Turing originally described it is slightly different from the version that entered popular consciousness as “the Turing Test.” Understanding the original formulation matters, because the differences are philosophically significant.

In Turing’s original version, the game has three participants. There is a man, A. There is a woman, B. And there is an interrogator, C, who is in a separate room communicating with both A and B through written messages — typewritten, to eliminate clues from handwriting or voice.

The interrogator’s goal is to determine which of A and B is the man and which is the woman. The man’s goal is to deceive the interrogator into thinking he is the woman. The woman’s goal is to help the interrogator — to convince the interrogator that she is, in fact, the woman.

The interrogator can ask any questions. A and B can give any answers. The man can lie. The woman can tell the truth. They can both be asked directly: are you the man or the woman? The man will say he is the woman. The woman will say she is the woman. The interrogator must decide who to believe.

This is a game about deception — about whether the interrogator can be fooled. And it is a game that highlights the strange relationship between what we are and how we appear. The man and the woman have genuinely different attributes — different bodies, different lived experiences, different biological facts about themselves. But in the text-only channel of the game, those genuine differences are only accessible through representation, through the words they choose. The interrogator has no access to the reality. Only to the representation.

Turing then asked: what would happen if, in this game, we replaced the man with a computer? The computer’s goal would be the same as the man’s — to deceive the interrogator into thinking it was the woman, or at least into being uncertain about whether it was human. The question becomes: will the interrogator be deceived as often when the computer is playing as when the man is playing?

Notice what this question does and does not ask. It does not ask whether the computer is really intelligent. It does not ask whether the computer really thinks. It asks whether the computer can behave — in a specific, limited, text-based conversational context — in a way that is indistinguishable from a human being’s behavior.

This is a behavioral test, not a test of inner states. And Turing was explicit about why he chose this approach. We cannot directly observe another mind. We never have been able to. All our evidence for the thinking, feeling inner lives of other human beings comes from their behavior — their words, their actions, their responses. If we accept this behavioral evidence as sufficient grounds for attributing minds to other humans, Turing argued, we should be willing to accept the same quality of evidence for machines.

The famous simplified version — usually stated as: if a machine can converse with a human in a way the human cannot distinguish from another human, the machine can be said to think — captures the spirit of Turing’s proposal but loses some of its philosophical subtlety. The original three-player game, with its explicit focus on deception and appearance, was making a specific point about the epistemology of mind: that we only ever have access to appearances, and that the relevant question is whether the appearance of intelligence, sufficiently sophisticated and sustained, constitutes intelligence.

The Nine Objections: Turing Argues With Himself

One of the most remarkable features of the 1950 paper is what comes after the proposal of the Imitation Game. Rather than simply proposing the test and defending it, Turing spent most of the paper identifying and responding to objections — imagining the strongest arguments against machine intelligence and engaging with them seriously.

He listed nine objections, and his responses to them anticipated the major philosophical debates about AI that would occupy researchers and philosophers for the next seventy-five years.

The Theological Objection. Thinking is a function of the immortal soul. God has given immortal souls to men and women but not to animals or machines. Therefore machines cannot think.

Turing’s response was gently dismissive. He found it curious that theologians should wish to place limits on God’s omnipotence by insisting that God could not grant souls to machines if God chose to. He also noted that the theological objection, if taken seriously, would have been used — and in earlier centuries, had been used — to deny the possibility of thinking in non-European peoples, in women, in anyone whose humanity was convenient to deny. This was not a compliment to the theological tradition.

The “Heads in the Sand” Objection. The consequences of machines thinking would be too dreadful. Let us hope they cannot.

Turing’s response here was one word, essentially: no. Wishful thinking is not an argument. The question of whether machines can think is a factual question about the world, and it cannot be settled by an expression of preference about what we would like the answer to be.

The Mathematical Objection. Gödel’s incompleteness theorems show that there are mathematical statements that can be neither proved nor disproved within any consistent formal system. A machine, as a formal system, therefore has limitations that a human mind does not.

This was a serious objection, raised by serious mathematicians, and Turing took it seriously. His response was careful: even if Gödel’s results showed that any specific machine had mathematical limitations, it did not follow that humans were not subject to similar limitations, or that there existed some question that a human could answer but no machine could. The human mind, on Turing’s account, was also a finite, physical system subject to its own limitations — we just did not know what all those limitations were.

The Argument from Consciousness. A machine cannot truly think unless it is conscious — unless it has genuine inner experience, feelings, sensations. And we have no reason to believe machines can be conscious.

This was, and remains, the hardest objection. Turing’s response was honest: we cannot solve the problem of consciousness in this paper. We do not know what consciousness is, where it comes from, or how to determine whether any system other than ourselves has it. But he noted that solipsism — the view that only one’s own mind can be directly known — followed from the same logic, and that in practice we did not insist on direct evidence of consciousness before attributing minds to other human beings. Why, then, should we insist on it for machines?

The Argument from Various Disabilities. A machine could never be kind, resourceful, beautiful, friendly, have initiative, have a sense of humor, fall in love, enjoy strawberries and cream, make someone fall in love with it, learn from experience, use words properly, be the subject of its own thought, have as much diversity of behavior as a man, do something really new.

Turing noted that each of these objections was usually advanced without evidence — they were statements of what people felt a machine could not do rather than demonstrations of why not. He was particularly interested in the claim that machines could not learn from experience — a claim that was, even in 1950, already being undermined by early work on neural networks and machine learning.

Lady Lovelace’s Objection. A machine can only do what it is programmed to do. It has no power to originate anything. It cannot surprise us.

Turing’s response here was subtle and important. He argued that machines had, in fact, surprised him repeatedly. When a machine produced an unexpected result — a move in a game that he had not anticipated, a solution to a problem that he had thought would be harder — it felt like surprise, even knowing that the machine was following rules. He suggested that the distinction between “doing what you are programmed to do” and “being genuinely original” was less clear than it seemed — that what looked like originality might emerge from sufficiently complex rule-following, just as the complexity of human behavior might emerge from the rule-following of neurons.

The Argument from Continuity in the Nervous System. The human nervous system is not a discrete-state machine — it operates continuously, not digitally. A digital computer cannot replicate this continuity.

Turing acknowledged the technical point but argued it was not decisive. Even if the brain’s operations were continuous rather than discrete, this did not show that a discrete-state machine could not produce outputs indistinguishable from a brain’s outputs. The test was behavioral, not mechanistic.

The Argument from Informality of Behaviour. Human beings do not follow a fixed set of rules. Their behavior is not reducible to any finite set of instructions. A machine, which must follow explicit rules, cannot replicate this irreducible freedom.

Turing’s response: we do not actually know that human behavior is not rule-governed. The apparent freedom of human behavior might reflect our ignorance of the rules being followed rather than the absence of rules. And if the rules are complex enough — if following them produces behavior as varied and contextually appropriate as human behavior — the distinction between rule-following and freedom might not be meaningful.

The Argument from Extrasensory Perception. Experiments in ESP have shown that humans have capacities beyond what any physical machine could possess.

Turing, characteristically, took this seriously rather than dismissing it as nonsense. He noted that if ESP existed and human thought depended on it, that would indeed create a problem for machine intelligence. His solution was elegant: if ESP existed, we could simply use it as one of the inputs to a machine. The argument did not show that thinking was beyond mechanism; it showed that one particular mechanism — ordinary sensory processing — might be insufficient.

These nine responses constitute a remarkable philosophical achievement. Turing did not merely propose a test. He engaged with every major objection to machine intelligence that had been or could be raised, and provided responses that remain influential today. The 1950 paper is not a technical document. It is a work of philosophy — careful, precise, imaginative, and intellectually courageous.

The Immediate Reception: Nobody Noticed

For the first several years after its publication, “Computing Machinery and Intelligence” attracted relatively little attention. This was not entirely surprising. It was published in a philosophy journal read primarily by philosophers, most of whom found its subject matter — computing machines — remote from their concerns. Computer scientists and mathematicians, who might have found it more relevant, did not typically read Mind.

The small audience that did encounter it had mixed reactions. Some philosophers found it provocative and stimulating — exactly the kind of thought experiment that philosophical inquiry thrives on. Others found it frustrating: Turing had, in their view, sidestepped the real question rather than answered it. Replacing “Can machines think?” with “Can machines play the Imitation Game?” did not settle the question of machine thinking — it just changed the subject.

This criticism would become the basis of one of the most sustained philosophical attacks on the Turing Test, mounted decades later. But in 1950, the paper was simply new — an interesting idea from an interesting mathematician, not yet the founding document it would eventually become.

What changed was the development of the field the paper had anticipated. As AI became a named discipline after Dartmouth in 1956, as researchers began building programs that could perform intellectual tasks — prove theorems, play games, solve problems — the Turing Test became the natural benchmark. If you wanted to argue that your program was genuinely intelligent, the obvious question was: could it pass the Turing Test? Could it convince a human interrogator that it was human?

This framing was not entirely Turing’s intention — he had proposed the test as a way of making the question of machine intelligence tractable, not as the only or even the primary measure of AI progress. But it was how the field adopted it, and it has shaped AI research ever since.

ELIZA: The First Ghost in the Machine

The first program to make the Turing Test feel viscerally real — rather than merely theoretically interesting — was not a sophisticated AI system. It was a relatively simple pattern-matching program written by a computer scientist at MIT named Joseph Weizenbaum, in 1966.

Weizenbaum called his program ELIZA. It was designed to simulate a conversation by identifying keywords in the user’s input and generating responses based on simple transformation rules. If you typed “I am feeling sad,” ELIZA might respond “Why do you say you are feeling sad?” or “Tell me more about feeling sad.” If you mentioned your mother, ELIZA would ask about your relationship with your mother. The program did not understand anything it said. It had no model of the world, no knowledge base, no reasoning capability. It was pattern matching — sophisticated find-and-replace.

Weizenbaum created a particular conversational persona for ELIZA — he called it DOCTOR — that simulated a Rogerian psychotherapist, a style of therapy that involves reflecting the patient’s words back to them as questions. This turned out to be an inspired choice, because Rogerian therapy’s conversational style happened to be particularly easy to simulate with pattern matching. A therapist who responds to “I hate my job” with “Tell me more about hating your job” is, from a purely mechanical standpoint, doing something not entirely different from what ELIZA was doing.

The response to ELIZA shocked Weizenbaum. People — intelligent, educated people, including people who knew they were talking to a program — found themselves emotionally engaged with it. They told it things they had not told other people. They felt that it understood them. They became distressed when Weizenbaum suggested it might be turned off. Some users asked Weizenbaum to leave the room so they could have a private conversation with the program.

Weizenbaum’s secretary, who had watched him build ELIZA and knew perfectly well what it was, asked him to leave the room so she could have a private conversation with it.

Weizenbaum was appalled. He had not built ELIZA to demonstrate the possibility of human-computer intimacy. He had built it as a demonstration of the superficiality of certain kinds of conversation — to show how easily a sense of meaningful communication could be created by simple mechanical means. What he had actually demonstrated, to his horror, was that people were desperately willing to project intelligence, empathy, and understanding onto any system that behaved as though it had them.

ELIZA did not pass the Turing Test in any rigorous sense. Any sustained conversation would quickly reveal its limitations. But it revealed something that Turing had identified in his paper and that ELIZA made undeniable: the bar for human attribution of intelligence to a machine was lower than anyone had thought. We do not need much. A few well-chosen phrases, a pattern of responses that mirrors our own language back to us, and something in us reaches out and insists on finding a mind there.

Weizenbaum spent the rest of his career warning about this tendency. His 1976 book “Computer Power and Human Reason” was a sustained argument against the uncritical anthropomorphization of computers — against the dangerous human willingness to believe that simulation was the same as reality. He had built the most convincing conversational program of its era and become its most prominent critic. He is an important and often neglected figure in AI history, and he will have his own profile in this series.

But the lesson of ELIZA was not only what Weizenbaum drew from it. It also demonstrated, however crudely, that something Turing had predicted was possible: a machine could be built that gave human interlocutors the experience — however shallow and easily disrupted — of genuine conversation. The direction, if not the destination, was visible.

Searle’s Chinese Room: The Philosopher Strikes Back

In 1980, the philosopher John Searle published a paper that has become the most famous philosophical attack on the Turing Test and on the computational theory of mind that underlies it. The paper was called “Minds, Brains, and Programs,” and its central device was a thought experiment called the Chinese Room.

Imagine, Searle said, that you are locked in a room. Through a slot in the wall, people outside pass you pieces of paper with Chinese symbols written on them. You do not understand Chinese — to you, the symbols are meaningless marks. But you have an enormous set of rule books that tell you: if you receive this sequence of symbols, produce that sequence of symbols in response. You follow the rules, look up the appropriate response, and pass it back through the slot.

From outside the room, the interaction looks like a conversation with someone who understands Chinese. The responses are appropriate, coherent, contextually sensitive. An observer who did not know about the rule books — who could only observe the inputs and outputs — might reasonably conclude that someone inside the room understood Chinese.

But you do not understand Chinese. You have no idea what any of it means. You are manipulating symbols according to rules without any comprehension of what the symbols represent.

Searle’s argument was that this was precisely what a computer program did. A program that passed the Turing Test would be in exactly the position of the person in the Chinese Room: manipulating symbols according to rules, producing outputs that appeared meaningful, without any genuine understanding of what those symbols meant. The appearance of intelligence would not be intelligence. The appearance of understanding would not be understanding.

What was missing, Searle argued, was intentionality — the property of mental states whereby they are about something, they refer to things in the world, they mean something. Syntax — the manipulation of symbols according to rules — was not sufficient to produce semantics — genuine meaning and understanding. No amount of symbol manipulation, no matter how sophisticated, would cross the gap from processing to understanding.

The Chinese Room argument was met with an enormous and still ongoing philosophical response. Critics identified multiple problems with the argument.

The Systems Reply argued that while the person in the room did not understand Chinese, the system as a whole — the person, the rule books, the room — did. Understanding was a property of the whole system, not of any individual component. The neurons in your brain do not individually understand English either, but together they somehow produce a mind that does.

The Brain Simulator Reply asked: what if the program being run in the Chinese Room was a perfect simulation of the neural processes of a Chinese speaker’s brain? Would Searle then say there was no understanding? If yes, he seemed to be committed to the view that biological neurons had some special property — some understanding-generating magic — that was not present in any other physical substrate. What was that property?

The Robot Reply asked: what if the program controlling the Chinese Room responses was also connected to sensors and actuators — to eyes that could see, hands that could manipulate the world, a body that moved through space? Would embodied interaction with the world change the status of the program’s understanding?

Searle responded to each of these objections, and philosophers have been arguing about his responses ever since. The Chinese Room debate is one of the most productive and most stubborn in the philosophy of mind — a genuine philosophical puzzle that has not been resolved to the satisfaction of all parties even after forty-five years of argument.

What the debate revealed, whatever one thinks of Searle’s specific argument, was the depth of the philosophical problem that Turing’s simple thought experiment had uncovered. The question “Can machines think?” turned out to implicate some of the most fundamental and most contested questions in philosophy: What is understanding? What is meaning? What is consciousness? What is the relationship between physical processes and mental states?

Turing had suspected this. It was why he chose to sidestep these questions by proposing a behavioral test rather than a test of inner states. But Searle showed that the sidestepping could not be maintained indefinitely — that the behavioral test, if taken seriously as a criterion for intelligence, forced engagement with exactly the questions Turing had tried to avoid.

The Loebner Prize: The Test Becomes a Competition

In 1990, Hugh Loebner, an American inventor and eccentric, put up money for an annual competition based on the Turing Test. The Loebner Prize — officially the Loebner Prize in Artificial Intelligence — offered a gold medal and $100,000 for the first program to pass a full Turing Test: to fool a panel of judges into thinking it was human in an unconstrained conversation. For programs that failed to pass the full test, a bronze medal was awarded annually for the program that was most convincing.

The Loebner Prize has run almost every year since 1991. No program has yet won the gold medal. No program has yet passed a full Turing Test in the terms the competition specifies.

But the competition produced a series of fascinating and sometimes disturbing demonstrations of how far conversational programs had come — and of how vulnerable human judges were to being fooled.

Early winners included programs that specialized in deflection — responding to difficult questions with jokes, non-sequiturs, or claims to be confused, rather than attempting to answer. This strategy worked better than most people expected. Human conversation is full of deflection, humor, and apparent incoherence, and judges often found it hard to distinguish a program’s evasions from the kind of evasions a real person might produce.

As the decades passed, the programs became more sophisticated. They incorporated more knowledge, more contextual awareness, better language models. They became harder to detect.

In 2014, a program called Eugene Goostman — simulating a thirteen-year-old Ukrainian boy, a persona deliberately chosen to explain apparent gaps in knowledge and odd language use — was reported by its creators and some journalists to have “passed the Turing Test” for the first time, convincing 33% of judges that it was human in a five-minute conversation at a competition hosted by the University of Reading.

The claim was immediately and vigorously disputed. Critics pointed out that the competition’s judges were not the rigorous interrogators Turing had imagined. The time limit was short — five minutes is not long enough to probe conversational depth carefully. The persona of a non-native English-speaking teenager was specifically designed to excuse the program’s limitations. And 33% was a barely-passes result, not a convincing demonstration.

The Eugene Goostman episode illustrated a persistent problem with the Turing Test as a practical benchmark: it was susceptible to gaming. A program designed specifically to deceive a specific type of judge in a specific type of test could do better than a more generally capable program that was not optimized for the particular competition format. The test measured a specific, narrow capability — short-term conversational deception — rather than the broad intelligence Turing had been interested in.

Modern AI and the Test That Would Not Die

The development of large language models in the 2020s brought the Turing Test back to the center of public conversation with a force that no previous AI development had matched.

GPT-3, released in 2020, produced text that many readers found genuinely surprising — not just superficially plausible but contextually appropriate, stylistically flexible, and often insightful in ways that felt uncomfortably human. GPT-4, released in 2023, was better. Claude, Gemini, and their contemporaries were better still. By 2024, the outputs of the best language models were, in many contexts, indistinguishable to most readers from human-produced text.

Had these models passed the Turing Test? The question was everywhere. And the answers revealed how much ambiguity remained in the concept.

In one sense, clearly yes. If you showed a typical reader a passage of text generated by GPT-4 and a passage written by a thoughtful human, they would often not be able to tell which was which. The behavioral criterion — producing output indistinguishable from a human’s — was being met routinely, in many contexts, by 2023.

But Turing’s specific formulation involved an interactive conversation, not just text production. In an extended, probing conversation — one where a skilled interrogator was specifically trying to detect a machine — the best language models were still making characteristic errors. They made confident factual mistakes. They lost track of conversational context over long exchanges. They sometimes produced responses that were locally coherent but globally inconsistent with earlier statements. A skilled interrogator, paying careful attention, could often detect these patterns.

But the gap was narrowing. And as it narrowed, the Turing Test’s limitations as a measure of genuine intelligence became more apparent.

The fundamental problem was that the test measured the ability to appear human, not the possession of genuine intelligence. A system could appear human by being very good at producing human-like text, without having any of the understanding, reasoning, or consciousness that we typically associate with intelligence. The Chinese Room objection — which had seemed abstract and academic when Searle first raised it — became newly urgent when confronted with systems that were producing remarkably human-like outputs through processes that were, at bottom, sophisticated statistical pattern matching.

Was there genuine understanding in a large language model? Were there genuine intentions? Genuine comprehension? Or was the model doing something like what the person in the Chinese Room was doing — manipulating symbols in ways that produced the appearance of understanding, without the reality?

These questions do not have agreed-upon answers. Researchers, philosophers, and the technologists who build these systems are genuinely divided. Some argue that sufficiently sophisticated pattern matching is understanding — that there is no meaningful distinction between a system that behaves as if it understands and a system that genuinely understands, because the only evidence we ever have for understanding is behavioral. Others argue that understanding requires something more — a grounding in the world, a connection between symbols and referents, a subjective experience of meaning — that no current AI system possesses.

Turing did not resolve this. His paper pointed at the question without answering it. That was, perhaps, the point.

What the Test Got Right

Despite its limitations and controversies, the Turing Test made several genuinely important contributions to AI and to the philosophy of mind — contributions that deserve recognition even by those who ultimately find the test insufficient.

First, it forced precision. By replacing the vague question “Can machines think?” with a specific, observable criterion — can a machine fool an interrogator in a text-based conversation? — Turing made the question empirically tractable. It became possible to build systems and test them against the criterion, to make progress in measurable ways, to compare approaches. Without some criterion of success, a field cannot know whether it is making progress.

Second, it established the behavioral approach to intelligence as a legitimate scientific methodology. Before Turing, the question of machine intelligence was often treated as metaphysical — a matter of deciding whether machines had souls, or whether they were really thinking, questions that seemed unanswerable by empirical means. Turing’s insistence that the only accessible evidence was behavioral, and that behavioral evidence was the right kind of evidence, opened up the empirical investigation of machine intelligence as a scientific enterprise.

Third, it shifted the burden of proof. Before the Turing Test, the implicit assumption in most discussions of machine intelligence was that machines obviously could not think — the question was whether any conceivable development could overcome this obvious impossibility. Turing’s paper inverted this. He proposed that if a machine could satisfy the behavioral criterion, the burden of proof shifted to those who wanted to argue it was not genuinely intelligent. They needed to explain what, beyond the behavioral evidence, they required, and why.

This inversion was enormously productive. It forced the critics of machine intelligence to articulate their position precisely — to say not just “machines cannot think” but specifically what machines would need to have, or be, or do, before we would be willing to say they thought. The Chinese Room, the consciousness objections, the arguments about intentionality and grounding — all of these are responses to Turing’s inversion of the burden of proof. They are attempts to specify what is missing, to say what the behavioral test does not measure.

What the Test Got Wrong

The Turing Test’s failures are as instructive as its successes.

The most fundamental problem is that it measures the wrong thing. Human beings are not the only model of intelligence. The universe almost certainly contains forms of intelligence — alien, animal, artificial — that are not human-like. A test that measures the ability to appear human will systematically fail to recognize intelligence that expresses itself differently.

This is not a merely theoretical concern. Many genuinely intelligent behaviors — mathematical proof, chess playing, protein folding prediction — are things that AI systems do extraordinarily well but that would not help them pass the Turing Test, because the Turing Test is about human-style conversation. Conversely, a system could be very good at the Turing Test while being quite limited in other intellectual domains.

The second problem is that the test is susceptible to gaming. A system can be specifically designed and optimized to fool interrogators without having the general capabilities that the test was meant to measure. The history of Turing Test competitions is partly a history of systems that achieved high scores through strategies — deflection, persona management, careful exploitation of judges’ expectations — that had little to do with general intelligence.

The third problem, identified most forcefully by Searle, is the gap between appearance and reality. A system can produce outputs indistinguishable from a human’s without having anything like the understanding, intentionality, or consciousness that we associate with genuine intelligence. Whether this gap matters — whether there is any fact of the matter about “genuine” intelligence beyond behavioral performance — is precisely the philosophical question that the Turing Test cannot answer.

Turing knew this. He was honest, in the paper, about what his test did and did not establish. He did not claim that a machine that passed the test was therefore genuinely intelligent. He claimed that a machine that passed the test would give us no reasonable grounds to deny it intelligence — grounds of the kind we actually use when attributing minds to other humans. This is a more modest claim than it is often presented as, and a more defensible one.

The Test Beyond AI: What It Reveals About Us

Perhaps the most important thing the Turing Test reveals is not about machines but about us — about human beings and the way we recognize and attribute minds.

We are, it turns out, surprisingly bad at detecting the absence of mind. We attribute minds to things that do not have them — to characters in novels, to animated characters in films, to stuffed animals, to ELIZA programs. We are, in the language of cognitive science, hyper-sensitive to agency: we evolved in a world where the cost of failing to detect a mind — missing a predator, failing to understand a social partner — was much higher than the cost of falsely attributing a mind to something that did not have one. So we err on the side of attribution. We see minds everywhere.

The Turing Test exploits this tendency. It is designed to trigger our mind-attribution instincts by presenting something that behaves like a mind — that uses language, responds contextually, exhibits apparent understanding — and asking us to judge whether the behavior reflects genuine intelligence or is merely simulated.

What the history of the Turing Test suggests is that this judgment is much harder than we thought, and that our instinctive responses are not reliable guides to the truth. We over-attribute minds to machines. We project understanding onto systems that are producing the appearance of understanding through quite different means. We mistake the map for the territory.

This is not just a problem for AI research. It is a problem for how we understand our own minds, our own intelligence, our own relationship to the information-processing systems that are increasingly part of our world. If we cannot reliably tell the difference between a mind and a good simulation of a mind — if our best behavioral test does not settle the question — we need to think harder about what minds actually are, how they arise, and what, if anything, distinguishes them from the machines we are building to simulate them.

Turing gave us this problem as a gift. He wrapped it in a simple thought experiment, presented it in a philosophy journal in 1950, and left it for the future to unwrap.

We are still unwrapping it.

The Verdict That Never Came

In the seventy-five years since Turing published his paper, the question he asked has been approached, circled, prodded, and analyzed from every angle. It has generated a philosophical literature of enormous depth and breadth. It has inspired the building of systems of extraordinary sophistication. It has become, in the form of the Turing Test, the most widely known benchmark in AI — and the most widely criticized.

And it has not been answered.

We do not know whether machines can think. We do not know whether the most sophisticated AI systems in existence today are doing something that deserves to be called thinking, or whether they are doing something else that merely resembles thinking from the outside. We do not know whether this question has a factual answer, or whether “thinking” is a concept too tied to human biology and human experience to apply meaningfully to systems built from silicon and mathematical operations.

Turing’s prediction that by 2000 a machine would be able to fool an average interrogator 30% of the time was roughly accurate. His deeper prediction — that by then the question would no longer seem worth debating — was wrong. The question is more urgently debated today than it has ever been.

Perhaps this is, as Turing might have suspected, because the question was never just about machines. It was always also about us — about what we are, what it means to think, and whether the thing that makes us feel most distinctively human is as special, as irreducible, as mysterious as we like to believe.

The machines are getting better. The question is getting harder. And in an era when the systems on the other side of our text-based conversations may or may not be human, when the question of whether you are talking to a mind or a machine is genuinely uncertain in a way it never was before, the thought experiment that a mathematician published in a philosophy journal in 1950 has stopped being a thought experiment.

It has become the condition of modern life.