The GPT-4 Moment: One Year That Changed Everything

“The release is both anticipated and, when it arrives, astonishing. The AI research community has known for months that a new, more capable model was coming. The benchmark results leaked in various forms. The demonstrations capabilities shared with researchers and enterprise partners had circulated. The expectations were high. They are exceeded.”

— “They are exceeded.”

San Francisco, California. March 14, 2023. OpenAI releases GPT-4.

The release is both anticipated and, when it arrives, astonishing. The AI research community has known for months that a new, more capable model was coming. The benchmark results leaked in various forms. The demonstration capabilities shared with researchers and enterprise partners had circulated. The expectations were high.

They are exceeded.

GPT-4 achieves scores on the bar exam, the medical licensing exam, the SAT, and dozens of other standardised tests that place it in the top percentiles of human performance. It can describe images, read charts, interpret photographs — a capability the previous GPT models did not have. It can maintain coherent, detailed reasoning across conversations of a length that would have exceeded GPT-3.5’s reliable range. It makes fewer factual errors. It follows complex instructions more reliably. It is, across almost every dimension that can be measured, substantially better than its predecessor.

But the most significant thing about GPT-4 is not the benchmark scores. It is what happens when hundreds of millions of people start using it. What happens when people bring it their actual problems — the medical question they have been embarrassed to ask a doctor, the legal question they cannot afford a lawyer for, the code they cannot figure out how to debug, the email they do not know how to write, the argument they want to understand from multiple sides. What happens when a tool this capable is available to anyone with an internet connection.

The year that follows GPT-4’s release is one of the most consequential years in the history of artificial intelligence. The year that made AI real.

GPT-4 released

Date:: March 14, 2023
Location:: OpenAI, San Francisco, California
Significance:: OpenAI released GPT-4 — a large multimodal language model fine-tuned with RLHF — alongside a “technical report” notable for how little technical detail it disclosed (no parameter count, training dataset size, or architecture details). The model achieved scores on the bar exam (90th percentile), medical licensing exam, SAT, and dozens of other standardised tests that placed it in the top percentiles of human performance; could process image inputs alongside text; and made fewer factual errors than GPT-3.5.
Outcome:: The release triggered the most consequential year in AI history — competitive responses from Google (Bard, then Gemini), Anthropic (Claude 3), and Meta (LLaMA 2 and 3); the Biden Executive Order on AI; passage of the EU AI Act; the Bletchley Summit and AI Safety Institutes; and the broad mainstream adoption that made AI the defining technology of the era.

Important

The most significant thing about GPT-4 was not the benchmark scores. It was what happened when hundreds of millions of people started using it — bringing it their actual problems: the medical question they had been embarrassed to ask a doctor, the legal question they could not afford a lawyer for, the code they could not figure out how to debug. The year that followed GPT-4’s release was one of the most consequential years in the history of artificial intelligence. The year that made AI real.

The Technical Achievement: What GPT-4 Actually Was

Understanding the GPT-4 moment requires understanding what GPT-4 actually was — what specifically made it different from GPT-3.5 and why those differences mattered.

OpenAI did not publish full technical details about GPT-4’s architecture or training when it was released — the model was described in a “technical report” that was notable for how little technical detail it contained. The report confirmed that GPT-4 was a large language model trained with RLHF fine-tuning, that it was a multimodal system capable of processing both text and images as input, and that it had been trained with extensive safety testing. But it did not disclose the number of parameters, the size of the training dataset, the specific training procedure, or the architecture details that would have allowed outside researchers to fully characterise the system.

This opacity was itself significant and was widely commented on. OpenAI had previously published detailed technical reports for its major models; the shift to opacity with GPT-4 reflected the organisation’s increasing caution about releasing information that might give competitors a detailed roadmap for replicating the system’s capabilities.

Note

OpenAI’s GPT-4 “technical report” was notable for how little technical detail it disclosed: no parameter count, no training dataset size, no architecture details, no specific training procedure. The report confirmed that GPT-4 was a large multimodal language model fine-tuned with RLHF and tested for safety — but it deliberately withheld the details that would have allowed outside researchers to fully characterise or replicate the system. The opacity itself was significant: OpenAI had previously published detailed technical reports for GPT-2 and GPT-3; the shift to opacity reflected increasing caution about giving competitors a roadmap.

What was known, from the technical report and from the evidence of the model’s capabilities, was that GPT-4 represented a substantial qualitative improvement over GPT-3.5 in several dimensions.

Reasoning capability. GPT-4 demonstrated substantially better performance on reasoning tasks that required multi-step thinking, careful tracking of multiple variables, and the construction of valid arguments. The improvement was visible on standardised tests that required systematic reasoning — the bar exam, the medical licensing exam, the GRE — where GPT-4 performed in the top percentiles while GPT-3.5 had performed at more modest levels.

Instruction following. GPT-4 followed complex, multi-step instructions more reliably than GPT-3.5. It was better at maintaining the constraints specified in a prompt across a long response, better at producing outputs in specific formats, and better at handling the kind of nuanced, conditional instructions that arise in professional applications.

Factual accuracy. GPT-4 hallucinated less than GPT-3.5 — it was less likely to confidently generate false information. The improvement was meaningful but not complete; GPT-4 still hallucinated, but at a lower rate and often with more appropriate expressions of uncertainty.

Multimodality. GPT-4’s vision capabilities — the ability to process image inputs alongside text — were the most novel capability of the release. A user could upload a photograph and ask questions about it, upload a chart and ask for interpretation, or describe a visual scene and receive analysis. The vision capabilities were impressive from the beginning and rapidly became one of the most practically useful features of the model.

Info

GPT-4 represented a substantial qualitative improvement over GPT-3.5 in four dimensions:

Reasoning capability — substantially better performance on multi-step reasoning tasks (visible on the bar exam, USMLE, GRE — top percentiles versus GPT-3.5’s modest levels)
Instruction following — better at maintaining prompt constraints across long responses and handling nuanced conditional instructions
Factual accuracy — hallucinated less than GPT-3.5; still hallucinated, but at a lower rate and with more appropriate uncertainty expressions
Multimodality — the most novel capability: vision, allowing users to upload photographs, charts, and screenshots and ask questions about them

The System Card: A New Standard for Transparency

Alongside the technical report, OpenAI published a “system card” for GPT-4 — a document describing the safety testing conducted before deployment, the risks that testing had identified, and the mitigations that had been implemented.

The system card was a genuinely unusual document. AI companies had not previously published systematic accounts of the safety testing conducted on their models, and the system card set a new standard for transparency about the risks and limitations of frontier AI systems.

The card documented specific “red team” testing — systematic attempts to elicit harmful outputs from the model — and described the specific categories of harm that had been identified and mitigated. It disclosed that GPT-4, without safety fine-tuning, was capable of providing advice on how to produce biological, chemical, and radiological weapons — capabilities that the safety fine-tuning had substantially reduced but not eliminated. It disclosed specific examples of harmful outputs that had been produced in testing.

This transparency was valuable and was widely noted by AI safety researchers. The willingness to publish specific, documented examples of model failures — rather than simply asserting that safety testing had been conducted — created accountability and raised the bar for how AI companies communicated about their systems’ risks.

The system card also revealed the specific structure of OpenAI’s safety testing process — the combination of automated red-teaming, expert human red-teaming, and collaboration with external safety researchers — that had been developed over the years of working on increasingly capable models. This documentation of process was as valuable as the specific findings.

Definition

System card — A document, first published by OpenAI alongside GPT-4 in March 2023, that systematically describes the safety testing conducted on a frontier AI model before deployment, the risks that testing identified, and the mitigations that were implemented. The system card was a genuinely unusual document: AI companies had not previously published systematic accounts of safety testing. The GPT-4 system card disclosed specific “red team” results — including that, without safety fine-tuning, GPT-4 was capable of providing advice on how to produce biological, chemical, and radiological weapons — and set a new transparency standard that subsequent frontier-model releases (Claude, Gemini) followed.

The Benchmark Sweep: What the Scores Meant

The benchmark results for GPT-4 attracted immediate and extensive media coverage, and the specific scores became a shorthand for the model’s significance. Passing the bar exam in the 90th percentile. Scoring in the 88th percentile on the LSAT. Achieving near-perfect scores on the AP exams in several subjects.

The significance of these scores was both real and frequently overstated.

Real, because the specific scores represented a genuine capability improvement over previous AI systems. GPT-3.5, which was already impressive, scored at much lower percentiles on the same tests. The jump from GPT-3.5 to GPT-4 was large enough to be qualitatively significant, not just incrementally better.

But frequently overstated, because performance on standardised tests is not the same as genuine legal or medical expertise. The bar exam is designed to test specific legal knowledge and reasoning capabilities in the context of a specific legal system. A system that can pass the bar exam by drawing on patterns in its training data has demonstrated pattern-matching capability, not necessarily the judgment, the situational awareness, and the specific domain knowledge that a practicing lawyer has.

The benchmark scores were evidence that GPT-4 had learned to produce outputs that resembled good performance on standardised tests — evidence that was genuinely significant, not trivial. But translating from benchmark performance to real-world capability required understanding what the benchmarks were actually measuring and what GPT-4 was doing when it produced high-scoring answers.

Warning

The benchmark scores were real but frequently overstated. Performance on standardised tests is not the same as genuine legal or medical expertise. The bar exam tests specific legal knowledge and reasoning in the context of a specific legal system; a system that passes it by drawing on patterns in its training data has demonstrated pattern-matching capability, not necessarily the judgment, situational awareness, and domain knowledge that a practicing lawyer has. The benchmark scores were genuinely significant evidence — but translating from benchmark performance to real-world capability required understanding what the benchmarks were actually measuring.

The Sparks Paper: Microsoft’s Assessment

The most influential analysis of GPT-4’s capabilities was not published by OpenAI but by Microsoft Research — the “Sparks of Artificial General Intelligence” paper, published in March 2023 by Sébastien Bubeck and colleagues.

The paper was ambitious in its framing and specific in its demonstrations. The title itself was a provocation: to use “artificial general intelligence” in a paper title, about a system that had just been released, was to claim something significant about what GPT-4 was.

The paper’s substance was a systematic evaluation of GPT-4 across a wide range of tasks — mathematics, coding, reasoning, creative writing, legal analysis, medical diagnosis, understanding three-dimensional space, understanding vision. In task after task, the paper demonstrated that GPT-4’s performance was remarkable — not just better than previous AI systems but in some cases approaching or matching the performance of domain experts.

The paper also documented specific failure modes — cases where GPT-4’s performance fell short of what a domain expert would achieve, where it made specific types of errors, where its performance was inconsistent. The failures were as important as the successes for understanding what GPT-4 actually was.

The most careful part of the paper was its discussion of what GPT-4’s capabilities implied. The authors were careful not to claim that GPT-4 was conscious or that it had general intelligence in any robust sense. They were equally careful not to dismiss what they had observed. The specific claim — “sparks of AGI” — was intended to describe something real: that GPT-4 demonstrated capabilities that were more general and more flexible than previous AI systems, without claiming that it had achieved genuine general intelligence.

The Sparks paper was widely read, widely cited, and widely debated. Researchers who found it credible cited it as evidence that the trajectory of AI development was moving rapidly toward very capable systems. Researchers who found it overstated cited it as an example of the tendency to anthropomorphise AI systems and to mistake sophisticated pattern-matching for genuine intelligence. The debate it generated was itself evidence of the significance of the GPT-4 moment.

Sébastien Bubeck

Born:: 1984
Died:: Living
Nationality:: French
Role:: Senior Researcher at Microsoft Research (later Professor of Computer Science at Princeton University); mathematical theorist turned AI capabilities researcher
Known for:: Lead author of “Sparks of Artificial General Intelligence: Early Experiments with GPT-4” (March 2023), the most influential early public assessment of GPT-4’s capabilities. The paper’s title — using “artificial general intelligence” about a system just released — was itself a provocation; the authors carefully framed their claim as describing capabilities more general and flexible than previous AI systems, without claiming GPT-4 had achieved genuine general intelligence.

The Competitive Response: Google, Anthropic, and Meta Enter the Race

GPT-4’s release accelerated the competitive dynamics of the AI industry in ways that had lasting effects on the subsequent development of the field.

Google’s response. Google was the most directly threatened by GPT-4’s deployment. The integration of GPT-4 into Bing Search — which Microsoft had announced in February 2023, weeks before GPT-4’s release — represented a specific competitive threat to Google’s core search business. Google’s response was Bard, released in March 2023, which was powered by a version of Google’s PaLM model.

Google launches Bard

Date:: March 21, 2023
Location:: Google, Mountain View, California
Significance:: Google launched Bard — its conversational AI, powered by a version of the PaLM (Pathways Language Model) — in a limited preview, three weeks after Microsoft integrated GPT-4 into Bing Search. Bard’s public demonstration in February had been marred by a factual error in the demo video (Bard incorrectly described the James Webb Space Telescope’s capabilities), which had briefly caused Google’s stock price to drop.
Outcome:: Bard’s initial release was underwhelming compared to GPT-4 — less capable, less reliable, less polished. The gap was a significant embarrassment for Google given its long AI research leadership, and accelerated Google’s AI development timeline. Bard was succeeded by Gemini in December 2023.

Bard’s initial release was underwhelming by comparison with GPT-4. The model was less capable, less reliable, and the interface was less polished. Google’s public demonstration of Bard in February had been marred by a factual error in the demonstration video — Bard had incorrectly described a telescope’s capabilities — which had briefly caused Google’s stock price to drop.

The gap between GPT-4 and Bard was a significant embarrassment for Google, given the company’s long history of AI research leadership. The competitive pressure to close the gap accelerated Google’s AI development timeline in ways that would produce significant results over the following months.

Google announces Gemini

Date:: December 6, 2023
Location:: Google, Mountain View, California
Significance:: Google announced Gemini — its successor to Bard — as a natively multimodal model trained from the ground up to process text, images, audio, and video, rather than being a text model with vision capabilities bolted on. Gemini came in three sizes (Ultra, Pro, Nano) targeting different use cases.
Outcome:: Gemini marked the beginning of Google’s competitive recovery in the large language model race, demonstrating that Google’s AI organisation could close the gap with OpenAI when it deployed its full research and compute resources.

Gemini — Google’s successor to Bard, announced in December 2023 — represented the significant step up in capability that Google’s AI organisation had been working toward. Gemini was a natively multimodal model trained from the ground up to process text, images, audio, and video, rather than being a text model with vision capabilities bolted on. Its release marked the beginning of Google’s competitive recovery in the large language model race.

Anthropic’s response. Anthropic, which had been operating with Claude 2 as its primary model, accelerated its development timeline and released Claude 3 in March 2024 — exactly one year after GPT-4. Claude 3 came in three versions — Haiku, Sonnet, and Opus — targeting different use cases with different capability and cost tradeoffs.

Anthropic releases Claude 3

Date:: March 4, 2024
Location:: Anthropic, San Francisco, California
Significance:: Anthropic released Claude 3 — exactly one year after GPT-4 — in three versions: Haiku (fast/cheap), Sonnet (balanced), and Opus (most capable). Claude 3 Opus performed comparably to GPT-4 across most benchmarks and exceeded it on several, particularly those measuring reasoning and accuracy.
Outcome:: The release established Anthropic as a genuine peer competitor to OpenAI rather than a safety-focused alternative with slightly lower capability. It also demonstrated that Anthropic’s Constitutional AI approach could produce models competitive with the best available in both capability and safety — Claude 3 was notably less likely to hallucinate than GPT-4 and its refusal behaviour was more calibrated.

Claude 3 Opus, the most capable version, performed comparably to GPT-4 across most benchmarks and exceeded it on several, particularly those measuring reasoning and accuracy. The release established Anthropic as a genuine peer competitor to OpenAI rather than a safety-focused alternative with slightly lower capability.

Claude 3’s release also demonstrated that Constitutional AI — Anthropic’s approach to alignment — could produce models that were competitive with the best available models in terms of both capability and safety. The model was notably less likely to hallucinate than GPT-4, and its refusal behaviour was more calibrated — less likely to refuse legitimate requests on spurious safety grounds.

Meta’s open-source strategy. Meta took a different strategic direction, releasing LLaMA 2 in July 2023 and LLaMA 3 in April 2024 as open-weight models — models whose parameters were publicly available for download and use.

The open-weight strategy was controversial and consequential. By making frontier-quality models available for free, Meta effectively democratised access to large language model capabilities — anyone with the computing resources to run the models could do so, without depending on OpenAI’s or Anthropic’s APIs. This enabled a large ecosystem of applications, fine-tuned models, and research programmes that would not have been possible with proprietary models alone.

The open-weight strategy also raised concerns about the safety implications of distributing very capable models without the safety constraints that API deployment would allow. A fine-tuned version of LLaMA could, in principle, have its safety training removed and be used for harmful purposes. Meta’s position — that open access to AI was worth the risks, and that the benefits of open models for research and for developing-world access to AI outweighed the safety risks — was contested by those who believed that open-weight frontier models created unacceptable risks.

Definition

Open-weight model — A model whose trained parameters (weights) are publicly released for download, inspection, fine-tuning, and local deployment — typically under a licence that permits research and (in some licences) commercial use. Meta’s LLaMA 2 (July 2023), LLaMA 3 (April 2024), Mistral’s models, and DeepSeek-R1 (December 2024) are the most prominent open-weight frontier-class models. The open-weight strategy democratises access to capable models — enabling a large ecosystem of applications, fine-tuned variants, and research that would not be possible with proprietary APIs alone. It also raises specific safety concerns: a fine-tuned version of an open-weight model can have its safety training removed and be used for harmful purposes, with no API provider able to prevent it.

The Capability Emergence: What Scale Unlocked

One of the most important and most debated phenomena of the GPT-4 moment was the apparent emergence of qualitatively new capabilities at scale — capabilities that had not been present in smaller models and that appeared to arise discontinuously as models were scaled up.

The emergence phenomenon had been observed before GPT-4 — GPT-3’s in-context learning was arguably an example — but GPT-4 provided the most extensive and best-documented case study.

The specific capabilities that appeared to emerge at GPT-4’s scale included:

Analogical reasoning. The ability to reason by analogy — to identify structural similarities between different domains and to apply knowledge from one domain to problems in another — was substantially better in GPT-4 than in smaller models. The capacity to see that a problem in one domain was structurally similar to a solved problem in another domain, and to adapt the solution, is a hallmark of expert human reasoning.

Theory of mind. The ability to reason about what other people know, believe, and intend — a cognitive capability called theory of mind — showed substantial improvement with scale. GPT-4 was better at understanding the perspectives of characters in stories, at predicting how a person who had different information would interpret a situation, and at reasoning about the goals and intentions behind ambiguous actions.

Multilingual reasoning. GPT-4 showed substantially better performance on tasks that required reasoning in languages other than English, including tasks that required translating concepts between languages or reasoning about culturally specific knowledge. The improvement suggested that scale enabled the development of richer, more integrated representations of different languages.

These emergent capabilities were striking and practically significant. They were also theoretically puzzling: it was not obvious from the architecture of transformer-based language models why these capabilities should emerge at a specific scale threshold rather than improving gradually. The puzzle contributed to uncertainty about what future scaling would produce and what other capabilities might emerge.

Definition

Emergent capabilities — Qualitatively new capabilities in large language models that appear to arise discontinuously as models are scaled up — present in larger models but absent in smaller ones, rather than improving gradually with scale. GPT-3’s in-context learning was arguably an early example; GPT-4 provided the most extensive case study, with emergent analogical reasoning, theory of mind (reasoning about others’ knowledge, beliefs, and intentions), and multilingual reasoning. The phenomenon is theoretically puzzling: it is not obvious from the transformer architecture why capabilities should emerge at specific scale thresholds rather than improving gradually. The puzzle contributes to uncertainty about what future scaling will produce.

The Medical Applications: GPT-4 Meets Healthcare

One of the domains where GPT-4’s capabilities created the most immediate and most consequential interest was healthcare — an area where the potential benefits of capable AI were enormous and the risks of failure were high.

GPT-4’s performance on the USMLE — the United States Medical Licensing Examination — was one of its most publicised benchmark results. Passing the USMLE is a prerequisite for medical licensure in the United States; GPT-4’s performance in the 85th percentile represented performance that most medical students would find impressive.

Several research groups rapidly published studies evaluating GPT-4’s clinical reasoning capabilities. The results were largely positive in terms of benchmark performance: GPT-4 performed well on case-based reasoning tasks, on differential diagnosis problems, on interpretation of laboratory values, and on generation of treatment recommendations for hypothetical cases.

The medical applications that were developed and deployed in the months following GPT-4’s release reflected this benchmark performance. Medical documentation assistants, clinical decision support tools, patient education platforms, and diagnostic support systems all incorporated GPT-4 or its successors. Epic, one of the largest electronic health record companies, announced integration of GPT-4 into its clinical documentation workflow. Nuance, a Microsoft subsidiary, deployed GPT-4 in its clinical documentation product.

The deployment raised important questions that the benchmark performance did not resolve. Clinical medicine required not just knowledge and reasoning but judgment — the ability to weigh competing considerations, to recognise the limits of one’s knowledge, to communicate uncertainty appropriately, to understand the specific context of the individual patient. The specific ways that GPT-4 failed — the hallucination of medical facts, the inability to appropriately calibrate certainty, the potential to miss rare conditions that were underrepresented in training data — were specifically relevant to clinical deployment.

Warning

The deployment of GPT-4 in healthcare raised questions that the benchmark performance did not resolve. Clinical medicine requires not just knowledge and reasoning but judgment — the ability to weigh competing considerations, to recognise the limits of one’s knowledge, to communicate uncertainty appropriately, to understand the specific context of the individual patient. The specific ways GPT-4 failed — hallucinated medical facts, inability to appropriately calibrate certainty, potential to miss rare conditions underrepresented in training data — were specifically relevant to clinical deployment. The thoughtful practitioner’s approach: deploy GPT-4 as a tool that suggests rather than decides, drafts rather than sends, flags rather than acts — with human review readily available.

The approach that most thoughtful healthcare AI practitioners took was to deploy GPT-4 in ways that maintained human oversight — as a tool that suggested rather than decided, that drafted rather than sent, that flagged rather than acted. The specific applications where GPT-4 was most clearly beneficial were the administrative ones: documentation, summarisation, communication — where the stakes of individual errors were lower and where human review was readily available.

The Legal Applications: AI in the Courtroom and the Law Office

The legal profession’s engagement with GPT-4 was similarly characterised by enthusiasm, concern, and the specific failure that became the year’s most cautionary tale about AI-generated legal work.

GPT-4’s performance on the bar exam — scoring in the 90th percentile — created immediate interest from the legal industry, which had been a significant target for AI applications for several years. Law firms began evaluating GPT-4 for legal research, for contract drafting, for document review, and for summarisation of lengthy legal documents.

The cautionary tale that dominated the legal AI conversation in 2023 came from a federal court case in New York, in which a lawyer submitted legal briefs citing cases that ChatGPT had fabricated. The fake cases — with realistic-sounding names, dates, and citations — had been generated by ChatGPT in response to a research query, and the lawyer had submitted them without verifying that the cases actually existed. The judge, discovering the fabrications, sanctioned the lawyer and the law firm.

The case became a canonical example of the hallucination problem in a high-stakes professional context. Legal citations are required to be accurate — the entire authority of a legal argument depends on the real existence of the cases cited. A tool that confidently generates realistic-looking but nonexistent citations is specifically dangerous in this context.

Mata v. Avianca — the fabricated-citations case

Date:: May 2023 (sanctions hearing); Order issued June 22, 2023
Location:: United States District Court, Southern District of New York
Significance:: In Mata v. Avianca, Inc., a lawyer representing a plaintiff submitted legal briefs citing cases that ChatGPT had fabricated — with realistic-sounding names (e.g. Varghese v. China Southern Airlines), dates, and citations, but no actual existence in any legal database. The lawyer had asked ChatGPT for similar cases, taken the chatbot’s responses at face value, and submitted them without verification.
Outcome:: Judge P. Kevin Castel sanctioned the lawyer and the law firm under Federal Rule of Civil Procedure 11. The case became the canonical example of the hallucination problem in a high-stakes professional context — and led to court rules in multiple jurisdictions requiring disclosure when AI was used in drafting legal filings.

The legal profession’s response was to develop specific practices for using AI in legal research that included mandatory verification of all AI-generated citations against legal databases. Firms issued guidance prohibiting submission of AI-generated work product without human review. Several jurisdictions issued court rules requiring disclosure when AI was used in drafting legal filings.

The episode was instructive both for what it revealed about AI’s limitations and for how the legal profession responded. The response was not to ban AI in legal work — the productivity benefits were real and the adoption was continuing. The response was to develop profession-specific practices for AI use that maintained the accuracy and professional standards that legal practice required.

The Education Reckoning: A Year Into the AI Classroom

One year after ChatGPT’s launch, the education system was in the midst of a reckoning that GPT-4 accelerated. The more capable model, available in the same accessible interface, made the challenges of AI in education more acute and the opportunities more visible simultaneously.

The challenges were familiar: students using AI to produce work that was not their own, assessment design that had been obsoleted by AI’s capabilities, the difficulty of distinguishing AI-generated from human-generated work. GPT-4’s improved capability made these challenges harder — its outputs were better, more difficult to detect, and more convincingly the work of a capable student.

The opportunities were also more visible. GPT-4’s ability to explain concepts at multiple levels of sophistication, to answer follow-up questions about its explanations, to work through problems step by step while explaining each step — these were pedagogical capabilities that had not been available at this level of quality before. Teachers who engaged with the opportunity — who redesigned their teaching to incorporate AI as a learning tool rather than trying to exclude it — reported genuine educational value.

The year produced a more nuanced educational conversation than the immediate post-ChatGPT panic had. The binary framing — AI is cheating versus AI is a tool — was replaced by more granular questions about specific uses, specific subjects, specific age groups, and specific learning objectives. The research literature on AI in education began to accumulate, producing the kind of evidence-based analysis that the initial panic had lacked.

The specific pedagogical uses that showed the most promise were personalised tutoring — AI systems that could identify specific gaps in a student’s understanding and provide targeted explanations — and writing feedback — AI systems that could provide specific, actionable feedback on drafts that would allow students to improve their writing skills rather than simply producing final products.

The Labour Market Impact: The First Year’s Evidence

By early 2024, a year of wide GPT-4 deployment had begun to produce the first systematic evidence about AI’s labour market impact — evidence that was more complex and more differentiated than the initial debates had suggested.

Several research studies published in 2023 and early 2024 examined the impact of access to AI tools on worker productivity. The results consistently showed significant productivity improvements for workers who used AI assistance on specific tasks, particularly writing-intensive tasks. Writers, coders, and customer service representatives who used AI assistance produced substantially more output in the same time, or the same output in substantially less time.

The productivity improvements were not uniformly distributed across skill levels. Several studies found that the productivity gains were largest for lower-skilled workers — workers who were less naturally proficient at the tasks in question benefited most from AI assistance. The effect was to compress the productivity distribution, reducing the gap between less skilled and more skilled workers on specific tasks.

This finding had important implications for how to think about AI’s labour market impact. If AI disproportionately increased the productivity of lower-skilled workers, it could reduce inequality by giving lower-skilled workers access to the kind of capabilities that higher-skilled workers had previously monopolised. But it could also reduce the wage premium for skills that AI was now supplying — if AI made less-skilled workers as productive as more-skilled workers on specific tasks, the wage premium for those skills would diminish.

The first-year evidence on employment — whether AI was reducing the number of workers needed for specific roles — was more limited. The productivity improvements were clear; the employment effects were harder to measure in the short term. Layoffs in technology companies and in some content-creation industries were visible, but attributing those layoffs specifically to AI adoption was complicated by other factors — post-pandemic labour market adjustments, macroeconomic conditions, and the specific investment cycles of the technology industry.

The clearest evidence of employment impact was in specific roles that were most directly comparable to what language models could do: content writing, basic coding, customer service, and data entry. In these roles, companies that adopted AI assistance were reporting that they could maintain or increase output with fewer workers — a pattern consistent with AI substituting for some fraction of the work previously done by humans.

Info

One of the most replicated findings of the post-GPT-4 labour-market research: AI assistance disproportionately increased the productivity of lower-skilled workers, compressing the productivity distribution. The same finding cuts two ways:

Optimistic reading: AI reduces inequality by giving lower-skilled workers access to capabilities that higher-skilled workers had previously monopolised.
Pessimistic reading: AI reduces the wage premium for skills it can now supply — if AI makes less-skilled workers as productive as more-skilled workers on specific tasks, the wage premium for those skills diminishes.

The clearest evidence of employment displacement (rather than just productivity gain) was in roles most directly comparable to what language models do: content writing, basic coding, customer service, data entry.

The Government Response Accelerates

The twelve months following GPT-4’s release saw the most significant governmental response to AI in the history of the technology — a response driven by the specific capabilities that GPT-4 had demonstrated and by the pace at which those capabilities were being deployed.

The Biden administration’s Executive Order on AI — published in October 2023 — was the most comprehensive AI governance action by the US federal government to date. The order required frontier AI developers to share safety test results with the government, established safety standards for AI systems with potential for catastrophic misuse, directed federal agencies to develop AI governance policies, and initiated a process for developing international AI governance norms.

The order was ambitious in scope and limited in enforceability — executive orders cannot create the binding legal obligations that legislation can, and the specific safety evaluation requirements were dependent on voluntary cooperation from AI companies. But the order demonstrated that the federal government was taking AI governance seriously and was beginning to develop the institutional capacity to regulate AI.

In Europe, the EU AI Act’s negotiation was completed in December 2023 and the act passed in March 2024, one year after GPT-4’s release. The act’s requirements for general-purpose AI systems — the category that included GPT-4 and its successors — included transparency obligations, safety evaluations, and specific requirements for systems capable of systemic risks.

In the United Kingdom, the AI Safety Institute was established in November 2023 as part of the Bletchley Summit response, providing the first government body specifically charged with technical evaluation of frontier AI safety.

The pace of governance development was faster than the field had seen before — a response to the pace of capability development and the scale of deployment that made governance urgency undeniable.

The Year in Numbers: Scale and Adoption

The twelve months from March 2023 to March 2024 can be characterised by several numbers that capture the scale of the transformation.

100 million ChatGPT users in January 2023. 200 million by August 2023. By early 2024, ChatGPT and its successors were being used by hundreds of millions of people worldwide.

$29 billion — the estimated revenue of the AI software market in 2023, roughly double the previous year. The AI applications market was growing faster than any previous technology market at comparable scale.

$100 billion — the approximate total venture investment in AI startups in 2023, a record by a significant margin.

One million developers using the OpenAI API by mid-2023. The developer ecosystem built on foundation model APIs was growing at a rate that exceeded the growth of previous developer ecosystems.

40% of workers reporting using AI in their jobs, according to various workplace surveys conducted in 2023. AI tool adoption in professional settings was rapid and broad.

These numbers were not just statistics. They represented a transformation in the role of AI in economic and professional life that was genuinely unprecedented. No previous technology had achieved this scale of adoption this quickly, in this breadth of domains.

Info

The twelve months from March 2023 to March 2024, in numbers:

100 million ChatGPT users in January 2023 → 200 million by August 2023
$29 billion — estimated revenue of the AI software market in 2023, roughly double the previous year
$100 billion — approximate total venture investment in AI startups in 2023, a record by a significant margin
One million developers using the OpenAI API by mid-2023
40% of workers reporting using AI in their jobs (2023 workplace surveys)

No previous technology had achieved this scale of adoption this quickly, in this breadth of domains.

What the Year Revealed: The Honest Assessment

Twelve months of GPT-4 deployment produced an honest, evidence-based assessment of the technology that was more nuanced than either the enthusiasm or the alarm of early 2023 had suggested.

The genuine value. GPT-4 produced genuine value for millions of users across hundreds of applications. The productivity improvements were real and measurable. The access to capabilities that had previously required expensive professionals — legal research, medical information, code generation — was genuinely democratising. The creative applications — writing assistance, brainstorming, ideation — were genuinely useful to people who were not professional writers or designers.

The genuine limitations. GPT-4’s hallucination problem, while improved relative to previous models, remained significant enough to require human oversight for any application where factual accuracy mattered. Its tendency to produce confidently wrong answers in specific domains — particularly when queried about specific facts, recent events, or technical details outside its training data — was a limitation that users learned to manage but never fully overcome. Its inability to maintain reliable performance across all conversational contexts — the inconsistency that made it sometimes brilliant and sometimes frustrating — remained a characteristic limitation.

The genuine harms. The spread of AI-generated misinformation accelerated. The use of AI for fraud — voice cloning, synthetic media, sophisticated phishing — became more prevalent. The concerns about AI in hiring and in credit scoring and in other high-stakes decisions that had been raised before GPT-4 became more urgent as the systems became more capable and more widely deployed. The labour market impact, while not yet fully measurable, was visible enough to be a genuine concern.

The genuine governance gap. The governance that existed — the voluntary commitments, the emerging regulatory frameworks, the AI Safety Institutes — was significantly behind the pace of deployment. The specific risks that had been identified — misuse for fraud and misinformation, the labour market impact, the use in surveillance and law enforcement — were being addressed by regulation that was still being developed, in most cases faster than it had previously moved but still significantly slower than the technology.

Important

Twelve months of GPT-4 deployment produced an honest, evidence-based assessment more nuanced than either the enthusiasm or the alarm of early 2023:

The genuine value — productivity improvements real and measurable; access to capabilities (legal research, medical information, code generation) genuinely democratised; creative applications genuinely useful
The genuine limitations — hallucination problem improved but still required human oversight for any application where factual accuracy mattered; inconsistency that made it sometimes brilliant and sometimes frustrating
The genuine harms — AI-generated misinformation spread accelerated; AI for fraud (voice cloning, synthetic media, phishing) more prevalent; labour market impact visible enough to be a concern
The genuine governance gap — voluntary commitments, emerging regulatory frameworks, AI Safety Institutes all significantly behind the pace of deployment

The Foundation of What Came Next

The GPT-4 moment — the twelve months from its release through the competitive responses, the Gemini launch, the Claude 3 release, and the regulatory surge — established the foundation for the AI landscape of subsequent years.

It established that frontier AI was now a competitive market — that multiple organisations could build and deploy frontier-class systems, that the competitive dynamics would continue to drive rapid capability improvement, and that the pace of development was not going to slow simply because governance frameworks were lagging.

It established that AI capability had crossed a threshold of practical utility — that the systems were good enough, broadly enough, that adoption by professionals and consumers would continue regardless of the limitations and the governance gaps. The productivity improvements were too real, the access to capabilities too valuable, for the adoption to reverse.

It established the specific governance challenge: not whether AI should be governed — that was agreed — but how to govern it fast enough and well enough to manage the risks without foreclosing the benefits.

And it established the stakes. The systems that were deployed in 2023 were not the most capable systems that would be built. The systems of 2025, 2030, and beyond would be more capable, more widely deployed, and more consequential. The decisions made during the GPT-4 year — about governance frameworks, about safety standards, about labour market policy, about the distribution of AI’s benefits and costs — would shape the trajectory of AI development for decades.

The GPT-4 moment was, in every sense, a moment that mattered. The year that made AI real was also the year that made the choices about AI’s future impossible to defer.

Important

The GPT-4 year established four foundations for the AI landscape that followed:

Frontier AI is now a competitive market — multiple organisations can build and deploy frontier-class systems; competitive dynamics will continue to drive rapid capability improvement; the pace will not slow simply because governance is lagging
AI capability has crossed a threshold of practical utility — the systems are good enough, broadly enough, that adoption will continue regardless of limitations and governance gaps; the productivity improvements are too real, the access to capabilities too valuable, for adoption to reverse
The governance challenge is specific — not whether AI should be governed (that is agreed) but how to govern it fast enough and well enough to manage risks without foreclosing benefits
The stakes are established — the systems of 2025, 2030, and beyond will be more capable, more widely deployed, and more consequential; the decisions made during the GPT-4 year will shape AI development for decades

The GPT-4 Moment: One Year That Changed Everything

The Technical Achievement: What GPT-4 Actually Was

The System Card: A New Standard for Transparency

The Benchmark Sweep: What the Scores Meant

The Sparks Paper: Microsoft’s Assessment

The Competitive Response: Google, Anthropic, and Meta Enter the Race

The Capability Emergence: What Scale Unlocked

The Medical Applications: GPT-4 Meets Healthcare

The Legal Applications: AI in the Courtroom and the Law Office

The Education Reckoning: A Year Into the AI Classroom

The Labour Market Impact: The First Year’s Evidence

The Government Response Accelerates

The Year in Numbers: Scale and Adoption

What the Year Revealed: The Honest Assessment

The Foundation of What Came Next

Further Reading

Comments

The Technical Achievement: What GPT-4 Actually Was

The System Card: A New Standard for Transparency

The Benchmark Sweep: What the Scores Meant

The Sparks Paper: Microsoft’s Assessment

The Competitive Response: Google, Anthropic, and Meta Enter the Race

The Capability Emergence: What Scale Unlocked

The Medical Applications: GPT-4 Meets Healthcare

The Legal Applications: AI in the Courtroom and the Law Office

The Education Reckoning: A Year Into the AI Classroom

The Labour Market Impact: The First Year’s Evidence

The Government Response Accelerates

The Year in Numbers: Scale and Adoption

What the Year Revealed: The Honest Assessment

The Foundation of What Came Next

Further Reading

Comments

Subscribe