SeriesMinds & Machines⚡ EventAct V
E21Act V · The Explosion

The Multimodal Moment: When AI Learned to See, Hear, and Speak

On this page15 sections

“Rutkowski has not consented to this. He has not been paid for this. His name and his aesthetic have become training material for a system he had no role in building, and the system is now generating images that compete with his own work in the market for fantasy illustration. ‘My name became a token in a machine,’ he says in an interview. ‘I didn’t sign up for that.’”

— “My name became a token in a machine.”

San Francisco, California. April 2022. An artist named Greg Rutkowski is searching his own name on an image generation website and finding images he did not create.

Rutkowski is a Polish digital fantasy artist whose work — detailed, luminous, populated with heroes and monsters in the style of the great nineteenth-century Romantic painters — has been widely shared online and has become a specific aesthetic marker in certain corners of the internet. His name has become, in the months since AI image generation became accessible to the public, one of the most popular style prompts in the AI generation communities. When people want to generate images in a specific aesthetic — epic fantasy, richly detailed, painterly — they type “in the style of Greg Rutkowski” and the AI produces something that looks, to varying degrees, like Rutkowski’s work.

Rutkowski has not consented to this. He has not been paid for this. His name and his aesthetic have become training material for a system he had no role in building, and the system is now generating images that compete with his own work in the market for fantasy illustration.

“My name became a token in a machine,” he says in an interview. “I didn’t sign up for that.”

The Rutkowski case is a window into the most complex, most contested, and most culturally significant development of the AI era: the multimodal revolution — the integration of visual, audio, and language processing into unified AI systems that can generate, understand, and translate between every medium of human communication.

Important

The multimodal revolution — the integration of visual, audio, and language processing into unified AI systems that can generate, understand, and translate between every medium of human communication — is the most complex, most contested, and most culturally significant development of the AI era. The Rutkowski case opened the window onto it: an artist discovering his name had become a token in a machine, his aesthetic turned into training material without his consent or compensation, his market position undermined by a system he had no role in building.


Before Multimodality: The Siloed AI World

The deep learning revolution of 2012 transformed AI’s capabilities in specific domains. Computer vision systems learned to classify images. Speech recognition systems learned to transcribe audio. Natural language systems learned to process and generate text. Each of these capabilities was impressive. Each was developed in relative isolation from the others.

The siloed structure of AI capabilities reflected the siloed structure of AI research. Computer vision researchers worked on vision. NLP researchers worked on language. Speech researchers worked on audio. The architectures, the training data, the evaluation benchmarks, and the research communities were separate, and progress in one domain did not automatically transfer to the others.

This siloed structure was, in retrospect, a consequence of the historical accident that AI capabilities had developed in domains before there were good ways to connect them. The technical barriers to connecting different modalities were not fundamental — the transformer architecture that proved effective for language turned out, with appropriate modifications, to be effective for vision and audio too. But until those modifications were developed and demonstrated, the siloed structure persisted.

The multimodal revolution was the collapse of the silos — the development of AI systems that could process and generate multiple modalities simultaneously, that could understand the relationships between what was seen, heard, and said, and that could translate between modalities in ways that opened entirely new capabilities.

The collapse happened in several stages, driven by different research communities working on different aspects of the problem. Understanding the multimodal revolution requires understanding these stages and the specific contributions that made them possible.

Note

Before multimodality, AI capabilities were siloed: vision researchers worked on vision, NLP researchers on language, speech researchers on audio — with separate architectures, training data, benchmarks, and communities. The siloing was not fundamental (the transformer architecture that proved effective for language turned out, with appropriate modifications, to be effective for vision and audio too), but it persisted until those modifications were developed and demonstrated. The multimodal revolution was the collapse of the silos — the development of AI systems that could process and generate multiple modalities simultaneously, and translate between them in ways that opened entirely new capabilities.


CLIP and Contrastive Learning: The Foundation

The technical foundation for the multimodal revolution was laid by a 2021 paper from OpenAI: “Learning Transferable Visual Models from Natural Language Supervision” — the paper introducing CLIP (Contrastive Language-Image Pre-training).

CLIP was not an image generation system. It was an image understanding system — a model trained to understand the relationship between images and text descriptions. Specifically, CLIP was trained on a dataset of approximately 400 million image-text pairs from the internet, learning to identify which text descriptions matched which images and which images matched which text descriptions.

The training objective was contrastive learning: given a batch of images and text descriptions, train the model to produce high similarity scores for matched pairs and low similarity scores for mismatched pairs. This objective did not require labelled data in the traditional sense — it required only paired image and text data, which was available in enormous quantities on the internet.

The result of CLIP training was a model that had learned a joint representation of images and text — a shared embedding space in which images and their descriptions were close together, while images and unrelated text were far apart. This joint representation was the technical foundation for everything that followed: CLIP’s image-text embedding space was the bridge that allowed language and vision to be connected.

CLIP demonstrated capabilities that were immediately impressive. Given a text description, CLIP could identify which of a set of images best matched the description — without any task-specific training. Given an image, CLIP could identify which of a set of text descriptions best matched it. The zero-shot capabilities were evidence that CLIP had learned genuine semantic connections between language and visual content, not just surface-level pattern matching.

CLIP was also significant because of how it could be combined with other systems. A system that could generate images from random noise and that was connected to CLIP could be guided by text descriptions — it could explore the space of possible images, using CLIP to evaluate which images matched a given description, and gradually converge on images that scored well on the CLIP evaluation.

This was the basic principle behind the first generation of text-to-image generation systems.

Definition

CLIP (Contrastive Language-Image Pre-training) — A 2021 OpenAI model trained on approximately 400 million image-text pairs scraped from the internet, learning to identify which text descriptions matched which images (and vice versa). The training objective was contrastive learning: given a batch of images and descriptions, produce high similarity scores for matched pairs and low scores for mismatched pairs. CLIP learned a joint representation — a shared embedding space in which images and their descriptions were close together, unrelated content far apart. This shared embedding space was the technical foundation of the multimodal revolution: it was the bridge that allowed language and vision to be connected, and the mechanism by which text-to-image generators could be guided by text prompts.


DALL-E and the Text-to-Image Breakthrough

In January 2021, OpenAI published DALL-E — a system that could generate images from text descriptions. The name was a portmanteau of Salvador Dalí and Pixar’s WALL-E — a reference to surrealist art and to the animated robot, capturing something about the combination of creativity and mechanical precision that the system embodied.

DALL-E released
Date:
January 5, 2021
Location:
OpenAI, San Francisco, California
Significance:
OpenAI published DALL-E — a system built using a modified version of the GPT-3 architecture, trained on a large dataset of image-text pairs, that could generate images from text descriptions, including descriptions of scenes that had never been photographed or painted (“a snail made of harp,” “an avocado armchair,” “a cat sitting in a throne wearing a crown”). The name was a portmanteau of Salvador Dalí and Pixar’s WALL-E.
Outcome:
DALL-E’s release in a limited research preview established the text-to-image generation paradigm. The images were coherent at the level of overall composition but often failed at fine-grained details (hands with too many fingers, illegible text, inconsistent backgrounds). The capability itself — generating images from natural language descriptions — was genuinely unprecedented and clearly consequential.

DALL-E was built using a modified version of the GPT-3 architecture, trained on a large dataset of image-text pairs. It could generate images that corresponded to text descriptions, including descriptions of scenes that had never been photographed or painted — “a snail made of harp,” “an avocado armchair,” “a cat sitting in a throne wearing a crown.”

The images DALL-E generated were impressive but imperfect. They were coherent at the level of overall composition — the avocado armchair looked approximately like what you might imagine an avocado armchair to look like — but they often failed at fine-grained details: hands with too many fingers, text that was illegible, backgrounds that were inconsistent.

But the capability itself — generating images from natural language descriptions — was genuinely unprecedented and clearly consequential. The ability to specify an image in language and receive a visual realisation of that specification collapsed a specific barrier between human imagination and visual output that had previously required artistic skill or access to a skilled artist.

DALL-E’s release in a limited research preview attracted significant attention and established the text-to-image generation paradigm. The question was not whether this capability was interesting or commercially valuable — it clearly was — but whether it would improve rapidly enough to be practically useful and what the social implications would be.


Stable Diffusion and the Democratisation of Image Generation

The democratisation of text-to-image generation — its shift from a research preview accessible to a limited number of researchers to a publicly available tool used by millions — came through a different technical approach: diffusion models.

Definition

Diffusion model — A class of generative model that learns to reverse a process of gradual image degradation. The training process gradually adds noise to real images — turning a photograph of a dog into random noise through a series of steps — and trains a neural network to reverse this process, removing noise at each step. Once trained, the network can start from random noise and generate a coherent image by repeatedly removing noise according to what it has learned about the structure of real images. Diffusion models underlie Stable Diffusion, DALL-E 2 and 3, Midjourney, and Sora — and replaced the earlier GPT-style autoregressive approach used by the original DALL-E.

Stability AI released Stable Diffusion in August 2022 as an open-source model — its code, weights, and training details were published publicly for anyone to download, run, and modify. The release was deliberate and consequential: by making a high-quality text-to-image model freely available, Stability AI ensured that the technology would spread rapidly and would be modified, extended, and deployed by a large and diverse community.

Stable Diffusion released
Date:
August 22, 2022
Location:
Stability AI (with CompVis Ludwig Maximilian University of Munich and Runway), London / Munich
Significance:
Stability AI released Stable Diffusion as an open-source text-to-image diffusion model — its code, weights, and training details were published publicly for anyone to download, run, and modify. The release was a deliberate democratisation play: by making a high-quality model freely available, Stability AI ensured the technology would spread rapidly and be modified, extended, and deployed by a large and diverse community.
Outcome:
Within weeks, Stable Diffusion had been downloaded millions of times, deployed in dozens of applications and websites, fine-tuned for specific styles and domains, and integrated into creative workflows by artists, designers, and developers worldwide. The ecosystem grew faster than any previous open-source AI project — and triggered the copyright crisis that defined the multimodal era’s legal landscape.

The spread was rapid. Within weeks of the Stable Diffusion release, the model had been downloaded millions of times, deployed in dozens of applications and websites, fine-tuned for specific styles and domains, and integrated into creative workflows by artists, designers, and developers worldwide. The ecosystem of applications built on Stable Diffusion — from straightforward text-to-image interfaces to fine-tuned models for specific art styles to inpainting and outpainting tools for editing existing images — grew faster than any previous open-source AI project.

Midjourney, a commercial image generation service that launched in March 2022 and rapidly improved, offered a different model: high-quality image generation through a Discord interface, with a subscription model that funded continued development. Midjourney’s specific aesthetic — high-production-value images with a particular quality of light and composition that users described as “cinematic” or “epic” — attracted a large community of users who valued the specific visual quality it produced.

The combination of Stable Diffusion’s openness and Midjourney’s quality created a text-to-image ecosystem that, by late 2022, had hundreds of millions of users generating billions of images. The creative possibilities were extraordinary. The social and legal implications were alarming.


The text-to-image revolution immediately triggered a set of legal and ethical questions about copyright that the AI field and the legal system were not prepared to answer.

The systems that generated images were trained on enormous datasets of images scraped from the internet. These images were the work of photographers, illustrators, painters, designers, and other visual creators who had not consented to the use of their work in AI training and who were not compensated for it. The legal status of this training — whether training AI systems on copyrighted images without consent or compensation constituted copyright infringement — was not established in law.

In the United States, the legal framework for copyright does not clearly address AI training. The existing doctrine of “fair use” allows for limited use of copyrighted works without permission in certain circumstances, including for transformative purposes and for research. Whether AI training on copyrighted images constituted a transformative use — making something qualitatively different from the training data — was contested, and the courts were beginning to work through the question.

The Rutkowski case was one of the most visible examples of the specific concern that artists had. When people prompted image generation systems with “in the style of Greg Rutkowski,” they were using the artist’s name as a token — as a specification of an aesthetic that the system had learned from his work. The system was not copying Rutkowski’s specific images; it was generating new images in his aesthetic style. Whether this constituted copyright infringement was legally uncertain. Whether it was fair to Rutkowski — whether it used his work without compensation to compete with him in his market — was less uncertain.

Greg Rutkowski
Born:
1988
Died:
Living
Nationality:
Polish
Role:
Digital fantasy artist and illustrator
Known for:
Detailed, luminous fantasy artwork populated with heroes and monsters in the style of the great nineteenth-century Romantic painters. His work, widely shared online, became one of the most popular style prompts in early AI image generation communities (“in the style of Greg Rutkowski”). In April 2022, Rutkowski spoke publicly about finding images he had not created on AI generation websites — “My name became a token in a machine. I didn’t sign up for that” — and became one of the most visible artist voices raising concerns about consent, compensation, and the use of creative work in AI training data.

Class action lawsuits were filed by artists against Stability AI and other image generation companies, alleging copyright infringement in the training process. Getty Images sued Stability AI for training on its image library without a licence. The legal proceedings were still unfolding, and the legal framework for AI training data was still being developed.

The copyright crisis illustrated a broader challenge: the legal and regulatory frameworks for AI had been developed for a world in which human creators and AI tools were clearly distinct. The multimodal revolution — in which AI systems could generate work that was trained on and stylistically indistinguishable from human creative work — challenged those frameworks in ways that required either new legal interpretations or new legislation.

Artist class-action lawsuits against AI image generators
Date:
January 13, 2023 (filed)
Location:
United States District Court, Northern District of California
Significance:
A group of artists including Sarah Andersen, Kelly McKernan, and Karla Ortiz filed a class-action copyright infringement lawsuit against Stability AI, Midjourney, and DeviantArt, alleging that the companies had trained their image generation systems on billions of copyrighted images scraped from the internet without consent, compensation, or licence. Getty Images filed a parallel suit against Stability AI in the UK and US for training on its image library.
Outcome:
The legal proceedings are still unfolding. The cases collectively established the legal landscape in which the multimodal revolution’s copyright crisis will be adjudicated — and exposed the broader challenge that existing copyright frameworks were developed for a world in which human creators and AI tools were clearly distinct.
Warning

The copyright crisis exposed a broader challenge: legal and regulatory frameworks for AI had been developed for a world in which human creators and AI tools were clearly distinct. The multimodal revolution — in which AI systems could generate work that was trained on, and stylistically indistinguishable from, human creative work — challenged those frameworks in ways that required either new legal interpretations or new legislation. The specific questions — whether training on copyrighted images without consent or compensation constitutes infringement; whether “in the style of [artist]” generation is fair use; whether the resulting models compete with the artists whose work trained them — remain contested in courts as of 2024.


GPT-4V: Language and Vision United

When OpenAI released GPT-4 in March 2023 with vision capabilities — the ability to process image inputs alongside text — it represented a qualitatively different kind of multimodal AI from the text-to-image systems. Those systems translated text into images; GPT-4V could understand images, answer questions about them, describe them, and reason about their content in the context of a broader conversation.

The vision capabilities of GPT-4V were immediately practically useful in ways that the text-to-image systems were not primarily designed for. A user could upload a photograph of a damaged car and ask for advice about what repairs might be needed. A student could upload a diagram from a textbook and ask for an explanation. A developer could upload a screenshot of a bug and ask for help debugging. A doctor could upload an image of a skin lesion and ask what it might indicate.

Each of these applications was a specific, practical use case that had not been feasible with text-only AI systems. The vision capabilities transformed GPT-4 from a text assistant to something more like a general-purpose perceptual assistant — a system that could engage with the world in the way that humans engaged with it, through a combination of vision and language.

The technical approach underlying GPT-4V’s vision capabilities was the integration of a vision encoder — a module that converted image inputs into the same embedding space that the language model used — with the language model itself. The training had connected visual representations to linguistic representations in ways that allowed the model to reason about images in language and to understand the relationship between the two.

Several specific capabilities of GPT-4V were particularly striking:

Optical character recognition. GPT-4V could read text in images with high accuracy — not just clean printed text but handwritten notes, text in photographs, and text in complex visual contexts like screenshots of applications.

Visual reasoning. GPT-4V could answer questions that required reasoning about spatial relationships, quantities, and properties visible in images. “How many of these are blue?” “Is the shelf above or below the window?” “Which of these graphs shows the highest peak in 2019?” — these required combining visual understanding with reasoning in ways that text-only models could not.

Meme and cultural understanding. GPT-4V demonstrated understanding of internet meme formats and cultural references embedded in images — a capability that was both impressive as a demonstration of cultural understanding and concerning as evidence of the breadth of training data the system had processed.

Medical imaging. Early evaluations suggested that GPT-4V could provide useful analysis of medical images — X-rays, dermatological photographs, ophthalmological images — at a level that could assist non-specialist clinicians in underserved settings. The potential for AI-assisted diagnosis in low-resource healthcare settings, where specialist expertise was unavailable, was one of the most promising applications.

Info

GPT-4V’s vision capabilities were immediately practically useful in ways that text-to-image systems were not designed for. Four striking capabilities:

  1. Optical character recognition — high accuracy on printed text, handwritten notes, text in photographs, text in complex visual contexts like screenshots
  2. Visual reasoning — answering questions about spatial relationships, quantities, and properties visible in images (“How many of these are blue?” “Is the shelf above or below the window?”)
  3. Meme and cultural understanding — recognising internet meme formats and cultural references embedded in images (impressive as cultural understanding; concerning as evidence of training-data breadth)
  4. Medical imaging — useful analysis of X-rays, dermatological photographs, ophthalmological images, with potential to assist non-specialist clinicians in underserved settings

Sora: Video Generation and the Moving Image

In February 2024, OpenAI announced Sora — a text-to-video generation system that could produce minute-long videos from text descriptions. The announcement was accompanied by demonstration videos that were, by any previous standard of AI video generation, extraordinary.

OpenAI announces Sora
Date:
February 15, 2024
Location:
OpenAI, San Francisco, California
Significance:
OpenAI announced Sora — a text-to-video generation system that could produce minute-long videos from text descriptions, accompanied by demonstration videos of realistic scenes (a woman walking through a city, a dog splashing in water, an underwater world) generated entirely from text. The videos were smooth, the motion natural, the scenes coherent over time — visually compelling in ways that previous AI video systems had not been.
Outcome:
Sora was not released publicly when announced — OpenAI indicated it was being made available to a small group of creative professionals for testing and evaluation, with a broader release to follow. The deliberate pacing reflected both the deployment challenges of video generation at scale and the recognition that the social implications of widely available, high-quality video generation were significant — particularly for misinformation and deepfakes.

Sora’s demonstrations showed videos of realistic scenes — a woman walking through a city, a dog splashing in water, an underwater world — generated entirely by an AI system from text descriptions. The videos were smooth, the motion was natural, the scenes were coherent over time. They were not perfect — careful inspection revealed specific tells, particular in the handling of complex physical interactions and in the behaviour of objects and people at a distance — but they were visually compelling in ways that previous AI video systems had not been.

The technical approach underlying Sora was a diffusion model extended to video — rather than learning to generate static images by removing noise, Sora learned to generate sequences of frames by removing noise from video. The model incorporated the transformer architecture that had proved so effective for language and for image generation, adapted for the specific challenges of video: the temporal coherence required for smooth motion, the spatial consistency required for scenes to look like they were photographed from a consistent perspective, and the physical plausibility required for objects and people to behave as they do in the real world.

Sora was not released publicly when announced — OpenAI indicated it was being made available to a small group of creative professionals for testing and evaluation, with a broader release to follow. The deliberate pacing of the release reflected both the genuine challenges of deploying a video generation system at scale and the recognition that the social implications of widely available, high-quality video generation were significant.

The specific implications of high-quality text-to-video generation for misinformation were immediately noted. If AI systems could generate video of any scene from a text description — realistic, smooth, and visually compelling video — the ability to create synthetic media of events that did not occur was significantly advanced. The “deepfake” problem, which had been a concern with AI-generated video for several years, was potentially approaching a qualitative threshold where synthetic video would be indistinguishable from real video for many practical purposes.

Definition

Deepfake — AI-generated synthetic media (video, audio, or image) that depicts a real person saying or doing things they did not actually say or do. The term, coined in 2017 around a Reddit community using face-swapping AI to place celebrities’ faces onto other people’s bodies, originally referred to non-consensual intimate imagery but has expanded to cover any AI-generated synthetic media. The deepfake problem escalated qualitatively with the multimodal revolution: as generation systems improved from requiring significant expertise and computing resources to being accessible through consumer interfaces, the ability to create convincing synthetic media of any person doing anything became democratised — with corresponding implications for political misinformation, fraud, and non-consensual intimate imagery.


The Voice Revolution: AI Speaks, Sings, and Clones

Alongside the visual multimodal capabilities, the voice dimension of multimodal AI underwent its own revolution in 2023 and 2024.

Text-to-speech technology had existed for decades, and the robotic voice quality of early TTS systems was a familiar cultural reference. The deep learning revolution had significantly improved TTS quality — neural TTS systems like Google’s WaveNet, published in 2016, produced voice quality that was dramatically more natural than previous systems. But the voices were still recognisably synthetic, and the style was generic.

The new generation of voice AI systems — ElevenLabs, OpenAI’s Voice Engine, Microsoft’s VALL-E — represented a qualitative shift. These systems could clone the voice of a specific person from a few seconds of audio and generate new speech in that voice. The cloned voices were indistinguishable from the original by many listeners, including in laboratory conditions with careful listening.

The specific applications of voice cloning were both genuinely valuable and genuinely alarming.

Valuable applications. People who had lost their voices to disease or accident could have their voices cloned before they lost them, giving them access to personalised speech synthesis that sounded like themselves. Audiobooks and podcasts could be produced in the voice of the author without requiring the author’s time. Accessibility tools could provide natural-sounding voices for visual assistance and other applications.

Alarming applications. Voice cloning could be used to impersonate specific individuals — to generate audio of politicians saying things they had not said, of celebrities endorsing products they had not endorsed, of family members making urgent requests they had not made. The use of AI-generated voice cloning in fraud — generating audio of a parent asking a child for urgent financial help, or of a CEO approving an unusual financial transaction — was reported in several high-profile cases.

The voice cloning technology was also implicated in the spread of AI-generated music. Tools that could clone the voices of specific musicians and generate new songs in their style proliferated rapidly, raising the same copyright questions that had been raised by visual AI generation.

Warning

Voice cloning technology is genuinely dual-use in a way that text and image generation are not, because the human voice carries unique evidentiary weight: we are accustomed to believing that “I heard them say it” is strong evidence. The valuable applications are real — voice restoration for people with degenerative conditions, audiobook production in the author’s voice, accessibility tools with natural-sounding voices. The alarming applications are equally real — impersonation of politicians, fraudulent “your CEO approved this wire transfer” calls, synthetic evidence in legal contexts. The fraud cases reported in 2023 — voice clones of family members requesting urgent financial help, of CEOs approving unusual transactions — demonstrated that the evidentiary weight of voice had become a vulnerability.


The Integration: GPT-4o and Beyond

The trajectory of multimodal AI development pointed toward increasing integration — toward systems that could not just process multiple modalities but could seamlessly integrate them in real-time interaction. The announcement of GPT-4o (“o” for “omni”) in May 2024 represented a significant step in this direction.

GPT-4o was capable of understanding and generating text, audio, and images in an integrated real-time interaction. A user could speak to GPT-4o and receive a spoken response; could show it an image and ask questions about it in real time; could use it in a video conversation where it responded to what it saw and heard. The response latency was low enough to feel like a natural conversation, rather than the pause-and-respond pattern of previous AI interactions.

The demonstrations of GPT-4o showed capabilities that approached what science fiction had imagined for conversational AI — a system that could engage with a human’s full communicative context, understanding not just words but tone, emotion, and visual context. The system could detect that a speaker was nervous from their vocal quality and respond with appropriate encouragement. It could look at a math problem written on a piece of paper and provide step-by-step guidance. It could engage in multilingual conversation, switching between languages fluidly.

The emotional responsiveness of GPT-4o’s voice modality was particularly striking — and particularly concerning. The system modulated its tone, expressed enthusiasm, sympathy, and concern in ways that were designed to be engaging and appropriate. The question of whether this emotional responsiveness constituted genuine emotional intelligence or sophisticated mimicry of emotional responsiveness — and what the difference was, if any — was one that the multimodal revolution made more acute.

OpenAI announces GPT-4o
Date:
May 13, 2024
Location:
OpenAI, San Francisco, California
Significance:
OpenAI announced GPT-4o (“o” for “omni”) — a natively multimodal model capable of understanding and generating text, audio, and images in integrated real-time interaction. A user could speak to GPT-4o and receive a spoken response; could show it an image and ask questions about it in real time; could use it in a video conversation where it responded to what it saw and heard. The response latency was low enough to feel like a natural conversation.
Outcome:
GPT-4o’s demonstrations approached what science fiction had imagined for conversational AI — a system that engaged with a human’s full communicative context, understanding not just words but tone, emotion, and visual context. The emotional responsiveness of the voice modality was both striking and concerning: the question of whether it constituted genuine emotional intelligence or sophisticated mimicry — and what the difference was, if any — became more acute.

The Creative Disruption: A New Creative Economy

The multimodal revolution changed the creative economy in specific and significant ways that were still playing out as of 2024.

Stock photography and illustration. The market for stock photography and stock illustration — the licensing of generic images for commercial use — contracted significantly as AI-generated alternatives became available at a fraction of the cost of licenced photographs or illustrations. Companies that had previously purchased stock images began generating images with AI tools, reducing their need for licenced content.

Video production. AI tools for video production — for generating specific scenes, for producing B-roll, for creating demonstrations and explanatory videos — entered professional video production workflows. Independent video creators could produce content with higher production values than previously possible without a production team.

Music production. AI tools for generating music in specific styles, for separating and remixing audio tracks, and for cloning specific musical aesthetics entered music production in ways that were both used by professional musicians and concerned by them. The concern was partly economic — the reduction in demand for certain types of music production — and partly cultural — the question of what it meant for a piece of music to be the creative expression of an individual when the primary creative role was writing a text prompt.

New creative forms. Alongside the displacement of existing creative markets, the multimodal revolution also enabled new creative forms that had not been possible before. Interactive narrative experiences that combined AI-generated text, image, and audio in real time. Personalised creative works that could be generated on demand to specific specifications. Collaborative workflows between human creators and AI tools that produced work that neither could have produced alone.

The creative disruption was genuinely a disruption — some categories of creative work were disrupted, some were transformed, some were displaced. The emerging creative economy was different from the one it replaced: more accessible to non-professionals, more challenging for professionals who had relied on specific technical skills, more complex in its questions about attribution and authorship.

Info

The multimodal revolution’s creative-economy impact split into four categories:

  • Stock photography and illustration — market contracted as AI-generated alternatives became available at a fraction of the cost of licensed content
  • Video production — AI tools for scene generation, B-roll, and explanatory videos entered professional workflows; independent creators could produce higher production values without a production team
  • Music production — AI tools for style generation, audio separation, and voice cloning entered both professional and amateur workflows, raising both economic and cultural concerns
  • New creative forms — interactive narrative experiences, personalised on-demand creative works, human-AI collaborative workflows that neither could produce alone

The emerging creative economy was different from the one it replaced: more accessible to non-professionals, more challenging for professionals who had relied on specific technical skills, more complex in its questions about attribution and authorship.


The Deep Fake Problem: Trust and Verification

The multimodal revolution created a specific and significant challenge to the existing mechanisms of media trust and verification. In a world where AI systems could generate realistic images, audio, and video of any person saying or doing anything, the evidentiary value of media was fundamentally challenged.

The term “deepfake” had been coined in 2017 to describe AI-generated synthetic media, and the technology had been a source of concern for several years before the multimodal revolution. But the previous generation of deepfake technology had required significant expertise and computing resources to use; the new generation was accessible through consumer interfaces to anyone.

The specific harms from synthetic media were varied and documented.

Political misinformation. AI-generated audio and video of politicians making statements they had not made circulated in several countries’ elections, attempting to influence voters with fabricated evidence of candidates’ positions or behaviour.

Non-consensual intimate imagery. AI-generated explicit images and videos of real people — primarily women and girls — without their consent constituted a new form of sexual abuse that was spreading rapidly through online platforms. The women affected reported severe psychological harm and in some cases had been driven from public life.

Fraud and scams. Voice cloning was used in financial scams — generating audio of trusted individuals making urgent requests for money. Video deepfakes were used in impersonation fraud targeting businesses.

The response to the synthetic media challenge was developing along several lines. Technical detection tools — systems trained to identify AI-generated content — were developed and deployed by social media platforms and by independent researchers. Content authentication standards — cryptographic provenance information attached to genuine images and audio — were being developed by the Coalition for Content Provenance and Authenticity (C2PA). Legal frameworks criminalising non-consensual synthetic intimate imagery were enacted in several jurisdictions.

None of these responses was adequate to the scale and pace of the problem. Detection tools were consistently outpaced by the improving quality of generation systems. Authentication standards required adoption across the media ecosystem that had not yet occurred. Legal frameworks, where enacted, were difficult to enforce against the distribution of synthetic media across international platforms.

Definition

Liar’s dividend — The phenomenon, identified by Bobby Chesney and Danielle Citron in 2019, in which the existence of convincing deepfakes degrades the evidentiary value of all media — including genuine media. Once AI-generated video is common and convincing, any genuine video that a politician finds inconvenient can be dismissed as a deepfake, even when it is real. The liar’s dividend is in some ways more dangerous than the deepfakes themselves: the spread of convincing synthetic media does not need to convince most people that specific fakes are real; it only needs to make plausible the claim that real media might be fake, undermining the shared epistemic foundation on which democratic deliberation depends.


The Scientific Applications: Multimodal AI for Research

Beyond the cultural and social dimensions of the multimodal revolution, the scientific applications of multimodal AI — the use of systems that could process and generate multiple types of data for scientific discovery — were opening new possibilities for research.

Medical imaging. The integration of vision and language in multimodal systems created new possibilities for medical imaging AI. A system that could not only analyse an image but describe its findings in clinical language, answer questions about what it saw, and integrate the visual information with clinical notes and patient history was substantially more useful than a system that could only output a classification.

Drug discovery. The integration of molecular structure, protein sequence, and natural language in multimodal systems created new possibilities for drug discovery — systems that could understand the relationship between a drug’s chemical structure, its predicted mechanism of action, and the clinical language used to describe its effects.

Climate science. The analysis of satellite imagery in combination with textual data — climate models, scientific literature, historical records — created new possibilities for understanding climate change and its effects. Systems that could identify and interpret patterns in satellite imagery of glaciers, forests, or agricultural land in the context of broader scientific understanding were more powerful tools for environmental monitoring than either visual or textual systems alone.

Materials discovery. The integration of molecular structure data, experimental results, and scientific literature in multimodal systems supported the search for new materials with specific properties — combining the pattern recognition capabilities of visual systems with the language understanding and reasoning capabilities of language systems.


The Philosophical Question: What Does Multimodality Reveal?

The multimodal revolution raises a specific philosophical question about the nature of intelligence and representation: what does it mean that AI systems can translate between modalities — that they can see an image and describe it, hear a sound and transcribe it, read a description and visualise it?

For humans, the different senses are integrated in the brain through a process that is still not fully understood but that involves specific neural mechanisms for combining information from different sensory modalities. The integration is not just additive — seeing and hearing simultaneously is not simply the sum of seeing and hearing separately. The integration produces specific effects: the McGurk effect, in which what you hear is changed by what you see the speaker doing; synesthesia, in which sensory experiences in one modality trigger experiences in another; the cocktail party effect, in which you can focus on a specific conversation in a noisy environment by using visual cues alongside auditory ones.

The multimodal AI systems of the current era do not replicate this biological integration — they connect modalities through learned shared representations rather than through the biological mechanisms that integrate human perception. But they do produce a kind of integration: the ability to reason about one modality using concepts from another, to translate between modalities, to understand the semantic content that is shared across different sensory representations of the same thing.

This integration is evidence that the representations learned by large AI models are not modality-specific — they are representations of meaning that can be encoded and decoded in multiple sensory modalities. The word “apple” and the image of an apple and the sound of someone saying “apple” all point to something that is not the word, the image, or the sound — something that is the concept itself.

Whether AI systems have access to this concept in the same way that humans do — whether the shared representation space of multimodal AI corresponds to genuine conceptual understanding or to sophisticated statistical correlation — is the fundamental question that the multimodal revolution makes more acute.


The Year Ahead: What Multimodal AI Will Do Next

The multimodal revolution was still accelerating as of 2024. The capabilities that had been demonstrated — real-time audio-visual conversation, text-to-video generation, joint image-text reasoning — were impressive but not yet at the limits of what the approach could produce.

Several directions were being actively developed.

Improved video generation. Sora’s initial capabilities were impressive but limited — the videos were visually compelling but still showed specific tells, and the system could not produce video of arbitrary length or complexity. The improvement of video generation systems was a major research and commercial priority, and the gap between the current best and human-level video was likely to close.

Real-time multimodal agents. Systems that could perceive, understand, and act in the world in real time — receiving visual, audio, and text inputs from their environment and producing outputs that affected that environment — were a major research direction. These systems, which would combine multimodal perception with the agentic capabilities being developed in language models, would have capabilities significantly beyond current systems.

Embodied AI. The connection of multimodal perception to physical bodies — robots that could see, hear, and act in the physical world — was a direction that the robotics and AI research communities were pursuing with increasing urgency. The multimodal capabilities developed in language models could, if connected to appropriate physical platforms, enable robots that were significantly more capable and more adaptable than current systems.

The trajectory was clear: towards AI systems that could engage with the full richness of human perceptual experience — seeing, hearing, speaking, creating — and do so in integrated, real-time ways that went beyond the channel-specific AI systems of the previous decade.

What this trajectory would produce — what the world would look like when AI could engage as fully and as flexibly with human perceptual experience as humans themselves — was the question that the multimodal moment had raised, without yet answering.


Further Reading

Further Reading
  • “Learning Transferable Visual Models from Natural Language Supervision” by Radford, Kim, Hallacy et al. (2021) — The CLIP paper. The technical foundation of the multimodal revolution.
  • “High-Resolution Image Synthesis with Latent Diffusion Models” by Rombach et al. (2022) — The paper underlying Stable Diffusion. Essential for understanding how text-to-image generation works technically.
  • “Generative Agents: Interactive Simulacra of Human Behavior” by Park et al. (2023) — An exploration of what becomes possible when language and action are combined in AI agents.
  • “The Machine Stops” by E.M. Forster (1909) — Forster’s 1909 short story about a world in which humans communicate entirely through a vast technological system. A remarkably prescient meditation on what multimodal AI is building toward, written more than a century before it.
  • “Reality+: Virtual Worlds and the Problems of Philosophy” by David Chalmers (2022) — Chalmers’s philosophical exploration of virtual reality and digital experience provides essential context for thinking about what it means for AI systems to engage with the perceptual world.

Event 22: The AI Election: When Synthetic Media Met Democracy

The story of the 2024 election cycle — the deepfakes, the voice clones, the AI-generated misinformation, and the unprecedented challenge to democratic process posed by AI systems that could generate compelling synthetic media of any candidate saying anything. The first election in which AI was a central battleground.


Comments

Reply on Bluesky → (opens in a new tab)