Is training AI with books innovation—or is it theft?
In “Fair Use and the Origin of AI Training,” Edward Lee presents a powerful legal and historical defense of using copyrighted works in the development of artificial intelligence models under the fair use doctrine. Published in a forthcoming Houston Law Review, the article traces the origins of AI training practices to academic research labs, where early breakthroughs in scaling datasets catalyzed today’s AI revolution. Lee argues that this historical context matters deeply in evaluating whether AI training constitutes fair use, particularly when such use began in non-commercial university research settings with transformative objectives like developing new technologies.
Lee contends that courts assessing the 41+ ongoing copyright lawsuits against AI companies must adopt a use-by-use analysis as mandated by the Supreme Court in Warhol v. Goldsmith. He suggests that training AI models—especially when the models generate new, non-infringing works—may qualify as a legitimate, transformative purpose. However, courts must distinguish between uses made in training AI and those made by public users of generative models who might produce infringing outputs. The article warns that blanket assumptions of infringement could imperil technological innovation, particularly in an era of geopolitical AI competition.
If the use of copyrighted materials by university researchers to develop AI models is copyright infringement and not fair use, then, a fortiori, the fair use defense of AI companies must fail. Conversely, if the courts find that such university-based AI training has a legitimate fair use purpose, then courts should reject broad arguments that use of copyrighted works in AI training by companies cannot serve a fair use purpose.
A major portion of the article critiques the U.S. Copyright Office’s 2025 pre-publication report for endorsing a novel and potentially unconstitutional “market dilution” theory. According to Lee, this theory wrongly treats AI-generated works as harmful to the market for original copyrighted works, even when the outputs are not infringing. He asserts that this view not only overextends the scope of copyright law but also risks chilling innovation by conflating market competition with infringement. He emphasizes that copyright must be balanced with the Progress Clause’s mandate to promote science and useful arts.
Ultimately, Lee argues that courts must recognize technological innovation—especially in AI—as equally vital to creative progress. He appeals to historical precedents like Google v. Oracle and Sony v. Universal to frame AI training as part of a broader American tradition of technological fair use. In a world where the AI arms race is intensifying, Lee asserts that applying the fair use doctrine judiciously is critical not just for legal coherence but for national competitiveness and public benefit.