A central theme on which I focus is how copyright interacts with and/or overlays on artificial intelligence. One of the central questions currently occupying the headlines is whether the use of copyrighted works in training generative AI (“GenAI”) models is infringement or fair use. One of the earliest and best attempts at answering down this question is this article by Profs. Mark Lemley and Bryan Casey: Fair Learning, which appeared in the Texas Law Review.
It is important to note that Fair Learning is concerned with the use of copyrighted materials in training all kinds of artificial intelligence / machine learning models, not just GenAI models. In fact, Prof. Lemley is co-author on Foundation Models and Fair Use, which looks at this fair use question specifically for foundation models, which I will review in a separate post.
As the authors note early in the article, the specter of statutory damages makes the answer to this question existential; e.g., in 2019 Cox Communications was ordered to pay $1 billion for its failure to prevent its customers from using Cox’ internet service to illegally pirate 10,000 songs.1 Today’s GenAI models are trained on orders of magnitude more songs, pushing us into Limewire-esque file sharing litigation absurdity, where folks extrapolated that the labels could be seeking $75 trillion in damages. Just last month, the New York Times claimed in its suit against OpenAI that the GenAI company should pay “billions of dollars in statutory and actual damages” related to the “unlawful copying and use of The Times’s uniquely valuable works.”2
The authors begin by acknowledging that “there are reasons to think that courts in the future won’t be so sympathetic to machine copying” before arguing that “ML systems should generally be able to use databases for training, whether or not the contents of that database are copyrighted.” According to the authors, copying should be “fair use” because (1) broad access to training data sets supports the good public policy goal of encouraging people to compile new databases and make them open for public scrutiny and innovation and (2) ML systems’ use of training data is transformative (as that term is used in “fair use” analysis).
The authors also point out that “there is no plausible option simply to license all of the underlying photographs, videos, audio files, or texts” on which these ML models are trained. This sentiment is echoed by GenAI companies such as OpenAI, which argued to the UK Parliment that
“Because copyright today covers virtually every sort of human expression including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials. Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”
This perspective is reiterated by VC firm Andreessen Horowitz, who told the US Copyright Office that “the only practical way generative AI models can exist is if they can be trained on an almost unimaginably massive amount of content, much of which (because of the ease with which copyright protection can be obtained) will be subject to copyright.”3 More on point, a16z notes
Some of the most powerful AI models in existence today were trained on an enormous cross-section of all of the publicly available information ever published on the internet—that is, billions of pieces of text from millions of individual websites. For a very significant portion of those works, it is essentially impossible to identify who the relevant rights holders are…4
I’m familiar with this argument, having made it myself on behalf of digital music providers and their struggles to license musical works in the absence of a centralized database of copyright ownership information. While I think this argument works for some sympathetically minded people as a matter of public policy, I’m unaware of any case in which a defendant’s infringement was excused because the court found it “too hard” to find the copyright owner and secure a license. I also find it interesting that a16z uses the Music Modernization Act (MMA), which I negotiated on behalf of the digital music providers as CEO of the Digital Media Association (DiMA), to argue that a statutory licensing scheme for GenAI training data is unworkable. One of the unique aspects of the MMA is that it shifted the cost of administering the statutory license onto the licensees. It appears that a16z’ main objection is having to pay for the administration of the license—just as it objects to paying the license fee itself.
The question of whether the use of copyrighted works in training AI / ML models constitutes “fair use” is governed in the US by § 107 of the Copyright Act, which provides
Notwithstanding the provisions of sections 106 and 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include— (1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and (4) the effect of the use upon the potential market for or value of the copyrighted work. The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors.
While most scholars (and courts) focus on applying and weighing the four enumerated factors, it is important to consider § 107 in its entirety. That is, § 107 specifies the kind of end uses for which it might be “fair use” to use copyrighted works without prior authorization; i.e., “for purposes such as criticism, comment, news reporting, teaching …, scholarship, or research.” By way of example, the use of a copyrighted song for parody might be fair use (e.g., 2 Live Crew’s parody of Roy Orbison’s Pretty Woman) but not for satire (e.g., John McCain’s use of Jackson Browne’s Running on Empty during his presidential campaign).
Although they ultimately conclude that the use of copyrighted works in training data is (or ought to be) fair use, Profs. Lemley and Casey lay out why “courts won’t find AI learning to be fair.” First, training AI systems involves copying the entire copyrighted work without alteration (see § 107(3)). Because the entire work is copied, courts will likely focus on how transformative the AI model’s output is when considering a fair use defense. For example, an AI model that copies images of stop signs to create an autonomous driving application that can recognize stop signs might be transformative enough to make the copying of the images fair use.5 Conversely, an AI model that copies sound recordings of a particular artist to create a GenAI music product that can create a song in the style of that artist probably isn’t fair use.6
Second, (almost) all of these AI models are commercial products, which can weigh against fair use under both the first and fourth factors; i.e., commercial use and the effect on the market for the original copyrighted work. The authors suggest that some courts may view the unlicensed use of copyrighted works as some kind of “free riding” by the AI companies, specifically with respect to GenAI applications:
These ML systems––virtually all of which are trained on copyrighted works––have produced writing that’s difficult to distinguish from real journalists, painted in the style of celebrated masters, and even created stock photos comparable to those of professional photographers. Many of these efforts have been so convincing that professionals and opinion columnists alike have begun to openly worry about artificial intelligence as a genuine competitive threat. These concerns, in turn, have triggered criticisms from thought leaders, advocates, academics, and professionals worried that technology companies producing these technologies may be free riding on the labor of creative professionals. Critics of such practices have compared leading ML companies to “robber barons” siphoning up valuable IP. … It is not at all clear these practices are reflective of the kind of exploitation of original expression that copyright law is meant to guard against. But what is clear is that this emerging view of the equities, too, could have consequences for how courts consider the competitive and substitutive implications of a permissive fair use doctrine.
Third, the authors point to a broader public policy rationale against fair use: the negative uses to which these technologies will be put such as to “spread propaganda, facilitate dystopian surveillance, invade personal and sexual privacy, [and] perpetuate bias…” While these concerns aren’t explicitly incorporated into § 107’s four factors, these kinds of concerns could influence a court’s analysis or weighing of those factors against a finding of fair use.
Finally, the authors point out the potential existential risk to AI companies if they (and the a16z crowd) are wrong about the applicability of fair use in these cases. Prophetically, the authors note: “An ML that copies millions of works could potentially face hundreds of billions of dollars in statutory damages. And with thousands or even hundreds of thousands of different copyright owners, the risk of multiple opportunistic suits is high.”
Having articulated why a court might not find fair use, the authors turn to their normative assessment of why courts should find this kind of use fair. First, society benefits from allowing ML systems to compile the best possible databases and to open them for public scrutiny and for open AI. Broad access to training datasets will make AI better, safer, and fairer.” This strikes me as an odd justification, since none of the big AI companies make their datasets publicly available. Even OpenAI, which started out as a non-profit that was supposed to be transparent (hence the “open” in OpenAI), shrouds its datasets in secrecy.
Second, “given the large number of works an AI training data set needs to use and the fact that thousands, if not millions, of different people own those works, AI companies can’t simply license all the underlying photographs or texts for the new use. So allowing a copyright claim is tantamount to saying, not that copyright owners will get paid, but that no one will get the benefit of this new use because it will be impractical to make that use at all.” As noted above, this may be a good policy argument, but I don’t see how this could (or should) inform a court’s fair use analysis. That is, I don’t think “but it would be really hard for us to license all of this stuff” fits into one of § 107 four factors.
Third, “providing ML systems with broader access to data actually helps to mitigate some of the very negative outcomes that critics of ML systems fear.” I tend to agree with this assessment and it is echoed in Prof. Lobel’s Law of AI for Good, which I reviewed here. But, again, this is a policy argument, not an application of the fair use factors.
Finally, the authors claim that the strongest argument for fair use is that “ML systems generally don’t want to copy the copyrighted work for any copyright-related reason. ML systems generally copy works, not to get access to their creative expression (the part of the work the law protects), but to get access to the uncopyrightable parts of the work— the ideas, facts, and linguistic structure of the works.” As noted above, I agree that certain AI models, such as an autonomous driving application, appear to sufficiently “transform” the underlying copyrighted work as to qualify as fair use. And facts aren’t protectable at all; i.e., the Supreme Court already held in Feist Pub., Inc. v. Rural Telephone Service Co., Inc., 499 U.S. 340 (1991), the alphabetical ordering of facts (in this case subscribers’ phone numbers) was not protectable by copyright.
But I am puzzled by the reference to “linguistic structure of the works.” I keep coming across this idea that the way in which words are ordered in sentences, paragraphs, chapters, books, etc. are “facts” in the same way as Feist’s alphabetizing of subscribers. For example, in its memorandum in support of its motion to dismiss the copyright infringement claims brought by Sarah Silverman and her co-plaintiffs, OpenAI argues:
Copyright protects the particular way an author expresses an idea—not the underlying idea itself, facts embodied within the author’s articulated message, or other building blocks of creative expression. 17 U.S.C. § 102(b); Feist Publications Inc. v. Rural Tel. Serv. Co., 499 U.S. 340, 344– 45 (1991). As a result, “every idea, theory, and fact in a copyrighted work becomes instantly available for public exploitation at the moment of publication.” Eldred v. Ashcroft, 537 U.S. 186, 219 (2003). Accordingly, while an author may register a copyright in her book, the “statistical information” pertaining to “word frequencies, syntactic patterns, and thematic markers” in that book are beyond the scope of copyright protection. Google Books, 804 F.3d at 209; see also id. at 220 (tool that extracts “information about the original [work]” does not infringe because it does not “replicat[e] protected expression”). So too for facts conveyed by the book, or its high-level plot or themes.3 These principles are essential to copyright’s overall goal of “[i]ntellectual (and artistic) progress,” which is “possible only if each author [can] build[] on the work of others.” Nash v. CBS, Inc., 899 F.2d 1537, 1540 (7th Cir. 1990) (Easterbrook, J).
This got me thinking about Universal Music Group’s recent submission in response to the US Copyright Office’s inquiry into GenAI. In its submission discussing the copyrightability of generative AI outputs, UMG uses the analogy of “autocomplete” to describe what generative AI systems “do.” “When generative AI responds to a human text prompt, it draws upon its analysis of those statistical patterns to create something that matches or “autocompletes” the human prompt, but it is the AI system that creates the output, not a human being. Whenever the nature of the output — text, music, image — its expressive content was created and arranged by a computer making statistical predictions rather than creative choices.” (p. 75). This reminds me of a Patton Oswalt joke about his phone’s autocomplete. Oswalt says that when he types “I” into his phone, it autocompletes the next word “hate.” Oswalt laughs that he must type “I hate” into his phone so often that it now thinks every time he types “I” the most likely next word is “hate.”
I think this joke goes a long way to disproving the notion that training GenAI merely involves machines pulling unprotectable “facts” from the training data (instead of copying the entire creative expression of the work). Apparently, Patton Oswalt has a lot of strong feelings and routinely expresses those strong feelings in sentences beginning with the words “I hate.” When Patton types “I” into his phone and it autocompletes “hate” as the next word, his phone is not expressing its feelings. Instead, his phone has identified that probabilistically when Patton types “I” the most likely word to follow is “hate.” But, his phone could not “know” this unless his phone was copying and training on the entirety of Patton’s expressive writing.
Said differently, Patton’s phone is indifferent to what words Patton types or in what order. His phone is merely building a mathematical model of the frequency with which certain words appear in sequence; i.e., his phone is recognizing patterns of expression. But Patton’s phone is able to compute that mathematical model by analyzing the entirety of Patton’s expressive writing, not merely by pulling out individual facts or the “linguistic structure of the works.”
Profs. Manheim and Atik’s white paper AI Outputs and the First Amendment comes at this issue from a different angle.
These authors note that
LLMs, such as ChatGPT3 and Bard, use a process called “autoregressive language modeling” to construct their outputs.
“This process involves feeding the model a sequence of tokens, and then predicting the next token in the sequence. The model is trained on a massive dataset of text and code, so it learns the statistical relationships between words, phrases, and sentences. This allows the model to generate coherent and contextually relevant responses when given a prompt or query.”
In other words, the LLM output is merely a sequence of tokens (phrases, words or parts of words) that are chained together in a manner that statistically resembles other text in the machine’s data set.7
These authors reiterate the necessity of GenAI outputs being “contextually relevant”—“Maximizing user engagement for a 13-year-old girl worried about her social status requires different statistically relevant selection and sequencing of tokens than that which would radicalize an ISIS sympathizer.” Said differently, LLMs could not create “coherent and contextually relevant responses” unless they processed the creative choices that hundreds or thousands of human authors made when selecting the sequencing of words to articulate to 13-year-old girls or ISIS sympathizers that author’s thoughts. Therefore, I think describing this processing of thousands of authors’ creative choices as “copying facts” is a gross mischaracterization.
The authors of Fair Learning appear to understand this dilemma:
The problem ML systems face is the inability to capture the unprotectable parts to use for training without making a rote copy of the protectable ones. Systems want access to the unprotectable bits of creative works, but the way they get that access is necessarily copying the whole thing. Unlike humans, they can’t read to learn or observe the idea in a painting or song without making a copy of the whole thing in their training data set.
The remainder of the article is a normative description of the authors’ opinions about why the use of copyrighted works in AI / ML training data ought to be fair use.
Chris Eggertsen, Labels & Publishers Win $1 Billion Piracy Lawsuit Against Cox Communications Cox Communications was found liable for piracy infringement of more than 10,000 musical works, awarding $1 billion statutory damages to plaintiffs Sony Music, Universal Music Group, Warner Music. https://www.billboard.com/pro/cox-1-billion-piracy-lawsuit-labels-publishers.
Michael M. Grynbaum and Ryan Mac, The Times Sues OpenAI and Microsoft Over A.I. Use of Copyrighted Work. https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html
a16z also argues that “over the last decade or more, there has been an enormous amount of investment—billions and billions of dollars—in the development of AI technologies, premised on an understanding that, under current copyright law, any copying necessary to extract statistical facts is permitted. A change in this regime will significantly disrupt settled expectations in this area.” Based on the multiple lawsuits filed by copyrights owners since the launch of these GenAI models, I would venture to say (pun intended) that GenAI companies alleged “understanding” of copyright law was not widely shared.
a16z makes the somewhat nonsensical argument that there should not be a licensing regime for training data because it would be too expensive for the licensees; i.e., “a staggering quantity of individual works is required to train AI models. That means that, under any licensing framework that provided for more than negligible payment to individual rights holders, AI developers would be liable for tens or hundreds of billions of dollars a year in royalty payments.”
“ML systems generally copy works, not to get access to their creative expression (the part of the work the law protects), but to get access to the uncopyrightable parts of the work—the ideas, facts, and linguistic structure of the works. A self-driving car, for instance, doesn’t care about the composition or lighting of your photograph, or indeed about what you were likely actually intending to depict in your photo. It cares about the fact that there’s a stop sign in it.”
“When learning is done to copy expression, for example, by training an ML system to make a song in the style of Ariana Grande, the question of fair use can—and should—become much tougher.” There are some really interesting questions around the so-called “sound alike” exception under 114(b), which provides “The exclusive rights of the owner of copyright in a sound recording … do not extend to the making or duplication of another sound recording that consists entirely of an independent fixation of other sounds, even though such sounds imitate or simulate those in the copyrighted sound recording.” That is, the outputs of an GenAI model that creates independently fixed sounds that are exact copies of existing sound recordings might not be infringement under 114(b), but surely the training of that model using existing sound recordings would be infringement (i.e., not fair use).
The internal quote is actually generated by Bard in response to the authors’ prompt “how do you and other large language models develop their output strings?”