It is common for there to be tension between copyright-intensive industries and new technologies. For example, in a 1906 article American composer John Philips Sousa claimed that the introduction of recorded music would lead to “a marked deterioration in American music and musical taste, an interruption in the musical development of the country, and a host of other injuries to music in its artistic manifestations, by virtue – or rather by vice – of the multiplication of the various music-reproducing machines.” In 1982, Jack Valenti of the Motion Picture Association of America testified before Congress that “the VCR is to the American film producer and the American public as the Boston strangler is to the woman home alone.”
So it comes as no surprise that the entertainment industry, in particular, has reacted strongly—and negatively—to the introduction of generative artificial intelligence (GenAI). Previously I have reviewed law review articles looking at whether the fair use defense would succeed in the various copyright infringement cases brought by copyright owners against GenAI companies, including its application in general (https://harrisonpllc.substack.com/p/fair-learning) and to foundational models (https://harrisonpllc.substack.com/p/foundation-models-and-fair-use). Prominent copyright scholar Prof. Samuelson has just posted a draft article entitled “Fair Use Defenses in Disruptive Technology Cases” that reviews past fights between copyright-dependent incumbents and new technologies to better inform how we think about the application of fair use in the context of GenAI.
Xerox
Non-Commercial Use
Prof. Samuelson begins with the introduction of the Xerox machine. To younger people, the importance of the Xerox machine may be difficult to comprehend. In particular, students found the ability to make a copy of a chapter in a book or an article—instead of having to carry around the entire book or journal—very convenient. Book and journal publishers, in contrast, were less sanguine about Xerox.
Williams & Wilkins, the publisher of several academic journals, sued the National Institutes of Health (NIH) and National Library of Medicine (NLM) for making photocopies of journal articles. Discovery revealed that copying journal articles was widespread: in one year that NIH satisfied 85,744 researcher requests for copies of scientific articles, resulting in more than 930,000 copied pages. The court of claims found that the NIH and NLM’s copying of journal articles was fair use. Williams & Wilkins Co. v. United States, 487 F.2d 1345 (Ct. Cl. 1973). The court noted that the NIH and NLM were non-profit entities engaged in scientific research; remember that among the examples of the types of uses that may constitute fair use Sec. 107 specifies “criticism, comment, news reporting, teaching, scholarship, and research.” The court also found that Williams & Wilkins had not proven market harm.
Williams and Wilkins appealed to the Supreme Court and argued that “library photocopying of articles posed a serious threat to the future existence of scientific and technical journals.” because “copies were effective substitutes for the originals.” The Solicitor General argued that upending the “longstanding gentlemen’s agreement for scholarly research copying” could harm future medical research and that, in any event, this question of library copying was for Congress—not the courts—to decide. In a 4-4 ruling the Supreme Court failed to overturn the court of claims, but because there was no majority position, did not affirmatively find that the NIH and NLM’s use was fair.
Congress more or less resolved this particular question in the 1976 Copyright Act, which codified the fair use doctrine we still use today (including the distinction between commercial and non-commercial uses) and specifically provided for multiple copies for educational use in classrooms.
In some ways, the Williams and Wilkins case seems like a big strategic mistake by the publishers: (1) only one journal publisher sued (instead of multiple plaintiffs or a class action), (2) only eight articles were alleged to have been infringed (instead of the hundreds or thousands of articles that were copied), and (3) the defendants were government-created entities engaged in socially beneficial research.
Commercial Use
A second lawsuit, this time on behalf of 83 journal publishers, sued Texaco, which employed 500 research scientists who were photocopying journal articles (instead of buying the journals in which the articles appeared).1 Focused primarily on the (potential) effects on the market for the journal publishers’ articles, the Second Circuit found that Texaco’s photocopying was not fair use.
Edge Case I — Coursepacks for College, but Sold for Profit
Yet another case that challenged photocopying, this time by a group of academic book publishers suing Kinko’s for producing “coursepacks,” compilations of copyrighted portions of textbooks sold to students for use in their college courses.2 Recall that one of the enumerated examples of fair use is making “multiple copies for classroom use.” But Kinko’s was charging the students for the coursepacks, which is a commercial use. The court found that Kinko’s was committing infringement (i.e., its coursepacks were not fair use) because (a) photocopies are not transformative and (b) the coursepacks negatively impacted market for the publishers’ textbooks.
Edge Case II — Digital Coursepacks by Professors
It may seem that if creating physical coursepacks is copyright infringement than doing the same thing, but wholly online, would also be infringement. According to the 11th Circuit, you would be wrong. Three academic book publishers sued Georgia State University because they allowed professors to post digital copies of book chapters (most likely photocopied from the original book) onto “electronic course reserves” or digital coursepacks. After losing at the district court, the publishers appealed to the 11th Circuit.3 In finding the digital coursepacks to be fair use, the appeals court found that (unlike Kinko’s, which is a for-profit enterprise) in this case the copying was being done by professors—more directly tied to the “multiple copies for classroom use” that Sec. 107 allows. In addition, the court decided that the correct view of the market effects was “whether economic losses from copying were substantial enough to impair the publishers’ incentives, not whether challenged uses might cause some loss of revenues.”
Betamax
Home Video Recording Part I
The Jack Valenti quote above is connected to the advent of home video recording devices. While the so-called VCR (video cassette recorder) ended up becoming the default format for home video recording, the Sony and its Betamax device were sued for contributory and vicarious infringement by two movie studios.4 The case made it all the way to the Supreme Court, which held that
“private noncommercial home copying of Universal movies for time-shift purposes was fair use and that Betamax machines were staple items of commerce which consumers should be able to buy for their non-infringing uses, thus establishing a safe harbor from copyright liability for technologies capable of substantial non-infringing uses.”
The Court found that home recording was non-commercial and that the Betamax was capable of substantial non-infringing uses. The Court also stressed that home recording was simply a kind of “time-shifting” that allowed consumers to watch programs at times convenient to them (instead of live on TV).
In a bit of irony, the Supreme Court’s decision paved the way for the VCR rental market (remember Blockbuster Video???). In 1986, just two years after the Sony case, VCR movie rentals were already a $4 billion a year industry, which was more than the total domestic box office gross for that year.
Home Video Recording Part II
Book and journal publishers had lost the first photocopying case, then won the Kinko’s case, and then lost the Georgia State case. The entertainment industry lost the Betamax case, but that didn’t stop them from suing “remote-server digital video recording” devices (RS-DVR)— or digital VCRs (DVRs).5 Cablevision, a cable operator in the New York city metroplex, developed a digital VCR that enabled subscribers to record shows on servers stored at Cablevision’s headend (rather than on a box in the subscriber’s home).
This is one of my favorite copyright cases and I used it when I taught Music Law at the University of Texas and Cal. Berkeley’s law schools as an introduction to how technology and copyright intersect. First, the Second Circuit found that the digital “buffer” copies (i.e., temporary copies made to facilitate recording and playback) on the RS-DVRs were not “copies” as that term is used in the Copyright Act—and, therefore, not infringing. Next, the Second Circuit had to determine whether the playback of the recorded shows was a “public performance” as that term is used in the Copyright Act; i.e., if the performance was “to the public” then it could be infringing, but if it was a private performance, then it was not infringing. Because Cablevision had created separate virtual drives for each subscriber, Cablevision made a separate copy of each show for each subscriber, which the Second Circuit found to be dispositive: because “each RS-DVR transmission is made to a given subscriber using a copy made by that subscriber, we conclude that such a transmission is not ‘to the public.’”
The Internet
File Sharing
Photocopiers and VCRs seem like child’s play compared to the advent of peer-to-peer (P2P) file sharing services such as Napster. P2P systems enabled strangers to share exact (lossless) digital copies (at first primarily music, because the files were relatively small, but eventually including full-length feature films) across the Internet. Ironically, the success of P2P and Napster, which decimated the music industry for nearly two decades, was facilitated by the introduction of the CD, which was an enormous cash cow for record labels in the decade following their release.
Just like the movie studios and Sony Betamax, the record labels sued Napster for contributory and vicarious infringement.6 The Ninth Circuit rejected Napster’s argument that its technology was capable of substantial non-infringing uses and found Napster liable for infringement. After the Napster decision, other P2P services such as Grokster evolved into a decentralized architecture, in which users connected directly to each other’s computers (rather than accessing Napster’s server to search for and download files). While the Ninth Circuit was initially persuaded that this decentralized architecture distinguished Grokster from Napster, the Supreme Court found Grokster liable for infringement.7
Search Engines and Webcrawling
Google was sued because its webcrawling software that enabled its search engine would made copies and stored them in caches (to improve search response times).8 The district court found Google’s copying “transformative” because it made and stored copies for a different purpose than the author. In addition, the plaintiff was unable to show any harm to the market for his writing.
Thumbnails
A couple of lawsuits challenged the creation of “thumbnails” by search engines. Thumbnails are small, low-resolution copies of images found on a website that search engines such as Google will display in search results to inform the user of the type of content that will be found at the link. The courts in those cases found that making thumbnails was fair use because the thumbnails were transformative and did not negatively affect the market for the originals—e.g., they were so small and low-resolution that they could not serve as a substitute for the original.
Mass Digitization
In 2004 Google began a massive book digitization project known as Google Book Search in which it digitized millions of books, the vast majority of which were in-copyright. Unsurprisingly, book publishers sued Google in 2005 (the Google Books case).9 The publishers argued that Google’s scanning of entire books and displaying snippets in response to search queries was not fair use because (1) scanning an entire book was not transformative, (2) Google’s use was commercial, and (3) Google Book Search would damage the market for books. The Second Circuit rejected these arguments, finding that Google’s scanning was transformative because
Google’s “making of a digital copy to provide a search function is a transformative use, which augments public knowledge by making available information about Plaintiffs’ books without providing the public with a substantial substitute for matter protected by the Plaintiffs’ copyright interests in the original works or derivatives of them,” and the same was true “of Google’s provision of the snippet function.” Nor did Google’s commercial motivation for scanning the books undermine its fair use defense. The court was unpersuaded that “Google has usurped their opportunity to access paid and unpaid licensing markets for substantially the same functions that Google provides,” saying that the argument failed “in part because the licensing markets in fact involve very different functions than those that Google provides, and in part because an author’s derivative rights do not include an exclusive right to supply information (of the sort provided by Google) about [their] works.
Because I think the Google Books case will feature prominently in the litigation over generative AI (GenAI), it is instructive to look at how the Second Circuit addressed the display of snippets of books. As Prof. Samuelson writes,
While acknowledging that snippets do display some expressive parts of the books in the corpus, it observed that “[s]nippet view, at best and after a large commitment of manpower, produces discontinuous, tiny fragments, amounting in the aggregate to no more than 16% of a book.” Because of this, snippet view “does not threaten the rights holders with any significant harm to the value of their copyrights or diminish their harvest of copyright revenue.” Even if Google’s supply of snippets does cause some lost sales of books, “the possibility, or even the probability or certainty, of some loss of sales does not suffice to make the copy an effectively competing substitute that would tilt the weighty fourth factor in favor of the rights holder in the original. There must be a meaningful or significant effect ‘upon the potential market for or value of the copyrighted work.’
Jumping ahead, the GenAI companies who have been sued for copyright infringement will undoubtedly focus on this language to argue that their algorithms (usually) produce far less than 16% of any one of the works on which they are trained. In most cases, a plaintiff could not identify any portion of her work copied directly into the output of a GenAI program. The GenAI companies will argue that their models are even more removed from the original than the snippets in the Google Book case are from the original books. Similarly, the GenAI companies will argue that any potentially negative impact on the market for the original works on which the GenAI models were trained is even more tenuously connected to any GenAI outputs than the snippets in Google Book Search are to the potential sale of books.
GenAI Cases
Derivative Works Infringement
After reviewing some of the historical precedents, Prof. Samuelson turns to the current back of copyright infringement cases involving GenAI companies. She begins her analysis of the merits of these copyright infringement claims by looking at the outputs of GenAI models.
Generative AI is disruptive of copyright markets in a different way than in past cases, such as Sony, Napster, and Grokster, because generative AI systems do not generally produce exact copies of expression in pre-existing works. Outputs of the generative AI systems in litigation are overwhelmingly not substantially similar in expression to the works on which the models were trained. These systems can produce outputs faster and more cheaply than human authors, and the AI outputs are generally of high enough quality to be perceived as competitive in the marketplace with works of human authors. This means that there is a nontrivial potential for market substitution. There are, however, precedents upholding fair use defenses when defendants’ ultimately non-infringing products compete with the plaintiff’s products.
By claiming that GenAI models do not produce copies of works on which they were trained, Prof. Samuelson assumes away the numerous examples where GenAI models do output exact copies of the works on which they’ve been trained (e.g., Is ChatGTP a Copyright Infringement Machine?). The fact that GenAI models do produce exact copies of copyrighted works—even if only infrequently—surely complicates her subsequent claims that the outputs of GenAI models produce infringing derivative works. So, while I agree that in order to be infringing a derivative work must first be substantially similar to the original, it does not follow that all outputs from GenAI models are sufficiently dissimilar to avoid infringement liability—particularly when we have evidence of GenAI outputting exact copies. Said differently, with respect to the derivative works claims, the plaintiffs may be able to provide sufficient evidence that particular outputs are infringing derivative works, which is all the plaintiffs’ need to prove. Prof. Samuelson admits as much: “It is, of course, possible for outputs of generative AI to infringe derivative work rights. This is particularly likely if users prompt AI systems to produce images of popular characters, such as Snoopy, Mickey Mouse, or other content the model may have “memorized” because of multiple duplicates in the training dataset.”
Training Data as Infringement
Prof. Samuelson begins by noting that the compiling of training datasets and actually training the GenAI models are distinct actions, which may be performed by separate entities. Although she doesn’t say is explicitly, I take it the argument would be something like “Company X made copies of copyrighted works and then made a library of such works available to Company Y, which used the library to train a GenAI model. Company X is liable for infringement because it’s use was not transformative, but Company Y’s use was fair because it’s use was transformative.”
Getting to the crux of these cases, Prof. Samuelson goes on to state that
Generative AI defendants will almost certainly rely on the Second Circuit’s GBS decision, which involved a commercial firm’s ingestion of in-copyright works for the purpose of creating a large database of works enabling the developer to produce outputs in response to user queries. Because generative AI uses in-copyright works for very different purposes than the originals’ purposes, as in Field, iParadigms, GBS and the search engine cases, the AI training uses are likely to be considered transformative.
To train a GenAI model, the training data is “disassembled” or “tokenized;” words and punctuation are broken down into smaller groups of characters or “tokens.” For example, the sentence “My name is Chris” might be tokenized into “my” and “name” and “is” and “Chris” before being processed. Tokenization enables the GenAI models to recognize patterns in how human authors organize thoughts into sentences and paragraphs and to recognize how those patterns vary.
To further this point Prof. Samuelson quotes the testimony of Christopher Callison-Burch before the House Judiciary Committee: “generative AI systems ‘gain some capacity for common sense reasoning, which allows them to understand basic cause-and-effect relationships, infer missing information, and make simple deductions.’” I think this kind of anthropomorphication is dangerous and could lead courts to make “bad” decisions. GenAI models do not have “common sense” as the ordinary lay person uses that term.10 GenAI models have statistical models that predict what token is most likely to follow the prior token, given the prompt and all of the preceding tokens already delivered.
Prof. Samuelson argues that this tokenization is important because the scope of copyright protection does not extend to “statistical information” such as “word frequencies, syntactic patterns, and thematic markers.” (quoting the Google Books case). I think this quote is a bit misleading, however, when one looks at the next sentence in the court’s decision:
The district court gave as an example “track[ing] the frequency of references to the United States as a single entity (‘the United States is') versus references to the United States in the plural (‘the United States are’) and how that usage has changed over time.
Further quoting Google Books, Prof. Samuelson states that “Courts have not found processing in-copyright works to extract “information about the original [work]” to infringe because this extraction does not “replicat[e] protected expression.” Again, however, the decision in the Google Books case doesn’t really say that. Instead, the Second Circuit stated:
the second factor [of the nature of the copyrighted work] favors fair use not because Plaintiffs' works are factual, but because the secondary use transformatively provides valuable information about the original, rather than replicating protected expression in a manner that provides a meaningful substitute for the original.
In the GenAI cases, plaintiffs’ argument is that the output of the GenAI models is a meaningful substitute for the original!
Prof. Samuelson goes on to argue that “the nature-of-the-work factor may be given little weight in the generative AI cases because those systems’ uses are non-expressive, that is, they use the works as data, not as works.” This sentence is true if it is isolated to the actual training of the GenAI model, as opposed to what the GenAI’s purpose is. In other words, the tokenization of the training data may well be non-expressive, but the purpose of tokenizing the data is so that the GenAI model can produce expressive outputs!
Prof. Samuelson next raises the dubious claim that training GenAI models on copyrighted works is fair use because licensing all of the data would be too difficult.
The class action training data market harm claims seem weak insofar as the data on which models were trained comes from copies of works available on the open internet. Moreover, no class capable of granting a license was in existence when the training datasets were created and used to train models. Nor is it possible for AI developers to license the rights to use all in-copyright works available on the Internet, given the exceptionally large number of works and copyright owners. Transaction costs would be prohibitive relative to the value of use of works as training data; moreover, no individual work may be inherently more valuable as training data than any other. What mainly makes training datasets valuable is the large number of works on which models can be trained. (emphasis added)
This argument is almost immediately refuted in her next paragraph, where Prof. Samuelson admits that Getty Images, which is pursuing copyright infringement claims against GenAI developers, is in the business of licensing its images. Also consider
Amazon, Apple, Deezer, Google, Pandora, Spotify and Tidal were able to license the rights to distribute tens of millions of songs owned by thousands of copyright owners;
Amazon, Apple, Hulu, Netflix and others were able to license the rights to distribute thousands of movies and TV shoes from hundreds of copyright owners;
Amazon, Apple, Google, Spotify and others were able to license the rights to distribute tens of thousands of books and magazines owned by hundreds of copyright owners.
There is simply no evidence that licensing the data needed to train these models was too time consuming. It may well have been too expensive, but that is another matter entirely!!
The rest of Prof. Samuelson’s article deals with potential legislative solutions to the issues around copyright liability and training data, such as creating a collective blanket license along the lines of ASCAP, BMI, SoundExchange or the Mechanical Licensing Collective that I helped create when I negotiated the Music Modernization Act of 2018. Because such as collective has absolutely ZERO chance of becoming law, I won’t bother recounting her argument here.
American Geophysical Union v. Texaco, 60 F.3d 913 (2d Cir. 1994).
Basic Books, Inc. v. Kinko’s Graphics Corp., 758 F. Supp. 1522 (S.D.N.Y. 1991).
Cambridge Univ. Press v. Patton, 769 F.3d 1232, 1238–41 (11th Cir. 2014).
Universal City Studios, Inc. v. Sony Corp. Am., 480 F. Supp. 429 (C.D. Cal. 1979), rev’d, 659 F.2d 963 (9th Cir. 1981), rev’d, Sony Corp. Am. v. Universal City Studios, Inc., 464 U.S. 417 (1984).
Cartoon Network LP v. CSC Holdings, Inc., 536 F.3d 121 (2d Cir. 2008), cert. denied, 557 U.S. 946 (2009).
A&M Records, Inc. v. Napster, Inc., 239 F.3d 1004, 1011–12 (9th Cir. 2001)
Metro-Goldwyn-Mayer Studios, Inc. v. Grokster, Ltd., 380 F.3d 1154, 1157 (9th Cir. 2004), rev’d, 545 U.S. 913 (2005)
Field v. Google, Inc. 412 F. Supp.2d 1106 (D. Nev. 2006).
Authors Guild, Inc. v. Google, Inc., 804 F.3d 202 (2d Cir. 2015).
Websters defines “common sense” as “sound and prudent judgment based on a simple perception of the situation or facts.” GenAI models do not exhibit human-style judgment.