AI and the Copyright Liability Overhang: A Brief Summary of the Current State of AI-Related Copyright Litigation | Insights

Authors:

David M. McIntosh , Georgina Jones Suzuki , Yam Schaal

Join our mailing list Intellectual Property Transactions here to receive the latest insights directly to your inbox.

Copyright law, as it relates to Artificial Intelligence (“AI”), is at a crossroads. Rapid innovation in AI has created a great deal of uncertainty regarding whether popular AI platforms infringe copyright. More than a dozen¹ suits are pending across the United States in which copyright owners are pursuing various theories of infringement against AI platforms, alleging that AI models either infringe their copyrights because they are trained using copyrighted works,² or because the output of the AI models itself infringes,³ or both. While these suits are pending, the U.S. Copyright Office has issued a Notice of Inquiry (“NOI”) seeking comments about the collection and curation of AI dataset sources, how those datasets are used to train AI models, and whether permission by or compensation for copyright owners should be required when their works are included in the process.⁴

This legal uncertainty and potential for liability hang over the AI industry and will affect how AI is used and the terms of agreements between AI vendors, business partners, and users. This article collects the most prominent ongoing AI copyright cases and their theories, as well as recent discussions about a potential compulsory copyright licensing scheme,⁵ and provides considerations regarding the allocation of copyright infringement risk for companies that may be entering into agreements related to the use of AI platforms.

Plaintiffs’ Copyright Infringement Theories

Training the AI Requires Copying Copyrighted Works

Most of the plaintiffs in the cases, with the notable exception of Doe 1 v. Github,⁶ have asserted direct infringement claims alleging that each respective AI company in the case accessed copyrighted material and made copies for the purposes of training a given AI model. AI models require substantial amounts of data, and some prominent AI vendors “scrape” the internet for that content, which requires a copy to be made of such content.⁷ This theory of copyright infringement has survived a 12(b)(6) motion to dismiss in one case involving "scraped” training content.⁸ That said, different AI models are trained with different types of data and information as to the specific ways AI is trained is limited,⁹ so the success of this theory may depend on the facts of each case.

Post-Training Infringement

Plaintiffs also have offered theories of infringement based on the use of a given AI tool, apart from training it. Plaintiffs have argued that any given AI model, while running, is an unauthorized derivative work because it pulls from copyrighted materials.¹⁰ Plaintiffs also have argued that the AI model contains compressed copies of the works through the usage of weight folders, and that this unauthorized copying should be considered direct infringement even if the whole of the work is not represented in a traditional way.¹¹ Finally, plaintiffs have argued that AI outputs can result in substantially similar works that infringe.¹²

Possible Defenses; Notice of Inquiry Comments

Although several of the defendants of the cases discussed above have not yet filed answers in court as of the date of publication, the likely defense theories and positions on copyright liability can be gleaned from both the more advanced cases and AI companies’ responses¹³,¹⁴,¹⁵,¹⁶,¹⁷ to select questions presented by the Copyright Office in the recently issued Notice of Inquiry (“NOI”).¹⁸

In response to questions presented by the Copyright Office in its NOI, OpenAI (which is a defendant in a number of cases¹⁹) claimed that its AI is trained using either publicly available or licensed materials.²⁰ Further, it said that training ChatGPT involved access to a large dataset, teaching the model to break down text into smaller words and then correlate specific functions and linguistic data to words, such as probabilistic data about the chronology of words. For example, Open AI said that GPT-4 categorizes words into pronouns, nouns, verbs, and so on, and then uses math to predict the next word given the previous words. Therefore, it said, the structure of language itself is the only thing “stored” in the large language model (LLM), rather than the copyrightable expression in a given work.²¹ Similarly, Stability AI explained that its generative image AI, Stable Diffusion, breaks down images into basic structures and relative relationships between parts of images.²² Relying on this fragmentation of the scraped content, Microsoft claimed that the idea-expression dichotomy in copyright law should be a shield for AI companies.²³

Defendants also assert that after the models are trained, they do not contain any copies of the “scraped” content, or any copy of the material that was used to train the model.²⁴ However, the defendants have generally not yet provided a theory as to why copies made in the process of training the AI models or creating the training databases are not infringement.²⁵

AI companies have also asserted that their use of copyright-protected content constitutes a transformative fair use.²⁶ OpenAI claimed that in “rare” situations, an output could implicate the exclusive rights of a copyright holder by satisfying the substantial similarity test.²⁷ However, OpenAI has stated that nonetheless, the LLM has a substantial noninfringing use,²⁸ which is a patent law doctrine²⁹ that has migrated into copyright law³⁰ but has not yet been used in the manner suggested by OpenAI. Microsoft also mentioned the substantial similarity test for copyright infringement when it noted that Microsoft incorporates “many safeguards” into its AI tools to prevent it from being “misused for copyright infringement,” for example, training the AI to explicitly decline to provide excerpts from protected materials.³¹

Prospects for Compulsory Licensing Scheme

The Copyright Office is still pursuing its investigation into copyright law and policy issues raised by AI and has not yet addressed the public comments. It plans to issue a report in several sections analyzing the issues in 2024.³² During a meeting in February 2020 about AI and copyright issues, Mary Rasenberger, a former senior policy advisor for the U.S. Copyright Office who is now the Executive Director of the Author’s Guild, recommended a collective licensing scheme.³³ In the summer of 2023, the Copyright Office held a couple of public webinars about copyright and AI, and the transcripts from these events reveal that stakeholders in attendance do not see a compulsory licensing scheme as an ideal option, as it leads to issues of valuation.³⁴

Allocation of Risk in AI-Related Contracts

When contracting with third parties who are providing AI-based services or developing AI models, parties should consider how the contract allocates liability for potential copyright infringement. Although AI service providers largely do not indemnify users from copyright claims related to their free AI services,³⁵ some AI vendors offer enterprise and developer customers limited indemnity protections that are often delineated in terms of use for specific AI products.³⁶ Such terms may narrow the scope of indemnities with broad exclusions. For example, while vendors will generally indemnify users against third-party infringement claims related to outputs, some will not indemnify users for claims that the training data and inputs were infringing.³⁷ Therefore, parties should carefully consider the scope of indemnities when choosing between AI services and pay careful attention to what the indemnities leave out.

Looking ahead, if AI companies are found liable to third parties for copyright infringement in the pending litigation discussed above, the cost of using generative AI services may increase as vendors seek to shift liability to consumers, perhaps by further reducing the scope of coverage for AI indemnities. Where companies have the freedom to negotiate more bespoke terms, they would be well advised to seek including a pricing escalation provision to control a vendor’s ability to increase prices over time after the parties have entered into the contract.

Conclusion

With at least a dozen copyright cases pending, we are witnessing a period of litigation that will likely determine the legal relationship between content creators and AI platforms. Until the legal dust settles, however, there will be uncertainty and risk of copyright liability that anyone contracting with respect to the use of AI platforms should be aware of. It remains to be seen what the Copyright Office will recommend to Congress and how it will affect the progression or resolution of these cases, as well as who will bear the potential burdens of infringement liability. Attorneys practicing in this area will want to keep abreast of developments and think ahead as to how risk may be mitigated by contract.

See Thomson Reuters Enter. Ctr. GmbH v. ROSS Intel. Inc., No. 1:20-cv-00613-SB (D. Del. filed May 6, 2020); UAB Planner 5D v. Facebook, Inc., 534 F. Supp. 3d 1126 (N.D. Cal. 2021); Doe 1 v. GitHub, Inc., No. 4:22-cv-06823-JST (N.D. Cal. filed Nov. 3, 2022); Getty Images, Inc. v. Stability AI, Inc., No. 1:23-cv-00135-JLH (D. Del. filed Feb. 3, 2023); Tremblay v. OpenAI, Inc., No. 3:23-cv-03223, (N.D. Cal. filed June 28, 2023); L. v. Alphabet Inc., No. 3:23-cv-3440 (N.D. Cal. filed July 11, 2023); Authors Guild v. OpenAI, Inc., No. 1:23-cv-08292 (S.D.N.Y. filed Sept. 19, 2023); Kadrey v. Meta Platforms, Inc., No. 23-cv-03417-VC (N.D. Cal. Nov. 20, 2023); Huckabee v. Bloomberg L.P., No. 1:23-cv-09152 (S.D.N.Y. filed Oct 17, 2023); Concord Music Grp., Inc. v. Anthropic PBC, No. 3:23-cv-01092 (M.D. Tenn. filed Oct. 18, 2023); Andersen v. Stability AI Ltd., No. 3:23-cv-00201 (N.D. Cal. Oct. 30, 2023) (order granting motion to dismiss with leave to amend); The N.Y. Times Co. v. Microsoft Corp., No, 1:23-cv-11195, (S.D.N.Y. filed Dec. 27, 2023); Nazemian et al. v. Nvidia Corp., No. 24-01454 (N.D. Cal. filed Mar. 8, 2024).
See, e.g., Complaint, Andersen v. Stability AI Ltd., No. 3:23-cv-00201 (N.D. Cal. Oct. 30, 2023).
See, e.g., Complaint, The N.Y. Times Co. v. Microsoft Corp., No, 1:23-cv-11195, (S.D.N.Y. filed Dec. 27, 2023).
Notice of Inquiry, 88 Fed. Reg. 59942 (U.S. Copyright Office Aug. 30, 2023), https://www.regulations.gov/document/COLC-2023-0006-0001.
See id., Questions 10.3-10.5. For examples of compulsory licensing schemes, see also 17 U.S.C. §§ 111(d) and 115, which provide compulsory copyright licenses for cable system transmissions and music, respectively.
The plaintiffs in Github instead rely on the theory of improper copyright information management under the Digital Millennium Copyright Act. See 17 U.S.C. § 1202. This argument is also common to other cases collected (for example, in Getty Images).
Unnamed but relevant parties to a few of the matters are LAION and Open Crawl, who maintain and provide databases that AI vendors use for training. The Ninth Circuit has suggested that “scraping” publicly available data does not constitute an invasion of privacy or violation of the Computer Fraud and Abuse Act under the facts presented before it. hiQ Labs, Inc. v. LinkedIn Corp., 938 F.3d 985 (9th Cir. 2019).
See Thomson Reuters v. ROSS (D. Del. filed May 6, 2020), where the Delaware District Court granted summary judgment to the plaintiffs on the issue of copying.
Compare Answer at 15, ¶111, Authors Guild v. OpenAI, Inc., No. 1:23-cv-08292 (S.D.N.Y. filed Sept. 19, 2023) and Amended Complaint at 19, ¶111, Authors Guild v. OpenAI, Inc., No. 1:23-cv-08292 (S.D.N.Y. filed Sept. 19, 2023).
See Authors Guild v. OpenAI, Inc., No. 1:23-cv-08292 (S.D.N.Y. filed Sept. 19, 2023); Andersen v. Stability AI Ltd., No. 3:23-cv-00201 (N.D. Cal. Oct. 30, 2023).
See Andersen v. Stability AI Ltd., No. 3:23-cv-00201 (N.D. Cal. Oct. 30, 2023); see also The N.Y. Times Co. v. Microsoft Corp., No, 1:23-cv-11195, (S.D.N.Y. filed Dec. 27, 2023).
See Doe 1 v. GitHub, Inc., No. 4:22-cv-06823-JST (N.D. Cal. filed Nov. 3, 2022); Getty Images, Inc. v. Stability AI, Inc., No. 1:23-cv-00135-JLH (D. Del. filed Feb. 3, 2023); Concord Music Grp., Inc. v. Anthropic PBC, No. 3:23-cv-01092 (M.D. Tenn. filed Oct. 18, 2023); Andersen v. Stability AI Ltd., No. 3:23-cv-00201 (N.D. Cal. filed Oct. 30, 2023); The N.Y. Times Co. v. Microsoft Corp., No, 1:23-cv-11195, (S.D.N.Y. filed Dec. 27, 2023).
Microsoft Comment Letter (Oct. 30, 2023), https://www.regulations.gov/comment/COLC-2023-0006-8750.
OpenAI Comment Letter (Oct. 30, 2023), https://www.regulations.gov/comment/COLC-2023-0006-8906.
StabilityAI Comment Letter (Oct. 29, 2023), https://www.regulations.gov/comment/COLC-2023-0006-8664.
Meta Comment Letter (Dec. 6, 2023), https://www.regulations.gov/comment/COLC-2023-0006-10332.
Google Comment Letter (Oct. 30, 2023), https://www.regulations.gov/comment/COLC-2023-0006-9003.
All public comment letters are available here: https://www.regulations.gov/document/COLC-2023-0006-0001/comment.
See Tremblay v. OpenAI, Inc., No. 3:23-cv-03223, (N.D. Cal. filed June 28, 2023); Authors Guild v. OpenAI, Inc., No. 1:23-cv-08292 (S.D.N.Y. filed Sept. 19, 2023); The N.Y. Times Co. v. Microsoft Corp., No, 1:23-cv-11195, (S.D.N.Y. filed Dec. 27, 2023).
Supra note 14, at 5.
See id., at 5-6.
Supra note 15, at 10.
Supra note 13, at 3.
See, for example, supra note 16, at 9.
One theory that they may assert would be that these training copies, if they last in RAM for under 1.2 seconds, are “buffering copies,” which have been considered non-infringing copies by the Second Circuit. See Cartoon Network v. CSC Holdings, 536 F.3d 121 (2d Cir. 2008). However, this theory has not yet been asserted by any defendants in the cases, and may not be applicable depending on the technology involved.
See, e.g., supra note 15, at 2 and 8; see also Meta Platforms Inc.’s Answer to First Consolidated Amended Complaint, at 11, Kadrey v. Meta Platforms, Inc., No. 23-cv-03417-VC (N.D. Cal. Nov. 20, 2023).
Supra note 14, at 14.
Id.
See 35 U.S. Code § 271(c).
See Sony Corp. of America v. Universal City Studios, Inc.,464 U.S. 417 (1984).
Supra note 13, at 11.
Copyright and Artificial Intelligence, United States Copyright Office, https://www.copyright.gov/ai/.
United States Copyright Office, Copyright in the Age of Artificial Intelligence, 167 (Feb. 5, 2020), https://www.copyright.gov/events/artificial-intelligence/transcript.pdf.
United States Copyright Office, Transcript of Proceedings, 121, 128 (May 31, 2023), https://www.copyright.gov/ai/transcripts/230531-Copyright-and-AI-Music-and-Sound-Recordings-Session.pdf; United States Copyright Office, International Copyright Issues and Artificial Intelligence, 13 (Jul. 26, 2023), https://www.copyright.gov/events/international-ai-copyright-webinar/International-Copyright-Issues-and-Artificial-Intelligence.pdf.
See OpenAI, Terms of Use (Nov. 14, 2023), https://openai.com/policies/terms-of-use; see also Amazon, AWS Service Terms (Mar. 27, 2024), https://aws.amazon.com/service-terms/; see also Microsoft, Terms of Use (Feb. 7, 2022), https://aws.amazon.com/service-terms/; see also Google, Google Cloud Generative AI Indemnified Services (Mar. 7, 2024), https://cloud.google.com/terms/generative-ai-indemnified-services; see also Meta, Meta AIs Terms of Service (Apr. 1, 2024), https://m.facebook.com/policies/other-policies/ais-terms.
See Regina Sam Penti, Georgina Jones Suzuki & Derek Mubiru, Trouble Indemnity: IP Lawsuits In The Generative AI Boom, Law360 (Jan. 3, 2024, 4:24 PM), https://www.law360.com/articles/1779936/trouble-indemnity-ip-lawsuits-in-the-generative-ai-boom.
See, e.g., AWS Service Terms, Section 50.10.2 (“AWS will have no obligations or liability [for an Indemnified Generative AI Service] with respect to any claim: (i) arising from Generative AI Output generated in connection with inputs or other data provided by you that, alone or in combination, infringe or misappropriate another party’s intellectual property rights[.]”).