Web Scraping in the Age of AI: Guidance for Data Owners and Scrapers | Insights

Authors:

David M. McIntosh , Joshua Talicska , Craig Hu , Manish Namireddy

Introduction

Data scraping, commonly referred to as “web scraping,” refers to the automated process of extracting data from websites using specialized software, bots, or web crawlers. Since the dawn of the internet, developers have employed various web scraping techniques to rapidly collect and aggregate information from the internet for many purposes, such as price monitoring, market research, and lead generation, and many of those web scrapers have faced lawsuits from owners of the scraped data alleging various legal theories. The recent emergence and widespread adoption of artificial intelligence (AI) systems has made web scraping more prevalent than ever due to its critical role in generating the vast datasets necessary for training and developing AI systems and tools, and companies that engage in web scraping for this purpose may expose themselves to liability if they scrape data impermissibly. Meanwhile, many businesses that publish content online have become concerned about their content being scraped by and for AI systems and are seeking to protect their content using both technical and legal strategies.

This article provides an overview of methods used to scrape and navigate the internet and AI-specific web scraping activities. It then discusses actionable claims data owners have against web scrapers. Finally, it provides recommendations to both data owners and companies that scrape the internet under the existing legal framework in the United States.

Web Scraping and Its Significance in the Modern Economy

Methods

When individuals refer to “web scraping,” they are typically referring to one of the following methods for scraping or retrieving data:

Web Scraping: extraction of data from third-party websites using specialized software, such as web crawlers.
Web Crawling: automated navigation and indexing of web pages via hyperlinks.
Screen Scraping: automated extraction of data that is visually displayed on a screen, such as text, images, videos, PDFs, and other document formats.^1,2

The origins of scraping trace back to the dawn of the internet in the early 1990s, when the first web crawlers were developed primarily to measure the size of the web and index its pages for search engines. As the internet expanded, web scraping evolved into a widely used method for businesses and researchers to collect data at scale for applications such as online price monitoring and comparison, competitor product analysis, and lead generation.

Robots.txt

One of the primary mechanisms content owners use for limiting scraping is the Robots Exclusion Protocol, commonly implemented through a file called “robots.txt.” This is a plain text file placed in the root directory of a website that communicates instructions to web crawlers about which pages or sections of the site they are permitted or prohibited from accessing.³ Importantly, robots.txt cannot enforce the instructions it provides to bots.⁴ Although robots.txt has historically been regarded as a voluntary guideline, recent scholarship argues that civil doctrines—particularly contract and tort law—may provide viable frameworks for holding violators accountable, and that under certain conditions, robots.txt can give rise to a unilateral contract or serve as constructive notice sufficient to establish tortious liability.⁵

AI-Specific Contexts

Web scraping plays a critical role in the development and operation of modern AI systems, particularly large language models (LLMs). At the foundational level, training an LLM requires massive volumes of data, far beyond what human researchers could compile manually, and web scraping serves as one of the primary mechanisms by which developers amass the large-scale text corpora necessary to train these models.⁶

Beyond the training phase, web crawling also facilitates the live retrieval of information to complement the outputs of already-trained models. For instance, to access up-to-date information, AI tools use Retrieval-Augmented Generation (RAG), a technique that dynamically incorporates external data into the response generation process by retrieving information relevant to a user’s query at the time it is submitted.⁷ Using RAG, an AI tool can generate search queries, send them to a third-party search engine, pull the top results, and answer the question using the retrieved text as additional context, rather than relying solely on information encoded in the model during training.

Claims Asserted Against Web Scrapers

While web scraping is not inherently illegal in the United States, it may give rise to actionable claims depending on the method of scraping, the type of data collected, and the purpose for which the scraped data is used. These claims may include copyright infringement, violation of the Digital Millennium Copyright Act (DMCA), breach of contract, and violation of the Computer Fraud and Abuse Act (CFAA), among other causes of action.

Copyright Infringement and DMCA

Web scraping activities may give rise to claims of copyright infringement, which requires a plaintiff to establish two elements: (1) ownership of a valid copyright and (2) copying of original elements of the copyrighted work.⁸ In the context of AI, it is generally understood that both the collection and curation of data for training AI models and RAG implicate the reproduction right under copyright law.⁹ As the U.S. Copyright Office’s 2024 report on Generative AI Training explained, “the steps required to produce a training dataset containing copyrighted works clearly implicate the right of reproduction,” and RAG involves “making reproductions, including when the system copies retrieved content at generation time to augment its response.”

Courts have already begun to grapple with these issues, as illustrated by The New York Times Co. v. Microsoft Corp., in which the plaintiff, The New York Times, filed suit against OpenAI alleging its large-scale scraping and ingestion of copyrighted articles to train large language models, including ChatGPT and Microsoft’s Bing Chat (now Copilot), constituted infringement of its reproduction and other exclusive rights.¹⁰ The New York Times claimed that LLMs could “regurgitate” large portions of the copyrighted content, effectively allowing users to bypass paywalls. The case is still ongoing with respect to the copyright infringement claims.

Separately, web scraping may also implicate the DMCA, which prohibits any person from circumventing a technological measure that effectively controls access to a copyrighted work. Where a website operator has deployed technical barriers—such as access controls, authentication requirements, or anti-bot measures—to restrict automated access to its content, a scraper’s efforts to bypass those protections could trigger liability under the DMCA’s anti-circumvention provisions.¹¹

Importantly, web scrapers facing copyright infringement claims may raise a fair use defense, which permits the unlicensed use of copyrighted works under certain circumstances. Fair use is evaluated under a four-factor balancing test that considers: (1) the purpose and character of the use, including whether it is commercial or transformative; (2) the nature of the copyrighted work; (3) the amount and substantiality of the portion used in relation to the whole; and (4) the effect of the use on the potential market for the copyrighted work.¹² In the AI context, whether the ingestion of copyrighted material to train a model constitutes fair use remains an open and highly contested question. AI developers generally argue that training is “transformative,” because the model learns patterns rather than copying expression, and that no market substitution occurs because an LLM serves a fundamentally different function than the original works.¹³ Content owners counter that the sheer volume of material copied, the commercial nature of the resulting AI products, and the capacity of AI-generated outputs to substitute for the originals all weigh against fair use.¹⁴ The U.S. Copyright Office’s 2025 report on Generative AI Training declined to adopt a categorical rule, concluding that fair use in the AI training context must be assessed on a case-by-case basis. No appellate court has yet ruled on the question, and the outcomes of pending cases, including The New York Times Co. v. Microsoft Corp., will be pivotal in shaping the legal landscape.

Breach of Contract

Data owners have also sued and threatened to sue both scrapers and purchasers of scraped data for breach of contract. hiQ Labs, Inc. v. LinkedIn Corp. is a prominent warning of how even publicly available data is not necessarily contractually unrestricted.¹⁵ In this case, LinkedIn’s user agreement expressly barred users from scraping or copying profiles using crawlers, bots, or automated tools, with the Ninth Circuit noting that hiQ had agreed to those terms prior to using analytics products to scrape data such as names, job titles, and work histories from public LinkedIn profiles.¹⁶ The litigation, which was ultimately concluded at the district court level, saw LinkedIn ultimately prevail on its breach of contract claim, as the district court held that hiQ had likely violated LinkedIn’s User Agreement by engaging in automated web scraping and by using fake profiles to access the platform.¹⁷

Many websites’ terms of use address data access and usage, but terms of use will only provide an actionable claim for breach of contract if they are enforceable. Courts have drawn a distinction between “clickwrap” and “browsewrap” agreements. Both types of these agreements draw their names from “shrinkwrap” licenses, form agreements that were wrapped in transparent plastic that bound their users when they opened the packaging.¹⁸

In browsewrap agreements, a user manifests assent to the terms of an agreement or a license on a website without providing express assent, but simply by certain conduct, such as visiting the interior pages of a website. Courts have been reticent to enforce browsewrap agreements, unless, when looking at the totality of circumstances, the proponent of the browsewrap agreement can establish that users had actual or constructive notice of the terms of the agreement. Pollstar v. Gigmania, Ltd., 170 F. Supp. 2d 974, 981 (E.D. Cal. 2000). A Berkeley Technology Law article explains the overarching concern: browsewrap terms purport to bind users by conduct such as visiting interior pages, even though users need not view the website’s terms of use or expressly assent.¹⁹ Recent commentary from many courts reflects this concern, asking whether there was actual or constructive notice, whether notice was conspicuous, and whether the user took an action that unambiguously manifested assent.²⁰ For example, if a website places its terms of use at the top of the page, or gives the user a reason to scroll to or otherwise view the terms of use, constructive knowledge is more likely to be implied, and therefore the browsewrap agreement is more likely to be enforceable.²¹

In contrast, clickwrap agreements, which require a user to actively consent to terms or conditions by clicking a button or checking a box in order to proceed, are generally upheld as enforceable. For example, in Meta Platforms, Inc. v. BrandTotal Ltd., Meta prevailed on a claim that BrandTotal breached its contract with Meta by collecting data from Facebook and Instagram via automated means in violation of Meta’s terms of use.²² The LinkedIn user agreement in the hiQ case is another example of what was a clickwrap agreement.

Contract claims have also been alleged in AI-related data scraping cases. In a recent case, Reddit sued Anthropic in June 2025 for allegedly violating its User Agreement by using data from Reddit without authorization to train its AI models. Reddit’s User Agreement, which applies to all visiting users, including bots, prohibits commercial exploitation of its platform. Reddit grants conditional access to its archive through its “Compliance API” to AI companies that enter into formal licenses. Anthropic allegedly refused to enter into this license but continued unauthorized access to the Compliance API for commercial purposes and bypassed Reddit’s technology controls, including robots.txt directives and IP rate limits. In addition to breach of contract, Reddit also sued Anthropic under common law theories, such as unjust enrichment and trespass to chattels, which are discussed further below. Notably, because Reddit does not own the copyrights in its users’ posts, it did not assert any copyright infringement claims.

Computer Fraud and Abuse Act

The CFAA prohibits intentionally accessing a computer without authorization or exceeding authorized access to obtain information from a “protected computer,” or any computer “used in or affecting interstate or foreign commerce or communication.”²³ Data owners seeking to bring claims against web scrapers may invoke the CFAA on the theory that a scraper’s automated access to their servers constitutes unauthorized access to a protected computer. The scope of this cause of action has been significantly narrowed by recent case law. In Van Buren v. United States, the Supreme Court held that an individual “exceeds authorized access” under the CFAA only when the individual accesses a computer with authorization but then obtains information located in particular areas of the computer—such as files, folders, or databases—that are off-limits to the individual; the statute does not cover those who have improper motives for obtaining information that is otherwise available to them.²⁴

The Ninth Circuit applied this framework in hiQ Labs, Inc. v. LinkedIn Corp. While, as noted above, the hiQ litigation ultimately concluded with LinkedIn prevailing on its breach of contract claim, the court also concluded that hiQ’s scraping of publicly available LinkedIn member profiles did not likely violate a separate claim under the CFAA, reasoning that when a computer network generally permits public access to its data, a user’s accessing that publicly available data will likely not constitute access “without authorization” under the statute.²⁵ The court distinguished between computers for which no authorization is required (such as public-facing websites), computers for which authorization has been given, and computers for which authorization is required but has not been given, finding that public LinkedIn profiles fell into the first category.

Therefore, while the CFAA may be of limited value where a scraper accesses only publicly available data on an open website, data owners may still pursue CFAA claims where the scraper circumvents generally applicable access restrictions, such as username and password requirements, or may alternatively enforce their rights through breach of contract or terms-of-service claims. The narrowing of the CFAA following Van Buren and hiQ has made contract-based enforcement the primary legal tool for data owners seeking to control scraping of publicly accessible content.

Common Law and Privacy-Related Claims

Data owners may also pursue common law claims against scrapers. Under common law, a data owner may assert a claim for trespass to chattels, which arises when a web scraper accesses a computer system without authorization and, in doing so, causes damage to the system or impairs its functionality. This doctrine treats the computer system as personal property and recognizes that unauthorized interference with its use constitutes a harm.²⁶ A data owner may also bring a claim for unjust enrichment, arguing that the web scraper received a benefit (the scraped data) and that retention of that benefit at the data owner’s expense is unjust. This claim does not require a showing of wrongful conduct but rather focuses on the inequity of allowing the defendant to retain a benefit derived from the plaintiff’s resources without compensation or consent.²⁷

Beyond common law theories, data owners may also assert privacy-related claims where the scraped content includes personal data. Scraping potentially violates all the key principles of privacy laws such as fairness, individual rights and control, transparency, and consent.²⁸ Web scrapers that collect personal information may face liability under data protection regimes such as the General Data Protection Regulation in the European Union or the California Consumer Privacy Act in the United States, both of which impose obligations regarding the collection, processing, and use of personal data and provide enforcement mechanisms and private rights of action that data owners and affected individuals may invoke against non-compliant actors.^29,30

Recommendations for Data Owners

To protect their data from web scrapers, data owners should consider taking the following steps:

Terms of Use. Companies should ensure that their Terms of Use are enforceable by requiring affirmative consent from website visitors, such as requiring users to check a box acknowledging and agreeing to the terms before accessing content.
Robots.txt. Data owners should review their robots.txt files and specify the level of access they are willing to grant to web crawlers, making clear which portions of their sites may or may not be accessed by automated tools.
Technological Measures. Companies should consider implementing technological measures to deter unauthorized scraping. CAPTCHA systems, for example, protect websites against bots by generating and grading tests that humans can pass but current computer programs cannot.³¹ Additionally, services such as Cloudflare have implemented AI-specific solutions that can block AI crawlers entirely³² or require AI companies to pay for access to content through tools like Pay for Crawl and AI Crawl Control.³³

Recommendations for Web Scrapers

To mitigate potential liability, companies engaged in web scraping should consider taking the following actions:

Abide by Restrictions Set by Data Owners. Web scrapers should comply with the restrictions established by data owners, including websites’ terms of use, robots.txt files, and any technological protection measures—and should use authorized channels such as paid APIs where they exist, as companies like Wikipedia³⁴ and Reddit have established. Failure to comply with these measures can give rise to the actionable claims discussed above.
Consider the Nature of the Data Being Scraped. Companies should carefully consider the nature of the data being scraped, as copyright infringement claims require copyrightable materials to be at issue and scraping purely factual information is less likely to create legal exposure.
Contractual Protections. Companies that use third-party AI tools or purchase or license scraped data should closely review indemnification provisions, require the third party to indemnify them for losses caused by impermissible web scraping, and require counterparties to represent that they have not circumvented any technological measures. As discussed in this Ropes & Gray article, many providers of generative AI models now offer varying levels of indemnification protection to their customers, and companies should evaluate these protections carefully.³⁵
Licensing Deals. Companies building AI tools or technologies that rely on scraped data should consider entering into direct licensing deals with data- and content-centric companies, as a growing number of AI companies have already done with major news providers to reduce litigation risk.

Conclusion

The legal landscape surrounding web scraping is undergoing significant change, largely influenced by the rapid advancement of AI and the increasing importance of the underlying data. Although there is no comprehensive statute specifically regulating web scraping, a combination of copyright law, contract law, the CFAA, and common law claims provides data owners with numerous (if imperfect) mechanisms to safeguard their content. These legal tools also present considerable risks for entities engaging in scraping activities without appropriate precautions. Nevertheless, the line between lawful and unlawful web scraping remains blurry, especially with respect to what constitutes fair use of copyright-protected content and the validity of various technical and contractual limitations. Organizations involved in web scraping, whether as data owners or scrapers, should actively evaluate their legal risks, enhance their contractual and technical defenses, and stay informed about ongoing legal developments in this rapidly evolving field.

Crawling vs. Scraping, Oxylabs (Oct. 4, 2024), available at https://oxylabs.io/blog/crawling-vs-scraping.
Web Scraping vs. Screen Scraping, ParseHub (May 28, 2021), available at https://www.parsehub.com/blog/web-scraping-vs-screen-scraping/.
RFC 9309: Robots Exclusion Protocol, IETF (Sept. 2022), available at https://www.rfc-editor.org/rfc/rfc9309.html.
What is robots.txt? | How a robots.txt file works, Cloudflare, available at https://www.cloudflare.com/learning/bots/what-is-robots-txt/.
Chien-Yi Chang & Xin He, The Liabilities of Robots.txt, 58 Computer L. & Security Rev. 106176 (2025).
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus, EMNLP (2021), available at https://aclanthology.org/2021.emnlp-main.98/.
An Introduction to RAG Models, Perplexity, available at https://www.perplexity.ai/page/an-introduction-to-rag-models-jBULt6_mSB2yAV8b17WLDA.
Feist Publications, Inc. v. Rural Telephone Service Co., 499 U.S. 340, 361 (1991).
U.S. Copyright Office, Copyright and Artificial Intelligence Part 3: Generative AI Training Report (2025), available at https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report-Pre-Publication-Version.pdf.
The New York Times Co. v. Microsoft Corp., No. 23-cv-11195 (S.D.N.Y).
Craigslist Inc. v. 3Taps Inc., 964 F. Supp. 2d 1178 (N.D. Cal. 2013).
17 U.S.C. § 107.
See A Tale of Three Cases: How Fair Use Is Playing Out in AI Copyright Lawsuits, Ropes & Gray LLP, July 7, 2025, available at https://www.ropesgray.com/en/insights/alerts/2025/07/a-tale-of-three-cases-how-fair-use-is-playing-out-in-ai-copyright-lawsuits.
Ibid.
hiQ Labs, Inc. v. LinkedIn Corp., 31 F.4th 1180 (9th Cir. 2022).
Id. at 1181-1182.
hiQ v. LinkedIn Wrapped Up: Web Scraping Lessons Learned, Zwillgen, available at: https://www.zwillgen.com/alternative-data/hiq-v-linkedin-wrapped-up-web-scraping-lessons-learned/.
See Mark A. Lemley, Intellectual Property and Shrinkwrap Licenses, 68 S. CAL. L. REV. 1239, 1241 (1995)
Tarra Zynda, Ticketmaster Corp. v. Tickets.com, Inc.: Preserving Minimum Requirements of Contract on the Internet, 19 Berkeley Tech. L.J. 495 2004); see also Ticketmaster Corp. v. Tickets.com, Inc., No. CV997654HLHVBKX, 2003 WL 21406289 (C.D. Cal. Mar. 7, 2003).
See, e.g., Berman v. Freedom Financial Network, LLC, 30 F.4th 849, 856 (9th Cir. 2022); Gaker v. Citizens Disability, LLC, No. 20-CV-11031-AK, 2023 U.S. Dist. LEXIS 19182 (D. Mass. 2023).
Byars v. The Goodyear Tire and Rubber Co., et al., No. 5:22-cv-01358-SSS-KKx, 2023 U.S. Dist. LEXIS 22337 (C.D. Cal. 2023).
Meta Platforms, Inc. v. BrandTotal Ltd., 605 F. Supp. 3d 1218 (N.D. Cal. 2022).
18 U.S.C. § 1030.
Van Buren v. United States, 141 S. Ct. 1648 (2021).
See hiQ Labs, Inc. v. LinkedIn Corp., 31 F.4th 1180.
See Restatement (Second) of Torts § 217 (Am. L. Inst. 1965); see also eBay, Inc. v. Bidder’s Edge, Inc., 100 F. Supp. 2d 1058 (N.D. Cal. 2000) (applying trespass to chattels in the context of automated web scraping).
Restatement (Third) of Restitution and Unjust Enrichment § 1 (Am. L. Inst. 2011).
Daniel J. Solove & Woodrow Hartzog, The Great Scrape: The Clash Between Scraping and Privacy, 113 Calif. L. Rev. 1521 (2025).
Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of Such Data (General Data Protection Regulation).
California Consumer Privacy Act of 2018, Cal. Civ. Code §§ 1798.100–1798.199.100.
How CAPTCHAs work | What does CAPTCHA mean?, Cloudflare, available at https://www.cloudflare.com/learning/bots/how-captchas-work/.
Cloudflare Just Changed How AI Crawlers Scrape the Internet-at-Large; Permission-Based Approach Makes Way for A New Business Model, Cloudflare (July 1, 2025), available at https://www.cloudflare.com/press/press-releases/2025/cloudflare-just-changed-how-ai-crawlers-scrape-the-internet-at-large/.
Introducing pay per crawl: Enabling content owners to charge AI crawlers for access, Cloudflare (July 1, 2025), available at https://blog.cloudflare.com/introducing-pay-per-crawl/.
Modern Enterprise-grade APIs for Wikipedia & more, Wikimedia Enterprise, available at https://enterprise.wikimedia.com/.
Trouble Indemnity: IP Lawsuits In The Generative AI Boom, Law360 (Jan. 3, 2024), available at https://www.law360.com/articles/1779936.