AI firm “cut up and destroyed” millions of books

A curiously revealing story at Ars Technica (June 25, 2025):

Anthropic destroyed millions of print books to build its AI models : Company hired Google’s book-scanning chief to cut up and digitize “all the books in the world.”

On Monday, court documents revealed that AI company Anthropic spent millions of dollars physically scanning print books to build Claude, an AI assistant similar to ChatGPT. In the process, the company cut millions of print books from their bindings, scanned them into digital files, and threw away the originals solely for the purpose of training AI—details buried in a copyright ruling on fair use…

… in February 2024, the company hired Tom Turvey, the former head of partnerships for the Google Books book-scanning project, and tasked him with obtaining “all the books in the world.” …

While destructive scanning is a common practice among some book digitizing operations, Anthropic’s approach was somewhat unusual due to its documented massive scale. By contrast, the Google Books project largely used a patented non-destructive camera process to scan millions of books borrowed from libraries and later returned. ….

The article is well-worth reading for what it reveals about the insides of the AI world.  The 32-page court judgement is also interesting itself as it describes what the AI company did, and why.  Anthropic made a billion dollars this way.

For AI systems (“large language models”) to work, they have to be populated with high quality text.  Unfortunately that all belongs to other people, publishers and the like, who have lawyers.  So one way around this is to buy a physical copy of a book, and then store it inside your computer in digital form.

This trick is perfectly legal, or so a court has just ruled.  Why? because they legally purchased them, destroyed each copy after use, and kept the digital files internally rather than distributing them.

Buying used physical books sidestepped licensing entirely while providing the high-quality, professionally edited text that AI models need, and destructive scanning was simply the fastest way to digitize millions of volumes. The company spent “many millions of dollars” on this buying and scanning operation, often purchasing used books in bulk. Next, they stripped books from bindings, cut pages to workable dimensions, scanned them as stacks of pages into PDFs with machine-readable text including covers, then discarded all the paper originals.

The court documents don’t indicate that any rare books were destroyed in this process—Anthropic purchased its books in bulk from major retailers—but archivists long ago established other ways to extract information from paper. For example, The Internet Archive pioneered non-destructive book scanning methods that preserve physical volumes while creating digital copies. And earlier this month, OpenAI and Microsoft announced they’re working with Harvard’s libraries to train AI models on nearly 1 million public domain books dating back to the 15th century—fully digitized but preserved to live another day.

While Harvard carefully preserves 600-year-old manuscripts for AI training, somewhere on Earth sits the discarded remains of millions of books that taught Claude how to juice up your résumé.

I think most of us will feel somewhat appalled at this treatment of books.  Clearly the development of AI is straining the US copyright regime.

Share

Leave a Reply