The Internet Archive’s one trillionth webpage milestone marks a historic achievement in digital preservation after nearly three decades of work.
The nonprofit organisation confirmed that it has now preserved more than one trillion webpages. This effort protects online history from the internet’s natural impermanence.
Digital content can disappear quickly. For example, in 2019, a server migration error led to the deletion of millions of songs uploaded to MySpace over the previous decade. Incidents like this highlight why long-term web preservation matters.
Internet Archive and Digital Preservation
Since 1996, the Internet Archive has relied on web crawlers to archive publicly accessible websites. It combines automated tools with volunteer uploads to document the evolution of the web.
Today, the archive holds over one trillion webpages and 41 million texts, along with other media formats. Storing this collection requires around 100,000 terabytes of data. Moreover, with hundreds of millions of new websites appearing daily, the archive continues to expand rapidly.
Read: Classic Cartoons, Films, Songs and Books from the 1920s Enter the Public Domain
The Internet Archive plays a crucial role for researchers, journalists and the public. However, it faces new challenges.
The rapid rise of generative AI has led technology companies to scrape large volumes of online content for training data. As a result, major media organisations such as The New York Times, The Guardian and Gannett have blocked the Archive’s crawlers to prevent their newer material from being used for AI model training.
These developments underscore the tension between open digital preservation and evolving content protection strategies in the AI era.