Backing Up The Internet

MrDevanWright · August 25, 2025

With the loss of Anandtech, others coming and others in the past, what methods do we have to backup such sites? How much storage does a site like TomsHardware actually take?

Obviously the Internet Archive is doing their part, but we can't rely on them for everything. How would someone even backup/mirror a site that is deemed important enough?

Shimejii · August 25, 2025

Simple, you dont. Unless you have a budget in the millions for just server space, a massive warehouse, electricity and such its just not feasible. Small sites are very doable, but someone always has to foot that bill.

Eigenvektor · August 25, 2025

You can use tools like wget to recursively download a website. But that will only download things that are publicly visible. Any backend code (e.g. PHP) or any database accessed by frontend code (e.g. JS) would not be included in the backup. So if the site uses interactive features, those would likely be in a nonfunctional state in the backup.

Depending on the site, if the speed at which you operate is too fast, they might block your requests at some point. And, as was said above, you'll likely need a lot of storage and if you want stuff to stay available to others also bandwidth to serve it. And be prepared to fight the same legal battles the Internet Archive already has.

It's likely much more worthwhile to contribute to the internet archive instead, rather than doing this on your own. Either by donating money, or by donating your bandwidth: https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior

Kisai · August 25, 2025

5 hours ago, MrDevanWright said:

With the loss of Anandtech, others coming and others in the past, what methods do we have to backup such sites? How much storage does a site like TomsHardware actually take?

Obviously the Internet Archive is doing their part, but we can't rely on them for everything. How would someone even backup/mirror a site that is deemed important enough?

Gonna be brutally honest. your average website that doesn't run on garbage (eg wordpress) doesn't actually require that much space.

Like there's a good chance that anandtech minus the forums and any other UGC probably could have been saved to a 1TB hard drive.

What actually wastes the most amount of physical disk space is UGC spam. I kid you not, one site I managed at one time before dumping Wordpress, was like UGC 99.9999% spam. Like not even legit comments, just a bottomless pit of spam. That took the site from like 30GB to like 1GB.

Webdev Friends don't let webdev friends use wordpress. Or at least understand how to set it up properly, which basically nobody ever does. WP lacks both a basic caching plugin and a basic anti-spam system. And so do all open source but popular to use CMS like phpBB and mediawiki. Wordpress without those two basic plugins, consumes 100 times the amount of processing time and bandwidth, and requires the server to delegate 250MB of ram per php process instead of the typical out of the box configuration of 16MB.

So sites that were maybe custom coded, or used something that you actually pay for support for and doesn't sloppy, the actual asset storage is really small, like barely more than the text on the page plus any images included in the post.

But somehow some CMS's manage to spit out web pages that are 20x the code soup of the actual content due to the placement of ads, scripts for sharing, GDPR compliance, and the hundreds of "ad partners" inserting scripts into the page. None of that garbage needs to be saved. Just the text, images and videos that are actually on the site itself. The problem is that archive org and other crawlers don't really know where the content begins or ends unless there is a custom filter setup for the site itself.

It's much more efficient to get a database dump and the physical assets of the site and setup a mirror than it is to crawl the site and rip the data.

josemirm · August 26, 2025

Just as an anecdote, there was a wiki site called La Frikipedia (The Nerd/Freakpedia) in Spain that made fun of every being in Earth with articles made from random people, and they had to close as they made fun of a "Copyright association "president who sued them IIRC. The webmaster just gave the whole '/www' folder in a zip file (with its PHP shenanigans), and it was around 1 GB with lots of images.

I won't surprise if Anandtech or Tom's Hardware could be stored, with compression, in a couple Blu-Ray discs. The quid in my opinion isn't that about how can get the data, as we could scrap it even if it's not morally correct, but how pay for being it online. We could upload it to Internet Archive and pay something every year, or we could even seed it by Torrent, but that won't be the same as the web working as it is today.

apoyusiken · August 31, 2025

gotta be a millionaire

Sign In

Backing Up The Internet

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

The Future of PC Cooling?

Latest From ShortCircuit:

The coolest looking monitor. Period. - ASUS ROG display at Computex (Sponsored)

Latest From TechLinked:

Microsoft Just Can’t Help Itself

Latest From GameLinked:

Wait wasn't this game dead??

Latest From Tech Quickie:

Who's Tracking Your Phone Right Now?

Latest From The WAN Show:

Pizza Hut is Being Sued Over AI

My Activity Streams