Jump to content

With the loss of Anandtech, others coming and others in the past, what methods do we have to backup such sites? How much storage does a site like TomsHardware actually take?

Obviously the Internet Archive is doing their part, but we can't rely on them for everything. How would someone even backup/mirror a site that is deemed important enough?

Link to comment
https://linustechtips.com/topic/1621202-backing-up-the-internet/
Share on other sites

Link to post
Share on other sites

Simple, you dont. Unless you have a budget in the millions for just server space, a massive warehouse, electricity and such its just not feasible. Small sites are very doable, but someone always has to foot that bill.

Link to post
Share on other sites

You can use tools like wget to recursively download a website. But that will only download things that are publicly visible. Any backend code (e.g. PHP) or any database accessed by frontend code (e.g. JS) would not be included in the backup. So if the site uses interactive features, those would likely be in a nonfunctional state in the backup.

 

Depending on the site, if the speed at which you operate is too fast, they might block your requests at some point. And, as was said above, you'll likely need a lot of storage and if you want stuff to stay available to others also bandwidth to serve it. And be prepared to fight the same legal battles the Internet Archive already has.

 

It's likely much more worthwhile to contribute to the internet archive instead, rather than doing this on your own. Either by donating money, or by donating your bandwidth: https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior

Remember to either quote or @mention others, so they are notified of your reply

Link to post
Share on other sites

5 hours ago, MrDevanWright said:

With the loss of Anandtech, others coming and others in the past, what methods do we have to backup such sites? How much storage does a site like TomsHardware actually take?

Obviously the Internet Archive is doing their part, but we can't rely on them for everything. How would someone even backup/mirror a site that is deemed important enough?

Gonna be brutally honest. your average website that doesn't run on garbage (eg wordpress) doesn't actually require that much space.

 

Like there's a good chance that anandtech minus the forums and any other UGC probably could have been saved to a 1TB hard drive.

 

What actually wastes the most amount of physical disk space is UGC spam. I kid you not, one site I managed at one time before dumping Wordpress, was like UGC 99.9999% spam. Like not even legit comments, just a bottomless pit of spam. That took the site from like 30GB to like 1GB.

 

Webdev Friends don't let webdev friends use wordpress. Or at least understand how to set it up properly, which basically nobody ever does. WP lacks both a basic caching plugin and a basic anti-spam system. And so do all open source but popular to use CMS like phpBB and mediawiki. Wordpress without those two basic plugins, consumes 100 times the amount of processing time and bandwidth, and requires the server to delegate 250MB of ram per php process instead of the typical out of the box configuration of 16MB.

 

So sites that were maybe custom coded, or used something that you actually pay for support for and doesn't sloppy, the actual asset storage is really small, like barely more than the text on the page plus any images included in the post.

 

But somehow some CMS's manage to spit out web pages that are 20x the code soup of the actual content due to the placement of ads, scripts for sharing, GDPR compliance, and the hundreds of "ad partners" inserting scripts into the page. None of that garbage needs to be saved. Just the text, images and videos that are actually on the site itself. The problem is that archive org and other crawlers don't really know where the content begins or ends unless there is a custom filter setup for the site itself.

 

It's much more efficient to get a database dump and the physical assets of the site and setup a mirror than it is to crawl the site and rip the data.

 

 

Link to post
Share on other sites

Just as an anecdote, there was a wiki site called La Frikipedia (The Nerd/Freakpedia) in Spain that made fun of every being in Earth with articles made from random people, and they had to close as they made fun of a "Copyright association "president who sued them IIRC. The webmaster just gave the whole '/www' folder in a zip file (with its PHP shenanigans), and it was around 1 GB with lots of images.

 

I won't surprise if Anandtech or Tom's Hardware could be stored, with compression, in a couple Blu-Ray discs. The quid in my opinion isn't that about how can get the data, as we could scrap it even if it's not morally correct, but how pay for being it online. We could upload it to Internet Archive and pay something every year, or we could even seed it by Torrent, but that won't be the same as the web working as it is today.

 

 

Link to post
Share on other sites

gotta be a millionaire

I know it might not be secure, yeah vibecoding is cool but we shouldnt do smt unless we understand it and etc. thx but these disclaimers get old quick. maybe we shall be reminded frequently for we are stupid but i dont work at a nuclear powerplant.

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×