Jump to content

Best way to download an entire website?

If you wanted to for instance download this entire forum, with every thread, post, image etc and be able to browse it online.. how do you do this properly so you get 1:1 copy with all the scripts and so on?

Link to comment
Share on other sites

Link to post
Share on other sites

You don't, unless you have access to the servers and make a copy of the source and databases. Making a 1:1 copy of a website without server access is only possible if the website's code is exclusively HTML, CSS and JS.

 

The best you can do client-side is to scrape the content that you want, and reverse engineer the logic in order to display the content how it originally appears. But this is very bad practice, and something that is likely to get you banned by the target website.

 

Not everything on the web is archivable, as evidenced by the amount of broken sites that exist on archive.org.

Link to comment
Share on other sites

Link to post
Share on other sites

Software exists which will crawl a site and store a local copy of it as it looks like to a browser. However if anything is running on the back (e.g. a database), you'll wont be able to capture that, only the output. Like if you go and order a meal somewhere, you get the food, not the recipe for it.

 

Using a forum as the example, it would be impossible to get a working copy since it is dynamic and constantly changes. If you just wanted to index the posts, that could be possible as essentially it is what a search engine does. Sites may deploy protections against "unusual" activities like mass downloading, and the workaround I've used in the past is simply to not have more than 2 downloads running at a time, and spread it out over time. This may be impractical for obvious reasons if the site very large, unless you can parallel it up using VPNs for example so it doesn't look like the demand is from a single point.

Gaming system: R7 7800X3D, Asus ROG Strix B650E-F Gaming Wifi, Thermalright Phantom Spirit 120 SE ARGB, Corsair Vengeance 2x 32GB 6000C30, RTX 4070, MSI MPG A850G, Fractal Design North, Samsung 990 Pro 2TB, Acer Predator XB241YU 24" 1440p 144Hz G-Sync + HP LP2475w 24" 1200p 60Hz wide gamut
Productivity system: i9-7980XE, Asus X299 TUF mark 2, Noctua D15, 64GB ram (mixed), RTX 3070, NZXT E850, GameMax Abyss, Samsung 980 Pro 2TB, random 1080p + 720p displays.
Gaming laptop: Lenovo Legion 5, 5800H, RTX 3070, Kingston DDR4 3200C22 2x16GB 2Rx8, Kingston Fury Renegade 1TB + Crucial P1 1TB SSD, 165 Hz IPS 1080p G-Sync Compatible

Link to comment
Share on other sites

Link to post
Share on other sites

+1 for httrack. I used it to scrape our companies internal wiki for audit. Takes a while to run but gets every public facing bit of data.

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×