Tell wget not to go outside of current webpage

babadoctor · December 7, 2018

How do I tell wget not to go outside of the current webpage?

example wget script that downloads all images on a webpage

https://github.com/eduardschaeli/wget-image-scraper/blob/master/scraper.sh

#!/bin/sh
for line in `cat sites.txt`; do
  # replace http://
  stripped_url=`echo $line| cut -c8-`
  target_folder="downloads/`echo $stripped_url|sed 's/\//_/g'`"

  echo $stripped_url
  echo ""
  echo ""
  echo ""
  echo "Scraping $stripped_url"
  echo "-----------------------------------"
  echo "> creating folder.."
  echo $target_folder
  mkdir -p $target_folder
  echo "> scraping $stripped_url"
  wget -e robots=off \
    -H -nd -nc -np \
    --recursive -p \
    --level=1 \
    --accept jpg,jpeg,png,gif,webm \
    --convert-links -N \
    -P $target_folder $stripped_url
  echo ""
  echo ""
  echo "> Finished scraping $stripped_url"
done

I download a webpage, for example, at https://boards.4channel.org/v/thread/441907771/the-game-awards-2018 and it downloads all of the images on the webpage, but it also downloads those on other links, such as 4channel.org/a/, etc.

Before the domain change to 4channel, the scraper was working fine, but now I can't figure out how to fix it from wandering to other domains

Do I blacklist every other link on the page?

Thank you

corrado33 · December 7, 2018

Couldn't you just set the depth to 1?

-l 1 (That's an L)

From the wget man.

-l depth
--level=depth: Specify recursion maximum depth level depth. The default maximum depth is 5.; or maybe
-D domain-list
--domains=domain-list: Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does notturn on -H.; or maybe
-L
--relative: Follow relative links only. Useful for retrieving a specific home page without any distractions, not even those from the same hosts.

INSERTNAMEHERE · December 7, 2018

(I am really not an expert on bash but couldn't you use a regex to create those restrictions pretty easily?'

Idk if you want that flag set: '-H, --span-hosts go to foreign hosts when recursive'

Also that flag might be a good idea: '-L, --relative follow relative links only'

That might be an issue too since it might use foreign links to display pictures: ' -p, --page-requisites get all images, etc. needed to display HTML page'

And dude 'wget -h' or --help is a godsend and mostly the way to go when something not really nice happens

But you probably knew that.

@corrado33 yeah you beat me to it

babadoctor · December 7, 2018

2 minutes ago, corrado33 said:

Couldn't you just set the depth to 1?

-l 1 (That's an L)

From the wget man.

-l depth

--level=depth

Specify recursion maximum depth level depth. The default maximum depth is 5.

or maybe

-D domain-list

--domains=domain-list

Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does notturn on -H.

or maybe

-L

--relative

Follow relative links only. Useful for retrieving a specific home page without any distractions, not even those from the same hosts.

if you look at the script -l is being used and the depth is set to 1

corrado33 · December 7, 2018

Just now, babadoctor said:

if you look at the script -l is being used and the depth is set to 1

Ah, you're right, I didn't see the line extensions, I only saw the first wget line.

INSERTNAMEHERE · December 7, 2018

10 minutes ago, corrado33 said:

Couldn't you just set the depth to 1?

-l 1 (That's an L)

From the wget man.

-l depth

--level=depth

Specify recursion maximum depth level depth. The default maximum depth is 5.

or maybe

-D domain-list

--domains=domain-list

Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does notturn on -H.

or maybe

-L

--relative

Follow relative links only. Useful for retrieving a specific home page without any distractions, not even those from the same hosts.

It would be pretty rad if -D can take a regex to select a field of URL's

babadoctor · December 7, 2018

Just now, INSERTNAMEHERE said:

It would be pretty rad if -D can take a regex to select a field of URL's

There is a --reject-rejex option, i just don't know what flavor of regex it is.

INSERTNAMEHERE · December 7, 2018

Sadly the ugly flavor...

grep would give something more javascript or python like so advanced regex

babadoctor · December 7, 2018

2 minutes ago, INSERTNAMEHERE said:

Sadly the ugly flavor...

grep would give something more javascript or python like so advanced regex

aw, that sucks...

INSERTNAMEHERE · December 7, 2018

Did you try disabling -H?

babadoctor · December 7, 2018

2 minutes ago, INSERTNAMEHERE said:

Did you try disabling -H?

boards.4channel.org/v/thread/441907771/the-game-awards-2018



Scraping boards.4channel.org/v/thread/441907771/the-game-awards-2018
-----------------------------------
> creating folder..
downloads/boards.4channel.org_v_thread_441907771_the-game-awards-2018
> scraping boards.4channel.org/v/thread/441907771/the-game-awards-2018
Both --no-clobber and --convert-links were specified, only --convert-links will be used.
--2018-12-06 20:38:05--  http://boards.4channel.org/v/thread/441907771/the-game-awards-2018
Resolving boards.4channel.org (boards.4channel.org)... 104.16.178.122, 104.16.180.122, 104.16.179.122, ...
Connecting to boards.4channel.org (boards.4channel.org)|104.16.178.122|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘downloads/boards.4channel.org_v_thread_441907771_the-game-awards-2018/the-game-awards-2018.tmp’

the-game-awards-201     [ <=>                ]   1.17M  6.03MB/s    in 0.2s    

2018-12-06 20:38:05 (6.03 MB/s) - ‘downloads/boards.4channel.org_v_thread_441907771_the-game-awards-2018/the-game-awards-2018.tmp’ saved [1226321]

Removing downloads/boards.4channel.org_v_thread_441907771_the-game-awards-2018/the-game-awards-2018.tmp since it should be rejected.

FINISHED --2018-12-06 20:38:06--
Total wall clock time: 0.3s
Downloaded: 1 files, 1.2M in 0.2s (6.03 MB/s)
Converted links in 0 files in 0 seconds.


> Finished scraping boards.4channel.org/v/thread/441907771/the-game-awards-2018

INSERTNAMEHERE · December 7, 2018

Right about now I would start some trial and error

Its curious why it doesn't really take all the pictures needed to create the webpage you are in.

there is also: '--regex-type=TYPE regex type (posix|pcre)'

To switch the regex type to pcre then you can use regexr for help and maybe the --reject-regex option.

https://regexr.com/

Another Edit: you will definitely need -D to restrain spanning hosts in your example it would be both your page

boards.4channel.org/... and the i.4cdn.org/... hosts that should be on the list in this case

More info here:

https://www.gnu.org/software/wget/manual/wget.html#Spanning-Hosts

Edited December 7, 2018 by INSERTNAMEHERE
Better help

babadoctor · December 7, 2018

33 minutes ago, INSERTNAMEHERE said:

Right about now I would start some trial and error

Its curious why it doesn't really take all the pictures needed to create the webpage you are in.

there is also: '--regex-type=TYPE regex type (posix|pcre)'

To switch the regex type to pcre then you can use regexr for help and maybe the --reject-regex option.

https://regexr.com/

Another Edit: you will definitely need -D to restrain spanning hosts in your example it would be both your page

boards.4channel.org/... and the i.4cdn.org/... hosts that should be on the list in this case

More info here:

https://www.gnu.org/software/wget/manual/wget.html#Spanning-Hosts

Thank you so much!!!

INSERTNAMEHERE · December 7, 2018

It was a pleasure to help!

Sign In

Tell wget not to go outside of current webpage

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Link to comment

Share on other sites

Link to post

Share on other sites

Create an account or sign in to comment

Create an account

Sign in

Topics

Latest From Linus Tech Tips:

I Will NOT Give You $250 for Your Broken Game - WAN Show April 26, 2024

Latest From Tech Quickie:

Why Are Gaming Laptops So Expensive?

Latest From TechLinked:

Good Riddance, TikTok

Latest From GameLinked:

Is Nintendo being FRAMED?

Latest From ShortCircuit:

I tried 20 influencer foods, here are the best… and the worst…

Latest From Mac Address:

Why did you buy an Apple Vision Pro?

Latest From Channel Super Fun:

I Swapped the CEO's Assistant For a Day!