Jump to content

Tell wget not to go outside of current webpage

babadoctor
Go to solution Solved by INSERTNAMEHERE,

Right about now I would start some trial and error :D

Its curious why it doesn't really take all the pictures needed to create the webpage you are in.

there is also: '--regex-type=TYPE           regex type (posix|pcre)'

To switch the regex type to pcre then you can use regexr for help and maybe the --reject-regex option.

 

https://regexr.com/

 

Another Edit: you will definitely need -D to restrain spanning hosts in your example it would be both your page 

boards.4channel.org/... and the i.4cdn.org/... hosts that should be on the list in this case

 

More info here:

https://www.gnu.org/software/wget/manual/wget.html#Spanning-Hosts

 

How do I tell wget not to go outside of the current webpage?

 

example wget script that downloads all images on a webpage

https://github.com/eduardschaeli/wget-image-scraper/blob/master/scraper.sh

#!/bin/sh
for line in `cat sites.txt`; do
  # replace http://
  stripped_url=`echo $line| cut -c8-`
  target_folder="downloads/`echo $stripped_url|sed 's/\//_/g'`"

  echo $stripped_url
  echo ""
  echo ""
  echo ""
  echo "Scraping $stripped_url"
  echo "-----------------------------------"
  echo "> creating folder.."
  echo $target_folder
  mkdir -p $target_folder
  echo "> scraping $stripped_url"
  wget -e robots=off \
    -H -nd -nc -np \
    --recursive -p \
    --level=1 \
    --accept jpg,jpeg,png,gif,webm \
    --convert-links -N \
    -P $target_folder $stripped_url
  echo ""
  echo ""
  echo "> Finished scraping $stripped_url"
done

I download a webpage, for example, at https://boards.4channel.org/v/thread/441907771/the-game-awards-2018 and it downloads all of the images on the webpage, but it also downloads those on other links, such as 4channel.org/a/, etc. 

 

Before the domain change to 4channel, the scraper was working fine, but now I can't figure out how to fix it from wandering to other domains

Do I blacklist every other link on the page?

 

Thank you

OFF TOPIC: I suggest every poll from now on to have "**CK EA" option instead of "Other"

Link to comment
Share on other sites

Link to post
Share on other sites

Couldn't you just set the depth to 1?

 

-l 1 (That's an L)

 

From the wget man.

 

-l depth
--level=depth
Specify recursion maximum depth level depth. The default maximum depth is 5.
 
or maybe
 
-D domain-list
--domains=domain-list
Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does notturn on -H.
 
or maybe 
-L
--relative
Follow relative links only. Useful for retrieving a specific home page without any distractions, not even those from the same hosts.
Link to comment
Share on other sites

Link to post
Share on other sites

(I am really not an expert on bash but couldn't you use a regex to create those restrictions pretty easily?'

 

Idk if you want that flag set: '-H,  --span-hosts    go to foreign hosts when recursive'

Also that flag might be a good idea: '-L,  --relative       follow relative links only'

That might be an issue too since it might use foreign links to display pictures: ' -p,  --page-requisites    get all images, etc. needed to display HTML page'

 

And dude 'wget -h' or --help is a godsend and mostly the way to go when something not really nice happens ;)

But you probably knew that.

 

@corrado33 yeah you beat me to it :D

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, corrado33 said:

Couldn't you just set the depth to 1?

 

-l 1 (That's an L)

 

From the wget man.

 

-l depth
--level=depth
Specify recursion maximum depth level depth. The default maximum depth is 5.
 
or maybe
 
-D domain-list
--domains=domain-list
Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does notturn on -H.
 
or maybe 
-L
--relative
Follow relative links only. Useful for retrieving a specific home page without any distractions, not even those from the same hosts.

if you look at the script -l is being used and the depth is set to 1

OFF TOPIC: I suggest every poll from now on to have "**CK EA" option instead of "Other"

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, babadoctor said:

if you look at the script -l is being used and the depth is set to 1

Ah, you're right, I didn't see the line extensions, I only saw the first wget line. :)

Link to comment
Share on other sites

Link to post
Share on other sites

10 minutes ago, corrado33 said:

Couldn't you just set the depth to 1?

 

-l 1 (That's an L)

 

From the wget man.

 

-l depth
--level=depth
Specify recursion maximum depth level depth. The default maximum depth is 5.
 
or maybe
 
-D domain-list
--domains=domain-list
Set domains to be followed. domain-list is a comma-separated list of domains. Note that it does notturn on -H.
 
or maybe 
-L
--relative
Follow relative links only. Useful for retrieving a specific home page without any distractions, not even those from the same hosts.

It would be pretty rad if -D can take a regex to select a field of URL's 

Link to comment
Share on other sites

Link to post
Share on other sites

Just now, INSERTNAMEHERE said:

It would be pretty rad if -D can take a regex to select a field of URL's 

There is a --reject-rejex option, i just don't know what flavor of regex it is.

OFF TOPIC: I suggest every poll from now on to have "**CK EA" option instead of "Other"

Link to comment
Share on other sites

Link to post
Share on other sites

Sadly the ugly flavor...

grep would give something more javascript or python like so advanced regex

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, INSERTNAMEHERE said:

Sadly the ugly flavor...

grep would give something more javascript or python like so advanced regex

aw, that sucks...

OFF TOPIC: I suggest every poll from now on to have "**CK EA" option instead of "Other"

Link to comment
Share on other sites

Link to post
Share on other sites

2 minutes ago, INSERTNAMEHERE said:

Did you try disabling -H?

boards.4channel.org/v/thread/441907771/the-game-awards-2018



Scraping boards.4channel.org/v/thread/441907771/the-game-awards-2018
-----------------------------------
> creating folder..
downloads/boards.4channel.org_v_thread_441907771_the-game-awards-2018
> scraping boards.4channel.org/v/thread/441907771/the-game-awards-2018
Both --no-clobber and --convert-links were specified, only --convert-links will be used.
--2018-12-06 20:38:05--  http://boards.4channel.org/v/thread/441907771/the-game-awards-2018
Resolving boards.4channel.org (boards.4channel.org)... 104.16.178.122, 104.16.180.122, 104.16.179.122, ...
Connecting to boards.4channel.org (boards.4channel.org)|104.16.178.122|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘downloads/boards.4channel.org_v_thread_441907771_the-game-awards-2018/the-game-awards-2018.tmp’

the-game-awards-201     [ <=>                ]   1.17M  6.03MB/s    in 0.2s    

2018-12-06 20:38:05 (6.03 MB/s) - ‘downloads/boards.4channel.org_v_thread_441907771_the-game-awards-2018/the-game-awards-2018.tmp’ saved [1226321]

Removing downloads/boards.4channel.org_v_thread_441907771_the-game-awards-2018/the-game-awards-2018.tmp since it should be rejected.

FINISHED --2018-12-06 20:38:06--
Total wall clock time: 0.3s
Downloaded: 1 files, 1.2M in 0.2s (6.03 MB/s)
Converted links in 0 files in 0 seconds.


> Finished scraping boards.4channel.org/v/thread/441907771/the-game-awards-2018

 

OFF TOPIC: I suggest every poll from now on to have "**CK EA" option instead of "Other"

Link to comment
Share on other sites

Link to post
Share on other sites

Right about now I would start some trial and error :D

Its curious why it doesn't really take all the pictures needed to create the webpage you are in.

there is also: '--regex-type=TYPE           regex type (posix|pcre)'

To switch the regex type to pcre then you can use regexr for help and maybe the --reject-regex option.

 

https://regexr.com/

 

Another Edit: you will definitely need -D to restrain spanning hosts in your example it would be both your page 

boards.4channel.org/... and the i.4cdn.org/... hosts that should be on the list in this case

 

More info here:

https://www.gnu.org/software/wget/manual/wget.html#Spanning-Hosts

 

Edited by INSERTNAMEHERE
Better help
Link to comment
Share on other sites

Link to post
Share on other sites

33 minutes ago, INSERTNAMEHERE said:

Right about now I would start some trial and error :D

Its curious why it doesn't really take all the pictures needed to create the webpage you are in.

there is also: '--regex-type=TYPE           regex type (posix|pcre)'

To switch the regex type to pcre then you can use regexr for help and maybe the --reject-regex option.

 

https://regexr.com/

 

Another Edit: you will definitely need -D to restrain spanning hosts in your example it would be both your page 

boards.4channel.org/... and the i.4cdn.org/... hosts that should be on the list in this case

 

More info here:

https://www.gnu.org/software/wget/manual/wget.html#Spanning-Hosts

 

Thank you so much!!!

OFF TOPIC: I suggest every poll from now on to have "**CK EA" option instead of "Other"

Link to comment
Share on other sites

Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

×